Statistical Programming

7月 262021
 

A SAS programmer recently asked why his SAS program and his colleague's R program display different estimates for the quantiles of a very small data set (less than 10 observations). I pointed the programmer to my article that compares the nine common definitions for sample quantiles. The article has a section that explicitly compares the default sample quantiles in SAS and R. The function in the article is written to support all nine definitions. The programmer asked whether I could provide a simpler function that computes only the default definition in R.

This article compares the default sample quantiles in SAS in R. It is a misnomer to refer to one definition as "the SAS method" and to another as "the R method." In SAS, procedures such as PROC UNIVARIATE and PROC MEANS enable you to use the QNTLDEF= to use five different quantile estimates. By using SAS/IML, you can compute all nine estimation methods. Similarly, R supports all nine definitions. However, users of statistical software often use the default methods, then wonder why they get different answers from different software. This article explains the difference between the DEFAULT method in SAS and the DEFAULT method in R. The default in R is also the default method in Julia and in the Python packages SciPy and NumPy.

The Hyndman and Fan taxonomy

The purpose of a sample statistic is to estimate the corresponding population parameter. That is, the sample quantiles are data-based estimates of the unknown quantiles in the population. Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), discuss nine definitions of sample quantiles that commonly appear in statistical software packages. All nine definitions result in valid estimates. For large data sets (say, 100 or more observations), they tend to give similar results. The differences between the definitions are most evident if you use a small data set that has wide gaps between two adjacent pairs of values (after sorting the data). The example in this article is small and has a wide gap between the largest value and the next largest value.

By default, SAS uses Hyndman and Fan's Type=2 method, whereas R (and Julia, SciPy, and NumPy) use the Type=7 method. The Type=2 method uses the empirical cumulative distribution of the data (empirical CDF) to estimate the quantiles, whereas the Type=7 method uses a piecewise-linear estimate of the cumulative distribution function. This is demonstrated in the next section.

An example of sample quantiles

To focus the discussion, consider the data {0, 1, 1, 1, 2, 2, 2, 4, 5, 8}. There are 10 observations, but only six unique values. The following graphs show the estimates of the cumulative distribution function used by the Type=2 and Type=7 methods. The fringe plot (rug plot) below the CDF shows the locations of the data:

The sample quantiles are determined by the estimates of the CDF. The largest gap in the data is between the values X=5 and X=8. So, for extreme quantiles (greater than 0.9), we expect to see differences between the Type=2 and the Type=7 estimates for extreme quantiles. The following examples show that the two methods agree for some quantiles, but not for others:

  • The 0.5 quantile (the median) is determined by drawing a horizontal line at Y=0.5 and seeing where the horizontal line crosses the estimate of the CDF. For both graphs, the corresponding X value is X=2, which means that both methods give the same estimate (2) for the median.
  • The 0.75 quantile (the 75th percentile) estimates are different between the two methods. In the upper graph, a horizontal line at 0.75 crosses the empirical CDF at X=4, which is a data value. In the lower graph, the estimate for the 0.75 quantile is X=3.5, which is neither a data value nor the average of adjacent values.
  • The 0.95 quantile (the 95th percentile) estimates are different. In the upper graph, a horizontal line at 0.95 crosses the empirical CDF at X=8, which is the maximum data value. In the lower graph, the estimate for the 0.95 quantile is X=6.65, which is between the two largest data values.

Comments on the CDF estimates

The Type=2 method (the default in SAS) uses an empirical CDF (ECDF) to estimate the population quantiles. The ECDF has a long history of being used for fitting and comparing distributions. For example, the Kolmogorov-Smirnov test uses the ECDF to compute nonparametric goodness-of-fit tests. When you use the ECDF, a quantile is always an observed data value or the average of two adjacent data values.

The Type=7 method (the default in R) uses a piecewise-linear estimate to the CDF. There are many ways to create a piecewise-linear estimate, and there have been many papers (going back to the 1930's) written about the advantages and disadvantages of each choice. In Hyndman and Fan's taxonomy, six of the nine methods use piecewise-linear estimates. Some people prefer the piecewise-linear estimates because the inverse CDF is continuous: a small change to the probability value (such as 0.9 to 0.91) results in a small change to the quantile estimates. This is property is not present in the methods that use the ECDF.

A function to compute the default sample quantiles in R

Back in 2017, I wrote a SAS/IML function that can compute all of the common definitions of sample quantiles. If you only want to compute the default (Type=7) definition in R, you can use the following simpler function:

proc iml;
/* By default, R (and Julia, and some Python packages) uses
   Hyndman and Fan's Type=7 definition. Compute Type=7 sample quantiles.
*/
start GetQuantile7(y, probs);
   x = colvec(y);
   N = nrow(x);
   if N=1 then return (y);  /* handle the degenerate case, N=1 */
 
   /* remove missing values, if any */
   idx = loc(x^=.);
   if ncol(idx)=0 then  
      return (.);           /* all values are missing */
   else if ncol(idx)<N then do;
      x = x[idx,]; N = nrow(x);  /* remove missing */
   end;
 
   /* Main computation: Compute Type=7 sample quantile. 
      Estimate is a linear interpolation between x[j] and x[j+1]. */
   call sort(x);
   p = colvec(probs);
   m = 1-p;
   j = floor(N*p + m);      /* indices into sorted data values */
   g = N*p + m - j;         /* 0 <= g <= 1 for interpolation */
 
   q = j(nrow(p), 1, x[N]); /* if p=1, estimate by x[N]=max(x) */
   idx = loc(p < 1);
   if ncol(idx) >0 then do;
      j = j[idx]; g = g[idx];
      q[idx] = (1-g)#x[j] + g#x[j+1]; /* linear interpolation */
   end;   
   return q;
finish;
 
/* Compare the SAS and R default definitions.
   The differences between definitions are most apparent 
   for small samples that have large gaps between adjacent data values. */
x = {0 1 1 1 2 2 2 4 5 8}`;
prob = {0.5, 0.75, 0.9, 0.95};
 
call qntl(SASDefaultQntl, x, prob);
RDefaultQntl = GetQuantile7(x, prob);
print prob SASDefaultQntl RDefaultQntl;

The table shows some of the quantiles that were discussed previously. If you choose prob to be evenly spaced points in [0,1], you get the values on the graphs shown previously.

Summary

There are many ways to estimate quantiles. Hyndman and Fan (1996) list nine common definitions. By default, SAS uses the Type=2 method, where R (and other software) uses the Type=7 method. SAS procedures support five of the nine common definitions of sample quantiles, and you can use SAS/IML to compute the remaining definitions. To make it easy to reproduce the default values of sample quantiles from other software, I have written a SAS/IML function that computes the Type=7 quantiles.

If you do not have SAS/IML software, but you want to compute estimates that are based on a piecewise-linear estimate of the CDF, I suggest you use the QNTLDEF=1 option in PROC UNIVARIATE or PROC MEANS. This produces the Type=4 method in Hyndman and Fan's taxonomy. For more information about the quantile definitions that are natively available in SAS procedures, see "Quantile definitions in SAS."

The post Compare the default definitions for sample quantiles in SAS, R, and Python appeared first on The DO Loop.

7月 212021
 

To get better at something, you need to practice. That maxim applies to sports, music, and programming. If you want to be a better programmer, you need to write many programs. This article provides an example of forming the intersection of items in a SAS/IML list. It then provides several exercises so that SAS/IML programmers can practice working with lists.

What are SAS/IML lists?

A SAS/IML list is a data structure that enables you to pack various SAS/IML objects (scalars, vectors, matrices, tables, ...) into a single object. A primary use of lists is to pass information between functions. You can create lists and extract items from lists by using a special syntax. For example, if L is a list then L$1 is the first item in the list, L$2 is the second item, and so on.

Recently, a SAS/IML programmer asked some basic questions about manipulating items in a list. The chapter about lists in the SAS/IML User's Guide includes many examples, but there is a big difference between reading documentation and writing a program yourself. This article provides exercises that enable you to practice working with lists.

The intersection of items in a list

This section shows how to compute the intersection of items in a list. The input is a SAS/IML list that contains many matrices. The output is the intersection between the items.

If you have a set of matrices, the XSECT function in SAS/IML computes the intersection of the values in the matrices. However, you cannot pass a list to the XSECT function. The following example shows that the XSECT function takes a comma-separated set of matrices:

proc iml;
x = 1:20;          /* 1,2,3,...,20 */
y = do(0, 20, 2);  /* 0,2,4,...,20 */
z = do(0, 21, 3);  /* 0,3,6,...,21 */
w = 3:24;          /* 3,4,5,...,24 */
 
intersect = xsect(x,y,z,w);
print intersect;

The intersection of the matrices (thought of as sets) contains three elements. Notice that the arguments to the XSECT function are matrices.

So, how could you compute the intersection of matrices if they are items in a list? The key is to recognize that the intersection of k sets is equal to taking k-1 intersections. First, form S1 as the intersection of X and Y. Then form S2 as the intersection of S1 and Z. Lastly, form S3 (the answer) as the intersection of S2 and W. Thus, you can compute the total intersection by always computing the intersection of pairs of sets.

Thus, you can compute the intersection by looping over the items in the list. You can use the ListLen function to get the number of items in a list. Then repeatedly compute the pairwise intersections, as follows:

start ListIntersect(L);
   len = ListLen(L);
   if len=0 then                /* no items in list */
      return ( {} );            /* return empty matrix */
   else if len=1 then           /* one item */
      return L$1;               /* return the item */
 
   intersect = xsect(L$1, L$2); /* the intersection of the first 2 items */
   do i = 3 to len;
      intersect = xsect(intersect, L$i); /* the intersection of the first i items */
   end;
   return intersect;
finish;
 
/* test the function: pack matrices into a list */
L = [x, y, z, w];
s = ListIntersect(L);
print s;

Success! The ListIntersect function returns the same values as the original call to the XSECT function.

Exercises for manipulating lists

Now it is your turn. Write programs for the following tasks:

  1. The previous example shows that the ListIntersection function works on a list that contains numerical matrices. But the XSECT function also works for character values. Use the following character matrices to test whether the ListIntersect function works for a list that contains character items. If not, modify the function so that it works for a list of character matrices.
    a = 'A':'Z';
    b = {A C E G I K M O Q S U W Y};
    c = {A B C F G I K N O R S V Y};
    d = {I THINK IT IS EITHER X OR Y};
  2. The ListIntersect function does not check that the items in the list are the same type. Use the TYPE function to ensure that all items in the list have the same type, either all numeric or all character. If not, return the empty matrix.
  3. Write a function called ListUnion that uses the UNION function to compute the union of the elements in a list. Make sure it handles lists that contain all numeric or all character items.
  4. Given a matrix (m) and a list (L), you can check to see which items in the list have a nonempty intersection with the matrix. If there are N items in the list, write a function called HasIntersection that returns a binary vector (v) of length N. The i_th value, v[i], is 1 if XSECT(m, L$i) is nonempty. Otherwise, v[i]is 0. For example, the following call should return the vector {0 1 1 1} because L has four items and m has a nonempty intersection with items 2, 3, and 4:
    m = {0 24};
    v = HasIntersection(m, L);

Summary

If you want to be good at something, you need to practice. This article shows how to manipulate items in a SAS/IML list to form the intersection of items. It then provides a set of exercises that a motivated programmer can use to practice working with lists.

The post Operations on lists in SAS/IML appeared first on The DO Loop.

7月 192021
 

After my recent articles on simulating data by using copulas, many readers commented about the power of copulas. Yes, they are powerful, and the geometry of copulas is beautiful. However, it is important to be aware of the limitations of copulas. This article creates a bizarre example of bivariate data, then shows what happens when you fit copula models to the example.

You might think that if you match the correlation and marginal distributions of data, your model must look similar to the data. However, that statement is false. This article demonstrates that there are many bivariate distributions that have the same correlation and marginal distributions. Some might look like the data; others do not. To keep the discussion as simple as possible, I will discuss only bivariate data with marginal distributions that are normal.

Many ways to generate normally distributed variables

Before constructing the bizarre bivariate example, let's look at an interesting fact about symmetric univariate distributions. Let Z ~ N(0,1) be a standard normal random variable. Consider the following methods for generating a new random variable from Z:

  • Y = Z: Clearly, Y is a standard normal random variable.
  • Y = –Z: Since Z is symmetric, Y is a standard normal random variable.
  • With probability p, Y = –Z; otherwise, Y = Z: This is a little harder to understand, but Y is still a normal random variable! Think about generating a large sample from Z. Some observations are negative, and some are positive. To get a sample from Y, you randomly change the sign of each observation with probability p. Because the distribution is symmetric, you change half the positives to negatives and half the negatives to positives. Thus, Y is normally distributed. Both previous methods are special cases of this one. If p=0, then Y = Z. If p=1, then Y = –Z.

To demonstrate the third method, the following SAS DATA step generates a normally distributed sample by generating Z~N(0,1), then randomly permuting the sign of 50% of the variates:

%let N = 2000;
data Sim;
call streaminit(1234);
p = 0.5;
do i = 1 to &N;
   z = rand("Normal");
   /* flip the sign of Z with probability p */
   if rand("Bernoulli", p) then 
      y = z;
   else 
      y = -z;
   output;
end;
run;
 
proc univariate data=Sim;
   var y;
   histogram / normal;
run;

The graph indicates that Y is normally distributed. If you prefer statistical tests, the table shows several goodness-of-fit tests. The results indicate that a normal distribution fits the simulated data well.

This example shows that you can change the signs of 50% of the observations and still obtain a normal distribution. This fact is used in the next section to construct a bizarre bivariate distribution that has normal marginals. (Note: Actually, this example is valid for any value of p, 0 ≤ p ≤ 1. You always get normally distributed data.)

A bizarre example: Normal marginals do not imply bivariate normality

Let's play a game. I tell you that I have bivariate data where each variable is normally distributed and the correlation between the variables is 0.75. I ask you to sketch what you think a scatter plot of my data looks like. Probably most people would guess it looks like the graph below, which shows bivariate normal data:

If the joint distribution of X and Y is bivariate normal, then the univariate distributions of X and Y are indeed normal. However, the converse of that statement is not true! There are many bivariate distributions that are not bivariate normal, yet their marginal distributions are normal. Can't picture it? Study the following SAS DATA step, which simulates (X,Z) as bivariate data, but changes the sign for 50% of the Z variates:

/* Bivariate data whose marginals are normal. Idea from 
https://stats.stackexchange.com/questions/30159/is-it-possible-to-have-a-pair-of-gaussian-random-variables-for-which-the-joint-d
*/
%let N = 2000;
data NotBinormal;
call streaminit(123);
do i = 1 to &N;
   x = rand("Normal");
   z = rand("Normal");
   in24 = (x< 0 & z>=0) | /* (x,z) in 2nd quadrant */ 
          (x>=0 & z< 0);  /* (x,z) in 4th quadrant */
   /* If (x,z) is in 2nd quadrant, flip into 3rd quadrant.
      If (x,z) is in 4th quadrant, flip into 1st quadrant. */
   if in24 then 
      y = -z;
   else 
      y = z;
   output;
end;
run;
 
ods graphics / width=480px height=480px;
proc corr data=NotBinormal noprob Spearman plots=matrix(hist);
   var x y;
   ods select SpearmanCorr MatrixPlot;
run;

The scatter plot looks like a bow tie! I saw this example on StackExchange in an answer by 'cardinal'. I like it because the joint distribution is obviously NOT bivariate normal even though X and Y are both standard normal variables. Obviously, the variable X is univariate normal. For the variable Y, notice that the program generates Z as random normal but changes the sign for a random 50% of the Z values. In this case, the 50% is determined by looking at the location of the (X,Z) points. If (X,Z) is in the second or fourth quadrants, which occurs with probability p=0.5, flip the sign of Z. The result is bivariate data that lies in only two quadrants. Yet, both marginal variables are normal.

When asked to describe the scatter plot of correlated data with normal marginals, I think that few people would suggest this example! Fortunately, real data does not look like this pathological example. Nevertheless, it is instructive to ask, "what does a copula model look like if we fit it to the bow-tie data?"

Fit a copula to data

Let's fit a copula to the bizarre bow-tie distribution. We know from a previous article that a copula will have the same Spearman correlation (0.75) and the marginal distributions (normal) as the data. But if you simulate data from a Gaussian copula, a scatter plot of the simulated data looks nothing like the bow-tie scatter plot. Instead, you get bivariate normal data, as demonstrated by the following statements:

proc copula data=NotBinormal;
   var x y;              /* original data vars */
   fit normal;            /* choose copula: Normal, Clayton, Frank,... */
   simulate / seed=1234 ndraws=&N
              marginals=empirical    /* transform from copula by using empirical CDF of data */
              out=SimData;           /* contains the simulated data */
run;
 
title "Bivariate Normal Data";
proc sgplot data=SimData aspect=1;
   scatter x=x y=y / transparency=0.75 markerattrs=(symbol=CircleFilled size=12);
   xaxis grid; yaxis grid;
run;

The scatter plot is shown at the beginning of the previous section.

The COPULA procedure supports other types of copulas, such as the Clayton, Frank, and Gumbel copulas. These are useful in modeling data that have more (or less) weight in the tails, or that might be unsymmetric. For example, the Gumbel copula is an unsymmetric distribution that has more weight in the right tail than the normal distribution does. If you change the keyword NORMAL to GUMBEL in the previous call to PROC COPULA, you obtain the following scatter plot of simulated data from the Gumbel copula:

This is another example of a distribution that has the same Spearman correlation as the data and has normal marginals. It is not bivariate normal, but it still doesn't look like the pathological bow-tie example.

Summary

When you learn a new technique, it is important to know its limitations. This article demonstrates that there are many distributions that have the same correlation and the same marginal distributions. I showed a very dramatic example of a bizarre scatter plot that looks like a bow tie but, nevertheless, has normal marginals. When you fit a copula to the data, you preserve the Spearman correlation and the marginal distributions, but none of the copula models (Gaussian, Gumble, Frank, ....) look like the data.

I strongly encourage modelers to use the simulation capabilities of PROC COPULA to help visualize whether a copula model fits the data. You can make a scatter plot matrix of the marginal bivariate distributions to understand how the model might deviate from the data. As far as I know, PROC COPULA does not support any goodness-of-fit tests, so a visual comparison is currently the best you can do.

For more information about PROC COPULA in SAS, See the SAS/ETS documentation. (Or, for SAS Viya customers, see PROC CCOPULA in the SAS Econometrics product.)

For more information about different types of copulas and tail dependence, see
  • Venter, Gary G. (2002) "Tails of copulas," in Proceedings of the Casualty Actuarial Society (Vol. 89, No. 171, pp. 68-113).
  • Zivot, Eric (to appear), "Copulas" in Modeling Financial Time Series with R, Springer-Verlag.

The post Copulas and multivariate distributions with normal marginals appeared first on The DO Loop.

7月 052021
 

Do you know what a copula is? It is a popular way to simulate multivariate correlated data. The literature for copulas is mathematically formidable, but this article provides an intuitive introduction to copulas by describing the geometry of the transformations that are involved in the simulation process. Although there are several families of copulas, this article focuses on the Gaussian copula, which is the simplest to understand.

This article shows the geometry of copulas. This article and its example are based on Chapter 9 of Simulating Data with SAS (Wicklin, 2013). A previous article shows the geometry of the Iman-Conover transformation, which is an alternative way to create simulated data that have a specified rank correlation structure and specified marginal distributions.

Simulate data by using a copula

Recall that you can use the CDF function to transform any distribution to the uniform distribution. Similarly, you can use the inverse CDF to transform the uniform distribution to any distribution. To simulate correlated multivariate data from a Gaussian copula, follow these three steps:

  1. Simulate correlated multivariate normal data from a correlation matrix. The marginal distributions are all standard normal.
  2. Use the standard normal CDF to transform the normal marginals to the uniform distribution.
  3. Use inverse CDFs to transform the uniform marginals to whatever distributions you want.

The transformation in the second and third steps are performed on the individual columns of a data matrix. The transformations are monotonic, which means that they do not change the rank correlation between the columns. Thus, the final data has the same rank correlation as the multivariate normal data in the first step.

The next sections explore a two-dimensional example of using a Gaussian copula to simulate correlated data where one variable is Gamma-distributed and the other is Lognormally distributed. The article then provides intuition about what a copula is and how to visualize it. You can download the complete SAS program that generates the example and creates the graphs in this article.

A motivating example

Suppose that you want to simulate data from a bivariate distribution that has the following properties:

  • The rank correlation between the variables is approximately 0.6.
  • The marginal distribution of the first variable, X1, is Gamma(4) with unit scale.
  • The marginal distribution of the second variable, X2, is lognormal with parameters μ=0.5 and σ=0.8. That is, log(X2) ~ N(μ, σ).

The hard part of a multivariate simulation is getting the correlation structure correct, so let's start there. It seems daunting to generate a "Gamma-Lognormal distribution" with a correlation of 0.6, but it is straightforward to generate a bivariate NORMAL distribution with that correlation. Let's do that. Then we'll use a series of transformations to transform the normal marginal variables into the distributions that we want while preserving the rank correlation at each step.

Simulate multivariate normal data

The SAS/IML language supports the RANDNORMAL function, which can generate multivariate normal samples, as shown in the following statements:

proc iml;
N = 1e4;
call randseed(12345);
/* 1. Z ~ MVN(0, Sigma) */
Sigma = {1.0  0.6,
         0.6  1.0};
Z = RandNormal(N, {0,0}, Sigma);          /* Z ~ MVN(0, Sigma) */

The matrix Z contains 10,000 observations drawn from a bivariate normal distribution with correlation coefficient ρ=0.6. Because the mean vector is (0,0) and the covariance parameter is a correlation matrix, the marginal distributions are standard normal. The following graph shows a scatter plot of the bivariate normal data along with histograms for each marginal distribution.

Transform marginal distributions to uniform

The first step is to transform the normal marginals into a uniform distribution by using the probability integral transform (also known as the CDF transformation). The columns of Z are standard normal, so Φ(X) ~ U(0,1), where Φ is the cumulative distribution function (CDF) for the univariate normal distribution. In SAS/IML, the CDF function applies the cumulative distribution function to all elements of a matrix, so the transformation of the columns of Z is a one-liner:

/* 2. transform marginal variables to U(0,1) */
U = cdf("Normal", Z);                     /* U_i are correlated U(0,1) variates */

The columns of U are samples from a standard uniform distribution. However, the columns are not independent. Because the normal CDF is a monotonic transformation, it does not change the rank correlation between the columns. That is, the rank correlation of U is the same as the rank correlation of Z:

/* the rank correlations for Z and U are exactly the same */
rankCorrZ = corr(Z, "Spearman")[2]; 
rankCorrU = corr(U, "Spearman")[2]; 
print rankCorrZ rankCorrU;

The following graph shows a scatter plot of the transformed data along with histograms for each marginal distribution. The histograms show that the columns U1 and U2 are uniformly distributed on [0,1]. However, the joint distribution is correlated, as shown in the following scatter plot:

Transform marginal distributions to any distribution

Now comes the magic. One-dimensional uniform variates are useful because you can transform them into any distribution! How? Just apply the inverse cumulative distribution function of whatever distribution you want. For example, you can obtain gamma variates from the first column of U by applying the inverse gamma CDF. Similarly, you can obtain lognormal variates from the second column of U by applying the inverse lognormal CDF. In SAS, the QUANTILE function applies the inverse CDF, as follows:

/* 3. construct the marginals however you wish */
gamma = quantile("Gamma", U[,1], 4);            /* gamma ~ Gamma(alpha=4)   */
LN    = quantile("LogNormal", U[,2], 0.5, 0.8); /* LN ~ LogNormal(0.5, 0.8) */
X = gamma || LN;
 
/* check that the rank correlation still has not changed */
rankCorrX = corr(X, "Spearman")[2]; 
print rankCorrZ rankCorrX;

The first column of X is gamma-distributed. The second column of X is lognormally distributed. But because the inverse of a continuous CDF is monotonic, the column transformations do not change the rank correlation between the columns. The following graph shows a scatter plot of the newly transformed data along with histograms for each marginal distribution. The histograms show that the columns X1 and X2 are distributed as gamma and lognormal, respectively. The joint distribution is correlated.

What about the Pearson correlation?

For multivariate normal data, the Pearson correlation is close to the rank correlation. However, that is not true for nonnormal distributions. The CDF and inverse-CDF transformations are nonlinear, so the Pearson correlation is not preserved when you transform the marginal. For example, the following statements compute the Pearson correlation for Z (the multivariate normal data) and for X (the gamma-lognormal data):

rhoZ = corr(Z, "Pearson")[2];
rhoX = corr(X, "Pearson")[2];
print rhoZ rhoX;

Chapter 9 of Wicklin (2013), discusses the fact that you can adjust the correlation for Z and it will affect the correlation for X in a complicated manner. With a little work, you can choose a correlation for Z that will result in a specified correlation for X. For example, if you want the final Pearson correlation for X to be 0.6, you can use 0.642 as the correlation for Z:

/* re-run the example with new correlation 0.642 */
Sigma = {1.0    0.642,
         0.642  1.0};
newZ = RandNormal(N, {0,0}, Sigma);
newU = cdf("Normal", newZ);           /* columns of U are U(0,1) variates */
gamma = quantile("Gamma", newU[,1], 4);      /* gamma ~ Gamma(alpha=4) */
expo = quantile("Expo", newU[,2]);           /* expo ~ Exp(1)          */
newX = gamma || expo;
 
rhoZ = corr(newZ, "Pearson")[2];
rhoX = corr(newX, "Pearson")[2];
print rhoZ rhoX;

Success! The Pearson correlation between the gamma and lognormal variables is 0.6. In general, this is a complicated process because the value of the initial correlation depends on the target value and on the specific forms of the marginal distributions. Finding the initial value requires finding the root of an equation that involves the CDF and inverse CDF.

Higher dimensional data

This example in this article generalizes to higher-dimensional data in a natural way. The SAS program that generates the example in this article also includes a four-dimensional example of simulating correlated data. The program simulates four correlated variables whose marginal distributions are distributed as gamma, lognormal, exponential, and inverse Gaussian distributions. The following panel shows the bivariate scatter plots and marginal histograms for this four-dimensional simulated data.

What is a copula?

I've shown many graphs, but what is a copula? The word copula comes from a Latin word (copulare) which means to bind, link, or join. The same Latin root gives us the word "copulate." A mathematical copula is a joint probability distribution that induces a specified correlation structure among independent marginal distributions. Thus, a copula links or joins individual univariate distributions into a joint multivariate distribution that has a specified correlation structure.

Mathematically, a copula is any multivariate cumulative distribution function for which each component variable has a uniform marginal distribution on the interval [0, 1]. That's it. A copula is a cumulative distribution function whose domain is the cube [0,1]d. So the second graph in this article is the graph most nearly related to the copula for the bivariate data, since it shows the relationship between the uniform marginals.

The scatter plot shows the density, not the cumulative distribution. However, you can use the data to estimate the bivariate CDF. The following heat map visualizes the Gaussian copula for the correlation of 0.6:

Notice that this function does not depend on the final marginal distributions. It depends only on the multivariate normal distribution and the CDF transformation that produces the uniform marginals. In that sense, the copula captures only the correlation structure, without regard for the final form of the marginal distributions. You can use this same copula for ANY set of marginal distributions because the copula captures the correlation structure independently of the final transformations of the marginals.

Why are copulas important?

A remarkable theorem (Sklar’s theorem) says that every joint distribution can be written as a copula and a set of marginal distributions. If the marginals are continuous, then the copula is unique. Let that sink in for a second: Every continuous multivariate probability distribution can be expressed in terms of its marginal distributions and a copula. Every one. So, if you can understand copulas you can understand all multivariate distributions. Powerful math, indeed!

For simulation studies, you can use the converse of Sklar's theorem, which is also true. Specifically, if you have a set of d uniform random variables and a set of marginal distribution functions, a copula transforms the d components into a d-dimensional probability distribution.

Summary

Copulas are mathematically sophisticated. However, you can use copulas to simulate data without needing to understand all the mathematical details. This article presents an example of using a Gaussian copula to simulate multivariate correlated data. It shows the geometry at each step of the three-step process:

  1. Simulate data from a multivariate normal distribution with a known correlation matrix.
  2. Use the normal CDF to transform the marginal distributions to uniform.
  3. Use inverse CDFs to obtain any marginal distributions that you want.

The result is simulated correlated multivariate data that has the same rank correlation as the original simulated data but has arbitrary marginal distributions.

This article uses SAS/IML to show the geometry of the copula transformations. However, SAS/ETS software provides PROC COPULA, which enables you to perform copula modeling and simulation in a single procedure. A subsequent article shows how to use PROC COPULA.

The post An introduction to simulating correlated data by using copulas appeared first on The DO Loop.

6月 302021
 

A recent article about how to estimate a two-dimensional distribution function in SAS inspired me to think about a related computation: a 2-D cumulative sum. Suppose you have numbers in a matrix, X. A 2-D cumulative sum is a second matrix, C, such that the C[p,q] gives the sum of the cells of X for rows less than p and columns less than q. In symbols, C[p,q] = sum( X[i,j], i ≤ p, j ≤ q ).

This article shows a clever way to compute a 2-D cumulative sum and shows how to use this computation to compute a 2-D ogive, which is a histogram-based estimate of a 2-D cumulative distribution (CDF). An example of a 2-D ogive is shown to the right. You can compare the ogive to the 2-D CDF estimate from the previous article.

2-D cumulative sums

A naive computation of the 2-D cumulative sums would use nested loops to sum submatrices of X. If X is an N x M matrix, you might think that the cumulative sum requires computing N*M calls to the SUM function. However, there is a simpler computation that requires only N+M calls to the CUSUM function. You can prove by induction that the following algorithm gives the cumulative sums:

  1. Compute the cumulative sums for each row. In a matrix language such as SAS/IML, this requires N calls to the CUSUM function.
  2. Use the results of the first step and compute the "cumulative sums of the cumulative sums" down the columns. This requires M calls to the CUSUM function.

Thus, in the SAS/IML language, a 2-D cumulative sum algorithm looks like this:

proc iml;
/* compute the 2-D CUSUM */
start Cusum2D(X);
   C = j(nrow(X), ncol(X), .);  /* allocate result matrix */
   do i = 1 to nrow(X);
      C[i,] = cusum( X[i,] );   /* cusum across each row */
   end;
   do j = 1 to ncol(X);
      C[,j] = cusum( C[,j] );   /* cusum down columns */
   end;
   return C;
finish;
store module=(Cusum2D);
 
Count = { 7  6 1 3,
          6  3 5 2,
          1  2 4 1};
Cu = Cusum2D(Count);
print Count, Cu;

The output shows the original matrix and the 2-D cumulative sums. For example, in the CU matrix, the (2,3) cell contains the sum of the submatrix Count[1:2, 1:3]. That is, the value 28 in the CU matrix is the sum (7+6+1+6+3+5) of values in the Count matrix.

From cumulative sums to ogives

Recall that an ogive is a cumulative histogram. If you start with a histogram that shows the number of observations (counts) in each bin, then an ogive displays the cumulative counts. If you scale the histogram so that it estimates the probability density, then the ogive estimates the univariate CDF.

A previous article showed how to construct an ogive in one dimension from the histogram. To construct a two-dimensional ogive, do the following:

  • You start with a 2-D histogram. For definiteness, let's assume that the histogram displays the counts in each two-dimensional bin.
  • You compute the 2-D cumulative sums, as shown in the previous section. If you divide the cumulative sums by the total number of observations, you obtain the cumulative proportions.
  • The matrix of cumulative proportions is the 2-D ogive. However, by convention, an ogive starts at 0. Consequently, you should append a row and column of zeros to the left and top of the matrix of bin counts.

You can download the complete SAS/IML program that computes the ogive. The program defines two SAS/IML functions. The Hist2D function creates a matrix of counts for bivariate data. The Ogive2D function takes the matrix of counts and converts it to a matrix of cumulative proportions by calling the Cusum2D function in the previous section. It also pads the cumulative proportions with a row and a column of zeros. The function returns a SAS/IML list that contains the ogive and the vectors of endpoints for the ogive. The following statements construct an ogive from the bivariate data for the MPG_City and Weight variables in the Sashelp.Cars data set:

proc iml;
/* load computational modules */
load module=(Hist2D Cusum2D Ogive2D);
 
/* read bivariate data */
use Sashelp.Cars;
   read all var {MPG_City Weight} into Z;
close;
 
* define cut points in X direction (or use the GSCALE function);
cutX = do(10, 60, 5);        
cutY = do(1500, 7500, 500);
Count = Hist2D(Z, cutX, cutY);   /* matrix of counts in each 2-D bin */
L = Ogive2D(Count, cutX, cutY);  /* return value is a list */
ogive = L$'Ogive';               /* matrix of cumulative proportions */
print ogive[F=Best5. r=(L$'YPts') c=(L$'XPts')];

The table shows the values of the ogive. You could use bilinear interpolation if you want to estimate the curve that corresponds to a specific probability, such as 0.5. You can also visualize the ogive by using a heat map, as shown at the top of this article. Notice the difference between the tabular and graphical views: rows increase DOWN in a matrix, but the Y-axis points UP in a graph. To compare the table and graph, you can reverse the rows of the table.

Summary

This article shows how to compute a 2-D cumulative sum. You can use the 2-D cumulative sum to create a 2-D ogive from the counts of a 2-D histogram. The ogive is a histogram-based estimate of the 2-D CDF. You can download the complete SAS program that computes 2-D cumulative sums, histograms, and ogives.

The post Compute 2-D cumulative sums and ogives appeared first on The DO Loop.

6月 282021
 

This article shows how to estimate and visualize a two-dimensional cumulative distribution function (CDF) in SAS. SAS has built-in support for this computation. Although the bivariate CDF is not used as much as the univariate CDF, the bivariate version is still a useful tool in understanding the probable values of a random variate for a two-dimensional distribution.

Recall that the univariate CDF at the value x is the probability that a random observation from a distribution is less than x. In symbols, if X is a random variable, then the distribution function is \(F_X(x) = P(X\leq x)\). The two-dimensional CDF is similar, but it gives the probability that two random variables are both less than specified values. If X and Y are random variables, the bivariate CDF for the joint distribution is given by \(F_{XY}(x,y) = P(X \leq x ~\mbox{and}~ Y \leq y)\).

Univariate estimates of the CDF

Let's start with a one-dimensional example before exploring the topic in higher dimensions. For univariate probability distributions, statisticians use both the density function and the cumulative distribution function. They each contain the same information, and if you know one you can find the other. The cumulative distribution (CDF) of a continuous distribution function is the integral of the probability density function (PDF); the density function is the derivative of the cumulative distribution.

In terms of data analysis, a histogram is an estimate of a density function. So is a kernel density estimate. Consequently, there are three popular methods for estimating a CDF from data:

Let's compare the first and last methods to estimate the CDF. For data, use the MPG_City variable from the Sashelp.Cars data.

/* Use PROC UNIVARIATE to compute the empirical CDF from data */
proc univariate data=sashelp.Cars; /* use the FREQ option to get a table */
   var MPG_City;
   cdfplot MPG_City;
   ods select CDFPlot;
   ods output CDFPlot=UNIOut;
run;
 
/* Use PROC KDE to estimate the CDF from a kernel density estimate */
proc kde data=sashelp.Cars;
   univar MPG_City / CDF out=KDEOut;
run;
 
/* combine the estimates and plot on the same graph */
data All;
set UNIOut(rename=(ECDFY=ECDF ECDFX=X))
    KDEOut(rename=(distribution=kerCDF Value=X));
ECDF = ECDF / 100;  /* convert from percent to proportion */
keep X ECDF kerCDF;
run;
 
title "Estimates of CDF";
proc sgplot data=All;
   label X = "MPG_City";
   series x=X y=ECDF / legendlabel="Empirical CDF";
   series x=X y=kerCDF / legendlabel="Kernel CDF";
   yaxis label="Cumulative Distribution" grid;
   xaxis values=(5 to 60 by 5) grid valueshint;
run;

The two estimates are very close. The empirical CDF is more "jagged," whereas the estimate from the kernel density is smoother. You could use either curve to estimate quantiles or probabilities.

Bivariate estimates of the CDF

Now onto the bivariate case. For bivariate data, you cannot use PROC UNIVARIATE (obviously, from the name!), but you can still use PROC KDE. Instead of using the UNIVAR statement, you can use the BIVAR statement to compute a bivariate KDE. Let's look at the joint distribution of the MPG_City and Weight variables in the Sashelp.Cars data:

title;
proc sgplot data=sashelp.Cars;
   scatter x=MPG_City y=Weight;
run;

The scatter plot shows that the density of these points is concentrated near the lower-left corner of the plot. You can use PROC KDE to visualize the density. The PLOTS= option can create a contour plot and a two-dimensional histogram, as follows:

title "2-D Histogram";
proc kde data=sashelp.Cars;
   bivar MPG_City Weight / plots=(Contour Histogram)
         CDF out=KDEOut(rename=(Distribution=kerCDF Value1=X Value2=Y));
run;

The OUT= option writes a SAS data set that contains the estimates for the density; the CDF option adds the cumulative distribution estimates. The column names are Value1 (for the first variable on the BIVAR statement), Value2 (for the second variable), and Distribution (for the CDF). I have renamed the variables to X, Y, and CDF, but that is not required. You can plot use a heat map (or a contour plot) to visualize the bivariate CDF, as follows:

title "Estimate of 2-D CDF";
proc sgplot data=KDEOut;
   label X="MPG_City" Y="Weight" kerCDF="Kernel CDF";
   heatmapparm x=X y=Y colorresponse=kerCDF; 
run;

The CDF of bivariate normal data

The CDF is not used very often for bivariate data because the density function shows more details. In some sense, all CDFs look similar. In the previous example, the two variables are negatively correlated and are not linearly related. For comparison, the following program uses the RANDNORMAL function in SAS/IML to generate bivariate normal data that has a positive correlation of 0.6. (I have previously shown the empirical CDF for this data, and I have also drawn the graph of the bivariate CDF for the population.) The calls to PROC KDE and PROC SGPLOT are the same as for the previous example:

proc iml;
call randseed(123456);
N = 1e5;                           /* sample size */
rho = 0.6; 
Sigma = ( 1 || rho)//              /* correlation matrix */
        (rho||  1 );
mean = {0 0};
Z = randnormal(N, mean, Sigma);    /* sample from MVN(0, Sigma) */
create BivNorm from Z[c={X Y}];  append from Z;  close;
quit;
 
proc kde data=BivNorm;
   bivar X Y / plots=NONE
         CDF out=KDEOut(rename=(distribution=kerCDF Value1=X Value2=Y));
run;
 
title "Estimate of 2-D Normal CDF";
proc sgplot data=KDEOut aspect=1;
   label kerCDF="Kernel CDF";
   heatmapparm x=X y=Y colorresponse=kerCDF; 
run;

If you compare the bivariate CDF for the Cars data to the CDF for the bivariate normal data, you can see differences in the range of the red region, but the overall impression is the same. The CDF is near 0 in the lower-left corner, near 1 in the upper-right corner, and is approximately 0.5 along an L-shaped curve near the middle of the data. This is the basic shape of a generic 2-D CDF, just as an S-shape or sigmoid-shaped curve is the basic shape of a generic 1-D CDF.

Summary

You can use the CDF option on the BIVAR statement in PROC KDE to obtain an estimate for a bivariate CDF. The estimate is based on a kernel density estimate of the data. You can use the HEATMAPPARM statement in PROC SGPLOT to visualize the CDF surface. If you prefer a contour plot, you can use the graph template language (GTL) to create a contour plot.

The post Estimate a bivariate CDF in SAS appeared first on The DO Loop.

6月 232021
 

This article uses simulation to demonstrate the fact that any continuous distribution can be transformed into the uniform distribution on (0,1). The function that performs this transformation is a familiar one: it is the cumulative distribution function (CDF). A continuous CDF is defined as an integral, so the transformation is called the probability integral transformation.

This article demonstrates the probability integral transformation and its close relative, the inverse CDF transformation. These transformations are useful in many applications, such as constructing statistical tests and simulating data.

The probability integral transform

Let X be a continuous random variable whose probability density function is f. Then the corresponding cumulative distribution function (CDF) is the integral \(F(x) = \int\nolimits_{-\infty}^{x} f(t)\,dt\). You can prove that the random variable Y = F(X) is uniformly distributed on (0,1). In terms of data, if {X1, X2, ..., Xn} is a random sample from X, the values {F(X1), F(X2), ..., F(Xn)} are a random sample from the uniform distribution.

Let's look at an example. The following SAS DATA step generates a random sample from the Gamma distribution with shape parameter α=4. At the same time, the DATA step computes U=F(X), where F is the CDF of the gamma distribution. A histogram of the U variable shows that it is uniformly distributed:

/* The probability integral transformation: If X is a random 
   variable with CDF F, then Y=F(X) is uniformly distributed on (0,1).
*/
%let N = 10000;
data Gamma4(drop=i);
call streaminit(1234);
do i = 1 to &N;
   x = rand("Gamma", 4);     /* X ~ Gamma(4) */
   u = cdf("Gamma", x, 4);   /* U = F(X) ~ U(0,1)   */
   output;
end;
run;
 
title "U ~ U(0,1)";
proc sgplot data=Gamma4;
   histogram U / binwidth=0.04 binstart=0.02;  /* center first bin at h/2 */
   yaxis grid;
run;

The histogram has some bars that are greater than the average and others that are less than the average. That is expected in a random sample.

One application of the CDF transformation is to test whether a random sample comes from a particular distribution. You can transform the sample by using the CDF function, then test whether the transformed values are distributed uniformly at random. If the transformed variates are uniformly distributed, then the original data was distributed according to the distribution F.

The inverse probability transformation

It is also useful to run the CDF transform in reverse. That is, start with a random uniform sample and apply the inverse-CDF function (the quantile function) to generate random variates from the specified distribution. This "inverse CDF method" is a standard technique for generating random samples.

The following example shows how it works. The DATA step generates random uniform variates on (0,1). The variates are transformed by using the quantiles of the lognormal(0.5, 0.8) distribution. A call to PROC UNIVARIATE overlays the density curve of the lognormal(0.5, 0.8) distribution on the histogram of the simulated data. The curve matches the data well:

/* inverse probability transformation */
data LN(drop=i);
call streaminit(12345);
do i = 1 to &N;
   u = rand("Uniform");                    /* U ~ U(0,1)   */
   x = quantile("Lognormal", u, 0.5, 0.8); /* X ~ LogNormal(0.5, 0.8) */
   output;
end;
run;
 
proc univariate data=LN;
   var x;
   histogram x / lognormal(scale=0.5 shape=0.8) odstitle="X ~ LogNormal(0.5, 0.8)";
run;

Summary

The probability integral transform (also called the CDF transform) is a way to transform a random sample from any distribution into the uniform distribution on (0,1). The inverse CDF transform transforms uniform data into a specified distribution. These transformations are used in testing distributions and in generating simulated data.

Appendix: Histograms of uniform variables

The observant reader might have noticed that I used the BINWIDTH= and BINSTART= options to set the endpoints of the bin for the histogram of uniform data on (0,1). Why? Because by default the histogram centers bins. For data that are bounded on an interval, you can use the BINWIDTH= and BINSTART= options to improve the visualization.

Let's use the default binning algorithm to create a histogram of the uniform data:

title "U ~ U(0,1)";
title2 "34 Bins: First Bin Centered at 0 (Too Short)";
proc sgplot data=Gamma4;
   histogram u;
   yaxis grid;
run;

Notice that the first bin is "too short." The first bin is centered at the location U=0. Because the data cannot be less than 0, the first bin has only half as many counts as the other bins. The last bin is also affected, although the effect is not usually as dramatic.

The histogram that uses the default binning is not wrong, but it doesn't show the uniformity of the data as clearly as a histogram that uses the BINWIDTH= and BINSTART= options to force 0 and 1 to be the endpoints of the bins.

The post The probability integral transform appeared first on The DO Loop.

6月 162021
 

A previous article showed how to simulate multivariate correlated data by using the Iman-Conover transformation (Iman and Conover, 1982). The transformation preserves the marginal distributions of the original data but permutes the values (columnwise) to induce a new correlation among the variables.

When I first read about the Iman-Conover transformation, it seems like magic. But, as with all magic tricks, if you peer behind the curtain and understand how the trick works, you realize that it's math, not magic. This article shows the main mathematical ideas behind the Iman-Conover transformation. For clarity, the article shows how two variables can be transformed to create different variables that have a new correlation but the same marginal distributions.

A series of transformations

The Iman-Conover transformation is actually a sequence of four transformations. Let X be a data matrix (representing continuous data) and let C be a target rank-correlation matrix. The goal of the Iman-Conover transformation is to permute the elements in the columns of X to create a new data matrix (W) whose rank correlation is close to C. Here are the four steps in the Iman-Conover algorithm:

  1. Create normal scores: Compute the van der Waerden normal scores for each column of X. Let S be the matrix of normal scores.
  2. Uncorrelate the scores. Let Z be the matrix of uncorrelated scores. Let Q-1 be the transformation that maps S to Z.
  3. Correlated the scores to match the target. Let Y be the correlated scores. Let P be the transformation that maps Z to Y.
  4. Reorder the elements in the i_th column of X to match the order of the elements in the i_th column of Y. This ensures that X has the same rank correlation as Y.

The following diagram presents the steps of the Iman-Conover transformation. The original data are in the upper-left graph. The final data are in the lower-left graph. Each transformation is explained in a separate section of this article.

Background information

To understand the Iman-Conover transformation, recall the three mathematical facts that make it possible:

The purpose of this article is to present the main ideas behind the Iman-Conover transformation. This is not a proof, and almost every sentence should include the word "approximately." For example, when I write, "the data are multivariate normal" or "the data has the specified correlation," you should insert the word "approximately" into those sentences. I also blur the distinction between correlation in the population and sample correlations.

Example data

This example uses the EngineSize and MPG_Highway variables from the Sashelp.Cars data set. These variables are not normally distributed. They have a rank correlation of -0.77. The goal of this example is to permute the values to create new variables that have a rank correlation that is close to a target correlation. This example uses +0.6 as the target correlation, but you could choose any other value in the interval (-1, 1).

The following DATA step creates the data for this example. The graph in the upper-left of the previous diagram shows the negative correlation between these variables:

data Orig;
set sashelp.cars(rename=(EngineSize=X MPG_Highway=Y));
keep X Y;
label X= Y=;
run;

Transformation 1: Create normal scores

You can compute the van der Waerden scores for each variable. The van der Waerden scores are approximately normal—or at least "more normal" than the original variables. The following SAS/IML statements create the van der Waerden scores as columns of a matrix, S. Because the normalizing transformation is monotonic, the columns of S have the same rank correlation as the original data. The second graph in the previous diagram shows a scatter plot of the columns of S.

proc iml;
varNames = {"X" "Y"};
use Orig; read all var varNames into X; close;
 
/* T1: Create normal scores of each column 
   Columns of S have exactly the same ranks as columns of X.
   ==> Spearman corr is the same    */
N = nrow(X);
S = J(N, ncol(X));
do i = 1 to ncol(X);
   ranks = ranktie(X[,i], "mean");          /* tied ranks */
   S[,i] = quantile("Normal", ranks/(N+1)); /* van der Waerden scores */
end;
/* verify that rank correlation has not changed */
corrScores = corr(S, "Spearman")[2];
print corrX, corrScores;

Transformation 2: Uncorrelate the scores

For many data sets, the van der Waerden scores are approximately multivariate normal. Let CS be the Pearson correlation matrix of S. Let Q be the Cholesky root of CS. The inverse of Q is a transformation that removes any correlations among multivariate normal data. Thus, if you use the inverse of Q to transform the scores, you obtain multivariate data that are uncorrelated. The third graph in the previous diagram shows a scatter plot of the columns of Z.

/* T2: use the inverse Cholesky transform of the scores to uncorrelate */
CS = corr(S);        /* correlation of scores */
Q = root(CS);        /* Cholesky root of correlation of scores */
Z = S*inv(Q);        /* uncorrelated MVN data */
corrZ = corr(Z, "Spearman")[2];
print corrZ;

Transformation 3: Correlate the scores

Let C be the target correlation matrix. Let P be its Cholesky root. The matrix P is a transformation that induces the specified correlations in uncorrelated multivariate normal data. Thus, you obtain data, Y, that are multivariate normal with the given Pearson correlation. In most situations, the Spearman rank correlation is close to the Pearson correlation. The fourth graph in the previous diagram shows a scatter plot of the columns of Y.

/* T3: use the Cholesky transform of the target matrix to induce correlation */
/* define the target correlation */
C = {1    0.6,
     0.6 1  };
P = root(C);         /* Cholesky root of target correlation */
Y = Z*P;
corrY = corr(Y, "Spearman")[2];
print corrY;

Transformation 4: Reorder the values in columns of the original matrix

The data matrix Y has the correct correlations, but the Cholesky transformations of the scores have changed the marginal distributions of the scores. But that's okay: Iman and Conover recognized that any two matrices whose columns have the same ranks also have the same rank correlation. Thus, the last step uses the column ranks of Y to reorder the values in the columns of X.

Iman and Conover (1982) assume that the columns of X do not have any duplicate values. Under that assumption, when you use the ranks of each column of Y to permute the columns of X, the rank correlation of Y and X will be the same. (And, of course, you have not changed the marginal distributions of X.) If X has duplicate values (as in this example), then the rank correlations of Y and X will be close, as shown by the following statements:

/* T4: Permute or reorder data in the columns of X to have the same ranks as Y */
W = X; 
do i = 1 to ncol(Y);
   rank = rank(Y[,i]);          /* use ranks as subscripts, so no tied ranks */
   tmp = W[,i]; call sort(tmp); /* sort column by ranks */
   W[,i] = tmp[rank];           /* reorder the column of X by the ranks of Y */
end;
 
corrW = corr(W, "Spearman")[2];
print corrW;

The rank correlation for the matrix W is very close to the target correlation. Furthermore, the columns have the same values as the original data, and therefore the marginal distributions are the same. The lower-left graph in the previous diagram shows a scatter plot of the columns of W.

The effect of tied values

As mentioned previously, the Iman-Conover transformation was designed for continuous variables that do not have any duplicate values. In practice, the Iman-Conover transformation still works well even if there are some duplicate values in the data. For this example, there are 428 observations. However, the first column of X has only 43 unique values and the second column has only 33 unique values. Because the normal scores are created by using the RANKTIE function, the columns of the S matrix also have 43 and 33 unique values, respectively.

Because the Q-1 matrix is upper triangular, the number of unique values might change for the Z matrix. For these data, the columns of Z and Y have 43 and 194 unique values, respectively. These unique values determine the matrix W, which is why the rank correlation for W is slightly different than the rank correlation for Y in this example. If all data values were unique, then W and Y would have exactly the same rank correlation.

Summary

The Iman-Conover transformation seems magical when you first see it. This article attempts to lift the curtain and reveal the mathematics behind the algorithm. By understanding the geometry of various transformations, you can understand the mathe-magical properties that make the Iman-Conover algorithm an effective tool in multivariate simulation studies.

You can download the complete SAS program that creates all graphs and tables in this article..

The post The geometry of the Iman-Conover transformation appeared first on The DO Loop.

6月 142021
 

Simulating univariate data is relatively easy. Simulating multivariate data is much harder. The main difficulty is to generate variables that have given univariate distributions but also are correlated with each other according to a specified correlation matrix. However, Iman and Conover (1982, "A distribution-free approach to inducing rank correlation among input variables") proposed a transformation that approximately induces a specified correlation among component variables without changing the univariate distribution of the components. (The univariate distributions are called the marginal distributions.) The Iman-Conover transformation is designed to transform continuous variables.

This article defines the Iman-Conover transformation. You can use the transformation to generate a random sample that has (approximately) a given correlation structure and known marginal distributions. The Iman-Conover transformation uses rank correlation, rather than Pearson correlation. In practice, the rank correlation and the Pearson correlation are often close to each other.

This article shows how to use the Iman-Conover transformation. A separate article explains how the Iman-Conover transformation works. The example and SAS program in this article are adapted from Chapter 9 of my book Simulating Data with SAS (Wicklin, 2013).

Create simulated data with known distributions

To use the Iman-Conover transformation, you must specify two things: A set of data and a target correlation matrix. The transformation will induce the specified correlation in the data by rearranging (or permuting) the columns of the data. In other words, the algorithm permutes the columns independently to induce the specified correlation without changing the marginal distributions.

This example generates data with four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively. It does not matter whether the original data are correlated or not. For simplicity, generate each variable independently, so the initial correlation matrix is approximately the identity matrix. The following SAS DATA step simulates four variables from the specified distributions. The call to PROC CORR computes the rank correlation of the original data. It also creates a scatter plot matrix that shows the marginal distributions of the variable in the diagonal cells and the bivariate joint distributions in the off-diagonal cells.

%let N = 100; /* sample size */
/* independently simulate variables from known distributions */
data SimIndep;
call streaminit(12345);
do i = 1 to &N;
   Normal    = rand("Normal");
   Lognormal = rand("Lognormal");
   Expon     = rand("Expo");
   Uniform   = rand("Uniform");
   output;
end;
drop i;
run;
 
proc corr data=SimIndep Spearman noprob plots=matrix(hist);
   var Normal Lognormal Expon Uniform;
run;

For these simulated data, the table shows that the rank correlation between any pair of variables is small. None of the correlations are statistically different from 0, which you can verify by removing the NOPROB option from the PROC CORR statement.

Implement the Iman-Conover transformation in SAS

The Iman-Conover transformation is actually a sequence of transformations of multivariate data. The following program defines the Iman-Conover transformation as a SAS/IML function. An explanation of the function is provided in a separate article.

/* define a store a SAS/IML function that computes the Iman-Conover transformation */
proc iml;
/* Input: X is a data matrix with k columns
          C is a (k x k) rank correlation matrix
   Output: A matrix W. The i_th column of W is a permutation of the i_th
          columns of X. The rank correlation of W is close to the 
          specified correlation matrix, C.
*/          
start ImanConoverTransform(X, C);
   N = nrow(X);
   S = J(N, ncol(X));
   /* T1: Create normal scores of each column */
   do i = 1 to ncol(X);
      ranks = ranktie(X[,i], "mean");          /* tied ranks */
      S[,i] = quantile("Normal", ranks/(N+1)); /* van der Waerden scores */
   end;
   /* T2: apply two linear transformations to the scores */
   CS = corr(S);        /* correlation of scores */
   Q = root(CS);        /* Cholesky root of correlation of scores */
   P = root(C);         /* Cholesky root of target correlation */
   T = solve(Q,P);      /* same as  T = inv(Q) * P; */
   Y = S*T;             /* transform scores: Y has rank corr close to target C */
 
   /* T3: Permute or reorder data in the columns of X to have the same ranks as Y */
   W = X;
   do i = 1 to ncol(Y);
      rank = rank(Y[,i]);          /* use ranks as subscripts, so no tied ranks */
      tmp = W[,i]; call sort(tmp); /* sort column by ranks */
      W[,i] = tmp[rank];           /* reorder the column of X by the ranks of M */
   end;
   return( W );
finish;
store module=(ImanConoverTransform);  /* store definition for later use */
quit;

Unlike many algorithms in the field of simulation, the Iman-Conover transformation is deterministic. Given a matrix of data (X) and a correlation matrix (C), the function returns the same transformed data every time you call it.

Use the Iman-Conover transformation to simulate correlated data

The following SAS/IML program shows how to use the Iman-Conover transformation to simulate correlated data. There are three steps:

  1. Read real or simulated data into a matrix, X. The columns of X define the marginal distributions. For this example, we will use the SimIndep data, which contains four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively.
  2. Define a "target" correlation matrix. This specifies the rank correlations that you hope to induce. For this example, the target correlation between the Normal and Lognormal variables is 0.75, the target correlation between the Normal and Exponential variables is -0.7, the target correlation between the Normal and Uniform variables is 0, and so forth.
  3. Call the ImanConoverTransform function. This returns a new data matrix, W. The i_th column of W is a permutation of the i_th column of X. The permutations are chosen so that the rank correlation of W is close to the specified correlation matrix:
proc iml;
load module=(ImanConoverTransform);  /* load the function definition */
 
/* Step 1: Read in the data */
varNames = {'Normal' 'Lognormal' 'Expon' 'Uniform'};
use SimIndep; read all var varNames into X; close;
 
/* Step 2: specify target rank correlation */
/*    X1    X2    X3    X4                 */
C = { 1.00  0.75 -0.70  0,           /* X1 */
      0.75  1.00 -0.95  0,           /* X2 */
     -0.70 -0.95  1.00 -0.3,         /* X3 */
      0     0    -0.3   1.0};        /* X4 */
 
/* Step 3: Strategically reorder the columns of X. 
           The new ordering has the specified rank corr (approx)
           and preserves the original marginal distributions. */
W = ImanConoverTransform(X, C);
RankCorr = corr(W, "Spearman");
print RankCorr[format=6.3 r=varNames c=varNames];

As advertised, the rank correlation of the new data matrix, W, is very close to the specified correlation matrix, C. You can write the data matrix to a SAS data set and use PROC CORR to create a scatter plot matrix that has the marginal histograms on the diagonal:

/* write new data to a SAS data set */
newvarNames = {'newNormal' 'newLognormal' 'newExpon' 'newUniform'};
create SimCorr from W[c=newvarNames];  append from W;  close;
quit;
 
proc corr data=SimCorr Spearman noprob plots=matrix(hist);
   var newNormal newLognormal newExpon newUniform;
run;

If you compare the diagonal histograms in this graph to the histograms in the previous graph, you will see that they are identical. They have to be identical because each new variable contains exactly the same data values as the original variables.

At the same time, the bivariate scatter plots show how permuting the data has induced the prescribed correlations between the variables. For this example, the newNormal and newLognormal variables have a strong positive rank correlation (0.73). The newNormal and NewExponential variables have a strong negative rank correlation (-0.65). The newNormal and newUniform variables are essentially uncorrelated (-0.03), and so forth.

Summary

The Iman-Conover transformation is a remarkable algorithm that induces a specified correlation structure on a data matrix while preserving the marginal distributions. You start with any data matrix (X) and any target correlation matrix. (The columns of X must represent continuous data.) You apply the Iman-Conover transformation to obtain a new matrix (W). The rank correlation of the new matrix is close to the values you specified. The columns of W have the same values as the columns of X, but the values have been reordered. Thus, the new matrix has the same marginal distributions as the original matrix.

In a follow-up article, I will explain the geometry or the Iman-Conover transformation and how it works.

The post Simulate correlated variables by using the Iman-Conover transformation appeared first on The DO Loop.

6月 092021
 

Many nonparametric statistical methods use the ranks of observations to compute distribution-free statistics. In SAS, two procedures that use ranks are PROC NPAR1WAY and PROC CORR. Whereas the SPEARMAN option in PROC CORR (which computes rank correlation) uses only the "raw" tied ranks, PROC NPAR1WAY uses transformations of the ranks, called scores, to compute rank-based analysis of variance (ANOVA), empirical distribution tests, and other analyses. This article shows two things:

  • How to generate normally distributed scores that have the same (rank) correlation as the original data.
  • How these scores are computed if there are tied values in the data.

What is a "score" in statistics?

Statisticians like the word "score." As a verb, "to score" means to evaluate a regression model on one or more observations. As a noun, "raw score" means the original data. "Standardized score" (or z-score) implies that the data have been transformed according to a standardizing transformation. The "rank score" (or Wilcoxon score) is the result of replacing each observation by its rank. Other types of rank-based scores, often called normal scores, are explained in the documentation for PROC NPAR1WAY, including the Blom, Tukey, and van der Waerden scores.

Ranks and normal scores

You can use the NORMAL= option in PROC RANK to obtain three common rank-based normal scores. The three scores are essentially the same and are can be used to construct normal Q-Q plots. Assume that the data do not have any duplicate values. Let Ri be the rank of the i_th observation of a data vector, X. Then the normal scores have the form
si = Φ-1( (Ri−c) / (n+b) )
where Φ-1 is the inverse standard cumulative normal distribution, and n is the number of nonmissing observations.

  • For Blom's score, c=3/8 and b=1/4
  • For Tukey's score, c=1/3 and b=1/3
  • For van der Waerden's score, c=0 and b=1

If the data have duplicate values, you need to adjust the formulas slightly, as shown in the last section of this article.

Why are normal scores useful?

The idea behind a normal score is rather clever. You compute the ranks of each variable and scale them so that they are in the range (0,1). You then compute the standard normal quantiles of these values. The resulting scores are often approximately normal.

Because the transformation includes a standardization, some statistics (such as the median and the interquartile range) are different between the data and the scores. However, the ranks of the scores are the same as the ranks of the original data, which means that many rank-based statistics are identical for the data and for the scores. This includes multivariate statistics such as the rank correlation between two variables.

Let's compute an example. First, let's compute the rank correlation between three variables in the Sashelp.Cars data set. The variables are not normally distributed, as shown by the matrix of scatter plots and the marginal histograms:

proc corr data=Sashelp.Cars spearman nosimple noprob
          plots(maxpoints=none)=matrix(hist);
   var EngineSize MPG_Highway Weight;
   label EngineSize= MPG_Highway= Weight=; /* suppress labels */
run;

You can use PROC RANK to create new variables that contain the normal scores for the variables. The NORMAL= option supports all three normal scores. For this example, I will use the van der Waerden scores (NORMAL=VW). The following statements generate the variables, then compute the rank correlation between the scores:

/* write VDW scores to RANKOUT data set */
proc rank data=Sashelp.Cars out=RankOut ties=mean normal=VW;
   var EngineSize MPG_Highway Weight;
   ranks vdwEngineSize vdwMPG_Highway vdwWeight;  /* names of scores */
run;
 
proc corr data=RankOut spearman nosimple noprob
          plots(maxpoints=none)=matrix(hist);
   var vdw:;
   label vdwEngineSize= vdwMPG_Highway= vdwWeight=; /* suppress labels */
run;

The rank correlations between the scores are exactly the same as for the original data. This is because the transformation between data and scores preserves the ranks. The scatter plot matrix of the scores shows that the scores are approximately multivariate normal. The bivariate scatter plots show the elliptical shapes that you would expect for correlated multivariate data.

This example shows the value of transforming data to scores. You get a new set of variables that are approximately multivariate normal and that have the same (rank) correlation as the original data. Although the Pearson product-moment correlation is not invariant under the transformation, the Pearson and Spearman correlations tend to be close to each other, so the Pearson correlation for the data tends to be close to the Spearman correlation for the scores.

Compute a normal score for tied values

As mentioned previously, when there are tied values, the transformation to produce the normal scores needs to be modified. You might guess that you should compute the tied ranks and then transformed the tied ranks into normal scores. However, both PROC NPAR1WAY and PROC RANK do something slightly different. The following quote is from the documentation for PROC NPAR1WAY:

When there are tied values, PROC NPAR1WAY first sorts the observations in ascending order and assigns ranks as if there were no ties. Then the procedure computes the scores based on these ranks by using the formula for the specified score type. The procedure averages the scores for tied observations and assigns this average score to each of the tied observations. Thus, all equal data values have the same score value.

The following SAS/IML program computes the van der Waerden scores for the three variables in the Sashelp.Cars data. First, the program computes the scores "incorrectly" by using the RANKTIE function to compute the averaged ranks and then transforming the tied ranks. Next, the program uses the method in the PROC NPAR1WAY documentation to compute the correct scores. Both methods are compared to the output from PROC RANK. The numerical difference between the incorrect-but-fast method and the correct-but-slower method is very small.

proc iml;
/* read the original data */
varNames = {'EngineSize' 'MPG_Highway' 'Weight'};
use Sashelp.Cars;
read all var varNames into Y;
run;
 
/* The simple method: If there are no ties, these statements 
   transform the data into the van der Waerden scores.
   Assume Y does not have any missing values.
*/
n = nrow(Y);
SimpleScores = j(n, ncol(Y), .);
do i = 1 to ncol(Y);
   r = ranktie(Y[,i], "mean");
   SimpleScores[,i] = quantile("Normal", r/(n+1));
end;
 
/* read the scores from PROC RANK and compare */
use RankOut;
read all var {'vdwEngineSize' 'vdwMPG_Highway' 'vdwWeight'} into VDW;
close;
 
maxDiffSimple = max(abs(SimpleScores-VDW));
 
/* If tied values, the computation is different and more complicated.
*/
start VDWTies(X);
   n = nrow(X);
   rank = rank(X);                     /* ignore ties */
   q = quantile("Normal", rank/(n+1)); /* compute VDW scores */
   score = j(n,1,.);
   uX = unique(X);                     /* get the distinct values */
   do i = 1 to ncol(uX);               /* for each distinct value */
      idx = loc( X=uX[i] );            /* locate all ties for this value */
      score[idx] = mean(q[idx]);       /* average the scores */
   end;
   return score;
finish;
 
scores = j(n, ncol(Y), .);
do j = 1 to ncol(Y);
   scores[,j] = VDWTies(Y[,j]);
end;
 
maxDiff = max(abs(scores-VDW));
print maxDiffSimple maxDiff;

The simple application of the van der Waerden formula is slightly different from the PROC RANK output. To get the correct scores, you must average the scores of the tied values. In practice, you can often use the simpler transformation, especially for large samples and for variables that have only a small proportion of tied values.

Even for moderate-sized samples (this example has 428 observations), the two methods tend to agree. For these data, the EngineSize variable has only 43 unique values. The MPG_Highway variable has only 33 unique values. Nevertheless, the simple computation of the van der Waerden scores is almost the same as the more complicated method that is used by PROC NPAR1WAY. A scatter plot of the two scores shows that the scores computed by the two methods are almost the same. The following statements create the scatter plot for the MPG_Highway variable:

title "MPG_Highway Variable";
call scatter(SimpleScores[,2], scores[,2]) procopt="noautolegend aspect=1"
     label={'Simple Scores' 'Tied Scores'} lineparm={0 0 1};

Summary

This article discusses rank-based normal scores, which are used in certain nonparametric tests. The scores are obtained by transforming the ranks of the original data. The scores have the same ranks as the data. Furthermore, the scores are approximately normally distributed, and they have the same rank correlation as the data.

This article also shows the correct way to compute the normal scores for tied values in the data. However, there is often little difference between the true scores and a simpler approximation that transforms the tied ranks.

The post Rank-based scores and tied values appeared first on The DO Loop.