rick wicklin

9月 202017
 
Fisher's Z Transformation: z = arctanh(r)

Pearson's correlation measures the linear association between two variables. Because the correlation is bounded between [-1, 1], the sampling distribution for highly correlated variables is highly skewed. Even for bivariate normal data, the skewness makes it challenging to estimate confidence intervals for the correlation, to run one-sample hypothesis tests ("Is the correlation equal to 0.5?"), and to run two-sample hypothesis tests ("Do these two samples have the same correlation?").

In 1921, R. A. Fisher studied the correlation of bivariate normal data and discovered a wonderful transformation (shown to the right) that converts the skewed distribution of the sample correlation (r) into a distribution that is approximately normal. Furthermore, whereas the variance of the sampling distribution of r depends on the correlation, the variance of the transformed distribution is independent of the correlation. The transformation is called Fisher's z transformation. This article describes Fisher's z transformation and shows how it transforms a skewed distribution into a normal distribution.

The distribution of the sample correlation

The following graph (click to enlarge) shows the sampling distribution of the correlation coefficient for bivariate normal samples of size 20 for four values of the population correlation, rho (ρ). You can see that the distributions are very skewed when the correlation is large in magnitude.

Sampling distributions of correlation for bivariate normal data of size N=20

The graph was created by using simulated bivariate normal data as follows:

  1. For rho=0.2, generate M random samples of size 20 from a bivariate normal distribution with correlation rho. (For this graph, M=2500.)
  2. For each sample, compute the Pearson correlation.
  3. Plot a histogram of the M correlations.
  4. Overlay a kernel density estimate on the histogram and add a reference line to indicate the correlation in the population.
  5. Repeat the process for rho=0.4, 0.6, and 0.8.

The histograms approximate the sampling distribution of the correlation coefficient (for bivariate normal samples of size 20) for the various values of the population correlation. The distributions are not simple. Notice that the variance and the skewness of the distributions depend on the value the underlying correlation (ρ) in the population.

Fisher's transformation of the correlation coefficient

Fisher sought to transform these distributions into normal distributions. He proposed the transformation f(r) = arctanh(r), which is the inverse hyperbolic tangent function. The graph of arctanh is shown at the top of this article. Fisher's transformation can also be written as (1/2)log( (1+r)/(1-r) ). This transformation is sometimes called Fisher's "z transformation" because the letter z is used to represent the transformed correlation: z = arctanh(r).

How he came up with that transformation is a mystery to me, but he was able to show that arctanh is a normalizing and variance-stabilizing transformation. That is, when r is the sample correlation for bivariate normal data and z = arctanh(r) then the following statements are true (See Fisher, Statistical Methods for Research Workers, 6th Ed, pp 199-203):

Transformed sampling distributions of correlation for bivariate normal data of size N=20
  • The distribution of z is approximately normal and "tends to normality rapidly as the sample is increased" (p 201).
  • The standard error of z is approximately 1/sqrt(N-3), which is independent of the value of the correlation.

The graph to the right demonstrates these statements. The graph is similar to the preceding panel, except these histograms show the distributions of the transformed correlations z = arctanh(r). In each cell, the vertical line is drawn at the value arctanh(ρ). The curves are normal density estimates with σ = 1/sqrt(N-3), where N=20.

The two features of the transformed variables are apparent. First, the distributions are normally distributed, or, to quote Fisher, "come so close to it, even for a small sample..., that the eye cannot detect the difference" (p. 202). Second, the variance of these distributions are constant and are independent of the underlying correlation.

Fisher's transformation and confidence intervals

From the graph of the transformed variables, it is clear why Fisher's transformation is important. If you want to test some hypothesis about the correlation, the test can be conducted in the z coordinates where all distributions are normal with a known variance. Similarly, if you want to compute a confidence interval, the computation can be made in the z coordinates and the results "back transformed" by using the inverse transformation, which is r = tanh(z).

You can perform the calculations by applying the standard formulas for normal distributions (see p. 3-4 of Shen and Lu (2006)), but most statistical software provides an option to use the Fisher transformation to compute confidence intervals and to test hypotheses. In SAS, the CORR procedure supports the FISHER option to compute confidence intervals and to test hypotheses for the correlation coefficient.

The following call to PROC CORR computes a sample correlation between the length and width of petals for 50 Iris versicolor flowers. The FISHER option specifies that the output should include confidence intervals based on Fisher's transformation. The RHO0= suboption tests the null hypothesis that the correlation in the population is 0.75. (The BIASADJ= suboption turns off a bias adjustment; a discussion of the bias in the Pearson estimate will have to wait for another article.)

proc corr data=sashelp.iris fisher(rho0=0.75 biasadj=no);
   where Species='Versicolor';
   var PetalLength PetalWidth;
run;
Use Fisher's transformation to compute confidence intervals and to test hypotheses in PROC CORR in SAS

The output shows that the Pearson estimate is r=0.787. A 95% confidence interval for the correlation is [0.651, 0.874]. Notice that r is not the midpoint of that interval. In the transformed coordinates, z = arctanh(0.787) = 1.06 is the center of a symmetric confidence interval (based on a normal distribution with standard error 1/sqrt(N-3)). However, the inverse transformation (tanh) is nonlinear, and the right half-interval gets compressed more than the left half-interval.

For the hypothesis test of ρ = 0.75, the output shows that the p-value is 0.574. The data do not provide evidence to reject the hypothesis that ρ = 0.75 at the 0.05 significance level. The computations for the hypothesis test use only the transformed (z) coordinates.

Summary

This article shows that Fisher's "z transformation," which is z = arctanh(r), is a normalizing transformation for the Pearson correlation of bivariate normal samples of size N. The transformation converts the skewed and bounded sampling distribution of r into a normal distribution for z. The standard error of the transformed distribution is 1/sqrt(N-3), which does not depend on the correlation. You can perform hypothesis tests in the z coordinates. You can also form confidence intervals in the z coordinates and use the inverse transformation (r=tanh(z)) to obtain a confidence interval for ρ.

The Fisher transformation is exceptionally useful for small sample sizes because, as shown in this article, the sampling distribution of the Pearson correlation is highly skewed for small N. When N is large, the sampling distribution of the Pearson correlation is approximately normal except for extreme correlations. Although the theory behind the Fisher transformation assumes that the data are bivariate normal, in practice the Fisher transformation is useful as long as the data are not too skewed and do not contain extreme outliers.

You can download the SAS program that creates all the graphs in this article.

The post Fisher's transformation of the correlation coefficient appeared first on The DO Loop.

9月 182017
 
Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
run;
 
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
run;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
run;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
run;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.

9月 182017
 
Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
run;
 
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
run;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
run;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
run;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.

9月 132017
 
Simulate clusters of multivariate normal observations in SAS by using a Gaussian mixture distribution

This article shows how to simulate data from a mixture of multivariate normal distributions, which is also called a Gaussian mixture. You can use this simulation to generate clustered data. The adjacent graph shows three clusters, each simulated from a four-dimensional normal distribution. Each cluster has its own within-cluster covariance, which controls the spread of the cluster and the amount overlap between clusters.

This article is based on an example in Simulating Data with SAS (Wicklin, 2013, p. 138). Previous articles have explained how to simulate from a multivariate normal distribution and how to generate a random sample from a univariate mixture distribution.

A review of mixture distributions

The graph at the top of this article shows 100 random observation from a Gaussian mixture distribution. A mixture distribution consists of K component distributions and a set of mixing probabilities that determine the probability that a random observation belongs to each component. For example, if π = {0.35, 0.5, 0.15} is a vector of mixing probabilities, then in large random samples about 35% of the observations are drawn from the first component, about 50% from the second component, and about 15% from the third component.

For a mixture of Gaussian components, you need to specify the mean vector and the covariance matrix for each component. For this example, the means and covariances are approximately the estimates from Fisher's famous iris data set, so the scatter plot matrix might look familiar to statisticians who have previously analyzed the iris data. The means of the three component distributions are μ1 = {50, 34, 15, 2}, μ2 = {59, 28, 43, 13}, and μ3 = {66, 30, 56, 20}. The covariance matrices for the component distributions are shown later.

Simulate a Gaussian mixture in SAS

The SAS/IML language is the easiest way to simulate multivariate data in SAS. To simulate from a mixture of K Gaussian distributions, do the following:

  1. Generate a random draw from the multinomial distribution with probability vector π. This gives the number of observations to sample from each component.
  2. For each component, simulate from a multivariate normal distribution.

A technical issue is how to pass the mean vectors and covariance matrices into a module that simulates the data. If you are using SAS/IML 14.2 or beyond, you can use a list of lists (Wicklin, 2017, p. 12–14). For compatibility with older versions of SAS, the implementation in this article uses a different technique: each mean and covariance are stored as rows of a matrix. Because the covariance matrices are symmetric, Wicklin (2013) stores them in lower-triangular form. For simplicity, this article stores each 4 x 4 covariance matrix as a row in a 3 x 16 matrix.

proc iml;
/* Simulate from mixture of K mulitvariate normal distributions in dimension d.  
   OUTPUT variables:
   X  : a SampleSize x d matrix of simulated observations
   ID : a column vector that contains the component from which each obs is drawn
 
   INPUT variables:
   pi  : 1 x K vector of mixing probabilities, sum(pi)=1
   mu  : K x d matrix whose i_th row contains the mean vector for the i_th component
   Cov : K x (d**2) matrix whose i_th row contains the covariance matrix for the i_th component
*/
start SimMVNMixture(X, ID,                     /* output arguments */
                    SampleSize, pi, mu, Cov);  /* input arguments  */
   K = ncol(pi);                               /* number of components */
   d = ncol(mu);                               /* number of variables */
   X = j(SampleSize, d);                       /* output: each row is observation */
   ID = j(SampleSize, 1);                      /* ID variable */
   N = RandMultinomial(1, SampleSize, pi);     /* vector of samples sizes for components */
   b = 1;                                      /* b = beginning index for group i */
   do i = 1 to K;
      e = b + N[i] - 1;                        /* e = ending index for group i */
      ID[b:e] = i;                             /* set ID variable */
      c = shape(Cov[i,], d, d);                /* reshape Cov to square matrix */
      X[b:e, ] = RandNormal(N[i], mu[i,], c);  /* simulate i_th MVN sample */
      b = e + 1;                               /* next group starts at this index */
   end;
finish;

The SimMVNMixture routine allocates a data matrix (X) that is large enough to hold the results. It generates a vector N ={N1, N2,..., NK} to determine the number of observations that will be drawn from each component and calls the RANDNORMAL function to simulate from each Gaussian component. The scalar values b and e keep track of the beginning and ending rows of each sample from each component.

After the module is defined, you can call it for a specific set of parameters. Assume you want to generate K=3 clusters of four-dimensional data (d=4). The following statements specify the mixing probabilities for a three-component model. The mu matrix is a K x d matrix whose rows are the mean vectors for the components. The Cov matrix is a K x (d**2) matrix whose rows are the covariance matrices for the components. The following statements generate a total of 100 observations from the three-component mixture distribution:

/* specify input args; means/cov correspond to sashelp.iris data for species */
pi = {0.35 0.5 0.15};                   /* mixing probs for K=3 groups */
mu = {50 34 15  2 ,                            /* means of Group 1     */
      59 28 43 13 ,                            /* means of Group 2     */
      66 30 56 20 };                           /* means of Group 3     */
/* specify within-group covariances */
Cov = {12 10  2 1  10 14 1 1   2 1  3 1  1 1 1 1 ,   /* cov of Group 1 */
       27  9 18 6   9 10 8 4  18 8 22 7  6 4 7 4 ,   /* cov of Group 2 */
       40  9 30 5   9 10 7 5  30 7 30 5  5 5 5 8 };  /* cov of Group 3 */
/* run the simulation to generate 100 observations */
call randseed(12345);
run SimMVNMixture(X, Group, 100, pi, mu, Cov);

The call to the SimMVNMixture routine returned the simulated random sample in X, which is a 100 x d matrix. The module also returns an ID vector that identifies the component from which each observation was drawn. You can visualize the random sample by writing the data to a SAS data set and using the SGSCATTER procedure to create a paneled scatter plot, as follows:

/* save simulated data to data set */
varNames="x1":"x4";
Y = Group || X;
create Clusters from Y[c=("Group" || varNames)];  append from Y;  close;
quit;
 
ods graphics / attrpriority=none;
title "Multivariate Normal Components";
title2 "(*ESC*){unicode pi} = {0.35, 0.5, 0.15}";
proc sgscatter data=Clusters;
   compare y=(x1 x2) x=(x3 x4) / group=Group markerattrs=(Size=12);
run;

The graph is shown at the top of this article.

Although each component in this example is multivariate normal, the same technique will work for any component distributions. For example, one cluster could be multivariate normal, another multivariate t, and a third multivariate uniform.

In summary, you can create a function module in the SAS/IML language to simulate data from a mixture of Gaussian components. The RandMutinomial function returns a random vector that determines the number of observations to draw from each component. The RandNormal function generates the observations. A technical detail involves how to pass the parameters to the module. This implementation packs the parameters into matrices, but other options, such as a list of lists, are possible.

The post Simulate multivariate clusters in SAS appeared first on The DO Loop.

9月 112017
 

Did you know that you can get SAS to compute symbolic (analytical) derivatives of simple functions, including applying the product rule, quotient rule, and chain rule? SAS can form the symbolic derivatives of single-variable functions and partial derivatives of multivariable functions. Furthermore, the derivatives are output in a form that can be pasted into a SAS program. The trick is to use PROC NLIN with the LIST option.

In SAS/IML, the nonlinear optimization routines will use analytical derivatives if they are provided; otherwise, they will automatically generate numerical finite-difference approximations to the derivatives. I rarely specify derivatives because it can be time-consuming and prone to error. But recently I realized that PROC NLIN and other SAS procedures automatically generate symbolic derivatives, and you can "trick" the procedure into displaying the derivatives so that they can be used in other applications. To get PROC NLIN to output symbolic derivatives, do the following:

  • Create a data set to use for the DATA= option of PROC NLIN.
  • List the variables that you want to take derivatives of on the PARMS statement. In open code, assign a value to constants that appear in the function.
  • Specify the function to differentiate on the MODEL statement.

Symbolic derivatives of functions of one variable

Here is a one-variable example. Suppose you want to take the derivative of
f(x) = x3 + x sin(x) + exp( x2 ).
The function and its derivative is shown to the right.

The following SAS statements create a "fake" data set that defines a variable named F. The call to PROC NLIN sets up the problem. The LIST option on the PROC NLIN statement generates the 'ProgList' table, which contains the derivative function:

data _dummy;
   f = 1;                                    /* name of function */
run;
 
proc nlin data=_dummy list;
   parms x=0;                                /* list variables */
   model f = x**3 + x*sin(x) + exp(x**2);    /* define function */
   ods select ProgList;                      /* output symbolic derivatives */
run;
Symbolic derivatibves in SAS for a univariate function

The output shows the expressions for the function ("MODEL.f") and the derivative with respect to x. The derivative df/dx is written as "@MODEL.f/@x", where the symbol '@' is used for the 'd'. You can see that SAS took the simple derivative of the x3 term, applied the product rule to the x*sin(x) term, and applied the chain rule to the exp( x2 ) term. You can also see that the expressions are written in SAS notation, with "**" indicating the power operator. SAS functions are written in upper case (SIN, COS, and EXP).

Symbolic partial derivatives of multivariate functions

In exactly the same way, you can obtain partial derivatives. In a partial derivative, all variables are held constant except one. Suppose you have the function for the normal density,
f(x) = 1/(sqrt(2 π) σ) exp( -(x - μ)**2 / (2 σ**2) )
Suppose that you want to compute the partial derivatives ∂f/∂μ and ∂f/∂σ. To compute these derivatives, list μ and σ on the PARMS statement and assign values to the other symbols (x and π) that appear in the function. It does not matter what value you assign to the x and π, you are just preventing PROC NLIN from complaining that there are unassigned variables. You could assign the value 1 to both variables, but I prefer to tell the procedure that π equals 3.14. In the following program, note that PROC NLIN reuses the fake data set that defines the F variable:

proc nlin data=_dummy list;
   parms mu=0 sigma=1;                        /* list variables */
   x=1; pi = constant('pi');                  /* assign arbitrary values to constants */
   model f = exp( -(x-mu)**2 / (2*sigma**2) ) / (sqrt(2*pi)*sigma); /* define function */
   ods select ProgList;                      /* output symbolic derivatives */
run;
Symbolic partial derivatibves in SAS for a multivariate function

The derivative with respect to σ requires using the chain rule and the quotient rule. Notice that the derivative includes a call to the original function ("MODEL.f"). Notice also that there is repetition of various expressions such as -(x - μ)**2 / (2 σ**2). If you copy and paste these expressions into a SAS program that computes the derivative, the computation will be slightly inefficient because the program will evaluate the same expression multiple times. One way to improve the efficiency is to collect sub-expressions into a temporary variable, as shown in the next section.

Sub-expressions and the chain rule

An experienced statistician will recognize that the expression z = (x - μ) / σ arises naturally as the standardized variable. Using such an expression might simplify the derivatives.

The following program computes the same derivatives as before, except this time the programmer defines the expression z = (x - μ) / σ and uses z in the function. When SAS takes the derivative, it computes dz/dμ and dz/dσ as part of the computation, as shown below:

proc nlin data=_dummy list;
   parms mu=0 sigma=1;                        /* list variables */
   x=1; pi = constant('pi');                  /* assign arbitrary values to constants */
   const = sqrt(2*pi);
   z = (x-mu) / sigma;                        /* intermediate expression */
   model f = exp(-z**2 / 2) / (const*sigma);  /* define function */
   ods select ProgList;                       /* output symbolic derivatives */
run;
Symbolic derivatibves in SAS that use sub-expressions and apply the chain rule

Notice that the derivatives use several sub-expressions such as dz/dμ and dz/dσ. This computation is more efficient than the previous because it reuses previously computed quantities. For another example, define f1 = 1 / sqrt(2*pi*sigma) and f2 = exp(-z**2 / 2) and define the MODEL as f = f1*f2. You will see that each term in the derivative is efficiently evaluated in terms of previously expressions.

Discussion of symbolic derivatives

As I said previously, I do not usually specify analytical derivatives for my optimizations. I find that numerical derivatives (which are automatically computed) are fast and accurate for 99% of the optimizations that I perform.

Furthermore, some of the objective functions that I optimize do not have analytical derivatives. Others are matrix computations for which it would be difficult and time-consuming to write down the analytical derivative. (For an example, see the article on the derivative of the bivariate normal cumulative distribution.)

If you decide to specify an analytical derivative in SAS, make sure that the derivative is correct and efficient. The technique in this article will help to ensure that the derivative is correct. By strategically defining sub-expressions as in the previous section, you can help SAS generate derivatives that are computationally efficient.

Be aware that the derivatives are generated by applying the chain rule, product rule, and quotient rule without any algebraic simplifications of the result. Thus the result can be unnecessarily complicated. For an example, ask SAS to display the derivative of log( (1+x)/(1-x) ) / 2. The naive result is complicated. If you rewrite the function as log(1+x)/2 - log(1-x)/2, then the derivative is much simpler. However, in neither case will SAS simplify the derivative to obtain 1 / (1-x**2).

For more of my thoughts on using derivatives in SAS, see "Two hints for specifying derivatives." What about your thoughts? What do you think about using PROC NLIN to generate automatic derivatives of simple functions? Do you think this technique will be useful to you?

The post Symbolic derivatives in SAS appeared first on The DO Loop.

9月 072017
 

If you use SAS regression procedures, you are probably familiar with the "stars and bars" notation, which enables you to construct interaction effects in regression models. Although you can construct many regression models by using that classical notation, a friend recently reminded me that the EFFECT statement in SAS provides greater control over the interaction terms in a regression model.

The EFFECT statement is supported in many SAS procedures, including the GLIMMIX, GLMSELECT, and LOGISTIC procedures. Because those procedures can output a design matrix, you can use a model that is generated by an EFFECT statement in any SAS procedure, even older procedures that do not support the EFFECT statement. I have previously shown how you can use the SPLINE option in the EFFECT statement to generate spline effects. This article deals with polynomial effects, which are effects that are formed by elementwise multiplication of continuous variables, such as x1*x1, x1*x2, and x1*x1*x2*x3.

The following statements rename a few continuous variables in the Sashelp.Heart data set, which will be used in the examples. The new variables (x1, x2, x3, and x4) are easier to type and using these short variable names makes it easier to see how interactions effects are formed.

data Heart / view=Heart;
set Sashelp.Heart;
rename Height=x1  Weight=x2  Diastolic=x3  Systolic=x4;
run;

A review of "stars and bars" notation in SAS

Recall that many SAS regression procedures (such as GLM, GENMOD, LOGISTIC, MIXED,...) support the bar operator (|) to specify interactions between effects. For example, the following MODEL statement specifies that the model should include all main effects and all higher-order interactions:

proc logistic data=Heart;
   model Y = x1 | x2 | x3 | x4;   /* all main effects and interactions up to 4-way */
run;

The previous MODEL statement includes all two-way, three-way, and four-way interaction effects between distinct variables. In practice, fitting a model with so many effects will lead to overfitting, so most analysts restrict the model to two-way interactions. In SAS you can use the "at" operator (@) to specify the highest interaction terms in the model. For example, the following syntax specifies that the model contains only main effects and two-way interactions:

model Y = x1 | x2 | x3 | x4 @2;   /* main effects and two-way interactions */
/* equivalent: model Y = x1 x2 x3 x4   x1*x2 x1*x3 x1*x4   x2*x3 x2*x4   x3*x4; */

Use the EFFECT statement to build polynomial effects

Notice that the bar operator does not generate the interaction of a variable with itself. For example, the terms x1*x1 and x2*x2 are not generated. Notice also that you need to explicitly type out each variable when you use the bar operator. You cannot use "colon notation" (x:) or hyphens (x1-x4) to specify a range of variables. For four variable, this is an inconvenience; for hundreds of variables, this is a serious problem.

The EFFECT statement enables you to create polynomial effects, which have the following advantages:

  • You can use colon notation or a hyphen to specify a range of variables. (You can also specify a space-separated list of variable names.)
  • You can control the degree of the polynomial terms and the maximum value of the exponent that appears in each term.
  • You can generate interactions between a variable and itself.

The POLYNOMIAL option in the EFFECT statement is described in terms of multivariate polynomials. Recall that a multivariate monomial is a product of powers of variables that have nonnegative integer exponents. For example, x2 y z is a monomial that contains three variables. The degree of a monomial is the sum of the exponents of the variables. The previous monomial has degree 4 because 4 = 2 + 1 + 1.

Use the EFFECT statement to generate two-way interactions

The syntax for the EFFECT statement is simple. For example, to generate all main effects and two-way interactions, including second-degree terms like x1*x1 and x2*x2, use the following syntax:

ods select ParameterEstimates(persist);
proc logistic data=Heart;
   effect poly2 = polynomial(x1-x4 / degree=2);
   model Status = poly2;
/* equivalent:  model Status = x1 | x2 | x3 | x4  @2     
                               x1*x1 x2*x2 x3*x3 x4*x4;  */
run;

The name of the effect is 'poly2'. It is a polynomial effect that contains all terms that involve first- and second-degree monomials. Thus it contains the main effects, the two-way interactions between variables, and the terms x1*x1, x2*x2, x3*x3, and x4*x4. The equivalent model in "stars and bars" notation is shown in the comment. The models are the same, although the variables are listed in a different order.

You can also use the colon operator to select variables that have a common prefix. For example, to specify all polynomial effects for variables that begin with the prefix 'x', you can use EFFECT poly2 = POLYNOMIAL(x: / DEGREE=2).

You can use the MDEGREE= option to control the maximum degree of any exponent in a monomial. For example, the monomials x1*x1 and x1*x2 are both second-degree monomials, but the maximum exponent that appears in the first monomial is 2 whereas the maximum exponent that appears in the second monomial is 1. The following syntax generates only monomial terms for which the maximum exponent is 1, which means that x1*x1 and x2*x2 will not appear:

proc logistic data=Heart;
   effect poly21 = polynomial(x1-x4 / degree=2 mdegree=1);  /* exclude x1*x1, x2*x2, etc. */
   model Status = poly21;
/* equivalent:  model Status = x1 | x2 | x3 | x4  @2    */
run;

Use the EFFECT statement to generate two-way interactions between lists of variables

A novel use of polynomial effects is to generate all two-way interactions between variables in one list and variables in another list. For example, suppose you are interested in the interactions between the lists (x1 x2) and (x3 x4), but you are not interested in within-list interactions such as x1*x2 and x3*x4. By using polynomial effects, you can create the two lists and then use the bar operator to request all main and two-way interactions between elements of the lists, as follows:

proc logistic data=Heart;
   effect list1 = polynomial(x1-x2);    /* first list of variables */
   effect list2 = polynomial(x3-x4);    /* second list of variables */
   model Status = list1 | list2;        /* main effects and pairwise interactions between lists */
/* equivalent:
   model Status = x1 x2                    
                  x3 x4                
                  x1*x3 x1*x4 x2*x3 x2*x4; */
run;

Notice that you can use the EFFECT statement multiple times in the same procedure. This is a powerful technique! By using multiple EFFECT statements, you can name sets of variables that have similar properties, such as demographic effects (age, sex), lifestyle effects (diet, exercise), and physiological effects (blood pressure, cholesterol). You can then easily construct models that contain interactions between these sets, such as LIFESTYLE | PHYSIOLOGY.

In summary, the POLYNOMIAL option in the EFFECT statement enables you to control the interactions between variables in a model. You can form models that are equivalent to the "stars and bars" syntax or specify more complex models. In particular, you can use the EFFECT statement multiple times to construct lists of variables and interactions between elements of the lists. For more information about polynomial effects, see the SAS documentation for the POLYNOMIAL option in the EFFECT STATEMENT.

The post Construct polynomial effects in SAS regression models appeared first on The DO Loop.

9月 052017
 

Correlation is a fundamental statistical concept that measures the linear association between two variables. There are multiple ways to think about correlation: geometrically, algebraically, with matrices, with vectors, with regression, and more. To paraphrase the great songwriter Paul Simon, there must be 50 ways to view your correlation! But don't "slip out the back, Jack," this article describes only seven of them.

How can we understand these many different interpretations? As the song says, "the answer is easy if you take it logically." These seven ways to view your Pearson correlation are based on the wonderful paper by Rodgers and Nicewander (1988), "Thirteen ways to look at the correlation coefficient," which I recommend for further reading.

1. Graphically

The simplest way to visualize correlation is to create a scatter plot of the two variables. A typical example is shown to the right. (Click to enlarge.) The graph shows the heights and weights of 19 students. The variables have a strong linear "co-relation," which is Galton's original spelling for "correlation." For these data, the Pearson correlation is r = 0.8778, although few people can guess that value by looking only at the graph.

For data that are approximately bivariate normal, the points will align northwest-to-southeast when the correlation is negative and will align southwest-to-northeast when the data are positively correlated. If the point-cloud is an amorphous blob, the correlation is close to zero.

In reality, most data are not well-approximated by a bivariate normal distribution. Furthermore, Anscombe's Quartet provides a famous example of four point-clouds that have exactly the same correlation but very different appearances. (See also the images in the Wikipedia article about correlation.) So although a graphical visualization can give you a rough estimate of a correlation, you need computations for an accurate estimate.

2. The sum of crossproducts

In elementary statistics classes, the Pearson sample correlation coefficient between two variables x and y is usually given as a formula that involves sums. The numerator of the formula is the sum of crossproducts of the centered variables. The denominator involves the sums of squared deviations for each variable. In symbols:

The terms in the numerator involve the first central moments and the terms in the denominator involve the second central moments. Consequently, Pearson's correlation is sometimes called the product-moment correlation.

3. The inner product of standardized vectors

I have a hard time remembering complicated algebraic formulas. Instead, I try to visualize problems geometrically. The way I remember the correlation formula is as the inner (dot) product between two centered and standardized vectors. In vector notation, the centered vectors are x - x̄ and y - ȳ. A vector is standardized by dividing by its length, which in the Euclidean norm is the square root of the sum of the square of the elements. Therefore you can define u = (x - x̄) / || x - x̄ || and v = (y - ȳ) / || y - ȳ ||. Looking back at the equation in the previous section, you can see that the correlation between the vectors x and y is the inner product r = u · v. This formula shows that the correlation coefficient is inavariant under affine transformations of the data.

4. The angle between two vectors

Linear algebra teaches that the inner product of two vectors is related to the angle between the vectors. Specifically, u · v = ||u|| ||v|| cos(θ), where θ is the angle between the vectors u and v. Dividing both sides by the lengths of u and v and using the equations in the previous section, we find that r = cos(θ), where θ is the angle between the vectors.

This equation gives two important facts about the Pearson correlation. First, the correlation is bounded in the interval [-1, 1]. Second, it gives the correlation for three important geometric cases. When x and y have the same direction (θ = 0), then their correlation is +1. When x and y have opposite directions (θ = π), then their correlation is -1. When x and y are orthogonal (θ = π/2), their correlation is 0.

5. The standardized covariance

Recall that the covariance between two variables x and y is

The covariance between two variables depends on the scales of measurement, so the covariance is not a useful statistic for comparing the linear association. However, if you divide by the standard deviation of each variable, then the variables become dimensionless. Recall that the standard deviation is the square root of the variance, which for the x variable is given by

Consequently, the expression sxy / (sx sy) is a standardized covariance. If you expand the terms algebraically, the "n - 1" terms cancel and you are left with the Pearson correlation.

6. The slope of the regression line between two standardized variables

Most textbooks for elementary statistics mention that the correlation coefficient is related to the slope of the regression line that estimates y from x. If b is the slope, then the result is r = b (sx / sy). That's a messy formula, but if you standardize the x and y variables, then the standard deviations are unity and the correlation equals the slope.

This fact is demonstrated by using the same height and weight data for 19 students. The graph to the right shows the data for the standardized variables, along with an overlay of the least squares regression line. The slope of the regression line is 0.8778, which is the same as the correlation between the variables.

7. Geometric mean of regression slopes

The previous section showed a relationship between two of the most important concepts in statistics: correlation and regression. Interestingly, the correlation is also related to another fundamental concept, the mean. Specifically, when two variables are positively correlated, the correlation coefficient is equal to the geometric mean of two regression slopes: the slope of y regressed on x (bx) and the slope of x regressed on y (by).

To derive this result, start from the equation in the previous section, which is r = bx (sx / sy). By symmetry, it is also true that r = by (sy / sx). Consequently, r2 = bx by or r = sqrt( bx by ), which is the geometric mean of the two slopes..

Summary

This article shows multiple ways that you can view a correlation coefficient: analytically, geometrically, via linear algebra, and more. The Rodgers and Nicewander (1988) article includes other ideas, some of which are valid only asymptotically or approximately.

If you want to see the computations for these methods applied to real data, you can download a SAS program that produces the computations and graphs in this article.

What is your favorite way to think about the correlation between two variables? Leave a comment.

The post 7 ways to view correlation appeared first on The DO Loop.

8月 302017
 

Aa previous article discussed the mathematical properties of the singular value decomposition (SVD) and showed how to use the SVD subroutine in SAS/IML software. This article uses the SVD to construct a low-rank approximation to an image. Applications include image compression and denoising an image.

Construct a grayscale image

The value of each pixel in a grayscale image can be stored in a matrix where each element of the matrix is a value between 0 (off) and 1 (full intensity). I want to create a small example that is easy to view, so I'll create a small matrix that contains information for a low-resolution image of the capital letters "SVD." Each letter will be five pixels high and three pixels wide, arranges in a 7 x 13 array of 0/1 values. The following SAS/IML program creates the array and displays a heat map of the data:

ods graphics / width=390px height=210px;
proc iml;
SVD = {0 0 0 0 0 0 0 0 0 0 0 0 0,
       0 1 1 1 0 1 0 1 0 1 1 0 0,
       0 1 0 0 0 1 0 1 0 1 0 1 0,
       0 1 1 1 0 1 0 1 0 1 0 1 0,
       0 0 0 1 0 1 0 1 0 1 0 1 0,
       0 1 1 1 0 0 1 0 0 1 1 0 0,
       0 0 0 0 0 0 0 0 0 0 0 0 0 };
A = SVD;
/* ColorRamp:  White    Gray1    Gray2    Gray3    Black   */
ramp =        {CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000};
call heatmapcont(A) colorramp=ramp showlegend=0 title="Original Image" range={0,1};
Rank-5 data matrix for singular value decomposition (SVD)

A low-rank approximation to an image

Because the data matrix contains only five non-zero rows, the rank of the A matrix cannot be more than 5. The following statements compute the SVD of the data matrix and create a plot of the singular values. The plot of the singular values is similar to the scree plot in principal component analysis, and you can use the plot to help choose the number of components to retain when approximating the data.

call svd(U, D, V, A);  /* A = U*diag(D)*V` */
title "Singular Values";
call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};
Singular values of rank-5 data matrix

For this example, it looks like retaining three or five components would be good choices for approximating the data. To see how low-rank approximations work, let's generate and view the rank-1, rank-2, and rank-3 approximations:

keep = 1:3;          /* use keep=1:5 to see all approximations */
do i = 1 to ncol(keep);
   idx = 1:keep[i];                           /* components to keep */
   Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;  /* rank k approximation */
   Ak = (Ak - min(Ak)) / range(Ak);           /* for plotting, stdize into [0,1] */
   s = "Rank = " + char(keep[i],1);
   call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1};
end;
Low-rank approximation: Rank-1 approximation via SVD of a data matrix

The rank-1 approximation does a good job of determining the columns and do not contain that contain the letters. The approximation also picks out two rows that contain the horizontal strokes of the capital "S."

Low-rank approximation: Rank-2 approximation via SVD of a data matrix

The rank-2 approximation refines the image and adds additional details. You can begin to make out the letters "SVD." In particular, all three horizontal strokes for the "S" are visible and you can begin to see the hole in the capital "D."

Low-rank approximation: Rank-3 approximation via SVD of a data matrix

The rank-3 approximation contains enough details that someone unfamiliar with the message can read it. The "S" is reconstructed almost perfectly and the space inside the "V" and "D" is almost empty. Even though this data is five-dimensional, this three-dimensional approximation is very good.

Not only is a low-rank approximation easier to work with than the original five-dimensional data, but a low-rank approximation represents a compression of the data. The original image contains 7 x 13 = 91 values. For the rank-3 approximation, three columns of the U matrix contain 33 numbers and three columns of VT contain 15 numbers. So the total number of values required to represent the rank-3 approximation is only 48, which is almost half the number of values as for the original image.

Denoising an image

You can use the singular value decomposition and low-rank approximations to try to eliminate random noise that has corrupted an image. Every TV detective series has shown an episode in which the police obtain a blurry image of a suspect's face or license plate. The detective asks the computer technician if she can enhance the image. With the push of a button, the blurry image is replaced by a crystal clear image that enables the police to identify and apprehend the criminal.

The image reconstruction algorithms used in modern law enforcement are more sophisticated than the SVD. Nevertheless, the SVD can do a reasonable job of removing small random noise from an image, thereby making certain features easier to see. The SVD has to have enough data to work with, so the following statements duplicate the "SVD" image four times before adding random Gaussian noise to the data. The noise has a standard deviation equal to 25% of the range of the data. The noisy data is displayed as a heat map on the range [-0.25, 1.25] by using a color ramp that includes yellow for negative values and blue for values greater than 1.

call randseed(12345);
A = (SVD // SVD) || (SVD // SVD);                     /* duplicate the image */
A = A + randfun( dimension(A), "Normal", 0, 0.25);    /* add Gaussian noise */
/*       Yellow   White    Gray1    Gray2    Gray3    Black    Blue    */
ramp2 = {CXFFEDA0 CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000 CX3182BD};
call heatmapcont(A) colorramp=ramp2 showlegend=0 title="Noisy Image" range={-0.5, 1.5};
Image corrupted by Gaussian noise

I think most police officers would be able to read this message in spite of the added noise, but let's see if the SVD can clean it up. The following statements compute the SVD and create a plot of the singular values:

call svd(U, D, V, A);  /* A = U*diag(D)*V` */
call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};
Plot of singular values for a small image that is corrupted by Gaussian noise

There are 14 non-zero singular values. In theory, the main signal is contained in the components that correspond to the largest singular values whereas the noise is captured by the components for the smallest singular values. For these data, the plot of the singular values suggests that three or five (or nine) components might capture the main signal while ignoring the noise. The following statements create and display the rank-3 and rank-5 approximations. Only the rank-5 approximation is shown.

keep = {3 5};        /* which components to examine? */
do i = 1 to ncol(keep);
   idx = 1:keep[i];
   Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;
   Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, standardize into [0,1] */
   s = "Rank = " + char(keep[i],2);
   call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1};
end;
Low-rank approximation: Rank-5 reconstruction via SVD of a small image that is corrupted by Gaussian noise

The denoised low-rank image is not as clear as Hollywood depicts, but it is quite readable. It is certainly good enough to identify a potential suspect. I can hear the detective mutter, "Book 'em, Danno! Murder One."

Summary and further reading

In summary, the singular value decomposition (SVD) enables you to approximate a data matrix by using a low-rank approximation. This article uses a small example for which the full data matrix is rank-5. A plot of the singular values can help you choose the number of components to retain. For this example, a rank-3 approximation represents the essential features of the data matrix.

For similar analyses and examples that use the singular value decomposition, see

In SAS software, the SVD is heavily used in the SAS Text Miner product. For an overview of how a company can use text mining to analyze customer support issues, see Sanders and DeVault (2004) "Using SAS at SAS: The Mining of SAS Technical Support."

The post The singular value decomposition and low-rank approximations appeared first on The DO Loop.

8月 282017
 

The singular value decomposition (SVD) could be called the "billion-dollar algorithm" since it provides the mathematical basis for many modern algorithms in data science, including text mining, recommender systems (think Netflix and Amazon), image processing, and classification problems. Although the SVD was mathematically discovered in the late 1800s, computers have made the SVD an indispensable tool in computational statistics and data science.

SVD: A fundamental theorem of linear algebra

Mathematically, the singular value decomposition is a fundamental theorem of linear algebra. )You could argue that it is THE fundamental theorem, but Gil Strang names a different result.) The singular value decomposition says that every n x p matrix can be written as the product of three matrices: A = U Σ VT where

  • U is an othogonal n x n matrix
  • Σ is a diagonal n x p matrix. In practice, the diagonal elements are ordered so that Σii ≥ Σjj for all i < j.
  • V is an othogonal p x p matrix and VT represents a matrix transpose.

The SVD represents the essential geometry of a linear transformation. It tells us that every linear transformation is a composition of three fundamental actions. Reading the equation from right to left:

  1. The matrix V represents a rotation or reflection of vectors in the p-dimensional domain.
  2. The matrix Σ represents a linear dilation or contraction along each of the p coordinate directions. If np, this step also canonically embeds (or projects) the p-dimensional domain into (or onto) the n-dimensional range.
  3. The matrix U represents a rotation or reflection of vectors in the n-dimensional range.

Thus the SVD specifies that every linear transformation is fundamentally a rotation or reflection, followed by a scaling, followed by another rotation or reflection. The Strang (1993) article about the fundamental theorem of linear algebra includes the following geometric interpretation of the singular value decomposition of a 2 x 2 matrix:

Geometric interpretation of the singular value decomposition (SVD) as the product of a rotation/reflection, followed by a scaling, followed by another rotation/reflection.

The diagram shows that the transformation induced by the matrix A (the long arrow across the top of the diagram) is equivalent to the composition of the three fundamental transformations, namely a rotation, a scaling, and another rotation.

SVD: The fundamental theorem of multivariate data analysis

Because of its usefulness, the singular value decomposition is a fundamental technique for multivariate data analysis. A common goal of multivariate data analysis is to reduce the dimension of the problem by choosing a small linear subspace that captures important properties of the data. The SVD is used in two important dimension-reducing operations:

  • Low-rank approximations: Recall that the diagonal elements of the Σ matrix (called the singular values) in the SVD are computed in decreasing order. The SVD has a wonderful mathematical property: if you choose some integer k ≥ 1 and let D be the diagonal matrix formed by replacing all singular values after the k_th by 0, then then matrix U D VT is the best rank-k approximation to the original matrix A.
  • Principal components analysis: The principal component analysis is usually presented in terms of eigenvectors of a correlation matrix, but you can show that the principal component analysis follows directly from the SVD. (In fact, you can derive the eigenvalue decomposition of a matrix, from the SVD.) The principal components of AT A are the columns of the V matrix; the scores are the columns of U. Dimension reduction is achieved by truncating the number of columns in U, which results in the best rank-k approximation of the data.

Compute the SVD in SAS

In SAS, you can use the SVD subroutine in SAS/IML software to compute the singular value decomposition of any matrix. To save memory, SAS/IML computes a "thin SVD" (or "economical SVD"), which means that the U matrix is an n x p matrix. This is usually what the data analyst wants, and it saves considerable memory in the usual case where the number of observations (n) is much larger than the number of variables (p). Technically speaking, the U for an "economical SVD" is suborthogonal: UT U is the identity when np and U UT is the identity when np.

As for eigenvectors, the columns of U and V are not unique, so be careful if you compare the results in SAS to the results from MATLAB or R. The following example demonstrates a singular value decomposition for a 3 x 2 matrix A. For the full SVD, the U matrix would be a 3 x 3 matrix and Σ would be a 3 x 2 diagonal matrix. For the "economical" SVD, U is 3 2 and Σ is a 2 x 2 diagonal matrix, as shown:

proc iml;
A = {0.3062 0.1768,
    -0.9236 1.0997,
    -0.4906 1.3497 };
 
call svd(U, D, V, A);  /* A = U*diag(D)*V` */
print U[f=6.3], D[f=6.3], V[f=6.3];

Notice that, to save memory, only the diagonal elements of the matrix &Signa; are returns in the vector D. You can explicitly form &Sigma = diag(D) if desired. Geometrically, the linear transformation that corresponds to V rotates a vector in the domain by about 30 degrees, the matrix D scales it by 2 and 0.5 in the coordinate directions, and U inserts it into a three-dimensional space, and applies another rotation.

My next blog post shows how the SVD enables you to reduce the dimension of a data matrix by using a low-rank approximation, which has applications to image compression and de-noising.

The post The singular value decomposition: A fundamental technique in multivariate data analysis appeared first on The DO Loop.

8月 232017
 

All statisticians are familiar with the classical arithmetic mean. Some statisticians are also familiar with the geometric mean. Whereas the arithmetic mean of n numbers is the sum divided by n, the geometric mean of n nonnegative numbers is the n_th root of the product of the numbers. The geometric mean is always less than the arithmetic mean.

Did you know that there is a hybrid quantity, called the arithmetic-geometric mean, which is defined by combining the two quantities? The arithmetic-geometric mean is only defined for two positive numbers, x and y. It is defined as the limit of an alternating iterative process:

  • Define a1 = (x + y)/2 and g1 = sqrt(x y) to be the arithmetic and geometric means, respectively.
  • Iteratively define an+1 = (an + gn)/2 and gn+1 = sqrt(an gn).

Interestingly, this process always converges. The number that it converges to is called the arithmetic-geometric mean of x and y, which is denoted by AGM(x,y). The AGM is between the geometric mean and the arithmetic mean. I am not aware of a multivariate generalization of the AGM function.

In SAS you can use the SAS/IML language or PROC FCMP to create a user-defined function that computes the arithmetic-geometric mean. Since the AGM is not a vector operation, I show the PROC FCMP approach:

proc fcmp outlib=work.funcs.MathFuncs;
function AGM(x, y);
   if x<0 | y<0 then return( . );
   epsilon = 1e-10;               /* precision of computation */
   a = mean(x, y);   
   g = geomean(x,y);   
   do while (abs(a - g) > epsilon);
      a_new = mean(a, g);   
      g_new = geomean(a, g);   
      a = a_new; 
      g = g_new;
   end;
   return( mean(a, g) );
endsub;
quit;
 
/* test the function */
options cmplib=work.funcs;  /* define location of function */
data _NULL_;
   agm = AGM(1, 0.8); 
   put agm 18.14;
run;
   0.89721143211504

An example of calling the new AGM function is shown for the two numbers 1 and 0.8. The arithmetic mean of 1 and 0.8 is 0.9. The geometric mean of 1 and 0.8 is sqrt(0.8) = 0.8944. The computation shows that the arithmetic-geometric mean is 0.8972, which is between the arithmetic and geometric means.

The following program computes the graph of the AGM function and displays a contour plot of the result. Note that AGM(x, x) = x for any nonnegative value of x.

data grid;
do x = 0 to 5 by 0.1;
   do y = 0 to 5 by 0.1;
      agm = AGM(x, y);
      output;
   end;
end;
run;
 
/* see http://blogs.sas.com/content/iml/2012/07/02/create-a-contour-plot-in-sas.html */
proc sgrender data=grid template=ContourPlotParm;
dynamic _TITLE="Graph of Arithmetic-Geometric Mean AGM(x,y)"
        _X="x" _Y="y" _Z="AGM";
run;
Graph of the arithmetic-geometric mean

So, what is the arithmetic-geometric mean good for? The AGM seems mainly useful in applied mathematics and number theory for computing elliptic functions and elliptic integrals. I am not aware of any applications to statistics, perhaps because the AGM is not defined for a sample that contains n elements when n > 2.

Although the AGM function is not used in statistics, the iterative algorithm that defines the AGM is a technique that I have seen often in computational statistics. When a researcher wants two conditions to hold simultaneously, one possible method is to alternately solve for each condition and prove that this iterative process converges to a solution that satisfies both conditions simultaneously. The iterative process can be an effective way to implement an optimization problem if each sub-optimization is simple.

In multivariate statistics, this technique is used in the method of alternating least squares, which is used in several SAS procedures such as PROC TRANSREG, VARCLUS, and MDS. In linear algebra, this technique is used in the method of alternating projections. I have used this technique to compute the closest correlation matrix to an arbitrary matrix.

One last interesting fact. There is another kind of mean called the harmonic mean. You can compute the arithmetic-harmonic mean by applying the same iterative algorithm, but replace the GEOMEAN function with the HARMEAN function. The process converges and the arithmetic-harmonic mean converges is equal to the geometric mean! Thus although this two-step definition might seem contrived, it is "natural" in the sense that it produces the geometric mean (of two numbers) from the arithmetic and harmonic means.

The post The arithmetic-geometric mean appeared first on The DO Loop.