Statistical Thinking

7月 222019
 
Illustration of the 68-95-99.7 rule

Is 4 an extreme value for the standard normal distribution? In high school, students learn the famous 68-95-99.7 rule, which is a way to remember that 99.7 percent of random observation from a normal distribution are within three standard deviations from the mean. For the standard normal distribution, the probability that a random value is bigger than 3 is 0.0013. The probability that a random value is bigger than 4 is even smaller: about 0.00003 or 3 x 10-5.

So, if you draw randomly from a standard normal distribution, it must be very rare to see an extreme value greater than 4, right? Well, yes and no. Although it is improbable that any ONE observation is more extreme than 4, if you draw MANY independent observations, the probability that the sample contains an extreme value increases with the sample size. If p is the probability that one observation is less than 4, then pn is the probability that n independent observations are all less than 4. Thus 1 – pn is the probability that some value is greater than 4. You can use the CDF function in SAS to compute these probabilities in a sample of size n:

/* What is a "large value" of a normal sample? The answer depends on the size of the sample. */
/* Use CDF to find probability that a random value from N(0,1) exceeds 4 */
proc iml;
P_NotGT4 = cdf("Normal", 4);   /* P(x < 4) */
 
/* Probability of an extreme obs in a sample that contains n independent observations */
n = {1, 100, 1000, 10000};     /* sample sizes */
P_NotGT4 = P_NotGT4**n;        /* P(all values are < 4) */
P_GT4 = 1 - P_NotGT4;          /* P(any value is > 4)   */
print n P_NotGT4 P_GT4;
Probability of a value greater than 4 in a sample that contains n independent observations from the standard normal distribution

The third column of the table shows that the probability of seeing an observation whose value is greater than 4 increases with the size of the sample. In a sample of size 10,000, the probability is 0.27, which implies that about one out of every four samples of that size will contain a maximum value that exceeds 4.

The distribution of the maximum in a Gaussian sample: A simulation approach

You can use a simulation to approximate the distribution of the maximum value of a normal sample of size n. For definiteness, choose n = 1,000 and sample from a standard normal distribution N(0,1). The following SAS/IML program simulates 5,000 samples of size n and computes the maximum value of each sample. You can then graph the distribution of the maximum values to understand how the maximum value varies in random samples of size n.

/* What is the distribution of the maximum value in a 
   sample of size n drawn from the standard normal distribution? */
call randseed(12345);
n = 1000;                  /* sample size */
numSim = 5000;             /* number of simulations */
x = j(n, numSim);          /* each column will be a sample */
call randgen(x, "Normal"); /* generate numSim samples */
max = x[<>, ];             /* find max of each sample (column) */
 
Title "Distribution of Maximum Value of a Normal Sample";
title2 "n = 1000";
call histogram(max);
 
/* compute some descriptive statistics */
mean = mean(max`);
call qntl(Q, max`, {0.25 0.5 0.75});
print (mean // Q)[rowname={"Mean" "P25" "Median" "P75"}];
Simulate 5000 samples of size n=1000. Plot the maximum value of each sample.

Based on this simulation, the expected maximum value of a sample of size n = 1,000 is about 3.2. The table indicates that about 25% of the samples have a maximum value that is less than 3. About half have a maximum value less than 3.2. About 75% of the samples have a maximum value that is less than 3.4. The graph shows the distribution of maxima in these samples. The maximum value of a sample ranged from 2.3 to 5.2.

The distribution of a maximum (or minimum) value in a sample is studied in an area of statistics that is known as extreme value theory. It turns out that you can derive the sampling distribution of the maximum of a sample by using the Gumbel distribution, which is also known as the "extreme value distribution of Type 1." You can use the Gumbel distribution to describe the distribution of the maximum of a sample of size n. The Gumbel distribution actually describes the maximum for many distributions, but for simplicity I will only refer to the normal distribution.

The distribution of the maximum in a Gaussian sample: The Gumbel distribution

This section does two things. First, it uses PROC UNIVARIATE to fit the parameters of a Gumbel distribution to the maximum values from the simulated samples. The Gumbel(3.07, 0.29) distribution is the distribution that maximizes the likelihood function for the simulated data. Second, it uses a theoretical formula to show that a Gumbel(3.09, 0.29) distribution is the distribution that models the maximum of a normally distributed sample of size n = 1,000. Thus, the results from the simulation and the theory are similar.

You can write the 5,000 maximum values from the simulation to a SAS data set and use PROC UNIVARIATE to estimate the MLE parameters for the Gumbel distribution, as follows:

create MaxVals var "max"; append; close;
QUIT;
 
/* Fit a Gumbel distribution, which models the distribution of maximum values */
proc univariate data=MaxVals;
   histogram max / gumbel;
   ods select Histogram ParameterEstimates FitQuantiles;
run;

The output from PROC UNIVARIATE shows that the Gumbel(3.07, 0.29) distribution is a good fit to the distribution of the simulated maxima. But where do those parameter values come from? How are the parameters of the Gumbel distribution related to the sample size of the standard normal distribution?

It was difficult to find an online reference that shows how to derive the Gumbel parameters for a normal sample of size n. I finally found a formula in the book Extreme Value Distributions (Kotz and Nadarajah, 2000, p. 9). For any cumulative distribution F that satisfies certain conditions, you can use the quantile function of the distribution to estimate the Gumbel parameters. The result is that the location parameter (μ) is equal to μ = F-1(1-1/n) and the scale parameter (σ) is equal to σ = F-1(1-1/(ne)) - μ, where e is the base of the natural logarithm. The following SAS/IML program uses the (1 – 1/n)th quantile of the normal distribution to derive the Gumbel parameters for a normal sample of size n:

proc iml;
n = 1000;
/* Compute parameters of Gumbel distribution when the sample is normal of size n.
   SAS calls the parameters (mu, sigma). Wikipedia calls them (mu, beta). Other
   references use (alpha, beta). */
mu_n    = quantile("Normal", 1-1/n);                /* location parameter */
sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */
print n mu_n sigma_n;
 
/* what is the mean (expected value) and median of this Gumbel distribution? */
gamma = constant("Euler");             /* Euler–Mascheroni constant = 0.5772157 */
mean = mu_n + gamma*sigma_n;           /* expected value of maximum */
median = mu_n - sigma_n * log(log(2)); /* median of maximum value distribution */
print n mean median;

Notice that these parameters are very close to the MLE estimates from the simulated normal samples. The results tell you that the expected maximum in a standard normal sample is 3.26 and about 50% of samples will have a maximum value of 3.19 or less.

You can use these same formulas to find the expected and median values of the maximum in samples of other sizes:

/* If the sample size is n, what is expected maximum */
n = {1E4, 2E4, 1E5, 2E5, 1E6, 2E6};
mu_n    = quantile("Normal", 1-1/n);                /* location parameter */
sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */
mean = mu_n + gamma*sigma_n;           /* expected value of maximum */
median = mu_n - sigma_n * log(log(2)); /* meadian of maximum value distribution */
print n mean median;

The table shows that you would expect to see a maximum value of 4 in a sample of size 20,000. If there are two million observations in a sample, you would expect to see a maximum value of 5!

You can graph this data over a range of sample sizes. The following graph shows the expected value of the maximum value in a sample of size n (drawn from a standard normal distribution) for large values of n.

You can create similar images for quantiles. The p_th quantile for the Gumbel distribution is q = mu_n - sigma_n log(-log(p)).

So, is 4 an unlikely value for the standard normal distribution? Yes, but for sufficiently large samples that value is likely to be observed. You can use the Gumbel distribution to model the distribution of the maximum in a normal sample of size n to determine how likely it is that the sample contains an extreme value. The larger the sample, the more likely it is to observe an extreme value. Although I do not discuss the general case, the Gumbel distribution can also model the maximum value for samples drawn from some non-normal distributions.

The post Extreme values: What is an extreme value for normally distributed data? appeared first on The DO Loop.

5月 082019
 

At SAS Global Forum 2019, Daymond Ling presented an interesting discussion of binary classifiers in the financial industry. The discussion is motivated by a practical question: If you deploy a predictive model, how can you assess whether the model is no longer working well and needs to be replaced?

Daymond discussed the following three criteria for choosing a model:

  1. Discrimination: The ability of the binary classifier to predict the class of a labeled observation. The area under an ROC curve is one measure of a binary model's discrimination power. In SAS, you can compute the ROC curve for any predictive model.
  2. Accuracy: The ability of the model to estimate the probability of an event. The calibration curve is a graphical indication of a model's accuracy. In SAS, you can compute a calibration curve manually, or you can use PROC LOGISTIC in SAS/STAT 15.1 to automatically compute a calibration curve.
  3. Stability: Point estimates are often used to choose a model, but you should be aware of the variability of the estimates. This is a basic concept in statistics: When choosing between two unbiased estimators, you should usually choose the one that has smaller variance. SAS procedures provide (asymptotic) standard errors for many statistics such as the area under an ROC curve. If you have reason to doubt the accuracy of an asymptotic estimate, you can use bootstrap methods in SAS to estimate the sampling distribution of the statistic.

Estimates of model stability

My article about comparing the ROC curves for predictive models contains two competing models: A model from using PROC LOGISTIC and an "Expert model" that was constructed by asking domain experts for their opinions. (The source of the models is irrelevant; you can use any binary classifier.) You can download the SAS program that produces the following table, which estimates the area under each ROC curve, the standard error, and 90% confidence intervals:

The "Expert" model has a larger Area statistic and a smaller standard error, so you might choose to deploy it as a "champion model."

In his presentation, Daymond asked an important question. Suppose one month later you run the model on a new batch of labeled data and discover that the area under the ROC curve for the new data is only 0.73. Should you be concerned? Does this indicate that the model has degraded and is no longer suitable? Should you cast out this model, re-train all the models (at considerable time and expense), and deploy a new "champion"?

The answer depends on whether you think Area = 0.73 represents a degraded model or whether it can be attributed to sampling variability. The statistic 0.73 is barely more than 1 standard error away from the point estimate, and you will recall that 68% of a normal distribution is within one standard deviation of the mean. From that point of view, the value 0.73 is not surprising. Furthermore, the 90% confidence interval indicates that if you run this model every day for 100 days, you will probably encounter statistics lower than 0.68 merely due to sampling variability. In other words, a solitary low score might not indicate that the model is no longer valid.

Bootstrap estimates of model stability

If "asymptotic normality" makes you nervous, you can use the bootstrap method to obtain estimates of the standard error and the distribution of the Area statistic. The following table summarizes the results of 5,000 bootstrap replications. The results are very close to the asymptotic results in the previous table. In particular, the standard error of the Area statistic is estimated as 0.08 and in 90% of the bootstrap samples, the Area was in the interval [0.676, 0.983]. The conclusion from the bootstrap computation is the same as for the asymptotic estimates: you should expect the Area statistic to bounce around. A value such as 0.73 is not unusual and does not necessarily indicate that the model has degraded.

You can use the bootstrap computations to graphically reveal the stability of the two models. The following comparative histogram shows the bootstrap distributions of the Area statistic for the "Expert" and "Logistic" models. You can see that not only is the upper distribution shifted to the right, but it has less variance and therefore greater stability.

I think Daymond's main points are important to remember. Namely, discrimination and accuracy are important for choosing a model, but understanding the stability of the model (the variation of the estimates) is essential for determining when a model is no longer working well and should be replaced. There is no need to replace a model for a "bad score" if that score is within the range of typical statistical variation.

References

Ling, D. (2019), "Measuring Model Stability", Proceedings of the SAS Global Forum 2019 Conference.

Download the complete SAS program that creates the analyses and graphs in this article.

The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.

5月 082019
 

At SAS Global Forum 2019, Daymond Ling presented an interesting discussion of binary classifiers in the financial industry. The discussion is motivated by a practical question: If you deploy a predictive model, how can you assess whether the model is no longer working well and needs to be replaced?

Daymond discussed the following three criteria for choosing a model:

  1. Discrimination: The ability of the binary classifier to predict the class of a labeled observation. The area under an ROC curve is one measure of a binary model's discrimination power. In SAS, you can compute the ROC curve for any predictive model.
  2. Accuracy: The ability of the model to estimate the probability of an event. The calibration curve is a graphical indication of a model's accuracy. In SAS, you can compute a calibration curve manually, or you can use PROC LOGISTIC in SAS/STAT 15.1 to automatically compute a calibration curve.
  3. Stability: Point estimates are often used to choose a model, but you should be aware of the variability of the estimates. This is a basic concept in statistics: When choosing between two unbiased estimators, you should usually choose the one that has smaller variance. SAS procedures provide (asymptotic) standard errors for many statistics such as the area under an ROC curve. If you have reason to doubt the accuracy of an asymptotic estimate, you can use bootstrap methods in SAS to estimate the sampling distribution of the statistic.

Estimates of model stability

My article about comparing the ROC curves for predictive models contains two competing models: A model from using PROC LOGISTIC and an "Expert model" that was constructed by asking domain experts for their opinions. (The source of the models is irrelevant; you can use any binary classifier.) You can download the SAS program that produces the following table, which estimates the area under each ROC curve, the standard error, and 90% confidence intervals:

The "Expert" model has a larger Area statistic and a smaller standard error, so you might choose to deploy it as a "champion model."

In his presentation, Daymond asked an important question. Suppose one month later you run the model on a new batch of labeled data and discover that the area under the ROC curve for the new data is only 0.73. Should you be concerned? Does this indicate that the model has degraded and is no longer suitable? Should you cast out this model, re-train all the models (at considerable time and expense), and deploy a new "champion"?

The answer depends on whether you think Area = 0.73 represents a degraded model or whether it can be attributed to sampling variability. The statistic 0.73 is barely more than 1 standard error away from the point estimate, and you will recall that 68% of a normal distribution is within one standard deviation of the mean. From that point of view, the value 0.73 is not surprising. Furthermore, the 90% confidence interval indicates that if you run this model every day for 100 days, you will probably encounter statistics lower than 0.68 merely due to sampling variability. In other words, a solitary low score might not indicate that the model is no longer valid.

Bootstrap estimates of model stability

If "asymptotic normality" makes you nervous, you can use the bootstrap method to obtain estimates of the standard error and the distribution of the Area statistic. The following table summarizes the results of 5,000 bootstrap replications. The results are very close to the asymptotic results in the previous table. In particular, the standard error of the Area statistic is estimated as 0.08 and in 90% of the bootstrap samples, the Area was in the interval [0.676, 0.983]. The conclusion from the bootstrap computation is the same as for the asymptotic estimates: you should expect the Area statistic to bounce around. A value such as 0.73 is not unusual and does not necessarily indicate that the model has degraded.

You can use the bootstrap computations to graphically reveal the stability of the two models. The following comparative histogram shows the bootstrap distributions of the Area statistic for the "Expert" and "Logistic" models. You can see that not only is the upper distribution shifted to the right, but it has less variance and therefore greater stability.

I think Daymond's main points are important to remember. Namely, discrimination and accuracy are important for choosing a model, but understanding the stability of the model (the variation of the estimates) is essential for determining when a model is no longer working well and should be replaced. There is no need to replace a model for a "bad score" if that score is within the range of typical statistical variation.

References

Ling, D. (2019), "Measuring Model Stability", Proceedings of the SAS Global Forum 2019 Conference.

Download the complete SAS program that creates the analyses and graphs in this article.

The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.

4月 172019
 

I think every course in exploratory data analysis should begin by studying Anscombe's quartet. Anscombe's quartet is a set of four data sets (N=11) that have nearly identical descriptive statistics but graphical properties. They are a great reminder of why you should graph your data. You can read about Anscombe's quartet on Wikipedia, or check out a quick visual summary by my colleague Robert Allison. Anscombe's first two examples are shown below:

The Wikipedia article states, "It is not known how Anscombe created his data sets." Really? Creating different data sets that have the same statistics sounds like a fun challenge! As a tribute to Anscombe, I decided to generate my own versions of the two data sets shown in the previous scatter plots. The first data set is linear with normal errors. The second is quadratic (without errors) and has the exact same linear fit and correlation coefficient as the first data.

Generating your own version of Anscombe's data

The Wikipedia article notes that there are "several methods to generate similar data sets with identical statistics and dissimilar graphics," but I did not look at the modern papers. I wanted to try it on my own. If you want to solve the problem on your own, stop reading now!

I used the following approach to construct the first two data sets:

  1. Use a simulation to create linear data with random normal errors: Y1 = 3 + 0.5 X + ε, where ε ~ N(0,1).
  2. Compute the regression estimates (b0 and b1) and the sample correlation for the linear data. These are the target statistics. They define three equations that the second data set must match.
  3. The response variable for the second data set is of the form Y2 = β0 + β1 X + β2 X2. There are three equations and three unknowns parameters, so we can solve a system of nonlinear equations to find β.

From geometric reasoning, there are three different solution for the β parameter: One with β2 > 0 (a parabola that opens up), one with β2 = 0 (a straight line), and one with β2 < 0 (a parabola that opens down). Since Anscombe used a downward-pointing parabola, I will make the same choice.

Construct the first data set

You can construct the first data set in a number of ways, but I choose to construct it randomly. The following SAS/IML statements construct the data, defines a helper function (LinearReg). The program computes the target values, which are the parameter estimates for a linear regression and the sample correlation for the data:

proc iml;
call randseed(12345);
x = T( do(4, 14, 0.2) );                              /* evenly spaced X */
eps = round( randfun(nrow(x), "Normal", 0, 1), 0.01); /* normal error */
y = 3 + 0.5*x + eps;                                  /* linear Y + error */
 
/* Helper function. Return paremater estimates for linear regression. Args are col vectors */
start LinearReg(Y, tX);
   X = j(nrow(tX), 1, 1) || tX;
   b = solve(X`*X, X`*Y);       /* solve normal equation */
   return b;
finish;
 
targetB = LinearReg(y, x);          /* compute regression estimates */
targetCorr = corr(y||x)[2];         /* compute sample correlation */
print (targetB`||targetCorr)[c={'b0' 'b1' 'corr'} F=5.3 L="Target"];

You can use these values as the target values. The next step is to find a parameter vector β such that Y2 = β0 + β1 X + β2 X2 has the same regression line and corr(Y2, X) has the same sample correlation. For uniqueness, set β2 < 0.

Construct the second data set

You can formulate the problem as a system of equations and use the NLPHQN subroutine in SAS/IML to solve it. (SAS supports multiple ways to solve a system of equations.) The following SAS/IML statements define two functions. Given any value for the β parameter, the first function returns the regression estimates and sample correlation between Y2 and X. The second function is the objective function for an optimization. It subtracts the target values from the estimates. The NLPHQN subroutine implements a hybrid quasi-Newton optimization routine that uses least squares techniques to find the β parameter that generates quadratic data that tries to match the target statistics.

/* Define system of simultaneous equations:
   https://blogs.sas.com/content/iml/2018/02/28/solve-system-nonlinear-equations-sas.html */
/* This function returns linear regression estimates (b0, b1) and correlation for a choice of beta */
start LinearFitCorr(beta) global(x);
   y2 = beta[1] + beta[2]*x + beta[3]*x##2;    /* construct quadratic Y */
   b = LinearReg(y2, x);      /* linear fit */
   corr = corr(y2||x)[2];     /* sample corr */
   return ( b` || corr);      /* return row vector */
finish;
 
/* This function returns the vector quantity (beta - target). 
   Find value that minimizes Sum | F_i(beta)-Target+i |^2 */
start Func(beta) global(targetB, targetCorr);
   target = rowvec(targetB) || targetCorr;
   G = LinearFitCorr(beta) - target;
   return( G );              /* return row vector */
finish;
 
/* now let's solve for quadratic parameters so that same 
   linear fit and correlation occur */
beta0 = {-5 1 -0.1};         /* initial guess */
con = {.  .  .,              /* constraint matrix */
       0  .  0};             /* quadratic term is negative */
optn = ncol(beta0) || 0;     /* LS with 3 components || amount of printing */
/* minimize sum( beta[i] - target[i])**2 */
call nlphqn(rc, beta, "Func", beta0, optn) blc=con;  /* LS solution */
print beta[L="Optimal beta"];
 
/* How nearly does the solution solve the problem? Did we match the target values? */
Y2Stats = LinearFitCorr(beta);
print Y2Stats[c={'b0' 'b1' 'corr'} F=5.3];

The first output shows that the linear fit and correlation statistics for the linear and quadratic data are identical (to 3 decimal places). Anscombe would be proud! The second output shows the parameters for the quadratic response: Y2 = 4.955 + 2.566*X - 0.118*X2. The following statements create scatter plots of the new Anscombe-like data:

y2 = beta[1] + beta[2]*x + beta[3]*x##2;
create Anscombe2 var {x y y2};  append;  close;
QUIT;
 
ods layout gridded columns=2 advance=table;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=X y=y;    lineparm x=0 y=3.561 slope=0.447 / clip;
run;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=x y=y2;   lineparm x=0 y=3.561 slope=0.447 / clip;
run;
ods layout end;

Notice that the construction of the second data set depends only on the statistics for the first data. If you modify the first data set, and the second will automatically adapt. For example, you could choose the errors manually instead of randomly, and the statistics for the second data set should still match.

What about the other data sets?

You can create the other data sets similarly. For example:

  • For the data set that consist of points along a line except for one outlier, there are three free parameters. Most of the points fall along the line Y3 = a + b*X and one point is at the height Y3 = c. Therefore, you can run an optimization to solve for the values of (a, b, c) that match the target statistics.
  • I'm not sure how to formulate the requirements for the fourth data set. It looks like all but one point have coordinates (X, Y4), where X is a fixed value and the vertical mean of the cluster is c. The outlier has coordinate (a, b). I'm not sure whether the variance of the cluster is important.

Summary

In summary, you can solve a system of equations to construct data similar to Anscombe's quartet. By using this technique, you can create your own data sets that share descriptive statistics but look very different graphically.

To be fair, the technique I've presented does not enable you to reproduce Anscombe's quartet in its entirety. My data share a linear fit and sample correlation, whereas Anscombe's data share seven statistics!

Anscombe was a pioneer (along with Tukey) in using computation to assist in statistical computations. He was also enamored with the APL language. He introduced computers into the Yale statistics department in the mid-1960s. Since he published his quartet in 1973, it is possible that he used computers to assist his creation of the Anscombe quartet. However he did it, Anscombe's quartet is truly remarkable!

You can download the SAS program that creates the results in this article.

The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.

3月 252019
 

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate normal data. It is better than looking at the univariate z-scores of each coordinate because a multivariate outlier does not necessarily have extreme coordinate values.

The geometry of multivariate outliers

In classical statistics, a univariate outlier is an observation that is far from the sample mean. (Modern statistics use robust statistics to determine outliers; the mean is not a robust statistic.) You might assume that an observation that is extreme in every coordinate is also a multivariate outlier, and that is often true. However, the converse is not true: when variables are correlated, you can have a multivariate outlier that is not extreme in any coordinate!

The following schematic diagram gives the geometry of multivariate normal data. The middle of the diagram represents the center of a bivariate sample.

  • The orange elliptical region indicates a region that contains most of the observations. Because the variables are correlated, the ellipse is tilted relative to the coordinate axes.
  • For observations inside the ellipse, their Mahalanobis distance to the sample mean is smaller than some cutoff value. For observations outside the ellipse, their Mahalanobis distance to the sample mean is larger than the cutoff.
  • The green rectangle at the left and right indicate regions where the X1 coordinate is far from the X1 mean.
  • The blue rectangle at the top and bottom indicate regions where the X2 coordinate is far from the X2 mean.
Geometry of multivariate outliers showing the relationship between Mahalanobis distance and univariate outliers. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'B' is a multivariate outlier.

The diagram displays two observations, labeled "A" and "B":

  • The observation "A" is inside the ellipse. Therefore, the Mahalanobis distance from "A" to the center is "small." Accordingly, "A" is not identified as a multivariate outlier. However, notice that "A" is a univariate outlier for both the X1 and X2 coordinates!
  • The observation "B" is outside the ellipse. Therefore, the Mahalanobis distance from "B" to the center is relatively large. The observation is classified as a multivariate outlier. However, notice that "B" is not a univariate outlier for either the X1 or X2 coordinates; neither coordinate is far from its univariate mean.

The main point is this: An observation can be a multivariate outlier even though none of its coordinate values are extreme. It is the combination of values which makes an outlier unusual. In terms of Mahalanobis distance, the diagram illustrates that an observation "A" can have high univariate z scores but not have an extremely high Mahalanobis distance. Similarly, an observation "B" can have a higher Mahalanobis distance than "A" even though its z scores are relatively small.

Applications to real data

This article was motivated by a question from a SAS customer. In his data, one observation had a large Mahalanobis distance score but relatively small univariate z scores. Another observation had large z scores but a smaller Mahalanobis distance. He wondered how that was possible. His data contained four variables, but the following two-variable example illustrates his situation:

Geometry of multivariate outliers. The point 'A' has large univariate z scores but the Mahalanobis distance is only about 2.5. The point 'B' has a Mahalanobis distance of 5 and is a multivariate outlier.

The blue markers were simulated from a bivariate normal distribution with μ = (0, 0) and covariance matrix Σ = {16 32.4, 32.4 81}. The red markers were added manually. The observation marked 'B' is a multivariate outlier. The Mahalanobis distance (MD) from 'B' to the center of the sample is about 5 units. (The center is approximately at (0,0).) In contrast, the observation marked 'A' is not a multivariate outlier even though it has extreme values for both coordinates. In fact, the MD from 'A' to the center of the sample is about 2.5, or approximately half the MD of 'B'. The coordinates (x1, x2), standardized coordinates (z1, z2), and MD for both points are shown below:

You can connect the Mahalanobis distance to the probability of a multivariate random normal variable. The squared MD for multivariate normal data is distributed according to a chi-square distribution. For bivariate normal data, the probability that an observation is within t MD units of the center is 1 - exp(-t2/2). Observations like 'A' are not highly unusual. Observations that have MD ≥ 2.5 occur in exp(-2) = 4.4% of random variates from the bivariate normal distribution. In contrast, observations like 'B' are extremely rare. Observations that have MD ≥ 5 occur with probability exp(-25/2) = 3.73E-6. Yes, if you measure in Euclidean distance, 'A' is farther from the center than 'B' is, but the correlation between the variables makes 'A' much more probable. The Mahalanobis distance incorporates the correlation into the calculation of "distance."

Summary and further reading

In summary, things are not always as they appear. For univariate data, an outlier is an extreme observation. It is far from the center of the data. In higher dimensions, we need to account for correlations among variables when we measure distance. The Mahalanobis distance does that, and the examples in this post show that an observation can be "far from the center" (as measured by the Mahalanobis distance) even if none of its individual coordinates are extreme.

The following articles provide more information about Mahalanobis distance and multivariate outliers:

You can download the SAS program that generates the examples and images in this article.

The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

3月 182019
 

Statisticians often emphasize the dangers of extrapolating from a univariate regression model. A common exercise in introductory statistics is to ask students to compute a model of population growth and predict the population far in the future. The students learn that extrapolating from a model can result in a nonsensical prediction, such as trillions of people or a negative number of people! The lesson is that you should be careful when you evaluate a model far beyond the range of the training data.

The same dangers exist for multivariate regression models, but they are emphasized less often. Perhaps the reason is that it is much harder to know when you are extrapolating a multivariate model. Interpolation occurs when you evaluate the model inside the convex hull of the training data. Anything else is an extrapolation. In particular, you might be extrapolating even if you score the model at a point inside the bounding box of the training data. This differs from the univariate case in which the convex hull equals the bounding box (range) of the data. In general, the convex hull of a set of points is smaller than the bounding box.

You can use a bivariate example to illustrate the difference between the convex hull of the data and the bounding box for the data, which is the rectangle [Xmin, Xmax] x [Ymin, Ymax]. The following SAS DATA step defines two explanatory variables (X and Y) and one response variable (Z). The SGPLOT procedure shows the distribution of the (X, Y) variables and colors each marker according to the response value:

data Sample;
input X Y Z @@;
datalines;
10 90 22  22 76 13  22 75  7 
24 78 14  24 76 10  25 63  5
26 62 10  26 94 20  26 63 15
27 94 16  27 95 14  29 66  7
30 69  8  30 74  8
;
 
title "Response Values for Bivariate Data";
proc sgplot data=Sample;
scatter x=x y=y / markerattrs=(size=12 symbol=CircleFilled)
        colorresponse=Z colormodel=AltThreeColorRamp;
xaxis grid; yaxis grid;
run;
Bivariate data. The data are not uniformly distributed in the bounding box of the data.

The data are observed in a region that is approximately triangular. No observations are near the lower-left corner of the plot. If you fit a response surface to this data, it is likely that you would visualize the model by using a contour plot or a surface plot on the rectangular domain [10, 30] x [62, 95]. For such a model, predicted values near the lower-left corner are not very reliable because the corner is far from the data.

In general, you should expect less accuracy when you predict the model "outside" the data (for example, (10, 60)) as opposed to points that are "inside" the data (for example, (25, 70)). This concept is sometimes discussed in courses about the design of experiments. For a nice exposition, see the course notes of Professor Rafi Hafka (2012, p. 49–59) at the University of Florida.

The convex hull of bivariate points

You can use SAS to visualize the convex hull of the bivariate observations. The convex hull is the smallest convex set that contains the observations. The SAS/IML language supports the CVEXHULL function, which computes the convex hull for a set of planar points.

You can represent the points by using an N x 2 matrix, where each row is a 2-D point. When you call the CVEXHULL function, you obtain a vector of N integers. The first few integers are positive and represent the rows of the matrix that comprise the convex hull. The (absolute value of) the negative integers represents the rows that are interior to the convex hull. This is illustrated for the sample data:

proc iml;
use Sample;  read all var {x y} into points;  close;
 
/* get indices of points in the convex hull in counter-clockwise order */
indices = cvexhull( points );
print (indices`)[L="indices"];  /* positive indices are boundary; negative indices are inside */

The output shows that the observation numbers (indices) that form the convex hull are {1, 6, 7, 12, 13, 14, 11}. The other observations are in the interior. You can visualize the interior and boundary points by forming a binary indicator vector that has the value 1 for points on the boundary and 0 for points in the interior. To get the indicator vector in the order of the data, you need to use the SORTNDX subroutine to compute the anti-rank of the indices, as follows:

b = (indices > 0);               /* binary indicator variable for sorted vertices */
call sortndx(ndx, abs(indices)); /* get anti-rank, which is sort index that "reverses" the order */
onBoundary = b[ndx];             /* binary indicator data in original order */
 
title "Convex Hull of Bivariate Data";
call scatter(points[,1], points[,2]) group=onBoundary 
             option="markerattrs=(size=12 symbol=CircleFilled)";
Convex hull: Blue points are on the convex hull and red points are in the interior

The blue points are the boundary of the convex hull whereas the red points are in the interior.

Visualize the convex hull as a polygon

You can visualize the convex hull by forming the polygon that connects the first, sixth, seventh, ..., eleventh observations. You can do this manually by using the POLYGON statement in PROC SGPLOT, which I show in the Appendix section. However, there is an easier way to visualize the convex hull. I previously wrote about SAS/IML packages and showed how to install the polygon package. The polygon package contains a module called PolyDraw, which enables you to draw polygons and overlay a scatter plot.

The following SAS/IML statements extract the positive indices and use them to get the points on the boundary of the convex hull. If the polygon package is installed, you can load the polygon package and visualize the convex hull and data:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
 
/* In SAS/IML 14.1, you can use the polygon package to visualize the convex hull:
   https://blogs.sas.com/content/iml/2016/04/27/packages-share-sas-iml-programs-html */
package load polygon;    /* assumes package is installed */
run PolyDraw(convexHull, points||onBoundary) grid={x y}
    markerattrs="size=12 symbol=CircleFilled";

The graph shows the convex hull of the data. You can see that it primarily occupies the upper-right portion of the rectangle. The convex hull shows the interpolation region for regression models. If you evaluate a model outside the convex hull, you are extrapolating. In particular, even though points in the lower left corner of the plot are within the bounding box of the data, they are far from the data.

Of course, if you have 5, 10 or 100 explanatory variables, you will not be able to visualize the convex hull of the data. Nevertheless, the same lesson applies. Namely, when you evaluate the model inside the bounding box of the data, you might be extrapolating rather than interpolating. Just as in the univariate case, the model might predict nonsensical data when you extrapolate far from the data.

Appendix

Packages are supported in SAS/IML 14.1. If you are running an earlier version of SAS, you create the same graph by writing the polygon data and the binary indicator variable to a SAS data set, as follows:

hullNdx = indices[loc(b)];         /* get positive indices */
convexHull = points[hullNdx, ];    /* extract the convex hull, in CC order */
 
/* Write the data and polygon to SAS data sets. Use the POLYGON statement in PROC SGPLOT. */
p = points || onBoundary;       
poly = j(nrow(convexHull), 1, 1) || convexHull;
create TheData from p[colname={x y "onBoundary"}]; append from p;    close;
create Hull from poly[colname={ID cX cY}];         append from poly; close;
quit;
 
data All; set TheData Hull; run;    /* combine the data and convex hull polygon */
 
proc sgplot data=All noautolegend;
   polygon x=cX y=cY ID=id / fill;
   scatter x=x y=y / group=onBoundary markerattrs=(size=12 symbol=CircleFilled); 
   xaxis grid; yaxis grid;
run;

The resulting graph is similar to the one produced by the PolyDraw modules and is not shown.

The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.

2月 062019
 

Feature generation (also known as feature creation) is the process of creating new features to use for training machine learning models. This article focuses on regression models. The new features (which statisticians call variables) are typically nonlinear transformations of existing variables or combinations of two or more existing variables. This article argues that a naive approach to feature generation can lead to many correlated features (variables) that increase the cost of fitting a model without adding much to the model's predictive power.

Feature generation in traditional statistics

Feature generation is not new. Classical regression often uses transformations of the original variables. In the past, I have written about applying a logarithmic transformation when a variable's values span several orders of magnitude. Statisticians generate spline effects from explanatory variables to handle general nonlinear relationships. Polynomial effects can model quadratic dependence and interactions between variables. Other classical transformations in statistics include the square-root and inverse transformations.

SAS/STAT procedures provide several ways to generate new features from existing variables, including the EFFECT statement and the "stars and bars" notation. However, an undisciplined approach to feature generation can lead to a geometric explosion of features. For example, if you generate all pairwise quadratic interactions of N continuous variables, you obtain "N choose 2" or N*(N-1)/2 new features. For N=100 variables, this leads to 4950 pairwise quadratic effects!

Generated features might be highly correlated

In addition to the sheer number of features that you can generate, another problem with generating features willy-nilly is that some of the generated effects might be highly correlated with each other. This can lead to difficulties if you use automated model-building methods to select the "important" features from among thousands of candidates.

I was reminded of this fact recently when I wrote an article about model building with PROC GLMSELECT in SAS. The data were simulated: X from a uniform distribution on [-3, 3] and Y from a cubic function of X (plus random noise). I generated the polynomial effects x, x^2, ..., x^7, and the procedure had to build a regression model from these candidates. The stepwise selection method added the x^7 effect first (after the intercept). Later it added the x^5 effect. Of course, the polynomials x^7 and x^5 have a similar shape to x^3, but I was surprised that those effects entered the model before x^3 because the data were simulated from a cubic formula.

After thinking about it, I realized that the odd-degree polynomial effects are highly correlated with each other and have high correlations with the target (response) variable. The same is true for the even-degree polynomial effects. Here is the DATA step that generates 1000 observations from a cubic regression model, along with the correlations between the effects (x1-x7) and the target variable (y):

%let d = 7;
%let xMean = 0;
data Poly;
call streaminit(54321);
array x[&d];
do i = 1 to 1000;
   x[1] = rand("Normal", &xMean);  /* x1 ~ U(-3, 3] */
   do j = 2 to &d;
      x[j] = x[j-1] * x[1];        /* x[i] = x1**i, i = 2..7 */
   end;
   y = 2 - 1.105*x1 - 0.2*x2 + 0.5*x3 + rand("Normal");  /* response is cubic function of x1 */
   output;
end;
drop i j;
run;
 
proc corr data=Poly nosimple noprob;
   var y;
   with x:;
run;
Correlations between the target variable and polynimal effects

You can see from the output that the x^5 and x^7 effects have the highest correlations with the response variable. Because the squared correlations are the R-square values for the regression of Y onto each effect, it makes intuitive sense that these effects are added to the model early in the process. Towards the end of the model-building process, the x^3 effect enters the model and the x^5 and x^7 effects are removed. To the model-building algorithm, these effects have similar predictive power because they are highly correlated with each other, as shown in the following correlation matrix:

proc corr data=Poly nosimple noprob plots(MAXPOINTS=NONE)=matrix(NVAR=ALL);
   var x:;
run;
Correlations among polynomial effects

I've highlighted certain cells in the lower triangular correlation matrix to emphasize the large correlations. Notice that the correlations between the even- and odd-degree effects are close to zero and are not highlighted. The table is a little hard to read, but you can use PROC IML to generate a heat map for which cells are shaded according to the pairwise correlations:

Heat map of correlations among polynomial effects

This checkerboard pattern shows the large correlations that occur between the polynomial effects in this problem. This image shows that many of the generated features do not add much new information to the set of explanatory variables. This can also happen with other transformations, so think carefully before you generate thousands of new features. Are you producing new effects or redundant ones?

Generated features are not independent

Notice that the generated effects are not statistically independent. They are all generated from the same X variable, so they are functionally dependent. In fact, a scatter plot matrix of the data will show that the pairwise relationships are restricted to parametric curves in the plane. The graph of (x, x^2) is quadratic, the graph of (x, x^3) is cubic, the graph of (x^2, x^3) is an algebraic cusp, and so forth. The PROC CORR statement in the previous section created a scatter plot matrix, which is shown below: Again, I have highlighted cells that have highly correlated variables.

Scatter plot matrix that shows the statistical dependencies between polynomial effects

I like this scatter plot matrix for two reasons. First, it is soooo pretty! Second, it visually demonstrates that two variables that have low correlation are not necessarily independent.

Summary

Feature generation is important in many areas of machine learning. It is easy to create thousands of effects by applying transformations and generating interactions between all variables. However, the example in this article demonstrates that you should be a little cautious of using a brute-force approach. When possible, you should use domain knowledge to guide your feature generation. Every feature you generate adds complexity during model fitting, so try to avoid adding a bunch of redundant highly-correlated features.

For more on this topic, see the article "Automate your feature engineering" by my colleague, Funda Gunes. She writes: "The process [of generating new features]involves thinking about structures in the data, the underlying form of the problem, and how best to expose these features to predictive modeling algorithms. The success of this tedious human-driven process depends heavily on the domain and statistical expertise of the data scientist." I agree completely. Funda goes on to discuss SAS tools that can help data scientists with this process, such as Model Studio in SAS Visual Data Mining and Machine Learning. She shows an example of feature generation in her article and in a companion article about feature generation. These tools can help you generate features in a thoughtful, principled, and problem-driven manner, rather than relying on a brute-force approach.

The post Feature generation and correlations among features in machine learning appeared first on The DO Loop.

8月 272018
 

A frequent topic on SAS discussion forums is how to check the assumptions of an ordinary least squares linear regression model. Some posts indicate misconceptions about the assumptions of linear regression. In particular, I see incorrect statements such as the following:

  • Help! A histogram of my variables shows that they are not normal! I need to apply a normalizing transformation before I can run a regression....
  • Before I run a linear regression, I need to test that my response variable is normal....

Let me be perfectly clear: The variables in a least squares regression model do not have to be normally distributed. I'm not sure where this misconception came from, but perhaps people are (mis)remembering an assumption about the errors in an ordinary least squares (OLS) regression model. If the errors are normally distributed, you can prove theorems about inferential statistics such as confidence intervals and hypothesis tests for the regression coefficients. However, the normality-of-errors assumption is not required for the validity of the parameter estimates in OLS. For the details, the Wikipedia article on ordinary least squares regression lists four required assumptions; the normality of errors is listed as an optional fifth assumption.

In practice, analysts often "check the assumptions" by running the regression and then examining diagnostic plots and statistics. Diagnostic plots help you to determine whether the data reveal any deviations from the assumptions for linear regression. Consequently, this article provides a "getting started" example that demonstrates the following:

  1. The variables in a linear regression do not need to be normal for the regression to be valid.
  2. You can use the diagnostic plots that are produced automatically by PROC REG in SAS to check whether the data seem to satisfy some of the linear regression assumptions.

By the way, don't feel too bad if you misremember some of the assumptions of linear regression. Williams, Grajales, and Kurkiewicz (2013) point out that even professional statisticians sometimes get confused.

An example of nonnormal data in regression

Consider this thought experiment: Take any explanatory variable, X, and define Y = X. A linear regression model perfectly fits the data with zero error. The fit does not depend on the distribution of X or Y, which demonstrates that normality is not a requirement for linear regression.

For a numerical example, you can simulate data such that the explanatory variable is binary or is clustered close to two values. The following data shows an X variable that has 20 values near X=5 and 20 values near X=10. The response variable, Y, is approximately five times each X value. (This example is modified from an example in Williams, Grajales, and Kurkiewicz, 2013.) Neither variable is normally distributed, as shown by the output from PROC UNIVARIATE:

/* For n=1..20, X ~ N(5, 1). For n=21..40, X ~ N(10, 1).
   Y = 5*X + e, where e ~ N(0,1) */
data Have;
input X Y @@;
datalines;
 3.60 16.85  4.30 21.30  4.45 23.30  4.50 21.50  4.65 23.20 
 4.90 25.30  4.95 24.95  5.00 25.45  5.05 25.80  5.05 26.05 
 5.10 25.00  5.15 26.45  5.20 26.10  5.40 26.85  5.45 27.90 
 5.70 28.70  5.70 29.35  5.90 28.05  5.90 30.50  6.60 33.05 
 8.30 42.50  9.00 45.50  9.35 46.45  9.50 48.40  9.70 48.30 
 9.90 49.80 10.00 48.60 10.05 50.25 10.10 50.65 10.30 51.20 
10.35 49.80 10.50 53.30 10.55 52.15 10.85 56.10 11.05 55.15 
11.35 55.95 11.35 57.90 11.40 57.25 11.60 57.95 11.75 61.15 
;
 
proc univariate data=Have;
   var x y;
   histogram x y / normal;
run;

There is no need to "normalize" these data prior to performing an OLS regression, although it is always a good idea to create a scatter plot to check whether the variables appear to be linearly related. When you regress Y onto X, you can assess the fit by using the many diagnostic plots and statistics that are available in your statistical software. In SAS, PROC REG automatically produces a diagnostic panel of graphs and a table of fit statistics (such as R-squared):

/* by default, PROC REG creates a FitPlot, ResidualPlot, and a Diagnostics panel */
ods graphics on;
proc reg data=Have;
   model Y = X;
quit;

The R-squared value for the model is 0.9961, which is almost a perfect fit, as seen in the fit plot of Y versus X.

Using diagnostic plots to check the assumptions of linear regression

You can use the graphs in the diagnostics panel to investigate whether the data appears to satisfy the assumptions of least squares linear regression. The panel is shown below (click to enlarge).

The first column in the panel shows graphs of the residuals for the model. For these data and for this model, the graphs show the following:

  • The top-left graph shows a plot of the residuals versus the predicted values. You can use this graph to check several assumptions: whether the model is specified correctly, whether the residual values appear to be independent, and whether the errors have constant variance (homoscedastic). The graph for this model does not show any misspecification, autocorrelation, or heteroscedasticity.
  • The middle-left and bottom-left graphs indicate whether the residuals are normally distributed. The middle plot is a normal quantile-quantile plot. The bottom plot is a histogram of the residuals overlaid with a normal curve. Both these graphs indicate that the residuals are normally distributed. This is evidence that you can trust the p-values for significance and the confidence intervals for the parameters.

In summary, I wrote this article to addresses two points:

  1. To dispel the myth that variables in a regression need to be normal. They do not. However, you should check whether the residuals of the model are approximately normal because normality is important for the accuracy of the inferential portions of linear regression such as confidence intervals and hypothesis tests for parameters. (A colleague mentioned to me that standard errors and hypothesis tests tend to be robust to this assumption, so a modest departure from normality is often acceptable.)
  2. To show that the SAS regression procedures automatically provide many graphical diagnostic plots that you can use to assess the fit of the model and check some assumptions for least squares regression. In particular, you can use the plots to check the independence of errors, the constant variance of errors, and the normality of errors.

References

There have been many excellent books and papers that describe the various assumptions of linear regression. I don't feel a need to rehash what has already been written, In addition to the Wikipedia article about ordinary linear regression, I recommend the following:

The post On the assumptions (and misconceptions) of linear regression appeared first on The DO Loop.

8月 222018
 

A SAS programmer recently asked how to interpret the "standardized regression coefficients" as computed by the STB option on the MODEL statement in PROC REG and other SAS regression procedures. The SAS documentation for the STB option states, "a standardized regression coefficient is computed by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor." Although correct, this definition does not provide an intuitive feeling for how to interpret the standardized regression estimates. This article uses SAS to demonstrate how parameter estimates for the original variables are related to parameter estimates for standardized variables. It also derives how regression coefficients change after a linear transformation of the variables.

Proof by example

One of my college physics professors used to smile and say "I will now prove this by example" when he wanted to demonstrate a fact without proving it mathematically. This section uses PROC STDIZE and PROC REG to "prove by example" that the standardized regression estimates for data are equal to the estimates that you obtain by standardizing the data. The following example uses continuous response and explanatory variables, but there is a SAS Usage Note that describes how to standardize classification variables.

The following call to PROC REG uses the STB option to compute the standardized parameter estimates for a model that predicts the weights of 19 students from heights and ages:

proc reg data=Sashelp.Class plots=none;
   Orig: model Weight = Height Age / stb;
   ods select ParameterEstimates;
quit;

The last column is the result of the STB option on the MODEL statement. You can get the same numbers by first standardizing the data and then performing a regression on the standardized variables, as follows:

/* Put original and standardized variables into the output data set.
   Standardized variables have the names 'StdX' where X was the name of the original variable. 
   METHOD=STD standardizes variables according to StdX = (X - mean(X)) / std(X) */
proc stdize data=Sashelp.Class out=ClassStd method=std OPREFIX SPREFIX=Std;
run;
 
proc reg data=ClassStd plots=none;
   Std: model StdWeight = StdHeight StdAge;
   ods select ParameterEstimates;
quit;

The parameter estimates for the standardized data are equal to the STB estimates for the original data. Furthermore, the t values and p-values for the slope parameters are equivalent because these statistics are scale- and translation-invariant. Notice, however, that scale-dependent statistics such as standard errors and covariance of the betas will not be the same for the two analyses.

Linear transformations of random variables

Mathematically, you can use a algebra to understand how linear transformations affect the relationship between two linearly dependent random variables. Suppose X is a random variable and Y = b0 + b1*X for some constants b0 and b1. What happens to this linear relationship if you apply linear transformations to X and Y?

Define new variables U = (Y - cY) / sY and V = (X - cX) / sX. If you solve for X and Y and plug the expressions into the equation for the linear relationship, you find that the new random variables are related by
U = (b0 + b1*cX - cY) / sY + b1*(sX/sY)*V.
If you define B0 = (b0 + b1*cX - cY) / sY and B1 = b1*(sX/sY), then U = B0 + B1*V, which shows that the transformed variables (U and V) are linearly related.

The physical significance of this result is that linear relationships persist no matter what units you choose to measure the variables. For example, if X is measured in inches and Y is measured in pounds, then the quantities remain linearly related if you measure X in centimeters and measure Y in "kilograms from 20 kg."

The effect of standardizing variables on regression estimates

The analysis in the previous section holds for any linear transformation of linearly related random variables. But suppose, in addition, that

  • U and V are standardized versions of Y and X, respectively. That is, cY and cX are the sample means and sY and sX are the sample standard deviations.
  • The parameters b0 and b1 are the regression estimates for a simple linear regression model.

For simple linear regression, the intercept estimate is b0 = cY - b1*cY, which implies that B0 = 0. Furthermore, the coefficient B1 = b1*(sX/sY) is the original parameter estimate "divided by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor," just as the PROC REG documentation states and just as we saw in the PROC REG output in the previous section. Thus the STB option on the MODEL statement does not need to standardize any data! It produces the standardized estimates by setting the intercept term to 0 and dividing the parameter estimates by the ratio of standard deviations, as noted in the documentation. (A similar proof handles multiple explanatory variables.)

Interpretation of the regression coefficients

For the original (unstandardized) data, the intercept estimate predicts the value of the response when the explanatory variables are all zero. The regression coefficients predict the change in the response for one unit change in an explanatory variable. The "change in response" depends on the units for the data, such as kilograms per centimeter.

The standardized coefficients predict the number of standard deviations that the response will change for one STANDARD DEVIATION of change in an explanatory variable. The "change in response" is a unitless quantity. The fact that the standardized intercept is 0 indicates that the predicted value of the (centered) response is 0 when the model is evaluated at the mean values of the explanatory variables.

In summary, standardized coefficients are the parameter estimates that you would obtain if you standardize the response and explanatory variables by centering and scaling the data. A standardized parameter estimate predicts the change in the response variable (in standard deviations) for one standard deviation of change in the explanatory variable.

The post Standardized regression coefficients appeared first on The DO Loop.

7月 112018
 

In a previous article, I showed how to find the intersection (if it exists) between two line segments in the plane. There are some fun problems in probability theory that involve intersections of line segments. One is "What is the probability that two randomly chosen chords of a circle intersect?" This article shows how to create a simulation in SAS to estimate the probability.

An exact answer

For this problem, a "random chord" is defined as the line segment that joins two points chosen at random (with uniform probability) on the circle. The probability that two random chords intersect can be derived by using a simple counting argument. Suppose that you pick four points at random on the circle. Label the points according to their polar angle as p1, p2, p3, and p4. As illustrated by the following graphic, the points are arranged on the circle in one of the following three ways. Consequently, the probability that two random chords intersect is 1/3 because the chords intersect in only one of the three possible arrangements.

Possible arrangements of two random chords in the circle

A simulation in SAS

You can create a simulation to estimate the probability that two random chords intersect. The intersection of two segments can be detected by using either of the two SAS/IML modules in my article about the intersection of line segments. The following simulation generates four angles chosen uniformly at random in the interval (0, 2π). It converts those points to (x,y) coordinates on the unit circle. It then computes whether the chord between the first two points intersects the chord between the third and fourth points. It repeats this process 100,000 times and reports the proportion of times that the chords intersect.

proc iml;
/* Find the intersection between 2D line segments [p1,p2] and [q1,q2].
   This function assumes that the line segments have different slopes (A is nonsingular) */
start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
finish;
 
/* Generate two random chords on the unit circle.
   Simulate the probability that they intersect  */
N = 1e5;
theta = j(N, 4);
call randseed(123456);
call randgen(theta, "uniform", 0, 2*constant('pi'));
intersect = j(N,1,0);
do i = 1 to N;
   t = theta[i,]`;                 /* 4 random U(0, 2*pi) */
   pts = cos(t) || sin(t);         /* 4 pts on unit circle */
   p1 = pts[1,];    p2 = pts[2,];
   q1 = pts[3,];    q2 = pts[4,];
   intersect[i] = all(IntersectSegsSimple(p1, p2, q1, q2) ^= .);
end;
 
prob = mean(intersect);
print prob;

This simulation produces an estimate that is close to the exact probability of 1/3.

Connection to Bertrand's Paradox

This problem has an interesting connection to Bertrand's Paradox. Bertrand's paradox shows the necessity of specifying the process that is used to define the random variables in a probability problem. It turns out that there are multiple ways to define "random chords" in a circle, and the different definitions can lead to different answers to probability questions. See the Wikipedia article for an example.

For the definition of "random chords" in this problem, the density of the endpoints is uniform on the circle. After you make that choice, other distributions are determined. For example, the distribution of the lengths of 1,000 random chords is shown below. The lengths are NOT uniformly distributed! The theoretical density of the chord lengths is overlaid on the distribution of the sample. If you change the process by which chords are randomly chosen (for example, you force the lengths to be uniformly distributed), you might also change the answer to the problem, as shown in Bertrand's Paradox.

Distribution of lengths of random chords in the unit circle, generated by choosing uniformly distributed endpoints

The post The probability that two random chords of a circle intersect appeared first on The DO Loop.