Last year, I wrote more than 100 posts for The DO Loop blog. In previous years, the most popular articles were about SAS programming tips, statistical analysis, and data visualization. But not in 2020. In 2020, when the world was ravaged by the coronavirus pandemic, the most-read articles were related to analyzing and visualizing the tragic loss and suffering of the pandemic. Here are some of the most popular articles from 2019 in several categories.

### The coronavirus pandemic

Relationship between new cases and cumulative cases in COVID-19 infections

### Statistical analysis: Regression

Visualization of collinearity diagnostics

### Data visualization

Many articles in the previous sections included data visualization, but two popular articles are specifically about data visualization:

Reference lines at values of statistical estimates

Many people claim they want to forget 2020, but these articles provide a few tips and techniques that you might want to remember. So, read (or re-read!) these popular articles from 2020. And if you made a resolution to learn something new this year, consider subscribing to The DO Loop so you don't miss a single article!

The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.

When you perform a linear regression, you can examine the R-square value, which is a goodness-of-fit statistic that indicates how well the response variable can be represented as a linear combination of the explanatory variables. But did you know that you can also go the other direction? Given a set of explanatory variables and an R-square statistic, you can create a response variable, Y, such that a linear regression of Y on the explanatory variables produces exactly that R-square value.

### The geometry of correlation and least-square regression

In a previous article, I showed how to compute a vector that has a specified correlation with another vector. You can generalize that situation to obtain a vector that has a specified relationship with a linear subspace that is spanned by multiple vectors.

Recall that the correlation is related to the angle between two vectors by the formula cos(θ) = ρ, where θ is the angle between the vectors and ρ is the correlation coefficient. Therefore, correlation and "angle between" measure similar quantities. It makes sense to define the angle between a vector and a linear subspace as the smallest angle the vector makes with any vector in the subspace. Equivalently, it is the angle between the vector and its (orthogonal) projection onto the subspace.

This is shown graphically in the following figure. The vector z is not in the span of the explanatory variables. The vector w is the projection of z onto the linear subspace. As explained in the previous article, you can find a vector y such that the angle between y and w is θ, where cos(θ) = ρ. Equivalently, the correlation between y and w is ρ.

### Correlation between a response vector and a predicted vector

There is a connection between this geometry and the geometry of least-squares regression. In least-square regression, the predicted response is the projection of an observed response vector onto the span of the explanatory variables. Consequently, the previous article shows how to simulate an "observed" response vector that has a specified correlation with the predicted response.

For simple linear regression (one explanatory variable), textbooks often point out that the R-square statistic is the square of the correlation between the independent variable, X, and the response variable, Y. So, the previous article enables you to create a response variable that has a specified R-square value with one explanatory variable.

The generalization to multivariate linear regression is that the R-square statistic is the square of the correlation between the predicted response and the observed response. Therefore, you can use the technique in this article to create a response variable that has a specified R-square value in a linear regression model.

To be explicit, suppose you are given explanatory variables X1, X2, ..., Xk, and a correlation coefficient, ρ. The following steps generate a response variable, Y, such that the R-square statistic for the regression of Y onto the explanatory variables is ρ2:

2. Use least-squares regression to find w = $\hat{\mathbf{z}}$, which is the projection of z onto the subspace spanned by the explanatory variables and the 1 vector.
3. Use the technique in the previous article to find Y such that corr(Y, w) = ρ

### Create a response variable that has a specified R-square value in SAS

The following program shows how to carry out this algorithm in the SAS/IML language:

proc iml; /* Define or load the modules from https://blogs.sas.com/content/iml/2020/12/17/generate-correlated-vector.html */ load module=_all_; /* read some data X1, X2, ... into columns of a matrix, X */ use sashelp.class; read all var {"Height" "Weight" "Age"} into X; /* read data into (X1,X2,X3) */ close;   /* Least-squares fit = Project Y onto span(1,X1,X2,...,Xk) */ start OLSPred(y, _x); X = j(nrow(_x), 1, 1) || _x; b = solve(X*X, X*y); yhat = X*b; return yhat; finish;   /* specify the desired correlation between Y and \hat{Y}. Equiv: R-square = rho^2 */ rho = 0.543;   call randseed(123); guess = randfun(nrow(X), "Normal"); /* 1. make random guess */ w = OLSPred(guess, X); /* 2. w is in Span(1,X1,X2,...) */ Y = CorrVec1(w, rho, guess); /* 3. Find Y such that corr(Y,w) = rho */ /* optional: you can scale Y anyway you want ... */ /* in regression, R-square is squared correlation between Y and YHat */ corr = corr(Y||w)[2]; R2 = rho**2; PRINT rho corr R2;

The program uses a random guess to generate a vector Y such that the correlation between Y and the least-squares prediction for Y is exactly 0.543. In other words, if you run a regression model where Y is the response and (X1, X2, X3) are the explanatory variables, the R-square statistic for the model will be ρ2 = 0.2948. Let's write the Y variable to a SAS data set and run PROC REG to verify this fact:

/* Write to a data set, then call PROC REG */ Z = Y || X; create SimCorr from Z[c={Y X1 X2 X3}]; append from Z; close; QUIT;   proc reg data=SimCorr plots=none; model Y = X1 X2 X3; ods select FitStatistics ParameterEstimates; quit;

The "FitStatistics" table that is created by using PROC REG verifies that the R-square statistic is 0.2948, which is the square of the ρ value that was specified in the SAS/IML program. The ParameterEstimates table from PROC REG shows the vector in the subspace that has correlation ρ with Y. It is -1.26382 + 0.04910*X1 - 0.00197*X2 - 0.12016 *X3.

### Summary

Many textbooks point out that the R-square statistic in multivariable regression has a geometric interpretation: It is the squared correlation between the response vector and the projection of that vector onto the linear subspace of the explanatory variables (which is the predicted response vector). You can use the program in this article to solve the inverse problem: Given a set of explanatory variables and correlation, you can find a response variable for which the R-square statistic is exactly the squared correlation.

The post Create a response variable that has a specified R-square value appeared first on The DO Loop.

Tis the season for my annual fun Christmas themed blog! This is the seventh year and my tenth song. I hope you enjoy this 2020 holiday song (to the tune of Rockin' around the Christmas Tree). Hackin around the Decision Tree at the SAS party hackathon Data science algorithms available [...]

Hackin' around the decision tree was published on SAS Voices by David Pope

It takes a lot of companies collaborating in a lot of new and different ways behind the scenes to make AI work seamlessly on the front lines. We see this all the time in our work on AI. Processing power is one of the most obvious examples – when clients [...]

Do you know that you can create a vector that has a specific correlation with another vector? That is, given a vector, x, and a correlation coefficient, ρ, you can find a vector, y, such that corr(x, y) = ρ. The vectors x and y can have an arbitrary number of elements, n > 2. One application of this technique is to create a scatter plot that shows correlated data for any correlation in the interval (-1, 1). For example, you can create a scatter plot with n points for which the correlation is exactly a specified value, as shown at the end of this article.

The algorithm combines a mixture of statistics and basic linear algebra. The following facts are useful:

• Statistical correlation is based on centered and normalized vectors. When you center a vector, it usually changes the direction of the vector. Therefore, the calculations use centered vectors.
• Correlation is related to the angle between the centered vectors. If the angle is θ, the correlation between the vectors is cos(θ).
• Projection is the key to finding a vector that has a specified correlation. In linear algebra, the projection of a vector w onto a unit vector u is given by the expression (wu)*u.
• Affine transformations do not affect correlation. For any real number, α, and for any β > 0, the vector α + β y has the same correlation with x as y does. For simplicity, the SAS program in this article returns a centered unit vector. You can scale and translate the vector to obtain other solutions.

### The geometry of a correlated vector

Given a centered vector, u, there are infinitely-many vectors that have correlation ρ with u. Geometrically, you can choose any vector on a positive cone in the same direction as u, where the cone has angle θ and cos(θ)=ρ. This is shown graphically in the figure below. The plane marked $\mathbf{u}^{\perp}$ is the orthogonal complement to the vector u. If you extend the cone through the plane, you obtain the cone of vectors that are negatively correlated with x

One way to obtain a correlated vector is to start with a guess, z. The vector z can be uniquely represented as the sum $\mathbf{y} = \mathbf{w} + \mathbf{w}^{\perp}$, where w is the projection of z onto the span of u, and $\mathbf{w}^{\perp}$ is the projection of z onto the orthogonal complement.

The following figure shows the geometry of the right triangle with angle θ such that cos(θ) = ρ. If you want the vector y to be unit length, you can read off the formula for y from the figure. The formula is
$\mathbf{y} = \rho \mathbf{w} / \lVert\mathbf{w}\rVert + \sqrt{1 - \rho^2} \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert$
In the figure, $\mathbf{v}_1 = \mathbf{w} / \lVert\mathbf{w}\rVert$ and $\mathbf{v}_2 = \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert$.

### Compute a correlated vector

It is straightforward to implement this projection in a matrix-vector language such as SAS/IML. The following program defines two helper functions (Center and UnitVec) and uses them to implement the projection algorithm. The function CorrVec1 takes three arguments: the vector x, a correlation coefficient ρ, and an initial guess. The function centers and scales the vectors into the vectors u and z. The vector z is projected onto the span of u. Finally, the function uses trigonometry and the fact that cos(θ) = ρ to return a unit vector that has the required correlation with x.

/* Given a vector, x, and a correlation, rho, find y such that corr(x,y) = rho */ proc iml; /* center a column vector by subtracting its mean */ start Center(v); return ( v - mean(v) ); finish; /* create a unit vector in the direction of a column vector */ start UnitVec(v); return ( v / norm(v) ); finish;   /* Find a vector, y, such that corr(x,y) = rho. The initial guess can be almost any vector that is not in span(x), orthog to span(x), and not in span(1) */ start CorrVec1(x, rho, guess); /* 1. Center the x and z vectors. Scale them to unit length. */ u = UnitVec( Center(x) ); z = UnitVec( Center(guess) );   /* 2. Project z onto the span(u) and the orthog complement of span(u) */ w = (z*u) * u; wPerp = z - w;   /* 3. The requirement that cos(theta)=rho results in a right triangle where y (the hypotenuse) has unit length and the legs have lengths rho and sqrt(1-rho^2), respectively */ v1 = rho * UnitVec(w); v2 = sqrt(1 - rho**2) * UnitVec(wPerp); y = v1 + v2;   /* 4. Check the sign of y*u. Flip the sign of y, if necessary */ if sign(y*u) ^= sign(rho) then y = -y; return ( y ); finish;

The purpose of the function is to project the guess onto the green cone in the figure. However, if the guess is in the opposite direction from x, the algorithm will compute a vector, y, that has the opposite correlation. The function detects this case and flips y, if necessary.

The following statements call the function for a vector, x, and requests a unit vector that has correlation ρ = 0.543 with x:

/* Example: Call the CorrVec1 function */ x = {1,2,3}; rho = 0.543; guess = {0, 1, -1}; y = CorrVec1(x, rho, guess); corr = corr(x||y); print x y, corr;

As requested, the correlation coefficient between x and y is 0.543. This process will work provided that the guess satisfies a few mild assumptions. Specifically, the guess cannot be in the span of x or in the orthogonal complement of x. The guess also cannot be a multiple of the 1 vector. Otherwise, the process will work for positive and negative correlations.

The function returns a vector that has unit length and 0 mean. However, you can translate the vector and scale it by any positive quantity without changing its correlation with x, as shown by the following example:

/* because correlation is a relationship between standardized vectors, you can translate and scale Y any way you want */ y2 = 100 + 23*y; /* rescale and translate */ corr = corr(x||y2); /* the correlation will not change */ print corr;

When y is a centered unit vector, the vector β*y has L2 norm β. If you want to create a vector whose standard deviation is β, use β*sqrt(n-1)*y, where n is the number of elements in y.

### Random vectors with a given correlation

One application of this technique is to create a random vector that has a specified correlation with a given vector, x. For example, in the following program, the x vector contains the heights of 19 students in the Sashelp.Class data set. The program generates a random guess from the standard normal distribution and passes that guess to the CorrVec1 function and requests a vector that has the correlation 0.678 with x. The result is a centered unit vector.

use sashelp.class; read all var {"Height"} into X; close;   rho = 0.678; call randseed(123); guess = randfun(nrow(x), "Normal"); y = CorrVec1(x, rho, guess);   mean = 100; std = 23*sqrt(nrow(x)-1); v = mean + std*y; title "Correlation = 0.678"; title2 "Random Normal Vector"; call scatter(X, v) grid={x y};

The graph shows a scatter plot between x and the random vector, v. The correlation in the scatter plot is 0.678. The sample mean of the vector v is 100. The sample standard deviation is 23.

If you make a second call to the RANDFUN function, you can get another random vector that has the same properties. Or you can repeat the process for a range of ρ values to visualize data that have a range of correlations. For example, the following graph shows a panel of scatter plots for ρ = -0.75, -0.25, 0.25, and 0.75. The X variable is the same for each plot. The Y variable is a random vector that was rescaled to have mean 100 and standard deviation 23, as above.

The random guess does not need to be from the normal distribution. You can use any distribution.

### Summary

This article shows how to create a vector that has a specified correlation with a given vector. That is, given a vector, x, and a correlation coefficient, ρ, find a vector, y, such that corr(x, y) = ρ. The algorithm in this article produces a centered vector that has unit length. You can multiply the vector by β > 0 to obtain a vector whose norm is β. You can multiply the vector by β*sqrt(n-1) to obtain a vector whose standard deviation is β.

There are infinitely-many vectors that have correlation ρ with x. The algorithm uses a guess to produce a particular vector for y. You can use a random guess to obtain a random vector that has a specified correlation with x.

The post Find a vector that has a specified correlation with another vector appeared first on The DO Loop.

There’s nothing worse than being in the middle of a task and getting stuck. Being able to find quick tips and tricks to help you solve the task at hand, or simply entertain your curiosity, is key to maintaining your efficiency and building everyday skills. But how do you get quick information that’s ALSO engaging? By adding some personality to traditionally routine tutorials, you can learn and may even have fun at the same time. Cue the SAS Users YouTube channel.

With more than 50 videos that show personality published to-date and over 10,000 hours watched, there’s no shortage of learning going on. Our team of experts love to share their knowledge and passion (with personal flavor!) to give you solutions to those everyday tasks.

What better way to round out the year than provide a roundup of our most popular videos from 2020? Check out these crowd favorites:

### Looking forward to 2021

We’ve got you covered! SAS will continue to publish videos throughout 2021. Subscribe now to the SAS Users YouTube channel, so you can be notified when we’re publishing new videos. Be on the lookout for some of the following topics:

• Transforming variables in SAS
• Tips for working with SAS Technical Support
• How to use Git with SAS

2020 roundup: SAS Users YouTube channel how to tutorials was published on SAS Users.