Reshma Saujani, of Girls Who Code, doesn’t have the background you’d expect for the person leading an organization whose mission is to inspire, educate and equip young women with computing skills to pursue 21st-century opportunities. She’s not a coder or computer science graduate, and she grew up terrified of math [...]
In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here's Randall:
For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of
latticepackage for graphics,
- several functions from the
statspackage for modeling (e.g.,
lm(), t.test()), and
mosaicpackage for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.
goal ( y ~ x , data = mydata, ... )
Many data analysis operations can be executed by filling in four pieces of information (goal, y, x, and mydata) with the appropriate information for the desired task. This allows students to become fluent quickly with a powerful, coherent toolkit for data analysis.
Trouble in paradise
As the earlier post noted, the use of
latticehas some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantile-quantile plots are simple to make with
lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into sub-plots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with
ggplot2. And in our experience, students generally find the look of
ggplot2graphics more appealing.
ggplot2into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach,
ggplot2is tough to justify.
ggformula, an R package that provides a formula interface to
ggplot2graphics. Our hope is that this provides the best aspects of
lattice(the formula interface and lighter syntax) and
ggplot2(modularity, layering, and better visual aesthetics).
gf. Here are two examples, either of which could replace the side-by-side boxplots made with
latticein the previous post.
%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.
ggformulapackage provides two ways to create these facets. The first uses
|very much like
latticedoes. Notice that the
gf_lm()layer inherits information from the the
gf_points()layer in these plots, saving some typing when the information is the same in multiple layers.
gf_facet_grid()and can be more convenient for complex plots or when customization of facets is desired.
ggformalaalso fits into a tidyverse-style workflow (arguably better than
ggplot2itself does). Data can be piped into the initial call to a
ggformulafunction and there is no need to switch between
+when moving from data transformations to plot operations.
ggformulastrengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of
ggplot2features and customizations remains available, and the
ggformulapackage vignettes and tutorials describe these in more detail.
How do the North American amusement parks compare in popularity? If this question was to come up during a lunch discussion, I bet someone would pull out their smartphone and go to Wikipedia for the answer. But is Wikipedia the definitive answer - how can we tell if Wikipedia is wrong? [...]
The post Amusement park attendance (could Wikipedia be wrong?!?) appeared first on SAS Learning Post.
Pearson's correlation measures the linear association between two variables. Because the correlation is bounded between [-1, 1], the sampling distribution for highly correlated variables is highly skewed. Even for bivariate normal data, the skewness makes it challenging to estimate confidence intervals for the correlation, to run one-sample hypothesis tests ("Is the correlation equal to 0.5?"), and to run two-sample hypothesis tests ("Do these two samples have the same correlation?").
In 1921, R. A. Fisher studied the correlation of bivariate normal data and discovered a wonderful transformation (shown to the right) that converts the skewed distribution of the sample correlation (r) into a distribution that is approximately normal. Furthermore, whereas the variance of the sampling distribution of r depends on the correlation, the variance of the transformed distribution is independent of the correlation. The transformation is called Fisher's z transformation. This article describes Fisher's z transformation and shows how it transforms a skewed distribution into a normal distribution.
The distribution of the sample correlation
The following graph (click to enlarge) shows the sampling distribution of the correlation coefficient for bivariate normal samples of size 20 for four values of the population correlation, rho (ρ). You can see that the distributions are very skewed when the correlation is large in magnitude.
The graph was created by using simulated bivariate normal data as follows:
- For rho=0.2, generate M random samples of size 20 from a bivariate normal distribution with correlation rho. (For this graph, M=2500.)
- For each sample, compute the Pearson correlation.
- Plot a histogram of the M correlations.
- Overlay a kernel density estimate on the histogram and add a reference line to indicate the correlation in the population.
- Repeat the process for rho=0.4, 0.6, and 0.8.
The histograms approximate the sampling distribution of the correlation coefficient (for bivariate normal samples of size 20) for the various values of the population correlation. The distributions are not simple. Notice that the variance and the skewness of the distributions depend on the value the underlying correlation (ρ) in the population.
Fisher's transformation of the correlation coefficient
Fisher sought to transform these distributions into normal distributions. He proposed the transformation f(r) = arctanh(r), which is the inverse hyperbolic tangent function. The graph of arctanh is shown at the top of this article. Fisher's transformation can also be written as (1/2)log( (1+r)/(1-r) ). This transformation is sometimes called Fisher's "z transformation" because the letter z is used to represent the transformed correlation: z = arctanh(r).
How he came up with that transformation is a mystery to me, but he was able to show that arctanh is a normalizing and variance-stabilizing transformation. That is, when r is the sample correlation for bivariate normal data and z = arctanh(r) then the following statements are true (See Fisher, Statistical Methods for Research Workers, 6th Ed, pp 199-203):
- The distribution of z is approximately normal and "tends to normality rapidly as the sample is increased" (p 201).
- The standard error of z is approximately 1/sqrt(N-3), which is independent of the value of the correlation.
The graph to the right demonstrates these statements. The graph is similar to the preceding panel, except these histograms show the distributions of the transformed correlations z = arctanh(r). In each cell, the vertical line is drawn at the value arctanh(ρ). The curves are normal density estimates with σ = 1/sqrt(N-3), where N=20.
The two features of the transformed variables are apparent. First, the distributions are normally distributed, or, to quote Fisher, "come so close to it, even for a small sample..., that the eye cannot detect the difference" (p. 202). Second, the variance of these distributions are constant and are independent of the underlying correlation.
Fisher's transformation and confidence intervals
From the graph of the transformed variables, it is clear why Fisher's transformation is important. If you want to test some hypothesis about the correlation, the test can be conducted in the z coordinates where all distributions are normal with a known variance. Similarly, if you want to compute a confidence interval, the computation can be made in the z coordinates and the results "back transformed" by using the inverse transformation, which is r = tanh(z).
You can perform the calculations by applying the standard formulas for normal distributions (see p. 3-4 of Shen and Lu (2006)), but most statistical software provides an option to use the Fisher transformation to compute confidence intervals and to test hypotheses. In SAS, the CORR procedure supports the FISHER option to compute confidence intervals and to test hypotheses for the correlation coefficient.
The following call to PROC CORR computes a sample correlation between the length and width of petals for 50 Iris versicolor flowers. The FISHER option specifies that the output should include confidence intervals based on Fisher's transformation. The RHO0= suboption tests the null hypothesis that the correlation in the population is 0.75. (The BIASADJ= suboption turns off a bias adjustment; a discussion of the bias in the Pearson estimate will have to wait for another article.)
proc corr data=sashelp.iris fisher(rho0=0.75 biasadj=no); where Species='Versicolor'; var PetalLength PetalWidth; run;
The output shows that the Pearson estimate is r=0.787. A 95% confidence interval for the correlation is [0.651, 0.874]. Notice that r is not the midpoint of that interval. In the transformed coordinates, z = arctanh(0.787) = 1.06 is the center of a symmetric confidence interval (based on a normal distribution with standard error 1/sqrt(N-3)). However, the inverse transformation (tanh) is nonlinear, and the right half-interval gets compressed more than the left half-interval.
For the hypothesis test of ρ = 0.75, the output shows that the p-value is 0.574. The data do not provide evidence to reject the hypothesis that ρ = 0.75 at the 0.05 significance level. The computations for the hypothesis test use only the transformed (z) coordinates.
This article shows that Fisher's "z transformation," which is z = arctanh(r), is a normalizing transformation for the Pearson correlation of bivariate normal samples of size N. The transformation converts the skewed and bounded sampling distribution of r into a normal distribution for z. The standard error of the transformed distribution is 1/sqrt(N-3), which does not depend on the correlation. You can perform hypothesis tests in the z coordinates. You can also form confidence intervals in the z coordinates and use the inverse transformation (r=tanh(z)) to obtain a confidence interval for ρ.
The Fisher transformation is exceptionally useful for small sample sizes because, as shown in this article, the sampling distribution of the Pearson correlation is highly skewed for small N. When N is large, the sampling distribution of the Pearson correlation is approximately normal except for extreme correlations. Although the theory behind the Fisher transformation assumes that the data are bivariate normal, in practice the Fisher transformation is useful as long as the data are not too skewed and do not contain extreme outliers.
The post Fisher's transformation of the correlation coefficient appeared first on The DO Loop.
Healthcare, like many industries, is in the midst of a paradigm shift, says Chris Donovan, Executive Director of Enterprise Information Management & Analytics for the Cleveland Clinic. "Historically, healthcare was really about intervention, and about taking care of you when you were sick and getting you better." That type of care [...]
12 hours: That’s how quickly you can die from sepsis. Oh – you’ve never heard of sepsis? Not surprising. More Americans have heard of Ebola, a nearly non-existent condition in the U.S., than sepsis – a condition that affects more than 1.6 million Americans every year. Sepsis is the body’s [...]
Using data for good: It’s a matter of life and death was published on SAS Voices by Jill Gress
In Part 1 of this blog posting series, we discussed our current viewpoints on marketing attribution and conversion journey analysis in 2017. We concluded on a cliffhanger, and would like to return to our question of which attribution measurement method should we ultimately focus on. As with all difficult questions [...]
Get faster value out of your data by empowering business users to work with data on their own.
The post Discoverability enables self-service data preparation appeared first on The Data Roundtable.
In 2016, Capella University and global analytics leader SAS teamed up to offer the first-ever Capella Women in Analytics Scholarships to encourage women to join the growing field of big data and analytics sciences. The awards provide two women with full scholarships for Capella’s Bachelor of Science in Information Technology [...]
The post Capella and SAS Announce 2017 Scholarships for Women in Analytics appeared first on SAS Analytics U Blog.
Have you heard the term “analytics economy” and wondered what it means? Or maybe you’ve wondered how your organization can use data and analytics to achieve economic gains. Now we have more than just data. We have accessible data, fueled by advances in compute power and connectivity, and interpreted by ever-more powerful [...]
Analytics economy = data + analytics + collaboration was published on SAS Voices by Randy Guard