9月 222017
 

Reshma Saujani, of Girls Who Code, doesn’t have the background you’d expect for the person leading an organization whose mission is to inspire, educate and equip young women with computing skills to pursue 21st-century opportunities. She’s not a coder or computer science graduate, and she grew up terrified of math [...]

How 20 girls turned into 40,000 programmers was published on SAS Voices by Anjelica Cummings

9月 212017
 
A previous entry (http://sas-and-r.blogspot.com/2017/07/options-for-teaching-r-to-beginners.htmldescribes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested

In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here's Randall: 

For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of
  • the lattice package for graphics,
  • several functions from the stats package for modeling (e.g., lm(), t.test()), and
  • the mosaic package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.
Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.  

    goal ( y ~ x , data = mydata, ... )


Many data analysis operations can be executed by filling in four pieces of information (goal, y, x, and mydata) with the appropriate information for the desired task. This allows students to become fluent quickly with a powerful, coherent toolkit for data analysis.

Trouble in paradise
As the earlier post noted, the use of lattice has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantile-quantile plots are simple to make with lattice, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into sub-plots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2. And in our experience, students generally find the look of ggplot2 graphics more appealing.
On the other hand, introducing ggplot2 into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2 is tough to justify.
ggformula: The third-and-a half way
Danny Kaplan and I recently introduced ggformula, an R package that provides a formula interface to ggplot2 graphics. Our hope is that this provides the best aspects of lattice (the formula interface and lighter syntax) and ggplot2 (modularity, layering, and better visual aesthetics).
For simple plots, the only thing that changes is the name of the plotting function. Each of these functions begins with gf. Here are two examples, either of which could replace the side-by-side boxplots made with lattice in the previous post.
We can even overlay these two types of plots to see how they compare. To do so, we simply place what I call the "then" operator (%>%, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.

Comparing groups
Groups can be compared either by overlaying multiple groups distinguishable by some attribute (e.g., color)
or by creating multiple plots arranged in a grid rather than overlaying subgroups in the same space. The ggformula package provides two ways to create these facets. The first uses | very much like lattice does. Notice that the gf_lm() layer inherits information from the the gf_points() layer in these plots, saving some typing when the information is the same in multiple layers.


The second way adds facets with gf_facet_wrap() or gf_facet_grid() and can be more convenient for complex plots or when customization of facets is desired.
Fitting into the tidyverse work flow
ggformala also fits into a tidyverse-style workflow (arguably better than ggplot2 itself does). Data can be piped into the initial call to a ggformula function and there is no need to switch between %>% and + when moving from data transformations to plot operations.
Summary
The “Less Volume, More Creativity” approach is based on a common formula template that has served well for several years, but the arrival of ggformula strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2 features and customizations remains available, and the  ggformula  package vignettes and tutorials describe these in more detail.
-- Randall Pruim
9月 212017
 

How do the North American amusement parks compare in popularity? If this question was to come up during a lunch discussion, I bet someone would pull out their smartphone and go to Wikipedia for the answer. But is Wikipedia the definitive answer - how can we tell if Wikipedia is wrong? [...]

The post Amusement park attendance (could Wikipedia be wrong?!?) appeared first on SAS Learning Post.

9月 202017
 
Fisher's Z Transformation: z = arctanh(r)

Pearson's correlation measures the linear association between two variables. Because the correlation is bounded between [-1, 1], the sampling distribution for highly correlated variables is highly skewed. Even for bivariate normal data, the skewness makes it challenging to estimate confidence intervals for the correlation, to run one-sample hypothesis tests ("Is the correlation equal to 0.5?"), and to run two-sample hypothesis tests ("Do these two samples have the same correlation?").

In 1921, R. A. Fisher studied the correlation of bivariate normal data and discovered a wonderful transformation (shown to the right) that converts the skewed distribution of the sample correlation (r) into a distribution that is approximately normal. Furthermore, whereas the variance of the sampling distribution of r depends on the correlation, the variance of the transformed distribution is independent of the correlation. The transformation is called Fisher's z transformation. This article describes Fisher's z transformation and shows how it transforms a skewed distribution into a normal distribution.

The distribution of the sample correlation

The following graph (click to enlarge) shows the sampling distribution of the correlation coefficient for bivariate normal samples of size 20 for four values of the population correlation, rho (ρ). You can see that the distributions are very skewed when the correlation is large in magnitude.

Sampling distributions of correlation for bivariate normal data of size N=20

The graph was created by using simulated bivariate normal data as follows:

  1. For rho=0.2, generate M random samples of size 20 from a bivariate normal distribution with correlation rho. (For this graph, M=2500.)
  2. For each sample, compute the Pearson correlation.
  3. Plot a histogram of the M correlations.
  4. Overlay a kernel density estimate on the histogram and add a reference line to indicate the correlation in the population.
  5. Repeat the process for rho=0.4, 0.6, and 0.8.

The histograms approximate the sampling distribution of the correlation coefficient (for bivariate normal samples of size 20) for the various values of the population correlation. The distributions are not simple. Notice that the variance and the skewness of the distributions depend on the value the underlying correlation (ρ) in the population.

Fisher's transformation of the correlation coefficient

Fisher sought to transform these distributions into normal distributions. He proposed the transformation f(r) = arctanh(r), which is the inverse hyperbolic tangent function. The graph of arctanh is shown at the top of this article. Fisher's transformation can also be written as (1/2)log( (1+r)/(1-r) ). This transformation is sometimes called Fisher's "z transformation" because the letter z is used to represent the transformed correlation: z = arctanh(r).

How he came up with that transformation is a mystery to me, but he was able to show that arctanh is a normalizing and variance-stabilizing transformation. That is, when r is the sample correlation for bivariate normal data and z = arctanh(r) then the following statements are true (See Fisher, Statistical Methods for Research Workers, 6th Ed, pp 199-203):

Transformed sampling distributions of correlation for bivariate normal data of size N=20
  • The distribution of z is approximately normal and "tends to normality rapidly as the sample is increased" (p 201).
  • The standard error of z is approximately 1/sqrt(N-3), which is independent of the value of the correlation.

The graph to the right demonstrates these statements. The graph is similar to the preceding panel, except these histograms show the distributions of the transformed correlations z = arctanh(r). In each cell, the vertical line is drawn at the value arctanh(ρ). The curves are normal density estimates with σ = 1/sqrt(N-3), where N=20.

The two features of the transformed variables are apparent. First, the distributions are normally distributed, or, to quote Fisher, "come so close to it, even for a small sample..., that the eye cannot detect the difference" (p. 202). Second, the variance of these distributions are constant and are independent of the underlying correlation.

Fisher's transformation and confidence intervals

From the graph of the transformed variables, it is clear why Fisher's transformation is important. If you want to test some hypothesis about the correlation, the test can be conducted in the z coordinates where all distributions are normal with a known variance. Similarly, if you want to compute a confidence interval, the computation can be made in the z coordinates and the results "back transformed" by using the inverse transformation, which is r = tanh(z).

You can perform the calculations by applying the standard formulas for normal distributions (see p. 3-4 of Shen and Lu (2006)), but most statistical software provides an option to use the Fisher transformation to compute confidence intervals and to test hypotheses. In SAS, the CORR procedure supports the FISHER option to compute confidence intervals and to test hypotheses for the correlation coefficient.

The following call to PROC CORR computes a sample correlation between the length and width of petals for 50 Iris versicolor flowers. The FISHER option specifies that the output should include confidence intervals based on Fisher's transformation. The RHO0= suboption tests the null hypothesis that the correlation in the population is 0.75. (The BIASADJ= suboption turns off a bias adjustment; a discussion of the bias in the Pearson estimate will have to wait for another article.)

proc corr data=sashelp.iris fisher(rho0=0.75 biasadj=no);
   where Species='Versicolor';
   var PetalLength PetalWidth;
run;
Use Fisher's transformation to compute confidence intervals and to test hypotheses in PROC CORR in SAS

The output shows that the Pearson estimate is r=0.787. A 95% confidence interval for the correlation is [0.651, 0.874]. Notice that r is not the midpoint of that interval. In the transformed coordinates, z = arctanh(0.787) = 1.06 is the center of a symmetric confidence interval (based on a normal distribution with standard error 1/sqrt(N-3)). However, the inverse transformation (tanh) is nonlinear, and the right half-interval gets compressed more than the left half-interval.

For the hypothesis test of ρ = 0.75, the output shows that the p-value is 0.574. The data do not provide evidence to reject the hypothesis that ρ = 0.75 at the 0.05 significance level. The computations for the hypothesis test use only the transformed (z) coordinates.

Summary

This article shows that Fisher's "z transformation," which is z = arctanh(r), is a normalizing transformation for the Pearson correlation of bivariate normal samples of size N. The transformation converts the skewed and bounded sampling distribution of r into a normal distribution for z. The standard error of the transformed distribution is 1/sqrt(N-3), which does not depend on the correlation. You can perform hypothesis tests in the z coordinates. You can also form confidence intervals in the z coordinates and use the inverse transformation (r=tanh(z)) to obtain a confidence interval for ρ.

The Fisher transformation is exceptionally useful for small sample sizes because, as shown in this article, the sampling distribution of the Pearson correlation is highly skewed for small N. When N is large, the sampling distribution of the Pearson correlation is approximately normal except for extreme correlations. Although the theory behind the Fisher transformation assumes that the data are bivariate normal, in practice the Fisher transformation is useful as long as the data are not too skewed and do not contain extreme outliers.

You can download the SAS program that creates all the graphs in this article.

The post Fisher's transformation of the correlation coefficient appeared first on The DO Loop.

9月 202017
 

Healthcare, like many industries, is in the midst of a paradigm shift, says Chris Donovan, Executive Director of Enterprise Information Management & Analytics for the Cleveland Clinic. "Historically, healthcare was really about intervention, and about taking care of you when you were sick and getting you better." That type of care [...]

How Cleveland Clinic and Cox Automotive are transforming with analytics was published on SAS Voices by Alison Bolen

9月 192017
 

12 hours: That’s how quickly you can die from sepsis. Oh – you’ve never heard of sepsis?  Not surprising. More Americans have heard of Ebola, a nearly non-existent condition in the U.S., than sepsis – a condition that affects more than 1.6 million Americans every year. Sepsis is the body’s [...]

Using data for good: It’s a matter of life and death was published on SAS Voices by Jill Gress

9月 192017
 

In Part 1 of this blog posting series, we discussed our current viewpoints on marketing attribution and conversion journey analysis in 2017. We concluded on a cliffhanger, and would like to return to our question of which attribution measurement method should we ultimately focus on. As with all difficult questions [...]

Algorithmic marketing attribution and conversion journey analysis [Part 2] was published on Customer Intelligence Blog.

9月 192017
 

In 2016, Capella University and global analytics leader SAS teamed up to offer the first-ever Capella Women in Analytics Scholarships to encourage women to join the growing field of big data and analytics sciences. The awards provide two women with full scholarships for Capella’s Bachelor of Science in Information Technology [...]

The post Capella and SAS Announce 2017 Scholarships for Women in Analytics appeared first on SAS Analytics U Blog.

9月 182017
 

Have you heard the term “analytics economy” and wondered what it means? Or maybe you’ve wondered how your organization can use data and analytics to achieve economic gains. Now we have more than just data. We have accessible data, fueled by advances in compute power and connectivity, and interpreted by ever-more powerful [...]

Analytics economy = data + analytics + collaboration was published on SAS Voices by Randy Guard