7月 282017
 

In my last post, I discussed the outstanding experiences I’ve had as an intern at SAS and some of the things that I’ve learned so far. I love working at SAS, and now that I’ve been here for a while, I can confidently say I’m contributing to the success of [...]

The post Interning at SAS: A Summer Adventure - Part 2 appeared first on SAS Analytics U Blog.

7月 282017
 

Every year, SAS ranks high on the Fortune 100 Best Companies to Work For list. The rankings are largely based on anonymous feedback from employees, who consistently rank SAS high on everything from executive communications to work-life balance. But SAS isn’t just a great place to work for professionals; it’s an [...]

The post Interning at SAS: A Summer Adventure - Part 1 appeared first on SAS Analytics U Blog.

7月 282017
 
I've been reading David Robinson's excellent blog entry "Teach the tidyverse to beginners" (http://varianceexplained.org/r/teach-tidyverse), which argues that a tidyverse approach is the best way to teach beginners.  He summarizes two competing curricula:

1) "Base R first": teach syntax such as $ and [[]], built in functions like ave() and tapply(), and use base graphics

2) "Tidyverse first": start from scratch with pipes (%>%) and leverage dplyr and use ggplot2 for graphics

If I had to choose one of these approaches, I'd also go with 2) ("Tidyverse first"), since it helps to move us closer to helping our students "think with data" using more powerful tools (see here for my sermon on this topic).


A third way

Of course, there’s a third option that addresses David’s imperative to "get students doing powerful things quickly".  The mosaic package was written to make R easier to use in introductory statistics courses.  The package is part of Project MOSAIC (http://mosaic-web.org), an NSF-funded initiative to integrate statistics, modeling, and computing. A paper outlining the mosaic package's "Less Volume, More Creativity" approach was recently published in the R Journal (https://journal.r-project.org/archive/2017/RJ-2017-024). To his credit, David mentions the mosaic package in a response to one of the comments on his blog.

Less Volume, More Creativity

One of the big ideas in the mosaic package is that students build on the existing formula interface in R as a mechanism to calculate summary statistics, generate graphical displays, and fit regression models. Randy Pruim has dubbed this approach "Less Volume, More Creativity".

While teaching this formula interface involves adding a new learning outcome (what is "Y ~ X"?), the mosaic approach simplifies calculation of summary statistics by groups and the generation of two or three dimensional displays on day one of an introductory statistics course (see for example Wang et al., "Data Viz on Day One: bringing big ideas into intro stats early and often" (2017), TISE).

The formula interface also prepares students for more complicated models in R (e.g., logistic regression, classification).

Here's a simple example using the diamonds data from the ggplot2 package.  We model the relationships between two colors (D and J), number of carats, and price.

I'll begin with a bit of data wrangling to generate an analytic dataset with just those two colors. (Early in a course I would either hide the next code chunk or make the recoded dataframe accessible to the students to avoid cognitive overload.)  Note that an R Markdown file with the following commands is available for download at https://nhorton.people.amherst.edu/mosaic-blog.Rmd.

recoded <- diamonds %>%
  filter(color=="D" | color=="J") %>%
  mutate(col = as.character(color))

We first calculate the mean price (in US$) for each of the two colors.

mean(price ~ col, data = recoded)
   D    J 
3170 5324

This call is an example of how the formula interface facilitates calculation of a variable's mean for each of the levels of another variable. We see that D color diamonds tend to cost less than J color diamonds.

A useful function in mosaic is favstats() which provides a useful set of summary statistics (including sample size and missing values) by group.

favstats(price ~ col, data = recoded)
col
min
Q1
median
Q3
max
mean
sd
n
missing
D35791118384214186933170335767750
J335186042347695187105324443828080

A similar command can be used to generate side by side boxplots. Here we illustrate the use of lattice graphics. (An alternative formula based graphics system (ggformula) will be the focus of a future post.)

bwplot(col ~ price, data = recoded)


The distributions are skewed to the right (not surprisingly since they are prices). If we wanted to formally compare these sample means we could do so with a two-sample t-test (or in a similar fashion, by fitting a linear model).

t.test(price ~ col, data = recoded)
Welch Two Sample t-test

data:  price by col
t = -20, df = 4000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2336 -1971
sample estimates:
mean in group D mean in group J 
           3170            5324 


msummary(lm(price ~ col, data = recoded))
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3170.0       45.0    70.4   <2e-16 ***
colJ          2153.9       83.2    25.9   <2e-16 ***

Residual standard error: 3710 on 9581 degrees of freedom
Multiple R-squared:  0.0654, Adjusted R-squared:  0.0653 

F-statistic:  670 on 1 and 9581 DF,  p-value: <2e-16

The results from the two approaches are consistent: the group differences are highly statistically significant.  We could conclude that J diamonds tend to cost more than D diamonds, back in the population of all diamonds.

Let's do a quick review of the mosaic modeling syntax to date:
mean(price ~ col)
bwplot(price ~ col)t.test(price ~ col)lm(price ~ col) See the pattern? On a statistical note, it's important to remember that the diamonds were not randomized into colors: this is a found (observational dataset) so there may be other factors at play.  The revised GAISE College report reiterates the importance of multivariate thinking in intro stats.Moving to three dimensionsLet's continue with the "Less Volume, More Creativity" approach to bring in a third variable: the number of carats in each diamond. xyplot(price ~ carat, groups=col, auto.key=TRUE, type=c("p", "r"), data = recoded)
We see that controlling for the number of carats, the D color diamonds tend to sell for more than the J color diamonds.  We can confirm this by fitting a regression model that controls for both variables (and then display the resulting predicted values from this parallel slopes model using plotModel()).
This is a great example of Simpson's paradox: accounting for the number of carats has yielded opposite results from a model that didn't include carats.If we were to move forward with such an analysis we'd need to be sure to undertake an assessment of our model and verify conditions and assumptions (but for the purpose of the blog entry I'll defer that).

Moving beyond mosaic

The revised GAISE College report enunciated the importance of technology when teaching statistics. Many courses still use calculators or web-based applets to incorporate technology into their classes. R is an excellent environment for teaching statistics, but many instructors feel uncomfortable using it (particularly if they feel compelled to teach the $ and [[]] syntax, which many find offputting).  The mosaic approach helps make the use of R feasible for many audiences by keeping things simple. It's unfortunately true that many introductory statistics courses don't move beyond bivariate relationships (so students may feel paralyzed about what to do about other factors). The mosaic approach has the advantage that it can bring multivariate thinking, modeling, and exploratory data tools together with a single interface (and modest degree of difficulty in terms of syntax). I've been teaching multiple regression as a descriptive method early in an intro stat course for the past ten years (and it helps to get students excited about material that they haven't seen before). The mosaic approach also scales well: it's straightforward to teach students dplyr/tidyverse data wrangling by adding in the pipe operator and some key data idioms. (So perhaps the third option should be labeled "mosaic and tidyverse".)  

See the following for an example of how favstats() can be replaced by dplyr idioms. 

recoded %>%
  group_by(col) %>%
  summarize(meanval = mean(price, na.rm = TRUE))
col
meanval
D3170
J5324
That being said, I suspect that many students (and instructors) will still use favstats() for simple tasks (e.g., to check sample sizes, check for missing data, etc).  I know that I do.  But the important thing is that unlike training wheels, mosaic doesn't hold them back when they want to learn new things. I'm a big fan of ggplot2, but even Hadley agrees that the existing syntax is not what he wants it to be.  While it's not hard to learn to use + to glue together multiple graphics commands and to get your head around aesthetics, teaching ggplot2 adds several additional learning outcomes to a course that's already overly pregnant with them.


Side note

I would argue that a lot of what is in mosaic should have been in base R (e.g., formula interface to mean(), data= option for mean()).  Other parts are more focused on teaching (e.g., plotModel()xpnorm(), and resampling with the do() function).

Closing thoughts

In summary, I argue that the mosaic approach is consistent with the tidyverse. It dovetails nicely with David's "Teach tidyverse" as an intermediate step that may be more accessible for undergraduate audiences without a strong computing background.  I'd encourage people to check it out (and let Randy, Danny, and I know if there are ways to improve the package).

Want to learn more about mosaic?  In addition to the R Journal paper referenced above, you can see how we get students using R quickly in the package's "Less Volume, More Creativity" and "Minimal R" vignettes.  We also provide curated examples from commonly used textbooks in the “mosaic resources” vignette and a series of freely downloadable and remixable monographs including The Student’s Guide to R and Start Teaching with R.
7月 282017
 

I recently saw an interesting PEW study showing the percent of each state's revenue that came from federal funds. They had some pretty nice graphs ... but just like jell-o, there's always room for more graphs, eh! Let's start with the map. Their map had an informative title, a reasonable gradient [...]

The post Which states rely the most on federal funds? appeared first on SAS Learning Post.

7月 282017
 

I recently saw an interesting PEW study showing the percent of each state's revenue that came from federal funds. They had some pretty nice graphs ... but just like jell-o, there's always room for more graphs, eh! Let's start with the map. Their map had an informative title, a reasonable gradient [...]

The post Which states rely the most on federal funds? appeared first on SAS Learning Post.

7月 262017
 

recently used filesI use SAS Enterprise Guide every day, and for a wide variety of tasks. As a result, I have a huge collection of project files (EGP files) and SAS program files.

I have always relied on the "recently used" list in the File menu to provide me with quick access to the files I need to open. The File menu keeps two lists: one for project files and one for SAS programs. With a click into either list, you can see a simple list of all of the file names. Hover your cursor over any of the names to see a tooltip with the full path -- very useful in case you have similarly named files in different folders.

By default, the number of recently used items (SAS programs and projects) that the File menu tracks is just 6. That's not nearly as many as I need, so I've always changed that setting to the maximum, which has always been 15. You can find the setting in Tools→Options, General tab.

I recently learned that in SAS Enterprise Guide v7.12 and later, the maximum number of "recent files" to track was raised to 50! With more high-resolution displays in the field, even a long list of recent files can provide a convenient method to save a few clicks. I think 50 is a little high for me (I'll be reaching into last year's files), so I bumped my setting to 30.

Get EG to remember more files for you!

And don't forget the "jump list" in Windows 7 and later! Every program or project file that SAS Enterprise Guide opens is also added to this quick-access list that you can find on your Windows toolbar or Start menu. Here's an example of what that looks like:

jump list in Windows

The post Open recent files with fewer clicks in SAS Enterprise Guide appeared first on The SAS Dummy.

7月 262017
 

In a previous blog, I describe how there are a few new features related to report and page prompts in SAS Visual Analytics 8.1; namely the ability to configure cascading prompts in VA 8.1: Cascading Prompts as Report and Page Prompts.

In this blog, I will cover how to configure prompts, either report, page, or report canvas prompts, that use different data sources.

Different Data Sources with overlapping data values

First, you must have two different data sources added to your Visual Analytics report. These data sources must have values that overlap that you wish to prompt on. All of the values do not need to map, but they must have some values in common if you wish to use a shared prompt.

In this example, we will prompt for Product Line. Let’s examine the column values:

I’ve color coded the values that I would like to map together. I see that the only values that match “out-of-the-box” is Game.

One work around to get all of the values to match will be to create a Custom Category and use that column for the mapping.

In a “real world” scenario, this may not be ideal. The cardinality of the two columns may be so large that you may have to go back to either the source data or ETL job to produce better matching values.

However, if you are using date columns as the mapping columns things are considerably easier as year, month, and quarter are standard values that match without extra steps.

Here is my new Custom Category that I will use for my mapping:

Here are my mappings now. I will be using Product Line (New) for the Insight Toys data source moving forward.

Add prompts

There are two different locations where you can add prompts, i.e. Control Objects, which means there are two different ways to configure prompts with different data sources:

1.     Report and Page Prompts

2.     Report Canvas Prompts

Report and Page Prompt configuration for different data sources

For this first example, I will configure a Button Bar object placed in the Page Prompt area to filter two different data sources. For the Button Bar’s Category Role, I will use the data source with the largest available selection, in this case, the Product Line (New) from Insight Toys.

Now let’s configure this button bar to filter both data sources. You must activate the button bar by clicking on it, then right-mouse click and select Edit data source mappings

Then you simply have to pick your source table’s column to map to your target table’s column.

That’s it. The mapping is complete. Here is what the report would look like with different selections made for the button bar. Notice, that since I used the Insight Toys data source for the Role assignment, and it has more values than available in the Mega Corp data. If a selection is made where nothing matches in Mega Corp, as in the Gift example, then the Mega Corp bar chart is blank.

Report Canvas Prompt configuration for different data sources

In this second example, I am going to use a List Control object within the report canvas to filter two different data sources. Again, I will use the Insight Toys’ Product Line (New) column as the List Role Category assignment since it has the most values.

Now to configure the list to filter both bar charts. Click on the list control object to activate the window. Then select the Actions pane, and use the Add button to select Add filter.

Then select both bar charts as the target of the filter Action.
Next, select the Map data option.

Select the source data’s column to map to the target data’s column. Use the + to add additional column mapping criteria.

Here is how the report would look with a few of the values selected from the list table. You can see how both Mega Corp and Insight Toys display overlapping values for Product Line but for any unique Product Lines, such as Gift, its values are only displayed on the Insight Toys bar chart.

Now you know how to configure your control objects for multiple data sources. This works no matter how many data sources you add to your report, simply use the Map data option and select the mappings between the source data and target data.

As I mentioned earlier, a frequently used application of mapping prompts for multiple data sources is for date columns. Here is a screenshot of one example using year and month. I also styled the button bar’s selected background and text color to coordinate with the graphs.

 

SAS Visual Analytics 8.1: Configuring prompts with different source data was published on SAS Users.

7月 262017
 

“I do not like this modern technology,” said my father-in-law. “It is making people too lazy. Things are too easy now.” He was referring to my grocery order. I was sitting in his kitchen in Reykjavik, Iceland, the day before my return to the United States. I had just explained [...]

Icelandic in-laws and inventory management: digital supply chain innovations was published on SAS Voices by Marcia Walker

7月 262017
 
One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data.  Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate?  How are the results interpreted and communicated? 


While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data.  The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods.  (Figure 8.2 above was taken from a case study in the supervised learning chapter.)

Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics.  Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation.  Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data.  A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.

We believe that several features of the book are distinctive:
  1. minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
  2. ethical considerations are raised early, to motivate later examples
  3. recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured
Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.  


This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We've taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.  
We've made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we're working to create a series of lab activities (e.g., text as data).  (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.)
Modern Data Science with R cover
Modern Data Science with R

An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
7月 262017
 

A classical problem in elementary probability asks for the expected lengths of line segments that result from randomly selecting k points along a segment of unit length. It is both fun and instructive to simulate such problems. This article uses simulation in the SAS/IML language to estimate solutions to the following problems:

  • Randomly choose a point at random in the interval (0, 1). The point divides the interval into two segments of length x and 1-x. What is the expected length of the larger (smaller) segment?
  • Broken stick problem: What is the probability that three randomly chosen points will break a segment into a triangle?
  • Randomly choose k points at random in the interval (0, 1). The points divide the interval into k+1 segments. What is the expected length of the largest (smallest) segment?
  • When k=2, the points divide the interval into three segments. What is the probability that the three segments can form a triangle? This is called the broken-stick problem and is illustrated in the figure to the right.

You can find a discussion and solution to these problems on many websites, but I like the Cut-The-Knot.org website, which includes proofs and interactive Java applets.

Simulate a solution in SAS

You can simulate these problems in SAS by writing a DATA step or a SAS/IML program. I discuss the DATA step at the end of this article. The body of this article presents a SAS/IML simulation and constructed helper modules that solve the general problem. The simulation will do the following:

  1. Generate k points uniformly at random in the interval (0, 1). For convenience, sort the points in increasing order.
  2. Compute the lengths of the k+1 segments.
  3. Find the length of the largest and smallest segments.

In many languages (including the SAS DATA step), you would write a loop that performs these operations for each random sample. You would then estimate the expected length by computing the mean value of the largest segment for each sample. However, in the SAS/IML language, you can use matrices instead of using a loop. Each sample of random points can be held in the column of a matrix. The lengths of the segments can also be held in a matrix. The largest segment for each trial is stored in a row vector.

The following SAS/IML modules help solve the general simulation problem for k random points. Because the points are ordered, the lengths of the segments are the differences between adjacent rows. You can use the DIF function for this computation, but the following program uses the DifOp module to construct a small difference operator, and it uses matrix multiplication to compute the differences.

proc iml;
/* Independently sort column in a matrix.
   See http://blogs.sas.com/content/iml/2011/03/14/sorting-rows-of-a-matrix.html */
start SortCols(A);
   do i = 1 to ncol(A);
      v = A[ ,i];  call sort(v);  A[ ,i] = v; /* get i_th col and sort it */
   end;
finish;
 
/* Generate a random (k x NSim) matrix of points, then sort each column. */
start GenPts(k, NSim);
   x = j(k, NSim);               /* allocate k x NSim matrix */
   call randgen(x, "Uniform");   /* fill with random uniform in (0,1) */
   if k > 1 then run SortCols(x);  
   return x;
finish;
 
/* Return matrix for difference operator.
   See  http://blogs.sas.com/content/iml/2017/07/24/difference-operators-matrices.html */
start DifOp(dim);
   D = j(dim-1, dim, 0);         /* allocate zero matrix */
   n = nrow(D); m = ncol(D);
   D[do(1,n*m, m+1)] = -1;       /* assign -1 to diagonal elements */
   D[do(2,n*m, m+1)] = 1;        /* assign +1 to super-diagonal elements */
   return D;
finish;
 
/* Find lengths of segments formed by k points in the columns of x.
   Assume each column of x is sorted and all points are in (0,1). */
start SegLengths(x);   
   /* append 0 and 1 to top and bottom (respectively) of each column */
   P = j(1, ncol(x), 0) // x // j(1, ncol(x), 1);
   D = DifOp(nrow(P));           /* construct difference operator */
   return D*P;                   /* use difference operator to find lengths */
finish;
 
P = {0.1  0.2  0.3,
     0.3  0.8  0.5,
     0.7  0.9  0.8 };
L = SegLengths(P);
print L[label="Length (k=3)"];
Lengths of segments formed by random points in the unit interval

The table shows the lengths of three different sets of points for k=3. The first column of P corresponds to points at locations {0.1, 0.3, 0.7}. These three points divide the interval [0, 1] into four segments of lengths 0.1, 0.2, 0.4, and 0.3. Similar computations hold for the other columns.

The expected length of the longer of two segments

For k=1, the problem generates a random point in (0, 1) and asks for the expected length of the longer segment. Obviously the expected length is greater than 1/2, and you can read the Cut-The-Knot website to find a proof that shows that the expected length is 3/4 = 0.75.

The following SAS/IML statements generate one million random points and compute the larger of the segment lengths. The average value of the larger segments is computed and is very close to the expected value:

call randseed(54321);
k = 1;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 1 point dropped onto (0,1) */
L = SegLengths(x);               /* lengths of  segments */
Largest = L[<>, ];               /* max length among the segments */
mean = mean(Largest`);           /* average of the max lengths */
print mean;
Estimate for expected length of longest segment

You might not be familiar with the SAS/IML max subscript operator (<>) and the min subscript operator (><). These operators compute the minimum or maximum values for each row or column in a matrix.

The expected length of the longest of three segments

For k=2, the problem generates two random points in (0, 1) and asks for the expected length of the longest segment. You can also ask for the average shortest length. The Cut-The-Knot website shows that the expected length for the longest segment is 11/18 = 0.611, whereas the expected length of the shortest segment is 2/18 = 0.111.

The following SAS/IML statements simulate choosing two random points on one million unit intervals. The program computes the one million lengths for the resulting longest and shortest segments. Again, the average values of the segments are very close to the expected values:

k = 2;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 2 points dropped onto (0,1) */
L = SegLengths(x);               /* lengths of segments */
maxL = L[<>, ];                  /* max length among the segments */
meanMax = mean(maxL`);           /* average of the max lengths */
minL = L[><, ];                  /* min length among the segments */
meanMin = mean(minL`);           /* average of the max lengths */
print meanMin meanMax;
Estimates for expected lengths of the shorted and longest segments formed by two random points in the unit interval

The broken stick problem

You can use the previous simulation to estimate the broken stick probability. Recall that three line segments can form a triangle provided that they satisfy the triangle inequality: the sum of the two smaller lengths must be greater than the third length. If you randomly choose two points in (0,1), the probability that the resulting three segments can form a triangle is 1/4, which is smaller than what most people would guess.

The vectors maxL and minL each contain one million lengths, so it is trivial to compute the vector of that contains the third lengths.

/* what proportion of randomly broken sticks form triangles? */
medL = 1 - maxL - minL;          /* compute middle length */
isTriangle = (maxL <= minL + medL); /* do lengths satisfy triangle inequality? */
prop = mean(isTriangle`);        /* proportion of segments that form a triangle */
print prop;
Estimate for the probability that three random segments form a triangle

As expected, about 0.25 of the simulations resulted in segments that satisfy the triangle inequality.

In conclusion, this article shows how to use the SAS/IML language to solve several classical problems in probability. By using matrices, you can run the simulation by using vectorized computations such as matrix multiplication finding the minimum or maximum values of columns. (However, I had to use a loop to sort the points. Bummer!)

If you want to try this simulation yourself in the DATA step, I suggest that you transpose the SAS/IML setup. Use arrays to hold the random points and use the CALL SORTN subroutine to sort the points. Use the LARGEST function and the SMALLEST function to compute the largest and smallest elements in an array. Feel free to post your solution (and any other thoughts) in the comments.

The post Random segments and broken sticks appeared first on The DO Loop.