Distribution

10月 062016
 

As data analysts, we all try to do the right thing. When there is a choice of statistical distributions to be used for a given application, it’s a natural inclination to try to find the “best” one.

But beware...

Fishing for the best distribution can lead you into a trap. Just because one option appears to be best – that doesn’t mean that it’s correct! For example, consider this data set:

distribution

What is the best distribution we can use to describe this data? JMP can help us answer this question. From the Distribution platform, we can choose to fit a number of common distributions to the data: Normal, Weibull, Gamma, Exponential, and others. To fit all possible continuous distributions to this data in JMP, go to the red triangle hotspot for this variable in the Distribution report, and choose “Continuous Fit > All”. Here is the result:

fit-all

JMP has compared 11 potential distributions for this data, and ranked them from best (Gamma) to worst (Exponential). The metric used to perform the ranking is the corrected Akaike Information Criterion (AICc). Lower values of AICc indicate better fit, and so the Gamma distribution is the winner here.

Here’s the catch

This data set was generated by drawing a random sample of size 50 from a population that is normally distributed with a mean of 50 and a standard deviation of 10. The Normal distribution is the correct answer by definition, but our fishing expedition gave us a misleading result.

How often is there a mismatch like this? One way we can approach this question is through simulation. I wrote a small JMP script to draw samples of various sizes from a normally distributed population. I investigated sample sizes of 5, 10, 20, 30, 50, 75, 100, 250, and 500 observations; for each of these, I drew 1,000 independent samples and had JMP compute the fit for all possible continuous distributions. Last, for each sample I recorded the name of the best-fitting distribution, as measured by AICc. (JSL script available in the JMP File Exchange).

The results were quite surprising!

results

  • Remember, the correct answer in each case is “Normal”. If our fishing expedition was yielding good results across the board, the line for the Normal distribution should be high and flat, hovering near 100%.
  • Instead, the wrong distribution was chosen with disturbing frequency. For sample sizes under 50, the Normal distribution was not even the most commonly chosen. That honor belongs to the Weibull distribution.
  • For a sample size of 5 observations from a Normal distribution, the correct identification was not made a single time out of 1,000 samples.
  • If you want to have at least a 50% chance of correctly identifying normally distributed data by this method, you’ll need more than 100 observations!
  • Even at a sample size of 500 observations, the likelihood of the normal distribution being correctly called the best is only about 80%.

The moral of the story

When comparing the fit of different distributions to a data set, don’t assume that the distribution with the smallest AICc is the correct one. Relative magnitudes of the AICc statistics are what counts. A rule of thumb (used elsewhere in JMP) is that models whose values of AICc are within 10 units of the “best” one are roughly equivalent.* In our first example above, the Gamma distribution is nominally the best, but its AICc is only .2 units lower than that of the Normal distribution. There is not good statistical evidence to choose the Gamma over the Normal.

More generally, as a best practice it is wise to consider only distributions that make sense in the context of the problem. Your own knowledge and expertise are usually the best guides. Don’t choose an exotic distribution that has a slightly better fit over one that makes sense and has a proven track record in your field of work.

*This rule is used to compare models built in the Generalized Regression personality of the Fit Model platform in JMP Pro. See Burnham, K.P. and Anderson, D.R. (2002), Model Selection And Multimodel Inference: A Practical Information Theoretic Approach. Springer, New York.

tags: Distribution, Statistics, Tips and Tricks

The post Is that the best (distribution) you've got? appeared first on JMP Blog.

10月 032016
 

By now you may have heard that in JMP 13, the most frequently used features of reports created in Graph Builder can be saved as interactive HTML, which can then be viewed using just a web browser.

Getting Graph Builder output to work for the web in JMP 13 involved bringing new features to several graphical elements that had been available in interactive HTML output since JMP 11. Areas and lines can be used to display some of the same information as points but in a different way. Exploring these stacked areas in interactive HTML, you can now see the values along the edge of the area.

SmartphoneOSArea

The tooltips for lines display the rows that are included in each point along the line as well as information about the values. Graph Builder gives you the ability to customize various attributes of the lines. The example below combines lines using different drawing styles with annotations and the gray reference ranges to create a rich graph.

MarriageDivorceRatesLines

While the most heavily used graph types and options are exported as interactive HTML, the remaining ones are exported as static images. Contour plots are exported as static images; however, if your data is categorical, Graph Builder produces violin plots, which are exported as interactive HTML. Below you can see the close relationship between the violin plot and another Graph Builder element, the box plot.

IrisViolinPlot

What if you want to bin data into categories to explore their distribution? There are a number of ways to do this in Graph Builder. The histogram is available in Interactive HTML in the Distribution platform (as well as options in several other JMP platforms), but now can also be exported to the web after exploring your data in a drag-and-drop manner in Graph Builder in JMP. Below is an example created using Titanic passenger data to examine the distribution of ages.

TitanicHistogram

A mosaic plot is used to examine the relationship between two categorical variables. Cells give informative tooltips regarding the share and number of rows associated with each cell, and cells can be selected with rows being linked to other related charts in the report.

TitanicMosaic

In JMP, you can use Dashboard Builder to create reports with several types of Graph Builder output in the same page -- so people who do not have JMP yet can interactively explore your data. Here, a mosaic plot, bars and histograms are combined to analyze the importance of different goals to schoolchildren.

GraphBuilderDashboard

These are just a few examples of the powerful graphs you can create to explore your data in Graph Builder and share with others using interactive HTML. The graphs shown here as well as a few other examples are available as live interactive HTML files to explore on the web at http://www.jmp.com/jmphtml5/, but be sure to try your own Graph Builder creations!

tags: Dashboard, Dashboard Builder, Data Visualization, Distribution, Graph Builder, Interactive HTML, JMP 13

The post Interactive HTML: Lines, mosaic plots and more for Graph Builder appeared first on JMP Blog.

9月 042016
 

You might say I love sports. I began swimming at a very early age and participated on swim teams for many years. Gymnastics, volleyball, softball, basketball and even track teams were all part of my life, and I loved playing and competing. So maybe that is why I always love […]

The post Looking at Summer Games data with JMP appeared first on JMP Blog.

7月 222016
 

Let's say you are in the Distribution platform in JMP, and you have created a report that you wish to drill down into. Well, the Local Data Filter can help with that. But perhaps you also want to share a portion of the data with a co-worker, and not just […]

The post Video: Subsetting data from a JMP Distribution report appeared first on JMP Blog.

6月 082016
 

In a previous post, I wrote how pedigree might be used to help predict outcomes of horse races. In particular, I discussed a metric called the Dosage Index (DI), which appeared to be a leading indicator of success (at least historically). In this post, I want to introduce the Center […]

The post What does a winning thoroughbred horse look like? appeared first on JMP Blog.

12月 202012
 
In the last 30 years supply chains have been asked to take on a leading role as a facilitator for the proposed pull methodology, as needed to be more responsive to demands of the customer. This new and reactive supply chain would enable practitioners to quickly act upon demand signals [...]
3月 142012
 
The Distribution platform, frequently the backbone of data exploration, is one of the most widely used platforms in JMP. The JMP 10 Distribution platform has additional customizations and options to make this phase of data exploration even more individualized and easier than before. This blog post will focus on the reports [...]