One of the first and most important steps in analyzing data, whether for descriptive or inferential statistical tasks, is to check for possible errors in your data. In my book, Cody's Data Cleaning Techniques Using SAS, Third Edition, I describe a macro called %Auto_Outliers. This macro allows you to search for possible data errors in one or more variables with a simple macro call.

### Example Statistics

To demonstrate how useful and necessary it is to check your data before starting your analysis, take a look at the statistics on heart rate from a data set called Patients (in the Clean library) that contains an ID variable (Patno) and another variable representing heart rate (HR). This is one of the data sets I used in my book to demonstrate data cleaning techniques. Here is output from PROC MEANS: The mean of 79 seems a bit high for normal adults, but the standard deviation is clearly too large. As you will see later in the example, there was one person with a heart rate of 90.0 but the value was entered as 900 by mistake (shown as the maximum value in the output). A severe outlier can have a strong effect on the mean but an even stronger effect on the standard deviation. If you recall, one step in computing a standard deviation is to subtract each value from the mean and square that difference. This causes an outlier to have a huge effect on the standard deviation.

### Macro

Let's run the %Auto_Outliers macro on this data set to check for possible outliers (that may or may not be errors).

Here is the call:

```%Auto_Outliers(Dsn=Clean.Patients,
Id=Patno,
Var_List=HR SBP DBP,
Trim=.1,
N_Sd=2.5)```

This macro call is looking for possible errors in three variables (HR, SBP, and DBP); however, we will only look at HR for this example. Setting the value of Trim equal to .1 specifies that you want to remove the top and bottom 10% of the data values before computing the mean and standard deviation. The value of N_Sd (number of standard deviations) specifies that you want to list any heart rate beyond 2.5 trimmed standard deviations from the mean.

### Result

Here is the result: After checking every value, it turned out that every value except the one for patient 003 (HR = 56) was a data error. Let's see the mean and standard deviation after these data points are removed. Notice the Mean is now 71.3 and the standard deviation is 11.5. You can see why it so important to check your data before performing any analysis.

You can download this macro and all the other macros in my data cleaning book by going to support.sas.com/cody. Scroll down to Cody's Data Cleaning Techniques Using SAS, and click on the link named "Example Code and Data." This will download a file containing all the programs, macros, and data files from the book.  By the way, you can do this with any of my books published by SAS Press, and it is FREE!

Let me know if you have questions in the comments section, and may your data always be clean! To learn more about SAS Press, check out up-and-coming titles, and to receive exclusive discounts make sure to subscribe to the newsletter.

Finding Possible Data Errors Using the %Auto_Outliers Macro was published on SAS Users. Some key components of CASL are the action statements. These statements perform tasks that range from configuring the connection to the server, to summarizing large amounts of data, to processing image files. Each action has its own purpose. However, there is some overlapping functionality between actions. For example, more than one action can summarize numeric variables.
This blog looks at three actions: SIMPLE.SUMMARY, AGGREGATION.AGGREGATE, and DATAPREPROCESS.RUSTATS. Each of these actions generates summary statistics. Though there might be more actions that generate the same statistics, these three are a good place to start as you learn CASL.

### Create a CAS table for these examples

The following step generates a table called mydata, stored in the casuser caslib, that will be used for the examples in this blog.

```cas; libname myuser cas caslib='casuser'; data myuser.mydata; length color \$8; array X{100}; do k=1 to 9000; do i=1 to 50; X{i} = rand('Normal',0, 4000); end; do i=51 to 100; X{i} = rand('Normal', 100000, 1000000); end; if x1 < 0 then color='red'; else if x1 < 3000 then color='blue'; else color='green'; output; end; run;```

### SIMPLE.SUMMARY

The purpose of the Simple Analytics action set is to perform basic analytical functions. One of the actions in the action set is the SUMMARY action, used for generating descriptive statistics like the minimum, maximum, mean, and sum.
This example demonstrates obtaining the sum, mean, and n statistics for five variables (x1–x5) and grouping the results by color. The numeric input variables are specified in the INPUTS parameter. The desired statistics are specified in the SUBSET parameter.

```proc cas; simple.summary / inputs={"x1","x2","x3","x4","x5"}, subset={"sum","mean","n"}, table={caslib="casuser",name="mydata",groupBy={"color"}}, casout={caslib="casuser", name="mydata_summary", replace=true}; run; table.fetch / table={caslib="casuser",name="mydata_summary" }; run; quit;```

The SUMMARY action creates a table that is named mydata_summary. The TABLE.FETCH action is included to show the contents of the table. The mydata_summary table can be used as input for other actions, its variable names can be changed, or it can be transposed. Now that you have the summary statistics, you can use them however you need to.

### AGGREGATION.AGGREGATE

Many SAS® procedures have been CAS-enabled, which means you can use a CAS table as input. However, specifying a CAS table does not mean all of the processing takes place on the CAS server. Not every statement, option, or statistic is supported on the CAS server for every procedure. You need to be aware of what is not supported so that you do not run into issues if you choose to use a CAS-enabled procedure. In the documentation, refer to the CAS processing section to find the relevant details.
When a procedure is CAS-enabled, it means that, behind the scenes, it is submitting an action. The MEANS and SUMMARY procedure steps submit the AGGREGATION.AGGREGATE action.
With PROC MEANS it is common to use a BY or CLASS statement and ask for multiple statistics for each analysis variable, even different statistics for different variables. Here is an example:

```proc means sum data=myuser.mydata noprint;
by color;
var x1 x2 x3;
output out=test(drop=_type_ _freq_) sum(x1 x3)=x1_sum x3_sum
max(x2)=x2_max std(x3)=x3_std;
run;
```

The AGGREGATE action produces the same statistics and the same structured output table as PROC MEANS.

```proc cas; aggregation.aggregate / table={name="mydata",caslib="casuser",groupby={"color"}} casout={name="mydata_aggregate", caslib='casuser', replace=true} varspecs={{name='x1', summarysubset='sum', columnnames={'x1_sum'}}, {name='x2', agg='max', columnnames={'x2_max'}}, {name='x3', summarysubset={'sum','std'}, columnnames={'x3_sum','x3_std'}}} savegroupbyraw=true, savegroupbyformat=false, raw=true; run; quit;```

The VARSPECS parameter might be confusing. It is where you specify the variables that you want to generate statistics for, which statistics to generate, and what the resulting column should be called. Check the documentation: depending on the desired statistic, you need to use either SUMMARYSUBSET or AGG arguments.

If you are using the GROUPBY action, you most likely want to use the SAVEGROUPBYRAW=TRUE parameter. Otherwise, you must list every GROUPBY variable in the VARSPECS parameter. Also, the SAVEGROUPBYFORMAT=FALSE parameter prevents the output from containing _f versions (formatted versions) of all of the GROUPBY variables. ### DATAPREPROCESS.RUSTATS

The RUSTATS action, in the Data Preprocess action set, computes univariate statistics, centralized moments, quantiles, and frequency distribution statistics. This action is extremely useful when you need to calculate percentiles. If you ask for percentiles from a procedure, all of the data will be moved to the compute server and processed there, not on the CAS server.
This example has an extra step. Actions require a list of variables, which can be cumbersome when you want to generate summary statistics for more than a handful of variables. Macro variables are a handy way to insert a list of strings, variable names in this case, without having to enter all of the names yourself. The SQL procedure step generates a macro variable containing the names of all of the numeric variables. The macro variable is referenced in the INPUTS parameter.
The RUSTATS action has TABLE and INPUTS parameters like the previous actions. The REQUESTPACKAGES parameter is the parameter that allows for a request for percentiles.
The example also contains a bonus action, TRANSPOSE.TRANSPOSE. The goal is to have a final table, mydata_rustats2, with a structure like PROC MEANS would generate. The tricky part is the COMPUTEDVARSPROGRAM parameter.
The table generated by the RUSTATS action has a column called _Statistic_ that contains the name of the statistic. However, it contains “Percentile” multiple times. A different variable, _Arg1_, contains the value of the percentiles (1, 10, 20, and so on). The values of _Statistic_ and _Arg1_ need to be combined, and that new combined value generates the new variable names in the final table.
The COMPUTEDVARS parameter specifies that the name of the new variable will hold the concatenation of _Statistic_ and _Arg1_. The COMPUTEDVARSPROGRAM parameter tells CAS how to create the values for NEWID. The NEWID value is then used in the ID parameter to make the new variable names—pretty cool!

```proc sql noprint; select quote(strip(name)) into: numvars separated by ',' from dictionary.columns where libname='MYUSER' and memname='MYDATA' and type='num'; quit;   proc cas; dataPreprocess.rustats / table={name="mydata",caslib="casuser"} inputs={&numvars} requestpackages={{percentiles={1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95 99},scales={"std"}}} casoutstats={name="mydata_rustats",caslib="casuser"} ;   transpose.transpose / table={caslib='casuser', name="mydata_rustats", groupby={"_variable_"}, computedvars={{name="newid",format="\$20."}},computedvarsprogram="newid=strip(_statistic_)||compress(strip(_arg1_),'.-');"} transpose={"_value_"} id={"newid"} casOut={caslib='casuser', name="mydata_rustats2", replace=true}; run; quit;```

Here is a small portion of the final table. Remember, you can use the TABLE.FETCH action to view the table contents as well. ### Summary

Summarizing numeric data is an important step in analyzing your data. CASL provides multiple actions that generate summary statistics. This blog provided a quick overview of three of those actions: SIMPLE.SUMMARY, AGGREGATION.AGGREGATE, and DATAPREPROCESS.RUSTATS.
The wonderful part of so many choices is that you can decide which one best fits your needs. Summarizing your data with actions also ensures that all of the processing occurs on the CAS server and that you are taking full advantage of its capabilities.
Be sure to use the DROPTABLE action to delete any tables that you do not want taking up space in memory:

```proc cas; table.droptable / caslib='casuser' name='mydata' quiet=true; table.droptable / caslib='casuser' name='mydata_summary' quiet=true; table.droptable / caslib='casuser' name='mydata_aggregate' quiet=true; table.droptable / caslib='casuser' name='mydata_rustats' quiet=true; table.droptable / caslib='casuser' name='mydata_rustats2' quiet=true; quit; cas casauto terminate;```

Summarization in CASL was published on SAS Users. In the hype and excitement surrounding artificial intelligence and big data, most of us miss out on critical aspects related to collection, processing, handling and analyzing data. It's important for data science practitioners to understand these critical aspects and add a human touch to big data. What are these aspects? [...] Galit Shmueli, National Tsing Hua University’s Distinguished Professor of Service Science, will be visiting the SAS campus this month for an interview for an Analytically Speaking webcast.

Her research interests span a number of interesting topics, most notably her acclaimed research, To Explain or Predict, as well as noteworthy research on statistical strategy, bio-surveillance, online auctions, count data models, quality control and more.

In the Analytically Speaking interview, we’ll focus on her most interesting Explain or Predict work as well as her research on Information Quality and Behavioral Big Data, which was the basis of her plenary talk at the Stu Hunter conference earlier this year. I'll also ask about her books and teaching.

Galit has authored and co-authored many books, two of which — just out this year — include some JMP. First is Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro, with co-authors, Peter C. Bruce, Nitin R. Patel, and Mia Stephens of JMP. This first edition release coincides with the third edition release of Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner, with the first two co-authors listed above. As Michael Rappa says so well in the foreword of the JMP Pro version of the book, “Learning analytics is ultimately about doing things to and with data to generate insights.  Mastering one's dexterity with powerful statistical tools is a necessary and critical step in the learning process.”

The second book is Information Quality: The Potential of Data and Analytics to Generate Knowledge, which Galit co-authored with Professor Ron S. Kenett, CEO and founder of KPA and research professor at the University of Turin in Italy (you may recognize Ron and KPA colleagues as guest bloggers on the JMP Blog on the topic of QbD). As David Hand notes in his foreword, the book explains that “the same data may be high quality for one purpose and low quality for another, and that the adequacy of an analysis depends on the data and the goal, as well as depending on other less obvious aspects, such as the accessibility, completeness, and confidentiality of the data.”

Both Ron and Galit will be plenary speakers at Discovery Summit Prague in March. You can download a chapter from their book, which discusses information quality support with JMP and features an add-in for Information Quality, both written by Ian Cox of JMP. You can see a short demo of JMP support for information quality during the Analytically Speaking webcast on Nov. 16.

Whether your analysis is seeking to explain some phenomena and/or to make useful predictions, you will want to hear Galit’s thoughtful perspective on the tensions between these two goals, as well as what Galit has to say on other topics up for discussion. Join us! If Nov. 16 doesn’t suit your schedule, you can always view the archived version when convenient.

The post To explain or predict with Galit Shmueli appeared first on JMP Blog.  Ronald Snee and Roger Hoerl have written a book called Strategies for Formulations Development. It is intended to help scientists and engineers be successful in creating formulations quickly and efficiently.

The following tip is from this new book, which focuses on providing the essential information needed to successfully conduct formulation studies in the chemical, biotech and pharmaceutical industries:

Although most journal articles present mixture experiments and models that only involve the formulation components, most real applications also involve process variables, such as temperature, pressure, flow rate and so on. How should we modify our experimental and modeling strategies in this case? A key consideration is whether the formulation components and process variables interact. If there is no interaction, then an additive model, fitting the mixture and process effects independently, can be used:

c(x,z) = f(x) + g(z), where 1

f(x) is the mixture model, and g(z) is the process variable model. Independent designs could also be used. However, in our experience, there is typically interaction between mixture and process variables. What should we do in this case? Such interaction is typically modeled by replacing the additive model in Equation 1 with a multiplicative model:

c(x,z) = f(x)*g(z) 2

Note that this multiplicative model is actually non-linear in the parameters. Most authors, including Cornell (2002), therefore suggest multiplying out the individual terms in f(x) and g(z) from Equation 2, creating a linear hybrid model. However, this tends to be a large model, since the number of terms in linearized version of c(x,z) will be the number in f(x) times the number in g(z). In Cornell’s (2002) famous fish patty experiment, there were three mixture variables (7 terms) and three process variables (8 terms), but the linearized c(x,z) had 7*8 = 56 terms, requiring a 56-run hybrid design.

Recent research by Snee et al. (2016) has shown that by considering hybrid models that are non-linear in the parameters, the number of terms required, and therefore the size of designs required, can be significantly reduced, often on the order of 50%. For example, if we fit equation 2 directly as a non-linear model, then the number of terms to estimate is the number in f(x) plus the number in g(z); 7 + 8 = 15 in the fish patty case. Snee et al. (2016) showed using real data that this approach can often provide reasonable models, allowing use of much smaller fractional hybrid designs. We therefore recommended an overall sequential strategy involving initial use of fractional designs and non-linear models, but with the option of moving to linearized models if necessary.

When designing an experiment, a common diagnostic is the statistical power of effects. Bradley Jones has written a number of blog posts on this very topic. In essence, what is the probability that we can detect non-negligible effects given a specified model? Of course, there are a set of assumptions/specifications needed in order to do this, such as effect sizes, error, and significance level of tests. I encourage you to read some of those previous blog posts if you’re unfamiliar with the topic.

If our response is continuous, and we are assuming a linear regression model, we can use results from the Power Analysis outline under Design Evaluation. However, what if our response is based on pass/fail data, where we are planning to do 10 trials at each experimental run? For this response, we can fit a logistic regression model, but we cannot use the results in the Design Evaluation outline. Nevertheless, we’re still interested in the power...

What to do?
We could do a literature review to see about estimating the power, and hope to find something that applies (and do so for each specific case that comes up in the future). But, it is more straight-forward to run a Monte Carlo simulation. To do so, we need to be able to generate responses according to a specified logistic regression model. For each of these generated responses, fit the model and, for each effect, check if the p-value falls below a certain threshold (say 0.05). This has been possible in previous versions of JMP using JSL, but requires a certain level of comfort with scripting and in particular scripting formulas and extracting information from JMP reports. Also, you need to find the time to write the script. In JMP Pro 13, you can now perform Monte Carlo simulations with just a few mouse-clicks.

That sounds awesome
The first time I saw the new one-click simulate, I was ecstatic, thinking of the possible uses with designed experiments. A key element needed to use the one-click simulate feature is a column containing a formula with a random component. If you read my previous blog post on the revamped Simulate Responses in DOE, then you know we already have a way to generate such a formula without having to write it ourselves.

1. Create the Design, and then Make Table with Simulate Responses checked
In this example, we have four factors (A-D), and plan an eight-run experiment. I’ll assume that you’re comfortable using the Custom Designer, but if not, you can read about the Custom Design platform here. This example can essentially be set up the same way as an example in our documentation.

Before you click the Make Table button, you need to make sure that Simulate Responses has been selected from the hotspot at the top of the Custom Design platform. 2. Set up the simulation
Once the data table is created, we now have to setup our simulation via the Simulate Response dialog described previously. Under Distribution, we select Binomial, with and set N to 10 (i.e. 10 trials for each row of the design). Here, I’ve chosen a variety of coefficients for A-D, with factor D having a coefficient of 0 (i.e., that factor is inactive). The Simulate Response dialog I will use is: Clicking the Apply button, we get a Y Simulated column simulating the number of successes out of 10 trials, and a column indicating the number of trials (which is used in Fit Model). For modeling purposes, I copied the Y Simulated column into Y. If we look at the formula for Y Simulated, we see that it can generate a response vector based on the model given in the Simulate Responses dialog. 3. Fit the Model
Now that we have a formula for simulating responses, we need to set up the modeling for the simulated responses. In this case, we want to collect p-values for the effects from repeated logistic regression analyses on the simulated responses. We first need to do the analysis for a single response. If we launch the Fit Model platform, we can add the number of trials to the response role (Y), and change the Personality to Generalized Linear Model with a Binomial Distribution. My Fit Model launch looks like this: Click the Run button to fit the model. The section of the report that we’re interested in is the Parameter Estimates outline. For the initial simulated response, A, B, and C were found to be active, and D was not (which is correct). Of course, this is just for a single simulation. We could keep simulating a new response vector, and keeping track of these p-values for each effect, or, we could use one-click simulate and let it do this for us.

4. Run the simulation
The column we’re interested in for this blog post is the Prob>ChiSq. We right-click on that column to bring up the menu, and (if you have JMP Pro), at the bottom, above bootstrap, we see an option for Simulate. The dialog that pops up has a choice for Column to Switch Out and a Choice for Column to Switch In. For our simulations, instead of using the response Y, we want to use Y Simulated, as it contains the formula with the Random Binomial. Instead of using Y when we first used Fit Model, we could have instead used Y Simulated, and switch it out with itself. The Number of Samples refers to how many times to simulate a response. Here I’ve left it at 2500. Now we just click OK, and let it run. After a short wait, we’re presented with a data table containing a column for each effect from the Fit Model dialog (as well as a simulation ID, SimID), and 2501 rows – the first is the original fit, and marked as excluded, while each other row corresponds to the results from one of our 2500 simulated responses. The values are the p-values for each effect from Fit Model. The one-click Simulate has also pre-populated the data table with a distribution script, and, because it recognizes the results are p-values, another script called Power Analysis. Running the Power Analysis script provides a distribution of the p-values for each effect, as well as a summary titled Simulated Power with the rejection rate at different levels of alpha. For example, if we look at the effect of factor B, we see that at alpha = 0.05, 2103 times out of 2500 the null hypothesis of no effect was rejected for a rejection rate (empirical power) of about 84%. I typically right-click on one of the Simulated Power tables, and select Make Combined Data Table. This provides a data table that provides the rejection rates for each term at the four different alpha levels. This makes it easier to view the results in Graph Builder, such as the results for alpha = 0.05. Now we can see that we have high power to detect the effects for A and B (recall that they had the largest coefficients), while C and the intercept are around 50%. Since D was inactive in our simulation, the rejection rate is around 0.05, as we would expect. We may be concerned with 50% power assuming that the effect of C is correct. With the ease of being able to perform these simulations, it’s simple to go back to Simulate Reponses and change the number of trials for each row of the design before running another one-click simulation. Likewise, we could create a larger design to see how that affects the power. We could even try modeling using generalized regression with a binomial distribution.

Final thoughts
To me, a key aspect of this new feature is that it allows you to go through different “what if” scenarios with ease. This is especially true if you are in a screening situation, where it’s not unusual to be using model selection techniques when analyzing data. Now you can have empirical power calculations that match the analysis you plan to use, and help alert you to pitfalls that can arise during analysis. While this was possible prior to JMP 13, I typically didn’t find the time to create a custom formula each time I was considering a design. In the short time I’ve been using the one-click simulate, the ease with which I can create the formula and run the simulations has led me to insights I would not have gleaned otherwise, and has become an important tool in my toolbox.

The post Empirical power calculations for designed experiments with 1-click simulate in JMP 13 appeared first on JMP Blog. The Simulate Responses feature throughout various design of experiments (DOE) platforms has always been a useful tool for generating a set of responses according to a specified model. I use it frequently for the simulated responses in Fit Model (or other appropriate platforms), as a way to check that the model is being fit as expected. Prior to JMP 13, Simulate Responses had limitations:

• Simulation was limited to linear regression models with normal errors.
• The ability to simulate responses was tied to the DOE window and the Simulate Responses window. If you closed either window, you would have to make a new data table to simulate responses again.
• If you wanted to run a Monte Carlo simulation study using simulated responses (that is, simulating a large number of responses from the specified model and collecting results), there was no easy way to do so using the simulated responses from the DOE platform.

### Simulate Responses in JMP 13

The look and feel of the Simulate Responses dialog remains the same in JMP 13. But to address the limitations I mentioned above, some new features have been added. That's the focus of the rest of this post.

### Different distributions for the response

There are times when conducting an experiment that the response is not continuous, but instead either pass/fail or count data. In JMP 13, in addition to a linear regression model with normal errors, you now also have the ability to simulate responses that follow a Binomial or Poisson distribution. ### Relaunching the Simulate Responses dialog

As discussed above, because Simulate Responses was tied to the DOE platform, there was no easy way to relaunch the Simulate Responses dialog once either was closed. In JMP 13, Simulate Responses is tied to the data table. A table script, called DOE Simulate, relaunches the simulation dialog. ### Automatic formula creation

In my view, the most powerful new aspect of the revamped Simulate Responses can be easy to miss. I’ll demonstrate with a simple example, a 12-run Custom Design with four continuous factors (X1-X4). When you select Simulate Responses from the hotspot in a DOE platform and you create a data table, you now end up with two columns for each response that have the same values initially. The second one, Y Simulated (where Y is the name of the response), gets updated each time the Apply button is clicked in the dialog. Why the need for two columns? If you right-click on the column name, you see that Y Simulated is actually a Formula Column. The column Y has no formula – it is simply filled in during the data table creation -- the idea being that you fill your own response values there when collecting data. If you examine the formula, you see the responses are generated based on the model specified in the Simulate Responses dialog: This means that you can also simulate responses by clicking the Apply button within the formula editor. Note that this formula is automatically created by the Simulate Responses dialog. Suppose you change the coefficients and distribution, like this: When you click the Apply button, your Y Simulated column now has count data. The script updates the underlying formula to reflect the new model: ### Simulation studies

For users comfortable using JSL, the automatic formula only requires the use of Eval Formula to simulate a new response, and then gathering the desired information via scripting. What’s more, if you have JMP Pro 13, this formula allows you to perform one-click Monte Carlo simulations (akin to the one-click bootstrap).

Stay tuned for a blog post by Ryan Parker explaining how to use this new Simulate feature. In a future blog post, I’ll show just how easy it is to perform empirical power calculations using the new Simulate Responses with one-click Simulate.

The post Simulate Responses is revamped to be more useful in JMP 13 appeared first on JMP Blog. In our previous blog post, we wrote about using designed experiments to develop analytic methods. This post continues the discussion of analytic methods and shows how a new type of experimental design, the Definitive Screening Design (DSD), can be used to assess and improve analytic methods.

We begin with a quick review of analytic methods and a brief summary of the experiment described in that previous blog post, and then show what is learned by using a DSD.

### What are analytic methods?

Analytic methods are used to carry out essential product and process measurements. Such measurement systems are critical in the pharmaceutical industry where understanding of the process monitoring and control requirements are important for developing sound analytic methods. The typical requirements for evaluating analytic methods include:

• Precision: This requirement makes sure that method variability is only a small proportion of the specifications range (upper specification limit – lower specification limit).
• Selectivity: This determines which impurities to monitor at each production step and specifies design methods that adequately discriminate the relative proportions of each impurity.
• Sensitivity: This relates to the need for methods that accurately reflect changes in CQA's that are important relative to the specification limits, which is essential for effective process control.

QbD implementation in the development of analytic methods is typically a four-stage process addressing both design and control of the methods. The stages are:

2. Method Design Selection: Select the method work conditions to achieve the design intent.
3. Method Control Definition: Establish and define appropriate controls for the components with the largest contributions to performance variability.
4. Method Control Validation: Demonstrate acceptable method performance with robust and effective controls.

We continue here the discussion of how to use statistically designed experiments to achieve robustness, which we began in our previous blog post.

### A case study in HPLC development

The case study we presented in our previous post concerns the development of a High Performance Liquid Chromatography (HPLC) method. The specific system consists of an Agilent 1050, with a variable-wavelength UV detector and a model 3396-A integrator. Table 1 lists the factors and their levels used in the designed experiments of this case study. The original experimental array was a 27-4 Fractional Factorial experiment with three center points (see Table 2). The levels "-1" and "1" correspond to the lower and upper levels listed in Table 1, and "0" corresponds to the nominal level. The lower and upper levels are chosen to reflect variation that might naturally occur about the nominal setting during regular operation. Table 1. Factors and levels in HPLC experiments. Table 2. Original Fractional Factorial experimental array for HPLC experiment.

The fractional factorial experiment (Table 2) consists of 11 runs that combine the design factor levels in a balanced set of combinations, including three center points.

### What do we learn from the fractional factorial experiment?

In our previous post, we analyzed the data from the factorial experiment and found that the experiment provided answers to several important questions:

• How sensitive is the method to natural variation in the input settings?
• Which inputs have the largest effect on the outputs from the method?
• Are there different inputs that dominate the sensitivity of different responses?
• Is the variation transmitted from factor variation large relative to natural run-to-run variation?

All of the above answers relate to our ability to assess the effects of factor variation when the factors are at their nominal setting. However, they do not address the possibility of improving robustness by possibly moving the nominal setting to one that is less sensitive to factor variation.

### Robustness and Nonlinearity

Robustness has a close link to nonlinearity. We saw this feature in the previous blog post.  There the initial analysis of the factorial experiment showed clear lack-of-fit, which the team attributed to the "gradient" factor.  We used a model with a quadratic term for gradient and found that situating the nominal value near the "valley" of the resulting curve could effectively reduce the amount of transmitted variation. Thus, the added quadratic term gave valuable information about where to set the gradient to achieve a robust method.

The presence of interactions is another form of nonlinearity that has consequences for method robustness. Two factors have an interaction effect on a response when the slope of either factor's effect depends on the setting of the other factor.  In a robustness experiment, the slope is a direct reflection of method sensitivity. So when there is an interaction, we can typically set the nominal level of one of the factors to a level that moderates the slope of the second factor, thereby reducing its contribution to transmitted variation. Exploiting interactions in this manner is a basic tool in the quality engineering experiments of Genichi Taguchi.

### How can we plan the experiment for improving robustness?

The fractional factorial experiment that we analyzed in the previous post was effective for estimating linear effects of the factors – and this was sufficient for assessing robustness.  However, to improve robustness, we need a design that is large enough to let us estimate both linear and nonlinear effects.  The natural first step is to consider estimating "second order effects", which include pure quadratic effects like the one for gradient in our earlier post and two-factor interactions.

There are three ways we can think about enlarging the experiment to estimate additional terms in a model of the analytic method’s performance. Specifically, we can use a design that is appropriate for estimating:

1. All two-factor interactions and pure quadratic effects.
2. All two-factor interactions but no pure quadratics.
3. All pure quadratics but no interactions.

Effective designs exist for option 1, like the central composite and Box-Behnken designs.  Similarly, the two-factor interactions can be estimated from two-level fractional factorial designs (option 2). The main drawback to both of these choices is that they require too many experimental runs. With K factors in the experiment, there are K main effects, K pure quadratics and K(K-1)/2 two-factor interactions. We also need to estimate the overall mean, so we need at least 1+K(K+1)/2 runs to estimate all the main effects and two-factor interactions. If K is small, this may be perfectly feasible. However, with K=7, as in the HPLC experiment, that adds up to at least 29 runs (and at least 36 to also estimate the pure quadratics). These experiments are about three times as large as the fractional factorial design analyzed in the previous blog post  and would be too expensive to implement.

Here we consider option 3, designs to estimate all the pure quadratics, but no interactions. A very useful class of experimental designs for this purpose is the Definitive Screening Designs (DSD's). We show in the next section how to use a DSD for studying and improving the robustness of an analytic method.

### Applying a Definitive Screening Design

A Definitive Screening Design (DSD) for K factors requires 2K+1 runs if K is even and 2K+3 if K is odd (to ensure main effect orthogonality). The design needs to estimate 2K+1 regression terms, so this is at or near the minimum number of runs needed. In such a design, all factors are run at three levels in a factorial arrangement, main effects are orthogonal and free of aliasing (partial or full) with quadratic effects and two-factor interaction effects and no quadratic or two-way interaction effect is fully aliased with another quadratic or two-way interaction effect. With a DSD we can estimate all linear and quadratic main effects.  Further, if some factors prove to have negligible effects, we may be able to estimate some two-factor interactions. The HPLC study had seven factors, so a DSD requires 17 experimental runs (see Table 3). For robustness studies, it is important to estimate the magnitude of run-to-run variation. The DSD in this application has two degrees of freedom for error, so no additional runs are needed. Were K even, it would be advisable to add at least two runs to permit estimation of error. A simple way to do this is to add more center points to the design. Table 3: Definitive Screening Design with seven factors

### What do we learn from analyzing the DSD experimental data?

As in the previous blog post, we will illustrate the analysis by looking at the results for the peakHeight response in the HPLC application. Throughout, we divide the actual peakHeights by 1000 for ease of presentation. We proceeded to fit a model to the DSD experimental data that includes all main effects and pure quadratic effects. The analysis shows that the only significant quadratic effect is that for Gradient.  In addition to the Gradient quadratic effect we decided to keep in the model, as linear main effects:  Gradient, Column Temperature, Detection Wavelength and Triethylamine Percentage. In Figure 1, we show parameter estimates from fitting this reduced model to the peakHeight responses. All terms are statistically significant, the adjusted R2 is 87%, and the run-to-run variation has an estimated standard deviation of 2.585. Figure 1. Parameter estimates of peakHeight with quadratic model.

We show a Profiler for the reduced quadratic model in Figure 2, below. Figure 2. Profiler of HPLC experiments with reduced quadratic model derived from DSD experiments.

### Finding a robust solution

In order to improve robustness, we need to identify nonlinear effects. Here the only nonlinear effect is for gradient. Figure 2 shows us that the quadratic response curve for gradient reaches a minimum quite close to the nominal value (0 in the coded units of Figure 2). Consequently, setting the nominal level of Gradient to that level is a good choice for robustness. The other factors can also be kept at their nominal settings. They have only minor quadratic effects, so moving them to other settings will have no effect on method robustness.

We can assess the level of variation, as in the previous post, by assigning normal distributions to the input factors.  As in that post, we use the default option in JMP, which assigns to each input a normal distribution with standard deviation of 0.4 (in coded units). Figure 3 shows the results of this simulation. The standard deviation of peakHeight associated with variation in the factor levels is 2.697, very similar in magnitude to the SD for run-to-run variation from the experimental data. The estimate of the overall method SD is then 3.736 (the square root of 2.6972 + 2.5852). Figure 3. Profiler of peakHeight at nominal levels with added noise.

It is instructive to compare the results from analyzing the DSD to those from analyzing the fractional factorial in the previous blog post. Both experiments ended with the conclusion that gradient has a nonlinear effect on peakHeight, and that setting gradient close to its planned nominal level is a good choice for robustness of the analytic method. The fractional factorial was not able to identify gradient as the interesting factor; this happened only after substantial discussion by the experimental team. And even then, there was concern that the decision to attribute all the nonlinearity to the gradient might be completely off the mark. The DSD, on the other hand, with just a few more runs, was able to support a firm conclusion that gradient is the only factor that has a nonlinear effect. There was no need for debate and assumptions; the issue could be determined from the experimental data.

The DSD and the fractional factorial are both able to assess variance from factor uncertainty and both agree that the three factors with the most important contributions are gradient, column temperature and detection wavelength. The DSD identified a fourth factor, the percent of Triethylamine, as playing a significant role.

The DSD, by estimating all the pure quadratic effects, was also able to fully confirm that there would be no direct gain in method robustness by shifting any of the factors to different nominal values. Improvement might still be possible due to two-factor interactions; but as we pointed out, only a much larger experiment could detect those interactions.

### Can we still improve method robustness?

The DSD has shown us that changing nominal levels is not a solution. An alternative is to institute tighter control on the process parameters, thereby limiting their natural variation. Moreover, the DSD helps us prioritize the choice of which factors to control.  Figures 1 and 3 show us that the strongest linear effect is due to the column temperature. They also show that the strong and nonlinear effect of gradient may be contributing some of the most extreme high values of peakHeight. Thus these two variables appear to be the primary candidates for enhanced control.  Figures 4 and 5 use the simulator option with the profiler to see the effect of reducing the natural spread of each of these factors, in turn, by a factor of 2. With enhanced control of the column temperature, the SD related to factor uncertainty drops from 2.697 to 2.550. Reducing the variation of the gradient leads to a much more substantial improvement. The SD drops by about 40%, to 1.667. Figure 4. Profiler of peakHeight at nominal levels with added noise and enhanced control of the column temperature. Figure 5. Profiler of peakHeight at nominal levels with added noise and enhanced control of the gradient.

### Summary

Experiments on robustness are an important stage in the development of an analytic method. These experiments intentionally vary process factors that cannot be perfectly controlled about their nominal value. Experiments that are geared to fitting a linear regression model are useful for assessing robustness, but have limited value for improving robustness.

We have shown here how to exploit nonlinear effects to achieve more robust analytic methods.  The Definitive Screening Design can be especially useful for such experiments. For a minimal experimental cost, it provides enough data to estimate curvature with respect to each input factor. When curvature is present, we have seen how to exploit it to improve robustness. When curvature has been exploited, we have seen how to use the experimental results to achieve further improvements via tighter control of one or more input factors.

Notes

 Jones, B. and Nachtsheim, C. J. (2011) “A Class of Three-Level Designs for Definitive Screening in the Presence of Second-Order Effects” Journal of Quality Technology, 43. 1-15.

 Borman, P., Nethercote, P., Chatfield, M., Thompson, D., Truman, K. (2007), Pharmaceutical Technology. http://pharmtech.findpharma.com/pharmtech/Peer-Reviewed+Research/The-Application-of-Quality-by-Design-to-Analytical/ArticleStandard/Article/detail/463580

 Romero R., Gasquez, D., Sanshez, M., Rodriguez, L. and Bagur, M.  (2002), A geometric approach to robustness testing in analytical HPLC, LCGC North America, 20, pp. 72-80, www.chromatographyonline.com.

 Steinberg, D.M., Bursztyn, D. (1994). Dispersion effects in robust-design experiments with noise factors, Journal of Quality Technology, 26, 12-20.

This blog post is brought to you by members of the KPA Group: Ron Kenett and David Steinberg. Read the whole QbD Column series.

The post The QbD Column: Applying QbD to make analytic methods robust appeared first on JMP Blog. Development of measurement or analytic methods parallels the development of drug products. Understanding of the process monitoring and control requirements drives the performance criteria for analytical methods, including the process critical quality attributes (CQAs) and specification limits. Uncovering the characteristics of a drug substance that require control to ensure safety or efficacy determines the identification of CQAs. The typical requirements of analytic methods include:

• Precision: This requirement makes sure that method variability is only a small proportion of the specifications range (upper specification limit – lower specification limit). This is also called Gage Reproducibility and Repeatability (GR&R).
• Selectivity: This determines which impurities to monitor at each production step and specifies design methods that adequately discriminate the relative proportions of each impurity.
• Sensitivity: To achieve effective process control, this requires methods that accurately reflect changes in CQA's that are important relative to the specification limits.

These criteria establish the reliability of methods for use in routine operations. This has implications for analysis time, acceptable solvents and available equipment. To develop an analytic method with QbD principles, the method’s performance criteria must be understood, as well as the desired operational intent of the eventual end-user. Limited understanding of a method can lead to poor technology transfer from the laboratory into use in commercial manufacturing facilities or from an existing facility to a new one. Failed transfers often require significant additional resources to remedy the causes of the failure, usually at a time when there is considerable pressure to move ahead with the launch of a new product. Applying Quality by Design (QbD) to analytic methods aims to prevent such problems.

QbD implementation in the development of analytic methods is typically a four-stage process, addressing both design and control of the methods. The stages are:

2. Method Design Selection: Select the method work conditions to achieve the design intent.
3. Method Control Definition: Establish and define appropriate controls for the components with the largest contributions to performance variability.
4. Method Control Validation: Demonstrate acceptable method performance with robust and effective controls.

Testing robustness of analytical methods involves evaluating the influence of small changes in the operating conditions. Ruggedness testing identifies the degree of reproducibility of test results obtained by the analysis of the same sample under various normal test conditions such as different laboratories, analysts, and instruments. In the following case study, we focus on the use of experiments to assess and improve robustness.

### A case study in HPLC development

The case study presented here is from the development of a High Performance Liquid Chromatography (HPLC) method. It is a typical example of testing the robustness of analytical methods. The specific system consists of an Agilent 1050, with a variable-wavelength UV detector and a model 3396-A integrator.

The goal of the robustness study is to find out whether deviations from the nominal operating conditions affect the results. Table 1 lists the factors and their levels used in this case study. The experimental array is a 27-4 Fractional Factorial experiment with three center points (see Table 2). The levels "-1" and "1" correspond to the lower and upper levels listed in Table 1, and "0" corresponds to the nominal level. The lower and upper levels are chosen to reflect variation that might naturally occur about the nominal setting during regular operation. For examples of QbD applications of Fractional Factorial experiments to formulation and drug product development, see the second and third blog posts in this series. Table 1. Factors and levels in the HPLC experiment. Table 2. Experimental array of the HPLC experiment.

The experimental array consists of 11 experimental runs that combine the design factors levels in a balanced set of combinations and three center points.

### What we can learn from the HPLC experiments

In analyzing the HPLC experiment, we have the following goals:

1. Find expected method measurement prediction variance for recommended setups of the method (the measurement uncertainty).
2. Identify the best settings of the experimental factors to achieve acceptable performance.
3. Determine the factors that impact the performance of the method on one or more responses.
4. Assess the impact of variability in the experimental factors on the measured responses.
5. Make robust the HPLC process by exploiting nonlinearity in the factor effects to achieve performance that is not sensitive to changes about nominal levels.

If we consider only the eight experimental runs of the 27-4 fractional factorial, without the center points, we get an average prediction variance of 0.417 and 100% optimality for fitting a first-order model. This is due to the balanced property of the design (see Figure 1 Left). The design in Table 2, with three center points, reduces prediction uncertainty near the center of the region and has a lower average prediction variance of 0.38. However, the center points don't contribute to estimating slopes, as seen in the lower efficiency for fitting the first-order model (see Figure 1 Right). Figure 1. Design diagnostics for this fractional factorial without (left) and with (right) three center points. Figure 2. Prediction variance profile without (top) and with (bottom) center points.

The JMP Prediction Variance Profile in Figure 2 shows the ratio of the prediction variance to the error variance, also called the relative variance of prediction, at various factor level combinations. Relative variance is minimized at the center of the design. Adding three center points reduces prediction variance by 25%, from 0.12 to 0.09. This is an advantage derived by adding experimental runs at the center points. Another advantage that we will see later is that the center points permit us to assess nonlinear effects, or lack-of-fit for the linear regression model. A third advantage is that the center points give us a model-free estimate of the extent of natural variation in the system.

At each factor level combination, the experiments produced five responses: 1) Area of chromatogram at peak (peakArea), 2) Height of chromatogram at peak (peakHeight), 3) Minimum retention time adjusted to standard (tRmin),  4) Unadjusted minimum retention time (unad tRmin) and 5) Chromatogram resolution (Res).

Our first concern in analyzing the data is to identify proper models linking factors and responses.

### What do we learn from analyzing the data from the fractional factorial experiment?

Linear regression models are the simplest models to consider. They represent changes in responses between two levels of factors, in our case this corresponds to levels labeled “-1” and “+1”. Since we also have three center points, at levels labeled “0”, we can also assess nonlinear effects. We do so, as in our second blog post, by adding a synthetic indicator variable designed to assess lack-of-fit (LOF) that is equal to “1” at the center points and “0” elsewhere. The JMP Effect Summary report, for all five responses with linear effects on all seven factors, and the LOF indicator, is presented in Figure 3. Figure 3. Effect Summary report of seven factors and LOF indicator on five responses.

The Effect Summary table lists the model effects across the full set of five responses, sorted by ascending p-values. The LogWorth for each effect is defined as -log10(p-value), which adjusts p-values to provide an appropriate scale for graphics. A LogWorth that exceeds 2 is significant at the 0.01 level because -log10(0.01)=2. The report includes a bar graph of the LogWorth with dashed vertical lines at integer values and a blue reference line at 2. The displayed p-values correspond to the significance test displayed in the Effect Tests table of the model report. The report in Figure 3 shows that, overall, four factors and LOF are significant at the 0.01 level (Col Temp, Gradient, Buf PH and Dim Perc) and Buf Conc, Det Wave and Trie Perc are non-significant. From the experimental plan in Table 2, one can estimate the main effects of the seven factors and the LOF indicator on the five responses with a linear model.

Figure 4 presents parameter estimates for peakHeight with an adjusted R2 of 93%, a very good fit.

The peakHeight response is most sensitive to variations in Col Temp, Det Wave and Gradient. Figure 4. Parameter estimates of peakHeight of seven factors and LOF indicator with linear model. For improved readability, peakHeights have been divided by 1000.

We observe a statistically significant difference between the predicted value at the center point of the experimental design and the three measurements actually performed there (via the LOF variable).

Figure 5 displays a profiler plot showing the linear effects of each factor on all five responses.  The plot is very helpful in highlighting which conditions might affect the HPLC method. We see that Col Temp and Gradient, the two most important factors, affect several different responses.  Buf pH, Buf Conc and Dim Perc have especially strong effects on the retention responses, but have weak effects on the other CQA's.  The factors give good fits to the retention responses and to peakHeight, but not to peakArea or Res, which is reflected in the wide confidence bands for those CQA's and in high p-values for the overall model F-tests in the Analysis of Variance line of the model output. Figure 5. Profiler of HPLC experiments with linear model.

What should we do about the nonlinearity? Our analysis found a significant effect of the LOF indicator, which points to a nonlinear effect that is not accounted for in the profiler of Figure 5.  The center points we added to the two-level fractional factorial design let us detect the nonlinearity, but they don’t provide enough information to determine what causes it – any one of the seven factors (and possibly several of them) could be responsible for the nonlinear effect on peak Height.  In our next blog, we will discuss some design options to address the problem. For now, we show what we achieved with the current experiment.

After much brainstorming, the HPLC team decided that it was very likely that the Gradient was the factor causing the nonlinearity. This important assumption, based only on process knowledge, is crucial to all our subsequent conclusions.  We proceeded to fit a model to the original experimental data that includes a quadratic effect for Gradient. The team also decided to retain only the factors with the strongest main effects for each response; for peakHeight, the factors were Gradient, Column Temperature and Detection Wavelength. In Figure 6, we show parameter estimates from fitting this reduced model to the peakHeight responses. With this model, all terms are significant with an adjusted R2 of 89%.  The root mean squared error, which estimates run-to-run variation at the same settings of the factors, is 1.754, slightly less than 1% the magnitude of peakHeight itself (after dividing peakHeight by 1000). Figure 6. Parameter estimates of peakHeight with quadratic model. For improved readability, peakHeights have been divided by 1000.

We show a Profiler for the reduced quadratic model in Figure 7. Figure 7. Profiler of HPLC experiment with reduced quadratic model for peakHeight. For improved readability, peakHeights have been divided by 1000.

### Finding a robust solution

One of the main goals of the experiment was to assess the robustness of the system to variation in the input factors. We explore this question by introducing normal noise to the four factors in the reduced quadratic model. For each of the three factors, we assumed a standard deviation of 0.4 (in the coded units), which is the default option in JMP. This reflects a view that the experimental settings are about 2.5 SD from the nominal level, so reflect rather extreme deviations that might be encountered in practice.

Figure 8 presents the effect of noise on peakHeight for a set-up at the center point which was initially identified as the nominal setting.  We can compute the SD of the simulated outcomes by saving them to a table and using the “distribution” tab in JMP. The SD turns out to be 2.397 and is slightly larger than the run-to-run SD that we computed earlier of 1.754. The overall SD associated with the analytic system involves both of these components. To combine them, we need to first square them, then add them (because variances are additive, not SD) and then take a square root to return to the original measurement scale. The resulting combined SD is 2.970, so the anticipated variation in factor settings leads to an SD about 70% larger than the one from run-to-run variation alone.  The overall SD is less than 1.5% of the typical values of peakHeight and that was considered acceptable for this process.

### Is our nominal solution a good one for robustness?

Figure 8 is very helpful in answering this question. The important factor here is Gradient, through its non-linear relationship to peakHeight. The “valley” of that relationship is near the nominal choice of 0. Our simulation of factor variation generates values of Gradient that cover the range from -1 to 1. When those values are in the valley, they transmit very little variation to peakHeight. By contrast, when they are near the extremes, there is substantial variation in peakHeight. So the fact that the bottom of the valley is close to the nominal setting assures us that the transmitted variation will be about as small as possible. We can test this feature by shifting the nominal value of Gradient. When the nominal is -0.5, the simulator shows that the SD from factor variation increases to 4.282, almost 80% more than for the nominal setting at 0.

The dependence of peakHeight on Col Temp and on Det Wave is linear. So regardless of how we choose the nominal settings of these factors, they will transmit the same degree of variation to the peakHeight output. The experiment lets us assess how they affect robustness, but does not provide any opportunity to exploit the results to improve robustness. Figure 8. Prediction Profiler of peakHeight when the factors are at their nominal settings and the natural SD is 0.4. For improved readability, peakHeights have been divided by 1000.

### Going back to the original questions

In reviewing the questions originally posed, we can now provide the following answers:

1. What is the expected method measurement prediction variance for recommended set ups of the method (the measurement uncertainty).

Answer: We looked at this question most closely for peakHeight, where we found that the SD is 2970, with roughly equal contributions from run-to-run variation and from variation in the measurement process factors.

2. What setup of the experimental factors will achieve acceptable performance?

Answer: With all factors at their nominal settings (coded value at 0), the SD of 2970 is less than 1.5% the size of the values measured, which is an acceptable level of variation in this application.

3. What are the factors that impact the performance of the method in one or more responses?

Answer: The three factors with highest impact on the method’s performance are gradient profile, column temperature and detection wave.

4. Can we make the setup of the experimental factors robust in order to achieve performance that is not sensitive to changes in factor levels?

Answer: We saw that we can improve robustness for peakHeight by setting the gradient to its coded level of 0 (the nominal level in the experiment).  That setting helps us to take advantage of the non-linear effect of gradient and reduce transmitted variation.

5. Can we assess the impact of variability in the experimental factors on the analytical method?

Answer: As we noted earlier, the natural variability of the input factors is responsible for slightly more than half the variation in peakHeight.

### Wrap-Up

In reviewing the questions originally posed, we first fit a linear regression model. After realizing that there is an unaccounted nonlinear effect we used a reduced quadratic model and found that it fits the data well. Inducing variability in the factors of the reduced quadratic (gradient profile, column temperature and detection wave), we could estimate of the variability due to the method and could assess the robustness of the recommended setup.

The team’s assumption that gradient is responsible for the non-linearity is clearly important here.  If other factors also have non-linear effects, there could be consequences for how to best improve the robustness of the method. We will explore this issue further in our next blog post.

Notes

 Borman, P., Nethercote, P., Chatfield, M., Thompson, D., Truman, K. (2007), Pharmaceutical Technology. http://pharmtech.findpharma.com/pharmtech/Peer-Reviewed+Research/The-Application-of-Quality-by-Design-to-Analytical/ArticleStandard/Article/detail/463580

 Kenett, R.S, and Kenett, D.A (2008), Quality by Design Applications in Biosimilar Technological Products, Accreditation and Quality Assurance, Springer Verlag, Vol. 13, No 12, pp. 681-690.

 Romero R., Gasquez, D., Sanshez, M., Rodriguez, L. and Bagur, M.  (2002), A geometric approach to robustness testing in analytical HPLC, LCGC North America, 20, pp. 72-80, www.chromatographyonline.com.

This blog post is brought to you by members of the KPA Group: Ron Kenett and David Steinberg.

The post The QbD Column: Is QbD applicable for developing analytic methods? appeared first on JMP Blog. As data analysts, we all try to do the right thing. When there is a choice of statistical distributions to be used for a given application, it’s a natural inclination to try to find the “best” one.

### But beware...

Fishing for the best distribution can lead you into a trap. Just because one option appears to be best – that doesn’t mean that it’s correct! For example, consider this data set: What is the best distribution we can use to describe this data? JMP can help us answer this question. From the Distribution platform, we can choose to fit a number of common distributions to the data: Normal, Weibull, Gamma, Exponential, and others. To fit all possible continuous distributions to this data in JMP, go to the red triangle hotspot for this variable in the Distribution report, and choose “Continuous Fit > All”. Here is the result: JMP has compared 11 potential distributions for this data, and ranked them from best (Gamma) to worst (Exponential). The metric used to perform the ranking is the corrected Akaike Information Criterion (AICc). Lower values of AICc indicate better fit, and so the Gamma distribution is the winner here.

### Here’s the catch

This data set was generated by drawing a random sample of size 50 from a population that is normally distributed with a mean of 50 and a standard deviation of 10. The Normal distribution is the correct answer by definition, but our fishing expedition gave us a misleading result.

How often is there a mismatch like this? One way we can approach this question is through simulation. I wrote a small JMP script to draw samples of various sizes from a normally distributed population. I investigated sample sizes of 5, 10, 20, 30, 50, 75, 100, 250, and 500 observations; for each of these, I drew 1,000 independent samples and had JMP compute the fit for all possible continuous distributions. Last, for each sample I recorded the name of the best-fitting distribution, as measured by AICc. (JSL script available in the JMP File Exchange).

The results were quite surprising! • Remember, the correct answer in each case is “Normal”. If our fishing expedition was yielding good results across the board, the line for the Normal distribution should be high and flat, hovering near 100%.
• Instead, the wrong distribution was chosen with disturbing frequency. For sample sizes under 50, the Normal distribution was not even the most commonly chosen. That honor belongs to the Weibull distribution.
• For a sample size of 5 observations from a Normal distribution, the correct identification was not made a single time out of 1,000 samples.
• If you want to have at least a 50% chance of correctly identifying normally distributed data by this method, you’ll need more than 100 observations!
• Even at a sample size of 500 observations, the likelihood of the normal distribution being correctly called the best is only about 80%.

### The moral of the story

When comparing the fit of different distributions to a data set, don’t assume that the distribution with the smallest AICc is the correct one. Relative magnitudes of the AICc statistics are what counts. A rule of thumb (used elsewhere in JMP) is that models whose values of AICc are within 10 units of the “best” one are roughly equivalent.* In our first example above, the Gamma distribution is nominally the best, but its AICc is only .2 units lower than that of the Normal distribution. There is not good statistical evidence to choose the Gamma over the Normal.

More generally, as a best practice it is wise to consider only distributions that make sense in the context of the problem. Your own knowledge and expertise are usually the best guides. Don’t choose an exotic distribution that has a slightly better fit over one that makes sense and has a proven track record in your field of work.

*This rule is used to compare models built in the Generalized Regression personality of the Fit Model platform in JMP Pro. See Burnham, K.P. and Anderson, D.R. (2002), Model Selection And Multimodel Inference: A Practical Information Theoretic Approach. Springer, New York.

The post Is that the best (distribution) you've got? appeared first on JMP Blog. 