Distribution

8月 102021
 

This post demonstrates how to rank data and how to place these ranks into roughly equal groups.

There are certain variables, such as annual salary, that are highly skewed. There are many who earn between $50,00 and $150,000, but some who earn millions or hundreds of millions of dollars a year. Trying to use variables like annual salary in statistical models typically violates assumptions of many popular statistical techniques. There are several solutions to the types of distribution problems we just described. One solution is to use a transformation like a logarithm of a value to "bring in the tail." Another solution is to substitute ranks for the original values. For example, the lowest salary would be assigned a rank of one, the next highest would be assigned a rank of two, and so forth. Another method is to place all of the values into a number of bins. For example, you could place all the salaries into ranges such that there would be approximately an equal number of values in each range.

You can use SAS Studio tasks to create ranks and, with a tiny bit of editing, create salary ranges.

Let's start with a data set called Salary that was created by a small program using a random number function. Shown below is a histogram and a smooth line representing 1,000 values of salary from this data set.

You see a grouping of values on the left side of the distribution and a few very high salaries in the right tail. For curious readers, here is the program that generated these data values.

The RAND function can generate quite a few distributions, such as uniform and normal. For this program, an exponential distribution was used.

Suppose you plan to use yearly salary in a binary logistic regression model. Using the actual values from the Salary data set would not work well. Let's start out by creating a new variable that represents the rank of salary. In SAS Studio, this is easily done using the Rank Data task as one of the selections under the Data tab. You can see this in the figure below.

You choose the data set and variable to rank on the DATA tab, like this.

The Salary data set was selected, and the variable Salary was chosen as the variable (column) to rank. Finally, Rank_Salary was selected for the output data set name. A histogram of the ranks is, as you would expect, uniform ranging from one to 1,000 (see figure below).

How can you place these 1,000 values into 10 bins? To do this, you click the CODE tab and then click Edit (circled in the figure below).

All you need to do is add the PROC RANK option Groups=10 to this program as shown next.

This option groups all the ranks into 10 groups. Below is a histogram of the variable Rank_Salary with the Groups= option included.

This new variable would work quite well in a logistic regression model or other types of regression.

If you found this blog post helpful, you might be interested in some of my books. As always, comments and/or suggestions are welcome.

How to Transform a Skewed Distribution to a Uniform Distribution was published on SAS Users.

7月 292021
 

When using ordinary least squares regression (OLS), if your response or dependent attribute isn’t close to a normal distribution, your analysis is going to be affected and typically not in a good way. The farther away your input data is from normality, the greater the impact on your model will be as well. One of the key assumptions in OLS and other regressions is that the distribution of the dependent be at least approximately normally distributed.

This blog post discusses two distributional data transforms: the natural (or common) logarithm, and the Johnson family of transforms,1 specifically the Su transform. The Su transform has been empirically shown to improve logistic regression performance from the input attributes.2 I will focus primarily on the response or dependent attribute.

Example

In this example, the attribute called VALUE24 is the amount of revenue generated from campaign purchases that were obtained in the last 24 months. The histogram below shows the distribution, and the fitted line is what a normal distribution should look like given the mean and variance of this data.

You can clearly see that the normal curve overlaid on the histogram indicates that the data isn’t close to a normal distribution. The dependent attribute of VALUE24 will need to be transformed.

The following histogram is shown with the VALUE24 attributed transform with a natural logarithm. The normal curve is also plotted as before. While this logarithm transform is much better because the normal curve is closer to the log-normal distribution, it still has some room for improvement.

Below is the Su transformed distribution of VALUE24. Notice that not only did it transform it to conform to a normal distribution, it also centered the mean at zero and the variance is one!

Benefits of the Su transform

The benefits of using this transform for use in predictive models are:

  • Uses a link-free consistency by allowing an elliptically contoured predictor space.
  • Reduction in nonlinear confounding that will lead to less misspecification.
  • Local adaptivity: suitable transforms allow models to balance their sensitivity to dense versus sparse regions of the predictor space.2

For details about the first listed benefit, please refer to Potts.2 The second benefit is that your model has a much greater chance of being specified correctly due to a reduction in confounding of attributes. This will not only improve your model’s accuracy but its interpretability, as well. The last major benefit is that the Su transform brings in very long tails of a distribution much better than a log or other transforms.1

Computing the Su transform

The transform that makes the distribution work the best for global smoothing without resorting to a nonparametric density estimation is to estimate a Beta distribution, which can be very close to a normal distribution when certain parameters of the Beta distribution are set.2 The actual equation for a Su family is below. However, we won’t be using that exact equation.

What we need is to frame the above equation for the optimal transformation that can be estimated using nonlinear regression and normal scores as the response variable. The above equation can be framed to the following2:

In SAS, one can use the RANK procedure and the nonlinear regression procedure, PROC NLIN; or if using the completely in-memory platform of SAS Viya, PROC NLMOD to accomplish solving the three parameters λ, δ, γ.

However, there is one serious downside to fitting a Su transform with the above nonlinear function.  Unlike the log or some other transforms, it doesn’t have an inverse link function to transform the data back to its original form. There is one saving grace, however, and that is one can perform an empirical data simulation and develop score code that will mimic the relationship between the Su transformed and the untransformed data. This will enable the Su transform to be estimated on a good statistical representative sample and applied to the larger data set from which the sample was derived. The scatter plot below shows the general relationship of VALUE24 and the Su transformed VALUE24.

Using PROC GENSELECT, the empirical fit of the above relationship can be performed and score code generated. While other SAS procedures can be used to do the same fitting, such as the TRANSREG and GLMSELECT procedures, they won’t write out the score code at present when using computed effect statements. After fitting this data with a six-degree polynomial spline, the fitted curve is shown in the scatter plot below.

The following SAS score code was generated from the GENSELECT procedure and written to a .sas file on the server.

SAS code used to fit the six-degree curve

ods graphics on;
 
title 'General Smoothing Spline for Value24 Empirical Link Function';
 
proc genselect data=casuser.merge_buytest ;
 
effect poly_su24 = poly( su_value24 /degree=6);
 
model value24 = poly_su24 / distribution=normal;
 
code file="/shared/users/racoll/poly_value24_score.sas";
 
output out=casuser.glms_out pred=pred_value24 copyvar=id;
 
run; title;

Generated SAS scoring code from PROC GENSELECT

drop _badval_ _linp_ _temp_ _i_ _j_; 
_badval_ = 0; _
linp_   = 0; 
_temp_   = 0; 
_i_      = 0; 
_j_      = 0; 
drop MACLOGBIG; 
MACLOGBIG= 7.0978271289338392e+02; 
array _xrow_0_0_{7} _temporary_; 
array _beta_0_0_{7} _temporary_ (    210.154143004534 
    123.296571835751 
    49.7657489425605 
    9.82725091766856 
    -3.13300266038598 
    -0.70029417420551 
    0.17335174313709); 
array _xtmp_0_0_{7} _temporary_; 
array _xcomp_0_0_{7} _temporary_; 
array _xpoly1_0_0_{7} _temporary_;  
 if missing(su_value24) 
 then do; 
    _badval_ = 1; 
    goto skip_0_0; 
end;   
 
do _i_=1 to 7; _xrow_0_0_{_i_} = 0; end; 
do _i_=1 to 7; _xtmp_0_0_{_i_} = 0; end; 
do _i_=1 to 7; _xcomp_0_0_{_i_} = 0; end; 
do _i_=1 to 7; _xpoly1_0_0_{_i_} = 0; end;   
 
_xtmp_0_0_[1] = 1;   
 
_temp_ = 1; 
_xpoly1_0_0_[1] = su_value24; 
_xpoly1_0_0_[2] = su_value24 * _xpoly1_0_0_[1];
 _xpoly1_0_0_[3] = su_value24 * _xpoly1_0_0_[2]; 
_xpoly1_0_0_[4] = su_value24 * _xpoly1_0_0_[3]; 
_xpoly1_0_0_[5] = su_value24 * _xpoly1_0_0_[4];
 _xpoly1_0_0_[6] = su_value24 * _xpoly1_0_0_[5]; 
do _j_=1 to 1; _xtmp_0_0_{1+_j_} = _xpoly1_0_0_{_j_}; end;   
 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{2+_j_} = _xpoly1_0_0_{_j_+1}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{3+_j_} = _xpoly1_0_0_{_j_+2}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{4+_j_} = _xpoly1_0_0_{_j_+3}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{5+_j_} = _xpoly1_0_0_{_j_+4}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{6+_j_} = _xpoly1_0_0_{_j_+5}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+0} = _xtmp_0_0_{_j_+0}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+1} = _xtmp_0_0_{_j_+1}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+2} = _xtmp_0_0_{_j_+2}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+3} = _xtmp_0_0_{_j_+3}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+4} = _xtmp_0_0_{_j_+4}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+5} = _xtmp_0_0_{_j_+5}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+6} = _xtmp_0_0_{_j_+6}; end;  
 
do _i_=1 to 7; 
_linp_ + _xrow_0_0_{_i_} * _beta_0_0_{_i_}; 
end;   
 
skip_0_0: 
label P_VALUE24 = 'Predicted: VALUE24'; 
if (_badval_ eq 0) and not missing(_linp_) then do; 
  P_VALUE24 = _linp_; 
end; else do; 
    _linp_ = .; 
    P_VALUE24 = .; 
end;

This score code can be placed in a SAS DATA step along with your analytical model’s score code so that a back-transform can be accomplished of your predicted dependent variable VALUE24.

If you liked this blog post, then you might like my latest book, Segmentation Analytics with SAS® Viya®: An Approach to Clustering and Visualization.

References

1 Johnson, N. L., “Systems of Frequency Curves Generated by Methods of Translation,” Biometrika, 1949.

2 Potts, W., “Elliptical Predictors for Logistic Regression”, Keynote Address at SAS Data Mining Conference, Las Vegas, NV, 2006.

 

 

An analytic transform for cantankerous data was published on SAS Users.

10月 062016
 

As data analysts, we all try to do the right thing. When there is a choice of statistical distributions to be used for a given application, it’s a natural inclination to try to find the “best” one.

But beware...

Fishing for the best distribution can lead you into a trap. Just because one option appears to be best – that doesn’t mean that it’s correct! For example, consider this data set:

distribution

What is the best distribution we can use to describe this data? JMP can help us answer this question. From the Distribution platform, we can choose to fit a number of common distributions to the data: Normal, Weibull, Gamma, Exponential, and others. To fit all possible continuous distributions to this data in JMP, go to the red triangle hotspot for this variable in the Distribution report, and choose “Continuous Fit > All”. Here is the result:

fit-all

JMP has compared 11 potential distributions for this data, and ranked them from best (Gamma) to worst (Exponential). The metric used to perform the ranking is the corrected Akaike Information Criterion (AICc). Lower values of AICc indicate better fit, and so the Gamma distribution is the winner here.

Here’s the catch

This data set was generated by drawing a random sample of size 50 from a population that is normally distributed with a mean of 50 and a standard deviation of 10. The Normal distribution is the correct answer by definition, but our fishing expedition gave us a misleading result.

How often is there a mismatch like this? One way we can approach this question is through simulation. I wrote a small JMP script to draw samples of various sizes from a normally distributed population. I investigated sample sizes of 5, 10, 20, 30, 50, 75, 100, 250, and 500 observations; for each of these, I drew 1,000 independent samples and had JMP compute the fit for all possible continuous distributions. Last, for each sample I recorded the name of the best-fitting distribution, as measured by AICc. (JSL script available in the JMP File Exchange).

The results were quite surprising!

results

  • Remember, the correct answer in each case is “Normal”. If our fishing expedition was yielding good results across the board, the line for the Normal distribution should be high and flat, hovering near 100%.
  • Instead, the wrong distribution was chosen with disturbing frequency. For sample sizes under 50, the Normal distribution was not even the most commonly chosen. That honor belongs to the Weibull distribution.
  • For a sample size of 5 observations from a Normal distribution, the correct identification was not made a single time out of 1,000 samples.
  • If you want to have at least a 50% chance of correctly identifying normally distributed data by this method, you’ll need more than 100 observations!
  • Even at a sample size of 500 observations, the likelihood of the normal distribution being correctly called the best is only about 80%.

The moral of the story

When comparing the fit of different distributions to a data set, don’t assume that the distribution with the smallest AICc is the correct one. Relative magnitudes of the AICc statistics are what counts. A rule of thumb (used elsewhere in JMP) is that models whose values of AICc are within 10 units of the “best” one are roughly equivalent.* In our first example above, the Gamma distribution is nominally the best, but its AICc is only .2 units lower than that of the Normal distribution. There is not good statistical evidence to choose the Gamma over the Normal.

More generally, as a best practice it is wise to consider only distributions that make sense in the context of the problem. Your own knowledge and expertise are usually the best guides. Don’t choose an exotic distribution that has a slightly better fit over one that makes sense and has a proven track record in your field of work.

*This rule is used to compare models built in the Generalized Regression personality of the Fit Model platform in JMP Pro. See Burnham, K.P. and Anderson, D.R. (2002), Model Selection And Multimodel Inference: A Practical Information Theoretic Approach. Springer, New York.

tags: Distribution, Statistics, Tips and Tricks

The post Is that the best (distribution) you've got? appeared first on JMP Blog.

10月 032016
 

By now you may have heard that in JMP 13, the most frequently used features of reports created in Graph Builder can be saved as interactive HTML, which can then be viewed using just a web browser.

Getting Graph Builder output to work for the web in JMP 13 involved bringing new features to several graphical elements that had been available in interactive HTML output since JMP 11. Areas and lines can be used to display some of the same information as points but in a different way. Exploring these stacked areas in interactive HTML, you can now see the values along the edge of the area.

SmartphoneOSArea

The tooltips for lines display the rows that are included in each point along the line as well as information about the values. Graph Builder gives you the ability to customize various attributes of the lines. The example below combines lines using different drawing styles with annotations and the gray reference ranges to create a rich graph.

MarriageDivorceRatesLines

While the most heavily used graph types and options are exported as interactive HTML, the remaining ones are exported as static images. Contour plots are exported as static images; however, if your data is categorical, Graph Builder produces violin plots, which are exported as interactive HTML. Below you can see the close relationship between the violin plot and another Graph Builder element, the box plot.

IrisViolinPlot

What if you want to bin data into categories to explore their distribution? There are a number of ways to do this in Graph Builder. The histogram is available in Interactive HTML in the Distribution platform (as well as options in several other JMP platforms), but now can also be exported to the web after exploring your data in a drag-and-drop manner in Graph Builder in JMP. Below is an example created using Titanic passenger data to examine the distribution of ages.

TitanicHistogram

A mosaic plot is used to examine the relationship between two categorical variables. Cells give informative tooltips regarding the share and number of rows associated with each cell, and cells can be selected with rows being linked to other related charts in the report.

TitanicMosaic

In JMP, you can use Dashboard Builder to create reports with several types of Graph Builder output in the same page -- so people who do not have JMP yet can interactively explore your data. Here, a mosaic plot, bars and histograms are combined to analyze the importance of different goals to schoolchildren.

GraphBuilderDashboard

These are just a few examples of the powerful graphs you can create to explore your data in Graph Builder and share with others using interactive HTML. The graphs shown here as well as a few other examples are available as live interactive HTML files to explore on the web at http://www.jmp.com/jmphtml5/, but be sure to try your own Graph Builder creations!

tags: Dashboard, Dashboard Builder, Data Visualization, Distribution, Graph Builder, Interactive HTML, JMP 13

The post Interactive HTML: Lines, mosaic plots and more for Graph Builder appeared first on JMP Blog.

9月 042016
 

You might say I love sports. I began swimming at a very early age and participated on swim teams for many years. Gymnastics, volleyball, softball, basketball and even track teams were all part of my life, and I loved playing and competing. So maybe that is why I always love […]

The post Looking at Summer Games data with JMP appeared first on JMP Blog.

7月 222016
 

Let's say you are in the Distribution platform in JMP, and you have created a report that you wish to drill down into. Well, the Local Data Filter can help with that. But perhaps you also want to share a portion of the data with a co-worker, and not just […]

The post Video: Subsetting data from a JMP Distribution report appeared first on JMP Blog.

6月 082016
 

In a previous post, I wrote how pedigree might be used to help predict outcomes of horse races. In particular, I discussed a metric called the Dosage Index (DI), which appeared to be a leading indicator of success (at least historically). In this post, I want to introduce the Center […]

The post What does a winning thoroughbred horse look like? appeared first on JMP Blog.

12月 202012
 
In the last 30 years supply chains have been asked to take on a leading role as a facilitator for the proposed pull methodology, as needed to be more responsive to demands of the customer. This new and reactive supply chain would enable practitioners to quickly act upon demand signals [...]
3月 142012
 
The Distribution platform, frequently the backbone of data exploration, is one of the most widely used platforms in JMP. The JMP 10 Distribution platform has additional customizations and options to make this phase of data exploration even more individualized and easier than before. This blog post will focus on the reports [...]