On discussion forums, I often see questions that ask how to Winsorize variables in SAS. For example, here are some typical questions from the SAS Support Community:
- I want an efficient way of replacing (upper) extreme values with (95th) percentile. I have a data set with around 600 variables and want to get rid of extreme values of all 600 variables with 95th percentile.
- I have several (hundreds of) variables that I need to “Winsorize” at the 95% and 5%. I want all the observations with values greater 95th percentile to take the value of the 95th percentile, and all observations with values less than the 5th percentile to take the value of the 5th percentile.
It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?
Winsorization: Definition, pitfalls, and alternatives #StatWisdom
Click To Tweet
What is Winsorization?
The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.
Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.
The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.
I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:
- Winsorizaion is symmetric: Some people want to modify only the large data values. However, Winsorization is a symmetric process that replaces the k smallest and the k largest data values.
- Winsorization is based on counts: Some people want to modify values based on quantiles, such as the 5th and 95th percentiles. However, using quantiles might not lead to a symmetric process. Let k1 be the number of values less than the 5th percentile and let k2 be the number of values greater than the 95th percentile. If the data contain repeated values, then k1 might not equal to k2, which means that you are potentially changing more values in one tail than in the other.
As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.
Winsorization: The good
Why do some people want to Winsorize their data? There are a few reasons:
- Classical statistics such as the mean and standard deviation are sensitive to extreme values. The purpose of Winsorization is to "robustify" classical statistics by reducing the impact of extreme observations.
- Winsorization is sometimes used in the automated processing of hundreds or thousands of variables when it is impossible for a human to inspect each and every variable.
- If you compare a Winsorized statistic with its classical counterpart, you can identify variables that might contain contaminated data or are long-tailed and require special handling in models.
Winsorization: The bad
There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.
As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:
- For regression, the ROBUSTREG procedure provides four different methods for handling univariate and multivariate outliers and high-leverage points. You can also use PROC ROBUSTREG to compute robust univariate statistics and to identify multivariate outliers.
- Quantile regression (PROC QUANTREG) also provides robust estimates.
- For univariate data, the UNIVARIATE procedure provides several robust estimates of location and scale. You can use the ROBUSTSCALE option, the TRIMMED= option, and the WINSORIZED= option to obtain robust estimates without modifying the original data. SAS provides these statistics by using the original data.
Winsorization: The ugly
If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,
we are likely to think of them as 'strays' [or] 'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.
A little later in the essay (p. 458), he says
Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.
I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.
Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.
If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.
What are your thoughts about Winsorizing data? Share them in the comments.