I encountered a wonderful survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. Not only are the authors major contributors to the field of robust estimation, but the article is short and very readable. This blog post walks through the examples in the paper and shows how to compute each example by using SAS. In particular, this post shows how to compute robust estimates of location for univariate data. Future posts will show how to compute robust estimates of scale and multivariate estimates.

The Rousseeuw and Hubert article begins with a quote:

In real data sets, it often happens that some observations are different from the majority. Such observations are calledoutliers. ...They do not fit the model well. It is very important to be able to detect these outliers.

The quote explains why outlier detection is connected to robust estimation methods. Classical statistical estimators are so affected by the outliers that "the resulting fitted model does not allow [you] to detect the deviating observations." The goal of robust statistical methods is to "find a fit that is close to the fit [you] would have found without the [presence of] outliers." You can then identify the outliers by their large deviation from the robust model.

The simplest example is computing the "center" of a set of data, which is known as estimating location. Consider the following five measurements:

`6.25, 6.27, 6.28, 6.34, 63.1`

As the song says, *one of these points is not like the other....* The last datum is probably a miscoding of 6.31.

## Robust estimate of location in SAS/IML software

SAS/IML software contains several functions for robust estimation. For estimating location, the MEAN and MEDIAN functions are the primary computational tools. It is well known that the mean is sensitive to even a single outlier, whereas the median is not. The following SAS/IML statements compute the mean and median of these data:

proc iml; x = {6.25, 6.27, 6.28, 6.34, 63.1}; mean = mean(x); /* or x[:] */ median = median(x); print mean median;

The mean is not representative of the bulk of the data, but the median is.

Although the survey article doesn't mention it, there are two other robust estimators of location that have been extensively studied. They are the *trimmed mean* and the *Winsorized mean*:

trim = mean(x, "trimmed", 0.2); /* 20% of obs */ winsor = mean(x, "winsorized", 1); /* one obs */ print trim winsor;

The trimmed mean is computed by excluding the *k* smallest and *k* largest values, and computing the mean of the remaining values. The Winsorized mean is computed by replacing the *k* smallest values with the *(k+1)st* smallest, and replacing the *k* largest values with the *(k+1)st* largest. The mean of these remaining values is the Winsorized mean. For both of these functions, you can specify either a number of observations to trim or Winsorize, or a percentage of values. Formulas for the trimmed and Winsorized means are included in the documentation of the UNIVARIATE procedure. If you prefer an example, here are the equivalent computations for the trimmed and Winsorized means:

trim2 = mean( x[2:4] ); winsor2 = mean( x[2] // x[2:4] // x[4] ); print trim2 winsor2;

## Robust Estimates in the UNIVARIATE Procedure

The UNIVARIATE procedure also supports these robust estimators. The trimmed and Winsorized means are computed by using the TRIM= and WINSOR= options, respectively. Not only does PROC UNIVARIATE compute robust estimates, but it computes standard errors as shown in the following example.

data a; input x @@; datalines; 6.25 6.27 6.28 6.34 63.1 ; run; proc univariate data=a trim=0.2 winsor=1; var x; ods select BasicMeasures TrimmedMeans WinsorizedMeans; run;

Next time: robust estimates of scale.