Some practicing statisticians (especially those over a certain age!) learned parametric modeling techniques in school, but later incorporated nonparametric techniques into their arsenal of statistical tools. A very common nonparametric technique is kernel density estimation, in which the PDF at a point, x is estimate by a weighted average of the sample data in a neighborhood of x. Kernel density estimates are widely used, especially when the data appear to be multimodal.
Kernel Density Estimation in SAS
As before, consider the cholesterol of 2,873 women in the Framingham Heart Study as recorded in the SAS data set SASHELP.Heart, which is distributed with SAS/GRAPH software in SAS 9.2. You can use the UNIVARIATE procedure to construct a kernel density estimate for the cholesterol variable.
The kernel density estimator has a parameter (called the bandwidth) that determines the size of the neighborhood used in the computation to compute the estimate. Small values of the bandwidth result in wavy, wiggly, KDEs, whereas large values result in smooth KDEs. The UNIVARIATE procedure has various methods to select the bandwidth automatically. For this example, I use the SheatherJones plugin (SJPI) bandwidth.
The following statements use the OUTKERNEL= option to output the kernel density estimate at a sequence of regularly space values, and the VSCALE= option to set the vertical scale of the histogram to the density scale:
data heart; set sashelp.heart; where sex="Female"; run; ods graphics on; proc univariate data=heart; var cholesterol; histogram / vscale=proportion kernel(C=SJPI) outkernel=density; run;
The nonparametric kernel density estimate exhibits features that are not captured by parametric models. For example, the KDE for these data show several bumps and asymmetries that are not captured by the assumption that the data are normally distributed.
The DENSITY data set contains six variables. The _VALUE_ variable represents the X locations for the density curve and the _DENSITY_ variable represents the Y locations. Therefore the area under the density estimate curve can be computed by numerically integrating those two variables.
Overcoming a Bug in the OUTKERNEL= Option
Unfortunately, there is a bug in the OUTKERNEL= option in SAS 9.2 which results in the DENSITY data set containing two (identical) copies of the KDE curve! (The bug is fixed in SAS 9.3.) Here's how to remove the second copy: find the maximum value of the X variable and delete all observations that come after it. The maximum value of X ought to be the last observation in the DENSITY data. The following statements read the density curve into the SAS/IML language and eliminate the second copy of the curve:
proc iml; /** read in kernel density estimate **/ use density; read all var {_value_} into x; read all var {_density_} into y; close density; /** work around bug in OUTKERNEL data set in SAS 9.2 **/ /** remove second copy of data **/ max = max(x); idx = loc(x=max); if ncol(idx)>1 then do; x = x[1:idx[1]]; y = y[1:idx[1]]; end;
The previous statements remove the second copy of the data if it exists, but do nothing if the second copy does not exist. Therefore it is safe to run the statements in all versions of SAS.
The Area under a Kernel Density Estimate Curve
The kernel density estimate is not a simple function. The points on the curve are available only as a sequence of (X, Y) pairs in the DENSITY data set. Consequently, to integrate the KDE you need to use a numerical integration technique. For this example, you can use the TrapIntegral module from my blog post on the trapezoidal rule of integration.
The following statements define the TrapIntegral module and verify that the KDE curve is correctly defined by integrating the curve over the entire range of X values. The area under a density estimate should be 1. If the area is not close to 1, then something is wrong:
start TrapIntegral(x,y); N = nrow(x); dx = x[2:N]  x[1:N1]; meanY = ( y[2:N] + y[1:N1] )/2; return( dx` * meanY ); finish; /** compute the integral over all x **/ Area = TrapIntegral(x,y); print Area;
0.9999992 
Ahh! Success! The area under the KDE is 1 to within an acceptable tolerance. Now to the real task: compute the area under the KDE curve for certain medically interesting values of cholesterol. In particular, the following statements estimate the probability that the cholesterol of a random woman in the population is less than 200 mg/dL (desirable), or greater than 240 mg/dL (high):
/** compute the integral for x < 200 **/ idx = loc(x<200); lt200 = TrapIntegral(x[idx],y[idx]); /** compute the integral for x >= 240 **/ idx = loc(x>=240); ge240 = TrapIntegral(x[idx],y[idx]); print lt200 ge240;
ge240 

0.291803 
0.35916 
If you believe that the KDE for these data is a good estimate of the distribution of cholesterol in women in the general population, the computations show that about 29% of women have healthy cholesterol levels, whereas 36% have high levels of cholesterol. These are approximately the same values that were computed by using empirical probability estimates, as shown in my original post on this topic. They differ by about 2% from the parametric model, which assumes normally distributed data.