Statistical Programming


Prior to SAS/IML 14.2, every variable in the Interactive Matrix Language (IML) represented a matrix. That changed when SAS/IML 14.2 introduced two new data structures: data tables and lists. This article gives an overview of data tables. I will blog about lists in a separate article.

A matrix is a rectangular array that contains all numerical or all character values. Numerical matrices are incredibly useful for computations because linear algebra provides a powerful set of tools for implementing analytical algorithms. However, a matrix is somewhat limiting as a data structure. Matrices are two-dimensional, rectangular, and cannot contain mixed-type data (numeric AND character). Consequently, you can't use one single matrix to pass numeric and character data to a function.

Data tables in SAS/IML are in-memory versions of a data set. They contain columns that can be numeric or character, as well as column attributes such as names, formats, and labels. The data table is associated with a single symbol and can be passed to modules or returned from a module. the TableCreateFromDataSet function, as shown:

proc iml;
tClass = TableCreateFromDataSet("Sashelp", "Class"); /* SAS/IML 14.2 */

The function reads the data from the Sashelp.Class data set and creates an in-memory copy. You can use the tClass symbol to access properties of the table. For example, if you want to obtain the names of the columns in the table, you can use

varNames = TableGetVarName(tClass);
print varNames;
Column names for a data table in SAS/IML

Extracting columns and adding new columns

Data tables are not matrices. You cannot add, subtract, or multiply with tables. When you want to compute something, you need to extract the data into matrices. For example, if you want to compute the body-mass index (BMI) of the students in Sashelp.Class, you can use the TableAddVar function to add the BMI as a new column in the table:

Y = TableGetVarData(tClass, {"Weight" "Height"});
wt = Y[,1]; ht = Y[,2];                /* get Height and Weight variables */
BMI = wt / ht##2 * 703;                /* BMI formula */
call TableAddVar(tClass, "BMI", BMI);  /* add new "BMI" column to table */

Passing data tables to modules

As indicated earlier, you can use data tables to pass mixed-type data into a user-defined function. For example, the following statements define a module whose argument is a data table. The module prints the mean value of the numeric columns in the table, and it prints the number of unique levels for character columns. To do so, it first extracts the numeric data into a matrix, then later extracts the character data into a matrix.

start QuickSummary(tbl);
   type = TableIsVarNumeric(tbl);      /* 0/1 vector   */
   /* for numeric columns, print mean */
   idx = loc(type=1);                  /* numeric cols */
   if ncol(idx)>0 then do;             /* there is a numeric col */
      varNames = TableGetVarName(tbl, idx);         /* get names */
      m = TableGetVarData(tbl, idx);   /* extract numeric data   */
      mean = mean(m);
      print mean[colname=varNames L="Mean of Numeric Variables"];
   /* for character columns, print number of levels */
   idx = loc(type=0);                  /* character cols */
   if ncol(idx)>0 then do;             /* there is a character col */
      varNames = TableGetVarName(tbl, idx);           /* get names */
      m = TableGetVarData(tbl, idx);   /* extract character data   */
      levels = countunique(m, "col");
      print levels[colname=varNames L="Levels of Character Variables"];
run QuickSummary(tClass);
Pass data tables to SAS/IML functions and modules


SAS/IML 14.2 supports data tables, which are rectangular arrays of mixed-type data. You can use built-in functions to extract columns from a table, add columns to a table, and query the table for attributes of columns. For more information about data tables, see the chapter

The post Data tables: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.


A categorical response variable can take on k different values. If you have a random sample from a multinomial response, the sample proportions estimate the proportion of each category in the population. This article describes how to construct simultaneous confidence intervals for the proportions as described in the 1997 paper "A SAS macro for constructing simultaneous confidence intervals for multinomial proportions" by Warren May and William Johnson (Computer Methods and Programs in Biomedicine, p. 153–162).

Estimates of multinomial proportions

In their paper, May and Johnson present data for a random sample of 220 psychiatric patients who were categorized as either neurotic, depressed, schizophrenic or having a personality disorder. The observed counts for each diagnosis are as follows:

data Psych;
input Category $21. Count;
Neurotic              91
Depressed             49
Schizophrenic         37
Personality disorder  43

If you divide each count by the total sample size, you obtain estimates for the proportion of patients in each category in the population. However, the researchers wanted to compute simultaneous confidence intervals (CIs) for the parameters. The next section shows several methods for computing the CIs.

Methods of computing simultaneous confidence intervals

May and Johnson discussed six different methods for computing simultaneous CIs. In the following, 1–α is the desired overall coverage probability for the confidence intervals, χ2(α, k-1) is the upper 1–α quantile of the χ2 distribution with k-1 degrees of freedom, and π1, π2, ..., πk are the true population parameters. The methods and the references for the methods are:

  1. Quesenberry and Hurst (1964): Find the parameters πi that satisfy
    N (pi - πi)2 ≤ χ2(α, k-1) πi(1-πi).
  2. Goodman (1965): Use a Bonferroni adjustment and find the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α/k, 1) πi(1-πi).
  3. Binomial estimate of variance: For a binomial variable, you can bound the variance by using πi(1-πi) ≤ 1/4. You can construct a conservative CI for the multinomial proportions by finding the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α, 1) (1/4).
  4. Fitzpatrick and Scott (1987): You can ignore the magnitude of the proportion when bounding the variance to obtain confidence intervals that are all the same length, regardless of the number of categories (k) or the observed proportions. The formula is
    N (pi - πi)2 ≤ c2(1/4)
    where the value c2= χ2(α/2, 1) for α ≤ 0.016 and where c2= (8/9)χ2(α/3, 1) for 0.016 < α ≤ 0.15.
  5. Q and H with sample variance: You can replace the unknown population variances by the sample variances in the Quesenberry and Hurst formula to get
    N (pi - πi)2 ≤ χ2(α, k-1) pi(1-pi).
  6. Goodman with sample variance: You can replace the unknown population variances by the sample variances in the Goodman Bonferroni-adjusted formula to get
    N (pi - πi)2 ≤ χ2(α/k, 1) pi(1-pi).

In a separate paper, May and Johnson used simulations to test the coverage probability of each of these formulas. They conclude that the simple Bonferroni-adjusted formula of Goodman (second in the list) "performs well in most practical situations when the number of categories is greater than 2 and each cell count is greater than 5, provided the number of categories is not too large." In comparison, the methods that use the sample variance (fourth and fifth in the list) are "poor." The remaining methods "perform reasonably well with respect to coverage probability but are often too wide."

A nice feature of the Q&H and Goodman methods (first and second on the list) is that they procduce unsymmetric intervals that are always within the interval [0,1]. In contrast, the other intervals are symmetric and might not be a subset of [0,1] for extremely small or large sample proportions.

Computing CIs for multinomial proportions

You can download SAS/IML functions that are based on May and Johnson's paper and macro. The original macro used SAS/IML version 6, so I have updated the program to use a more modern syntax. I wrote two "driver" functions:

  • CALL MultCIPrint(Category, Count, alpha, Method) prints a table that shows the point estimates and simultaneous CIs for the counts, where the arguments to the function are as follows:
    • Category is a vector of the k categories.
    • Count is a vector of the k counts.
    • alpha is the significance level for the (1-alpha) CIs. The default value is alpha=0.05.
    • Method is a number 1, 2, ..., 6 that indicates the method for computing confidence intervals. The previous list shows the method number for each of the six methods. The default value is Method=2, which is the Goodman (1965) Bonferroni-adjusted method.
  • MultCI(Count, alpha, Method) is a function that returns a three-column matrix that contains the point estimates, lower limit, and upper limit for the CIs. The arguments are the same as above, except that the function does not use the Category vector.

Let's demonstrate how to call these functions on the psychiatric data. The following program assumes that the function definitions are stored in a file called; you might have to specify a complete path. The PROC IML step loads the definitions and the data and calls the MultCIPrint routine and requests the Goodman method (method=2):

%include "";   /* specify full path name to file */
proc iml;
load module=(MultCI MultCIPrint);
use Psych; read all var {"Category" "Count"}; close;
alpha = 0.05;
call MultCIPrint(Category, Count, alpha, 2); /* Goodman = 2 */
Estimates and 95% simultaneous confidence intervals for m ultinomial proportions

The table shows the point estimates and 95% simultaneous CIs for this sample of size 220. If the intervals are wider than you expect, remember the goal: for 95% of random samples of size 220 this method should produce a four-dimensional region that contains all four parameters.

You can visualize the width of these intervals by creating a graph. The easiest way is to write the results to a SAS data set. To get the results in a matrix, call the MultCI function, as follows:

CI = MultCI(Count, alpha, 2);  /*or simply CI = MultCI(Count) */
/* write estimates and CIs to data set */
Estimate = CI[,1];  Lower = CI[,2];  Upper = CI[,3];
create CIs var {"Category" "Estimate" "Lower" "Upper"};
ods graphics / width=600px height=240px;
title 'Simultaneous Confidence Intervals for Multinomial Proportions';
title2 'Method of Goodman (1965)';
proc sgplot data=CIs;
  scatter x=Estimate y=Category / xerrorlower=Lower xerrorupper=upper
          markerattrs=(Size=12  symbol=DiamondFilled)
  xaxis grid label="Proportion" values=(0.1 to 0.5 by 0.05);
  yaxis reverse display=(nolabel);
Graph of estimates and simultaneous confidence intervals for multinomial proportions

The graph shows intervals that are likely to enclose all four parameters simultaneously. The neurotic proportion in the population is probably in the range [0.33, 0.50] and at the same time the depressed proportion is in the range [0.16, 0.30] and so forth. Notice that the CIs are not symmetric about the point estimates; this is most noticeable for the smaller proportions such as the schizophrenic category.

Because the cell counts are all relatively large and because the number of categories is relatively small, Goodman's CIs should perform well.

I will mention that it you use the Fitzpatrick and Scott method (method=4), you will get different CIs from those reported in May and Johnson's paper. The original May and Johnson macro contained a bug that was corrected in a later version (personal communication with Warren May, 25FEB2016).


This article presents a set of SAS/IML functions that implement six methods for computing simultaneous confidence intervals for multinomial proportions. The functions are updated versions of the macro %CONINT, which was presented in May and Johnson (1987). You can use the MultCIPrint function to print a table of statistics and CIs, or you can use the MultCI function to retrieve that information into a SAS/IML matrix.

tags: Statistical Programming

The post Simultaneous confidence intervals for multinomial proportions appeared first on The DO Loop.


A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.

One model: Many samples

As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.

In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.

As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.

Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet

Many models: One sample

There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.

One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:

%macro RunReg(DSName, NumVars);
%do i = 1 %to &NumVars;                    /* repeat for each x&i */
   proc reg data=&DSName noprint
            outest=PE(rename=(x&i=Value)); /* save parameter estimates */
   model Y = x&i;                          /* model Y = x_i */
   /* ...then accumulate statistics... */

The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.

If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.

The BY way for many models

An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.

To run a BY-group analysis:

  1. Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
  2. Sort the data by the BY-group variable.
  3. Run the SAS procedure, which uses the BY statement to specify each model.

1. Transpose the data

In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.

/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide;                       /* <== specify data set name HERE         */
array x [*] x1-x&nCont;         /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];           /* value for each variable for each obs */
drop x:;

2. Sort the data

In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:

/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

3. Run the analyses

You can now call a SAS procedure one time to compute all regression models:

/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;

The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.


You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).

This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.

Can you think of other ways to use this trick? Leave a comment.

tags: Data Analysis, Getting Started, Statistical Programming

The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.


On discussion forums, I often see questions that ask how to Winsorize variables in SAS. For example, here are some typical questions from the SAS Support Community:

  • I want an efficient way of replacing (upper) extreme values with (95th) percentile. I have a data set with around 600 variables and want to get rid of extreme values of all 600 variables with 95th percentile.
  • I have several (hundreds of) variables that I need to “Winsorize” at the 95% and 5%. I want all the observations with values greater 95th percentile to take the value of the 95th percentile, and all observations with values less than the 5th percentile to take the value of the 5th percentile.

It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?

Winsorization: Definition, pitfalls, and alternatives #StatWisdom
Click To Tweet

What is Winsorization?

The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.

Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.

The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.

I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:

  • Winsorizaion is symmetric: Some people want to modify only the large data values. However, Winsorization is a symmetric process that replaces the k smallest and the k largest data values.
  • Winsorization is based on counts: Some people want to modify values based on quantiles, such as the 5th and 95th percentiles. However, using quantiles might not lead to a symmetric process. Let k1 be the number of values less than the 5th percentile and let k2 be the number of values greater than the 95th percentile. If the data contain repeated values, then k1 might not equal to k2, which means that you are potentially changing more values in one tail than in the other.

As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.

Winsorization: The good

Why do some people want to Winsorize their data? There are a few reasons:

  • Classical statistics such as the mean and standard deviation are sensitive to extreme values. The purpose of Winsorization is to "robustify" classical statistics by reducing the impact of extreme observations.
  • Winsorization is sometimes used in the automated processing of hundreds or thousands of variables when it is impossible for a human to inspect each and every variable.
  • If you compare a Winsorized statistic with its classical counterpart, you can identify variables that might contain contaminated data or are long-tailed and require special handling in models.

Winsorization: The bad

There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.

As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:

Winsorization: The ugly

If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,

we are likely to think of them as 'strays' [or] 'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.

A little later in the essay (p. 458), he says

Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.

I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.


Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.

If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.

What are your thoughts about Winsorizing data? Share them in the comments.

tags: Statistical Programming, Statistical Thinking

The post Winsorization: The good, the bad, and the ugly appeared first on The DO Loop.


In a previous article, I showed how to simulate data for a linear regression model with an arbitrary number of continuous explanatory variables. To keep the discussion simple, I simulated a single sample with N observations and p variables. However, to use Monte Carlo methods to approximate the sampling distribution of statistics, you need to simulate many samples from the same regression model.

This article shows how to simulate many samples efficiently. Efficient simulation is the main emphasis of my book Simulating Data with SAS. For a detailed discussion about simulating data from regression models, see chapters 11 and 12.

The SAS DATA step in my previous post contains four steps. To simulate multiple samples, put a DO loop around the steps that generate the error term and the response variable for each observation in the model. The following program modifies the previous program and creates a single data set that contains NumSamples (=100) samples. Each sample is identified by an ordinal variable named SampleID.

/* Simulate many samples from a  linear regression model */
%let N = 50;            /* N = sample size               */
%let nCont = 10;        /* p = number of continuous variables */
%let NumSamples = 100;  /* number of samples                  */
data SimReg(keep= SampleID i Y x:);
call streaminit(54321);
array x[&nCont];        /* explanatory variables are named x1-x&nCont */
/* 1. Specify model coefficients. You can hard-code values such as
array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36);
      or you can use a formula such as the following */
array beta[0:&nCont] _temporary_;
do j = 0 to &nCont;
   beta[j] = 4 * (-1)**(j+1) / (j+1);       /* formula for beta[j] */
do i = 1 to &N;              /* for each observation in the sample */
   do j = 1 to &nCont;
      x[j] = rand("Normal"); /* 2. Simulate explanatory variables  */
   eta = beta[0];                       /* model = intercept term  */
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       /*     + sum(beta[j]*x[j]) */
   /* 5. simulate response for each sample */
   do SampleID = 1 to &NumSamples;      /* <== LOOP OVER SAMPLES   */
      epsilon = rand("Normal", 0, 1.5); /* 3. Specify error distrib*/
      Y = eta + epsilon;                /* 4. Y = model + error    */

The efficient way to analyzed simulated samples with SAS is to use BY-group processing. With By-group processing you can analyze all samples with a single procedure call. The following statements sort the data by the SampleID variable and call PROC REG to analyze all samples. The NOPRINT option ensures that the procedure does not spew out thousands of tables and graphs. (For procedures that do not support the NOPRINT option, there are other ways to turn off ODS when analyzing simulated data.) The OUTEST= option saves the parameter estimates for all samples to a SAS data set.

proc sort data=SimReg;
   by SampleID i;
proc reg data=SimReg outest=PE NOPRINT;
   by SampleID;
   model y = x:;

The PE data set contains NumSamples rows. Each row contains the p parameter estimates for the analysis of one simulated sample. The distribution of estimates is an approximation to the true (theoretical) sampling distribution of the statistics. The following image visualizes the joint distribution of the estimates of four regression coefficients. You can see that the distribution of the estimates appears to be multivariate normal and centered at the values of the population parameters.

You can download the SAS program that simulates the data, analyzes it, and produces the graph. The program is very efficient. For 10,000 random samples of size N=50 that contain p=10 variables, it takes about one second to run the Monte Carlo simulation and analyses.

tags: Simulation, Statistical Programming

The post Simulate many samples from a linear regression model appeared first on The DO Loop.


If you are a SAS programmer and use the GROUP= option in PROC SGPLOT, you might have encountered a thorny issue: if you use a WHERE clause to omit certain observations, then the marker colors for groups might change from one plot to another. This happens because the marker colors depend on the data by default. If you change the number of groups (or the order of groups), the marker colors also change.

A simple example demonstrates the problem. The following scatter plots are colored by the ORIGIN variable in the SasHelp.Cars data. (Click to enlarge.) The ORIGIN variable has three levels: Asia, Europe, and USA. On the left, all values of the ORIGIN variable are present in the graph. On the right, the Asian vehicles are excluded by using WHERE ORIGIN^="Asia". Notice that the colors of the markers on the right are not consistent with the values on the left.

Warren Kuhfeld wrote an excellent introduction to legend order and group attributes, and he describes several other ways that group colors can change from one plot to another. To solve this problem, Kuhfeld and other experts recommend that you create a discrete attribute map. A discrete attribute map is a SAS data set that specifies the colors to use for each group value. If you are not familiar with discrete attribute maps, I provide several references at the end of this article.

Automatically create a discrete attribute map for PROC SGPLOT #SASTip
Click To Tweet

Automatic creation of a discrete attribute map

The discrete attribute map is powerful, flexible, and enables the programmer to completely determine the legend order and color for all categories. However, I rarely use discrete attribute maps in my work because the process requires the manual creation of a data set. The data set has to contain all categories (spelled and capitalized correctly) and you have to remember (or look up) the structure of the data set. Furthermore, many examples use hard-coded color values such as CXFFAAAA or "LightBlue," whereas I prefer to use the GraphDatan style elements in the current ODS style.

However, I recently realized that PROC FREQ and the SAS DATA step can lessen the burden of creating a discrete attribute map. The documentation for the discrete attribute map mentions that you can define a column named MarkerStyleElement (or MarkerStyle), which specifies the names of styles elements such as GraphData1, GraphData2, and so on. Therefore, you can use PROC FREQ to write the category levels to a data set, and use a simple DATA step to add the MarkerStyleElement variable. For example, you can create a discrete attribute map for the ORIGIN variable, as follows:

/* semi-automatic way to create a DATTRMAP= data set */
%let VarName = Origin;           /* specify name of grouping variable */
proc freq ORDER=FORMATTED;   /* or ORDER=DATA|FREQ  */
   tables &VarName / out=Attrs(rename=(&VarName=Value));
data DAttrs;
ID = "&VarName";                 /* or "ID_&VarName" */
set Attrs(keep=Value);
length MarkerStyleElement $11.;
MarkerStyleElement = cats("GraphData", 1+mod(_N_-1, 12)); /* GraphData1, GraphData2, etc */
proc print; run;
Structure of discrete attribute maps for DATTRMAP= option in PROC SGPLOT

Voila! The result is a valid discrete attribute data set for the ORIGIN variable. The DATTRS data set contains all the information you need to ensure that the first category is always displayed by using the GraphData1 element, the second category is displayed by using GraphData2, and so on. The program does not require that you manually type the categories or even know how many categories there are. Obviously, you could write a macro that makes it easy to generate these statements.

This data set uses the alphabetical order of the formatted values to determine the group order. However, you can use the ORDER=DATA option in PROC FREQ to order by the order of categories in the data set. You can also use the ORDER=FREQ option to order by the most frequent categories. Because most SAS-supplied styles define 12 style elements, the MOD function is used to handle categorical variable that have more than 12 levels.

Use the discrete attribute map

To use the discrete attribute map, you need to specify the DATTRMAP= option on the PROC SGPLOT statement. You also need to specify the ATTRID= option on every SGPLOT statements that will use the map. Notice that I set the value of the ID variable to be the name of the GROUP= variable. (If that is confusing, you could choose a different value, as noted in the comments of the program.) The following statements are similar to the statements that create the right-hand graph at the top of this article, except this call to PROC SGPLOT uses the DATTRS discrete attribute map:

proc sgplot DATTRMAP=DAttrs;
where origin^='Asia' && type^="Hybrid";
   scatter x=weight y=mpg_city / group=Origin ATTRID=Origin 
   keylegend / location=inside position=TopRight across=1;
Markers colored by attributes specified in a discrete attribute data set, using PROC SGPLOT and the DATTRMAP= option

Notice that the colors in this scatter plot are the same as for the left-hand graph at the top of this article. The group colors are now consistent, even though the number of groups is different.

Generalizing the automatic creation of a discrete attribute map

The previous section showed how to create a discrete attribute map for one variable. You can use a similar approach to automatically create a discrete data map that contains several variables. The main steps are as follows:

  1. Use ODS OUTPUT to save the OneWayFreqs tables from PROC FREQ to a SAS data set.
  2. Use the SUBSTR function to extract the variable name into the ID variable.
  3. Use the COALESCEC function to form a Value column that contains the values of the categorical variables.
  4. Use BY-group processing and the UNSORTED option to assign the style elements GraphDatan.
ods select none;
proc freq;
   tables Type Origin;        /* specify VARS here */ 
   ods output OneWayFreqs=Freqs;
ods select all;
data Freqs2;
set Freqs;
length ID $32.;
ID = substr(Table, 6);        /* original values are "Table VarName" */
Value = COALESCEC(F_Type, F_Origin);  /* also specify F_VARS here */
keep ID Value;
data DAttrs(drop=count);
set Freqs2;
length MarkerStyleElement $11.;
by ID notsorted;
if first.ID then count = 0;
count + 1;
MarkerStyleElement = cats("GraphData", 1 + mod(count-1, 12));

The preceding program is not completely general, but it shows the main ideas. You can adapt the program to your own data. If you are facile with the SAS macro language, you can even write a macro that generates appropriate code for an arbitrary number of variables. Leave a comment if this technique proves useful in your work or if you have ideas for improving the technique.


tags: Statistical Graphics, Statistical Programming

The post Automate the creation of a discrete attribute map appeared first on The DO Loop.

十二 072016

Many SAS procedure compute statistics and also compute confidence intervals for the associated parameters. For example, PROC MEANS can compute the estimate of a univariate mean, and you can use the CLM option to get a confidence interval for the population mean. Many parametric regression procedures (such as PROC GLM) can compute confidence intervals for regression parameters. There are many other examples.

If an analysis provides confidence intervals (interval estimates) for multiple parameters, the coverage probabilities apply individually for each parameter. However, sometimes it is useful to construct simultaneous confidence intervals. These are wider intervals for which you can claim that all parameters are in the intervals simultaneously with confidence level 1-α.

This article shows how to use SAS to construct a set of simultaneous confidence intervals for the population mean. The middle of this article uses some advanced multivariate statistics. If you only want to see the final SAS code, jump to the last section of this article.

Compute simultaneous confidence intervals for the mean in #SAS. #Statistics
Click To Tweet

The distribution of the mean vector

If the data are a random sample from a multivariate normal population, it is well known (see Johnson and Wichern, Applied Multivariate Statistical Analysis, 1992, p. 149; hereafter abbreviated J&W) that the distribution of the sample mean vector is also multivariate normal. There is a multivariate version of the central limit theorem (J&W, p. 152) that says that the mean vector is approximately normally distributed for random samples from any population, provided that the sample size is large enough. This fact can be used to construct simultaneous confidence intervals for the mean.

Recall that the most natural confidence region for a multivariate mean is a confidence ellipse. However, simultaneous confidence intervals are more useful in practice.

Confidence intervals for the mean vector

Before looking at multivariate confidence intervals (CI), recall that many a univariate two-sided CIs are symmetric intervals with endpoints b ± m*SE, where b is the value of the statistic, m is some multiplier, and SE is the standard error of the statistic. The multiplier must be chosen so that the interval has the appropriate coverage probability. For example, the two-sided confidence interval for the univariate mean is has the familiar formula xbar ± tc SE, where xbar is the sample mean, tc is the critical value of the t statistic with significance level α and n-1 degrees of freedom, and SE is the standard error of the mean. In SAS, you can compute tc as quantile("t", 1-alpha/2, n-1).

You can construct similar confidence intervals for the multivariate mean vector. I will show two of the approaches in Johnson and Wichern.

Hotelling T-squared statistic

As shown in the SAS documentation, the radii for the multivariate confidence ellipse for the mean are determined by critical values of an F statistic. The Hotelling T-squared statistic is a scaled version of an F statistic and is used to describe the distribution of the multivariate sample mean.

The following SAS/IML program computes the T-squared statistic for a four-dimensional sample. The Sashelp.iris data contains measurements of the size of petals and sepals for iris flowers. This subset of the data contains 50 observations for the species iris Virginica. (If you don't have SAS/IML software, you can compute the means and standard errors by using PROC MEANS, write them to a SAS data set, and use a DATA step to compute the confidence intervals.)

proc iml;
use sashelp.iris where(species="Virginica"); /* read data */
read all var _NUM_ into X[colname=varNames];
n = nrow(X);               /* num obs (assume no missing) */
k = ncol(X);               /* num variables */
alpha = 0.05;              /* significance level */
xbar = mean(X);            /* mean of sample */
stderr = std(X) / sqrt(n); /* standard error of the mean */
/* Use T-squared to find simultaneous CIs for mean parameters */
F = quantile("F", 1-alpha, k, n-k);  /* critical value of F(k, n-k) */
T2 = k*(n-1)/(n-k) # F;              /* Hotelling's T-squared is scaled F */
m = sqrt( T2 );                      /* multiplier */
Lower = xbar - m # stdErr;
Upper = xbar + m # stdErr;
T2_CI = (xbar`) || (Lower`) || (Upper`);
print T2_CI[F=8.4 C={"Estimate" "Lower" "Upper"}  R=varNames];

The table shows confidence intervals based on the T-squared statistic. The formula for the multiplier is a k-dimensional version of the 2-dimensional formula that is used to compute confidence ellipses for the mean.

Bonferroni-adjusted confidence intervals

It turns out that the T-squared CIs are conservative, which means that they are wider than they need to be. You can obtain a narrower confidence interval by using a Bonferroni correction to the univariate CI.

The Bonferroni correction is easy to understand. Suppose that you have k MVN mean parameters that you want to cover simultaneously. You can do it by choosing the significance level of each univariate CI to be α/k. Why? Because then the joint probability of all the parameters being covered (assuming independence) will be (1 - α/k)k, and by Taylor's theorem (1 - α/k)k ≈ 1 - α when (α/k) is very small. (I've left out many details! See J&W p. 196-199 for the full story.)

In other words, an easy way to construct simultaneous confidence intervals for the mean is to compute the usual two-sided CIs for significance level α/k, as follows:

/* Bonferroni adjustment of t statistic when there are k parameters */
tBonf = quantile("T", 1-alpha/(2*k), n-1);  /* adjusted critical value */
Lower = xbar - tBonf # stdErr;              
Upper = xbar + tBonf # stdErr;
Bonf_CI = (xbar`) || (Lower`) || (Upper`);
print Bonf_CI[F=8.4 C={"Estimate" "Lower" "Upper"} R=varNames];

Notice that the confidence intervals for the Bonferroni method are narrower than for the T-square method (J&W, p. 199).

The following graph shows a scatter plot of two of the four variables. The sample mean is marked by an X. For reference, the graph includes a bivariate confidence ellipse. The T-squared confidence intervals are shown in blue. The thinner Bonferroni confidence intervals are shown in red.

Bonferroni and T-squared simultaneous confidence intervals for the mean of four-dimensional iris data

Compute simultaneous confidence intervals for the mean in SAS

The previous sections have shown that the Bonferroni method is an easy way to form simultaneous confidence intervals (CIs) for the mean of multivariate data. If you want the overall coverage probability to be at most (1 - α), you can construct k univariate CIs, each with significance level α/k.

You can use the following call to PROC MEANS to construct simultaneous confidence intervals for the multivariate mean. The ALPHA= method enables you to specify the significance level. The method assumes that the data are all nonmissing. If your data contains missing values, use listwise deletion to remove them before computing the simultaneous CIs.

/* Bonferroni simultaneous CIs. For k variables, specify alpha/k 
   on the ALPHA= option. The data should c ontain no missing values. */
proc means data=sashelp.iris(where=(species="Virginica")) nolabels
     alpha=%sysevalf(0.05/4)  /* use alpha/k, where k is number of variables */
     mean clm maxdec=4;
var SepalLength SepalWidth PetalLength PetalWidth;  /* k = 4 */

The values in the table are identical to the Bonferroni-adjusted CIs that were computed earlier. The values in the third and fourth columns of the table define a four-dimensional rectangular region. For 95% of the random samples drawn from the population of iris Virginica flowers, the population means will be contained in the regions that are computed in this way.

tags: Statistical Programming

The post Simultaneous confidence intervals for a multivariate mean appeared first on The DO Loop.


A previous post discusses how the loess regression algorithm is implemented in SAS. The LOESS procedure in SAS/STAT software provides the data analyst with options to control the loess algorithm and fit nonparametric smoothing curves through points in a scatter plot.

Although PROC LOESS satisfies 99.99% of SAS users who want to fit a loess model, some research statisticians might want to extend or modify the standard loess algorithm. Researchers like to ask "what if" questions like "what if I used a different weighting function?" or "what if I change the points at which the loess model is evaluated?" Although the loess algorithm is complicated, it is not hard to implement a basic version in a matrix language like SAS/IML.

Implement a basic version of loess regression in SAS/IML. #SAStip
Click To Tweet

Implement loess regression in SAS/IML

Recent blog posts have provided some computational modules that you can use to implement loess regression. For example, the PairwiseNearestNbr module finds the k nearest neighbors to a set of evaluation points. The functions for weighted polynomial regression computes the loess fit at a particular point.

You can download a SAS/IML program that defines the nearest-neighbor and weighted-regression modules. The following call to PROC IML loads the modules and defines a function that fits a loess curve at points in a vector, t. Each fit uses a local neighborhood that contains k data values. The local weighted regression is a degree-d polynomial:

proc iml;
load  module=(PairwiseNearestNbr PolyRegEst PolyRegScore);
/* 1-D loess algorithm. Does not handle missing values.
   Input: t: points at which to fit loess (column vector)
          x, y: nonmissing data (column vectors)
          k: number of nearest neighbors used for loess fit
          d: degree of local regression model 
   Output: column vector L[i] = f(t[i]; k, d) where f is loess model
start LoessFit(t, x, y, k, d=1);
   m = nrow(t);
   Fit = j(m, 1);                            /* Fit[i] is predicted value at t[i] */
   do i = 1 to m;
      x0 = t[i];
      run PairwiseNearestNbr(idx, dist, x0, x, k);
      XNbrs = X[idx];  YNbrs = Y[idx];       /* X and Y values of k nearest nbrs */
      /* local weight function where dist[,k] is max dist in neighborhood */
      w = 32/5 * (1 - (dist / dist[k])##3 )##3; /* use tricubic weight function */
      b = PolyRegEst(YNbrs, XNbrs, w`, d);   /* param est for local weighted regression */
      Fit[i] = PolyRegScore(x0, b);          /* evaluate polynomial at x0 */
   return Fit;

This algorithm provides some features that are not in PROC LOESS. You can use this function to evaluate a loess fit at arbitrary X values, whereas PROC LOESS evaluates the function only at quantiles of the data. You can use this function to fit a local polynomial regression of any degree (for example, a zero-degree polynomial), whereas PROC LOESS fits only first- and second-degree polynomials. Although I hard-coded the standard tricubic weight function, you could replace the function with any other weight function.

On the other hand, PROC LOESS supports many features that are not in this proof-of-concept function, such as automatic selection of the smoothing parameter, handling missing values, and support for higher-dimensional loess fits.

Call the loess function on example data

Let's use polynomials of degree 0, 1, and 2 to compute three different loess fits. The LoessData data set is defined in my previous article:

use LoessData;  read all var {x y};  close;   /* read example data */
s = 0.383;                /* specify smoothing parameter */
k = floor(nrow(x)*0.383); /* num points in local neighborhood */
/* grid of points to evaluate loess curve (column vector) */
t    = T( do(min(x), max(x), (max(x)-min(x))/50) );
Fit0 = LoessFit(t, x, y, k, 0);    /* loess fit with degree=0 polynomials */
Fit1 = LoessFit(t, x, y, k, 1);    /* degree=1 */
Fit2 = LoessFit(t, x, y, k, 2);    /* degree=2 */
create Sim var {x y t Fit0 Fit1 Fit2};  append;  close;

You can use PROC SGPLOT to overlay these loess curves on a scatter plot of the data:

title "Overlay loess curves computed in SAS/IML";
proc sgplot data=Sim;
label Fit0="Loess (Deg=0)" Fit1="Loess (Deg=1)" Fit2="Loess (Deg=2)"; 
scatter x=x y=y;
series x=t  y=Fit0 / curvelabel;
series x=t  y=Fit1 / curvelabel lineattrs=(color=red);
series x=t  y=Fit2 / curvelabel lineattrs=(color=ForestGreen);
xaxis grid; yaxis grid;
Loess curves computed in SAS/IML

The three curves are fairly close to each other on the interior of the data. The degree 2 curve wiggles more than the other two curves because it uses a higher-degree polynomial. The over- and undershooting becomes even more pronounced if you use cubic or quartic polynomials for the local weighted regressions.

The curious reader might wonder how these curves compare to curves that are created by PROC LOESS or by the LOESS statement in PROC SGPLOT. In the attached program I show that the IML implementation produces the same predicted values as PROC LOESS when you evaluate the models at the same set of points.

Most SAS data analysts are happy to use PROC LOESS. They don't need to write their own loess algorithm in PROC IML. However, this article shows that IML provides the computational tools and matrix computations that you need to implement sophisticated algorithms, should the need ever arise.

tags: Data Analysis, Statistical Programming

The post Loess regression in SAS/IML appeared first on The DO Loop.


Finding nearest neighbors is an important step in many statistical computations such as local regression, clustering, and the analysis of spatial point patterns. Several SAS procedures find nearest neighbors as part of an analysis, including as PROC LOESS, PROC CLUSTER, PROC MODECLUS, and PROC SPP. This article shows how to find nearest neighbors for every observation directly in SAS/IML, which is useful if you are implementing certain algorithms in that language.

Compute the distance between observations

Let's create a sample data set. The following DATA step simulates 100 observations that are uniformly distributed in the unit square [0,1] x [0,1]:

data Unif;
call streaminit(12345);
do i = 1 to 100;
   x = rand("Uniform");   y = rand("Uniform");   output;

I have previously shown how to compute the distance between observations in SAS by using PROC DISTANCE or the DISTANCE function in SAS/IML. The following statements read the data into a SAS/IML matrix and computes the pairwise distances between all observations:

proc iml;
use Unif; read all var {x, y} into X; close;
D = distance(X);              /* N x N distance matrix */

The D matrix is a symmetric 100 x 100 matrix. The value D[i,j] is the Euclidean distance between the ith and jth rows of X. An easy way to look for the nearest neighbor of observation i is to search the ith row for the column that contains smallest distance. Because the diagonal elements of D are all zero, a useful trick is to change the diagonal elements to be missing values. Then the smallest value in each row of D corresponds to the nearest neighbor.

You can use the following statements assigns missing values to the diagonal elements of the D matrix. You can then use the SAS/IML subscript reduction operators to find the minimum distance in each row:

diagIdx = do(1, nrow(D)*ncol(D), ncol(D)+1); /* index diagonal elements */
D[diagIdx] = .;                              /* set diagonal elements */
dist = D[ ,><];            /* smallest distance in each row */
nbrIdx = D[ ,>:<];         /* column of smallest distance in each row */
Neighbor = X[nbrIdx, ];    /* coords of closest neighbor */

Visualizing nearest neighbors

If you write the nearest neighbors and distances to a SAS data set, you can use the VECTOR statement [LINK] in PROC SGPLOT to draw a vector that connects each observation to its nearest neighbor. The graph indicates the nearest neighbor for each observation.

Z = X || Neighbor || dist;
create NN_Unif from Z[c={"x" "y" "xc" "yc" "dist"}];
append from Z;
title "Nearest Neighbors for 100 Uniformly Distributed Points";
proc sgplot data=NN_Unif aspect=1;
label dist = "Distance to Nearest Neighbor";
   scatter x=x y=y / colorresponse=dist markerattrs=(symbol=CircleFilled);
   vector x=xc y=yc / xorigin=x yorigin=y transparency=0.5 
                      colorresponse=dist;  /* COLORRESPONSE= requires 9.4m3 for VECTOR */
   xaxis display=(nolabel);
   yaxis display=(nolabel);
Nearest neighbors each of 100 observations

Alternative ways to compute nearest neighbors

If you don't have SAS/IML but still want to compute nearest neighbors, you can use PROC MODECLUS. The NEIGHBOR option on the PROC MODECLUS statement produces a table that gives the observation number (or ID value) of nearest neighbors. For example, the following statements produce the observation numbers for the nearest neighbors:

ods select neighbor;
/* Use K=p to find nearest p-1 neighbors */
proc modeclus data=Unif method=1 k=2 Neighbor;  /* k=2 for nearest nbrs */
   var x y;

To get the coordinates of the nearest neighbor, you can create a variable that contains the observation numbers and then use an ID statement to include that ID variable in the PROC MODECLUS output. You can then look up the coordinates. I omit the details.

Why stop at one? A SAS/IML module for k nearest neighbors

You can use the ideas in the earlier SAS/IML program to write a program that returns the indices (observation numbers) of the k closest neighbors for k ≥ 1. The trick is to replace the smallest distance in each row with a missing value and then repeat the process of finding the smallest value (and column) in each row. The following SAS/IML module implements this computation:

proc iml;
/* Compute indices (row numbers) of k nearest neighbors.
   INPUT:  X    an (N x p) data matrix
           k    specifies the number of nearest neighbors (k>=1) 
   OUTPUT: idx  an (N x k) matrix of row numbers. idx[,j] contains
                the row numbers (in X) of the j_th closest neighbors
           dist an (N x k) matrix. dist[,j] contains the distances
                between X and the j_th closest neighbors
start NearestNbr(idx, dist, X, k=1);
   N = nrow(X);  p = ncol(X);
   idx = j(N, k, .);            /* j_th col contains obs numbers for j_th closest obs */
   dist = j(N, k, .);           /* j_th col contains distance for j_th closest obs */
   D = distance(X);
   D[do(1, N*N, N+1)] = .;      /* set diagonal elements to missing */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:N)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */

You can use this module to compute the k closest neighbors for each observation (row) in a data matrix. For example, the following statements compute the two closest neighbors for each observation. The output shows a few rows of the original data and the coordinates of the closest and next closest observations:

use Unif; read all var {x, y} into X; close;
run NearestNbr(nbrIdx, dist, X, k);
Near1 = X[nbrIdx[,1], ];    /* 1st nearest neighbors */
Near2   = X[nbrIdx[,2], ];  /* 2nd nearest neighbors */
Coordinates of nearest and second nearest neighbors to a set of observations

In summary, this article defines a short module in the SAS/IML language that you can use to compute the k nearest neighbors for a set of N numerical observations. Notice that the computation builds an N x N distance matrix in RAM, so this matrix might consume a lot of memory for large data sets. For example, the distance matrix for a data set with 16,000 observations requires about 1.91 GB of memory. For larger data sets, you might prefer to use PROC DISTANCE or PROC MODECLUS.

tags: 9.4, Data Analysis, Statistical Programming

The post Compute nearest neighbors in SAS appeared first on The DO Loop.


A kernel density estimate (KDE) is a nonparametric estimate for the density of a data sample. A KDE can help an analyst determine how to model the data: Does the KDE look like a normal curve? Like a mixture of normals? Is there evidence of outliers in the data?

In SAS software, there are two procedures that generate kernel density estimates. PROC UNIVARIATE can create a KDE for univariate data; PROC KDE can create KDEs for univariate and bivariate data and supports several options to choose the kernel bandwidth.

Kernel density estimate (KDE) and component densities

The KDE is a finite mixture distribution. It is a weighted sum of small density "bumps" that are centered at each data point. The shape of the bumps are determined by the choice of a kernel function. The width of the bumps are determined by the bandwidth.

In textbooks and lecture notes about kernel density estimation, you often see a graph similar to the one at the left. The graph shows the kernel density estimate (in blue) for a sample of 10 data values. The data values are shown in the fringe plot along the X axis. The orange curves are the component densities. Each orange curve is a scaled version of a normal density curve centered at a datum.

This article shows how to create this graph in SAS. You can use the same principles to draw the component densities for other finite mixture models.

The kernel bandwidth

The first step is to decide on a bandwidth for the component densities. The following statements define the data and use PROC UNIVARIATE to compute the bandwidth by using the Sheather-Jones plug-in method:

data sample;
input x @@;
46 60 24 15 17 14 21 59 22 16 
proc univariate data=sample;
   histogram x / kernel(c=SJPI);
Histogram and kernel density estimate (KDE)

The procedure create a histogram with a KDE overlay. The legend of the graph gives a standardized kernel bandwidth of c=0.27, but that is not the bandwidth you want. As explained in the documentation, the kernel bandwidth is derived from the normalized bandwidth by the formula λ = c IQR n-1/5, where IQR is the interquartile range and n is the number of nonmissing observations. For this data, IQR = 30 and n=10, so λ ≈ 5. To save you from having to compute these values, the SAS log displays the following NOTE:

NOTE: The normal kernel estimate for c=0.2661 has a bandwidth of 5.0377

The UNIVARIATE procedure can use several different kernel shapes. By default, the procedure uses a normal kernel. The rest of this article assumes a normal kernel function, although generalizing to other kernel shapes is straightforward.

Compute the component densities

Because the kernel function is centered at each datum, one way to visualize the component densities is to evaluate the kernel function on a symmetric interval about x=0 and then translate that component for every data point. In the following SAS/IML program, the vector w contains evenly spaced points in the interval [-3λ, 3λ], where λ is the bandwidth of the kernel function. (This interval contains 99.7% of the area under the normal curve with standard deviation λ.) The vector k is the normal density function evaluated on that interval, scaled by 1/n where n is the number of nonmissing observations. The quantity 1/n is the mixing probability for each component; the KDE is obtained by summing these components.

The program creates an output data set called component that contains three numerical variables. The ID variable is an ID variable that identifies which observation is being used. The z variable is a translated copy of the w variable. The k variable does not change because every component has the same shape and bandwidth.

proc iml;
use sample;  read all var {x};  close;
n = countn(x);
mixProb = 1/n;
lambda = 5;                                 /* bandwidth from PROC UNIVARIATE */
w = do(-3*lambda, 3*lambda, 6*lambda/100);  /* evenly spaced; 3 std dev */
k = mixProb * pdf("Normal", w, 0, lambda);  /* kernel = normal pdf centered at 0 */
ID= .; z = .;
create component var {ID z k};             /* create the variables */
do i = 1 to n;
   ID = j(ncol(w), 1, i);                  /* ID var */
   z = x[i] + w;                           /* center kernel at x[i] */
close component;

Compute the kernel density estimate

The next step is to sum the component densities to create the KDE. The easy way to get the KDE in a data set is to use the OUTKERNEL= option on the HISTOGRAM statement in PROC UNIVARIATE. Alternatively, you can create the full KDE in SAS/IML, as shown below.

The range of the data is [14, 60]. You can extend that range by the half-width of the kernel components, which is 15. Consequently the following statements use the interval [0, 75] as a convenient interval on which to sum the density components. The actual summation is easy in SAS/IML because you can pass a vector of positions to the PDF function i Base SAS.

The sum of the kernel components is written to a data set called KDE.

/* finite mixture density is weighted sum of kernels at x[i] */
a = 0; b = 75;    /* endpoints of interval [a,b] */
t = do(a, b, (b-a)/200);
kde = 0*t;
do i = 1 to n;
   kde = kde + pdf("Normal", t, x[i], lambda) * mixProb;
create KDE var {t kde}; append; close;

Visualize the KDE

You can merge the original data, the individual components, and the KDE curve into a single SAS data set called All. Use the SGPLOT procedures to overlay the three elements. The individual components are plotted by using the SERIES plot with a GROUP= option. A second SERIES plot graphs the KDE curve. A FRINGE statement plots the positions of each datum as a hash mark. The plot is shown at the top of this article.

data All;
merge sample component KDE;
title "Kernel Density Estimate as Weighted Sum of Component Densities";
proc sgplot data=All noautolegend;
   series x=z y=k / group=ID lineattrs=GraphFit2(thickness=1); /* components */
   series x=t y=kde /lineattrs=GraphFit;                       /* KDE curve  */
   fringe x;                                      /* individual observations */
   refline 0 / axis=y;
   xaxis label="x";
   yaxis label="Density" grid;

You can use the same technique to visualize other finite mixture models. However, the FMM procedure creates these plots automatically, so you might never need to create such a plot if you use PROC FMM. The main difference for a general finite mixture model is that the component distributions can be from different families and usually have different parameters. Therefore you will need to maintain a vector of families and parameters. Also, the mixing probabilities usually vary between components.

In summary, the techniques in this article are useful to teachers and presenters who want to visually demonstrate how choosing a kernel shape and bandwidth gives rise to the kernel density estimate.

tags: Statistical Graphics, Statistical Programming

The post How to visualize a kernel density estimate appeared first on The DO Loop.