10月 312016
 

Every year near Halloween I write an article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick (which features the CMISS function and the crossproducts matrix in SAS/IML) enables you to count the number of observations that are missing for pairs of columns. In other words, this trick counts how many missing values are in columns 1 and 2, columns 1 and 3, and so on for all "k choose 2" pairwise combinations of k columns.

This trick generalizes, so keep reading even if you don't care about counting missing values. Given ANY indicator matrix, you can use this trick to count the pairwise occurrence of events.


Counting pairwise events: A connection to matrix multiplication #Statistics
Click To Tweet


Counting missing values

Of course, you can count combinations of missing values without using SAS/IML or matrices. I've previously shown how to use PROC MI to count all combinations of missing values among a set of k variables. However, as I said, this SAS/IML trick enables you to count more than just missing values: it generalizes to count ANY pairwise event.

To get started, read data into a SAS/IML matrix. The following statements read 5209 observations and eight variables into the matrix X:

proc iml;
varNames = {"AgeAtStart" "Diastolic" "Systolic" 
            "Height" "Weight" "MRW" "Smoking" "Cholesterol"};
use Sashelp.Heart;               /* open data set */
read all var varNames into X;    /* create numeric data matrix, X */
close Sashelp.Heart;

The first part of the trick is to convert the matrix into a 0/1 indicator matrix, where 1 indicates that a data value is missing and 0 indicates nonmissing. I like to use the CMISS function in Base SAS for this task.

The second part of the trick is to recognize that if C is an indicator matrix, the crossproducts matrix P=C`C is a matrix that gives the counts of the pairwise events in C. The element P[i,j] is the inner product of the i_th and j_th columns of C, and because C is an indicator matrix the inner product counts the number of simultaneous "events" for column i and column j.

In the SAS/IML language, the code to compute pairwise counts of missing values requires only two lines:

C = cmiss(X);                    /* create 0/1 indicator matrix */
P = C`*C;                        /* pairwise counts */
print P[c=varNames r=varNames label="Pairwise Missing Counts"];
pairwise1

The P matrix is symmetric. The diagonal elements P[i,i] show the number of missing values for the i_th variable. For example, the Height, Weight, and MRW variables have 6 missing values whereas the Cholesterol variable has 152 missing values. The off-diagonal elements P[i,j] show the number of observations that are simultaneously missing for the i_th and j_th variables. For example, 28 observations have missing values for both the Smoking and Cholesterol variables; two observations have missing values for both the Height and Weight variables.

Counting any pair of events

If C is any indicator matrix, the crossproducts matrix P=C`C counts the number of observations for which two columns of C are simultaneously 1.

For example, the following statements standardize the data to create z-scores. You can form an indicator matrix that has the value 1 if a z-score is exceeds 3 in absolute value; these are outliers for normally distributed data. The crossproducts matrix counts the number of observations that are univariate outliers (on the diagonal) and the number that are outliers for a pair of variables:

Z = (X - mean(X)) / std(X);      /* standardize data */
L = (abs(Z) > 3);                /* L indicates outliers */
Q = L`*L;                        /* counts for pairs of outliers */
print Q[c=varNames r=varNames label="Pairwise Outliers"];
pairwise2

The table indicates that 52 patients have a diastolic blood pressure (BP) that is more than three standard deviations above the mean. Similarly, 83 patients are outliers for systolic BP, and 35 patients are outliers for both measurements.

The following graph visualizes the patients that have high blood pressure in each category:

Outlier = j(nrow(L), 1, "No         ");       /* create grouping variable */
Outlier[ loc(L[,2] | L[,3]) ] = "Univariate";
Outlier[ loc(L[,2] & L[,3]) ] = "Bivariate";
title "Standardized Blood Pressure";
title2 "Outliers More than Three Standard Deviations from Mean";
call scatter(Z[,2], Z[,3]) group=Outlier 
         label={"Standardized Diastolic" "Standardized Systolic"}
         option="transparency=0.5 markerattrs=(symbol=CircleFilled)"
         other="refline 3/axis=x; refline 3/axis=y;";
pairwise3

In conclusion, you can create an indicator matrix for any set of conditions. By forming the crossproducts matrix, you get the counts of all observations that satisfy each condition individually and in pairs. I think this trick is a real treat!

tags: Data Analysis, Matrix Computations, Tips and Techniques

The post Counting observations for which two events occur appeared first on The DO Loop.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)