4月 022018

As a general rule, when SAS programmers want to manipulate data row by row, they reach for the SAS DATA step. When the computation requires column statistics, the SQL procedure is also useful. When both row and column operations are required, the SAS/IML language is a powerful addition to a SAS programmer's toolbox.

I was reminded of this fact recently when a SAS programmer (possibly a student) asked how to "manually" perform the classic chi-square test for association in a two-way frequency table. The computation requires computing the means across rows and down columns, and the student was struggling with implementing the computations in the DATA step. This article illustrates how SAS/IML can simplify the rowwise and columnwise computations in the classic chi-square test.

The chi-square test for association in PROC FREQ

In SAS, the easy way to compute the chi-square test for association is to use PROC FREQ. The following data are from several examples in the PROC FREQ documentation. The data shows the hair color and eye color of 762 European children. The call to PROC FREQ computes the chi-square test and a cross-tabulation that displays the observed value, expected values (under the hypothesis that hair color and eye color are independent), and deviations, which are the "observed minus expected" values:

data Color;
   input Eyes $ Hair $ Count @@;
   label Eyes  ='Eye Color'
         Hair  ='Hair Color';
blue  black   6  blue  dark   51  blue  fair   69
blue  medium 68  blue  red    28  brown black  16
brown dark   94  brown fair   90  brown medium 94
brown red    47  green dark   37  green fair   69
green medium 55  green red    38
title 'Eye and Hair Color of European Children';
proc freq data=Color;
   tables Eyes*Hair / nopercent norow nocol 
                      deviation expected chisq;
   weight Count;

In the eye-by-hair table, each cell contains three values. The first value is the observed cell count, the second value is the expected cell count (assuming independence), and the third value is their difference, which is sometimes called the "deviation." The test statistic and p-value for the chi-square test are outlined in red. The test statistic is 20.92. The probability of observing that value from a random draw of a chi-square distribution with 8 degrees of freedom is 0.0073. Because that probability is so small, we reject the null hypothesis that hair color and eye color are independent.

Compute the chi-square test "manually" in SAS

The chi-square test on a 3 x 4 table is simple enough to compute by hand, but suppose you want to use SAS to validate or reproduce the numbers that PROC FREQ produces? This is a good programming exercise for students to make sure they understand the computations. The PROC FREQ documentation provides the formula for the test statistic by using the equation

where nij is the observed count in row i and column j and eij is the expected count, but there is nothing like programming a formula to ensure understanding.

The following steps indicate the (vectorized) computations that can be used to implement the chi-square test in SAS/IML.

  1. Use subscript reduction operators to compute the means for each row and column, and the grand mean for all cells.
  2. Use an outer product to form the table of expected values from the mean vectors.
  3. Compute the test statistic by using elementwise matrix operations.
  4. Use the CDF function to compute the p-value.
proc iml;
/* row means, column means, and overall mean */
cName = {"black" "dark" "fair" "medium" "red"};
rName = {"blue" "brown" "green"};
C = { 6  51  69  68  28,  
     16  94  90  94  47,   
      0  37  69  55  38};
colMarg = C[:, ];       /* mean of each column */
rowMarg = C[ ,:];       /* mean of each row */
grandMean = rowMarg[:]; /* grand mean (also C{:]) */
/* expected values under hypothesis of independence */
Expected  = rowMarg*colMarg / grandMean;   /* outer product */
Deviance = C - Expected;                   /* difference (observed-expected) for each cell */
/* chi square statistic: Sum(( Observed[i,j] - Expected[i,j])^2 / Expected[i,j] ) */
ChiSq = sum( Deviance##2 / Expected );     
DF = (nrow(C)-1) * (ncol(C)-1);
pvalue = 1 - cdf("ChiSq", ChiSq, DF);
print Expected[c=cName r=rName], Deviance[c=cName r=rName];
print ChiSq pvalue;

Notice that the program does not contain any loops, although the formulas contain double summations over the elements of the table. This is an example of "vectorizing" the computations, which means writing the computations as vector or matrix computations rather than scalar operations in a loop.

You can see that the 'Expected' matrix matches the PROC FREQ output for the expected values for each cell. Similarly, the 'Deviance' matrix matches the PROC FREQ output for the difference between observed and expected values. The test statistic is the sum of the ratios of the squared deviances and the expected values. A call to the CDF function computes the p-value.

In summary, you can use the high-level SAS/IML language to implement basic statistical tests such as the chi-square test for association in a two-way frequency table. Such an exercise enables students to understand the details of elementary statistical tests. For programmers who know the statistical details but who are new to the SAS/IML language, this short exercise provides a way to gain proficiency with vectorized programming techniques.

The post The chi-square test: An example of working with rows and columns in SAS appeared first on The DO Loop.

6月 212017

One way to assess the precision of a statistic (a point estimate) is to compute the standard error, which is the standard deviation of the statistic's sampling distribution. A relatively large standard error indicates that the point estimate should be viewed with skepticism, either because the sample size is small or because the data themselves have a large variance. The jackknife method is one way to estimate the standard error of a statistic.

Some simple statistics have explicit formulas for the standard error, but the formulas often assume normality of the data or a very large sample size. When your data do not satisfy the assumptions or when no formula exists, you can use resampling techniques to estimate the standard error. Bootstrap resampling is one choice, and the jackknife method is another. Unlike the bootstrap, which uses random samples, the jackknife is a deterministic method.

This article explains the jackknife method and describes how to compute jackknife estimates in SAS/IML software. This is best when the statistic that you need is also implemented in SAS/IML. If the statistic is computed by a SAS procedure, you might prefer to download and use the %JACK macro, which does not require SAS/IML.

The jackknife method: Leave one out!

The jackknife method estimates the standard error (and bias) of statistics without making any parametric assumptions about the population that generated the data. It uses only the sample data.

The jackknife method manufactures jackknife samples from the data. A jackknife sample is a "leave-one-out" resample of the data. If there are n observations, then there are n jackknife samples, each of size n-1. If the original data are {x1, x2,..., xn}, then the i_th jackknife sample is
{x1,..., xi-1,xi+1,..., xn}
You then compute n jackknife replicates. A jackknife replicate is the statistic of interest computed on a jackknife sample. You can obtain an estimate of the standard error from the variance of the jackknife replicates. The jackknife method is summarized by the following:

  1. Compute a statistic, T, on the original sample of size n.
  2. For i = 1 to n, repeat the following:
    • Leave out the i_th observation to form the i_th jackknife sample.
    • Compute the i_th jackknife replicate statistic, Ti, by computing the statistic on the i_th jacknife sample.
  3. Compute the average (mean) of the jackknife replicates: Tavg = Σi Ti / n.
  4. (Optional) Estimate the bias as BiasTjack = (n-1)(Tavg - T)
  5. Estimate the standard error as SEjack = sqrt( (n-1)/n (Σi Ti - Tavg)**2 )

Data for a jackknife example

Resampling methods are not hard, but the notation in some books can be confusing. To clarify the method, let's choose a particular statistic and look at example data. The following example is from Martinez and Martinez (2001, 1st Ed, p. 241), which is also the source for this article. The data are the LSAT scores and grade-point averages (GPAs) for 15 randomly chosen students who applied to law school.

data law;
input lsat gpa @@;
653 3.12   576 3.39   635 3.30   661 3.43   605 3.13
578 3.03   572 2.88   545 2.76   651 3.36   555 3.00
580 3.07   594 2.96   666 3.44   558 2.81   575 2.74

The statistic of interest (T) will be the correlation coefficient between the LSAT and the GPA variables for the n=15 observations. The observed correlation is TData = 0.776. The standard error of T helps us understand how much T would change if we took a different random sample of 15 students. The next sections show how to implement the jackknife analysis in the SAS/IML language.

Construct a jackknife sample in SAS

The SAS/IML matrix language is the simplest way to perform a general jackknife estimates. If X is an n x p data matrix, you can obtain the i_th jackknife sample by excluding the i_th row of X. The following two helper functions encapsulate some of the computations. The SeqExclude function returns the index vector {1, 2, ..., i-1, i+1, ..., n}. The JackSample function returns the data matrix without the i_th row:

proc iml;
/* return the vector {1,2,...,i-1, i+1,...,n}, which excludes the scalar value i */ 
start SeqExclude(n,i);
   if i=1 then return 2:n;
   if i=n then return 1:n-1;
   return (1:i-1) || (i+1:n);
/* return the i_th jackknife sample for (n x p) matrix X */
start JackSamp(X,i);
   n = nrow(X);
   return X[ SeqExclude(n, i), ];  /* return data without i_th row */

The jackknife method for multivariate data in SAS

By using the helper functions, you can carry out each step of the jackknife method. To make the method easy to modify for other statistics, I've written a function called EvalStat which computes the correlation coefficient. This function is called on the original data and on each jackknife sample.

/* compute the statistic in this function */
start EvalStat(X);
   return corr(X)[2,1];  /* <== Example: return correlation between two variables */
/* read the data into a (n x 2) data matrix */
use law;  read all var {"gpa" "lsat"} into X;  close;
/* 1. compute statistic on observed data */
T = EvalStat(X);
/* 2. compute same statistic on each jackknife sample */
n = nrow(X);
T_LOO = j(n,1,.);             /* LOO = "Leave One Out" */
do i = 1 to n;
   Y = JackSamp(X,i);
   T_LOO[i] = EvalStat(Y); 
/* 3. compute mean of the LOO statistics */
T_Avg = mean( T_LOO );  
/* 4 & 5. compute jackknife estimates of bias and standard error */
biasJack = (n-1)*(T_Avg - T);
stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) );
result = T || T_Avg || biasJack || stdErrJack;
print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}];
Jackknife estimate of standard error and bias for a correlation coefficient of multivariate data

The output shows that the estimate of bias for the correlation coefficient is very small. The standard error of the correlation coefficient is estimated as 0.14, which is about 18% of the estimate.

To use this code yourself, simply modify the EvalStat function. The remainder of the program does not need to change.

The jackknife method in SAS/IML: Univariate data

When the data are univariate, you can sometimes eliminate the loop that computes jackknife samples and evaluates the jackknife replicates. If X is column vector, you can computing the (n-1) x n matrix whose i_th column represents the i_th jackknife sample. (To prevent huge matrices, this method is best for n < 20000.) Because many statistical functions in SAS/IML operate on the columns of a matrix, you can often compute the jackknife replicates in a vectorized manner.

In the following program, the JackSampMat function returns the matrix of jackknife samples for univariate data. The function calls the REMOVE function in SAS/IML, which deletes specified elements of a matrix and returns the results in a row vector. The EvalStatMat function takes the matrix of jackknife samples and returns a row vector of statistics, one for each column. In this example, the function returns the sample standard deviation.

/* If x is univariate, you can construct a matrix where
   each column contains a jackknife sample.
   Use for univariate column vector x when n < 20000 */
start JackSampMat(x); 
   n = nrow(x);
   B = j(n-1, n,0);
   do i = 1 to n;
      B[,i] = remove(x, i)`;   /* transpose to column vevtor */
   return B;
/* Input: matrix where each column of X is a bootstrap sample. 
   Return a row vector of statistics, one for each column. */
start EvalStatMat(x); 
   return std(x);   /* <== Example: return std dev of each sample */

Let's use these functions to get a jackknife estimate of the standard error for the statistic (the standard deviation). The data (from Martinez and Martinez, p. 246) have been studied by many researchers and represent the weight gain in grams for 10 rats who were fed a low-protein diet of cereal:

x = {58,67,74,74,80,89,95,97,98,107}; /* Weight gain (g) for 10 rats */
/* optional: visualize the matrix of jackknife samples */
*M = JackSampMat(x);
*print M[c=("S1":"S10") r=("1":"9")];
/* Jackknife method for univariate data */
/* 1. compute observed statistic */
T = EvalStatMat(x);    
/* 2. compute same statistic on each jackknife sample */
T_LOO = EvalStatMat( JackSampMat(x) ); /* LOO = "Leave One Out" */
/* 3. compute mean of the LOO statistics */
T_Avg = mean( T_LOO` );                /* transpose T_LOO */
/* 4 & 5. compute jackknife estimates of bias and standard error */
biasJack = (n-1)*(T_Avg - T);
stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) );
result = T || T_Avg || biasJack || stdErrJack;
print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}];
Jackknife estimate of standard error and bias for the standard deviation of univariate data

The output shows that the standard deviation of these data is about 15.7 grams. The jackknife method computes that the standard error for this statistic about 2.9 grams, which is about 18% of the estimate.

In summary, jackknife estimates are straightforward to implement in SAS/IML. This article shows a general implementation that works for all data and a specialized implementation that works for univariate data. In both cases, you can adapt the code for your use by modifying the function that computes the statistic on a data set. This approach is useful and efficient when the statistic is implemented in SAS/IML.

The post The jackknife method to estimate standard errors in SAS appeared first on The DO Loop.

4月 102017

Many intervals in statistics have the form p ± δ, where p is a point estimate and δ is the radius (or half-width) of the interval. (For example, many two-sided confidence intervals have this form, where δ is proportional to the standard error.) Many years ago I wrote an article that mentioned that you can construct these intervals in the SAS/IML language by using a concatenation operator (|| or //). The concatenation creates a two-element vector, like this:

proc iml;
mu = 50; 
delta = 1.5;
CI = mu - delta  || mu + delta;   /* horizontal concatenation ==> 1x2 vector */

Last week it occurred to me that there is a simple trick that is even easier: use the fact that SAS/IML is a matrix-vector language to encode the "±" sign as a vector {-1, 1}. When SAS/IML sees a scalar multiplied by a vector, the result will be a vector:

CI = mu + {-1  1}*delta;         /* vector operation ==> 1x2 vector */
print CI;

You can extend this example to compute many intervals by using a single statement. For example, in elementary statistics we learn the "68-95-99.7 rule" for the normal distribution. The rule says that in a random sample drawn from a normal population, about 68% of the observations will be within 1 standard deviation of the mean, about 95% will be within 2 standard deviations, and about 99.7 % will be within 3 standard deviations of the mean. You can construct those intervals by using a "multiplier matrix" whose first row is {-1, +1}, whose second row is {-2, +2}, and whose third row is {-3, +3}. The following SAS/IML statements construct the three intervals for the 69-95-99.7 rule for a normal population with mean 50 and standard deviation 8:

mu = 50;  sigma = 8;
m = {-1 1,
     -2 2,
     -3 3};
Intervals = mu + m*sigma;
ApproxPct = {"68%", "95%", "99.7"};
print Intervals[rowname=ApproxPct];

Just for fun, let's simulate a large sample from the normal population and empirically confirm the 68-95-99.7 rule. You can use the RANDFUN function to generate a random sample and use the BIN function to detect which observations are in each interval:

call randseed(12345);
n = 10000;                                /* sample size */
x = randfun(n, "Normal", mu, sigma);      /* simulate normal sample */
ObservedPct = j(3,1,0);
do i = 1 to 3;
   b = bin(x, Intervals[i,]);             /* b[i]=1 if x[i] in interval */
   ObservedPct[i] = sum(b) / n;           /* percentage of x in interval */
results = Intervals || {0.68, 0.95, 0.997} || ObservedPct;
print results[colname={"Lower" "Upper" "Predicted" "Observed"}
              label="Probability of Normal Variate in Intervals: X ~ N(50, 8)"];

The simulation confirms the 68-95-99.7 rule. Remember that the rule is a mnemonic device. You can compute the exact probabilities by using the CDF function. In SAS/IML, the exact computation is p = cdf("Normal", m[,2]) - cdf("Normal", m[,1]);

In summary, the SAS/IML language provides an easy syntax to construct intervals that are symmetric about a central value. You can use a vector such as {-1, 1} to construct an interval of the form p ± δ, or you can use a k x 2 matrix to construct k symmetric intervals.

The post A simple trick to construct symmetric intervals appeared first on The DO Loop.

3月 102017

I recently needed to solve a fun programming problem. I challenge other SAS programmers to solve it, too! The problem is easy to state: Given a long sequence of digits, can you write a program to count how many times a particular subsequence occurs? For example, if I give you a sequence of 1,000 digits, can you determine whether the five-digit pattern {1 2 3 4 5} appears somewhere as a subsequence? How many times does it appear?

If the sequence is stored in a data set with one digit in each row, then SAS DATA step programmers might suspect that the LAG function will be useful for solving this problem. The LAG function enables a DATA step to examine the values of several digits simultaneously.

The SAS/IML language also has a LAG function which enables you to form a matrix of lagged values. This leads to an interesting way use vectorized computations to solve this problem. The following SAS/IML program defines a small eight-digit set of numbers in which the pattern {1 2 3} appears twice. The LAG function in SAS/IML accepts a vector of lags and creates a matrix where each column is a lagged version of the input sequence:

/* Find a certain pattern in sequence of digits */
proc iml;
Digit = {1,1,2,3,3,1,2,3};      /* digits to search */
target = {1 2 3};               /* the pattern to find */
p = ncol(target);               /* length of target sequence */
D = lag(Digit, (p-1):0);        /* columns shift the digits */
print D;

The output shows a three-column matrix (D) that contains the second, first, and zeroth lag (in that order) for the input sequence. Notice that if I am searching for a particular three-digit pattern, this matrix is very useful. The rows of this matrix are all three-digit patterns that appear in the original sequence. Consequently, to search for a three-digit pattern, I can use the rows of the matrix D.

To make the task easier, you can delete the first two rows, which contain missing values. You can also form a binary matrix X that has the value X[i,j]=1 when the j_th element of the pattern equals the j_th element of the i_th row, and 0 otherwise, as shown in the following:

D = D[p:nrow(Digit),];          /* delete first p rows */
X = (D=target);                 /* binary matrix */
print X;

Notice that in SAS/IML, the comparison operator (=) can perform a vector comparison. The binary comparison operator detects that the matrix on the left (D) and the vector on the right (target) both contain three columns. Therefore the operator creates the three-column logical matrix X, as shown. The X matrix has a wonderful property: a row of X contains all 1s if and only if the corresponding row of D matches the target pattern. So to find matches, you just need to sum the values in the rows of X. If the row sum equals the number of digits in the pattern, then that row indicates a place where the target pattern appears in the original sequence.

You can program this test in PROC IML as follows. The subscript reduction operator [,+] is shorthand for "compute the sum of each row over all columns".

/* sum across columns. Which rows contain all 1s? */
b = (X[,+] = p);                /* b[i]=1 if i_th row matches target */
NumRepl = sum(b);               /* how many times does target appear? */
if NumRepl=0 then 
   print "The target does not appear in the digits";
   print "The target appears at location " (loc(b)[1]),  /* print 1st location */
         "The target appears" (NumRepl) "times.";

The program discovered that the target pattern appears in the sequence twice. The first appearance begins with the second digit in the sequence. The pattern also appears in the sequence at the sixth position, although that information is not printed.

Notice that you can solve this problem in SAS/IML without writing any loops. Instead, you can use the LAG function to convert the N-digit sequence into a matrix with N-p rows and p columns. You can then test whether the target pattern matches one of the rows.

Your turn! Can you solve this problem?

Now that I have shown one way to solve the problem, I invite SAS programmers to write their own program to determine whether a specified pattern appears in a numeric sequence. You can use the DATA step, DS2, SQL, or any other SAS language.

Post a comment to submit your best SAS program. Extra points for programs that count all occurrences of the pattern and display the location of the first occurrence.

To help you debug your program, here is test data that we can all use. It contains 10 million random digits:

data Have(keep=Digit);
call streaminit(31488);
do i = 1 to 1e7;
   Digit = floor(10*rand("uniform"));

To help you determine if your program is correct, you can use the following results. In this sequence of digits:

  • The five-digit pattern {1 2 3 4 5} occurs 101 times and the first appearance begins at row 34417
  • The six-digit patter {6 5 4 3 2 1} occurs 15 times and the first appearance begins at row 120920

You can verify these facts by using PROC PRINT as follows:

proc print data=Have(firstobs=34417 obs=34421); run;
proc print data=Have(firstobs=120920 obs=120925); run;

Happy programming!

The post Find a pattern in a sequence of digits appeared first on The DO Loop.

7月 112016

Two of my favorite string-manipulation functions in the SAS DATA step are the COUNTW function and the SCAN function. The COUNTW function counts the number of words in a long string of text. Here "word" means a substring that is delimited by special characters, such as a space character, a period, or a comma. The SCAN function enables you to parse a long string and extract words. You can specify the delimiters yourself or use the default delimiters. Ron Cody discusses these and other string manipulation functions in his excellent 2005 tutorial, "An Introduction to SAS Character Functions."

Using the COUNTW and SCAN functions in the DATA step

For example, the following DATA step reads in a long line of text. The COUNTW function counts how many words are in the string. A loop then iterates over the number of words and the SCAN function extracts each word into a variable:

data parse;
length word $20;                 /* number of characters in the longest word */
input str $ 1-80;
delims = ' ,.!';                 /* delimiters: space, comma, period, ... */
numWords = countw(str, delims);  /* for each line of text, how many words? */
do i = 1 to numWords;            /* split text into words */
   word = scan(str, i, delims);
drop str delims i;
Introduction,to SAS/IML   programming!
Do you have... a question?
proc print data=parse;

Notice that the delimiters do not include the '/' or '?' characters. Therefore these characters are considered to be part of words. For example, the strings "SAS/IML" and "question?" include those non-letter characters. Notice also that consecutive delimiters are automatically excluded, such as extra spaces or the ellipses marks.

Creating a vector of words in SAS/IML

One of the advantages of the SAS/IML matrix language is that you can call the hundreds of functions in Base SAS. When you pass in a vector of arguments to a Base SAS function, the function returns a vector that is the same size and shape as the parameter. In this way, you can vectorize the calling of Base SAS functions. In particular, you can pass in a vector of indices to the SCAN function and get back a vector of words. You do not need to write a loop to extract multiple words, as the following example demonstrates:

proc iml;
s = "Introduction,to SAS/IML... programming!";
delims = ' ,.!'; 
n = countw(s, delims);  
words = scan(s, 1:n, delims);  /* pass parameter vector: create vector of words */
print words;

In summary, Base SAS provides many useful functions such as the string manipulation functions. This article shows that when you call these functions from SAS/IML and pass in a parameter vector, you get back a vector of results.

tags: Getting Started, vectorization

The post Break a sentence into words in SAS appeared first on The DO Loop.

6月 292016

I was a freshman in college the first time I saw the Cantor middle-thirds set and the related Cantor "Devil's staircase" function. (Shown at left.) These constructions expanded my mind and led me to study fractals, real analysis, topology, and other mathematical areas.

The Cantor function and the Cantor middle-thirds set are often used as counter-examples to mathematical conjectures. The Cantor set is defined by a recursive process that requires infinitely many steps. However, you can approximate these pathological objects in a matrix-vector language such as SAS/IML with only a few lines of code!

Construction of the Cantor middle-thirds set

The Cantor middle-thirds set is defined by the following iterative algorithm. The algorithm starts with the closed interval [0,1], then does the following:

  1. In the first step, remove the open middle-thirds of the interval. The two remaining intervals are [0,1/3] and [2/3, 1], which each have length 1/3.
  2. In the ith step, remove the open middle-thirds of all intervals from the (i-1)th step. You now have twice as many intervals and each interval is 1/3 as long as the intervals in the previous step.
  3. Continue this process forever.

After two steps you have four intervals: [0,1/9], [2/9,1/3], [2/3, 7/9], and [8/9,1]. After three steps you have eight intervals of length 1/27. After k steps you have 2k intervals of length 3-k.

The Cantor set is the set of all points that never get removed during this infinite process. The Cantor set clearly contains infinitely many points because it contains the endpoints of the intervals that are removed: 0, 1, 1/3, 2/3, 1/9. 2/9, 7/9, 8/9, and so forth. Less intuitive is the fact that the cardinality of the Cantor set is uncountably infinite even though it is a set of measure zero.

Construction of the Cantor function

The Cantor function F: [0,1] → [0,1] can be defined iteratively in a way that reflects the construction of the Cantor middle-thirds set. The function is shown at the top of this article.

  • At step 0, define the function on the endpoints of the [0,1] interval by F(0)=0 and F(1)=1.
  • At step 1, define the function on the closed interval [1/3, 2/3] to be 1/2.
  • At step 2, define the function on [1/9, 2/9] to be 1/4. Define the function on [7/9, 8/9] to be 3/4.
  • Continue this construction. At each step, the function is defined on the closure of the middle-thirds intervals so that the function value is halfway between adjacent values. The function looks like a staircase with steps of difference heights and widths. The function is constant on every middle-thirds interval, but nevertheless is continuous and monotonically nondecreasing.

Visualize the Cantor staircase function in #SAS.
Click To Tweet

Visualizing the Cantor function in SAS

This is a SAS-related blog, so I want to visualize the Cantor function in SAS. The middle-third intervals during the kth step of the construction have length 3-k, so you can stop the construction after a small number of iterations and get a decent approximation. I'll use k=8 steps.

Although the Cantor set and function were defined geometrically, they have an equivalent definition in terms of decimal expansion. The Cantor set is the set of decimal values that can be written in base 3 without using the '1' digit. In other words, elements of the Cantor set have the form x = 0.a1a2a3... (base 3), where ai equals 0 or 2.

An equivalent definition in terms of fractions is x = Σi ai3-i where ai equals 0 or 2. Although the sum is infinite, you can approximate the Cantor set by truncating the series after finitely many terms. A sum like this can be expressed as an inner product x = a*v` where a is a k-element row vector that contains 0s and 2s and v is a vector that contains the elements {1/3, 1/9, 1/27, ..., 1/3-k}.

You can define B to be a matrix with k columns and 2k rows that contains all combinations of 0s and 2s. Then the matrix product B*v is an approximation to the Cantor set after k steps of the construction. It contains the right-hand endpoints of the middle-third intervals.

In SAS/IML you can use the EXPANDGRID function to create a matrix whose rows contain all combinations of 0s and 2s. The ## operator raises an element to a power. Therefore the following statements construct and visualize the Cantor function. With a little more effort, you can write a few more statements that improve the approximation and add fractional tick marks to the axes, as shown in the graph at the top of this article.

proc iml;
/* rows of B contain all 8-digit combinations of 0s and 2s */ 
B = expandgrid({0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2});
B = B[2:nrow(B),];      /* remove first row of zeros */
k = ncol(B);            /* k = 8 */
v = 3##(-(1:k));       /* vector of powers 3^{-i} */
t = B * v`;            /* x values: right-hand endpts of middle-third intervals */
u = 2##(-(1:k));       /* vector of powers 2^{-i} */
f = B/2 * u`;          /* y values: Cantor function on Cantor set */
call series(t, f);     /* graph the Cantor function */

I think this is a very cool construction. Although the Cantor function is defined iteratively, there are no loops in this program. The loops are replaced by matrix multiplication and vectors. The power of a matrix language is that it enables you to compute complex quantities with only a few lines of programming.

Do you have a favorite example from math or statistics that changed the way that you look at the world? Leave a comment.


This short article cannot discuss all the mathematically interesting features of the Cantor set and Cantor function. The following references are provided for the readers who want additional information:

tags: Just for Fun, vectorization

The post Visualize the Cantor function in SAS appeared first on The DO Loop.

2月 032016

Last week I showed how to use PROC EXPAND to compute moving averages and other rolling statistics in SAS. Unfortunately, PROC EXPAND is part of SAS/ETS software and not every SAS site has a license for SAS/ETS. For simple moving averages, you can write a DATA step program, as discussed in previous post. However, for complex rolling statistics, the SAS/IML language, which enables you to easily access previous observations (and even future ones!), is a more powerful tool for computing rolling statistics.

This article shows how to implement various rolling statistics in SAS/IML. To keep the explanations and programs simple, the functions assume that there are no missing values in the data. The article "What is a moving average" explains the mathematical formulas used in this post.

A simple moving average function

The key to computing most rolling statistics is to define a rolling window of observations. At each time point, you extract the observations in the rolling window and use them to compute the statistic. You then move on to the next time point and repeat the computation. You might need to perform special computations at the beginning of the time series.

The following SAS/IML program implements a simple moving average. The arguments to the MA function are a column vector, Y, of time series values and a scalar value, k, which indicates the number of values to use for each moving average computation. The program loops over elements in Y. For each element, the program computes the mean of the current element and previous k-1 values.

proc iml;
/* Simple moving average of k data values.
   First k-1 values are assigned the mean of all previous values.
   Inputs:     y     column vector of length N >= k
               k     number of data points to use to construct each average
start MA(y, k);
   MA = j(nrow(y), 1, .);
   do i = 1 to nrow(y);
      idx = max(1,(i-k+1)):i;   /* rolling window of data points */
      MA[i] = mean( y[idx] );   /* compute average */
   return ( MA );

The first k-1 values require special handling because these values have fewer than k-1 prior observations to average. You could handle these special values by using a separate loop. However, I chose to use the expression max(1, (i-k+1)) to select the first element for the rolling mean computation. When i is less than k, this expression returns 1 for the first element, and the program computes the mean of the first i values. Otherwise, this expression returns i minus k-1 (which is i-k+1) for the first element, and the program computes the mean of k values.

The most important part of this computation is enumerating the time points to use in the computation (for example, idx = (i-k+1):i;) followed by extracting the associated data (for example, y[idx]). With these two expressions, you can compute any rolling statistic. For example, by changing the function call from MEAN to STD, you can compute a rolling standard deviation. The rolling median, rolling minimum, and rolling maximum are also trivial to implement. By changing the time points, you can compute rolling statistics for centered windows. If your data contain several variables, you can compute a rolling correlation.

A weighted moving average function

The following function computes a weighted moving average. The arguments to the WMA function are a column data vector, Y, and a vector of weights that has k elements. For each time point, wk (the last weight) is the weight for current data value, wk-1 is for the previous data value, and so forth. The function internally standardizes the weights so that they sum to unity. (This ordering was chosen so that the WMA function uses the same syntax as PROC EXPAND.) This function handles the first few values in a separate loop:

/* Weighted moving average of k data values.
   First k-1 values are assigned the weighted mean of all preceding values.
   Inputs:     y     column vector of length N >= k
               wt    column vector of weights. w[k] is weight for most recent 
                      data; wt[1] is for most distant data value.  The function 
                     internally standardizes the weights so that sum(wt)=1.
   Example call: WMA  = WMA(y, 1:5);
start WMA(y, wt);
   w = colvec(wt) / sum(wt);       /* standardize weights so that sum(w)=1 */
   k = nrow(w);
   MA = j(nrow(y), 1, .);
   /* handle first k values separately */
   do i = 1 to k-1;
      wIdx = k-i+1:k;                 /* index for previous i weights */
      tIdx = 1:i;                     /* index for previous i data values */
      MA[i] = sum(wt[wIdx]#y[tIdx]) / sum(wt[wIdx]);  /* weighted average */
   /* main computation: average of current and previous k-1 data values */
   do i = k to nrow(y);
      idx = (i-k+1):i;               /* rolling window of k data points */
      MA[i] = sum( w#y[idx] );       /* weighted sum of k data values */
   return ( MA );

Notice that the function requires computing a weighted mean, which is described in a previous article.

An exponentially weighted moving average function

An exponentially weighted moving average is defined recursively. The average at time t is a weighted average of the data point at time t and the average from time t-1. The relative weights are determined by the smoothing parameter, α. The following function implements that definition:

/* Exponentially weighted moving average (EWMA) with smoothing parameter alpha.
   Inputs:      y     column vector of length N
                alpha scalar value 0 < alpha < 1
start EWMA(y, alpha);
   MA = j(nrow(y), 1, .);
   MA[1] = y[1];              /* initialize first value of smoother */
   do i = 2 to nrow(y);
      MA[i] = alpha*y[i] + (1-alpha)*MA[i-1];
   return ( MA );

The three moving average functions are now defined. You can read the time series data into a vector and call the functions. If necessary, you can write the rolling statistics to a SAS data set:

/* read time series data */
use Sashelp.Air;  
   read all var "date" into t;
   read all var "air" into y;
MA   = MA(y, 5);           /* moving average, k=5 */
WMA  = WMA(y, 1:5);        /* weighted moving average */
EWMA = EWMA(y, 0.3);       /* exponentially WMA, alpha=0.3 */
create out var{t y MA WMA EWMA};  append;  close out;

You can use the SGPLOT procedure to visualize the rolling statistics, as shown in my previous article.

Vectorizing time series computations

The experienced SAS/IML programmer will notice that these functions are not heavily vectorized. The MA and WMA computations use vectors of length k to compute the means and weighted means, respectively. It is possible to write these functions by using a matrix operation, but if the time series has N data points, the transformation matrix is an N x N lower-triangular banded matrix, which requires a lot of memory for large values of N.

Notice that the EWMA function uses scalar quantities inside a loop. For time series computations that use lagged data values, you can sometimes vectorize the time series computations. However, for operations that are defined recursively, such as the EWMA, the effort required to vectorize the computation might exceed the benefit you gain from the vectorization. In many cases, a function that uses a simple loop is fast and easy to maintain.


This article presents functions for computing rolling statistics in SAS/IML. Examples included a simple moving average (MA), a weighted moving average (WMA), and an exponentially weighted moving average (EWMA). The article describes how to modify these function to compute other rolling statistics in SAS.

Computations of rolling statistics might not be easy to vectorize. Even when you can vectorize a computation, a simpler approach might run faster.

tags: Data Analysis, Statistical Programming, vectorization

The post Rolling statistics in SAS/IML appeared first on The DO Loop.

1月 132016

Recently I blogged about how to compute a weighted mean and showed that you can use a weighted mean to compute the center of mass for a system of N point masses in the plane. That led me to think about a related problem: computing the center of mass (called the centroid) for a planar polygon.

If you cut a convex polygon out of stiff cardboard, the centroid is the position where the polygon would balance on the tip of a needle. However, for non-convex polygons, the centroid might not be located in the interior of the polygon.

For a general planar figure, the mathematical formula for computing the centroid requires computing an integral. However, for polygons there are some well-known equations that involve only the locations of the vertices (provided that the vertices are enumerated in a consistent clockwise or counter-clockwise manner). The Wikipedia article about the centroid give a useful computational formula. It turns out that the (signed) area of a simple polygon with vertices (x0, y0), (x1, y1), ..., (xN-1, yN-1), is given by


In the formula, the vertex (xN, yN) is identified with the first vertex, (x0, y0). The centroid of the polygon occurs at the point (Cx, Cy), where

centroidEqn1 centroidEqn2

A polygon whose vertices are enumerated in a counter-clockwise manner will have a positive value for A. The formula for A will be negative when vertices are enumerated in a clockwise manner.

In the SAS language, arrays use 1-based indexing by convention. Therefore, when implementing the formula in SAS, it is customary for the vertices to be enumerated from 1 to N. The limits in the summation formulas are modified similarly. The (N+1)st vertex is identified with the first vertex.

Centroids are sometimes used as a position to label a country or state on a map. For SAS customers who have a license for SAS/GRAPH, the official SAS %CENTROID macro computes the centroid of a polygonal region. It is designed for use with map data sets, which have variables named ID, X, and Y. My colleague Sanjay Matange provides a similar macro in his blog post "A Macro for Polygon Area and Center."

A vectorized SAS/IML function that computes a centroid

Implementing the centroid formulas in SAS/IML is a good exercise in vectorization. Recall that function is vectorized if it avoids unnecessary loops and instead uses vector and matrix operations to perform the computations. For the area and centroid formulas, you can form a vector that contains all of the values that are to be summed. You can use the elementwise multiplication operator (#) to compute the elementwise product of two vectors. The SUM function adds up all elements in a vector.

The following function computes the centroid of a single polygon in a vectorized manner. The first few statements build the vectors from the values of the vertices; the last four statements compute the area and centroid by using vectorized operations. There are no loops.

proc iml;
/* _PolyCentroid: Return the centroid of a simple polygon.
   P  is an N x 2 matrix of (x,y) values for consectutive vertices. N > 2. */
start _PolyCentroid(P);
   lagIdx = 2:nrow(P) || 1;              /* vertices 2, 3, ..., N, 1 */
   xi   = P[ ,1];       yi = P[ ,2];     /* x_i and y_i for i=1..N   */
   xip1 = xi[lagIdx]; yip1 = yi[lagIdx]; /* x_{i+1} and y_{i+1}, x_{N+1}=x_1 */
   d = xi#yip1 - xip1#yi;                /* vector of difference of products */
   A = 0.5 * sum(d);                     /* signed area of polygon */
   cx = sum((xi + xip1) # d) / (6*A);    /* X coord of centroid */
   cy = sum((yi + yip1) # d) / (6*A);    /* Y coord of centroid */
   return( cx || cy );
/* test the function by using an L-shaped polygon */
L = {0 0, 6 0, 6 1, 2 1, 2 3, 0 3};
centroid = _PolyCentroid(L);
print centroid;

For this L-shaped polygon, the centroid is located outside the polygon. This example shows that the centroid might not be the best position for a label for a polygon. Examples of this phenomenon are easy to find on real maps. For example, the states of Louisiana and Florida have centroids that are near the boundaries of the states.

Compute the centroid of many polygons

The %CENTROID macro can compute the centroids for multiple polygons, assuming that there is an identifying categorical variable (called the ID variable) whose unique values distinguish one polygon from another.

If you want to compute the centroids of multiple polygons in SAS/IML, you can use the UNIQUEBY function to extract the rows that correspond to each polygon. For each polygon, the previous _PolyCentroid function is called, as follows:

/* If the polygon P has two columns, return the centroid. If it has three 
   columns, assume that the third column is an ID variable that identifies
   distinct polygons. Return the centroids of the multiple polygons. */
start PolyCentroid(P);
   if ncol(P)=2 then  return( _PolyCentroid(P) );
   ID = P[,3];
   u = uniqueby(ID);         /* starting index for each group */
   result = j(nrow(u), 2);   /* allocate vector to hold results */
   u = u // (nrow(ID)+1);    /* append (N+1) to end of indices */
   do i = 1 to nrow(u)-1;    /* for each group... */
      idx = u[i]:(u[i+1]-1); /* get rows in group */
      result[i,] = _PolyCentroid( P[idx, 1:2] ); 
   return( result );
/* test it on an example */
rect = {0 0,  2 0,  2 1,  0 1};
ID = j(nrow(L), 1, 1) // j(nrow(rect), 1, 2);
P = (L // rect) || ID;
centroids = PolyCentroid(P);
print centroids;

In summary, you can construct an efficient function in SAS/IML to compute the centroid of a simple polygon. The construction shows how to vectorize a computation by putting the vertices in vectors, rather than the less efficient alternative, which is to iterate over the vertices in a loop. Because polygons are often used to display geographical regions, I also included a function that iterates over polygons and computes the centroid of each.

tags: Math, vectorization

The post Compute the centroid of a polygon in SAS appeared first on The DO Loop.

10月 302015

Every year near Halloween I write a trick-and-treat article in which I demonstrate a simple programming trick that is a real treat to use. This year's trick features two of my favorite functions, the CUSUM function and the LAG function. By using these function, you can compute the rows of a matrix (or indices of a vector) that correspond to the beginning and end of groups.


The picture at left illustrates the problem to solve. You have data that contains a grouping or ID variable, and you want to find the first and last observation that corresponds to each group. In the picture, the ID variable is named Group. The beginning the groups are indicated by the green arrows: rows 1, 3, 6, and 8. The end of the groups are located at the red arrows: rows 2, 5, 7, and 10.

Often it is the case that you already know the sizes of the groups from some previous calculation. If you know the sizes, the CUSUM and LAG functions enable you to directly construct the vectors that contain the row numbers. (Of course, if the group sizes are all equal, you can easily solve this problem by using the DO function.)

Recall that the CUSUM function computes the cumulative sum of the elements in a vector. I've used the CUSUM function in more than a dozen blog posts over the years. The LAG function shifts a vector of values and is useful for any vectorized operations that involve comparing or manipulating adjacent values.

To see the CUSUM-LAG trick in action, assume that you know the sizes of each group. For the example in the picture, the group sizes are 2, 3, 2, and 3. The following function returns the vector of rows that begin and end each group::

proc iml;
/* Return kx2 matrix that contains the first and last elements 
   for k groups that have sizes s[1], s[2],...,s[k]. The i_th row contains
   the first and last index for the i_th group. */
start ByGroupIndices( s );
   size = colvec(s);              /* make sure you have a column vector */
   endIdx = cusum(size);          /* locations of ending index */
   beginIdx = 1 + lag(endIdx);    /* begin after each ending index ... */
   beginIdx[1] = 1;               /*    ...and at 1  */
   return ( beginIdx || endIdx );
size = {2, 3, 2, 3};              /* sizes for this example */
idx = ByGroupIndices(size);
print idx[c={"Begin" "End"}];

The CUSUM function computes the end of each group by accumulating the group sizes. The LAG function merely returns a vector of the same size, but with a missing value in the first element and all other elements shifted down one position. To that vector, you add 1 because the next group starts immediately after the previous group ends. Lastly, you just replace the missing value in the first position with 1 because the first group starts at the first row.

I've used the CUSUM-LAG trick many times. It is useful when you are simulating data in which the size of the sample is also being simulated. For example, just last week I used it to simulate contingency tables. Several years ago I used this trick to expand data by using frequencies. In general, you can use it to perform "BY-group processing" in the SAS/IML language.

Happy Halloween to all my readers. I hope you find this trick to be a real treat!

tags: Tips and Techniques, vectorization

The post The CUSUM-LAG trick in SAS/IML appeared first on The DO Loop.

10月 022015

In a previous post I described how to simulate random samples from an urn that contains colored balls. The previous article described the case where the balls can be either of two colors. In that csae, all the distributions are univariate. In this article I examine the case where the urn contains c > 2 colors, which leads to multivariate distributions.

Let K1, K2, ..., Kc be the number of balls for each of the c colors. Let N = Σ Ki be the total number of balls.

The table distribution

The simplest experiment is to reach into the urn and pull out a single ball. The probability of choosing a ball of color i is pi = Ki / N. In SAS you can use the table distribution to specify the probabilities of selecting each integer in the range [1, c]. Some people call the table distribution a categorical distribution or a generalized Bernoulli distribution.

You can use the RAND function in the DATA step to simulate from a categorical distribution. You can also use the RANDGEN function in SAS/IML software, or you can use the SAMPLE function.

The multinomial distribution

If you repeat the experiment n times (replacing the selected ball after recording its value), you obtain a c-tuple of counts. Let ki be the number of times that a ball of color i was selected. Then the result of n draws with replacement is the vector (k1, k2, ..., kc) where Σ ki = n.

By definition, this c-tuple is a single draw from the multinomial distribution. In SAS, you can use the RANDMULTINOMIAL function in SAS/IML to simulate from the multinomial distribution.

The multivariate hypergeometric distribution

When sampling with replacement, the probability of choosing a ball of color i never changes. However, you can also sample without replacement. Without replacement, the probability of choosing a given color changes from draw to draw.

The distribution of the number of colored balls when you sample without replacement is called the multivariate hypergeometric distribution. SAS does not have a built-in function for simulating data from a multivariate hypergeometric distribution, but the following programs shows that you can iteratively use the univariate hypergeometric distribution to construct a multivariate function.

A standard technique for drawing a sample from multiple categories is to reduce the problem to a series of two-category problems by combining colors (Gentle, 2003, p. 206). You can obtain a random number of balls drawn from the first color category by using the hypergeometric distribution with K1 balls of the first color and N - K1 balls in the "other" category. Suppose that k1 balls of the first color are selected. For the second step, there are N - K1 balls of the other colors in the urn and n - k1 balls left to draw. Repeat this process with the second color category. Use the hypergeometric distribution to draw k2 balls of the second color. That leaves (N - K1 - K2) balls of the remaining colors in the urn and (n - k1 - k1) balls left to draw. Continue this process to obtain counts for all the categories.

The following SAS/IML function can return multiple draws from the multivariate hypergeometric distribution. The function is efficient because it exploits the fact that you can pass vectors of parameters to the RANDGEN subroutine. The RandMVHyper function uses a vector of N's (ItemsLeft) and a vector of n's (nDraw) to generate an arbitrary number of independent draws from the multivariate hypergeometric distribution. There is no loop over the number of replications. The only loop is over the number of colors, which is necessary because that loop updates the selection probabilities. (If you study the code, you might want to mentally set nRep = 1 so that the vectors of parameters become scalar quantities.)

proc iml;
/* There are K[i] balls of color i in an urn.
   Draw N balls without replacement and report the number of 
   each color. 
   N[1] = sample size
   N[2] = number of replications (optional. Default is 1) 
   K = vector that gives the number balls of each color.
       The total number of balls is sum(K). 
start RandMVHyper(N, _K);
   if nrow(N)*ncol(N)>1 then nRep = N[2];
   else nRep = 1;
   K = rowvec(_K);               /* K[i] is number of items for category i */
   nDraw = j(nRep, 1, N[1]);     /* run nRep sims at once */
   ItemsLeft = j(nRep, 1, sum(K));
   x = j(nRep, ncol(K), 0);     /* each row is draw from MV hyper */
   h = j(nRep, 1, 0);           /* vec for hypergeometric values  */
   do i = 1 to ncol(K)-1;
      Kvec = j(nRep, 1, K[i]);
      call randgen(h, "Hyper", ItemsLeft, Kvec, nDraw);
      x[,i] = h;
      ItemsLeft = ItemsLeft - K[i];         /* update parameters */
      nDraw = nDraw - h;
   x[,ncol(K)] = nDraw;
   return (x);
call randseed(1234);
/* TEST:      nDraws nRep     K1 K2 K3 */
y = RandMVHyper({100 1000}, {100 40 60});
print (y[1:5,])[c={"black" "white" "red"} L="Draws w/o Replacement"];
mean = mean(y);
print mean;

The test code requests 1,000 replicates. Each replicate is a random vector from the multivariate hypergeometric distribution, namely the result of drawing 100 balls without replacement from an urn that contains 100 black balls, 40 white balls, and 60 red balls. The table shows the first five results. For example, the fourth experiment resulted in 53 black, 24 white, and 23 red balls. Each row sums to 100, which is the number of balls drawn for each experiment. The average counts from the 1,000 experiments were approximately 50 black ball, 20 white balls, and 30 red balls. These are the expected values.

The RandMVHyper function handles most cases, but it can break down when the sample size is small and when some categories (colors) have small marginal sums. You can download a more robust version of the function that handles these edge cases.

In summary:

  • Use the table (generalized Bernoulli) distribution to simulate drawing a colored ball from an urn that contains balls of different colors.
  • Use the multinomial distribution to generate the counts when you sample n balls with replacement. In other words, you repeat n independent and identical generalized Bernoulli trials and tabulate the number of balls that were selected for each color.
  • Use the multivariate hypergeometric distribution to generate the counts when you sample n balls without replacement. This article contains a SAS/IML function that samples from the multivariate hypergeometric distribution.
tags: Simulation, Statistical Thinking, vectorization

The post Balls and urns Part 2: Multi-colored balls appeared first on The DO Loop.