5月 132019

In SAS/IML programs, a common task is to write values in a matrix to a SAS data set. For some programs, the values you want to write are in a matrix and you use the CREATE FROM/APPEND FROM syntax to create the data set, as follows:

proc iml;
X = {1  2  3, 
     4  5  6, 
     7  8  9, 
    10 11 12};
create MyData from X[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from X;                                /* write all rows of X */
close;                                        /* close the data set */

In other programs, the results are computed inside an iterative DO loop. If you can figure out how many observations are generated inside the loop, it is smart to allocate room for the results prior to the loop, assign the rows inside the loop, and then write to a data set after the loop.

However, sometimes you do not know in advance how many results will be generated inside a loop. Examples include certain kinds of simulations and algorithms that iterate until convergence. An example is shown in the following program. Each iteration of the loop generates a different number of rows, which are appended to the Z matrix. If you do not know in advance how many rows Z will eventually contain, you cannot allocate the Z matrix prior to the loop. Instead, a common technique is to use vertical concatenation to append each new result to the previous results, as follows:

/* sometimes it is hard to determine in advance how many rows are in the final result */
free Z;
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Y = j(k , 3, n);         /* k x 3 matrix */
   Z = Z // Y;              /* vertical concatenation of results */
create MyData2 from Z[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from Z;                                 /* write all rows */
close;                                         /* close the data set */

Concatenation within a loop tends to be inefficient. As I like to say, "friends don't let friends concatenate results inside a loop!"

If your ultimate goal is to write the observations to a data set, you can write each sub-result to the data set from inside the DO loop! The APPEND FROM statement writes whatever data are in the specified matrix, and you can call the APPEND FROM statement multiple times. Each call will write the contents of the matrix to the open data set. You can update the matrix or even change the number of rows in the matrix. For example, the following program opens the data set prior to the DO loop, appends to the data set multiple times (each time with a different number of rows), and then closes the data set after the loop ends.

/* alternative: create data set, write to it during the loop, then close it */
Z = {. . .};                /* tell CREATE stmt that data will contain three numerical variables */
create MyData3 from Z[colname={'A' 'B' 'C'}];   /* open before the loop. The TYPE of the variables are known. */
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Z = j(k , 3, n);         /* k x 3 matrix */
   append from Z;           /* write each block of data */
close;                      /* close the data set */

The following output shows the contents of the MyData3 data set, which is identical to the MyData2 data set:

Notice that the CREATE statement must know the number and type (numerical or character) of the data set variables so that it can set up the data set for writing. If you are writing character variables, you also need to specify the length of the variables. I typically use missing values to tell the CREATE statement the number and type of the variables. These values are not written to the data set. It is the APPEND statement that writes data.

I previously wrote about this technique in the article "Writing data in chunks," which was focused on writing large data set that might not fit into RAM. However, the same technique is useful for writing data when the total number of rows is not known until run time. I also use it when running simulations that generate multivariate data. This technique provides a way to write data from inside a DO loop and to avoid concatenating matrices within the loop.

The post Write to a SAS data set from inside a SAS/IML loop appeared first on The DO Loop.

10月 082018

This article compares several ways to find the elements that are common to multiple sets. I test which method is the fastest in the SAS/IML language. However, all algorithms are intrinsically fast, which raises an important question: when is it worth the time and effort to optimize an algorithm?

The idea for this topic came from reading a blog post by 'mikefc' at the "Cool but Useless" blog about the relative performance of various ways to intersect vectors in R. Mikefc used vectors that contained between 10,000 and 100,000 elements and whose intersection was only a few dozen elements. In this article, I increase the sizes of the vectors by an order of magnitude and time the performance of the intersection function in SAS/IML. I also ensure that the sets share a large set of common elements so that the intersection is sizeable.

The intersection of large sets

In general, finding the intersection between sets is a fast operation. In languages such as R, MATLAB, and SAS/IML, you can store the sets in vectors and use built-in functions to form the intersection. Because the intersection of two sets is always smaller than (or equal to) the size of the original sets, computing the "intersection of intersections" is usually faster than computing the intersection of the original (larger) sets.

The following SAS/IML program creates eight SAS/IML vectors that have between 200,000 and 550,000 elements. The elements in the vectors are positive integers, although they could also be character strings or non-integers. The vectors contain a large subset of common elements so that the intersection is sizeable. The vectors A, B, ..., H are created as follows:

proc iml;
call randseed(123);
NIntersect = 5e4; 
common = sample(1:NIntersect, NIntersect, "replace"); /* elements common to all sets */
source = 1:1e6;        /* elements are positive integers less than 1 million */
BaseN = 5e5;           /* approx magnitude of vectors (sets) */
A = sample(source,     BaseN, "replace") || common; /* include common elements */
B = sample(source, 0.9*BaseN, "replace") || common;
C = sample(source, 0.8*BaseN, "replace") || common;
D = sample(source, 0.7*BaseN, "replace") || common;
E = sample(source, 0.6*BaseN, "replace") || common;
F = sample(source, 0.5*BaseN, "replace") || common;
G = sample(source, 0.4*BaseN, "replace") || common;
H = sample(source, 0.3*BaseN, "replace") || common;

The intersection of these vectors contains about 31,600 elements. You can implement the following methods that find the intersection of these vectors:

  1. The XSECT function in SAS/IML accepts up to 15 arguments. You can therefore make a single call to find the intersection.
  2. If you have dozens or hundreds of sets to intersect, you can loop over the sets and call the XSECT function multiple times. (Recall that the VALUE function enables you to loop over the names of the sets and access the vectors by names.) For example, you can let W1 be the intersection of the first two sets, let W2 be the intersection of W1 with the third set, and so forth. Because the size of the intersection is typically smaller than the size of the sets, later intersections require less time to compute than earlier intersections. (Although I only implement pairwise intersections, you could conceivably intersect more than two sets at a time.)
  3. As 'mikefc' mentions, it is quicker to intersect small sets than to intersect large sets. Thus you can preprocess the names of the sets and sort them by size. This heuristic might make a difference when the set sizes vary greatly.

The following SAS/IML statements implement these three methods and compute the time required to intersect each:

/* 1. one call to the built-in method */
t0  = time();
   w = xsect(a,b,c,d,e,f,g,h);
tBuiltin = time() - t0;
/* 2. loop over variable names and form pairwise intersections */
varName = "a":"h";
t0  = time();
   w = value(varName[1]);
   do i = 2 to ncol(varName);
     w = xsect(w, value(varName[i]));  /* compute pairwise intersection */
tPairwise = time() - t0;
/* 3. Sort by size of sets, then loop */
varName = "a":"h";
t0  = time();
   len = j(ncol(varName), 1);    
   do i = 1 to ncol(varName);
     len[i] = ncol(value(varName[i])); /* number of elements in each set */
   call sortndx(idx, len);             /* sort smallest to largest */
   sortName = varName[,idx];
   w = value(sortName[1]);
   do i = 2 to ncol(sortName);
     w = xsect(w, value(sortName[i])); /* compute pairwise intersection */
tSort = time() - t0;
print tBuiltin tPairwise tSort;

For this example data, each method takes about 0.5 seconds to find the intersection of eight sets that contain a total of 3,000,000 elements. If you rerun the analysis, the times will vary by a few hundredths of a second, so the times are essentially equal. ('mikefc' also discusses a few other methods, including a "tally method." The tally method is fast, but it relies on the elements being positive integers.)

Whether to optimize or not

There is a quote that I like: "In theory, there is no difference between theory and practice. But, in practice, there is." (Published by Walter J. Savitch, who claims to have overheard it at a computer science conference.) For the intersection problem, you can argue theoretically that pre-sorting by length is an inexpensive way to potentially obtain a boost in performance. However, in practice (for this example data), pre-sorting improves the performance by less than 1%. Furthermore, the absolute times for this operation are short, so saving a 1% of 0.5 seconds might not be worth the effort.

Even if you never compute the intersection of sets in SAS, there are two take-aways from this exercise:

  • When you measure the performance of algorithms, use example data that is similar to the real data that you expect to encounter. The performance of this problem depends on the size of the sets, the number of sets, and the contents of the sets. The results that I show here are based on my choices for these parameters. Different parameters might yield different results.
  • Always consider the total time for an algorithm before you start to optimize it. If your analysis takes 30 seconds, there is no need to optimize a step in the analysis that takes 0.5 seconds. No matter how well you optimize that step, it is not going to substantially shrink the total run time of your analysis.

The post The intersection of multiple sets appeared first on The DO Loop.

9月 262018

A radial basis function is a scalar function that depends on the distance to some point, called the center point, c. One popular radial basis function is the Gaussian kernel φ(x; c) = exp(-||xc||2 / (2 σ2)), which uses the squared distance from a vector x to the center c to assign a weight. The weighted sum of Gaussian kernels, Σ wi φ(x; c) arises in many applications in statistics, including kernel density estimation, kernel smoothing, and machine learning algorithms such as support vector machines. It is therefore important to be able to efficiently evaluate a radial basis function and compute a weighted sum of several such kernel functions.

One of the many useful features of the SAS/IML language is its ability to compactly represent matrix and vector expressions. The expression ||xc|| looks like the distance between two vectors, but in the SAS/IML language the DISTANCE function can handle multiple sets of vectors:

  • The DISTANCE function can compute the distance between two vectors of arbitrary dimensions. Thus when x and c are both d-dimensional row vectors, you can compute the distance by using r = DISTANCE(x, c). The result is a scalar distance.
  • The DISTANCE function can compute the distance between multiple points and a center. Thus when x is an m x d matrix that contains m points, you can compute the m distances between the points and c by using r = DISTANCE(x, c). Again, the syntax is the same, but now r is an m x 1 vector of distances.
  • The DISTANCE function in SAS/IML 14.3 can compute the distance between multiple points and multiple centers. Thus when x is an m x d matrix that contains m points and c is a p x d matrix that contains p centers, you can compute the m*p distances between the points and c by using r = DISTANCE(x, c). The syntax is the same, but now r is an m x p matrix of distances.

A SAS/IML function that evaluates a Gaussian kernel function

The following SAS/IML statements define a Gaussian kernel function. Notice that the function is very compact! To test the function, define one center at C = (2.3, 3.2). Because SAS/IML is a matrix language, you can evaluate the Gaussian kernel on a grid of integer coordinates (x,y) where x is an integer in the range [1,5] and y is in the range [1,8]. Let Z be the matrix of the 40 ordered pairs. The following call evaluates the Gaussian kernel at the grid of points:

proc iml;
/* Radial basis function (Gaussian kernel). If z is m x d and c is n x d, this function
   returns the mxn matrix of values exp( -||z[i,] - c[j,]||**2 / (2*sigma**2) ) */
start GaussKernel(z, c, sigma=1);
   return exp( -distance(z,c)##2 / (2*sigma**2) );
/* test on small data: Z is an 5 x 8 grid and C = {2.3 3.2} */
xPts = 1:5; yPts = 1:8;
Z = expandgrid(xPts, yPts);              /* expand into (8*5) x 2 matrix */
C = {2.3 3.2};                           /* the center */ 
phi = GaussKernel(Z, C);                 /* phi is 40 x 1 vector */
print Z phi;                             /* print in expanded form */
phi_Grid = shapecol(phi, ncol(yPts));    /* reshape into grid (optional) */
print phi_Grid[c=(char(xPts)) r=(char(yPts)) F=4.2];

The table shows the Gaussian kernel evaluated at the grid points. The columns represent the values at the X locations and the rows indicate the Y locations. The function is largest at the value (x,y)=(2,3) because (2,3) is the grid point closest to the center (2.3, 3.2). The largest value 0.94. Notice that the function is essentially zero at points that are more than 3 units from the center, which you would expect from a Gaussian distribution with σ = 1.

You can use the HEATMAPCONT subroutine to make a heat map of the function values. However, notice that in the matrix the rows increase in the downward direction, whereas in the usual Cartesian coordinate system the Y direction increases upward. Consequently, you need to reverse the rows and the Y-axis labels when you create a heat map:

start PlotValues( v, xPts, yPts );
   G = shapecol(v, ncol(yPts));      /* reshape vector into grid */
   M =  G[nrow(G):1, ];              /* flip Y axis (rows) */
   yRev = yPts[, ncol(yPts):1];      /* reverse the Y-axis labels */
   call heatmapcont(M) xvalues=xPts yValues=yRev;
run PlotValues(phi, xPts, yPts);
Evaluate Gaussian kernel on grid of points. Visualize with a heat map.

Sums of radial basis functions

Locations of 86 US cities with population greater than 200,000

Often the "centers" are the locations of some resource such as a warehouse, a hospital, or an ATM. Let's use the locations of 86 large US cities, which I used in a previous article about spatial data analysis. A graph of the locations of the cities is shown to the right. (Click to enlarge.) The locations are in a standardized coordinate system, so they do not directly correspond to longitudes and latitudes.

If there are multiple centers, the GaussKernel function returns a column for every center. Many applications require a weighted sum of the columns. You can achieve a weighted sum by using a matrix-vector product A*w, where w is a column vector of weights. If you want an unweighted sum, you can use the SAS/IML subscript reduction operator to sum across the columns: A[,+].

For example, the following statements evaluate the Gaussian kernel function at each value in a grid (the Z matrix) and for each of 86 cities (the C matrix). The result is a 3726 x 86 matrix of values. You can use the subscript reduction operator to sum the kernel evaluations over the cities, as shown:

use BigCities;
   read all var {x y} into C;     /* C = (x,y) locations of centers */
   read all var "City";
/* Z = a regular grid in (x,y) coordinates that contains the data */
XGridPts = round( do(-0.4, 0.4, 0.01), 0.001);
YGridPts = round( do(-0.2, 0.25, 0.01), 0.001);
Z = expandgrid( XGridPts, YGridPts );  /* 3,726 points on a 81x46 grid */
phi = GaussKernel(X, C, 0.025);   /* use smaller bandwidth */
sumPhi = phi[,+];                 /* for each grid point, add sum of kernel evaluations */
Sum of 86 Gaussian kernels evaluated on a regular grid

The resulting heat map shows blobs centered at each large city in the data. Locations near isolated cities (such as Oklahoma City) are lighter in color than locations near multiple nearby cities (such as southern California and the New York area) because the image shows the superposition of the kernel functions. At points that are far from any large city, the sum of the Gaussian kernel functions is essentially zero.

In summary, if you work with algorithms that use radial basis functions such as Gaussian kernels, you can use the SAS/IML language to evaluate these functions. By using the matrix features of the language and the fact that the DISTANCE function supports matrices as arguments, you can quickly and efficiently evaluate weighted sums of these kernel functions.

The post Radial basis functions and Gaussian kernels in SAS appeared first on The DO Loop.

8月 212017

When you implement a statistical algorithm in a vector-matrix language such as SAS/IML, R, or MATLAB, you should measure the performance of your implementation, which means that you should time how long a program takes to analyze data of varying sizes and characteristics. There are some general tips that can help you eliminate bottlenecks in your program so that your program is fast as lightning! In fact, it is a little bit frightening how quickly you can become an expert in timing.

The following general principles apply regardless of the language that you use to implement the algorithm:

  1. Use simulation to construct test data of varying sizes. By simulating data, you can easily vary the size of the size while preserving distributional characteristics. For example, you might want to test an algorithm on data that are normally distributed and contain 25, 50, 75, and 100 thousand observations. (Each language has different techniques to simulate data efficiently. For SAS software, see Simulating Data with SAS.)
  2. Vary the distribution of the test data. Concentrate mainly on how the algorithm performs on typical data, but also test examples of "best case" and "worst case" scenarios for which the algorithm might run very quickly or very slowly. If the performance is affected by other factors such as the proportion of outliers or missing values, then include those factors in the study.
  3. If possible, construct timing tests that run between 0.1 and 1 seconds. This length is long enough to be reliably measured but short enough that you can run many tests. If you try to time very short intervals (less than a millisecond), you will discover that the operating system constantly performs unrelated background tasks that can interfere with your timings, which leads to noisy measurements. For ultra-short intervals (less than a microsecond), you can encounter an "uncertainty principle" phenomena in which the very act measuring the performance of an algorithm affects the performance that you are trying to measure.
  4. To reduce timing noise, call the routine multiple times and report the mean or median. If the performance can vary greatly (±20% or more), report the distribution of times by using quantiles or by displaying a box plot.
  5. Use a "burn-in" call to reduce the impact of overhead costs. The first time a routine is called, it might require the operating system to load DLLs or allocate a huge block of memory. When the routine is called a second time, the operating system might have cached the needed information. If so, subsequent calls could be considerably faster than the first.
  6. Use a line plot or box plots to present the performance results, where the vertical axis represents time and the horizontal axis represents the size of the data. If the performance characteristics depend on other factors, you can represent those factors by overlaying multiple lines or by constructing a panel of graphs.

Timing the performance of algorithms in SAS/IML

This article is motivated by an R programmer who was trying to find the fastest way (on average) to check whether a vector contains more than one unique value. The programmer was timing the performance of five different methods in R in hopes of finding the fastest. I will use the same example, but will examine only two SAS/IML methods:

  • The ALL function in the SAS/IML language tests whether all elements of a vector are equal to some specified value. Thus the expression ALL(x = x[1]) is true only when all elements of a vector are the same.
  • The UNIQUE function in the SAS/IML language returns an ordered list of the distinct elements of a vector. Thus the expression (ncol(UNIQUE(x))=1) is true only when a vector containsone distinct value.

The ALL function should be fast to discard non-constant vectors because it a "short-circuiting" operation. That is, as soon as it detects two different values, it returns 0 (false). If you have a vector with 100,000 elements, the ALL function might only examine a small number of elements to determine that the vector is not constant. In contrast, the UNIQUE function should be relatively slow: it always examines all the elements, and it allocates memory to return a sorted list of all unique values.

The following program illustrates many of the best practices. It constructs random binary vectors that contain between 100,000 and 1 million elements. Most elements are 0, but there is a small probability that an element could be 1. Some of the vectors will contain a 1 near the beginning of the vector (the best case for the ALL function), others will contain a 1 near the end (the worst case for ALL).

proc iml;
/* TIP: "Burn-in" by calling each important function once before you start timing */
   x = randfun(N, "Bernoulli", 1/N);   /* Simulate data for size */
   r = all(x = x[1]);          /* method 1: The ALL function */
   r = (ncol(unique(x)) = 1);  /* method 2: The UNIQUE function */
/* end burn-in */
sizes = do(2,10,2)*1e5;        /* TIP: choose sizes so test runs in reasonable time */
NReps = 300;                   /* TIP: call functions multiple times */
TotalTime = j(ncol(sizes), 2);  /* result matrix: for each size, save average time */
do j = 1 to ncol(sizes);        /* TIP: use vectors of different sizes */
   N = sizes[j];
   x = j(N, 1);
   /* TRICK: estimate time to generate the random vectors; subtract that time later */
   t0 = time();
      do i = 1 to NReps;  call randgen(x, "Bernoulli", 1/N);  end;
   tRand = time() - t0;
   /* Method 1: time for NReps calls */
   t0 = time();
   do i = 1 to NReps;
      call randgen(x, "Bernoulli", 1/N);   /* TIP: simulate data for size */
      r = all(x = x[1]);                   /* Method 1: The ALL function */
   TotalTime[j,1] = time() - t0 - tRand;   /* subtract time to generate random numbers */
   /* Method 2: time for NReps calls */
   t0 = time();
   do i = 1 to NReps;
      call randgen(x, "Bernoulli", 1/N);   /* TIP: simulate data for size */
      r = (ncol(unique(x)) = 1);           /* Method 2: The UNIQUE function */
   TotalTime[j,2] = time() - t0 - tRand;   /* subtract time to generate random numbers */
AvgTime = TotalTime / NReps;               /* compute average time per call */
print AvgTime[c={ALL UNIQUE} r=(putn(sizes,'comma9.')) format=6.4];
/* TIP: create a series plot that overlays both curves */
Size = Sizes` // Sizes`;                   /* convert from wide to long data */
Time = AvgTime[,1] // AvgTime[,2];
Group = j(ncol(sizes), 1, "ALL") // j(ncol(sizes), 1, "UNIQUE");
call series(Size, Time) group=Group grid={x y} 
            option="curvelabel" label={"Size", "Average Time per Call"};

The results are shown to the right. The graph shows the average time to determine whether the elements of a vector of size N are unique for N in the range [1e5, 1e6]. Notice that the graph shows the average time out of 300 different samples of data. As expected, the average time for the ALL function is less than the average time for the UNIQUE function. For the ALL function, some of the tests run almost instantaneously, whereas others require longer run times. The method that calls the UNIQUE function has less variation, although the variation is not shown in this graph.

In terms of absolute time, both methods require only a few milliseconds. In relative terms, however, the ALL method is much faster, and the relative difference increases as the size of the data increases.

Notice that the program demonstrates a useful trick. The ALL function runs much faster than the time required to generate a random vector with a million elements. Therefore the time required to generate a vector and determine whether it is constant is dominated by generating the data. To estimate ONLY the time spent by the ALL and UNIQUE functions, you can either pre-compute the data or you can estimate how long it takes to generate the data and subtract that estimate from the total time. Because this particular test requires generating 300 vectors with 1 million elements, pre-computing and storing the vectors would require a lot of RAM, therefore I used the estimation trick.

In conclusion, when you are timing an algorithm in a vector-matrix language, follow the best practices in this article. The most important tip is to call the method multiple times and plot the average (as I have done here) or plot the distribution of times (not shown). Plotting the average gives you an estimate for the expected time whereas plotting the distribution enables you to estimate upper and lower bounds on the times.

SAS/IML programmers can find additional examples of timing performance in the following articles:

The post 6 tips for timing the performance of algorithms appeared first on The DO Loop.

7月 132015

As my colleague Margaret Crevar recently wrote, it is useful to know how long SAS programs take to run. Margaret and others have written about how to use the SAS FULLSTIMER option to monitor the performance of the SAS system. In fact, SAS distributes a macro that enables you to parse SAS logs to extract performance and timing information.

But for researchers who are developing custom algorithms in the SAS/IML language, there is a much simpler way to monitor performance. You can use the TIME function in Base SAS to find out (to within about 15 milliseconds) how long it takes for a SAS/IML function or set of statements to run. I use this function often to assess how an algorithm scales with the size of the input data. The TIME function returns the time of day, so if you call it twice and compute the difference in times, you get the time (in seconds) that elapsed between calls.

For example, suppose that you need to compute eigenvalues of a large symmetric matrix. You might be interested in knowing how the algorithm scales with the size of the (square) input matrix. The following SAS/IML program uses the SQRVECH function to create symmetric matrices of size 500, 1000, ..., 2500. For each matrix the TIME function is called just before and immediately after a call to the EIGVAL function, which computes the eigenvalues of the matrix. The elapsed time is plotted against the size of the matrix:

proc iml;
size = T(do(500, 2500, 250));    /* 500, 1000, ..., 2500 */
time = j(nrow(size), 1);         /* allocate room for results */
call randseed(12345); 
do i = 1 to nrow(size);
   n = size[i];
   r = j(n*(n+1)/2, 1);
   call randgen(r, "uniform");   /* generate random elements */
   A = sqrvech(r);               /* form symmetric matrix */
   t0 = time();                  /* save the current time */
   val = eigval(A);              /* put computation here */
   time[i] = time() - t0;        /* compute elapsed time */
title "Time versus Matrix Size";
call series(size, time) grid={x y};

The line plot (click to enlarge) shows the timing of the eigenvalue computation on square matrices of varying sizes. The computation is very fast when the matrices are less than 1000 x 1000, but takes longer as the matrix grows. For a 2500 x 2500 matrix, the computation takes about 15 seconds.

You can also use the TIME function to compare the performance of two or more different algorithms. For example, you can compare the performance of solving linear systems to help you write efficient programs.

You can also use this technique to time how long SAS procedures take to run: You can use the SUBMIT/ENDSUBMIT statements to call any SAS procedure, which means you can "drive" the performance analysis from the SAS/IML language. This technique is much easier than parsing SAS logs!

Incidentally, the distribution of the eigenvalues for a matrix with random elements that are drawn from a given distribution is a fascinating topic that is part of "random matrix theory." For a peek into this beautiful mathematical topic, see the article "The curious case of random eigenvalues", which discusses the symmetric matrices that I used in today's article. For unsymmetric matrices, the eigenvalues can be complex, and the distribution of the eigenvalues in the complex plane makes beautiful images.

For more details about timing computations and assessing the performance of algorithms, see Chapter 15 of Statistical Programming with SAS/IML Software.

tags: Efficiency, Getting Started

The post Compare the performance of algorithms in SAS appeared first on The DO Loop.

3月 182015

Imagine that you have one million rows of numerical data and you want to determine if a particular "target" value occurs. How might you find where the value occurs?

For univariate data, this is an easy problem. In the SAS DATA step you can use a WHERE clause or a subsetting IF statement to find the rows that contain the target value. In the SAS/IML language you can use the LOC function to find the rows.

For multivariate data, you can use a similar approach, except the code become messier. For example, suppose that you have a data set that contains five numeric variables and you want to test whether the 5-tuple (1, 2, 3, 4, 5) appears in the data. In the DATA step you might write the following statement:

if x1=1 & x2=2 & x3=3 & x4=4 & x5=5 then ...;

A general mathematical principle is that a "target problem" is equivalent to a root-finding problem. The standard mathematical trick is to subtract the target value and then find zeros of the resulting set of numbers. This trick provides the basis for writing a general SAS/IML programs that is vectorized and that can handle an arbitrary number of variables. The following statements generate a matrix that has one million rows and five columns. Each element is an integer 0 through 9.

proc iml;
call randseed(123);
x = floor( 10*randfun({1e6 5}, "Uniform") ); /* matrix of values 0-9 */

Suppose that you want to find whether there is a row that matches the target value {1 2 3 4 5}. The following statements subtract the target value from the data and then find all rows for which the difference is the zero vector:

target = {1 2 3 4 5};
y = x - target;             /* match target ==> y={0 0...0}   */
count = (y=0)[,+];          /* count number of 0s in each row */
idx = loc(count=ncol(target));      /* rows that match target */
if ncol(idx)>0 then 
   print "Target pattern found in " (ncol(idx)) " rows";
else print "Target pattern not found";

The variable idx contains the row numbers for which the pattern occurs. For this random integer matrix, the target pattern appeared four times. Other random matrices might have six, 12, or 15 rows that match the target. It is easy to show that in a random matrix with one million rows, the target value is expected to appear 10 times.

The example in this article shows how to search a numerical matrix for rows that have a particular value. For character matrices, you can use the ROWCATC function in SAS/IML to concatenate the column values into a single vector of strings. You can then use the LOC function to find the rows that match the pattern.

tags: Data Analysis, Efficiency
3月 042015

Evaluating a cumulative distribution function (CDF) can be an expensive operation. Each time you evaluate the CDF for a continuous probability distribution, the software has to perform a numerical integration. (Recall that the CDF at a point x is the integral under the probability density function (PDF) where x is the upper limit of integration. I am assuming that the PDF does not have a closed-form antiderivative.) For example, the following statements compute and graph the CDF for the standard lognormal distribution at 121 points in the domain [0,6]. To create the graph, the software has to compute 121 numerical integrations.

proc iml;
x = do(0, 6, 0.05);
CDF = cdf("Lognormal", x);
title "Standard Lognormal CDF";
call series(x, CDF) grid={x y};


An integral is an accumulation of slices

In contrast, evaluating the PDF is relatively cheap: you only need to evaluate the density function at a series of points. No integrals reqired. This suggests a clever little approximation: Instead of calling the CDF function many times, call the PDF function and use the cumulative sum function (CUSUM) to add together the areas of many slices of the density. For example, it is simple to modify the trapezoidal rule of integration to return a cumulative sum of the trapezoidal areas that approximate the area under a curve, as follows:

start CumTrapIntegral(x,y);
   N = nrow(colvec(x));
   dx    =   x[2:N] - x[1:N-1];
   meanY = ( y[2:N] + y[1:N-1] )/2;
   return( 0 // cusum(dx # meanY) );

With this function you can approximate the (expensive) CDF by using a series of (cheap) PDF evaluations, as follows:

pdf = pdf("Lognormal", x);             /* evaluate PDF */
CDFApprox = CumTrapIntegral(x, pdf);   /* cumulative sums of trapezoidal areas */

How good is the approximation to the CDF? You can plot the difference between the values that were computed by the CDF function and the values that were computed by using the trapeoidal areas:

Difference = CDF - CDFApprox`;
title "Difference between Lognormal CDF and Aproximate CDF";
call Series(x, Difference) grid={x y};

The graph shows that for the lognormal distribution evaluated on [0,6], the approximate CDF differs by less than 0.001. For most values of x, CDF(x) differs from the cumulative trapezoidal approximation by less than 0.0001. This result uses PDF values with a step size of 0.05. In general, smaller step sizes lead to smaller errors, and the error for the trapezoidal rule is proportional to the square of the step size.

Adjustments to the approximation

The lognormal distribution is exactly zero for x < 0, which means that the integral from minus infinity to 0 is zero. However, for distributions that are defined on the whole real line, you need to make a small modification to the algorithm. Nanmely, if you are approximating a cumulative distribution on the interal [a, b], you need to add the CDF value at a to the values returned by the CumTrapIntegral function. For example, the following statements compare the values of the standard normal CDF on [-3, 3] with the cumulative trapezoidal sums:

/* for distributions on [-infinity, x], correct by adding constant */
x = do(-3, 3, 0.05);
CDF = cdf("Normal", x);              /* gold standard */
pdf = pdf("Normal", x);
CDFApprox = CumTrapIntegral(x, pdf); /* cumulative sums */
CDFAdd = cdf("Normal", -3);          /* integral on (-infinity, -3) */
CDFApprox = CDFAdd + CDFApprox`;     /* trapezoidal approximation */
Difference = CDF - CDFApprox;
title "Difference between Normal CDF and Aproximate CDF";
call Series(x, Difference) grid={x y};

The graph shows that the approximation is within ±5e-5 of the values from the CDF function.

When should you use this approximation?

You might be asking the same question that my former math students would ask: "When will I ever use this stuff?" One answer is to use the approximation when you are computing the CDF for a distribution that is not supported by the SAS CDF function. You can implement the CDF by using the QUAD subroutine, but you can use the method in this post when you only need a quick-and-dirty approximation to the CDF.

In my next blog post I will show how to graph the PDF and CDF for a distribution that is not built into SAS.

tags: Efficiency, Numerical Analysis
2月 162015

Friends have to look out for each other.

Sometimes this can be slightly embarrassing. At lunch you might need to tell a friend that he has some tomato sauce on his chin. Or that she has a little spinach stuck between her teeth. Or you might need to tell your friend discreetly that he should examine his zipper before he presents at the staff meeting.

Yes, it can be awkward to speak up. However, it would be more embarrassing to my friends if I don't tell them these things. What kind of a friend would I be if I allow them to walk around all day with food on their face or an unzipped fly?

I consider many of my readers to be friends, so let me whisper a little secret in your ear: "When you program in a matrix-vector language such as SAS/IML, do not concatenate results inside a loop."

It is the programming equivalent of having spinach in your teeth. People will see it and think "how could he have missed such an obvious thing?" It makes you look bad. Some will snicker.

Concatenation inside a loop is a "rookie mistake" that experienced statistical programmers avoid. The better way to accumulate a vector or matrix of results is to allocate an array for the results prior to the loop and fill the array inside the loop.

The following example illustrates why it is more efficient to use the "allocate and fill" method instead of dynamically growing an array by using concatenation. The programmer wants to call the ComputeIt function 10,000 times inside a loop. The function returns a row vector with 100 elements. The first group of statements times how long it takes to dynamically grow the matrix of results. The second group of statements times how long it takes to allocate the matrix outside the loop and fill each row inside the loop:

proc iml;
NumIters = 10000;
start ComputeIt(n);
   return( n:(n+99) );
/* time how long it takes to dynamically grow an array */
t0 = time();
free result;
do i = 1 to NumIters;
   result = result // ComputeIt(i);     /* inefficient */
tConcat = time() - t0;
/* same computation, but allocate array and fill each row */
t0 = time();
result = j(NumIters, 100);
do i = 1 to nrow(result);
   result[i,] = ComputeIt(i);           /* more efficient */
tAlloc = time() - t0;
print tConcat tAlloc;

For this example, forming the result matrix by using concatenation takes almost 3 seconds. Filling rows of a pre-allocated array takes about 0.02 seconds for the same computation. The computations and the results are identical, but one method is one hundred times faster.

The results are not always this dramatic. If the ComputeIt function returns a tiny vector (such as return(2#n);), then the difference in run time is not as extreme. Similarly, if the ComputeIt function takes a long time to run, then it won't matter much whether you concatenate or pre-allocate the result. If each iteration of the loop takes 1 second, then the difference between 10,003 seconds and 10,000 seconds is inconsequential.

Nevertheless, I recommend that you adopt the "allocate and fill" technique as a good programming habit. Good habits pay off and make you look professional. And if someone points to your program and asks, "Why are you doing it this way?", you can pull them aside and whisper, "Friend, let me tell you a secret...."

tags: Efficiency, Tips and Techniques
1月 202015

A common task in SAS/IML programming is finding elements of a SAS/IML matrix that satisfy a logical expression. For example, you might need to know which matrix elements are missing, are negative, or are divisible by 2.

In the DATA step, you can use the WHERE clause to subset data. You can also use IF-THEN/ELSE logic to examine each observation in a data set to determine whether it satisfies a criterion. For data that are stored in a SAS/IML matrix, some programmers write a loop that iterates over the rows and/or columns and use IF-THEN/ELSE logic on each scalar element in the matrix. However, there is an easier and more efficient way: Use the LOC function, which finds the LOCation of elements that satisfy a logical expression.

This article provides examples of using the LOC function and a logical expression to find elements in a matrix that satisfy a criterion. It also points out that the SAS/IML language creates temporary 0/1 matrices from logical expressions.

The following SAS/IML statements define a 3 x 4 matrix. The LOC function is used to find the location of the elements that are missing, negative, and even, respectively:

proc iml;
x = {. -5 2  5,
    -2  . 3  4,
     4  . . -1};
missingLoc = loc(x=.);             /* missing values */
negativeLoc = loc(x^=. & x<0);     /* nonmissing and negative  */
evenLoc = loc(mod(x,2)=0);         /* n is even if (n mod 2)=0 */
print missingLoc, negativeLoc, evenLoc;

The output shows the location of the elements for each criterion. The elements of the matrix are enumerated in row-major order: the first row contains elements 1–4, the second row contains elements 5–8, and the third row contains elements 9–1. The program output shows that missing values are stored at locations 1, 6, 10 and 11. Negative values are stored at locations 2, 5, and 12. Even integers occur at locations 3, 5, 8, and 9.

Even if you are familiar with the LOC function, you might not be familiar with the way SAS/IML passes a logical expression like x=. or x<0 into the LOC function. When the SAS/IML parser sees an expression that involves a comparison operator, it evaluates the logical expression, which creates a temporary matrix of 0s and 1s. That temporary 0/1 matrix is sent to the LOC function. You can observe this behavior by using the RESET PRINTALL statement to display all matrices that SAS/IML creates, including the temporary matrices. If you add the statement

reset printall;

as the second line of the previous program, then the following output is displayed when the first LOC function executes:


The output shows that PROC IML creates a temporary matrix named _TEM1001. The matrix is the result of evaluating the logical expression x=.. Notice that the 0/1 matrix contains a 1 in the locations for which the data are missing and a 0 in the other locations. Similarly, the other logical expressions in the program create temporary matrices with names like _TEM1002 and _TEM1003.

There are two takeaways from this example:

  • Use the LOC function to find the location of matrix elements that satisfy a logical condition. The LOC function is more efficient than an explicit loop over the matrix elements.
  • The argument that is passed to the LOC function is always a matrix. When you specify a logical condition, SAS/IML creates temporary variables from the logical expressions. You can use the RESET PRINTALL statement to see the temporary variables that are created.

For more information, here are two related posts:

tags: Efficiency, Getting Started
7月 022014

A SAS customer showed me a SAS/IML program that he had obtained from a book. The program was taking a long time to run on his data, which was somewhat large. He was wondering if I could identify any inefficiencies in the program.

The first thing I did was to "profile" the program to identify trouble spots. (See Ch. 15, "Timing Computations and the Performance of Algorithms," in Statistical Programming with SAS/IML Software.) I discovered one function that was taking about 80% of the total time. The inefficient function (if I may oversimplify the problem) essentially took a vector x with N elements and returned an N x N indicator matrix, A, whose (i, j)th element is 1 if x[i] ≥ x[i] and is 0 otherwise. A small example of the x vector is used in the following statements. The inefficient code looked something like this:

proc iml;
x = {16, 15, 6, 6, 20, 5, 6, 12, 5, 12};  /* data for pairwise comparison */
N = nrow(x);
A = j(N,N,0);
do i = 1 to N;
   do j = 1 to N;
      if x[i] >= x[j] then 
         A[i,j] = 1;       /* A[i,j] = 1 if x[i] >= x[j]; 0 otherwise */

The 10 x 10 indicator matrix for this example is shown. When N is large, this double DO loop approach can be slow. I have written about similar computations before. For example, I've written about how to compute pairwise differences for a vector of values, which is almost the same problem. The key is to eliminate nested DO loops and replace them with vector or matrix operations, a process known as vectorization.

Pairwise comparisons: An exercise in vectorization

To implement a vector-based computation, think about the jth column of A. What is the jth column? It is the indicator vector obtained by comparing x to the jth value of x. The following statements rewrite the computation by using a single DO loop that iterates over each column of the resulting matrix:

B = j(N,N,0);
do j = 1 to N;
   B[,j] = (x >= x[j]);

Equivalently, you could loop over rows of the indicator matrix by using B[i,] = (x` <= x[i]) (but notice the direction of the inequality!). Either method produces a big speedup in the computation.

A matrix approach is less intuitive, but can be more efficient unless the matrices are huge. The idea is to form a matrix with N columns, each column being a copy of the data vector x. If you compare the elements of this matrix to the elements of its transpose, you obtain the desired indicator matrix:

xx = repeat(x,  1, N);    /* the i_th row is x[i]     */
C = (xx >= xx`);          /* indicator matrix         */

The matrices A, B, and C are equal.

A slight twist

In the previous section, I oversimplified the problem. The real problem evaluated several complicated logical expressions (by using IF-THEN statements) within the doubly-nested DO loops. The result of those logical expressions affected the values of the indicator matrix. To simplify matters, I'll illustrate the situation by using a binary vector and a simple logical AND operator. Assume that the IF-THEN statements resolve to either 1 or 0 (true or false) for each value of the column index, j, as follows:

s = {0, 1, 1, 0, 0, 1, 0, 1, 1, 0};   /* result of logical expressions, j=1 to N */
A = j(N,N,0);
do i = 1 to N;
   do j = 1 to N;
      if x[i] >= x[j] & s[j]=1 then   /* more complicated IF-THEN logic: */
         A[i,j] = 1;                  /* A[i,j] = 1 if x[i] >= x[j] AND s[j]=1 */

Although this new situation is more complicated, the lessons of the previous section remain relevant. Here's how you can compute the indicator matrix by using vector operations:

B = j(N,N,0);
do j = 1 to N;
   B[,j] = (x >= x[j]) & (s[j]=1);

Here's the equivalent matrix operations:

xx = repeat(x, 1, N);
ss = repeat(s, 1, N);
C = (xx >= xx`) & (ss`=1);

If the logic within the doubly-nested loops depends on both rows and columns (i and j), then the variable s becomes a matrix and the computations are modified accordingly.

In summary, many SAS/IML programmers know that looping over rows and columns of a matrix is inefficient and that vectorizing the computations leads to performance improvements. This article applies the same principles to pairwise comparisons of data in a vector. By avoiding scalar comparisons, you can improve the performance of the computation.

tags: Efficiency, vectorization