Statistical Programming

4月 162018
 

A colleague and I recently discussed how to generate random permutations without encountering duplicates. Given a set of n items, there are n! permutations My colleague wants to generate k unique permutations at random from among the total of n!. Said differently, he wants to sample without replacement from the set of all possible permutations.

In SAS, you can use the RANPERM function to generate ranom permutations. For example, the statement p = ranperm(4) generates a random permutation of the elements in the set {1, 2, 3, 4}. However, the function does not ensure that the permutations it generates will be unique.

I can think of two ways to generate random permutations without replacement. The first is to generate all permutations and then choose a random subset of size k without replacement from the set of n! possibilities. This technique is efficient when n is small or when k is a substantial proportion of n!. The second technique is to randomly generate permutations until you get a total of k that are unique. This technique is efficient when n is large and k is small relative to n!.

Generate all permutations, then sample without replacement

For small sets (say, n ≤ 8), an efficient way to generate random permutations is to generate all permutations and then extract a random sample without replacement. In the SAS/IML language you can use the ALLPERM function to generate all permutations of a set of n items, where n ≤ 18. The ALLPERM function returns a matrix that has n! rows and n columns. Each row is a permutation. You can then use the SAMPLE function to sample without replacement from among the rows, as follows:

proc iml;
call randseed(123);
n = 4;                       /* permutations of n items */
k = 3;                       /* want k unique permutation */
 
p = allperm(n);               /* matrix has n! rows; each row is a permutation */
rows = sample(1:n, k, "WOR"); /* random rows w/o replacement */
ranPermWOR = p[rows, ];       /* extract rows */
print ranPermWOR;

Generate random permutations, then check for uniqueness

The matrix of all permutations has n! rows, which gets big fast. If you want only a small number of permutations from among a huge set of possible permutations, it is more efficient to use the RANPERM function to generate permutations, then discard duplicates. Last week I showed how to eliminate duplicate rows from a numeric matrix so that the remaining rows are unique. The following SAS/IML statements use the UniqueRows function, which is defined in the previous post. You must define or load the module before you can use it.

/* <define or load the DupRows and UniqueRows modules HERE> */
 
n = 10;                      /* permutations of n items */
k = 5;                       /* want k unique permutation; k << n! */
p = ranperm(n, 2*k);         /* 2x more permutations than necessary */
U = UniqueRows(p);           /* get unique rows of 2k x n matrix */
if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
else do;
   U = UniqueRows( U // ranperm(n, 10*k) ); /* generate more... repeat as necessary */
   if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
end;
print U;

Notice in the previous statements that the call to RANPERM requests 2*k random permutations, even though we only want k. You should ask for more permutations than you need because some of them might be duplicates. The factor of 2 is ad hoc; it is not based on any statistical consideration.

If k is much smaller than n!, then you might think that the probability of generating duplicate a permutation is small. However, the Birthday Problem teaches us that duplicates arise much more frequently than we might expect, so it is best to expect some duplicates and generate more permutations than necessary. When k is moderately large relative to n!, you might want to use a factor of 5, 10, or even 100. I have not tried to compute the number of permutations that would generate k unique permutations with 95% confidence, but that would be a fun exercise. In my program, if the initial attempt generates fewer than k unique rows, it generates an additional 10*k permutations. You could repeat this process, if necessary.

In summary, if you want to generate random permutations without any duplicates, you have at least two choices. You can generate all permutations and then extract a random sample of k rows. This method works well for small values of n, such as n ≤ 8. For larger values of n, you might want to generate random permutation (more than k) and use the first k unique rows of the matrix of permutations.

The post Random permutations without duplicates appeared first on The DO Loop.

4月 042018
 

Correlation is a statistic that measures how closely two variables are related to each other. The most popular definition of correlation is the Pearson product-moment correlation, which is a measurement of the linear relationship between two variables. Many textbooks stress the linear nature of the Pearson correlation and emphasize that a zero value for the Pearson correlation does not imply that the variables are independent. A classic example is to define X to be a random variable on [-1, 1] and define Y=X2. X and Y are clearly not independent, yet you can show that the Pearson correlation between X and Y is 0.

In 2007, G. Szekely, M. Rizzo, and N. Bakirov published a paper in the The Annals of Statistics called "Measuring and Testing Dependence by Correlation of Distances." This paper defines a distance-based correlation that can detect nonlinear relationships between variables. This blog post describes the distance correlation and implements it in SAS.

An overview of distance correlation

It is impossible to adequately summarize a 27-page paper from the Annals in a few paragraphs, but I'll try. The Szekely-Rizzo-Bakirov paper defines the distance covariance between two random variables. It shows how to estimate the distance correlation from data samples. A practical implication is that you can estimate the distance correlation by computing two matrices: the matrix of pairwise distances between observations in a sample from X and the analogous distance matrix for observations from Y. If the elements in these matrices co-vary together, we say that X and Y have a large distance correlation. If they do not, they have a small distance correlation.

For motivation, recall that the Pearson covariance between X and Y (which is usually defined as the inner product of two centered vectors) can be written in terms of the raw observations:
Classical covariance formula

The terms (xi – xj) and (yi – yj) can be thought of as the one-dimensional signed distances between the i_th and j_th observations. Szekely et al. replace those terms with centered Euclidean distances D(xi, xj) and define the distance covarariance as follows:
Distance covariance formula

The distance covariance between random vectors X and Y has the following properties:

  • X and Y are independent if and only if dCov(X,Y) = 0.
  • You can define the distance variance dVar(X) = dCov(X,X) and the distance correlation as dCor(X,Y) = dCov(X,Y) / sqrt( dVar(X) dVar(Y) ) when both variances are positive.
  • 0 ≤ dCor(X,Y) ≤ 1 for all X and Y. Note this is different from the Pearson correlation, for which negative correlation is possible.
  • dCov(X,Y) is defined for random variables in arbitrary dimensions! Because you can compute the distance between observations in any dimensions, you can compute the distance covariance regardless of dimensions. For example, X can be a sample from a 3-dimensional distribution and Y can be a sample from a 5-dimensional distribution.
  • You can use the distance correlation to define a statistical test for independence. I don't have space in this article to discuss this fact further.

Distance correlation in SAS

The following SAS/IML program defines two functions. The first "double centers" a distance matrix by subtracting the row and column marginals. The second is the distCorr function, which computes the Szekely-Rizzo-Bakirov distance covariance, variances, and correlation for two samples that each have n rows. (Recall that X and Y can have more than one column.) The function returns a list of statistics. This lists syntax is new to SAS/IML 14.3, so if you are running an older version of SAS, modify the function to return a vector.

proc iml;
start AdjustDist(A);    /* double centers matrix by subtracting row and column marginals */
   rowMean = A[:, ];
   colMean = rowMean`;  /* the same, by symmetry */
   grandMean = rowMean[:];
   A_Adj = A - rowMean - colMean + grandMean;
   return (A_Adj);
finish;
 
/* distance correlation: G. Szekely, M. Rizzo, and N. Bakirov, 2007, Annals of Statistics, 35(6) */
start DistCorr(x, y);
   DX = distance(x);    DY = distance(y); 
   DX = AdjustDist(DX); DY = AdjustDist(DY);
   V2XY = (DX # DY)[:];  /* mean of product of distances */
   V2X  = (DX # DX)[:];  /* mean of squared (adjusted) distance */
   V2Y  = (DY # DY)[:];  /* mean of squared (adjusted) distance */
 
   dCov = sqrt( V2XY );     /* distance covariance estimate */
   denom = sqrt(V2X * V2Y); /* product of std deviations */
   if denom > 0 then R2 = V2XY / denom;    /* R^2 = (dCor)^2 */
   else              R2 = 0;
   dCor = sqrt(R2);     
   T = nrow(DX)*V2XY;   /* test statistic p. 2783. Reject indep when T>=z */
 
   /* return List of resutls: */
   L = [#"dCov"=dCov,  #"dCor"=dCor,  #"dVarX"=sqrt(V2X), #"dVarY"=sqrt(V2Y), #"T"=T];
   /* or return a vector: L = dCov || dCor || sqrt(V2X) || sqrt(V2Y) || T; */
   return L;
finish;

Let's test the DistCorr function on two 4-element vectors. The following (X,Y) ordered pairs lie one the intersection of the unit circle and the coordinate axes. The Pearson correlation for these observations is 0 because there is no linear association. In contrast, the distance-based correlation is nonzero. The distance correlation detects a relationship between these points (namely, that they lie along the unit circle) and therefore the variables are not independent.

x = {1,0,-1, 0};
y = {0,1, 0,-1};
results = DistCorr(x, y);
/* get itenms from results */
dCov=results$"dCov";  dCor=results$"dCor";  dVarX=results$"dVarX"; dVarY=results$"dVarY";
/* or from vector: dCov=results[1];  dCor=results[2];  dVarX=results[3]; dVarY=results[4]; */
print dCov dCor dVarX dVarY;
Distance correlation for points on a circle

Examples of distance correlations

Let's look at a few other examples of distance correlations in simulated data. To make it easier to compare the Pearson correlation and the distance correlation, you can define two helper functions that return only those quantities.

Classic example: Y is quadratic function of X

The first example is the classic example (mentioned in the first paragraph of this article) that shows that a Pearson correlation of 0 does not imply independence of variables. The vector X is distributed in [-1,1] and the vector Y is defined as X2:

/* helper functions */
start PearsonCorr(x,y);
   return( corr(x||y)[1,2] );
finish;
start DCorr(x,y);
   results = DistCorr(X, Y);
   return( results$"dCor" );  /* if DistCorr returns vector: return( results[2] ); */
finish;
 
x = do(-1, 1, 0.1)`;    /* X is in [-1, 1] */
y = x##2;               /* Y = X^2 */
PCor = PearsonCorr(x,y);
DCor = DCorr(x,y);
print PCor DCor;
Example where Pearson correlation equals 0 but distance correlation is nonzero

As promised, the Pearson correlation is zero but the distance correlation is nonzero. You can use the distance correlation as part of a formal hypothesis test to conclude that X and Y are not independent.

Distance correlation for multivariate normal data

You might wonder how the distance correlation compares with the Pearson correlation for bivariate normal data. Szekely et al. prove that the distance correlation is always less than the absolute value of the population parameter: dCor(X,Y) ≤ |ρ|. The following statements generate a random sample from a bivariate normal distribution with Pearson correlation ρ for a range of positive and negative ρ values.

N = 500;                   /* sample size */
mu = {0 0};  Sigma = I(2); /* parameters for bivaraite normal distrib */
rho = do(-0.9, 0.9, 0.1);  /* grid of correlations */
PCor = j(1, ncol(rho), .); /* allocate vectors for results */
DCor = j(1, ncol(rho), .);
call randseed(54321);
do i = 1 to ncol(rho);     /* for each rho, simulate bivariate normal data */
   Sigma[1,2] = rho[i]; Sigma[2,1] = rho[i]; /* population covariance */
   Z = RandNormal(N, mu, Sigma);        /* bivariate normal sample */
   PCor[i] = PearsonCorr(Z[,1], Z[,2]); /* Pearson correlation */
   DCor[i] = DCorr(Z[,1], Z[,2]);       /* distance correlation */
end;

If you graph the Pearson and distance correlation against the parameter values, you obtain the following graph:

Graph of Pearson correlation and distance correlation for samples of bivariate normal data with correlation rho

You can see that the distance correlation is always positive. It is close to, but in many cases less than, the absolute value of the Pearson estimate. Nevertheless, it is comforting that the distance correlation is closely related to the Pearson correlation for correlated normal data.

Distance correlation for data of different dimensions

As the last example, let's examine the most surprising property of the distance correlation, which is that it enables you to compute correlations between variables of different dimensions. In contrast, the Pearson correlation is defined only for univariate variables. The following statements generate two independent random normal samples with 1000 observations. The variable X is a bivariate normal sample. The variable Y is a univariate normal sample. The distance correlation for the sample is close to 0. (Because the samples were drawn independently, the distance correlation for the populations is zero.)

/* Correlation betwee a 2-D distribution and a 1-D distribution */
call randseed(12345, 1);          /* reset random number seed */
N = 1000;
mu = {0 0};
Sigma = { 1.0 -0.5, 
         -0.5  1.0};
X = RandNormal(N, mu, Sigma);     /* sample from 2-D distribution */
Y = randfun(N, "Normal", 0, 0.6); /* uncorrelated 1-D sample */
 
DCor = DCorr(X, Y);
print DCor;
Distance correlation for samples of different dimensions

Limitations of the distance correlation

The distance correlation is an intriguing idea. You can use it to test whether two variables (actually, sets of variables) are independent. As I see it, the biggest drawbacks of distance correlation are

  • The distance correlation is always positive because distances are always positive. Most analysts are used to seeing negative correlations when two variables demonstrate a negative linear relationship.
  • The distance correlation for a sample of size n must compute the n(n–1)/2 pairwise distances between observations. This implies that the distance correlation is an O(n2) operation, as opposed to Pearson correlation, which is a much faster O(n) operation. The implementation in this article explicitly forms the n x n distance matrix, which can become very large. For example, if n = 100,000 observations, each distance matrix requires more than 74 GB of RAM. There are ways to use less memory, but the distance correlation is still a relatively expensive computation.

Summary

In summary, this article discusses the Szekely-Rizzo-Bakirov distance-based covariance and correlation. The distance correlation can be used to create a statistical test of independence between two variables or sets of variables. The idea is interesting and appealing for small data sets. Unfortunately, the performance of the algorithm is quadratic in the number of observations, so the algorithm does not scale well to big data.

You can download the SAS/IML program that creates the computations and graphs in this article. If you do not have SAS/IML, T. Billings (2016) wrote a SAS macro that uses PROC DISTANCE to compute the distance correlation between two vectors. Rizzo and Szekely implemented their method in the 'energy' package of the R software product.

The post Distance correlation appeared first on The DO Loop.

3月 282018
 

Suppose you want to find observations in multivariate data that are closest to a numerical target value. For example, for the students in the Sashelp.Class data set, you might want to find the students whose (Age, Height, Weight) values are closest to the triplet (13, 62, 100). The way to do this is to compute a distance from each observation to the target. Unfortunately, there are many definitions of distance! Which distance should you use? This article describes and compares a few common distances and shows how to compute them in SAS.

Euclidean and other distances

The two most widely used distances are the Euclidean distance (called the L2 distance by mathematicians) and the "sum of absolute differences" distance, which is better known as the L1 distance or occasionally the taxicab distance. In one dimension, these distances are equal. However, when you have multiple coordinates, the distances are different. The Euclidean and L1 distances between (x1, x2, ..., xn) and a target vector (t1, t2, ..., tn) are defined as follows:
L1 distance: |x1-t1| + |x2-t2| + ... + |xn-tn|
Euclidean distance: sqrt( (x1-t1)2 + (x2-t2)2 + ... + (xn-tn)2 )

Both of these distances are supported in the SAS DATA step. You can use the EUCLID function to compute Euclidean distance and use the SUBABS function to compute the L1 distance. For example, the following DATA step computes the distance from each observation to the target value (Age, Height, Weight) = (13, 62, 100):

data Closest;
/* target (Age, Height, Weight) = (13, 62, 100) */
set Sashelp.Class;
EuclidDist = euclid(Age-13, Height-62, Weight-100);
L1Dist     = sumabs(Age-13, Height-62, Weight-100);
run;
 
/* sort by Euclidean distance */
proc sort data=Closest out=Euclid; by EuclidDist; run;
 
/* plot the distances for each observation */
title "Distance to Target Values";
proc sgplot data=Euclid;
   series x=Name y=EuclidDist / curvelabel="Euclidean";  *datalabel=Weight;
   series x=Name y=L1Dist     / curvelabel="L1";         *datalabel=Weight;
   yaxis grid label="Distance";
   xaxis grid discreteorder=data;
run;
Euclidean and L1 distances between observations and a target value

The graph shows the Euclidean and L1 distances from each student's data to the target value. The X axis is ordered by the Euclidean distance from each observation to the target value. If you add the DATALABEL=Weight or DATALABEL=Height options to the SERIES statements, you can see that the students who appear near the left side of the X axis have heights and weights that are close to the target values (Height, Weight) = (62, 100). The students who appear to the right are much taller/heavier or shorter/lighter than the target values. In particular, Joyce is the smallest student and Philip is the largest.

If you order the students by their L1 distances, you will obtain a different ordering. For example, in an L1 ranking, John and Henry would switch positions and Jeffery would be ranked 7th instead of 12th.

Distances that account for scale

There is a problem with the computations in the previous section: the variables are measured in different units, but the computations do not account for these differences. In particular, Age is an important factor in the distance computations because all student ages are within three years of the target age. In contrast, the heaviest student (Phillip) is 88 pounds more than the target weight.

The distance formula needs to account for amount of variation within each variable. If you run PROC MEANS on the data, you will discover that the standard deviation of the Age variable is about 1.5, whereas the standard deviations for the Height and Weight variables are about 5.1 and 22.8, respectively. It follows that one year should be treated as a substantial difference, whereas one pound of weight should be treated as a small difference.

A way to correct for differences among the scales of variables is to standardize the variables. In SAS, you can use PROC STDIZE to standardize the variables in a variety of ways. By default, the procedure centers each variable by subtracting the sample mean; it scales by dividing by the standard deviations. The following procedure all also writes the mean and standard deviations to a data set, which is displayed by using PROC PRINT:

proc stdize data=Sashelp.Class out=StdClass outstat=StdIn method=STD;
   var Age Height Weight;
run;
 
proc print data=StdIn(obs=2);
run;

Notice that target value is not part of the standardization! Only the data are used. However, you must convert the target value to the new coordinates before you can compute the standardized distances. You can standardize the target value by using the METHOD=IN option in PROC STDIZE to tell the procedure to transform the target value by using the location and scale values in the StdIn data set, as follows:

data Target;
   Age=13; Height=62; Weight=100;
run;
 
proc stdize data=Target out=StdTarget method=in(StdIn);
   var Age Height Weight;
run;
 
proc print data=StdTarget; run;

The data and the target values are now in standardized coordinates. You can therefore repeat the earlier DATA step and compute the Euclidean and L1 distances in the new coordinates. The graph of the standardized distances are shown below:

In the standardized coordinates "one unit" corresponds to one standard deviation from the mean in each variable. Thus Age is measured in units of 1.5 years, Height in units of 5.1 inches, and Weight in units of 22.8 pounds. Notice that some of the students have changed positions. Carol is still very similar to the target value, but Jeffrey is now more similar to the target value than previously thought. The smallest student (Joyce) and largest student (Philip) are still dissimilar (very distant) from the target.

Distances that account for correlation

In the previous section, different scales in the data are handled by using distances in a standardized coordinate system. This is an improvement over the unstandardized computations. However, there is one more commonly used distance computation, called the Mahalanobis distance. The Mahalanobis distance is similar to the standardized L2 distance but also accounts for correlations between the variables. I have previously discussed the meaning of Mahalanobis distance and have described how you can use the inverse Cholesky decomposition to uncorrelate variables. I won't repeat the arguments, but the following graph displays the Mahalanobis distance between each observation and the target value. Some students are in a different order than they were for the standardized distances:

If you would like to examine or modify any of my computations, you can download the SAS program that computes the various distances.

In summary, there are several reasonable definitions of distance between multivariate observations. The simplest is the raw Euclidean distance, which is appropriate when all variables are measured on the same scale. The next simplest is the standardized distance, which is appropriate if the scales of the variables are vastly different and the variables are uncorrelated. The third is the Mahalanobis distance, which becomes important if you want to measure distances in a way that accounts for correlation between variables.

For other articles about Mahalanobis distance or distance computations in SAS, see:

The post Find the distance between observations and a target value appeared first on The DO Loop.

3月 192018
 

About once a month I see a question on the SAS Support Communities that involves what I like to call "computations with combinations." A typical question asks how to find k values (from a set of p values) that maximize or minimize some function, such as "I have 5 variables, and for each observation I want to find the largest product among any 3 values."

These types of problems are specific examples of a single abstract problem, as follows:

  1. From a set of p values, generate all subsets that contain k < p elements. Call the subsets Y1, Y2, ..., Yt, where t equals "p choose k". (In SAS, use the COMB function to compute the number of combinations: t = comb(p,k).)
  2. For each subset, evaluate some function on the subset. Call the values z1, z2, ..., zt.
  3. Return some statistic of the zi. Often the statistics is a maximum or minimum, but it could also be a mean, variance, or percentile.

This is an "exhaustive" method that explicitly generates all subsets, so clearly this technique is impractical for large values of p. The examples that I've seen on discussion forums often use p ≤ 10 and small values of k (often 2, 3, or 4). For parameters in this range, an exhaustive solution is feasible.

This general problem includes "leave-one-out" or jackknife estimates as a special case (k = p – 1), so clearly this formulation is both general and powerful. This formulation also includes the knapsack problem in discrete optimization. In the knapsack problem, you have p items and a knapsack that can hold k items. You want to choose the items so that the knapsack holds as much value as possible. The knapsack problem maximizes the sum of the values whereas the general problem in this article can handle nonlinear functions of the values.

Example data

You can use the following DATA set to simulate integer data with a specified number of columns and rows. I use the relatively new "Integer" distribution to generate uniformly distributed integers in the range [-3, 9].

%let p = 5;      /* number of variables */
%let NObs = 6;   /* number of observations */
data have(drop=i j);
call streaminit(123);
array x[&p];
do i = 1 to &NObs;
   do j = 1 to dim(x);
      x[j] = rand("Integer", -3, 9); /* SAS 9.4M4 */
   end;
   output;
end;
;
 
proc print data=have; run;
Sample data

Computing with combinations in SAS/IML

For p = 5 and k = 3, the problem is: "For each observation of the 5 variables, find the largest product among any 3 values." In the SAS/IML language, you can solve problems like this by using the ALLCOMB function to generate all combinations of size k from the index set {1,2,...,p}. These values are indices that you can use to reference each combination of values. You can evaluate your function on each combination and then compute the max, min, mean, etc. For example, the following SAS/IML statements generate all combinations of 3 values from the set {1, 2, 3, 4, 5}:

proc iml;
p = 5; k = 3;
c = allcomb(p, k);  /* combinations of p items taken k at a time */
print c;
All combinations of 5 elements taken 3 at a time

A cool feature of the SAS/IML language is that you can use these values as column subscripts! In particular, the expression X[i, c] generates all 3-fold combinations of values in the i_th row. You can then use the SHAPE function to reshape the values into a matrix that has 3 columns, as follows:

/* Example: find all combination of elements in the first row */
varNames = "x1":"x5";
use have; read all var varNames into X; close;
Y = X[1, c];                 /* all combinations of columns for 1st row */
M = shape(Y, nrow(c), k);    /* reshape so each row has k elements */
prod = M[, #];               /* product of elements across columns */
print M prod;
A function evaluated each subset of size 3

Notice that each row of the matrix M contains k = 3 elements of Y. There are "5 choose 3" = 10 possible ways to choose 3 items from a set of 5, so the M matrix has 10 rows. Notice that you can use a subscript reduction operator (#) to compute the product of elements for each combination of elements. The maximum three-value product for the first row of data is 24.

The following loop performs this computation for each observation. The result is a vector that contains the maximum three-value product of each row. The original data and the results are then displayed side by side:

/* for each row and for X1-X4, find maximum product of three elements */
result = j(nrow(X), 1);
do i = 1 to nrow(X);
   Y = X[i, c];                  /* get i_th row and all combinations of coloumns */
   M = shape(Y, nrow(c), k);     /* reshape so each row has k elements */
   result[i] = max( M[,#] );     /* max of product of rows */
end;
print X[colname=varNames] result[L="maxMul"];
Maximum product of three elements from a set of 5

Of course, if the computation for each observation is more complicated than in this example, you can define a function that computes the result and then call the module like this: result[i]= MyFunc(M);

Generate all combinations in the DATA step

You can perform a similar computation in the DATA step, but it requires more loops. You can use the ALLCOMB function (or the LEXCOMBI function) to generate all k-fold combinations of the indices {1, 2, ..., p}. You should call the ALLCOMB function inside a loop from 1 to NCOMB(p, k). Inside the loop, you can evaluate the objective function on each combination of data values. Many DATA step functions such as MAX, MIN, SMALLEST, and LARGEST accept arrays of variables, so you probably want to store the variables and the indices in arrays. The following DATA step contains comments that describe each step of the program:

%let p = 5;
%let k = 3;
%let NChooseK = %sysfunc(comb(&p,&k));  /* N choose k */
data Want(keep=x1-x&p maxProd);
set have;
array x[&p] x1-x&p;        /* array of data */
array c[&k];               /* array of indices */
array r[&NChoosek];        /* array of results for each combination */
ncomb = comb(&p, &k);      /* number of combinations */
do i=1 to &k; c[i]=0; end; /* zero the array of indices before first call to ALLCOMB */
do j = 1 to ncomb;
   call allcombi(&p, &k, of c[*]);  /* generate j_th combination of indices */
   /* evaluate function of the array {x[c[1]], x[c[2]], ..., x[c[k]]} */
   r[j] = 1;               /* initialize product to 1 */
   do i=1 to &k; r[j] = r[j] * x[c[i]]; end;  /* product of j_th combination */
end;
maxProd = max(of r[*]);    /* max of products */
output;
run;
 
proc print data=Want; run;
Maximum product of three elements from a set of five

The DATA step uses an array (R) of values to store the result of the function evaluated on each subset. For a MAX or MIN computation, this array is not necessary because you can keep track of the current MAX or MIN inside the loop over combinations. However, for more general problems (for example, find the median value), an array might be necessary.

In summary, this article shows how to solve a general class of problems. The general problem generates all subsets of size k from a set of size p. For each subset, you evaluate a function and produce a statistic. From among the "p choose k" statistics, you then choose the max, min, or some other measure. This article shows how to solve these problems efficiently in the SAS/IML language or in the SAS DATA step. Because this is a "brute force" technique, it is limited to small values of p. I suggest p ≤ 25.

The post Compute with combinations: Maximize a function over combinations of variables appeared first on The DO Loop.

3月 072018
 

Data analysts often fit a probability distribution to data. When you have access to the data, a common technique is to use maximum likelihood estimation (MLE) to compute the parameters of a distribution that are "most likely" to have produced the observed data. However, how can you fit a distribution if you do not have access to the data?

This question was asked by a SAS programmer who wanted to fit a gamma distribution by using sample quantiles of the data. In particular, the p[rogrammer said, "we have the 50th and 90th percentile" of the data and "want to find the parameters for the gamma distribution [that fit] our data."

This is an interesting question. Recall that the method of moments uses sample moments (mean, variance, skewness,...) to estimate parameters in a distribution. When you use the method of moments, you express the moments of the distribution in terms of the parameters, set the distribution's moments equal to the sample moments, and solve for the parameter values for which the equation is true.

In a similar way, you can fit a distribution matching quantiles: Equate the sample and distributional quantiles and solve for the parameters of the distribution. This is sometimes called quantile-matching estimation (QME). Because the quantiles involve the cumulative distribution function (CDF), the equation does not usually have a closed-form solution and must be solved numerically.

Fit a two-parameter distribution from two quantiles

CDF that matches quantiles of a sample

To answer the programmer's question, suppose you do not have the original data, but you are told that the 50th percentile (median) of the data is x = 4 and the 90th percentile is x = 8. You suspect that the data are distributed according to a gamma distribution, which has a shape parameter (α) and a scale parameter (β). To use quantile-matching estimation, set F(4; α, β) = 0.5 and F(8; α, β) = 0.9, where F is the cumulative distribution of the Gamma(α, β) distribution. You can then solve for the values of (α, β) that satisfy the equations. You will get a CDF that matches the quantiles of the data, as shown to the right.

I have previously written about four ways to solve nonlinear equations in SAS. One way is to use PROC MODEL, as shown below:

data initial;
   alpha=1; beta=1;    /* initial guess for finding root */
   p1=0.5;  X1 = 4;    /* eqn for 1st quantile: F(X1; alpha, beta) = p1 */
   p2=0.9; X2 =  8;    /* eqn for 2nd quantile: F(X2; alpha, beta) = p2 */
run;
 
proc model data=initial;
  eq.one = cdf("Gamma", X1, alpha, beta) - p1;   /* find root of eq1 */
  eq.two = cdf("Gamma", X2, alpha, beta) - p2;   /*    and eq2 */
  solve alpha beta / solveprint out=solved outpredict;
run;quit;
 
proc print data=solved noobs;
   var alpha beta;
run;
Quantile-matching estimates: parameters estimated from observed quantiles

The output indicates that the parameters (α, β) = (2.96, 1.52) are the values for which the Gamma(α, β) quantiles match the sample quantiles. You can see this by graphing the CDF function and adding reference lines at the 50th and 90th percentiles, as shown at the beginning of this section. The following SAS code creates the graph:

/* Graph the CDF function to verify that the solution makes sense */
data Check;
set solved;            /* estimates of (alpha, beta) from solving eqns */
do x = 0 to 12 by 0.2;
   CDF = cdf("gamma", x, alpha, beta);
   output;
end;
run;
 
title "CDF of Gamma Distribution";
title2 "Showing 50th and 90th Percentiles";
proc sgplot data=Check;
   series x=x y=CDF / curvelabel;
   dropline y=0.5 X=4 / dropto=both;   /* first percentile */
   dropline y=0.9 X=8 / dropto=both;   /* second percentile */
   yaxis values=(0 to 1 by 0.1) label="Cumulative Probability";
   xaxis values=(0 to 12 by 2);
run;

Least squares estimates for matching quantiles

The previous section is relevant when you have as many sample quantiles as parameters. If you have more sample quantiles than parameters, then the system is overconstrained and you probably want to compute a least squares solution. If there are m sample quantiles, the least squares solution is the set of parameters that minimizes the sum of squares Σim (piF(xi; α, β))2.

For example, the following DATA step contains five sample quantiles. The observation (p,q) = (0.1, 1.48) indicates that the 10th percentile is x=1.48. The second observation indicates that the 25th percentile is x=2.50. The last observation indicates that the 90th percentile is x=7.99. You can use PROC NLIN to find a least squares solution to the quantile-matching problem, as follows:

data SampleQntls;
input p q;  /* p is cumul probability; q is p_th sample quantile */
datalines;
0.1  1.48 
0.25 2.50
0.5  4.25
0.75 6.00
0.9  7.99 
;
 
/* least squares fit of parameters */
proc nlin data=SampleQntls /* sometimes the NOHALVE option is useful */
          outest=PE(where=(_TYPE_="FINAL"));
   parms alpha 2 beta 2;
   bounds 0 < alpha beta;
   model p = cdf("Gamma", q, alpha, beta);
run;
 
proc print data=PE noobs;
   var alpha beta;
run;
Least squares estimates based on matching sample quantiles

The solution indicates the parameter values (α, β) = (2.72, 1.70) minimize the sum of squares between the observed and theoretical quantiles. The following graph shows the observed quantiles overlaid on the CDF of the fitted Gamma(α, β) distribution. Alternatively, you can graph the quantile-quantile plot of the observed and fitted quantiles.

CDF for fitted distribution where parameters are based on matching sample quantiles

Weighted least squares estimates for matching quantiles

For small samples, quantiles in the tail of a distribution have a large standard error, which means that the observed quantile might not be close to the theoretical quantile. One way to handle that uncertainty is to compute a weighted regression analysis where each sample quantile is weighted by the inverse of its variance. According to Stuart and Ord (Kendall’s Advanced Theory of Statistics, 1994, section 10.10), the standard error of the p_th sample quantile in a sample of size n is σ2 = p(1-p) / (n fp)2), where ξp is the p_th quantile of the distribution and f is the probability density function.

In PROC NLIN, you can perform weighted analysis by using the automatic variable _WEIGHT_. The following statements define the variance of the p_th sample quantile and define weights equal to the inverse variance. Notice the NOHALVE option, which can be useful for iteratively reweighted least squares problems. The option eliminates the requirement that the weighted sum of squares must decrease at every iteration.

/* weighted least squares fit where w[i] = 1/variance[i] */
proc nlin data=SampleQntls NOHALVE
          outest=WPE(where=(_TYPE_="FINAL"));
   parms alpha 2 beta 2;
   bounds 0 < alpha beta;
   N = 80;                            /* sample size */
   xi = quantile("gamma", p, alpha, beta); /* quantile of distrib */
   f = pdf("Gamma", xi, alpha, beta); /* density at quantile */
   variance = p*(1-p) / (N * f**2);   /* variance of sample quantiles */
   _weight_ = 1 / variance;           /* weight for each observation */
   model p = cdf("Gamma", q, alpha, beta);
run;
Parameter estimates where parameters are based on a weighted matching of observed quantiles

The parameter estimates for the weighted analysis are slightly different than for the unweighted analysis. The following graph shows the CDF for the weighted estimates, which does not pass as close to the 75th and 90th percentiles as does the CDF for the unweighted estimates. This is because the PDF of the gamma distribution is relatively small for those quantiles, which causes the regression to underweight those sample quantiles.

CDF for fitted distribution where parameters are based on a weighted matching of observed quantiles

In summary, this article shows how to use SAS to fit distribution parameters to observed quantiles by using quantile-matching estimation (QME). If the number of quantiles is the same as the number of parameters, you can numerically solve for the parameters for which the quantiles of the distribution equal the sample quantiles. If you have more quantiles than parameters, you can compute a least squares estimate of the parameters. Because quantile estimates in the tail of a distribution have larger uncertainty, you might want to underweight those quantiles. One way to do that is to run a weighted least squares regression where the weights are inversely proportional to the variance of the sample quantiles.

The post Fit a distribution from quantiles appeared first on The DO Loop.

2月 212018
 

This article describes and implements a fast algorithm that estimates a median for very large samples. The traditional median estimate sorts a sample of size N and returns the middle value (when N is odd). The algorithm in this article uses Monte Carlo techniques to estimate the median much faster.

Of course, there's no such thing as a free lunch. The Monte Carlo estimate is an approximation. It is useful when a quick-and-dirty estimate is more useful than a more precise value that takes longer to compute. For example, if you want to find the median value of 10 billion credit card transactions, it might not matter whether the median $16.74 or $16.75. Either would be an adequate estimate for many business decisions.

Although the traditional sample median is commonly used, it is merely an estimate. All sample quantiles are estimates of a population quantity, and there are many ways to estimate quantiles in statistical software. To estimate these quantities faster, researchers have developed approximate methods such as the piecewise-parabolic algorithm in PROC MEANS or "probabilistic methods" such as the ones in this article.

Example data: A sample from a mixture distribution

Sample from mixture distribution showing sample median

Consider the following data set, which simulates 10 million observations from a mixture of an exponential and a normal distribution. You can compute the exact median for the distribution, which is 8.065. A histogram of the data and the population median are shown to the right.

Because the sample size is large, PROC MEANS takes a few seconds to compute the sample median by using the traditional algorithm which sorts the data (an O(N log N) operation) and then returns the middle value. The sample median is 8.065.

/* simulate from a mixture distribution. See
   https://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution.html */
%let N = 1e7;
data Have;
call streaminit(12345);
do i = 1 to &N;
   d = rand("Bernoulli", 0.6);
   if d = 0 then
      x = rand("Exponential", 0.5);
   else
      x = rand("Normal", 10, 2);
   output;
end;
keep x;
run;
 
/* quantile estimate */
proc means data=Have N Median;
   var x;
run;

A naive resampling algorithm

The basic idea of resampling methods is to extract a random subset of the data and compute statistics on the subset. Although the inferential statistics (standard errors and confidence interval) on the subset are not valid for the original data, the point estimates are typically close, especially when the original data and the subsample are large.

The simplest form of a resampling algorithm is to generate a random subsample of the data and compute the median of the subsample. The following example uses PROC SURVEYSELECT to resample (with replacement) from the data. The subsample is one tenth as large as the original sample. The subsequent call to PROC MEANS computes the median of the subsample, which is 8.063.

proc surveyselect data=Have seed=1234567 NOPRINT 
     out=SubSample
     method=urs samprate=0.1;                 /* 10% sampling with replacement */
run;
 
title "Estimate from 10% of the data";
proc means data=SubSample N Median;
   freq NumberHits;
   var x;
run;

It takes less time to extract a 10% subset and compute the median than it takes to compute the median of the full data. This naive resampling is simple to implement and often works well, as in this case. The estimate on the resampled data is close to the true population median, which is 8.065. (However, the standard error of the traditional estimate is smaller.)

You might worry, however, that by discarding 90% of the data you might discard too much information. In fact, there is a 90% chance that we discarded the middle data value! Thus it is wise to construct the subsample more carefully. The next section constructs a subsample that excludes data from the tails of the distribution and keeps data from the middle of the distribution.

A probabilistic algorithm for the median

This section presents a more sophisticated approximate algorithm for the median. My presentation of this Monte Carlo algorithm is based on the lecture notes of Prof. H-K Hon and a Web page for computer scientists. Unfortunately, neither source credits the original author of this algorithm. If you know the original reference, please post it in the comments.

Begin with a set S that contains N observations. The following algorithm returns an approximate median with high probability:

  1. Choose a random sample (with replacement) of size k = N3/4 from the data. Call the sample R.
  2. Choose lower and upper "sentinels," L and U. The sentinels are the lower and upper quantiles of R that correspond to the ranks k/2 ± sqrt(N). These sentinels define a symmetric interval about the median of R.
  3. (Optional) Verify that the interval [L, U] contains the median value of the original data.
  4. Let C be the subset of the original observations that are within the interval [L, U]. This is a small subset.
  5. Return the median of C.

The idea is to use the random sample only to find an interval that probably contains the median. The width of the interval depends on the size of the data. If N=10,000, the algorithm uses the 40th and 60th percentiles of the random subset R. For N=1E8, it uses the 49th and 51th percentiles.

The following SAS/IML program implements the Monte Carlo estimate for the median:

proc iml;
/* Choose a set R of k = N##(3/4) elements in x, chosen  at random with replacement. 
   Choose quantiles of R to form an interval [L,U].
   Return the median of data that are within [L,U]. */
start MedianRand(x);
   N = nrow(x);
   k = ceil(N##0.75);    /* smaller sample size */
   R = T(sample(x, k));  /* bootstrap sample of size k (with replacement) */
   call sort(R);         /* Sort this subset of data */
   L = R[ floor(k/2 - sqrt(N)) ]; /* lower sentinel */
   U = R[  ceil(k/2 + sqrt(N)) ]; /* upper sentinel */
   UTail = (x > L);       /* indicator for large values */
   LTail = (x < U);       /* indicator for small values */
   if sum(UTail) > N/2 & sum(LTail) > N/2 then do;
      C = x[loc(UTail & LTail)]; /* extract central portion of data */
      return ( median(C) );      
   end;
   else do;
      print "Median not between sentinels!";
      return (.);
   end;
finish;
 
/* run the algorithm on example data */
use Have; read all var {"x"}; close;
call randseed(456);
 
t1 = time();
   approxMed = MedianRand(x);  /* compute approximate median; time it */
tApproxMed = time()-t1;
 
t0 = time();
   med = median(x);            /* compute traditional estimate; time it */
tMedian = time()-t0;
 
results = (approxMed || med) // (tApproxMed|| tMedian);
print results[colname={"Approx" "Traditional"}
              rowname={"Estimate" "Time"} F=Best6.];

The output indicates that the Monte Carlo algorithm gives an estimate that is close to the traditional estimate, but produces it three times faster. If you repeat this experiment for 1E8 observations, the Monte Carlo algorithm computes an estimate in 2.4 seconds versus 11.4 seconds for the traditional estimate, which is almost five times faster. As the data size grows, the speed advantage of the Monte Carlo algorithm increases. See Prof. Hon's notes for more details.

In summary, this article shows how to implement a probabilistic algorithm for estimating the median for large data. The Monte Carlo algorithm uses a bootstrap subsample to estimate a symmetric interval that probably contains the true median, then uses the interval to strategically choose and analyze only part of the original data. For large data, this Monte Carlo estimate is much faster than the traditional estimate.

The post A Monte Carlo algorithm to estimate a median appeared first on The DO Loop.

2月 192018
 

Your statistical software probably provides a function that computes quantiles of common probability distributions such as the normal, exponential, and beta distributions. Because there are infinitely many probability distributions, you might encounter a distribution for which a built-in quantile function is not implemented. No problem! This article shows how to numerically compute the quantiles of any probability distribution from the definition of the cumulative distribution (CDF).

In SAS, the QUANTILE function computes the quantiles for about 25 distributions. This article shows how you can use numerical root-finding methods (and possibly numerical integration) in SAS/IML software to compute the quantile function for ANY continuous distribution. I have previously written about related topics and particular examples, such as the following:

The quantile is the root of an integral equation

Quantiles are the solutions to the equation CDF(x)-p=0, where p is a probability

Computing a quantile would make a good final exam question for an undergraduate class in numerical analysis. Although some distributions have an explicit CDF, many distributions are defined only by a probability density function (the PDF, f(x)) and numerical integration must be used to compute the cumulative distribution (the CDF, F(x)). A canonical example is the normal distribution. I've previously shown how to use numerical integration to compute a CDF from a PDF by using the definition F(x) = ∫ f(t) dt, where the lower limit of the integral is –∞ and the upper limit is x.

Whether the CDF is defined analytically or through numerical integration, the quantile for p is found implicitly as the solution to the equation F(x) = p, where p is a probability in the interval (0,1). This is illustrated by the graph at the right.

Equivalently, you can define G(x; p) = F(x) – p so that the quantile is the root of the equation G(x; p) = 0. For well-behaved densities that occur in practice, a numerical root is easily found because the CDF is monotonically increasing. (If you like pathological functions, see the Cantor staircase distribution.)

Example: Create a custom distribution

SAS/IML software provides the QUAD subroutine, which provides numerical integration, and the FROOT function, which solves for roots. Thus SAS/IML is an ideal computational environment for computing quantiles for custom distributions.

As an example, consider a distribution that is a mixture of an exponential and a normal distribution:
F(x) = 0.4 Fexp(x; 0.5) + 0.6 Φ(x; 10, 2),
where Fexp(x; 0.5) is the exponential distribution with scale parameter 0.5 and Φ(x; 10, 2) is the normal CDF with mean 20 and standard deviation 2. In this case, you do not need to use numerical integration to compute the CDF. You can compute the CDF as a linear combination of the exponential and normal CDFs, as shown in the following SAS/IML function:

/* program to numerically find quantiles for a custom distribution */
proc iml;
/* Define the cumulative distribution function here. */
start CustomCDF(x); 
   F = 0.4*cdf("Exponential", x, 0.5) +
       0.6*cdf("Normal", x, 10, 2);
   return F;
finish;

The previous section shows the graph of the CDF on the interval [0, 16]. The vertical and horizontal lines correspond to the first, second and third quartiles of the distribution. The quartiles are close to the values Q1 ≈ 0.5, Q2 ≈ 8, and Q3 ≈ 10.5. The next section shows how to compute the quantiles.

Compute quantiles for an arbitrary distribution

As long as you can define a function that evaluates the CDF, you can find quantiles. For unbounded distributions, it is usually helpful to plot the CDF so that you can visually estimate an interval that contains the quantile. (For bounded distributions, the support of the distribution contains all quantiles.) For the mixture distribution in the previous section, it is clear that the quantiles are in the interval [0, 16].

The following program finds arbitrary quantiles for whichever CDF is evaluated by the CustomCDF function. To find quantiles for a different function, you can modify the CustomCDF and change the interval on which to find the quantiles. You do not need to modify the RootFunc or CustomQuantile functions.

/* Express CDF(x)=p as the root for the function CDF(x)-p. */
start RootFunc(x) global(gProb);
   return CustomCDF(x) - gProb;  /* quantile for p is root of CDF(x)-p */
finish;
 
/* You need to provide an interval on which to search for the quantiles. */
start CustomQuantile(p, Interval) global(gProb);
   q = j(nrow(p), ncol(p), .);              /* allocate result matrix */
   do i = 1 to nrow(p)*ncol(p);             /* for each element of p... */
      gProb = p[i];                         /*    set global variable   */
      q[i] = froot("RootFunc", Interval);   /*    find root (quantile)  */ 
   end;
   return q;
finish;
 
/* Example: look for quartiles in interval [0, 16] */
probs = {0.25 0.5 0.75};         /* Q1, Q2, Q3 */
intvl = {0 16};                  /* interval on which to search for quantiles */
quartiles = CustomQuantile(probs, intvl);
print quartiles[colname={Q1 Q2 Q3}];
Quartiles for mixture of exponential and normal distributions

In summary, you can compute an arbitrary quantile of an arbitrary continuous distribution if you can (1) evaluate the CDF at any point and (2) numerically solve for the root of the equation CDF(x)-p for a probability value, p. Because the support of the distribution is arbitrary, the implementation requires that you provide an interval [a,b] that contains the quantile.

The computation should be robust and accurate for non-pathological distributions provided that the density is not tiny or zero at the value of the quantile. Although this example is illustrated in SAS, the same method will work in other software.

The post Compute the quantiles of any distribution appeared first on The DO Loop.

1月 242018
 

A popular way to use lists in the SAS/IML language is to pack together several related matrices into a single data structure that can be passed to a function. Imagine that you have written an algorithm that requires a dozen different parameters. Historically, you would have to pass those parameters to the function by explicitly listing each argument. If you use lists (introduced in SAS/IML 14.2, which is SAS 9.4M4), you can pack all parameters into one list and pass that list into the function module where the individual items can be extracted as needed.

Example: An algorithm that requires multiple parameters

To illustrate this list-passing technique, consider the algorithm for Newton's method, which is an iterative method for finding the roots of a nonlinear systems of equations. An implementation of Newton's method is included the SAS/IML documentation. Newton's method requires several arguments:

  • The name of a module that evaluates the function whose roots are sought.
  • The name of a module that evaluates the Jacobian matrix (the derivative of the function).
  • An initial guess for the solution.
  • The maximum number of iterations before the algorithm decides that the method is not converging.
  • A convergence criterion that determines the accuracy of the solution.

The example in the documentation (which was written in the 1980s) uses global variables and hard-codes the names of functions and the convergence criteria. The example is meant to demonstrate the basic ideas, but is not reusable. You can make the example more flexible and robust by passing parameters to the module as shown in the next section.

Create a list of parameters

Suppose that you define the modules F and DF to evaluate a multivariate function and the matrix of partial derivatives (the Jacobian matrix), respectively. The following modules evaluate the same mathematical function and derivatives that are used in the SAS/IML documentation:

proc iml;
/* Newton's method to solve for roots of the system of equations
  F(x1,x2) = [ x1+x2-x1*x2+2 ] = [ 0 ]
             [ x1*exp(-x2)-1 ]   [ 0 ]
*/
/* 1. define a function F that evaluates a multivariate function */
start F(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   f = (x1+x2-x1*x2+2) // (x1*exp(-x2)-1); /* evaluate the function */
   return ( f );
finish F;
 
/* 2. define a function DF that evaluates the Jacobian of F */
start DF(x);
   x1 = x[1];   x2 = x[2];                 /* extract the values */
   J = {. ., . .};
   J[1,1] = 1-x2;       J[1,2] = 1-x1;
   J[2,1] = exp(-x2);   J[2,2] =  -x1*exp(-x2);
   return ( J );
finish DF;

As shown in a previous article, you can create a named list that contains the parameters for Newton's algorithm. In SAS/IML 14.3, you can create the list by specifying a sequence of name-value pairs:

/* 3. list of NAME  = VALUE pairs  (syntax uses SAS/IML 14.3) */
args = [#"Func"     = "F",         /* name of module that evaluates the function */
        #"Deriv"    = "DF",        /* name of module that evaluates the deivative */ 
        #"x0"       = {0.1, -2.0}, /* initial guess for solution */
        #"MaxIters" = 100,         /* maximum iterations */
        #"ConvCrit" = 1e-6 ];      /* convergence criterion */

In SAS/IML 14.3, you can use the list item operator ($) to specify an item in a list. You can also get the name of the function and use CALL EXECUTE to evaluate the function at the initial guess, as follows:

x = args$"x0";                           /* initial guess */ 
EvalFunc = "y=" + args$"Func" + "(x);";  /* string for "y=F(x);" */
call execute(EvalFunc);                  /* evaluates function at x */
print EvalFunc, y;

This is a useful trick for evaluating a function by name. The next section uses this trick inside a module that implements Newton's method.

Pass a list of parameters to and from a SAS/IML module

You can now implement Newton's method by defining a SAS/IML module that takes one argument, which is a list. Inside the module, the parameters are extracted from the list and used to evaluate the functions F and DF and to control the iteration and convergence of Newton's method, as follows:

/* 4. Newton's method for solving the roots for F(x)=0. The argument to the module is a list. */
start Newton( L );
   x = L$"x0";                              /* initial guess */
   EvalFunc = "y=" + L$"Func" + "(x);";     /* string for "y=F(x);" */
   EvalDeriv = "D=" + L$"Deriv" + "(x);";   /* string for "D=DF(x);" */
 
   call execute(EvalFunc);          /* evaluate F at starting values */
   converged = (max(abs(y)) < L$"ConvCrit");              /* 0 or 1 */
   do iter = 1 to L$"Maxiters"      /* iterate until max iterations */
             while( ^converged );   /*      or until convergence */
      call execute(EvalDeriv);      /* evaluate derivatives D=DF(x) */
      delta = -solve(D,y);          /* solve for correction vector */
      x = x+delta;                  /* update x with new approximation */
      call execute(EvalFunc);       /* evaluate the function y=F(x)*/
      converged = (max(abs(y)) < L$"ConvCrit");
   end;
   /* you can return x or  a list that summarizes the results */
   result = [#"x"          = x,           /* approximate root (if converged) */
             #"iterations" = iter,        /* total number of iterations */
             #"status"      = converged]; /* 1 if root found; 0 otherwise */
   return ( result );
finish Newton;
 
/* 5. call Newton's method and pass in a list of parameters */
soln = Newton( args );              
xSoln = soln$"x";                   /* get root */
if soln$"status" then               /* check convergence statust */
   print "Newton's method converged in " (soln$"iterations") " iterations", xSoln;
else 
   print "Newton's method did not converge in " (soln$"iterations") " iterations";

The Newton module returns a named list that has three items. The item named "status" is a binary value that indicates whether the method converged. If the method converged, the item named "x" contains the approximate root. The item named "iterations" contains the number of iterations of the algorithm.

In summary, you can use lists to create a structure that contains many parameters. You can pass the list to a SAS/IML module, which can extract and use the parameters. This enables you to simplify the calling syntax for user-defined modules that require many parameters.

For additional examples of using lists to pass heterogeneous data to SAS/IML modules, see "More Than Matrices: SAS/IML Software Supports New Data Structures" (Wicklin, 2017, pp. 11–12) and the SAS/IML documentation for lists.

The post Use lists to pass parameters to SAS/IML functions appeared first on The DO Loop.

1月 222018
 

SAS/IML 14.3 (SAS 9.4M5) introduced a new syntax for creating lists and for assigning and extracting item in a list. Lists (introduced in SAS/IML 14.2) are data structures that are convenient for holding heterogeneous data. A single list can hold character matrices, numeric matrices, scalar values, and other lists, as discussed in a previous article about how to use lists in SAS/IML.

The list creation operator

You can use square brackets to create a list. The elements of the list are separated by commas. For example, the following syntax creates a list that contains three elements. The list represents a hypothetical patient in a cholesterol-lowering study.

proc iml;
/* 1. New list creation syntax (L = [val1, val2,...]) */
/*                 Cholesterol at
       Name  Age   0mo  3mo  6mo  12mo */
P1 = ["Bob",  36, {182, 170, 170, 162}]; /* Patient 1 */

The name of the list is P1. The first element of the list is a string, the second is a scalar number, and the third is a numeric vector. In this case, the elements are specified by using literal values, but you can also use previously defined variables. The following statement is equivalent but defines the list by using existing variables:

Name = "Bob"; Age = 36; Chol = {182, 170, 170, 162};
P1 = [Name, Age, Chol];   /* define list from copies of existing variables */

As mentioned earlier, lists can contain other lists. For example, if you have multiple patients in a study, you can create a list of each patient's data, then create a list that contains all the patients' data, as follows:

P2 = ["Fred", 52, {175, 165, 155}     ]; /* Patient 2 */
P3 = ["Tom",  45, {160, 145,   ., 139}]; /* Patient 3 */
Patients = [P1, P2, P3];                 /* a list of patients */

Assign and extract list items

You can use the list item operator ($) to specify an item in a list. For each patient, the age is stored as the second item in a list. If you want the age of Bob (the first patient), you can use either of the following statements:

/* 2. New list item syntax ($) */
BobAge = P1$2;         /* get 2nd item from P1 list */
BobAge = Patients$1$2; /* get 1st item of Patients, then 2nd item from that list */

The first statement is straightforward: P1$2 means "get the second item from the P1 list." The second statement is parsed left to right. The syntax Patient$1 means "get the first item from the Patients list," which is equivalent to P1. Thus Patient$1$2 gets Bob's age.

The preceding example uses literal values to specify the position of an item in a list, but you can also use a variable. This makes it possible to extract items in a loop. For example, the following statements loop over all patients, extract the name and cholesterol values of the patient, and compute each patient's average cholesterol value during the study:

N = ListLen(Patients);      /* number of patients */
Name = j(N,1,"     ");      /* allocate vector for names */
AvgChol = j(N,1,.);         /* allocate vector for results */
do i = 1 to N;
   Name[i] = Patients$i$1;  /* get name of i_th patient */
   Chol = Patients$i$3;     /* get sequence of cholesterol values */
   AvgChol[i] = mean(Chol); /* average value */
end;
print AvgChol[rowname=Name];

You can use the list item operator on the left side of an assignment operator to change an item in a list. For example, suppose you discover that Bob's age was typed wrong: Bob is really 63, not 36. You can update Bob's data as follows:

P1$2 = 63;         /* update 2nd item in P1 list */
Patients$1 = P1;   /* update entire patient list */

Alternatively, you could use the syntax Patients$1$2 = 63 to update the data in place.

Extract sublists

To subset a list, use square brackets ([]) and specify the indices of the items. For example, the following statement extracts the second and third patients into a new list:

/* 3. Extract sublist SL = L[ {i1, i2,...} ] */
SubList = Patients[ {2 3} ]; /* Fred and Tom */

The sublist operator ([]) ALWAYS returns a list, even if you specify only one item! Thus Patients[1] is a LIST that contains one item. If you want the item itself, use Patients$1.

Named lists

In the previous example, the items in the list P1 are a patient's name, age, and cholesterol readings. If you want to extract Bob's age, you can write P1$2, but someone unfamiliar with the order of the list items would have no idea what that item represents. Thus it is helpful to define a named list, which is also called an associative array. When you specify a named list, you specify the items by using name-value pairs, as follows:

P = [#"Name" = "Bob",                        /* P$1 or P$"Name" */
     #"Age"  = 63,                           /* P$2 or P$"Age" */
     #"Cholesterol" = {182, 170, 170, 162}]; /* P$3 or P$"Cholesterol" */

You can use the names to refer to the items in a list. For example, the following statements extract the patient's name and cholesterol readings by using the list item operator:

Name = P$"Name";
Chol = P$"Cholesterol";       /* get Bob's measurements */
print Chol[Label=Name];

You can also use names in the sublist operator to extract a sublist:

L = P[ {"Age" "Cholesterol"} ];  /* sublist that contains two items */

Summary

In summary, SAS/IML 14.3 contains new syntax for creating a list and for extracting items and sublists. This syntax makes it easier to use lists and to read and write SAS/IML programs that use lists.

The post Create lists by using a natural syntax in SAS/IML appeared first on The DO Loop.

1月 102018
 

Last week I wrote about the 10 most popular articles from The DO Loop in 2017. My most popular articles tend to be about elementary statistics or SAS programming tips. Less popular are the articles about advanced statistical and programming techniques. However, these technical articles fill an important niche. Not everyone needs to know how to interpret a diffogram that visually compares the differences between means between groups, but those who do often send me a note of thanks, which motivates me to research and write similar articles.

Statistical programmers might want to review the following technical articles from 2017. This ain't summertime, and the reading ain't easy, but I think these articles are worth the effort. I've broken them into three categories. Enjoy!

Statistical Concepts

Visualization of regression that uses a weight variable in SAS

Statistical Data Analysis and Visualization

Visualize multivariate regression model by slicing the continuous variables. Graph created by using the EFFECTPLOT SLICEFIT statement in SAS.

Advanced SAS Programming Techniques

If you are searching for a way to enhance your SAS knowledge in this New Year, I think these 10 articles from The DO Loop are worth a second look. Was there an article from 2017 that you found particularly useful? Leave a comment.

The post 10 posts from 2017 that deserve a second look appeared first on The DO Loop.