rick wicklin

9月 162020
 

A data analyst recently asked a question about restricted least square regression in SAS. Recall that a restricted regression puts linear constraints on the coefficients in the model. Examples include forcing a coefficient to be 1 or forcing two coefficients to equal each other. Each of these problems can be solved by using PROC REG in SAS. The analyst's question was, "Why can I use PROC REG to solve a restricted regression problem when the restriction involves EQUALITY constraints, but PROC REG can't solve a regression problem that involves an INEQUALITY constraint?"

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that has linear constraints on the parameters. I also use geometry and linear algebra to show that solving an equality-constrained regression problem is mathematically similar to solving an unrestricted least squares system of equations. This explains why PROC REG can solve equality-constrained regression problems. In a future article, I will show how to solve regression problems that involve inequality constraints on the parameters.

The geometry of restricted regression

The linear algebra for restricted least squares regression gets messy, but the geometry is easy to picture. A schematic depiction of restricted regression is shown to the right.

Geometrically, ordinary least-squares (OLS) regression is the orthogonal projection of the observed response (Y) onto the column space of the design matrix. (For continuous regressors, this is the span of the X variables, plus an "intercept column.") If you introduce equality constraints among the parameters, you are restricting the solution to a linear subspace of the span (shown in green). But the geometry doesn't change. The restricted least squares (RLS) solution is still a projection of the observed response, but this time onto the restricted subspace.

In terms of linear algebra, the challenge is to write down the projection matrix for the restricted problem. As shown in the diagram, you can obtain the RLS solution in two ways:

  • Starting from Y: You can directly project the Y vector onto the restricted subspace. The linear algebra for this method is straightforward and is often formulated in terms of Lagrange multipliers.
  • Starting from the OLS solution: If you already know the OLS solution, you can project that solution vector onto the restricted subspace. Writing down the projection matrix is complicated, so I refer you to the online textbook Practical Econometrics and Data Science by A. Buteikis.

Restricted models and the TEST statement in PROC REG

Let's start with an example, which is based on an example in the online textbook by A. Buteikis. The following SAS DATA step simulates data from a regression model
      Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4 + ε
where B3 = B1 and B4 = –2*B2 and ε ~ N(0, 3) is a random error term.

/* Model: Y = B0 + B1*X1 + B2*X2 + B3*X3 + B4*X4
   where B3 = B1 and B4 = -2*B2
*/
data RegSim;
array beta[0:4] (-5, 2, 3, 2, -6);  /* parameter values for regression coefficients */
N = 100;                            /* sample size */
call streaminit(123);
do i = 1 to N;
   x1 = rand("Normal", 5, 2);       /* X1 ~ N(5,2) */
   x2 = rand("Integer", 1, 50);     /* X2 ~ random integer 1:50 */
   x3 = (i-1)*10/(N-1);             /* X3 = sequence: 0 to 10 by 1/N */
   x4 = rand("Normal", 10, 3);      /* X4 ~ N(10, 3) */
   eps = rand("Normal", 0, 3);      /* RMSE = 3 */
   Y = beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3 + beta[4]*x4 + eps;
   output;
end;
run;

Although there are five parameters in the model (B0-B4), there are only three free parameters because the values of B2 and B4 are determined by the values of B1 and B4, respectively.

If you suspect that a parameter in your model is a known constant or is equal to another parameter, you can use the RESTRICT statement in PROC REG to incorporate that restriction into the model. Before you do, however, it is usually a good idea to perform a statistical test to see if the data supports the model. For this example, you can use the TEST statement in PROC REG to hypothesize that B3 = B1 and B4 = –2*B2. Recall that the syntax for the TEST statement uses the variable names (X1-X4) to represent the coefficients of the variable. The following call to PROC REG carries out this analysis:

proc reg data=RegSim plots=none;
   model Y = x1 - x4;
   test x3 = x1, x4 = -2*x2;
run;

The output shows that the parameter estimates for the simulated data are close to the parameter values used to generate the data. Specifically, the root MSE is close to 3, which is the standard deviation for the error term. The Intercept estimate is close to -5. The coefficients for the X1 and X3 term are each approximately 2. The coefficient for X4 is approximately –2 times the coefficient of X2. The F test for the test of hypotheses has a large p-value, so you should not reject the hypotheses that B3 = B1 and B4 = –2*B2.

The RESTRICT statement in PROC REG

If the hypothesis on the TEST statement fails to reject, then you can use the RESTRICT statement to recompute the parameter estimates subject to the constraints. You could rerun the entire analysis or, if you are running SAS Studio in interactive mode, you can simply submit a RESTRICT statement and PROC REG will compute the new parameter estimates without rereading the data:

   restrict x3 = x1, x4 = -2*x2; /* restricted least squares (RLS) */
   print;
quit;

The new parameter estimates enforce the restrictions among the regression coefficients. In the new model, the coefficients for X1 and X3 are exactly equal, and the coefficient for X4 is exactly –2 times the coefficient for X2.

Notice that the ParameterEstimates table is augmented by two rows labeled "RESTRICT". These rows have negative degrees of freedom because they represent constraints. The "Parameter Estimate" column shows the values of the Lagrange multipliers that are used to enforce equality constraints while solving the least squares system. Usually, a data analyst does not care about the actual values in these rows, although the Wikipedia article on Lagrange multipliers discusses ways to interpret the values.

The linear algebra of restricted regression

This section shows the linear algebra behind the restricted least squares solution by using SAS/IML. Recall that the usual way to compute the unrestricted OLS solution is the solve the "normal equations" (X`*X)*b = X`*Y for the parameter estimates, b. Although textbooks often solve this equation by using the inverse of the X`X matrix, it is more efficient to use the SOLVE function in SAS/IML to solve the equation, as follows:

proc iml;
varNames = {x1 x2 x3 x4};
use RegSim;                  /* read the data */
   read all var varNames into X;
   read all var "Y";
close;
X = j(nrow(X), 1, 1) || X;   /* add intercept column to form design matrix */
XpX = X`*X;
 
/* Solve the unrestricted OLS system */
b_ols = solve(XpX, X`*Y);    /* normal equations X`*X * b = X`*Y */
ParamNames = "Intercept" || varNames;
print b_ols[r=ParamNames F=10.5];

The OLS solution is equal to the first (unrestricted) solution from PROC REG.

If you want to require that the solution satisfies B3 = B1 and B4 = –2*B2, then you can augment the X`X matrix to enforce these constraints. If L*λ = c is the matrix equation for the linear constraints, then the augmented system is
\( \begin{pmatrix} X^{\prime}X & L^{\prime} \\ L & 0 \end{pmatrix} \begin{pmatrix} b_{\small\mathrm{RLS}} \\ \lambda \end{pmatrix} = \begin{pmatrix} X^{\prime}Y \\ c \end{pmatrix} \)

The following statements solve the augmented system:

/* Direct method: Find b_RLS by solving augmented system */
/* Int X1 X2 X3 X4 */
L = {0  1  0 -1  0,     /*    X1 - X3 = 0 */
     0  0 -2  0 -1};    /* -2*X2 - X4 = 0 */
c = {0, 0};
 
zero = j(nrow(L), nrow(L), 0);
A = (XpX || L`   ) //
    (L   || zero );
z = X`*Y // c;
b_rls = solve(A, z);
rNames = ParamNames||{'Lambda1' 'Lambda2'};
print b_rls[r=rNames F=10.5];

The parameter estimates are the same as are produced by using PROC REG. You can usually ignore the last two rows, which are the values of the Lagrange multipliers that enforce the constraints.

By slogging through some complicated linear algebra, you can also obtain the restricted least squares solution from the OLS solution and the constraint matrix (L). The following equations use the formulas in the online textbook by A. Buteikis to compute the restricted solution:

/* Adjust the OLS solution to obtain the RLS solution.
   b_RLS = b_OLS - b_adjustment
   Follows
   http://web.vu.lt/mif/a.buteikis/wp-content/uploads/PE_Book/4-4-Multiple-RLS.html
*/
RA_1 = solve(XpX, L`);
RA_3 = solve(L*RA_1, L*b_ols - c);
b_adj = RA_1 * RA_3;
b_rls = b_ols - b_adj;
print b_rls[r=ParamNames F=10.5];

This answer agrees with the previous answer but does not compute the Lagrange multipliers. Instead, it computes the RLS solution as the projection of the OLS solution onto the restricted linear subspace.

Summary

This article shows how to use the RESTRICT statement in PROC REG to solve a restricted regression problem that involves linear constraints on the parameters. Solving an equality-constrained regression problem is very similar to solving an unrestricted least squares system of equations. Geometrically, they are both projections onto a linear subspace. Algebraically, you can solve the restricted problem directly or as the projection of the OLS solution.

Although changing one of the equality constraints into an INEQUALITY seems like a minor change, it means that the restricted region is no longer a linear subspace. Ordinary least squares can no longer solve the problem because the solution might not be an orthogonal projection. The next article discusses ways to solve problems where the regression coefficients are related by linear inequalities.

The post Restricted least squares regression in SAS appeared first on The DO Loop.

9月 142020
 

My 2020 SAS Global Forum paper was about how to write custom parallel programs by using the iml action in SAS Viya 3.5. My conference presentation was canceled because of the coronavirus pandemic, but I recently recorded a 15-minute video that summarizes the main ideas in the paper.

One of the reasons I enjoy attending conferences is to obtain a high-level "mental roadmap" of a topic that I can leverage if I decide to study the details. Hopefully, this video will provide you with a roadmap of the iml action and its capabilities.


If your browser does not support embedded video, you can go directly to the video on YouTube.

The video introduces a few topics that I have written about in more detail:

For more details you can read the paper: Wicklin and Banadaki, 2020, "Write Custom Parallel Programs by Using the iml Action."

The post Video: How to Write a Custom Parallel Program in SAS Viya appeared first on The DO Loop.

9月 102020
 

I previously wrote about the RAS algorithm, which is a simple algorithm that performs matrix balancing. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. Ideally, the balanced matrix will reflect the structural relationships in the original matrix. An alternative method is the iterative proportional fitting (IPF) algorithm, which is implemented in the IPF subroutine in SAS/IML. The IPF method can balance n-way tables, n ≥ 2.

The IPF function is a statistical modeling method. It computes maximum likelihood estimates for a hierarchical log-linear model of the counts as a function of the categorical variables. It is more complicated to use than the RAS algorithm, but it is also more powerful.

This article shows how to use the IPF subroutine for the two-way frequency table from my previous article. The data are for six girl scouts who sold seven types of cookies. We know the row totals for the Type variable (the boxes sold for each cookie type) and the column totals for the Person variable (the boxes sold by each girl). But we do not know the values of the individual cells, which are the Type-by-Person totals. A matrix balancing algorithm takes any guess for the table and maps it into a table that has the observed row and column sums.

Construct a table that has the observed marginal totals

As mentioned in the previous article, there are an infinite number of matrices that have the observed row and column totals. Before we discuss the IPF algorithm, I want to point out that it is easy to compute one particular balanced matrix. The special matrix is the one that assumes no association (independence) between the Type and Person variables. That matrix is formed by the the outer product of the marginal totals:
     B[i,j] = (sum of row i)(sum of col j) / (sum of all counts)

It is simple to compute the "no-association" matrix in the SAS/IML language:

proc iml;
/* Known quantities: row totals and column totals */
u = {260, 214, 178, 148, 75, 67, 59}; /* total boxes sold, by type */
v = {272 180 152 163 134 100};        /* total boxes sold, by girl */
 
/* Compute the INDEPENDENT table, which is formed by the outer product
   of the marginal totals:
   B[i,j] = (sum of row i)(sum of col j) / (sum of all counts) */
B = u*v / sum(u);
 
Type   = 'Cookie1':'Cookie7';                  /* row labels    */
Person = 'Girl1':'Girl6';                      /* column labels */
print B[r=Type c=Person format=7.1 L='Independent Table'];

For this balanced matrix, there is no association between the row and column variables (by design). This matrix is the "expected count" matrix that is used in a chi-square test for independence between the Type and Person variables.

Because the cookie types are listed in order from most popular to least popular, the values in the balanced matrix (B) decrease down each column. If you order the girls according to how many boxes they sold (you need to exchange the third and fourth columns), then the values in the balanced matrix would decrease across each row.

Use the IPF routine to balance a matrix

The documentation for the IPF subroutine states that you need to specify the following input arguments. For clarity, I will only discuss two-way r x c tables.

  • dim: The vector of dimensions of the table: dim = {r c};
  • table: This r x c matrix is used to specify the row and column totals. You can specify any matrix; only the marginal totals of the matrix are used. For example, for a two-way table, you can use the "no association" matrix from the previous section.
  • config is an input matrix whose columns specify which marginal totals to use to fit the counts. Because the model is hierarchical, all subsets of specified marginals are included in the model. For a two-way table, the most common choices are the main effect models such as config={1} (model count by first variable only), config={2} (model count by second variable only), and config={1 2} (model count by using both variables). It is possible to fit the "fully saturated model" (config={1, 2}), but that model is overparameterized and results in a perfect fit, if the IPF algorithm converges.
  • initTable (optional): The elements of this matrix will be adjusted to match the observed marginal totals. In practice, this matrix is often based on a survey and you want to adjust the counts by using known marginal totals from a recent census of the population. Or it might be an educated guess, as in the Girl Scout cookie example. If you do not specify this matrix, a matrix of all 1s is used, which implies that you think all cell entries are approximately equal.

The following SAS/IML program defines the parameters for the IPF call.

/* Specify structure between rows and columns by asking each girl 
   to estimate how many boxes of each type she sold. This is the 
   matrix whose cells will be adjusted to match the marginal totals. */
/*   G1 G2 G3 G4 G5 G6          */
A = {75 45 40 40 40 30 ,  /* C1 */
     40 35 45 35 30 30 ,  /* C2 */
     40 25 30 40 30 20 ,  /* C3 */
     40 25 25 20 20 20 ,  /* C4 */
     30 25  0 10 10  0 ,  /* C5 */
     20 10 10 10 10  0 ,  /* C6 */
     20 10  0 10  0  0 }; /* C7 */
 
Dim = ncol(A) || nrow(A);  /* dimensions of r x c table */
Config = {1 2};            /* model main effects: counts depend on Person and Type */
call IPF(FitA, Status,     /* output values */
         Dim, B,           /* use the "no association" matrix to specify marginals */
         Config, A);       /* use best-guess matrix to specify structure */
print FitA[r=Type c=Person format=7.2],     /* the IPF call returns the FitA matrix and Status vector */
      Status[c={'rc' 'Max Diff' 'Iters'}];

The IPF call returns two quantities. The first output is the balanced matrix. For this example, the balanced matrix (FitA) is identical to the result from the RAS algorithm. It has a similar structure to the "best guess" matrix (A). For example, and notice that the cells in the original matrix that are zero are also zero in the balanced matrix.

The second output is the Status vector. The status vector provides information about the convergence of the IPF algorithm:

  • If the first element is zero, the IPF algorithm converged.
  • The second element gives the maximum difference between the marginal totals of the balanced matrix (FitA) and the observed marginal totals.
  • The third element gives the number of iterations of the IPF algorithm.

Other models

The IPF subroutine is more powerful than the RAS algorithm because you can fit n-way tables and because you can model the table in several natural ways. For example, the model in the previous section assumes that the counts depend on the salesgirl and on the cookie type. You can also model the situation where the counts depend only on one of the factors and not on the other:

/* counts depend only on Person, not on cookie Type */
Config = {1};   
call IPF(FitA1, Status, Dim, B, Config, A);
print FitA1[r=Type c=Person format=7.2], Status;
 
/* counts depend only on cookie Type, not on Person */
Config = {2};   
call IPF(FitA2, Status, Dim, B, Config, A);
print FitA2[r=Type c=Person format=7.2], Status;

Summary

In summary, the SAS/IML language provides a built-in function (CALL IPF) for matrix balancing of frequency tables. The routine incorporates statistical modeling and you can specify which variables to use to model the cell counts. The documentation several examples that demonstrate using the IPF algorthm to balance n-way tables, n ≥ 2.

The post Iterative proportional fitting in SAS appeared first on The DO Loop.

9月 082020
 

Matrix balancing is an interesting problem that has a long history. Matrix balancing refers to adjusting the cells of a frequency table to match known values of the row and column sums. One of the early algorithms for matrix balancing is known as the RAS algorithm, but it is also called the raking algorithm in some fields. The presentation in this article is inspired by a paper by Carol Alderman (1992).

The problem: Only marginal totals are known

Alderman shows a typical example (p. 83), but let me give a smaller example. Suppose a troop of six Girl Scouts sells seven different types of cookies. At the end of the campaign, each girl reports how many boxes she sold. Also, the troop leader knows how many boxes of each cookie type were sold. However, no one kept track of how many boxes of each type were sold by each girl.

The situation is summarized by the following 7 x 6 table. We know the row totals, which are the numbers of boxes that were sold for each cookie type. We also know the column totals, which are the numbers of boxes that were sold by each girl. But we do not know the values of the individual cells, which are the type-by-girl totals.

Guessing the cells of a matrix

As stated, the problem usually has infinitely many solutions. For an r x c table, there are r*c cells, but the marginal totals only provide r + c linear constraints. Except for 2 x 2 tables, there are more unknowns than equations, which leads to an underdetermined linear system that (typically) has infinitely many solutions.

Fortunately, often obtain a unique solution if we provide an informed guess for the cells in the table. Perhaps we can ask the girls to provide estimates of the number of boxes of each type that they sold. Or perhaps we have values for the cells from the previous year. In either case, if we can estimate the cell values, we can use a matrix balancing algorithm to adjust the cells so that they match the observed marginal totals.

Let's suppose that the troop leader asks the girls to estimate (from memory) how many boxes of each type they sold. The estimates are shown in the following SAS/IML program. The row and column sums of the estimates do not match the known marginal sums, but that's okay. The program uses subscript reduction operators to compute the marginal sums of the girls' estimates. These are displayed next to the actual totals, along with the differences.

proc iml;
/* Known marginal totals: */
u = {260, 214, 178, 148, 75, 67, 59}; /* total boxes sold, by type */
v = {272 180 152 163 134 100};        /* total boxes sold, by girl */
 
/* We don't know the cell values that produce these marginal totals,
   but we can ask each girl to estimate how many boxes of each type she sold */
/*   G1 G2 G3 G4 G5 G6          */
A = {75 45 40 40 40 30 ,  /* C1 */
     40 35 45 35 30 30 ,  /* C2 */
     40 25 30 40 30 20 ,  /* C3 */
     40 25 25 20 20 20 ,  /* C4 */
     30 25  0 10 10  0 ,  /* C5 */
     20 10 10 10 10  0 ,  /* C6 */
     20 10  0 10  0  0 }; /* C7 */
 
/* notice that the row/col totals for A do not match the truth: */
rowTotEst = A[ ,+];        /* for each row, sum across all cols */
colTotEst = A[+, ];        /* for each col, sum down all rows   */
Items  = 'Cookie1':'Cookie7';                  /* row labels    */
Person = 'Girl1':'Girl6';                      /* column labels */
 
/* print known totals, estimated totals, and difference */
rowSums = (u || rowTotEst || (u-rowTotEst));
print rowSums[r=Items c={'rowTot (u)' 'rowTotEst' 'Diff'}];
colSums = (v // colTotEst // (v-colTotEst));
print colSums[r={'colTot (v)' 'colTotEst' 'Diff'} c=Person];

You can see that the marginal totals do not match the known row and column totals. However, we can use the guesses as an initial estimate for the RAS algorithm. The RAS algorithm will adjust the estimates to obtain a new matrix that DOES satisfy the marginal totals.

Adjusting a matrix to match marginal totals

The RAS algorithm can iteratively adjust the girls' estimates until the row sums and column sums of the matrix match the known row sums and column sums. The resulting matrix will probably not be an integer matrix. It will, however, reflect the structure of the girls' estimates in three ways:

  1. It will approximately preserve each girl's recollection of the relative proportions of cookie types that she sold. If a girl claims that she sold many of one type of cookie and few of another, that relationship will be reflected in the balanced matrix.
  2. It will approximately preserve the relative proportions of cookie types among girls. If one girl claims that she sold many more of one type of cookie than another girl, that relationship will be reflected in the balanced matrix.
  3. It will preserve zeros. If a girl claims that she did not sell any of one type of cookie, the balanced matrix will also contain a zero for that cell.

The RAS algorithm is explained in Alderman (1992). The matrix starts with a nonnegative r x c matrix. Let u be the observed r x 1 vector of row sums. Let v be the observed 1 x c vector of column sums. The main steps of the RAS algorithm are:

  1. Initialize X = A.
  2. Scale the rows. Let R = u / RowSum(X), where 'RowSum' is a function that returns the vector of row sums. Update X ← R # X. Here R#X indicates elementwise multiplication of each column of X by the corresponding element of R.
  3. Scale the columns. Let S = v / ColSum(X), where 'ColSum' is a function that returns the vector of column sums. Update X ← X # S. Here X#S indicates elementwise multiplication of each row of X by the corresponding element of S.
  4. Check for convergence. If RowSum(X) is close to u and ColSum(X) is close to v, then the algorithm has converged, and X is the balanced matrix. Otherwise, return to Step 2.

The RAS algorithm is implemented in the following SAS/IML program:

/* Use the RAS matrix scaling algorithm to find a matrix X that 
   approximately satisfies 
   X[ ,+] = u  (marginals constraints on rows)
   X[+, ] = v  (marginals constraints on cols)
   and X is obtained by scaling rows and columns of A */
start RAS(A, u, v);
   X = A;       /* Step 0: Initialize X */
   converged = 0;
   do k = 1 to 100 while(^Converged);
      /* Step 1: Row scaling: X = u # (X / X[ ,+]); */
      R = u / X[ ,+];
      X = R # X;
 
      /* Step 2: Column scaling: X = (X / X[+, ]) # v; */
      S = v / X[+, ];
      X = X # S;
 
      /* check for convergence: u=rowTot and v=colTot */
      rowTot = X[ ,+];   /* for each row, sum across all cols */
      colTot = X[+, ];   /* for each col, sum down all rows */
      maxDiff = max( abs(u-rowTot), abs(v-colTot) );
      Converged = (maxDiff < 1e-8);
   end;
   return( X );
finish;
 
X = RAS(A, u, v);
print X[r=Items c=Person format=7.1];

The matrix X has row and column sums that are close to the specified values. The matrix also preserves relationships among individual columns and rows, based on the girls' recollections of how many boxes they sold. The algorithm preserves zeros for the girls who claim they sold zero boxes of a certain type. It does not introduce any new zeros for cells.

You can't determine how close the X matrix is to "the truth," because we don't know the true number of boxes that each girl sold. The X matrix depends on the girls' recollections and estimates. If the recollections are close to the truth, then X will be close to the truth as well.

Summary

This article looks at the RAS algorithm, which is one way to perform matrix balancing on a table that has nonnegative cells. "Balancing" refers to adjusting the interior cells of a matrix to match observed marginal totals. The RAS algorithm starts with an estimate, A. The estimate does not need to satisfy the marginal totals, but it should reflect relationships between the rows and columns that you want to preserve, such as relative proportions and zero cells. The RAS algorithm iteratively scales the rows and columns of A to create a new matrix that matches the observed row and column sums.

It is worth mentioning the row scaling could be accumulated into a diagonal matrix, R, and the column scaling into a diagonal matrix, S. Thus, the balanced matrix is the product of these diagonal matrices and the original estimate, A. As a matrix equation, you can write X = RAS. It is this matrix factorization that gives the RAS algorithm its name.

The RAS is not the only algorithm for matrix balancing, but it is the simplest. Another algorithm, called iterative proportional fitting (IPF), will be discussed in a second article.

The post Matrix balancing: Update matrix cells to match row and column sums appeared first on The DO Loop.

9月 022020
 

The HighLow plot often enables you to create many custom plots without resorting to annotation. Although it is designed to create a candlestick chart for stocks, it is incredibly versatile. Recently, a SAS programmer wanted to create a patient-profile graph that looked like a stacked bar chart but had repeated categories. The graph is shown to the right. The graph shows which treatments each patient was given and in which order. Treatments can be repeated, and during some periods a patient was given no treatment.

Although the graph looks like a stacked bar chart, there is an important difference. A stacked bar chart is a graph of (counts of) discrete categories. For each patient, each category appears at most one time. Although the categories can be ordered, the categories are displayed in the same order for every patient.

In contrast, by using the HighLow chart, you can order the categories independently for each patient and you can display repeated categories.

A graph of mutually exclusive events over time

This chart visualizes the sequence of treatments for a set of patients over time, but you can generalize the idea. In general, this graph is used to show different subjects over time. At each instant of time, the subject can experience one and only one event at a time. In theory, this means that the events should be mutually exclusive, but you can get around that restriction by defining additional events such as "A and B", "A and C", and "B and C".

The following DATA step defines the data in "long form," meaning that each patient profile is defined by using multiple rows. There are four variables: The subject ID, the starting time, the ending time, and the treatment category. Each row of the data set defines the starting and ending time for one treatment for one patient. The DATA step also implements a trick that I like: prepend "fake data" to specify the order that the treatments will appear in the legend.

data Patients;
length Treatment $ 10;
input ID StartDay EndDay Treatment;
/* trick: The first patient is a 'missing patient', which 
   determines the order that the categories appear in the legend. */
datalines;
  .   0    0 A
  .   0    0 B
  .   0    0 C
  .   0    0 None
 
101   0   90 A
101  90  120 B
101 120  180 A
101 180  210 None
101 210  270 B
101 270  360 None
101 360  420 A
 
201   0  180 B
201 180  240 None
201 240  300 C
 
301   0  270 C
301 270  320 None
301 320  480 A
 
401   0  180 C
401 180  240 None
401 240  360 A
401 360  450 B
;
 
title "Treatments by Day for Each Patient";
proc sgplot data=Patients;
   highlow y=ID low=StartDay high=EndDay / group=Treatment type=BAR;
   yaxis type=discrete reverse;   /* the ID var is numeric, so you must specify DISCRETE */
   xaxis grid label='Day Since Start of Treatment';
run;

The graph is shown at the top of the article. The PROC SGPLOT is a one-liner. The "hard part" about creating this graph is knowing that it can be created by using a HighLow plot and creating the data in the correct form.

In summary, the HighLow plot gives you more control over the graph than a standard stacked bar chart. It enables you to display repeated categories and to order the categories independently for each subject. By using these features, you can visualize a sequence of mutually exclusive events over time for each subject.

Further reading

There have been many blog posts that use the HighLow plot to create custom graphs. In statistics, I have used the HighLow plot to create deviation plots and butterfly fringe plots. In clinical graphs, it can create graphs such as:

The post The HighLow plot: When a stacked bar chart is not enough appeared first on The DO Loop.

8月 312020
 

On discussion forums, many SAS programmers ask about the best way to generate dummy variables for categorical variables. Well-meaning responders offer all sorts of advice, including writing your own DATA step program, sometimes mixed with macro programming. This article shows that the simplest and easiest way to generate dummy variables in SAS is to use PROC GLMSELECT. It is not necessary to write a SAS program to generate dummy variables. This article shows an example of generating dummy variables that have meaningful names, which are based on the name of the original variable and the categories (levels) of the variable.

A dummy variable is a binary indicator variable. Given a categorical variable, X, that has k levels, you can generate k dummy variables. The j_th dummmy variable indicates the presence (1) or absence (0) of the j_th category.

Why GLMSELECT is the best way to generate dummy variables

I usually avoid saying "this is the best way" to do something in SAS. But if you are facing an impending deadline, you are probably more interested in solving your problem and less interested in comparing five different ways to solve it. So let's cut to the chase: If you want to generate dummy variables in SAS, use PROC GLMSELECT.

Why do I say that? Because PROC GLMSELECT has the following features that make it easy to use and flexible:

  • The syntax of PROC GLMSELECT is straightforward and easy to understand.
  • The dummy variables that PROC GLMSELECT creates have meaningful names. For example, if the name of the categorical variable is X and it has values 'A', 'B', and 'C', then the names of the dummy variables are X_A, X_B, and X_C.
  • PROC GLMSELECT creates a macro variable named _GLSMOD that contains the names of the dummy variables.
  • When you write the dummy variables to a SAS data set, you can include the original variables or not.
  • By default, PROC GLMSELECT uses the GLM parameterization of CLASS variables. This is what you need to generate dummy variables. But the same procedure also enables you to generate design matrices that use different parameterizations, that contain interaction effects, that contain spline bases, and more.

The only drawback to using PROC GLMSELECT is that it requires a response variable to put on the MODEL statement. But that is easily addressed.

How to generate dummy variables

Let's show an example of generating dummy variables. I will use two categorical variables in the Sashelp.Cars data: Origin and Cylinders. First, let's look at the data. As the output from PROC FREQ shows, the Origin variable has three levels ('Asia', 'Europe', and 'USA') and the Cylinders variable has seven valid levels and also contains two missing values.

%let DSIn    = Sashelp.Cars;     /* name of input data set */
%let VarList = Origin Cylinders; /* name of categorical variables */
 
proc freq data=&DSIn;
   tables &VarList;
run;

In order to use PROC GLMSELECT, you need a numeric response variable. PROC GLMSELECT does not care what the response variable is, but it must exist. The simplest thing to do is to create a "fake" response variable by using a DATA step view. To generate the dummy variables, put the names of the categorical variables on the CLASS and MODEL statements. You can use the OUTDESIGN= option to write the dummy variables (and, optionally, the original variables) to a SAS data set. The following statements generate dummy variables for the Origin and Cylinders variables:

/* An easy way to generate dummy variables is to use PROC GLMSELECT */
/* 1. add a fake response variable */
data AddFakeY / view=AddFakeY;
set &DSIn;
_Y = 0;
run;
 
/* 2. Create the dummy variables as a GLM design matrix. Include the original variables, if desired */
proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=Want(drop=_Y);
   class      &VarList;   /* list the categorical variables here */
   model _Y = &VarList /  noint selection=none;
run;

The dummy variables are contained in the WANT data set. As mentioned, the GLMSELECT procedure creates a macro variable (_GLSMOD) that contains the names of the dummy variables. You can use this macro variable in procedures and in the DATA step. For example, you can use it to look at the names and labels for the dummy variables:

/* show the names of the dummy variables */
proc contents varnum data=Want(keep=&_GLSMOD);
ods select Position;
run;

Notice that the names of the dummy variables are very understandable. The three levels of the Origin variable are 'Asia', 'Europe', and 'USA, so the dummy variables are named Origin_Asia, Origin_Europe, and Origin_USA. The dummy variables for the seven valid levels of the Cylinders variable are named Cylinders_N, where N is a valid level.

A macro to generate dummy variables

It is easy to encapsulate the two steps into a SAS macro to make it easier to generate dummy variables. The following statements define the %DummyVars macro, which takes three arguments:

  1. DSIn is the name of the input data set, which contains the categorical variables.
  2. VarList is a space-separated list of the names of the categorical variables. Dummy variables will be created for each variable that you specify.
  3. DSOut is the name of the output data set, which contains the dummy variables.
/* define a macro to create dummy variables */
%macro DummyVars(DSIn,    /* the name of the input data set */
                 VarList, /* the names of the categorical variables */
                 DSOut);  /* the name of the output data set */
   /* 1. add a fake response variable */
   data AddFakeY / view=AddFakeY;
      set &DSIn;
      _Y = 0;      /* add a fake response variable */
   run;
   /* 2. Create the design matrix. Include the original variables, if desired */
   proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=&DSOut(drop=_Y);
      class      &VarList;   
      model _Y = &VarList /  noint selection=none;
   run;
%mend;
 
/* test macro on the Age and Sex variables of the Sashelp.Class data */
%DummyVars(Sashelp.Class, Age Sex, ClassDummy);

When you run the macro, it writes the dummy variables to the ClassDummy data set. It also creates a macro variable (_GLSMOD) that contains the name of the dummy variables. You can use the macro to analyze or print the dummy variables, as follows:

/* _GLSMOD is a macro variable that contains the names of the dummy variables */
proc print data=ClassDummy noobs; 
   var Name &_GLSMod;
run;

The dummy variables tell you that Alfred is a 14-year-old male, Alice is a 13-year-old female, and so forth.

What happens if a categorical variable contains a missing value?

If a categorical variable contains a missing value, so do all dummy variables that are generated from that variable. For example, we saw earlier that the Cylinders variable for the Sashelp.Cars data has two missing values. You can use PROC MEANS to show that the dummy variables (named Cylinders_N) also have two missing values. Because the dummy variables are binary variables, the sum of each dummy variable matches the number of levels. Compare the SUM column in the PROC MEANS output with the earlier output from PROC FREQ:

/* A missing value in Cylinders results in a missing value for 
   each dummy variable that is generated from Cylinders */
proc means data=Want N NMiss Sum ndec=0;
   vars Cylinders_:;
run;

Summary

In most analyses, it is unnecessary to generate dummy variables. Most SAS procedures support the CLASS statement, which enables you to use categorical variables directly in statistical analyses. However, if you do need to generate dummy variables, there is an easy way to do it: Use PROC GENSELECT or use the %DummyVars macro in this article. The result is a SAS data set that contains the dummy variables and a macro variable (_GLSMOD) that contains the names of the dummy variables.

Further Reading

Here are links to previous articles about dummy variables and creating design matrices in SAS.

The post The best way to generate dummy variables in SAS appeared first on The DO Loop.

8月 262020
 

In the paper "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018), I discussed an example that uses the new STREAMREWIND subroutine in Base SAS 9.4M5. As its name implies, the STREAMREWIND subroutine rewinds a random number stream, essentially resetting the stream to the beginning. I struggled to create a compelling example for the STREAMREWIND routine because using the subroutine "results in dependent streams of numbers" and because "it is usually not necessary in simulation studies" (p. 12). Regarding an application, I asserted that the subroutine "is convenient for testing."

But recently I was thinking about two-factor authentication and realized that I could use the STREAMREWIND subroutine to emulate generating a random token that changes every 30 seconds. I think it is a cool example, and it gives me the opportunity to revisit some of the newer features of random-number generation in SAS, including new generators and random number keys.

A brief overview of two-factor authentication

I am not an expert on two-factor authentication (TFA), but I use it to access my work computer, my bank accounts, and other sensitive accounts. The main idea behind TFA is that before you can access a secure account, you must authenticate yourself in two ways:

  • Provide a valid username and password.
  • Provide information that depends on a physical device that you own and that you have previously registered.

Most people use a smartphone as the physical device, but it can also be a PC or laptop. If you do an internet search for "two factor authentication tokens," you can find many images like the one on the right. This is the display from a software program that runs on a PC, laptop, or phone. The "Credential ID" field is a long string that is unique to each device. (For simplicity, I've replaced the long string with "12345.") The "Security Code" field displays a pseudorandom number that changes every 30 seconds. The Security Code depends on the device and on the time of day (within a 30-second interval). In the image, you can see a small clock and the number 28, which indicates that the Security Code will be valid for another 28 seconds before a new number is generated.

After you provide a valid username and password, the account challenges you to type in the current Security Code for your registered device. When you submit the Security Code, the remote server checks whether the code is valid for your device and for the current time of day. If so, you can access your account.

Two-factor random number streams

I love the fact that the Security Code is pseudorandom and yet verifiable. And it occurred to me that I can use the main idea of TFA to demonstrate some of the newer features in the SAS random number generators (RNGs).

Long-time SAS programmers know that each stream is determined by a random number seed. But a newer feature is that you can also set a "key" for a random number stream. For several of the new RNGs, streams that have the same seed but different keys are independent. You can use this fact to emulate the TFA app:

  • The Credential ID (which is unique to each device) is the "seed" for an RNG.
  • The time of day is the "key" for an RNG. Because the Security Code must be valid for 30 seconds, round the time to the nearest 30-second boundary.
  • Usually each call to the RAND function advances the state of the RNG so that the next call to RAND produces a new pseudorandom number. For this application, we want to get the same number for any call within a 30-second period. One way to do this is to reset the random number stream before each call so that RAND always returns the FIRST number in the stream for the (seed, time) combination.

Using a key to change a random-number stream

Before worrying about using the time of day as the key value, let's look at a simpler program that returns the first pseudorandom number from independent streams that have the same seed but different key values. I will use PROC FCMP to write a function that can be called from the SAS DATA step. Within the DATA step, I set the seed value and use the "Threefry 2" (TF2) RNG. I then call the Rnd6Int function for six different key values.

proc fcmp outlib=work.TFAFunc.Access;
   /* this function sets the key of a random-numbers stream and 
      returns the first 6-digit pseudorandom number in that stream */
   function Rnd6Int(Key);
      call stream(Key);               /* set the Key for the stream */
      call streamrewind(Key);         /* rewind stream with this Key */
      x = rand("Integer", 0, 999999); /* first 6-digit random number in stream */
      return( x );
   endsub;
quit;
 
options cmplib=(work.TFAFunc);       /* DATA step looks here for unresolved functions */
data Test;
DeviceID = 12345;                    /* ID for some device */
call streaminit('TF2', DeviceID);    /* set RNG and seed (once per data step) */
do Key = 1 to 6;
   SecCode = Rnd6Int(Key);           /* get random number from seed and key values */
   /* Call the function again. Should produce the same value b/c of STREAMREWIND */
   SecCodeDup = Rnd6Int(Key);  
   output;
end;
keep DeviceID Key SecCode:;
format SecCode SecCodeDup Z6.;
run;
 
proc print data=Test noobs; run;

Each key generates a different pseudorandom six-digit integer. Notice that the program calls the Rnd6Int function twice for each seed value. The function returns the same number each time because the random number stream for the (seed, key) combination gets reset by the STREAMREWIND call during each call. Without the STREAMREWIND call, the function would return a different value for each call.

Using a time value as a key

With a slight modification, the program in the previous section can be made to emulate the program/app that generates a new TFA token every 30 seconds. However, so that we don't have to wait so long, the following program sets the time interval (the DT macro) to 10 seconds instead of 30. Instead of talking about a 30-second interval or a 10-second interval, I will use the term "DT-second interval," where DT can be any time interval.

The program below gets the "key" by looking at the current datetime value and rounding it to the nearest DT-second interval. This value (the RefTime variable) is sent to the Rnd6Int function to generate a pseudorandom Security Code. To demonstrate that the program generates a new Security Code every DT seconds, I call the Rnd6Int function 10 times, waiting 3 seconds between each call. The results are printed below:

%let DT = 10;                  /* change the Security Code every DT seconds */
 
/* The following DATA step takes 30 seconds to run because it
   performs 10 iterations and waits 3 secs between iterations */
data TFA_Device;
keep DeviceID Time SecCode;
DeviceID = 12345;
call streaminit('TF2', DeviceID);   /* set the RNG and seed */
do i = 1 to 10;
   t = datetime();                  /* get the current time */
   /* round to the nearest DT seconds and save the "reference time" */
   RefTime = round(t, &DT); 
   SecCode = Rnd6Int(RefTime);      /* get a random Security Code */
   Time = timepart(t);              /* output only the time */
   call sleep(3, 1);                /* delay 3 seconds; unit=1 sec */
   output;
end;
format Time TIME10. SecCode Z6.;
run;
 
proc print data=TFA_Device noobs; 
   var DeviceId Time SecCode;
run;

The output shows that the program generated three different Security Codes. Each code is constant for a DT-second period (here, DT=10) and then changes to a new value. For example, when the seconds are in the interval [05, 15), the Security Code has the same value. The Security Code is also constant when the seconds are in the interval [15, 25) and so forth. A program like this emulates the behavior of an app that generates a new pseudorandom Security Code every DT seconds.

Different seeds for different devices

For TFA, every device has a unique Device ID. Because the Device ID is used to set the random number seed, the pseudorandom numbers that are generated on one device will be different than the numbers generated on another device. The following program uses the Device ID as the seed value for the RNG and the time of day for the key value. I wrapped a macro around the program and called it for three hypothetical values of the Device ID.

%macro GenerateCode(ID, DT);
data GenCode;
   keep DeviceID Time SecCode;
   format DeviceID 10. Time TIME10. SecCode Z6.;
   DeviceID = &ID;
   call streaminit('TF2', DeviceID); /* set the seed from the device */
   t = datetime();                   /* look at the current time */
   /* round to the nearest DT seconds and save the "reference time" */
   RefTime = round(t, &DT);          /* round to nearest DT seconds */
   SecCode = Rnd6Int(RefTime);       /* get a random Security Code */
   Time = timepart(t);               /* output only the time */
run;
 
proc print data=GenCode noobs; run;
%mend;
 
/* each device has a unique ID */
%GenerateCode(12345, 30);
%GenerateCode(24680, 30);
%GenerateCode(97531, 30);

As expected, the program produces different Security Codes for different Device IDs, even though the time (key) value is the same.

Summary

In summary, you can use features of the SAS random number generators in SAS 9.4M5 to emulate the behavior of a TFA token generator. The SAS program in this article uses the Device ID as the "seed" and the time of day as a "key" to select an independent stream. (Round the time into a certain time interval.) For this application, you don't want the RAND function to advance the state of the RNG, so you can use the STREAMREWIND call to rewind the stream before each call. In this way, you can generate a pseudorandom Security Code that depends on the device and is valid for a certain length of time.

The post Rewinding random number streams: An application appeared first on The DO Loop.

8月 242020
 

I got a lot of feedback about my recent article about how to find roots of nonlinear functions by using the SOLVE function in PROC FCMP. A colleague asked how the FCMP procedure stores the functions. Specifically, why the OUTLIB= option on the PROC FCMP statement use a three-level syntax: OUTLIB=libref.DataSetName.PackageName. The three levels are a libref, a data set name, and a package name. The documentation is terse about what the third level (the package name) is used for and why it is important. This article describes how the FCMP-defined functions are stored, and how you can use the package name to call different versions of a function.

This article is my attempt to "reverse engineer" how PROC FCMP stores functions based on what I have read and observed. In addition to the FCMP documentation, I recommend reading Secosky (2007) and Eberhardt (2009). Feel free to add your own knowledge in the comments.

How FCMP functions are stored

I started writing about the capabilities of the FCMP procedure in 2012, but the procedure itself goes back to SAS 9.2. Modern versions of SAS store functions in an analytic store (which is read by using PROC ASTORE) or in an item store (which is read by using PROC PLM). But these binary storage formats had not yet been developed back in the pre-9.2 days. So PROC FCMP stores functions in a SAS data set. That means you can use PROC PRINT to investigate how PROC FCMP stores functions.

When you use the OUTLIB= option in PROC FCMP, you specify a three-level name: OUTLIB=libref.DataSetName.PackageName. The first two levels specify the name of a SAS data set. This data set is created if it doesn't exist, or it is modified if it already exists. The third level is used as a text field in a variable named _KEY_, which enables one data set to contain functions that belong to different packages. The package name becomes important if two packages define a function that has the same name.

To demonstrate, let's define some functions and store them in a data set named Work.MyFuncs. The following statements create two functions (A and B) that belong to the 'PROD' (for 'Production') package and one function (A) that belongs to the 'DEV' (for 'Development') package. Notice that both packages have a function named 'A'. The following statements define the functions and use PROC PRINT to display a portion of the Work.MyFuncs data set:

/* Store all functions in the data set WORK.MyFuncs */
/* Define functions in 'PROD' package */
proc fcmp outlib=work.MyFuncs.Prod;
   function A(x);
      return( x );      /* in the 'Prod' pkg, A(x) equals x */
   endsub;
   function B(x);
      return( x > 0 );
   endsub;
quit;
 
/* Define functions in 'DEV' package */
proc fcmp outlib=work.MyFuncs.Dev;
   function A(x);
      return( 2*x );    /* the 'Dev' pkg uses a different definition for A(x) */
   endsub;
quit;
 
proc print data=work.MyFuncs;
   var _Key_ Sequence Type Subtype Name;
run;

The output from PROC PRINT is shown. The data set contains 20 rows. I have put a red rectangle around rows 1–13 and another around rows 14–20. Each rectangle defines a package. The names of the packages are defined by the observations where Subtype='Package', which are highlighted in yellow. The Type, Subtype, and Name columns indicate how the FCMP statements that define the functions are stored in the data set. The _KEY_ column identifies which rows define which functions. There are other columns (not shown) that store the actual content of each function.

This output shows how the third level of the OUTLIB= option is used. The _KEY_ column records the package name and appends each function name ('A' or 'B') to the name of the package. So PROC FCMP knows that there are three stored functions whose full names are PROD.A, PROD.B, and DEV.A.

Calling a function from the DATA step

Since there are two functions called 'A', what happens if I call 'A' from a SAS DATA step? The answer is that the DATA step uses the most recent definition, which in this example is DEV.A. To alert you to the fact that calling 'A' is ambiguous, the SAS log displays a warning. You can also use the _DISPLAYLOC_ flag on the CMPLIB= system option to display the origin of each call to an FCMP function, as follows:

/* Tell the DATA step where to look for unresolved functions.
   The _DISPLAYLOC_ flag shows the full name for each call to an FCMP function */
options cmplib=(work.MyFuncs _DISPLAYLOC_); 
data Want;
   x = 1; y = A(x);   /* y is the result of the latest definition */
run;
 
proc print data=Want noobs; run;
WARNING: Function 'A' was defined in a previous package. 'A' in current
         package DEV will be used as default when the package name is not
         specified.
 
NOTE: Function 'A' loaded from work.MyFuncs.DEV.

The value of the X and Y variables make it clear that the function DEV.A was called (because A(x)=2*x in that definition). The WARNING and NOTE in the SAS log reinforce this fact.

Choosing which package to call

The WARNING in the previous section says that the current (most recent) package "will be used as default when the package name is not specified." This message seems to imply that you can somehow call PROD.A, which is the other stored function that is named 'A'. This is, in fact, true. The PROC FCMP documentation states, "to select a specific subroutine when there is ambiguity, use the package name and a period as the prefix to the subroutine name."

You cannot specify the package name directly in the DATA step, but you can specify the package name in an FCMP function. So, for example, you can define a function called 'ChooseA' that includes a flag that indicates which package to use. The following PROC FCMP statements define a function that will call either PROD.A or DEV.A, depending on the value of a parameter. This wrapper function can then be called in the DATA step:

/* In PROC FCMP, you can "dis-ambiguate" by using a two-level function name */
proc fcmp outlib=work.MyFuncs.Choose;
   function ChooseA(x, choice $);
      if upcase(choice)="DEV" then
        return( Dev.A(x) );
      else
        return( Prod.A(x) );
   endsub;
quit;
 
data WantChoice;
   x = 1;
   y_Dev  = ChooseA(x, "Dev");   /* call Dev.A */
   y_Prod = ChooseA(x, "Prod");  /* call Prod.A */
run;
 
proc print data=WantChoice noobs; run;

From the definitions of DEV.A and PROD.A, you can verify that each function was called correctly. Because the _DISPLAYLOC_ option is still active, the SAS log also indicates that each function was called.

Summary

This article was motivated by a question about how the FCMP procedure stores functions. The answer is that the OUTLIB= option on the PROC FCMP statement requires a libref, a data set name, and a package name. In most circumstances, you do not need to use the package name. The package name becomes important only if two different packages each support a function that has the same name. In that case, you can use the package name to disambiguate the function call.

Personally, I prefer to avoid having two packages that define the same function, but if you cannot avoid it, this trick shows you how to handle it. Eberhardt (2009, p. 15) discusses a related issue, which is how to call functions that are stored in two (or more) different data sets.

The post How does PROC FCMP store functions? appeared first on The DO Loop.

8月 192020
 

Finding the root (or zero) of a nonlinear function is an important computational task. In the case of a one-variable function, you can use the SOLVE function in PROC FCMP to find roots of nonlinear functions in the DATA step. This article shows how to use the SOLVE function to find the roots of a user-defined function from the SAS DATA step.

Some ways to find the root of a nonlinear function in SAS

I have previously blogged about several ways to find the roots of nonlinear equations in SAS, including:

The DATA step does not have a built-in root-finding function, but you can implement one by using PROC FCMP.

PROC FCMP: Implement user-defined functions in SAS

PROC FCMP enables you to write user-defined functions that can be called from the DATA step and from SAS procedures that support DATA step programming statements (such as PROC NLIN, PROC NLMIXED, and PROC MCMC). To demonstrate finding the roots of a custom function, let's use the same function that I used to show how to find roots by using SAS/IML. The following statements use PROC FCMP to define a function (FUNC1) and to store the function to WORK.FUNCS. You must use the CMPLIB= system option to tell the DATA step to look in WORK.FUNCS for unknown functions. The DATA step evaluates the function for input arguments in the interval [-3, 3]. The SGPLOT procedure creates a graph of the function:

/* use PROC FCMP to create a user-defined function (FUNC1) */
proc fcmp outlib=work.funcs.Roots;
   function Func1(x);         /* create and store a user-defined function */
      return( exp(-x**2) - x**3 + 5*x +1 );
   endsub;
quit;
options cmplib=work.funcs;   /* DATA step will look here for unresolved functions */
 
data PlotIt;   
do x = -3 to 3 by 0.1;       /* evaluate function on [-3, 3] */
   y = Func1(x);
   output;
end;
run;
 
title "Graph of User-Defined Function";
proc sgplot data=PlotIt;
   series x=x y=y;
   refline 0 / axis=y;
   xaxis grid;
run;

From the graph, it appears that the function has three roots (zeros). The approximate locations of the roots are {-2.13, -0.38, 2.33}, as found in my previous article. The next section shows how to use the SOLVE function in PROC FCMP to find these roots.

Use the SOLVE function to find roots

The SOLVE function is not a DATA step function. It is a "special function" in PROC FCMP. According to the documentation for the special functions, "you cannot call these functions directly from the DATA step. To use these functions in a DATA step, you must wrap the special function inside another user-defined FCMP function." You can, however, call the SOLVE function directly from procedures such as NLIN, NLMIXED, and MCMC.

I was confused when I first read the documentation for the SOLVE function, so let me give an overview of the syntax. The SOLVE function enables you to find a value x such that f(x) = y0 for a "target value", y0. Thus, the SOLVE function enables you to find roots of the function g(x) = f(x) – y0. However, in this article, I will set y0 = 0 so that x will be a root of f.

Because the function might have multiple roots, you need to provide a guess (x0) that is close to the root. The SOLVE function will start with your initial guess and apply an iterative algorithm to obtain the root. Thus, you need to provide the following information to the SOLVE function:

  • The name of an FCMP function (such as "Func1") that will evaluate the function.
  • An initial guess, x0, that is near a root. (You can also provide convergence criteria that tells the iterative algorithm when it has found an approximate root.)
  • The parameters to the function. A missing value indicates the (one) parameter that you are solving for. For a function of one variable, you will specify a missing value for x, which tells the SOLVE function to solve for x. In general, a function can take multiple parameters, so use a missing value in the parameter list to indicate which parameter you are solving for.

The following SAS program defines a new FCMP function called Root_Func1. That function takes one argument, which is the initial guess for a root. Inside the function, the SOLVE function finds the root of the "Func1" function (defined earlier). If it was successful, it returns the root to the caller. To test the Root_Func1 function, I call it from a DATA step and pass in the values -2, 0, and +2. I obtained these guesses by looking at the graph of the Func1 function.

/* to call SOLVE in the DATA step, wrap it inside an FCMP function */
proc fcmp outlib=work.funcs.Roots;
   function Root_Func1(x0);
      array opts[5] init absconv relconv maxiter status; /* initialize to missing */
      opts[1] = x0;        /* opts[1] is the initial guess */
      z = solve("Func1",   /* name of function */
                opts,      /* initial condition, convergence criteria, and status */
                0,         /* target: find x so that f(x) = 0 */
                .);        /* Params for f. A missing value indicates param to solve for */
      if status=0 then 
         return( z );      /* root found */
      else return (.);     /* a root was not found */
   endsub;
quit;
 
data Roots1;
input x0 @@;   
root = Root_Func1(x0);     /* x0 is a "guess" for a root of Func1 */
datalines;
-2 0 2
;
 
proc print data=Roots1 noobs; run;

The output shows that the function has found three roots, which are approximately {-2.13, -0.38, 2.33}. These are the same values that I found by using the FROOT function in SAS/IML.

The roots of functions that have parameters

Sometimes it is useful to include parameters in a function. For example, the following function of x contains two parameters, a and b:
    g(x; a, b) = exp(-x2) + a*x3 + b*x +1
Notice that the first function (Func1) is a special case of this function because f(x) = g(x; -1, 5). You can include parameters in the FCMP functions that define the function and that find the roots. When you call the SOLVE function, the last arguments are a list of parameters (x, a, b) and you should give specific values to the a and b parameters and use a missing value to indicate that the x variable is the variable that you want to solve for. This is shown in the following program:

/* you can also define a function that depends on parameters */
proc fcmp outlib=work.funcs.Roots;
   /* define the function. Note Func1(x) = Func2(x; a=-1, b=5) */
   function Func2(x, a, b);
      return( exp(-x**2) + a*x**3 + b*x +1 );
   endsub;
 
   function Root_Func2(x0, a, b);
      array opts[5] init absconv relconv maxiter status; /* initialize missing */
      init = x0;           /* pass in initial guess */
      z = solve("Func2",   /* name of function */
                opts,      /* initial condition, convergence opts, status */
                0,         /* Find x so that f(x) = 0 */
                ., a, b ); /* Params for f. A missing value indicates param to solve for */
      if status=0 then 
         return( z );      /* root found */
      else return (.);     /* a root was not found */
   endsub;
quit;
 
/* Evaluate Func2 for x in [-3,3] for three pairs of (a,b) values:
   (-1,5), (-1,3), and (-2,1) */
data PlotIt;
do x = -3 to 3 by 0.1;
   y = Func2(x, -1, 5); lbl="a=-1; b=5";  output;
   y = Func2(x, -1, 3); lbl="a=-1; b=3";  output;
   y = Func2(x, -2, 1); lbl="a=-2; b=1";  output;
end;
run;
 
title "Graph of User-Defined Functions";
proc sgplot data=PlotIt;
   series x=x y=y / group=lbl;
   refline 0 / axis=y;
   xaxis grid;
   yaxis min=-10 max=10;
run;

The three graphs are shown. You can see that two of the graphs have three roots, but the third graph has only one root. You can call the Root_Func2 function from the DATA step to find the roots for each function:

/* find the roots of Func2 for (a,b) values (-1,5), (-1,3), and (-2,1) */
data Roots2;
input a b x0;
root = Root_Func2(x0, a, b);
datalines;
-1 5 -2 
-1 5  0 
-1 5  2 
-1 3 -2 
-1 3  0 
-1 3  2 
-2 1  0
-2 1  2
;
 
proc print data=Roots2 noobs; run;

Notice that the roots for (a,b) = (-1, 5) are exactly the same as the roots for Func1. The roots for the function when (a,b) = (-1, 3) are approximately {-1.51, -0.64, 1.88}. The root for the function when (a,b) = (-2, 1) is approximately 1.06.

Notice that the initial guess x0 = 0 did not converge to a root for the function that has (a,b) = (-2, 1). It is well known that Newton-type methods do not always converge. This is in contrast to Brent's methods (used by the FROOT function in SAS/IML), which always converges to a root, if one exists.

Summary

In summary, this article shows how to use the SOLVE function in Base SAS to find the roots (zeros) of a nonlinear function of one variable. You first define a nonlinear function by using PROC FCMP. You can call the SOLVE function directly from some SAS procedures. However, if you want to use it in the DATA step, you need to define a second FCMP function that takes the function parameters and an initial guess and calls the SOLVE function. This article shows two examples of using the SOLVE function to find roots in SAS.

The post Find the root of a function by using the SAS DATA step appeared first on The DO Loop.

8月 172020
 

I recently showed how to use simulation to estimate the power of a statistical hypothesis test. The example (a two-sample t test for the difference of means) is a simple SAS/IML module that is very fast. Fast is good because often you want to perform a sequence of simulations over a range of parameter values to construct a power curve. If you need to examine 100 sets of parameter values, you would expect the full computation to take about 100 times longer than one simulation.

But that calculation applies only when you run the simulations sequentially. If you run the simulations in parallel, you can complete the simulation study much faster. For example, if you have access to a machine or cluster of machines that can run 32 threads, then each thread needs to run only a few simulations. You can perform custom parallel computations by using the iml action in SAS Viya. The iml action is supported in SAS Viya 3.5.

This article shows how to distribute computations by using the PARTASKS function in the iml action. (PARTASKS stands for PARallel TASKS.) If you are not familiar with the iml action or with parallel processing, see my previous example that uses the PARTASKS function.

Use simulation to estimate a power curve

I previously showed how to use simulation to estimate the power of a statistical test. In that article, I simulated data from N(15,5) and N(16,5) distributions. Because the t test is invariant under centering and scaling operations, this study is mathematically equivalent to simulating from the N(0,1) and N(0.2,1) distributions. (Subtract 15 and divide by 5.)

Recall that the power of a statistical hypothesis test is the probability of making a Type 2 error. A Type 2 error is rejecting the null hypothesis when the alternative hypothesis is true. The power depends on the sample sizes and the magnitude of the effect you are trying to detect. In this case, the effect is the difference between two means. You can study how the power depends on the mean difference by estimating the power for a t test between data from an N(0,1) distribution and an N(δ, 1) distribution, where δ is a parameter in the interval [0, 2]. All hypothesis tests in this article use the α=0.05 significance level.

The power curve we want to estimate is shown to the right. The horizontal axis is the δ parameter, which represents the true difference between the population means. Each point on the curve is an estimate of the power of a two-sample t test for random samples of size n1 = n2 = 10. The curve indicates that when there is no difference between the means, you will conclude otherwise in 5% of random samples (because α=0.05). For larger differences, the probability of detecting the difference increases. If the difference is very large (more than 1.5 standard deviations apart), more than 90% of random samples will correctly reject the null hypothesis in the t test.

The curve is calculated at 81 values of δ, so this curve is the result of running 81 independent simulations. Each simulation uses 100,000 random samples and carries out 100,000 t tests. Although it is hard to see, the graph also shows 95% confidence intervals for power. The confidence intervals are very small because so many simulations are run.

So how long did it take to compute this power curve, which is the result of 8.1 million t tests? About 0.4 seconds by using parallel computations in the iml action.

Using the PARTASKS function to distribute tasks

Here's how you can write a program in the iml action to run 81 simulations across 32 threads. The following program uses four nodes, each with 8 threads, but you can achieve similar results by using other configurations. The main parts of the program are as follows:

  • The program assumes that you have used the CAS statement in SAS to establish a CAS session that uses four worker nodes, each with at least eight CPUs. For example, the CAS statement might look something like this:
        cas sess4 host='your_host_name' casuser='your_username' port=&MPPPort nworkers=4;
    The IML program is contained between the SOURCE/ENDSOURCE statements in PROC CAS.
  • The TTestH0 function implements the t test. It runs a t test on the columns of the input matrices and returns the proportion of samples that reject the null hypothesis. The TTestH0 function is explained in a previous article.
  • The "task" we will distribute is the SimTTest function. (This encapsulates the "main program" in my previous article.) The SimTTest function simulates the samples, calls the TTestH0 function, and returns the number of t tests that reject the null hypothesis. The first sample is always from the N(0, 1) distribution and the second sample is from the N(&delta, 1) distribution. Each call creates B samples. The input to the SimTTest function is a list that contains all parameters: a random-number seed, the sample sizes (n1 and n2), the number of samples (B), and the value of the parameter (delta).
  • The main program first sets up a list of arguments for each task. The i_th item in the list is the argument to the i_th task. For this computation, all arguments are the same (a list of parameters) except that the i_th argument includes the parameter value delta[i], where delta is a vector of parameter values in the interval [0, 2].
  • The call to the PARTASKS function is
         RL = ParTasks('SimTTest', ArgList, {2, 0});
    where the first parameter is the task (or list of tasks), the second parameter is the list of arguments to the tasks, and the third parameter specifies how to distribute the computations. The documentation of the PARTASKS function provides the full details.
  • After the PARTASKS function returns, the program processes the results. For each value of the parameter, δ, the main statistic is the proportion of t tests (p) that reject the null hypothesis. The program also computes a 95% (Wald) confidence interval for the binomial proportion. These statistics are written to a CAS table called 'PowerCurve' by using the MatrixWriteToCAS subroutine.
/* Power curve computation: delta = 0 to 2 by 0.025 */
proc cas;
session sess4;                         /* use session with four workers     */
loadactionset 'iml';                   /* load the iml action set           */
source TTestPower;
 
/* Helper function: Compute t test for each column of X and Y.
   X is (n1 x m) matrix and Y is (n2 x m) matrix.
   Return the number of columns for which t test rejects H0 */
start TTestH0(x, y);
   n1 = nrow(X);     n2 = nrow(Y);     /* sample sizes                      */
   meanX = mean(x);  varX = var(x);    /* mean & var of each sample         */
   meanY = mean(y);  varY = var(y);
   poolStd = sqrt( ((n1-1)*varX + (n2-1)*varY) / (n1+n2-2) );
 
   /* Compute t statistic and indicator var for tests that reject H0 */
   t = (meanX - meanY) / (poolStd*sqrt(1/n1 + 1/n2));
   t_crit =  quantile('t', 1-0.05/2, n1+n2-2);       /* alpha = 0.05        */
   RejectH0 = (abs(t) > t_crit);                     /* 0 or 1              */
   return  RejectH0;                                 /* binary vector       */
finish;
 
/* Simulate two groups; Count how many reject H0: delta=0 */
start SimTTest(L);                     /* define the mapper                 */
   call randseed(L$'seed');            /* each thread uses different stream */
   x = j(L$'n1', L$'B');               /* allocate space for Group 1        */
   y = j(L$'n2', L$'B');               /* allocate space for Group 2        */
   call randgen(x, 'Normal', 0,         1);   /* X ~ N(0,1)                 */
   call randgen(y, 'Normal', L$'delta', 1);   /* Y ~ N(delta,1)             */
   return sum(TTestH0(x, y));          /* count how many reject H0          */
finish;
 
/* ----- Main Program ----- */
numSamples = 1e5;
L = [#'delta' = .,   #'n1' = 10,  #'n2' = 10,
     #'seed'  = 321, #'B'  = numSamples];
 
/* Create list of arguments. Each arg gets different value of delta */
delta = T( do(0, 2, 0.025) );
ArgList = ListCreate(nrow(delta));
do i = 1 to nrow(delta);
   L$'delta' = delta[i];
   call ListSetItem(ArgList, i, L);
end;
 
RL = ParTasks('SimTTest', ArgList, {2, 0});  /* assign nodes before threads */
 
/* Summarize results and write to CAS table for graphing */
varNames = {'Delta' 'ProbEst' 'LCL' 'UCL'};  /* names of result vars        */
Result = j(nrow(delta), 4, .);
zCrit = quantile('Normal', 1-0.05/2);  /* zCrit = 1.96                      */
do i = 1 to nrow(delta);               /* for each task                     */
   p = RL$i / numSamples;              /* get proportion that reject H0     */
   SE = sqrt(p*(1-p) / L$'B');         /* std err for binomial proportion   */
   LCL = p - zCrit*SE;                 /* 95% CI                            */
   UCL = p + zCrit*SE;
   Result[i,] = delta[i] || p || LCL || UCL;
end;
 
call MatrixWriteToCas(Result, '', 'PowerCurve', varNames);
endsource;
iml / code=TTestPower nthreads=8;
quit;

You can pull the 81 statistics back to a SAS data set and use PROC SGPLOT to plot the results. You can download the complete program that generates the statistics and graphs the power curve.

Notice that there are 81 tasks but only 32 threads. That is not a problem. The PARTASKS tries to distribute the workload as evenly as possible. In this example, 17 threads are assigned three tasks and 15 threads are assigned two tasks. If T is the time required to perform one power estimate, the total time to compute the power curve will be approximately 3T, plus "overhead costs" such as the time required to set up the problem, distribute the parameters to each task, and aggregate the results. You can minimize the overhead costs by passing only small amounts of data across nodes.

Summary

In summary, if you have a series of independent tasks in IML, you can use the PARTASKS function to distribute tasks to available threads. The speedup can be dramatic; it depends on the time required for the thread that performs the most work.

This article shows an example of using the PARTASKS function in the iml action, which is available in SAS Viya 3.5. The example shows how to distribute a sequence of computations among k threads that run concurrently and independently. In this example, k=32. Because the tasks are essentially identical, each thread computes 2/32 or 3/32 of the total work. The results from each task are returned in a list, where they can be further processed by the main program.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post Estimate a power curve in parallel in SAS Viya appeared first on The DO Loop.