Statistical Programming

11月 302022
 

A probabilistic card trick is a trick that succeeds with high probability and does not require any skill from the person performing the trick. I have seen a certain trick mentioned several times on social media. I call it "ladders" or the "ladders game" because it reminds me of the childhood board game Chutes and Ladders (previously known as Snakes and Ladders in India and the UK). In Chutes and Ladders, some squares on the board tell the player to move forward (go up a ladder) or to move backward (go down a chute). In the "ladders" game, you only move forward.

This article uses Monte Carlo simulation in SAS to show that the "ladders" card trick "works" about 98.2% of the time.

I have been told that Martin Gardner wrote about this game. He referred to mathematical card tricks as those that "do not require prestidigitation," or sleight of hand.

The ladders card trick

The rules of the game and a Python simulation are presented in a blog post by Bob Carpenter, who saw the trick performed at a children's museum. The performer shuffles an ordinary deck of 52 playing cards and places them face up in a row on a table. The performer then describes the game of ladders:

  • A player starts by pointing to any card in the first five positions.
  • The value of the card indicates the number of cards to move forward. If the card is 2, 3, ..., 9, you move forward that many cards. If the card is a 10, jack, queen, king, or ace, you move forward one space.
  • From the new position, apply the rule again. Repeat this process until you cannot advance any more.

The performer selects two volunteers from the audience and tells them each to point randomly to any card from among the first five cards. Each player does so. The performer then announces that both players will end up on a certain "magical" card and he tells everyone the name of the magical card. The players then follow the game rules. Amazingly, each person's path terminates on the magical card that was previously announced. The trick works about 98.2% of the time.

To illustrate the ladders game, first consider a simpler situation of a deck that contains only 13 cards, one for each card value (for example, use only spades). One random shuffle of the 13 cards is shown to the right. For this arrangement of cards, the performer announces that both volunteers will end up on the 7 card (the magical card for this shuffle), which is the 11th card in the row of cards. Suppose the first volunteer points to the ace in the first position as her starting card. The rules of the game dictate the following sequence of moves:

  • From the A, the player moves forward one card and lands on the J.
  • From the J, the player moves forward one card and lands on the 2.
  • From the 2, the player moves forward two cards and lands on the 6.
  • From the 6, the player moves forward six cards and lands on the 7. The player cannot move forward seven spaces, so the 7 is her final card, as predicted in advance.

Thus, the path for the first player is the sequence of positions (1, 2, 3, 5, 11).

What happens for the other player? Well, if the second volunteer chooses his initial card at positions 2, 3, or 5, then his path will instantly coincide with the path of the first player, so his path, too, terminates at the 7 card in the 11th position. If the second volunteer chooses his initial card at position 4, then his path hits positions 4, 8, and 11. This path also terminates at the 7 card in the 11th position. Thus, for this set of cards and this shuffle, all initial positions follow a path that eventually lands on the 7.

How does the performer know the name of the final magical card? As he deals, he secretly counts the path determined by the first card in the deck.

Simulate the ladders game in SAS

You can use Monte Carlo simulation to estimate the probability that the ladders card trick "works." To keep it simple, let's say the trick "works" when the paths of the two players converge. That is, the players initially start on two random positions (1-5), but they end up on the same final card after they each follow the path dictated by their initial position. The simulation for the ladders card trick has the following major steps:

  1. Set up the card deck. For each card, assign the number of steps to move forward.
  2. Create a function that computes the final position, given the initial position and the sequence of moves determined by the shuffled deck.
  3. Run the Monte Carlo simulation. This consists of several sub-steps:
    • Shuffle the deck (that is, shuffle the array of moves).
    • Use sampling without replacement to choose two initial positions.
    • Compute the final card for each initial position.
    • Record whether the two paths converged.
  4. Estimate the probability of the trick working as the proportion of times that the paths converged in the simulation.

As is often the case, this simulation is easiest to perform in SAS by using the SAS/IML language. You can use the RANPERM function to shuffle the deck. You can use the SAMPLE function to sample without replacement from the values 1-5. You can write a function that computes the final position based on the initial position. The following program is one way to implement a Monte Carlo simulation for the ladders card trick:

/* The Ladders card trick:
   https://statmodeling.stat.columbia.edu/2022/10/07/mira-magic-a-probabilistic-card-trick/
   Lay out a shuffled deck of cards.
   Choose one of the first five cards as a starting point.
   Move down the deck as follows:
   If the current card is 2-9, move that many cards forward.
   For other cards, move one space forward.
*/
proc iml;
/* 1. set up the card deck and assign values. 
   Note: We don't actually need the deck, only the instructions
   about how to move. */
values = {A,'2','3','4','5','6','7','8','9','10', J, Q, K};
advance= {1,  2,  3,  4,  5,  6,  7,  8,  9,  1,  1, 1, 1};
deck  = repeat(values, 4);   /* full deck of 52 cards */
moves = repeat(advance, 4);  /* corresponding moves */
nCards = nrow(moves);
 
/* 2. Find the final position, given the first position and 
   the vector of moves for the current shuffled deck.
   startPos : a scalar value 1-5
   Move : a vector of values. If your current card is in 
          position i, the next card is position (i + Move[i])
*/
start EndPos(startPos, Move);
   Pos = startPos;
   do while(Pos + Move[Pos] < nrow(Move));
      Pos = Pos + Move[Pos];
   end;
   return( Pos );
finish;
 
/* 3. Run the Monte Carlo simulation:
      A. Shuffle the deck (vector of moves)
      B. Use sampling w/o replacement to choose two initial positions
      C. Compute the final card for each initial position
      D. Record whether the two paths converged
*/
call randseed(1234);
NSim = 1E5;
match = j(NSim,1);
do j = 1 to nSim;
   k = ranperm(nCards);          /* shuffle the deck */
   Move = moves[k];
   /* choose two different starting positions in 1-5 */
   startPos = sample(1:5, 2, "NoReplace");
   EndPos1 = EndPos(startPos[1], Move);
   EndPos2 = EndPos(startPos[2], Move);
   match[j] = (EndPos1=EndPos2);
end;
 
/* 4. Estimate probability as the proportion of matches (1s) in a binary vector.
      Optional: Compute a Wald CL for the proportion */
ProbMatch = mean(match);
StdErr = sqrt( ProbMatch*(1-ProbMatch)/ NSim );
CL = ProbMatch + {-1 1}*1.96*StdErr;
print NSim[F=comma9.] ProbMatch StdErr CL[c={'LCL' 'UCL'}];

The simulation shows that the two players end up on the same card more than 98% of the time.

Summary

This article describes a probabilistic card trick, which I call the game of ladders. Two players choose random cards as an initial condition, then follow the rules of the game. For about 98.2% of the deals and initial positions, the paths of the two players will converge to the same card. As the performer is laying out the card, he can predict in advance that the paths will converge and, with high probability, announce the name of the final card.

Bob Carpenter's blog post mentions the connection between this card trick and Markov chains. He also presents several variations, such as using more cards in the deck.

The post Ladders: A probabilistic card trick appeared first on The DO Loop.

11月 212022
 

Recently, I needed to know "how much" of a piecewise linear curve is below the X axis. The coordinates of the curve were given as a set of ordered pairs (x1,y1), (x2,y2), ..., (xn, yn). The question is vague, so the first step is to define the question better. Should I count the number of points on the curve for which the Y value is negative? Should I use a weighted sum and add up all the negative Y values? Ultimately, I decided that the best measure of "negativeness" for my application is to compute the area that lies below the line Y=0 and above the curve. In calculus, this would be the "negative area" of the curve. Because the curve is piecewise linear, you can compute the area exactly by using the trapezoid rule of integration.

An example is shown to the right. The curve is defined by 12 ordered pairs. The goal of this article is to compute the area shaded in blue. This is the "negative area" with respect to the line Y=0. With very little additional effort, you can generalize the computation to find the area below any horizontal line and above the curve.

Area defined by linear segments

The algorithm for computing the shaded area is simple. For each line segment along the curve, let [a,b] be the interval defined by the left and right abscissas (X values). Let f(a) and f(b) be the corresponding ordinate values. Then there are four possible cases for the positions of f(a) and f(b) relative to the horizontal reference line, Y=0:

  • Both f(a) and f(b) are above the reference line. In this case, the area between the line segment and the reference line is positive. We are not interested in this case for this article.
  • Both f(a) and f(b) are below the reference line. In this case, the "negative area" can be computed as the area of a trapezoid: \(A = 0.5 (b - a) (f(b) + f(a))\).
  • The value f(a) is below the reference line, but f(b) is above the line. In this case, the "negative area" can be computed as the area of a triangle. You first solve for the location, c, at which the line segment intersects the reference line. The negative area is then \(A = 0.5 (c - a) f(a)\).
  • The value f(a) is above the reference line and f(b) is below the line. Again, the relevant area is a triangle. Solve for the intersection location, c, and compute the negative area as \(A = 0.5 (b - c) f(b)\).

The three cases for negative area are shown in the next figure:

You can easily generalize these formulas if you want the above the curve and below the line Y=t. In every formula that includes f(a), replace that value with (f(a) – t). Similarly, replace f(b) with (f(b) – t).

Compute the negative area

The simplest computation for the negative area is to loop over all n points on the line. For the i_th point (1 ≤ i < n), let [a,b] be the interval [x[i], x[i+1]] and apply the formulas in the previous section. Since we skip any intervals for which f(a) and f(b) are both positive, we can exclude the point (x[i], y[i]) if y[i-1], y[i], and y[i+1] are all positive. This is implemented in the following SAS/IML function. By default, the function returns the area below the line Y=0 and the curve. You can use an optional argument to change the value of the horizontal reference line.

proc iml;
/* compute the area below the line y=y0 for a piecewise
   linear function with vertices given by (x[i],y[i]) */
start AreaBelow(x, y, y0=0);
   n = nrow(x);
   idx = loc(y<y0);                /* find indices for which y[i] < 0 */
   if ncol(idx)=0 then return(0);
 
   k = unique(idx-1, idx, idx+1);  /* we need indices before and after */
   jdx = loc(k > 0 & k <  n);      /* restrict to indices in [1, n-1] */
   v = k[jdx];                     /* a vector of the relevant vertices */
 
   NegArea = 0;
   do j = 1 to nrow(v);            /* loop over intervals where f(a) or f(b) negative */
      i = v[j];                    /* get j_th index in the vector v */
      fa = y[i]-y0; fb = y[i+1]-y0;/* signed distance from cutoff line */
      if fa > 0 & fb > 0 then 
         ;                         /* segment is above cutoff; do nothing */
      else do;
         a  = x[i];  b = x[i+1];
         if fa < 0 & fb < 0 then do;  /* same sign, use trapezoid rule */
            Area = 0.5*(b - a) * (fb + fa);
         end;
         /* different sign, f(a) < 0, find root and use triangle area */
         else if fa < 0 then do;  
            c = a - fa * (b - a) / (fb - fa);
            Area = 0.5*(c - a)*fa;
         end;
         /* different sign, f(b) < 0, find root and use triangle area */
         else do;
            c = a - fa * (b - a) / (fb - fa);
            Area = 0.5*(b - c)*fb;
         end;
         NegArea = NegArea + Area;
      end;
   end;
   return( NegArea );
finish;
 
/* points along a piecewise linear curve */
x = {   1,    2, 3.5,   4,5,     6, 6.5,   7,   8,  10, 12,  15};
y = {-0.5, -0.1, 0.2, 0.7,0.8,-0.2, 0.3, 0.6, 0.3, 0.1,-0.4,-0.6};
 
/* compute area under the line Y=0 and above curve (="negative area") */
NegArea = AreaBelow(x,y);
print NegArea;

The program defines the AreaBelow function and calls the function for the piecewise linear curve that is shown at the top of this article. The output shows that the area of the shaded regions is -2.185.

The area between a curve and a reference line

I wrote the program in the previous section so that the default behavior is to compute the area below the reference line Y=0 and above a curve. However, you can use the third argument to specify the value of the reference line. For example, the following statements compute the areas below the lines Y=0.1 and Y=0.2, respectively:

Area01 = AreaBelow(x,y, 0.1);      /* reference line Y=0.1 */
Area02 = AreaBelow(x,y, 0.2);      /* reference line Y=0.2 */
print Area01 Area02;

As you would expect, the magnitude of the area increases as the reference line moves up. In fact, you can visualize the area below the line Y=t and the curve as a function of t. Simply, call the AreaBelow function in a loop for a sequence of increasing t values and plot the results:

t = do(-0.7, 1.0, 0.05);
AreaUnder = j(1,ncol(t));
do i = 1 to ncol(t);
   AreaUnder[i] = AreaBelow(x, y, t[i]);
end;
title "Area Under Reference Line";
call series(t, AreaUnder) label={'Reference Level' 'Area'} grid={x y} xvalues=do(-0.7, 1.0, 0.1);

The graph shows that the area under the reference line is a monotonic function. If the reference line is below the minimum value of the curve, the area is zero. As you increase t, the area below the line Y=t and the curve increases in magnitude. After the reference line reaches the maximum value along the curve (Y=0.8 for this example), the magnitude of the area increases linearly. It is difficult to see in the graph above, but the curve is actually linear for t ≥ 0.8.

Summary

You can use numerical integration to determine "how much" of a function is negative. If the function is piecewise linear, the integral over the negative intervals can be computed by using the trapezoid rule. This article shows how to compute the area between a reference line Y=t and a piecewise linear curve. When t=0, this is the "negative area" of the curve.

Incidentally, this article is motivated by the same project that inspired me to write about how to test whether a function is monotonically increasing. If a function is monotonically increasing, then its derivative is strictly positive. Therefore, another way to test a function for monotonic increasing is to test whether the derivative is never negative. A way to measure how far a function deviates from being monotonic is to compute the "negative area" for the derivative.

The post The area under a piecewise linear curve appeared first on The DO Loop.

10月 242022
 

I recently gave a presentation about the SAS/IML matrix language in which I emphasized that a matrix language enables you to write complex analyses by using only a few lines of code. In the presentation, I used least squares regression as an example. One participant asked how many additional lines of code would be required for binary logistic regression. I think my answer surprised him. I claimed it would take about a dozen lines of code to obtain parameter estimates for logistic regression. This article shows how to obtain the parameter estimates for a logistic regression model "manually" by using maximum likelihood estimation.

Example data and logistic regression model

Of course, if you want to fit a logistic regression model in SAS, you should use PROC LOGISTIC or another specialized regression procedure. The LOGISTIC procedure not only gives parameter estimates but also produces related statistics and graphics. However, implementing a logistic regression model from scratch is a valuable exercise because it enables you to understand the underlying statistical and mathematical principles. It also enables you to understand how to generalize the basic model. For example, with minimal work, you can modify the program to support binomial data that represent events and trials.

Let's start by running PROC LOGISTIC on data from the PROC LOGISTIC documentation. The response variable, REMISS, indicates whether there was cancer remission in each of 27 cancer patients. There are six variables that might be important in predicting remission. The following DATA step defines the data. The subsequent call to PROC LOGISTIC fits a binary logistic model to the data. The procedure produces a lot of output, but only the parameter estimates are shown:

data Remission;
   input remiss cell smear infil li blast temp;
   label remiss='Complete Remission';
   datalines;
1   .8   .83  .66  1.9  1.1     .996
1   .9   .36  .32  1.4   .74    .992
0   .8   .88  .7    .8   .176   .982
0  1     .87  .87   .7  1.053   .986
1   .9   .75  .68  1.3   .519   .98
0  1     .65  .65   .6   .519   .982
1   .95  .97  .92  1    1.23    .992
0   .95  .87  .83  1.9  1.354  1.02
0  1     .45  .45   .8   .322   .999
0   .95  .36  .34   .5  0      1.038
0   .85  .39  .33   .7   .279   .988
0   .7   .76  .53  1.2   .146   .982
0   .8   .46  .37   .4   .38   1.006
0   .2   .39  .08   .8   .114   .99
0  1     .9   .9   1.1  1.037   .99
1  1     .84  .84  1.9  2.064  1.02
0   .65  .42  .27   .5   .114  1.014
0  1     .75  .75  1    1.322  1.004
0   .5   .44  .22   .6   .114   .99
1  1     .63  .63  1.1  1.072   .986
0  1     .33  .33   .4   .176  1.01
0   .9   .93  .84   .6  1.591  1.02
1  1     .58  .58  1     .531  1.002
0   .95  .32  .3   1.6   .886   .988
1  1     .6   .6   1.7   .964   .99
1  1     .69  .69   .9   .398   .986
0  1     .73  .73   .7   .398   .986
;
 
proc logistic data=remission;
   model remiss(event='1') = cell smear infil li blast temp;
   ods select ParameterEstimates;
run;

The next section shows how to obtain the parameter estimates "manually" by performing a maximum-likelihood computation in the SAS/IML language.

Logistic regression from scratch

Suppose you want to write a program to obtain the parameter estimates for this model. Notice that the regressors in the model are all continuous variables, and the model contains only main effects, which makes forming the design matrix easy. To fit a logistic model to data, you need to perform the following steps:

  1. Read the data into a matrix and construct the design matrix by appending a column of 1s to represent the Intercept variable.
  2. Write the loglikelihood function. This function will a vector of parameters (b) as input and evaluate the loglikelihood for the binary logistic model, given the data. The Penn State course in applied regression analysis explains the model and how to derive the loglikelihood function. The function is
    \(\ell(\beta) = \sum_{i=1}^{n}[y_{i}\log(\pi_{i})+(1-y_{i})\log(1-\pi_{i})]\)
    where \(\pi_i\) is the i_th component of the logistic transformation of the linear predictor: \(\pi(\beta) = \frac{1}{1+\exp(-X\beta)}\)
  3. Make an initial guess for the parameters and use nonlinear optimization to find the maximum likelihood estimates. The SAS/IML language supports several optimization methods. I use the Newton-Raphson method, which is implemented by using the NLPNRA subroutine.
proc iml;
/* 1. read data and form design matrix */
varNames = {'cell' 'smear' 'infil' 'li' 'blast' 'temp'};
use remission;
read all var "remiss" into y;   /* read response variable */
read all var varNames into X;   /* read explanatory variables */
close;
X = j(nrow(X), 1, 1) || X;     /* design matrix: add Intercept column */
 
/* 2. define loglikelihood function for binary logistic model */
start BinLogisticLL(b) global(X, y);
   z = X*b`;                   /* X*b, where b is a column vector */
   p = Logistic(z);            /* 1 / (1+exp(-z)) */
   LL = sum( y#log(p) + (1-y)#log(1-p) );   
   return( LL );
finish;
 
/* 3. Make initial guess and find parameters that maximize the loglikelihood */
b0 = j(1, ncol(X), 0);         /* initial guess */
opt = 1;                       /* find maximum of function */
call nlpnra(rc, b, "BinLogisticLL", b0, opt);   /* use Newton's method to find b that maximizes the LL function */
print b[c=("Intercept"||varNames) L="Parameter Estimates" F=D8.];
QUIT;

The parameter estimates from the MLE are similar to the estimates from PROC LOGISTIC. When performing nonlinear optimization, different software uses different methods and criteria for convergence, so you will rarely get the exact parameter estimates when you solve the same problem by using different methods. For these data, the relative difference in each parameter estimate is ~1E-4 or less.

I didn't try to minimize the number of lines in the program. As written, the number of executable lines is 16. If you shorten the program by combining lines (for example, p = Logistic(X*b`);), you can get the program down to 10 lines.

Of course, a program should never be judged solely by the number of lines it contains. I can convert any program to a "one-liner" by wrapping it in a macro! I prefer to ask whether the program mimics the mathematics underlying the analysis. Can the statements be understood by someone who is familiar with the underlying mathematics but might not be an expert programmer? Does the language provide analysts with the tools they need to perform and combine common operations in applied math and statistics? I find that the SAS/IML language succeeds in all these areas.

Logistic regression with events-trials syntax

As mentioned previously, implementing the regression estimates "by hand" not only helps you to understand the problem better, but also to modify the basic program. For example, the syntax of PROC LOGISTIC enables you to analyze binomial data by using the events-trials syntax. When you have binomial data, each observation contains a variable (EVENTS) that indicates the number of successes and the total number of trials (TRIALS) for a specified value of the explanatory variables. The following DATA step defines data that are explained in the Getting Started documentation for PROC LOGISTIC. The data are for a designed experiment. For each combination of levels for the independent variables (HEAT and SOAK), each observation contains the number of items that were ready for further processing (R) out of the total number (N) of items that were tested. Thus, each observation represents N items for which R items are events (successes) and N-R are nonevents (failures).

The following DATA step defines the binomial data and calls PROC LOGISTIC to fit a logistic model:

/* Getting Started example for PROC LOGISTIC */
data ingots;
   input Heat Soak r n @@;
   datalines;
7 1.0 0 10  14 1.0 0 31  27 1.0 1 56  51 1.0 3 13
7 1.7 0 17  14 1.7 0 43  27 1.7 4 44  51 1.7 0  1
7 2.2 0  7  14 2.2 2 33  27 2.2 0 21  51 2.2 0  1
7 2.8 0 12  14 2.8 0 31  27 2.8 1 22  51 4.0 0  1
7 4.0 0  9  14 4.0 0 19  27 4.0 1 16
;
 
proc logistic data=ingots;
   model r/n = Heat Soak;
   ods select ParameterEstimates;
run;

Can you modify the previous SAS/IML program to accommodate binomial data? Yes! The key is "expand the data." Each observation represents N items for which R items are events (successes) and N-R are nonevents (failures). You could therefore reuse the previous program if you introduce a "frequency variable" (FREQ) that indicates the number of items that are represented by each observation.

The following program is a slight modification of the original program. The FREQ variable is derived from the EVENTS and TRIALS variables. The data is duplicated so that each observation in the binomial data becomes a pair of observations, one with frequency EVENTS and the other with frequency TRIALS-EVENTS. In the loglikelihood function, the frequency variable is incorporated into the summation.

proc iml;
/* 1. read data and form design matrix */
varNames = {'Heat' 'Soak'};
use ingots;
read all var "r" into Events;
read all var "n" into Trials;
read all var varNames into X;
close;
n = nrow(X);
X = j(n,1,1) || X;             /* design matrix: add Intercept column */
 
/* for event-trial syntax, split each data row into TWO observations,
   one for the events and one for the nonevents. Create Freq = frequency variable. */ 
Freq = Events // (Trials-Events);
X = X // X;                    /* duplicate regressors */
y = j(n,1,1) // j(n,1,0);      /* binary response: 1s and 0s */
 
/* 2. define loglikelihood function for events-trials logistic model */
start BinLogisticLLEvents(b) global(X, y, Freq);
   z = X*b`;                   /* X*b, where b is a column vector */
   p = Logistic(z);            /* 1 / (1+exp(-z)) */
   LL = sum( Freq#(y#log(p) + (1-y)#log(1-p)) );   /* count each observation numEvents times */
   return( LL );
finish;
 
/* 3. Make initial guess and find parameters that maximize the loglikelihood */
b0 = j(1, ncol(X), 0);         /* initial guess (row vector) */
opt = 1;                       /* find maximum of function */
call nlpnra(rc, b, "BinLogisticLLEvents", b0, opt);
print b[c=("Intercept"||varNames) L="Parameter Estimates" F=D8.];
QUIT;

For these data and model, the parameter estimates are the same to four decimal places. Again, there are small relative differences in the estimates. But the point is that you can solve for the MLE parameter estimates for binomial data from first principles by using about 20 lines of SAS/IML code.

Summary

This article shows how to obtain parameter estimates for a binary regression model by writing about a short SAS/IML program. With slightly more effort, you can obtain parameter estimates for a binomial model in events-trials syntax. These examples demonstrate the power, flexibility, and compactness of the SAS/IML matrix programming language.

Of course, this power and flexibility come at a cost. When you use SAS/IML to solve a problem, you must understand the underlying mathematics of the problem. Second, the learning curve for SAS/IML is higher than for other SAS procedures. SAS procedures such as PROC LOGISTIC are designed so that you can focus on building a good predictive model without worrying about the details of numerical methods. Accordingly, SAS procedures can be used by analysts who want to focus solely on modeling. In contrast, the IML procedure is often used by sophisticated programmers who want to extend the capabilities of SAS by implementing custom algorithms.

The post Implement binary logistic regression from first principles appeared first on The DO Loop.

10月 032022
 

A previous article discusses the definitions of three kinds of moments for a continuous probability distribution: raw moments, central moments, and standardized moments. These are defined in terms of integrals over the support of the distribution. Moments are connected to the familiar shape features of a distribution: the mean, variance, skewness, and kurtosis. This article shows how to evaluate integrals in SAS to evaluate the moments of a continuous distribution, and how to compute the mean, variance, skewness, and kurtosis of the distribution.

Notice that this article is not about how to compute the sample estimates. You can easily use PROC MEANS or PROC UNIVARIATE to compute the sample mean, sample variance, and so forth, from data. Rather, this article is about how to compute the quantities for the probability distribution.

Definitions of moments of a continuous distribution

Evaluating a moment requires performing an integral. The domain of integration is the support of the distribution. This is often infinite (for example, the normal distribution on (-∞, ∞)) or semi-bounded (for example, the lognormal distribution on [0,∞)), but sometimes it is finite (for example, the beta distribution on [0,1]). The QUAD subroutine in SAS/IML software can compute integrals on finite, semi-infinite, and infinite domains. You need to specify a SAS/IML module that evaluates the integrand at an arbitrary point in the domain. To compute a raw moment, the integrand is computed as the product of x and the probability density function (PDF). To compute a central moment, the integrand is the product the PDF and a power of (x - μ), where μ is the mean of the distribution. Let f(x) be the PDF. The moments relate to the mean, variance, skewness, and kurtosis as follows:

  • The mean is the first raw moment: \(\mu = \int_{-\infty}^{\infty} x f(x)\, dx\)
  • The variance is the second central moment: \(\sigma^2 = \int_{-\infty}^{\infty} (x - \mu)^2 f(x)\, dx\)
  • The skewness is the third standardized moment: \(\mu_3 / \sigma^3\) where \(\mu_3 = \int_{-\infty}^{\infty} (x - \mu)^3 f(x)\, dx\) is the third central moment.
  • The full kurtosis is the fourth standardized moment: \(\mu_4 / \sigma^4\) where \(\mu_4 = \int_{-\infty}^{\infty} (x - \mu)^4 f(x)\, dx\) is the fourth central moment. For consistency with the sample estimates, you can compute the excess kurtosis by subtracting 3 from this quantity.

Compute moments in SAS

To demonstrate how to compute the moments of a continuous distribution, let's use the gamma distribution with shape parameter α=4. The PDF of the Gamma(4) distribution is shown to the right. The domain of the PDF is [0, ∞). The mean, variance, skewness, and excess kurtosis for the gamma distribution can be computed analytically. The formulas are given in the Wikipedia article about the gamma distribution. The following SAS/IML program evaluates the formulas for α=4.

/* compute moments (raw, central, and standardized) for a probability distribution */
proc iml;
/* Formulas for the mean, variance, skewness, and excess kurtosis, 
   of the gamma(alpha=4) distribution from https://en.wikipedia.org/wiki/Gamma_distribution
*/
alpha = 4;                 /* shape parameter */
mean  = alpha;             /* mean */
var   = alpha;             /* variance */
skew  = 2 / sqrt(alpha);   /* skewness */
exKurt= 6 / alpha;         /* excess kurtosis */
colNames = {'Mean' 'Var' 'Skew' 'Ex Kurt'};
print (mean||var||skew||exKurt)[c=colNames L="Formulas for Gamma"];

Let's evaluate these same quantities by performing numerical integration of the integrals that define the moments of the distribution:

/* return the PDF of a specified probability distribution */
start f(x) global(alpha);
   return  pdf("Gamma", x, alpha);  /* return the PDF evaluated at x */
finish;
 
/* the first raw moment is the mean: \int x*f(x) dx */
start Raw1(x);
   return( x # f(x) );              /* integrand for first raw moment */
finish;
 
/* the second central moment is the variance: \int (x-mu)^2*f(x) dx */
start Central2(x) global(mu);
   return( (x-mu)##2 # f(x) );      /* integrand for second central moment */
finish;
 
/* the third central moment is the UNSTANDARDIZE skewness: \int (x-mu)^3*f(x) dx */
start Central3(x) global(mu);
   return( (x-mu)##3 # f(x) );      /* integrand for third central moment */
finish;
 
/* the fourth central moment is the UNSTANDARDIZE full kurtosis: \int (x-mu)^4*f(x) dx */
start Central4(x) global(mu);
   return( (x-mu)##4 # f(x) );      /* integrand for fourth central moment */
finish;
 
Interval = {0, .I};                 /* support of gamma distribution is [0, infinity) */
call quad(mu, "Raw1", Interval);    /* define mu, which is used for central moments */
call quad(var, "Central2", Interval);
call quad(cm3, "Central3", Interval);
call quad(cm4, "Central4", Interval);
 
/* do the numerical calculations match the formulas? */
skew = cm3 / sqrt(var)##3;          /* standardize 3rd central moment */
exKurt = cm4 / sqrt(var)##4 - 3;    /* standardize 4th standard moment; subtract 3 to get excess kurtosis */
print (mu||var||skew||exKurt)[c=colNames L="Numerical Computations for Gamma"];

For this distribution, the numerical integration produces numbers that exactly match the formulas. For other distributions, the numerical integration might differ from the exact values by a tiny amount.

Checking the moments against the sample estimates

Not all distributions have analytical formulas for the mean, variance, and so forth. When I compute a new quantity, I like to verify the computation by comparing it to another computation. In this case, you can compare the distributional values to the corresponding sample estimates for a large random sample. In other words, if you generate a large random sample from the Gamma(4) distribution and compute the sample mean, sample, variance, and so forth, you should obtain sample estimates that are close to the corresponding values for the distribution. You can perform this simulation in the DATA step or in the SAS/IML language. The following statements use the DATA step and PROC MEANS to generate a random sample of size N=10,000 from the gamma distribution:

data Gamma(keep = x);
alpha = 4;
call streaminit(1234);
do i = 1 to 10000;
   x = rand("Gamma", alpha);
   output;
end;
run;
 
proc means data=Gamma mean var skew kurt NDEC=3;
run;

As expected, the sample estimates are close to the values of the distribution. The sample skewness and kurtosis are biased statistics and have relatively large standard errors, so the estimates for those parameters might not be very close to the distributional values, even for a large sample.

Summary

This article shows how to use the QUAD subroutine in SAS/IML software to compute raw and central moments of a probability distribution. This article uses the gamma distribution with shape parameter α=4, but the computation generalizes to other distributions. The important step is to write a user-defined function that evaluates the PDF of the distribution of interest. For many common distributions, you can use the PDF function in Base SAS to evaluate the density function. You also must specify the support of the distribution, which is often an infinite or semi-infinite domain of integration. When the moments exist, you can compute the mean, variance, skewness, and kurtosis of any distribution.

The post Compute moments of probability distributions in SAS appeared first on The DO Loop.

9月 212022
 

The noncentral t distribution is a probability distribution that is used in power analysis and hypothesis testing. The distribution generalizes the Student t distribution by adding a noncentrality parameter, δ. When δ=0, the noncentral t distribution is the usual (central) t distribution, which is a symmetric distribution. When δ > 0, the noncentral t distribution is positively skewed; for δ < 0, the noncentral t distribution is negatively skewed. Thus, you can think of the noncentral t distribution as a skewed cousin of the t distribution.

SAS software supports the noncentrality parameter in the PDF, CDF, and QUANTILE functions. This article shows how to use these functions for the noncentral t distribution. The RAND function in SAS does not directly support the noncentrality parameter, but you can use the definition of a random noncentral t variable to generate random variates.

Visualize the noncentral t distribution

The classic Student t distribution contains a degree-of-freedom parameter, ν. (That's the Greek letter "nu," which looks like a lowercase "v" in some fonts.) For small values of ν, the t distribution has heavy tails. For larger values of ν, the t distribution resembles the normal distribution. The noncentral t distribution uses the same degree-of-freedom parameter.

The noncentral t distribution also supports a noncentrality parameter, δ. The simplest way to visualize the effect of δ is to look at its probability density function (PDF) for several values of δ. The support of the PDF is all real numbers, but most of the probability is close to x = δ. You can use the PDF function in SAS to compute the PDF for various values of the noncentrality parameter. The fourth parameter for the PDF("t",...) call is the noncentrality value. It is optional and defaults to 0 if not specified.

The following visualization shows the density functions for positive values of δ and positive values of x. In the computer programs, I use DF for the ν parameter and NC for the δ parameter.

/* use the PDF function to visualize the noncentral t distribution */
%let DF = 6;
data ncTPDFSeq;
df = &DF;                        /* degree-of-freedom parameter, nu */
do nc = 4, 6, 8, 12;             /* noncentrality parameter, delta */
   do x = 0 to 20 by 0.1;        /* most of the density is near x=delta */
      PDF = pdf("t", x, df, nc);
      output;
   end;
end;
label PDF="Density";
run;
 
title "PDF of Noncentral t Distributions";
title2 "DF=&DF";
proc sgplot data=ncTPDFSeq;
   series x=x y=PDF / group=nc lineattrs=(thickness=2);
   keylegend / location=inside across=1 title="Noncentral Param" opaque;
   xaxis grid; yaxis grid;
run;

The graph shows the density functions for δ = 4, 6, 8, and 12 for a distribution that has ν=6 degrees of freedom. You can see that the modes of the distributions are close to (but a little less than) δ when δ > 0. For negative values of δ, the functions are reflected across x=0. That is, if f(x; ν, δ) is the pdf of the noncentral t distribution with parameter δ, then f(-x; ν, -δ) = f(x; ν, δ).

The CDF and quantile function of the noncentral t distribution

If you change the PDF call to a CDF call, you obtain a visualization of the cumulative distribution function for various values of the noncentrality parameter, δ.

The quantile function is important in hypothesis testing. The following DATA step finds the quantile that corresponds to an upper-tail probability of 0.05. This would be a critical value in a one-sided hypothesis test where the test statistic is distributed according to a noncentral t distribution.

%let NC = 4;
data CritVal;
do alpha = 0.1, 0.05, 0.01;
   tCritUpper = quantile("T", 1-alpha, &DF, &NC);
   output;
end;
run;
 
proc print data=CritVal nobs; run;

The graph shows the critical value of a noncentral t statistic for a one-sided hypothesis test at the α significance level for α=0.1, 0.05, and 0.01. A test statistic that is larger than the critical value would lead you to reject the null hypothesis at the given significance level.

Random variates from the noncentral t distribution

Although the RAND function in SAS does not support a noncentrality parameter for the t distribution, it is simple to generate random variates. By definition, a noncentral t random variable, Tν δ is the ratio of a standard normal variate with mean δ and a scaled chi-distributed variable. If Z ~ N(δ,1) is a normal random variable and V ~ χ2(ν) is a chi-squared random variable with ν degrees of freedom, then the ratio Tν δ = Z / sqrt(V / ν) is a random variable from a noncentral t distribution.

/* Rand("t",df) does not support a noncentrality parameter. Use the definition instead. */
data ncT;
df = &DF;
nc = &NC;
call streaminit(12345);
do i = 1 to 10000;
   z = rand("Normal", nc);   /* Z ~ N(nc, 1)    */
   v = rand("chisq", df);    /* V ~ ChiSq(df)   */
   t = z / sqrt(v/df);       /* T ~ NCT(df, nc) */
   output;
end;
keep t;
run;
 
title "Random Sample from Noncentral t distribution";
title2 "DF=&DF; nc=&NC";
proc sgplot data=ncT noautolegend;
   histogram t;
   density t / type=kernel;
   xaxis max=20;
run;

The graph shows a histogram for 10,000 random variates overlaid with a kernel density estimate. The density is very similar to the earlier graph that showed the PDF for the noncentral t distribution with ν=6 degrees of freedom and δ=4.

Summary

The noncentral t distribution is a probability distribution that is used in power analysis and hypothesis testing. You can this of the noncentral t distribution as a skewed t distribution. SAS software supports the noncentral t distribution by using an optional argument in the PDF, CDF, and QUANTILE functions. You can generate random variates by using the definition of a random variable, which is a ratio of a normal variate and a scaled chi-distributed variable.

The post The noncentral t distribution in SAS appeared first on The DO Loop.

9月 122022
 

An integer can be represented in many ways. This article shows how to represent a positive integer in any base b. The most common base is b=10, but other popular bases are b=2 (binary numbers), b=8 (octal), and b=16 (hexadecimal). Each base represents integers in different ways. Think of a "representation" as a name that is assigned to each integer. This article shows a mathematical algorithm for representing an integer in base b ≥ 2. It implements the algorithm by using the SAS DATA step.

This process is commonly called "converting" an integer from base 10 to base b, but that is a slight misnomer. An integer has an intrinsic value regardless of the base that you use to represent it. We aren't converting from base 10, although we typically use base 10 to input the number into a computer. I prefer to use the term "represent the integer" when discussing the process of writing it in a specified base.

Represent an integer in any base

The most common way to represent an integer is to use base 10, which uniquely represents each positive integer as a sum
\(x = \sum\nolimits_{i=0}^k {c_i} 10^i\)
where the \(c_i\) are integers \(0 \leq c_i < 10\). Notice that this sum of powers looks like a polynomial, so the ci are often called coefficients. For example, the number 675 (base 10) can be represented as 6*102 + 7*101 + 5*100. So, the ordered tuple (6,7,5) represents the integer in base 10. We usually concatenate the tuple values and simply write 675 and optionally add "base 10" if the base in unclear.

We call this representation "base 10" because it represents a number as a sum of powers of 10. The phrase "change the base" means that we represent the same number as a sum of powers of some other positive integer. For example, the number 675 (base 10) can be represented as 1243 (base 8) because 675 = 1*83 + 2*82 + 4*81 + 3*80. Consequently, the tuple (1,2,4,3) represents the integer in base 8. Equivalently, the coefficients in base 8 are (1,2,4,3).

An algorithm to represent an integer in any base

There is a simple iterative algorithm to represent a positive integer, x, in any base b ≥ 2. The steps in the algorithm are indexed by an integer i=0, 1, 2, ..., k-1, where k is the smallest integer such that x < bk.

The algorithm works by rewriting the "sum of powers" by using Horner's method. For example, the sum 1*83 + 2*82 + 4*81 + 3*80 can be rewritten as
((1*8 + 2)*8 + 4)*8 + 3.
In this representation, each nested term is some number times 8 (the base) plus a remainder. For a general base, you can write the Horner sum as
(((c[k-1]+...)*b + c[2])*b + c[1])*b + c[0]

The mathematical algorithm exploits Horner's representation, as follows:

  1. Divide the integer, x, by b. In a computer language, the integer-part of the division is FLOOR(x/b) whereas the remainder is MOD(x,b). Write the remainder as the rightmost value in the base b representation of x.
  2. Repeat this process as long as the integer-part of the division is positive. At each step, write the remainder to the left of the previous remainders. When the integer part becomes 0, the process terminates.

For example, let's represent the number 675 (base 10) in base 8. Use x=675 and b=8 as the values in the formulas. The algorithm is as follows:

  1. For i=0: Divide the number 675 by the base, which is 8. You get 84 with 3 as the remainder. So c[0]=3. The number 84 is used in the next step.
  2. For i=1: Divide the number 84 by 8. You get 10 with 4 as the remainder. So c[1]=4.
  3. For i=2: Divide the number 10 by 8. You get 1 with 2 as the remainder. So c[2]=2.
  4. For i=3: Divide the number 1 by 8. You get 0 with 1 as the remainder. So c[3]=1.
  5. Because the integer-part of the previous division is 0, the algorithm terminates.

This process is summarized in the following table for base 8 and the base-10 integer 675:

From the coefficients, you can build a string (such as '1243') that represents the number in the given base. For b ≤ 10, the symbols 0, 1, ..., b-1 are used to represent the coefficients. For 10 < b ≤ 36, the letters of the English alphabet represent the higher coefficients: A=10, B=11, C=12, ..., Z=35. The SAS program in the next section uses these symbols to represent the coefficients.

A SAS program for representing an integer in an arbitrary base

Let's put a set of base-10 numbers into a data set:

/* Example data: Base 10 integers to convert to other bases */
data Base10;
input x @@;      /* x >= 0 */
datalines;
2 3 4 5 8 15 31 32 33 49 63 675
;

Because we want to store the representation in a character variable, we set the length of the variable by using the parameter MAXCOEF=32. A string of length 32 is usually sufficient for representing a wide range of integers. The VALUELIST macro defines the set of characters to use to represent each integer as a string for bases b=2, 3, ..., 36. If you want to use a larger base, extend this set of values.

%let maxCoef = 32;    /* how many characters in a string that represents the number? */
%let valueList = ('0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 
                  'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 
                  'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z');

The following SAS DATA step implements the previous algorithm to represent the numbers in a new base. Let's begin by representing the numbers in binary (base 2).

%let base = 2;        /* positive integer. Usually 2 <= base <= 36 */
 
data NewBase;
array values[36] $ _temporary_ &valueList;     /* characters to use to encode values */
array c[0:%eval(&maxCoef-1)]  _temporary_ ;    /* integer coefficients c[0], c[1], ... */
length s $&maxCoef;                            /* string for base b */
b = &base;                                     /* base for representation */
 
set Base10;                                    /* x is a positive integer; represent in base b */
/* compute the coefficients that represent x in Base b */
y = x;
do k = 0 to &maxCoef while(y>0);
   c[k] = mod(y, b);                           /* remainder when r is divided by b */
   y = floor(y / b);                           /* whole part of division */
   substr(s,&maxCoef-k,1) = values[c[k]+1];    /* represent coefficients as string */
end;
keep s x k b;
run;
 
proc print data=NewBase noobs label;
   label x="Base 10" s="Base &base";
   var x s k;
run;

For each positive integer, x, the variable k specifies how many characters are required to represent the number in base 2. (This assumes that you do not want to use leading zeros in the representation.) For example, 31 (base 10) = 11111 (base 2) requires five characters, whereas 32 (base 10) = 100000 (base 2) requires six characters. The program set MAXCOEF = 32, which means that this program can compute the binary representation of any number up to 232 – 1 = 4,294,967,295 (base 10).

Convert base 10 to octal or hexadecimal

To represent a positive integer in a different base, simply redefine the BASE macro variable and rerun the program. For example, the following statements enable you to convert the numbers to base 8:

%let base = 8;        /* positive integer 2 <= base <= 36 */

The last row of the table shows that 675 (base 10) = 1243 (base 8), as was shown previously.

For hexadecimal (base 16), you can use the letters 'A' through 'F' to represent the larger values of the coefficients:

%let base = 16;        /* positive integer 2 <= base <= 36 */

The hexadecimal numbers might be familiar to statistical programmers who use hexadecimal values for RGB colors in statistical graphics. In many programming languages, you can specify a color as a hexadecimal value such as 0x1F313F. The first two digits ('1F') specify the amount of red in the color, the next two digits ('31') specify the amount of green, and the last two digits specify the amount of blue. As an integer, this number is 2044,223 (base 10). You might have seen an advertisement for a computer monitor that claims that the monitor "displays more than 16 million colors." That number is used because FFFFFF (base 16) = 16,777,215 (base 10). In other words, if each red, green, and blue pixel can display 256 colors, the total number of colors is more than 16 million.

Summary

This article shows an algorithm that enables you to represent a positive integer in any base. This is commonly called "converting from base 10," although that is a misnomer. At each step of the algorithm, you use division by the base to find the integer part and remainder of an integer. This algorithm is demonstrated by using the SAS DATA step. As written, the program supports bases 2–36.

The post Convert integers from base 10 to another base appeared first on The DO Loop.

8月 312022
 

The SELECT-WHEN statement in the SAS DATA step is an alternative to using a long sequence of IF-THEN/ELSE statements. Although logically equivalent to IF-THEN/ELSE statements, the SELECT-WHEN statement can be easier to read. This article discusses the two distinct ways to specify the SELECT-WHEN statement. You can use the first syntax when you are making logical decisions based on the possible values of a discrete variable. The second syntax enables you to use more complex logical expressions. In brief:

  • Use the first syntax when the possible values of a single discrete variable are known in advance.
  • Use the second syntax when the logic involves multiple variables or complex logic.

The first syntax for the SELECT-WHEN statement

A previous article shows the first syntax for the SELECT-WHEN statement and includes several examples. However, I will provide only a simple example here. The first syntax requires that you specify an expression on the SELECT statement. On the WHEN statement, you specify one of more values that the expression can take. Following the WHEN statement, you specify a statement that you want to perform when the WHEN expression is true. If you want to perform more than one statement, you can use a DO-END block to perform multiple operations.

It sounds more difficult than it is, so let's see an example. The following SAS data set contains dates in 2020. Some of them are well-known religious holidays, others are US holidays, and others are "personal holidays" that I like to celebrate with my friends, such as Leap Day, Pi Day, Cinco de Mayo, and Halloween. The data set contains the date of the holiday and a name. You can use a SELECT-WHEN statement to classify each holiday as Religious, USA, or Personal, as follows:

data Dates;
length Holiday $11;
input Holiday 1-12 Date date9.;
format Date date9.;
datalines;
MLK         20JAN2020 
Leap        29FEB2020 
Pi          14MAR2020 
StPatrick   17MAR2020 
Easter      12APR2020 
CincoDeMayo 05MAY2020 
Memorial    25MAY2020 
Labor       07SEP2020 
Halloween   31OCT2020 
Christmas   25DEC2020 
;
 
data Holidays;
length Type $12;
set Dates;
select(upcase(Holiday));                             /* specify expression on the SELECT statement */
  when('MLK','MEMORIAL','LABOR')   Type="USA";       /* handle possible values of expression */
  when('EASTER','CHRISTMAS')       Type="Religious";
  otherwise                        Type="Fun";       /* handle any value not previously handled */
end;
run;
 
proc print data=Holidays noobs;
   var Date Holiday Type;
run;

Notice that the SELECT statement branches on the expression UPCASE(Holiday). The new data set has a Type variable that classifies each holiday according to the value of UPCASE(Holiday).

In the SELECT-WHEN syntax, the OTHERWISE statement is optional but is recommended. If you do not have an OTHERWISE statement, and all the WHEN statements are false, the DATA step will stop with an error. You can place use a null statement after the OTHERWISE keyword to specify that the program should do nothing for unhandled values, as follows:
OTHERWISE ; /* semicolon ==> null statement ==> do nothing */

By the way, if you have only the dates, you can still form the Type variable by using the HOLIDAYNAME function in SAS. This is left as an exercise.

As stated previously, the SELECT-WHEN statement provides the same functionality as an IF-THEN/ELSE statement, but it is a simpler way to handle multiple values. For example, the equivalent IF-THEN/ELSE logic is as follows:

if upcase(Holiday)='MLK' |
   upcase(Holiday)='MEMORIAL' |
   upcase(Holiday)='LABOR' then Type="USA";
else if ...

The second syntax for the SELECT-WHEN statement

The first SELECT-WHEN syntax enables you to perform an action for any value of an expression. Usually, you will branch on values of a discrete variable.

You can use the second SELECT-WHEN syntax when you need more complex logical processing. In the second syntax, you do not specify an expression on the SELECT statement. Instead, you specify a logical expression on each WHEN statement. The WHEN statements are executed in order until one is true. The corresponding action is then executed. If no WHEN statements are true, the OTHERWISE statement is processed.

The previous example uses the name of a holiday to assign values to the Type variable. Many holidays do not have fixed dates but rather dates that vary from year to year. For example, US Thanksgiving is the fourth Thursday of November, and MLK Day is the third Monday of January. Easter is even more complicated. Luckily, SAS provides the HOLIDAY function, which enables you to obtain the dates of common holidays for each year. For each date in the data set, you can use the HOLIDAY function to determine whether the date corresponds to a religious or USA holiday. The following SELECT-WHEN statement shows how to implement the second syntax:

data Want;
length Type $12;
set Dates;    
Year = year(Date);                            /* get the year for the date */
select;                                       /* no expression on SELECT statement */
  when(holiday('MLK',           Year)=Date |  /* logic to determine type of holiday */
       holiday('Memorial',      Year)=Date |
       holiday('USIndependence',Year)=Date |
       holiday('Labor',         Year)=Date |
       mdy(9, 11, Year)              =Date |  /* Patriot Day 9/11 */
       holiday('Veterans',      Year)=Date |
       holiday('Thanksgiving',  Year)=Date) 
          Type="USA";
  when(holiday('Easter',     Year)=Date |
       holiday('Christmas',  Year)=Date )
          Type="Religious";
  otherwise                  
          Type="Fun";
end;
run;
 
proc print data=Want noobs;
   var Date Holiday Type;
run;

This syntax for the SELECT-WHEN statement looks even more like an IF-THEN/ELSE statement.

Consider PROC FORMAT for recoding variables

On discussion forums, I see a lot of questions like this example in which programmers are creating a new variable from the values of an existing variable. In simple situations, you can use PROC FORMAT to perform this task. For more complicated situations, you can even define your own function to use as a format.

Summary

This article shows two ways to specify the SELECT-WHEN statement in SAS. The SELECT-WHEN statement is an alternative to using multiple IF-THEN/ELSE statements. For simple logical processing, you can specify an expression on the SELECT statement and specify values for the expression on the WHEN statements. For more complicated logic, you can put all the logic into the WHEN statements.

The post Two types of syntax for the SELECT-WHEN statement in SAS appeared first on The DO Loop.

7月 202022
 

It isn't easy to draw the graph of a function when you don't know what the graph looks like. To draw the graph by using a computer, you need to know the domain of the function for the graph: the minimum value (xMin) and the maximum value (xMax) for plotting the function. When you know a lot about the function, you can make well-informed choices for visualizing the graph. However, without knowledge, you might need several attempts before you discover the domain that best demonstrates the shape of the function. This trial-and-error approach is time-consuming and frustrating.

If the function is the density function for a probability distribution, there is a neat trick you can use to visualize the function. You can use quantiles of the distribution to determine the interval [xMin, xMax]. This article shows that you can set xMin to be an extreme-left quantile such as QUANTILE(0.01), and set xMin to be an extreme-right quantile such as QUANTILE(0.99). The values 0.01 and 0.99 are arbitrary, but in practice, these values provide good initial choices for many distributions. From looking at the resulting graph, you can often modify those numbers to improve the visualization.

Choose a domain for the lognormal distribution

Here's an example. Suppose you want to graph the PDF or CDF of the standard lognormal distribution. What is the domain of the lognormal distribution? You could look it up on Wikipedia, but mathematically the 0.01 and 0.99 quantiles of the distribution provide an interval that contains 98% of the probability. Thus, is makes sense to initially visualize the function on this domain. The following SAS DATA step evaluates the PDF at 201 evenly spaced points on the interval [xMin, xMax]:

/* Choose xMin and xMax by using quantiles, then use equally spaced intervals in x */
data PDF;
xMin = quantile("LogNormal", 0.01);   /* p = 0.01 */
xMax = quantile("LogNormal", 0.99);   /* p = 0.99 */
dx = (xMax - xMin) / 200;
do x = xMin to xMax by dx;
   pdf = pdf("Lognormal", x); 
   output;
end;
drop xMin xMax dx;
run;
 
title "Visualize PDF by using Quantiles to Determine Domain";
proc sgplot data=PDF;
  series x=x y=PDF;
run;

This graph is pretty good, but not perfect. For the lognormal distribution, you might want xMin to move to the left so that the graph shows more of the shape of the left tail of the distribution. If you know that x=0 is the minimum value for the lognormal distribution, you could set xMin=0. However, you could also let the distribution itself decide the lower bound by decreasing the probability value used for the left quantile. You can change one number to re-visualize the graph. Just change the probability value from 0.01 to a smaller number such as 0.001 or 0.0001. For example, if you use 0.0001, the first statement in the DATA step looks like this:

xMin = quantile("LogNormal", 0.0001);  /* p = 0.0001 */

With that small change, the graph now effectively shows the shape of the PDF for the lognormal distribution:

Visualize discrete densities

You can use the same trick to visualize densities or cumulative probabilities for a discrete probability distribution. For example, suppose you want to visualize the density for the Poisson(20) distribution. If you are familiar with the Poisson distribution, you might know that you need an interval that is symmetric about x=20 (the mean value), but how wide should the interval be? Instead of doing computations in your head, just use the quantile trick, as follows:

/* A discrete example: PDF = Poisson(20) */
data PMF;
xMin = quantile("Poisson", 0.01, 20);   /* p = 0.01 */
xMax = quantile("Poisson", 0.99, 20);   /* p = 0.99 */
do x = xMin to xMax;                    /* x is integer-valued */
   pmf = pdf("Poisson", x, 20);         /* discrete probability mass function (PMF) */
   output;
end;
drop xMin xMax;
run;
 
title "Visualize Discrete PDF by using Quantiles to Determine the Domain";
proc sgplot data=PMF;
  vbar x / response=pmf;
run;

Notice that the DATA step uses integer steps between xMin and xMax because the Poisson distribution is defined only for integer values of x. Notice also that I used a bar chart to visualize the density. Alternatively, you could use a needle chart. Regardless, the program automatically chose the interval [10, 31] as a representative domain. If you want to visualize more of the tails, you can use 0.001 and 0.999 as the input values to the QUANTILE call.

Summary

This article shows a convenient trick for visualizing the PDF or CDF of a distribution function. You can use the quantile function to find quantiles in the left tail (xMin) and in the right tails (xMax) of the distribution. You can visualize the function by graphing it on the interval [xMin, xMax]. I suggest you use the 0.01 and 0.99 quantiles as an initial guess for the domain. You can modify those values if you want to visualize more (or less) of the tails of the distribution.

The post A tip for visualizing the density function of a distribution appeared first on The DO Loop.

7月 182022
 

A colleague was struggling to compute a right-tail probability for a distribution. Recall that the cumulative distribution function (CDF) is defined as a left-tail probability. For a continuous random variable, X, with density function f, the CDF at the value x is
    F(x) = Pr(X ≤ x) = ∫ f(t) dt
where the integral is evaluated on the interval (-∞, x). That is, it is the area under the density function "to the left" of x. In contrast, the right-tail probability is the complement of this quantity. The right-tail probability is defined by
    S(x) = Pr(X ≥ x) = 1 – F(x)
It is the area under the density function "to the right" of x.

The right-tail probability is also called the right-side or upper-tail probability. The function, S, that gives the right-tail probability is called the survival distribution function (SDF). Similarly, a quantile that is associated with a probability that is close to 0 or 1 might be called an extreme quantile. This article discusses why a naive computation of an extreme quantile can be inaccurate. It shows how to compute extreme probabilities and extreme quantiles by using special functions in SAS:

Accurate computations of extreme probabilities

My colleague contacted me to discuss some issues related to computing an extreme quantile for p ≈ 1. To demonstrate, I will use the standard lognormal distribution, whose PDF is shown to the right. (However, this discussion applies equally well to other distributions.) When p is small, the lognormal quantile is close to 0. When p is close to 1, the lognormal quantile is arbitrarily large because the lognormal distribution is unbounded.

The following SAS DATA step uses the QUANTILE function to compute extreme quantiles in the left and right tails of the lognormal distribution. The quantiles are associated with the probabilities p=1E-k and p=1 – 1E-k for k=8, 9, 10, ..., 15.

%let Distrib = Lognormal;
data Have;
do pow = -8 to -15 by -1;
   p = 10**pow;                   
   Lx = quantile("&Distrib", p);   /* left-tail quantile: find x such that CDF(x) = p */
   Rp = 1 - p;                     /* numerically, 1-p = 1 for p < MACEPS */
   Rx = quantile("&Distrib", Rp);  /* right-tail quantile: find x such that CDF(x) = 1-p */
   output;
end;
format Rp 18.15;
run;
 
proc print noobs; run;

The second column shows the probability, p, and the Lx column shows the corresponding left-tail quantile. For example, the first row shows that 1E-8 is the probability that a standard lognormal random variate has a value less than 0.00365. The Rp column shows the probability 1 – p. The Rx column shows the corresponding quantile. For example, the first row shows that 1 – 1E-8 is the probability that a standard lognormal random variate has a value greater than 273.691.

From experimentation, it appears that the QUANTILE function returns a missing value when the probability argument is greater than or equal to 1 – 1E-12. The SAS log displays a note that states
    NOTE: Argument 2 to function QUANTILE('Normal',1) at line 5747 column 9 is invalid.
This means that the function will not compute extreme-right quantiles. Why? Well, I can think of two reasons:

  • When p is very small, the quantity 1 – p does not have full precision. (This is called the "loss-of-precision problem" in numerical analysis.) In fact, if p is less than machine epsilon, the expression 1 – p EXACTLY EQUALS 1 in floating-point computations! Thus, we need a better way to specify probabilities that are close to 1.
  • For unbounded distributions, the CDF function for an unbounded distribution is extremely flat for values in the right tail. Because the pth quantile is obtained by solving for the root of the function CDF(x) - p = 0, the computation to find an accurate extreme-right quantile is numerically imprecise for unbounded distributions.

Compute extreme quantiles in SAS

The previous paragraph does not mean that you can't compute extreme-right quantiles. It means that you should use a function that is specifically designed to handle probabilities that are close to 1. The special function is the SQUANTILE function, which uses asymptotic expansions and other tricks to ensure that extreme-right quantiles are computed accurately. If x is the value returned by the SQUANTILE function, then 1 – CDF(x) = p. You can, of course, rewrite this expression to see that x satisfies the equation CDF(x) = 1 – p.

The syntax for the SQUANTILE function is almost the same as for the QUANTILE function. For any distribution, QUANTILE("Distrib", 1-p) = SQUANTILE("Distrib", p). That is, you add an 'S' to the function name and replace 1 – p with p. Notice that by specifying p instead of 1 – p, the right-tail probability can be specified more accurately.

The following SAS DATA step is a modification of the previous DATA step. It uses the SQUANTILE function to compute the extreme-right quantiles:

data Want;
do pow = -8 to -15 by -1;
   p = 10**pow;
   Lx = quantile("&Distrib", p);   /* find x such that CDF(x) = p */
   Rp = 1 - p;
   Rx = squantile("&Distrib", p);  /* find x such that 1-CDF(x)=p */
   output;
end;
format Rp 18.15;
drop pow;
run;
 
proc print noobs; run;

The table shows that the SQUANTILE function can compute quantiles that are far in the right tail. The first row of the table shows that the probability is 1E-8 that a lognormal random variate exceeds 273.69. (Equivalently, 1 – 1E-8 is the probability that the variate is less than 273.69.) The last row of the table shows that 1E-15 is the probability that a lognormal random variate exceeds 2811.14.

Compute extreme probabilities in SAS

When p is small, the "loss-of-precision problem" (representing 1 – p in finite precision) also affects the computation of the cumulative probability (CDF) of a distribution. That is, the expression CDF("Distrib", 1-p) might not be very accurate because 1 – p loses precision when p is small. To counter that problem, SAS provides the SDF function, which computes the survival distribution function (SDF). The SDF returns the right-tail quantile x for which 1 – CDF(x) = p. This is equivalent to finding the left-tail quantile that satisfies CDF(x) = 1 – p.

The following DATA step computes several one-sided probabilities. For each value of x, the program computes the probability that a random lognormal variate is greater than x:

%let Distrib = lognormal;
data SDF;
do x = 500 to 2500 by 500;              /* extreme quantiles  */
   Rp       = 1 - cdf("&Distrib", x);   /* right-tail probability */
   RpBetter = sdf("&Distrib", x);       /* better computation of right-tail probability */
   output;
end;
run;
 
proc print noobs; run;

The difference between the probabilities in the Rp and RpBetter columns are not as dramatic as for the quantiles in the previous section. That is because the quantity 1 – CDF(x) does not lose precision when x is large because CDF(x) ≈ 1. Nevertheless, you can see small differences between the columns, and the RpBetter column will tend to be more accurate when x is large.

Summary

This article shows two tips for working with extreme quantiles and extreme probabilities. When the probability is close to 1, consider using the SQUANTILE function to compute quantiles. When the quantile is large, consider using the SDF function to compute probabilities.

The post Tips for computing right-tail probabilities and quantiles appeared first on The DO Loop.

7月 112022
 

I previously wrote about partial leverage plots for regression diagnostics and why they are useful. You can generate a partial leverage plot in SAS by using the PLOTS=PARTIALPLOT option in PROC REG. One useful property of partial leverage plots is the ability to graphically represent the null hypothesis that a regression coefficient is 0. John Sall ("Leverage Plots for General Linear Hypotheses", TAS, 1990) showed that you could add a confidence band to the partial regression plot such that if the line segment for Y=0 is not completely inside the band, then you reject the null hypothesis.

This article shows two ways to get the confidence band. The easy way is to write the partial leverage data to a SAS data set and use the REG statement in PROC SGPLOT to overlay a regression fit and a confidence band for the mean. This way is easy to execute, but it requires performing an additional regression analysis for every effect in the model. A second way (which is due to Sall, 1990) computes the confidence band directly from the statistics on the original data. This article shows how to use the SAS/IML language to compute the confidence limits from the output of PROC REG.

Confidence limits on a partial leverage plot: The easy way

The following example is taken from my previous article, which explains the code. The example uses a subset of Fisher's iris data. For convenience, the variables are renamed to Y (the response) and X1, X2, and X3 (the regressors). The call to PROC REG includes

data Have;
set sashelp.iris(
    rename=(PetalLength = Y
            PetalWidth  = x1
            SepalLength = x2
            SepalWidth  = x3)
    where=(Species="Versicolor"));
label Y= x1= x2= x3=;
run;
 
proc reg data=Have plots(only)=PartialPlot;
   model Y = x1 x2 x3 / clb partial partialdata;
   ods exclude PartialPlotData;
   ods output PartialPlotData=PD;
quit;
 
/* macro to create a partial leverage plot (with CLM) for each variable */
%macro MakePartialPlot(VarName);
   title "Partial Leverage Plot for &VarName";
   title2 "with 95% CL for H0: Beta_&VarName = 0";
   proc sgplot data=PD ;
      reg     y=part_Y_&VarName x=part_&VarName / CLM;
      refline 0 / axis=y;
   run;
%mend;
 
%MakePartialPlot(x2);
%MakePartialPlot(x3);

Confidence limits on a partial leverage plot: The efficient way

The previous section shows an easy way to produce partial leverage regression plots in SAS. The confidence bands are a graphical way to examine the null hypotheses βi = 0. Look at the confidence band for the partial leverage plot for the X2 variable. The Y=0 axis (the gray line) is NOT wholly contained in the confidence band. You can therefore reject (with 95% confidence) the null hypotheses that β2=0. In contrast, look at the confidence band for the partial leverage plot for the X3 variable. The Y=0 axis IS wholly contained in the confidence band. You therefore cannot reject the null hypotheses that β3=0.

Although this method is easy, Sall (1990) points out that it is not the most computationally efficient method. For every variable, the SGPLOT procedure computes a bivariate regression, which means that this method requires performing k+2 linear regressions: the regression for the original data that has k regressors, followed by additional bivariate regression for the intercept and each of the k regressors. In practice, each bivariate regression can be computed very quickly, so the entire process is not very demanding. Nevertheless, Sall (1990) showed that all the information you need is already available from the regression on the original data. You just need to save that information and use it to construct the confidence bands.

The following section uses Sall's method to compute the quantities for the partial leverage plots. The program is included for the mathematically curious; I do not recommend that you use it in practice because the previous method is much simpler.

Sall's method for computing partial leverage regression plots

Sall's method requires the following output from the original regression model:

  • The means of the regressor variables. You can get these statistics by using the SIMPLE option on the PROC REG statement and by using ODS OUTPUT SimpleStatistics=Simple to save the statistics to a data set.
  • The inverse of the crossproduct matrix (the X`*X matrix). You can get this matrix by using the I option on the MODEL statement and by using ODS OUTPUT InvXPX=InvXPX to save the matrix to a data set.
  • The mean square error and the error degrees of freedom for the regression model. You can get these statistics by using ODS OUTPUT ANOVA=Anova to save the ANOVA table to a data set.
  • The parameter estimates and the F statistics for the null hypotheses βi=0. You can get these statistics by using the CLB option on the MODEL statement and by using ODS OUTPUT ParameterEstimates=PE to save the parameter estimates table to a data set. The data set contains t statistics for the null hypotheses, but the F statistic is the square of the t statistic.

The following call to PROC REG requests all the statistics and saves them to data sets:

proc reg data=Have plots(only)=PartialPlot simple;
   model Y = x1 x2 x3 / clb partial partialdata I;
   ods exclude InvXPX PartialPlotData;
   ods output SimpleStatistics=Simple
              InvXPX=InvXPX ANOVA=Anova ParameterEstimates=PE 
              PartialPlotData=PD(drop=Model Dependent Observation);
quit;

Sall (1990) shows how to use these statistics to create the partial leverage plots. The following SAS/IML program implements Sall's method.

/* Add confidence bands to partial leverage plots: 
   Sall, J. P. (1990). "Leverage Plots for General Linear Hypotheses."  
   American Statistician 44:308–315.
*/
proc iml;
alpha = 0.05;
 
/* Use statistics from original data to estimate the confidence bands */
/* xBar = mean of each original variable */
use Simple;  read all var "Mean" into xBar; close;
p = nrow(xBar)-1;            /* number of parameters */
xBar = xBar[1:p];            /* remove the Y mean */
 
/* get (X`*X)^{-1} matrix */
use InvXPX;  read all var _NUM_ into xpxi;  close;
xpxi = xpxi[1:p, 1:p];       /* remove column and row for Y */
hbar = xbar` * xpxi * xbar;  /* compute hbar from inv(X`*X) and xbar */
 
/* get s2 = mean square error and error DF */
use ANOVA;  read all var {DF MS};  close;
s2 = MS[2];                  /* extract MSE */
DFerror = DF[2];             /* extract error DF */
 
/* get parameter estimates and t values */
use PE;  read all var {Estimate tValue};  close;
FValue  = tValue#tValue;     /* F statistic is t^2 for OLS */
b = Estimate;                /* lines have slope b[i] and intercept 0 */
 
tAlpha  = quantile("T", 1-alpha/2, DFerror); /* assume DFerror > 0 */
FAlpha  = tAlpha#tAlpha;     /* = quantile( "F", 1-alpha, 1, DFerror); */
 
/* Get partial plot data from REG output */
/* read in part_: names for intercept and X var names */
use PD;  read all var _NUM_ into partial[c=partNames];  close;
 
/* The formula for the confidence limits is from Sall (1990) */
nPts = 100;               /* score CL at nPts values on [xMin, xMax] */ 
XPts = j(nPts, p, .);     /* allocate matrices to store the results  */
Pred  = j(nPts, p, .);
Lower = j(nPts, p, .);
Upper = j(nPts, p, .);
do i = 1 to p;
   levX = partial[,2*i-1];               /* partial x variable */
   xMin = min(levX);  xMax = max(levX);  /* [xMin, xMax] */
   /* partial regression line is (t,z) and CL band is (t, z+/- SE) */
   t = T( do(xMin,xMax,(xMax-xMin)/(nPts-1)));   /* assume xMin < xMax */
   z = b[i]*t;                                   /* predicted mean: yHat = b[i]*x + 0 */
   SE = sqrt( s2#FAlpha#hbar + FAlpha/FValue[i] # (z#z) ); /* Sall, 1990 */
   XPts[,i] = t;
   Pred[,i]  = z;
   Lower[,i] = z - SE;
   Upper[,i] = z + SE;
end;
 
/* create names for new variables for the CL band */
oddIdx = do(1,ncol(partNames),2);  /* names we want are the ODD elements */
Names = partNames[,oddIdx];
XNames     = compress('X_'   +Names);
PredNames  = compress('Pred_'+Names);
LowerNames = compress('LCL_' +Names);
UpperNames = compress('UCL_' +Names);
 
/* write regression lines and confidence bands to a data set */
Results = XPts || Pred || Lower || Upper;
create PartCL from Results[c=(XNames||PredNames||LowerNames||UpperNames)];
   append from Results;
close; 
QUIT;

The data set PartCL contains the fit lines (which have slopes equal to b[i]) and the lower and upper confidence limits for the means of the partial leverage plots. The lines are evaluated at 100 evenly spaced points in the range of the "partial variables." The program uses matrix multiplication to evaluate the quadratic form \(\bar{x}^\prime (X^\prime X)^{-1} \bar{x}\). The remaining computations are scalar operations.

The following statements merge the regression lines and confidence intervals with the data for the partial leverage plots. You can then create a scatter plot of the partial leverage values and overlay the regression line and confidence bands.

data Want;
set PD PartCL;
run;
 
/* macro for plotting a partial leverage plot with CL from the IML program */
%macro MakePartialPlotCL(VarName);
   title "Partial Leverage Plot for &VarName";
   proc sgplot data=Want noautolegend;
      band    x=x_part_&VarName lower=LCL_part_&VarName upper=UCL_part_&VarName; 
      scatter x=part_&VarName y=part_Y_&VarName;
      series  x=x_part_&VarName y=Pred_part_&VarName;
      refline 0 / axis=y;
   run;
%mend;
 
ods graphics / PUSH AttrPriority=None width=360px height=270px;
%MakePartialPlotCL(x2);
%MakePartialPlotCL(x3);
ods graphics / POP;      /* restore ODS GRAPHICS options */

As expected, the graphs are the same as the ones in the previous section. However, the quantities in the graphs are all produced by PROC REG as part of the computations for the original regression model. No additional regression computations are required.

Summary

This article shows two ways to get a confidence band for the conditional mean of a partial regression plot. The easy way is to use the REG statement in PROC SGPLOT. A more efficient way is to compute the confidence band directly from the statistics on the original data. This article shows how to use the SAS/IML language to compute the confidence limits from the output of PROC REG.

This article also demonstrates a key reason why SAS customers choose SAS/IML software. They want to compute some statistic that is not directly produced by any SAS procedure. The SAS/IML language enables you to combine the results of SAS procedures and use matrix operations to compute other statistics from formulas in textbooks or journal articles. In this case, PROC IML consumes the output of PROC REG and creates regression lines and confidence bands for partial leverage plots.

The post Confidence bands for partial leverage regression plots appeared first on The DO Loop.