rick wicklin

11月 302022
 

A probabilistic card trick is a trick that succeeds with high probability and does not require any skill from the person performing the trick. I have seen a certain trick mentioned several times on social media. I call it "ladders" or the "ladders game" because it reminds me of the childhood board game Chutes and Ladders (previously known as Snakes and Ladders in India and the UK). In Chutes and Ladders, some squares on the board tell the player to move forward (go up a ladder) or to move backward (go down a chute). In the "ladders" game, you only move forward.

This article uses Monte Carlo simulation in SAS to show that the "ladders" card trick "works" about 98.2% of the time.

I have been told that Martin Gardner wrote about this game. He referred to mathematical card tricks as those that "do not require prestidigitation," or sleight of hand.

The ladders card trick

The rules of the game and a Python simulation are presented in a blog post by Bob Carpenter, who saw the trick performed at a children's museum. The performer shuffles an ordinary deck of 52 playing cards and places them face up in a row on a table. The performer then describes the game of ladders:

  • A player starts by pointing to any card in the first five positions.
  • The value of the card indicates the number of cards to move forward. If the card is 2, 3, ..., 9, you move forward that many cards. If the card is a 10, jack, queen, king, or ace, you move forward one space.
  • From the new position, apply the rule again. Repeat this process until you cannot advance any more.

The performer selects two volunteers from the audience and tells them each to point randomly to any card from among the first five cards. Each player does so. The performer then announces that both players will end up on a certain "magical" card and he tells everyone the name of the magical card. The players then follow the game rules. Amazingly, each person's path terminates on the magical card that was previously announced. The trick works about 98.2% of the time.

To illustrate the ladders game, first consider a simpler situation of a deck that contains only 13 cards, one for each card value (for example, use only spades). One random shuffle of the 13 cards is shown to the right. For this arrangement of cards, the performer announces that both volunteers will end up on the 7 card (the magical card for this shuffle), which is the 11th card in the row of cards. Suppose the first volunteer points to the ace in the first position as her starting card. The rules of the game dictate the following sequence of moves:

  • From the A, the player moves forward one card and lands on the J.
  • From the J, the player moves forward one card and lands on the 2.
  • From the 2, the player moves forward two cards and lands on the 6.
  • From the 6, the player moves forward six cards and lands on the 7. The player cannot move forward seven spaces, so the 7 is her final card, as predicted in advance.

Thus, the path for the first player is the sequence of positions (1, 2, 3, 5, 11).

What happens for the other player? Well, if the second volunteer chooses his initial card at positions 2, 3, or 5, then his path will instantly coincide with the path of the first player, so his path, too, terminates at the 7 card in the 11th position. If the second volunteer chooses his initial card at position 4, then his path hits positions 4, 8, and 11. This path also terminates at the 7 card in the 11th position. Thus, for this set of cards and this shuffle, all initial positions follow a path that eventually lands on the 7.

How does the performer know the name of the final magical card? As he deals, he secretly counts the path determined by the first card in the deck.

Simulate the ladders game in SAS

You can use Monte Carlo simulation to estimate the probability that the ladders card trick "works." To keep it simple, let's say the trick "works" when the paths of the two players converge. That is, the players initially start on two random positions (1-5), but they end up on the same final card after they each follow the path dictated by their initial position. The simulation for the ladders card trick has the following major steps:

  1. Set up the card deck. For each card, assign the number of steps to move forward.
  2. Create a function that computes the final position, given the initial position and the sequence of moves determined by the shuffled deck.
  3. Run the Monte Carlo simulation. This consists of several sub-steps:
    • Shuffle the deck (that is, shuffle the array of moves).
    • Use sampling without replacement to choose two initial positions.
    • Compute the final card for each initial position.
    • Record whether the two paths converged.
  4. Estimate the probability of the trick working as the proportion of times that the paths converged in the simulation.

As is often the case, this simulation is easiest to perform in SAS by using the SAS/IML language. You can use the RANPERM function to shuffle the deck. You can use the SAMPLE function to sample without replacement from the values 1-5. You can write a function that computes the final position based on the initial position. The following program is one way to implement a Monte Carlo simulation for the ladders card trick:

/* The Ladders card trick:
   https://statmodeling.stat.columbia.edu/2022/10/07/mira-magic-a-probabilistic-card-trick/
   Lay out a shuffled deck of cards.
   Choose one of the first five cards as a starting point.
   Move down the deck as follows:
   If the current card is 2-9, move that many cards forward.
   For other cards, move one space forward.
*/
proc iml;
/* 1. set up the card deck and assign values. 
   Note: We don't actually need the deck, only the instructions
   about how to move. */
values = {A,'2','3','4','5','6','7','8','9','10', J, Q, K};
advance= {1,  2,  3,  4,  5,  6,  7,  8,  9,  1,  1, 1, 1};
deck  = repeat(values, 4);   /* full deck of 52 cards */
moves = repeat(advance, 4);  /* corresponding moves */
nCards = nrow(moves);
 
/* 2. Find the final position, given the first position and 
   the vector of moves for the current shuffled deck.
   startPos : a scalar value 1-5
   Move : a vector of values. If your current card is in 
          position i, the next card is position (i + Move[i])
*/
start EndPos(startPos, Move);
   Pos = startPos;
   do while(Pos + Move[Pos] < nrow(Move));
      Pos = Pos + Move[Pos];
   end;
   return( Pos );
finish;
 
/* 3. Run the Monte Carlo simulation:
      A. Shuffle the deck (vector of moves)
      B. Use sampling w/o replacement to choose two initial positions
      C. Compute the final card for each initial position
      D. Record whether the two paths converged
*/
call randseed(1234);
NSim = 1E5;
match = j(NSim,1);
do j = 1 to nSim;
   k = ranperm(nCards);          /* shuffle the deck */
   Move = moves[k];
   /* choose two different starting positions in 1-5 */
   startPos = sample(1:5, 2, "NoReplace");
   EndPos1 = EndPos(startPos[1], Move);
   EndPos2 = EndPos(startPos[2], Move);
   match[j] = (EndPos1=EndPos2);
end;
 
/* 4. Estimate probability as the proportion of matches (1s) in a binary vector.
      Optional: Compute a Wald CL for the proportion */
ProbMatch = mean(match);
StdErr = sqrt( ProbMatch*(1-ProbMatch)/ NSim );
CL = ProbMatch + {-1 1}*1.96*StdErr;
print NSim[F=comma9.] ProbMatch StdErr CL[c={'LCL' 'UCL'}];

The simulation shows that the two players end up on the same card more than 98% of the time.

Summary

This article describes a probabilistic card trick, which I call the game of ladders. Two players choose random cards as an initial condition, then follow the rules of the game. For about 98.2% of the deals and initial positions, the paths of the two players will converge to the same card. As the performer is laying out the card, he can predict in advance that the paths will converge and, with high probability, announce the name of the final card.

Bob Carpenter's blog post mentions the connection between this card trick and Markov chains. He also presents several variations, such as using more cards in the deck.

The post Ladders: A probabilistic card trick appeared first on The DO Loop.

11月 282022
 

A SAS programmer was trying to simulate poker hands. He was having difficulty because the sampling scheme for simulating card games requires that you sample without replacement for each hand. In statistics, this is called "simple random sampling."

If done properly, it is straightforward to simulate poker hands in SAS. This article shows how to generate a standard deck of 52 cards. It then shows how to generate many card hands for a game that involves several players. This can be done in two ways: The SAMPLE function in the SAS/IML language and PROC SURVEYSELECT in SAS/STAT software.

Encode a deck of cards

The first step is to generate a standard deck of 52 playing cards. There are many ways to do this, but the following DATA step iterates over 13 values (A, 2, 3, ..., Q, K) and over four suits (clubs, diamonds, hearts, and spades). The step creates a data set that has 52 cards. To make the output more compact, I concatenate the value of the card with the first letter of the suit to obtain a two-character string for each card. For example, "5C" is the "five of clubs" and "AS" is the "ace of spades." So that each card can be represented by using two characters, I use "T" instead of "10." For example, "TH" is the "ten of hearts." The

data Cards;
length Value $2 Suit $8 Card $3;
array Values[13] $ _temporary_ ('A','2','3','4','5','6','7','8','9','T','J','Q','K');
array Suits[4] $ _temporary_ ('Clubs', 'Diamonds', 'Hearts', 'Spades');
do v = 1 to 13;
   do s = 1 to 4;
      Value = Values[v]; Suit = Suits[s]; Card = cats(Value,substr(Suit,1,1));
      output;
   end;
end;
run;
 
proc print data=Cards(obs=8); 
   var v s Value Suit Card;
run;
TABLE

The cards are generated in order. The output shows the first eight cards in the deck, which are the aces and twos. For some types of poker hands (such as determining pairs and straights), the V variable can be useful, and for other analyses (such as determining flushes), the S variable can be useful.

Deal cards by using SAS/IML

When humans deal cards, we use a two-step method: shuffle the deck and then deal the top cards to players in a round-robin fashion. A computer simulation can simluate dealing cards by using a one-step method: deal random cards (without replacement) to every player. SAS provides built-in sampling methods to make this process easy to program.

In SAS/IML, the default sampling method for the SAMPLE function is sampling without replacement. If there are P players and each player is to receive C cards, then a deal consists of P*C cards, drawn without replacement from the deck. The deck does not need to be shuffled because the SAMPLE function outputs the cards in random order. Because the cards are in random order, it does not matter how you assign the cards to the players. The following SAS/IML program uses P=3 players and generates C=5 cards for each player. The program simulates one deal by assigning the first C cards to the first player, the second C cards to the second player, and so on:

%let nCards   = 5;    /* cards per player */
%let nPlayers = 3;    /* number of players */
%let nDeals   = 10;   /* number of deals to simulate */
 
/* simulate dealing many card hands in SAS/IML */
proc iml;
call randseed(1234);                       /* set random number seed */
use Cards;  read all var "Card";  close;   /* read the deck of cards into a vector */
 
nCardsPerDeal = &nCards * &nPlayers;       /* each row is a complete deal */
D = sample(Card, nCardsPerDeal, "WOR" );   /* sample without replacement from the deck */
Deal = shape(D, &nPlayers, &nCards);       /* reshape into nPlayers x nCards matrix */
 
/* print one deal */
cardNames   = "C1":"C&nCards";
playerNames = "P1":"P&nPlayers";
print Deal[c=cardNames r=playerNames];
TABLE

For this deal, the first player has two tens, two fives, and a king. The second player has a four, a five, a nine, an ace, and a two. You could sort the cards in each row, but I have left each player's cards unsorted so that you can see the random nature of the assignment.

Simulate multiple deals

The previous program simulates one deal. What if you want to simulate multiple deals? One way is to loop over the number of deals, but there is a more efficient method. The second argument to the SAMPLE function can be a two-element vector. The first element specifies the number of cards for each deal, and the second element specifies the number of deals. Thus, for example, if you want to simulate 10 deals, you can use the following statement to generate a matrix that has 10 rows. Each row is a sample (without replacement) from a 52-card deck:

/* use one call to simulate many deals */
D = sample(Card, nCardsPerDeal//&nDeals, "WOR" );   /* simulate many deals */
 
/* Ex: The 8th row contains the 8th deal. Display it. */
Deal8 = shape(D[8,], &nPlayers, &nCards);
print Deal8[c=cardNames r=playerNames];
TABLE

To demonstrate that each row is a deal, I have extracted, reshaped, and displayed the eighth row. The eighth deal is unusual because each player is dealt a "good" hand. The first player has a pair of eights, the second player has two pairs (threes and sixes), and the third player has a pair of sevens. If you want to print (or analyze) the 10 deals, you can loop over the rows, as follows:

do i = 1 to &nDeals;
   Deal = shape(D[i,], &nPlayers, &nCards);
   /* analyze or print the i_th deal */
   print i, Deal[c=cardNames r=playerNames];
end;

Deal random hands by using PROC SURVEYSELECT

You can also simulate card deals by using the SURVEYSELECT procedure. You should use the following options:

  • Use the METHOD=SRS option to specify sampling without replacement for each deal. To obtain the cards in random order, use the OUTRANDOM option.
  • Use the N= option to specify the number of cards in each deal.
  • Use the REPS= option to specify the total number of deals.

An example is shown in the following call to PROC SURVEYSELECT:

/* similar approach for PROC SURVEYSELECT */
proc surveyselect data=Cards seed=123 NOPRINT
     method=SRS                   /* sample without replacement */
     outrandom                    /* randomize the order of the cards */
     N=%eval(&nCards * &nPlayers) /* num cards per deal */
     reps=&nDeals                 /* total number of deals */
     out=AllCards;
run;

The output data set (AllCards) contains a variable named Replicate, which identifies the cards in each deal. Because the cards for each deal are in a random order, you can arbitrarily assign the cards to players. You can use a round-robin method or assign the first C cards to the first player, the next C cards to the second player, and so on. This is done by using the following DATA step:

/* assign the cards to players */
data AllDeals;
set AllCards;
by Replicate;
if first.Replicate then do;
   Cnt = 0;
end;
Player = 1 + floor(Cnt/&nCards);
Cnt + 1;
drop Cnt;
run;
OUTPUT

The simulated data are in long form. For some analyses, it might be more convenient to convert it into wide form. You can do this by using the DATA step or by using PROC TRANSPOSE, as below:

/* OPTIONAL: Convert from long form to wide form */
proc transpose data=AllDeals 
               out=Deals(rename=(col1-col&nCards=C1-C&nCards) drop=_NAME_);
   var Card;
   by Replicate Player;
run;
 
proc print data=Deals;
   where Replicate=8;
   var Replicate Player C:;
run;
OUTPUT

To demonstrate the results, I display the result of the eighth deal (Replicate=8). For this deal, the first player has several face cards but no pairs, the second player has a pair of threes, and the third player has a pair of fours.

Summary

SAS provides many tools for simulation studies. For simulations of card games, this article shows how to create a data set that represents a standard deck of 52 playing cards. From the cards, you can use the SAMPLE function in SAS/IML software or the SURVEYSELECT procedure to simulate many deals, where each deal is a sample without replacement from the deck. For each deal, you can assign the cards to the players in a systematic manner.

The post Simulate poker hands in SAS appeared first on The DO Loop.

11月 212022
 

Recently, I needed to know "how much" of a piecewise linear curve is below the X axis. The coordinates of the curve were given as a set of ordered pairs (x1,y1), (x2,y2), ..., (xn, yn). The question is vague, so the first step is to define the question better. Should I count the number of points on the curve for which the Y value is negative? Should I use a weighted sum and add up all the negative Y values? Ultimately, I decided that the best measure of "negativeness" for my application is to compute the area that lies below the line Y=0 and above the curve. In calculus, this would be the "negative area" of the curve. Because the curve is piecewise linear, you can compute the area exactly by using the trapezoid rule of integration.

An example is shown to the right. The curve is defined by 12 ordered pairs. The goal of this article is to compute the area shaded in blue. This is the "negative area" with respect to the line Y=0. With very little additional effort, you can generalize the computation to find the area below any horizontal line and above the curve.

Area defined by linear segments

The algorithm for computing the shaded area is simple. For each line segment along the curve, let [a,b] be the interval defined by the left and right abscissas (X values). Let f(a) and f(b) be the corresponding ordinate values. Then there are four possible cases for the positions of f(a) and f(b) relative to the horizontal reference line, Y=0:

  • Both f(a) and f(b) are above the reference line. In this case, the area between the line segment and the reference line is positive. We are not interested in this case for this article.
  • Both f(a) and f(b) are below the reference line. In this case, the "negative area" can be computed as the area of a trapezoid: \(A = 0.5 (b - a) (f(b) + f(a))\).
  • The value f(a) is below the reference line, but f(b) is above the line. In this case, the "negative area" can be computed as the area of a triangle. You first solve for the location, c, at which the line segment intersects the reference line. The negative area is then \(A = 0.5 (c - a) f(a)\).
  • The value f(a) is above the reference line and f(b) is below the line. Again, the relevant area is a triangle. Solve for the intersection location, c, and compute the negative area as \(A = 0.5 (b - c) f(b)\).

The three cases for negative area are shown in the next figure:

You can easily generalize these formulas if you want the above the curve and below the line Y=t. In every formula that includes f(a), replace that value with (f(a) – t). Similarly, replace f(b) with (f(b) – t).

Compute the negative area

The simplest computation for the negative area is to loop over all n points on the line. For the i_th point (1 ≤ i < n), let [a,b] be the interval [x[i], x[i+1]] and apply the formulas in the previous section. Since we skip any intervals for which f(a) and f(b) are both positive, we can exclude the point (x[i], y[i]) if y[i-1], y[i], and y[i+1] are all positive. This is implemented in the following SAS/IML function. By default, the function returns the area below the line Y=0 and the curve. You can use an optional argument to change the value of the horizontal reference line.

proc iml;
/* compute the area below the line y=y0 for a piecewise
   linear function with vertices given by (x[i],y[i]) */
start AreaBelow(x, y, y0=0);
   n = nrow(x);
   idx = loc(y<y0);                /* find indices for which y[i] < 0 */
   if ncol(idx)=0 then return(0);
 
   k = unique(idx-1, idx, idx+1);  /* we need indices before and after */
   jdx = loc(k > 0 & k <  n);      /* restrict to indices in [1, n-1] */
   v = k[jdx];                     /* a vector of the relevant vertices */
 
   NegArea = 0;
   do j = 1 to nrow(v);            /* loop over intervals where f(a) or f(b) negative */
      i = v[j];                    /* get j_th index in the vector v */
      fa = y[i]-y0; fb = y[i+1]-y0;/* signed distance from cutoff line */
      if fa > 0 & fb > 0 then 
         ;                         /* segment is above cutoff; do nothing */
      else do;
         a  = x[i];  b = x[i+1];
         if fa < 0 & fb < 0 then do;  /* same sign, use trapezoid rule */
            Area = 0.5*(b - a) * (fb + fa);
         end;
         /* different sign, f(a) < 0, find root and use triangle area */
         else if fa < 0 then do;  
            c = a - fa * (b - a) / (fb - fa);
            Area = 0.5*(c - a)*fa;
         end;
         /* different sign, f(b) < 0, find root and use triangle area */
         else do;
            c = a - fa * (b - a) / (fb - fa);
            Area = 0.5*(b - c)*fb;
         end;
         NegArea = NegArea + Area;
      end;
   end;
   return( NegArea );
finish;
 
/* points along a piecewise linear curve */
x = {   1,    2, 3.5,   4,5,     6, 6.5,   7,   8,  10, 12,  15};
y = {-0.5, -0.1, 0.2, 0.7,0.8,-0.2, 0.3, 0.6, 0.3, 0.1,-0.4,-0.6};
 
/* compute area under the line Y=0 and above curve (="negative area") */
NegArea = AreaBelow(x,y);
print NegArea;

The program defines the AreaBelow function and calls the function for the piecewise linear curve that is shown at the top of this article. The output shows that the area of the shaded regions is -2.185.

The area between a curve and a reference line

I wrote the program in the previous section so that the default behavior is to compute the area below the reference line Y=0 and above a curve. However, you can use the third argument to specify the value of the reference line. For example, the following statements compute the areas below the lines Y=0.1 and Y=0.2, respectively:

Area01 = AreaBelow(x,y, 0.1);      /* reference line Y=0.1 */
Area02 = AreaBelow(x,y, 0.2);      /* reference line Y=0.2 */
print Area01 Area02;

As you would expect, the magnitude of the area increases as the reference line moves up. In fact, you can visualize the area below the line Y=t and the curve as a function of t. Simply, call the AreaBelow function in a loop for a sequence of increasing t values and plot the results:

t = do(-0.7, 1.0, 0.05);
AreaUnder = j(1,ncol(t));
do i = 1 to ncol(t);
   AreaUnder[i] = AreaBelow(x, y, t[i]);
end;
title "Area Under Reference Line";
call series(t, AreaUnder) label={'Reference Level' 'Area'} grid={x y} xvalues=do(-0.7, 1.0, 0.1);

The graph shows that the area under the reference line is a monotonic function. If the reference line is below the minimum value of the curve, the area is zero. As you increase t, the area below the line Y=t and the curve increases in magnitude. After the reference line reaches the maximum value along the curve (Y=0.8 for this example), the magnitude of the area increases linearly. It is difficult to see in the graph above, but the curve is actually linear for t ≥ 0.8.

Summary

You can use numerical integration to determine "how much" of a function is negative. If the function is piecewise linear, the integral over the negative intervals can be computed by using the trapezoid rule. This article shows how to compute the area between a reference line Y=t and a piecewise linear curve. When t=0, this is the "negative area" of the curve.

Incidentally, this article is motivated by the same project that inspired me to write about how to test whether a function is monotonically increasing. If a function is monotonically increasing, then its derivative is strictly positive. Therefore, another way to test a function for monotonic increasing is to test whether the derivative is never negative. A way to measure how far a function deviates from being monotonic is to compute the "negative area" for the derivative.

The post The area under a piecewise linear curve appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 142022
 

A profile plot is a compact way to visualize many variables for a set of subjects. It enables you to investigate which subjects are similar to or different from other subjects. Visually, a profile plot can take many forms. This article shows several profile plots: a line plot of the original data, a line plot of standardized data, and a heat map.

This article uses crime data to illustrate the various kinds of plots. The line plot at the right shows the incidence rate for seven types of crime in 16 US cities in 1970. The profile plot looks somewhat like a spaghetti plot, but a spaghetti plot typically features a continuous variable (often, time) along the vertical axis, whereas the horizontal axis in a profile plot features discrete categories. The vertical axis shows a response variable, which is often a count, a rate, or a proportion. The line segments enable you to trace the rates for each city. You can use the graph to discover which cities have high crime rates, which have low rates, and which are high for some crimes but low for others.

In the example, the subjects are cities; the variables are the seven crime rates. However, a profile plot is applicable to many fields. For example, in a clinical setting, the subjects could be patients and the variables could be the results of certain lab tests, such as blood levels of sugar, cholesterol, iron, and so forth. Be aware, however, that there are several different graphs that are known as "patient profile plots." Some are more complicated than the profile plots in this article.

A simple profile plot

For the city crime data, all variables are measured as rates per 100K people. Because all variables are measured on the same scale, you can plot the raw data. You could use a box plot if you want to show the distribution of rates for each crime. However, the purpose of a profile plot is to be able to trace each subject's relative rates across all variables. To do that, let's use a line plot.

The data are originally from the United States Statistical Abstracts (1970) and were analyzed by John Hartigan (1975) in his book Clustering Algorithms. The data were reanalyzed by Michael Friendly (SAS System for Statistical Graphics, 1991). This article follows Friendly's analysis for the line plots. The heat map in the last section is not considered in Friendly (1991).

The following SAS DATA step defines the data in the wide format (seven variables and 16 rows). However, to plot the data as a line plot in which colors identify the cities, it is useful to convert the data from wide to long format. For this example, I use PROC TRANSPOSE. You can then use the SERIES statement in PROC SGPLOT to visualize the rates for each crime and for each city.

/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city &$16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* convert to long format and use alphabetical ordering along X axis */
proc transpose data=crime out=crime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=crime2;  by crime;  run;
 
title "Profile Plot of Crime Rates in US Cities in 1970";
footnote J=L "Friendly (1991, p. 421) Output 8.9";
proc sgplot data=crime2;
  label rate="Crime Rate (per 100K population)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

The graph is shown at the top of this article. It is not very useful. The graph does not enable you to trace the crime rate patterns for each city. To quote Friendly (1991, p. 422): "In plotting the raw data, the high-frequency crimes such as burglary completely dominate the display, and the low-frequency crimes (murder, rape) are compressed to the bottom of the plot. Because of this, the possibility that the cities may differ in their patterns is effectively hidden."

Whenever you are trying to graph data that are on different scales, it is useful to transform the scale of the data. You could do this globally (for example, a logarithmic transformation) or you can focus on relative aspects. Following Friendly (1991), the next section standardizes the data, thereby focusing the attention on which cities have relatively high or low crime rates for various crimes.

A profile plot for standardized data

The most common way to standardize data is to use an affine transformation for each variable so that the standardized variables have mean 0 and unit variance. You can use PROC STANDARD in SAS to standardize the variables. You can specify the name of variables by using the VAR statement, or you can use the keyword _NUMERIC_ to standardize all numeric variables in the data. The following statements standardize the original wide data and then convert the standardized data to long format for graphing.

proc standard data=crime mean=0 std=1 out=stdcrime;
   var _numeric_;
run;
/* convert to long format and ensure a consistent ordering along X axis */
proc transpose data=stdcrime out=stdcrime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=stdcrime2;  by crime;  run;
 
title "Profile Plot of Standardized Crime Rates in 1970";
footnote J=L "Friendly (1991, p. 423) Output 8.10";
proc sgplot data=stdcrime2;
  label rate="Crime Rate (standard score)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

Unfortunately, for these data, this graph is not much better than the first profile plot. Friendly (1991, p. 423) calls the plot "dismal" because it is "hard to see any similarities among the cities in their patterns of crime." With a close inspection, you can (barely) make out the following facts:

  • For most crimes, Los Angeles, Detroit, and New York have high rates.
  • For most crimes, Tucson and Hartford have low rates.
  • The crime rate in Boston is low except for auto theft.

This plot is similar to a parallel coordinates plot. There are three ways to improve the standardized profile plot:

  • Rearrange the order of the categories along the X axis by putting similar categories next to each other. I created one possible re-ordering, in which property crimes are placed on the left and crimes against persons are placed on the right. The profile plot is slightly improved, but it is still hard to read.
  • Change the affine transformation for the variables. Instead of always using mean 0 and unit variance, perhaps there is a better way to transform the response values? And perhaps using nonuniform spacing for the categories could make the profiles easier to read? These two ideas are implemented in the "optimal linear profile plot" (Hartigan, 1975; Friendly, 1991). I will show how to create the optimal linear profile plot in a subsequent article.
  • Abandon the spaghetti-like line plot in favor of a lasagna plot, which uses a heat map instead of a line plot. This is shown in the next section.

A lasagna plot of profiles

The main advantage of a lasagna plot is that data are neatly displayed in rows and columns. For each subject, you can examine the distribution of the response variable across categories. Furthermore, a rectangular heat map is symmetric: for each category, you can examine the distribution of the response variable across subjects! For these data, the subjects are cities, the categories are crimes, and the response variable is either the raw crime rate or the standardized rate. The disadvantage of a heat map is that the response values are represented by colors, and it can be difficult to perceive changes in colors for some color ramps. On the other hand, you can always print the data in tabular form if you need to know the exact values of the response variables.

The following statements create a lasagna plot for the standardized data in this example. Like the line plot, the heat map uses the data in long form:

%let colormap = (CX0571B0 CX92C5DE CXF7F7F7 CXF4A582 CXCA0020);
/* lasagna plot: https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html 
   and https://blogs.sas.com/content/iml/2022/03/28/viz-missing-values-longitudinal.html */
title "Lasagna Plot of City Profiles";
title2 "Standardized Crime Rates in 1970";
proc sgplot data=stdcrime2;
   label city="City" crime="Crime" rate="Standardized Rate";
   heatmapparm x=crime y=city colorresponse=rate / outline outlineattrs=(color=gray) colormodel=&colormap; 
run;

This is a better visualization of the crime profiles for each city. You can easily see the standardized rates and detect patterns among the cities. You can see that Detroit, Kansas City, Los Angeles, New York, and Washington (DC) have high relative rates. You can see that Atlanta, Boston, Hartford, Portland, and Tucson tend to have relatively low rates. You can see the crimes for which Atlanta, Boston, and Portland do not have low rates.

This display uses an alphabetical ordering of both the categories (crimes) and the subject (cities). However, you can sort the rows and columns of a lasagna plot to improve the visualization. For example, the following heat map sorts the cities by the average standardized crime rate across all categories. It also arranges the horizontal axis so that nonviolent crimes against property are on the left and violent crimes against people on the right:

This arrangement makes it easier to see outliers such as the orange and red cells in the lower left of the graph. Both Portland and Honolulu have rates of property crime that are higher than you might suspect, given their overall low crime rates. Similarly, Boston has relatively high rates of auto theft compared to its otherwise low crime rates. In the same way, the relatively low rate of rapes in New York City is apparent.

Summary

This article shows three ways to create a profile plot in SAS. A profile plot enables you to visualize the values of many variables for a set of subjects. The simplest profile plot is a line plot of the raw values, where each line corresponds to a subject and the discrete horizontal axis displays the names of variables. However, if the variables are measured in different units or on different scales, it is useful to rescale the values of the variables. One way is to create a line plot of the standardized variables, which displays the relative values (high or low) of each variable. A third display is a heat map (sometimes called a lasagna plot), which uses color to display the high or low values of each variable.

In this article, the categories are uniformly spaced along the horizontal axis. In a subsequent article, I will show an alternative plot that uses nonuniform spacing.

The post Profile plots in SAS appeared first on The DO Loop.

11月 072022
 

I recently blogged about how to compute the area of the convex hull of a set of planar points. This article discusses the expected value of the area of the convex hull for n random uniform points in the unit square. The article introduces an exact formula (due to Buchta, 1984) and uses Monte Carlo simulation to approximate the sampling distribution of the area of a convex hull. A simulation like this is useful when testing data against the null hypothesis that the points are distributed uniformly at random.

The expected value of a convex hull

Assume that you sample n points uniformly at random in the unit square, n ≥ 3. What is the expected area of the convex hull that contains the points? Christian Buchta ("Zufallspolygone in konvexen Vielecken," 1984) proved a formula for the expected value of the area of n random points in the unit square. (The unit square is sufficient: For any other rectangle, simply multiply the results for the unit square by the area of the rectangle.)

Unfortunately, Buchta's result is published in German and is behind a paywall, but the formula is given as an answer to a question on MathOverflow by 'user9072'. Let \(s = \sum_{k=1}^{n+1} \frac{1}{k} (1 - \frac{1}{2^k}) \). Then the expected area of the convex hull of n points in [0,1] x [0,1] is \( E[A(n)] = 1 -\frac{8}{3(n+1)} \bigl[ s - \frac{1}{(n+1)2^{n+1}} \bigr] \)

The following SAS DATA step evaluates Buchta's formula for random samples of n ≤ 100. A few values are printed. You can use PROC SGPLOT to graph the expected area against n:

/* Expected area of the convex hull of a random sample of size n in the unit square.
   Result by Buchta (1984, in German), as reported by 
   https://mathoverflow.net/questions/93099/area-enclosed-by-the-convex-hull-of-a-set-of-random-points
*/
data EArea;
do n = 3 to 100;
   s = 0;
   do k = 1 to n+1;
      s + (1 - 1/2**k) / k;
   end;
   EArea = 1 - 8/3 /(n+1) * (s - 2**(-n-1) / (n+1));
   output;
end;
keep n EArea;
run;
 
proc print data=EArea noobs; 
   where n in (3,4,6,8,10,12,13) or mod(n,10)=0;
run;
 
title "Expected Area of Convex Hull";
proc sgplot data=EArea;
   series x=n y=EArea;
   xaxis grid label="Number of Random Points in Unit Square";
   yaxis grid label="Expected Area" min=0 max=1;
run;

The computation shows that when n is small, the expected area, E[A(n)], is very small. The first few values are E[A(3)]=11/144, E[A(4)]=11/72, E[A(5)] = 79/360, and E[A(6)] = 199/720. For small n, the expected area increases quickly as n increases. When n = 12, the expected area is about 0.5 (half the area of the bounding square). As n gets larger, the expected area asymptotically approaches 1 very slowly. For n = 50, the expected area is 0.8. For n = 100, the expected area is 0.88.

The distribution of the expected area of a convex hull

Buchta's result gives the expected value, but the expected value is only one number. You can use Monte Carlo simulation to compute and visualize the distribution of the area statistics for random samples. The distribution enables you to estimate other quantities, such as the median area and an interval that predicts the range of the areas for 95% of random samples.

The following SAS/IML program performs a Monte Carlo simulation to approximate the sampling distribution of the area of the convex hull of a random sample in the unit square of size n = 4. The program does the following:

  1. Load the PolyArea function, which was defined in a previous article that shows how to compute the area of a convex hull. Or you can copy/paste from that article.
  2. Define a function (CHArea) that computes the convex hull of a set of points and returns the area.
  3. Let B be the number of Monte Carlo samples, such as B = 10,000. Use the RANDFUN function to generate B*n random points in the unit square. This is an efficient way to generate B samples of size n.
  4. Loop over the samples. For each sample, extract the points in the sample and call the CHArea function to compute the area.
  5. The Monte Carlo simulation computes B sample areas. The distribution of the areas is the Monte Carlo approximation of the sampling distribution. You can compute descriptive statistics such as mean, median, and percentiles to characterize the shape of the distribution.
proc iml;
/* load the PolyArea function or copy it from 
   https://blogs.sas.com/content/iml/2022/11/02/area-perimeter-convex-hull.html
*/
load module=(PolyArea);
 
/* compute the area of the convex hull of n pts in 2-D */
start CHArea(Pts);
   indices = cvexhull(Pts);            /* compute convex hull of the N points */
   hullIdx = indices[loc(indices>0)];  /* the positive indices are on the CH */
   P = Pts[hullIdx, ];                 /* extract rows to form CH polygon */
   return PolyArea(P);                 /* return the area */
finish;
 
call randseed(1234);
NSim = 1E4;                            /* number of Monte Carlo samples */
n = 4;
X = randfun((NSim*n)//2, "Uniform");   /* matrix of uniform variates in [0,1]x[0,1] */
Area = j(NSim, 1, .);                  /* allocate array for areas */
do i = 1 to NSim;
   beginIdx = (i-1)*n + 1;             /* first obs for this sample */
   ptIdx    = beginIdx:(beginIdx+n-1); /* range of obs for this sample */
   Pts = X[ptIdx, ];                   /* use these n random points */
   Area[i] = CHArea(Pts);              /* find the area */
end;
 
meanArea = mean(Area);
medianArea = median(Area);
call qntl(CL95, Area, {0.025 0.975});
print n meanArea medianArea (CL95`)[c={LCL UCL}];

For samples of size n = 4, the computation shows that the Monte Carlo estimate of the expected area is 0.154, which is very close to the expected value (0.153). In addition, the Monte Carlo simulation enables you to estimate other quantities, such as the median area (0.139) and a 95% interval estimate for the predicted area ([0.022, 0.369]).

You can visualize the distribution of the areas by using a histogram. The following SAS/IML statements create a histogram and overlay a vertical reference line at the expected value.

title "Distribution of Areas of Convex Hull";
title2 "N=4; NSim = 10,000";
refStmt = "refline 0.15278 / axis=x lineattrs=(color=red) label='E(Area)';";
call histogram(Area) other=refStmt;

The histogram shows that the distribution of areas is skewed to the right for n = 4. For this distribution, a better measure of centrality might be the median, which is smaller than the expected value (the mean). Or you can describe the distribution by reporting a percentile range such as the inter-quartile range or the range for 95% of the areas.

Graphs of the distribution of the area of a convex hull

You can repeat the previous Monte Carlo simulation for other values of n. For example, if you use n = 50, you get the histogram shown to the right. The distribution for n = 50 is skewed to the left. A prediction interval for 95% of areas is [0.69, 0.89].

Whether or not you know about Buchta's theoretical result, you can use Monte Carlo simulation to estimate the expected value for any sample size. The following SAS/IML program simulates 10,000 random samples of size n for n in the range [3, 100]. For each simulation, the program writes the estimate of the mean and a prediction interval to a SAS data set. You can use the BAND statement in PROC SGPLOT to overlay the interval estimates and the estimate of the expected value for each value of n:

/* repeat Monte Carlo simulation for many values of N. Write results to a data set and visualize */
create ConvexHull var {'n' MeanArea LCL UCL};
 
numPts = 3:100;
NSim = 1E4;                            /* number of Monte Carlo samples */
do j = 1 to ncol(numPts);                 /* number of random pts for each convex hull */
   N = numPts[j];
   X = randfun((NSim*N)//2, "Uniform");   /* matrix of uniform variates in [0,1]x[0,1] */
   Area = j(NSim, 1, .);                  /* allocate array for areas */
   LCL = j(NSim, 1, .);                   /* allocate arrays for 95% lower/upper CL */
   UCL = j(NSim, 1, .);
   do i = 1 to NSim;
      beginIdx = (i-1)*N + 1;             /* first obs for this sample */
      ptIdx    = beginIdx:(beginIdx+N-1); /* range of obs for this sample */
      Pts = X[ptIdx, ];                   /* use these N random points */
      Area[i] = CHArea(Pts);
   end;
 
   meanArea = mean(Area);
   call qntl(CL95, Area, {0.025 0.975});
   LCL = CL95[1];   UCL = CL95[2];
   append;
end;
close ConvexHull;
QUIT;
 
title "Monte Carlo Estimate of Expected Area of Convex Hull";
proc sgplot data=ConvexHull noautolegend;
   band x=n lower=LCL upper=UCL / legendlabel="95% Prediction";
   series x=n y=meanArea / legendlabel="Expected Area";
   keylegend / location=inside position=E opaque across=1;
   xaxis grid label="Number of Random Points in Unit Square";
   yaxis grid label="Expected Area" min=0 max=1;
run;

From this graph, you can determine the expected value and a 95% prediction range for any sample size, 3 ≤ n ≤ 100. Notice that the solid line is the Monte Carlo estimate of the expected value. This estimate is very close to the true expected value, which we know from Buchta's formula.

Summary

Some researchers use the area of a convex hull to estimate the area in which an event occurs. The researcher might want to test whether the area is significantly different from the area of the convex hull of a random point pattern. This article presents several properties of the area of the convex hull of a random sample in the unit square. The expected value of the area is known theoretically (Buchta, 1984). Other properties can be estimated by using a Monte Carlo simulation.

You can download the SAS program that generates the tables and graphs in this article.

The post The area of the convex hull of random points appeared first on The DO Loop.

11月 022022
 

The area of a convex hull enables you to estimate the area of a compact region from a set of discrete observations. For example, a biologist might have multiple sightings of a wolf pack and want to use the convex hull to estimate the area of the wolves' territory. A forestry expert might use a convex hull to estimate the area of a forest that has been infested by a beetle or a blight. This article shows how to compute the area and perimeters of convex hulls in SAS. Because a convex hull is a convex polygon, we present formulas for the area and perimeter of polygons and apply those formulas to convex hulls.

Gauss' shoelace formula for the area of a polygon

There are many formulas for the area of a planar polygon, but the one used in this article is known as Gauss' shoelace formula, or the triangle formula. This formula enables you to compute the area of a polygon by traversing the vertices of the polygon in order. The formula uses only the coordinates of adjacent pairs of vertices. Assume a polygon has n that are vertices enumerated counterclockwise, and the coordinates are \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\). Then the area of the polygon is given by the shoelace formula
\(A = \frac{1}{2} \sum_{i=1}^{n} x_i y_{i+1} - x_{i+1} y_i\)
where we use the convention that \((x_{n+1}, y_{n+1}) = (x_1, y_1)\).

To me, the remarkable thing about this formula is that it looks like it has nothing to do with an area. However, the Appendix derives Gauss' shoelace formula from Green's theorem in the plane, which reveals the connection between the formula and the area of the polygon.

You might wonder why this is called the "shoelace formula." Suppose a polygon has vertices \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\). A schematic visualization of the formula is shown to the right. You list the X and Y coordinates of the vertices and append the first vertex to the end of the list. The blue arrows represent multiplication. The red arrows represent multiplication with a negative sign. So, the first two rows indicate the quantity \(x_1 y_2 - x_2 y_1\). (If you are familiar with linear algebra, you can recognize this quantity as the determinant of the 2 x 2 matrix with first row \((x_1, y_1)\) and second row \((x_2, y_2)\).) The sequence of arrows looks like criss-cross lacing on a shoe.

Implementing the area and perimeter of a polygon in SAS

The visualization of the shoelace formula suggests an efficient method for computing the area of a polygon. First, append the first vertex to the end of the list of vertices. Then, define the vector Lx = lag(x) as the vector with elements \(x_2, x_3, \ldots, x_n, x_1\). Define Ly = lag(y) similarly. Then form the vector d = x#Ly – y#Lx, where the '#' operator indicates elementwise multiplication. The area of the polygon is the sum(d)/2. This computation is implemented in the following SAS/IML program, which also implements a function that computes the Euclidean distance along the edges of the polygon. The perimeter formula is given by
\(s = \sum_{i=1}^{n} \sqrt{(x_{i+1} -x_i)^2 + (y_{i+1} -y_i)^2}\), where \((x_{n+1}, y_{n+1}) = (x_1, y_1)\).

proc iml;
/* The following functions are modified from 
   Rick Wicklin (2016). The Polygon package. SAS Global Forum 2016 */
/* PolyArea: Return the area of a simple polygon.
   P is an N x 2 matrix of (x,y) values for vertices. N > 2.
   This function uses Gauss' "shoelace formula," which is also known as the 
   "triangle formula."  See https://en.wikipedia.org/wiki/Shoelace_formula  */
start PolyArea(P);
   lagIdx = 2:nrow(P) || 1;   /* vertices of the lag; appends first vertex to the end of P */
   xi   = P[ ,1];       yi = P[ ,2];
   xip1 = xi[lagIdx]; yip1 = yi[lagIdx]; 
   area = 0.5 * sum(xi#yip1 - xip1#yi);
   return( area );
finish;
 
/* PolyPerimeter: Return the perimeter of a polygon.
   P is an N x 2 matrix of (x,y) values for vertices. N > 2.  */
start PolyPeri(_P);
   P = _P // _P[1,];     /* append first vertex */
   /* (x_{i+1} - x_i)##2 + (y_{i+1} - y_i)##2 */
   v = dif(P[,1])##2 + dif(P[,2])##2;  /* squared distance from i_th to (i+1)st vertex */
   peri = sum(sqrt(v));                /* sum of edge lengths */
   return( peri );
finish;
 
/* demonstrate by using a 3:4:5 right triangle */
Pts = {1 1,
       4 1,
       4 5};
Area = PolyArea(Pts);
Peri = PolyPeri(Pts);
print Area Peri;

The output shows the area and perimeter for a 3:4:5 right triangle with vertices (1,1), (4,1), and (4,5). As expected, the area is 0.5*base*height = 0.5*3*4 = 6. The perimeter is the sum of the edge lengths: 3+4+5 = 12.

Notice that neither of the functions uses a loop over the vertices of the polygon. Instead, the area is computed by using vectorized computations. Essentially, the area formula uses the LAG of the vertex coordinates, and the perimeter formula uses the DIF of the coordinates.

Apply the area and perimeter formulas to convex hulls

You can use the previous functions to compute the area and perimeter of the 2-D convex hull of a set of planar points. In SAS, you can use the CVEXHULL function to compute a convex hull. The CVEXHULL function returns the indices of the points on the convex hull in counterclockwise order. You can therefore extract those points to obtain the polygon that encloses the points. You can send this polygon to the functions in the previous section to obtain the area and perimeter of the convex hull, as follows:

/* Use the CVEXHULL function in SAS to compute 2-D convex hulls. See
   https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html  */
points = {0  2, 0.5 2, 1 2, 0.5 1, 0 0, 0.5 0, 1  0, 
          2 -1,   2 0, 2 1,   3 0, 4 1,   4 0, 4 -1, 
          5  2,   5 1, 5 0,   6 0 }; 
 
/* Find the convex hull: indices on the convex hull are positive */
Indices = cvexhull( points ); 
hullIdx = indices[loc(indices>0)];   /* extract vertices with positive indices */
convexPoly = points[hullIdx, ];      /* extract rows to form a polygon */
print hullIdx convexPoly[c={'cx' 'cy'} L=""];
 
/* what is the area of the convex hull? */
AreaCH = PolyArea(convexPoly);
PeriCH = PolyPeri(convexPoly);
print AreaCH PeriCH;

The output shows the indices and coordinates of the convex polygon that bounds the set of points. It is a two-column matrix, where each row represents the coordinates of a vertex of the polygon. The area of this polygon is 15 square units. The perimeter is 15.71 units. If these coordinates are the locations of spatial observations, you can compute the area and perimeter of the convex hull that contains the points.

Summary

A convex hull is a polygon. Accordingly, you can use formulas for the area and perimeter of a polygon to obtain the area and perimeter of a convex hull. This article shows how to efficiently compute the area and perimeter of the convex hull of a set of planar points.

Appendix: Derive Gauss' shoelace formula from Green's theorem

If a polygon has n vertices with coordinates \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\), we want to prove Gauss' shoelace formula:
\(A = \frac{1}{2} \sum_{i=1}^{n} x_i y_{i+1} - x_{i+1} y_i\)
where we use the convention that \((x_{n+1}, y_{n+1}) = (x_1, y_1)\).

At first glance, Gauss' shoelace formula does not appear to be related to areas. However, the formula is connected to area because of a result from multivariable calculus known as "the area corollary of Green's theorem in the plane." Green's theorem enables you to compute the area of a planar region by computing a line integral around the boundary of the region. That is what is happening in Gauss' shoelace formula. The "area corollary" says that you can compute the area of the polygon by computing \(\oint_C x/2\; dx - y/2\; dy\), where C is the piecewise linear path along the perimeter of the polygon.

Let's evaluate the integral on the edge between the i_th and (i+1)st vertex. Parameterize the i_th edge as
\((x(t), y(t)) = ( x_i + (x_{i+1}-x_i)t, y_i + (y_{i+1}-y_i)t )\quad \mathrm{for} \quad t \in [0,1]\). With this parameterization, \(dx = (dx/dt)\; dt = (x_{i+1}-x_i)\; dt\) and \(dy = (dy/dt)\; dt = (y_{i+1}-y_i)\; dt\).

When you substitute these expressions into the line integral, the integrand becomes a linear polynomial in t, but a little algebra shows that the linear terms cancel and only the constant terms remain. Therefore, the line integral on the i_th edge becomes
\(\frac{1}{2} \int_{\mathrm edge} x\; dx - y\; dy = \frac{1}{2} \int_0^1 x_i (y_{i+1}-y_i) - y_i (x_{i+1}-x_i) dt = \frac{1}{2} (x_i y_{i+1}- y_i x_{i+1})\).
When you evaluate the integral on all edges of the polygon, you get a summation, which proves Gauss' shoelace formula for the area of a polygon.

The post The area and perimeter of a convex hull appeared first on The DO Loop.

10月 312022
 

Every year, I write a special article for Halloween in which I show a SAS programming TRICK that is a real TREAT! This year, the trick is to concatenate two strings into a single string in a way that guarantees you can always recover the original strings. I learned this trick from @RobinHouston on Twitter.

Why would you want to combine and pull apart strings?

I have combined strings when I wanted to concatenate levels of two categorical variables to form an interaction variable. By combining the strings, you can create a box plot that shows the response for each joint level of two categorical variables. I have not needed to separate the strings later, but I can imagine scenarios in which that capability might be useful.

To give a concrete example, suppose you have two strings: "ghost" and "goblin." The first trick constructs a string such as "ghost/goblin" and shows you how to reconstruct the original values from that string if the first string does not contain the character ('/') that is used to separate the first string from the second. The second trick constructs a more complicated string: "ghost/goblinghost*goblin". You can always reconstruct the original values regardless of the characters in the strings.

The easy way: Use one delimiter

A simple technique for combining strings is to use a single delimiter to concatenate two strings and then pull them apart. That idea works provided that the delimiter is not part of the first string. Let's see how it works by concatenating the Make and Model character variables from the Sashelp.Cars data set. The following DATA step extracts a subset of the full data:

data Have;
set Sashelp.Cars;
by Make;
if First.Make;
keep Make Model;
run;
 
proc print data=Have(obs=5);
run;

Suppose that you want to concatenate the Make and Model variables to form a new string while still being able to determine where the first string ends and the second string begins. If you know that some character such as '/' is not part of the first string, you can use the CAT or CATS function to concatenate the strings. The CATS function strips off leading and trailing blanks from all strings. Because SAS strings are blank-padded, I prefer to use the CATS function, as shown in the following DATA step:

%let DELIM = /;    /* choose any rare character as a delimiter */
data SimpleCombine;
length Join $55;
set Have;
/* Use CAT to preserve leading and trailing blanks.
   Use CATS to strip leading and trailing blanks.   */
Join = cats(Make,"&DELIM",Model);  
run;
 
proc print data=SimpleCombine(obs=5);
run;

The JOIN variable contains the concatenated strings, separated by the '/' character. If you want to recover the original two values (minus any spaces that were stripped off), you can use the FIND function to locate the position of the delimiter, then use the SUBSTR function to extract the substrings before and after the delimiter, as follows:

data SimpleSplit;
length v1 v2 $55;
set SimpleCombine;
loc = find(Join, "&DELIM");       /* location of delimiter */
v1 = substr(Join, 1, loc-1);
v2 = substr(Join, loc+1);
run;
 
proc print data=SimpleSplit(obs=5);
run;

The output shows that the v1 and v2 variables contain the original string values. The v1 and v2 variables do not contain any leading or trailing blanks, but they would if you use the CAT function instead of the CATS function. Thus, if the delimiter is not part of the first string, you can use a single delimiter to combine and split strings. The logic of the program assumes that the position of the first delimiter in the combined string is the location at which to split the string. That logic fails if the first string contains the delimiter as one of its characters.

A trick for combining and splitting any strings

In many situations, you know that the input strings will not contain a certain character, such as '/'. However, if you want a FOOLPROOF method that will work for ANY strings regardless of their contents, the trick from @RobinHouston become relevant. The trick enables you to combine and split any two strings!

The basic idea is to combine the strings by using two different delimiters instead of only one. I will use the delimiters '/' and '*', but the values do not matter. You can use 'A' and 'B' or '0' and '1', if you prefer. The method still works if the delimiter is contained in one or both strings.

Here's how to combine the strings so that they can be split later. If the original strings are v1 and v2, you form the concatenated strings
      s1 = v1 + '/' + v2
and
      s2 = v1 + '*' + v2
Notice that the strings s1 and s2 have the same length. In fact, they are identical except for one location: s1 contains the first delimiter whereas s2 contains the second delimiter. You then store the concatenated string
      s = s1 + s2
which always has an even number of characters.

You can use the SAS DATA step to implement this combination step:

/* Use the characters '/' and '*' to mark the boundary between Make and Model 
   Note: '*'=ASCII 42 precedes '/'=ASCII 47 in the ASCII table */
data Encode;
length s1 s2 $55 Join $110;
set Have;
/* Use CAT to preserve leading and trailing blanks.
   Use CATS to strip leading and trailing blanks.   */
s1 = cats(Make,'/',Model);  
s2 = cats(Make,'*',Model);
Join = cats(s1,s2);        /* Join does not have leading spaces */
drop s1 s2;
run;
 
proc print data=Encode(obs=5);
run;

To recover the original strings from s, you can do the following:

  • Let s1 be the first half of s and s2 be the second half of s. Because s has an even number of characters, this operation is always well defined.
  • Use the COMPARE function to find the position, k, where s1 differs from s2. By construction, this is the location of the delimiter.
  • Recover v1 by extracting the first k-1 characters from s1 and recover v2 by using the characters after the (k+1)st position.

The following DATA step splits the string back into its original values:

/* Split the encoded string in half. Use the COMPARE function to find where the two 
   halves are different. Split the string at that location.  */
data Decode;
set Encode;
L = lengthc(trim(Join));         /* trim any trailing spaces */
s1 = substr(Join, 1, L/2);       /* first half of string */
s2 = substr(Join, L/2 + 1, L/2); /* second half */
/* find the location where strings differ */
diffLoc = abs(compare(s1, s2));
v1 = substr(s1, 1, diffLoc-1);
v2 = substr(s1, diffLoc+1);
drop s1 s2 diffLoc;
run;
 
proc print data=Decode(obs=5);
run;

The output shows that the v1 and v2 variables contain the original strings.

Summary

This article shows how to concatenate two strings so that you can recover the original string values. The first trick uses only one delimiter and requires that you choose a character that is not part of the first string. The second trick is more complicated, but it works for ANY strings, regardless of their contents.

I think both tricks are real treats! You might never have need for the second trick, but it is very clever. I like to write these tricks down in case someone in the future (maybe me!) searches for "how do I combine two strings so that I can recover the original strings?" Post a comment if you use either of these tricks!

The post A trick to combine and split strings appeared first on The DO Loop.

10月 262022
 

A SAS programmer asked how to create a graph that shows whether missing values in one variable are associated with certain values of another variable. For example, a patient who is supposed to monitor his blood glucose daily might have more missing measurements near holidays and in the summer months due to vacations. In this example, the "missingness" in the glucose variable is related to the date variable. In general, we say that missing values depend on another variable when the probability of obtaining a missing value in one variable depends on the value of another variable.

It can be useful to visualize the dependency if you are trying to impute values to replace the missing values. In missing-value imputation, the analyst must decide how to model the missing values. Three popular assumptions are that the values are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). You can read more about these missing-value assumptions. The visualization in this article can help you decide whether to treat the data as MCAR or MAR.

Generate fake data and missing values

Let's generate some fake data. The following DATA step creates four variables:

  • The DATE variable does not contain any missing values. It contains daily values between 01JAN2018 and 31DEC2020.
  • The X1 variable is missing only in 2020. In 2020, there is an 8% probability that X1 is missing.
  • The X2 variable is missing only on Sundays. On Sundays, there is a 10% probability that X2 is missing.
  • The X3 variable is nonmissing from March through October. During the winter months, there is a 5% probability that X3 is missing.
/* generate FAKE data to use as an example */
data Have;
call streaminit(1234);
format Date DATE9.;
do Date = '01JAN2018'd to '31DEC2020'd;
   x1 = rand("Normal");
   x2 = rand("Normal");
   x3 = rand("Normal");
   /* assign missing values:
      x1 is missing only in 2020
      x2 is missing only on Sundays
      x3 is missing only in Nov, Dec, Jan, and Feb
   */
   if year(Date)=2020            & rand("Bernoulli", 0.08) then x1 = .;
   if weekday(date)=1            & rand("Bernoulli", 0.10) then x2 = .;
   if month(date) in (11,12,1,2) & rand("Bernoulli", 0.05) then x3 = .;
   output;
end;
run;

The goal of this article is to create a graph that indicates how the missing values in the X1-X3 variables depend on time. In a previous article, I showed how to use heat maps in SAS/IML to visualize missing values. In the next section, I use the SAS DATA step to transform the data into a form that you can visualize by using the HEATMAPPARM statement in PROC SGPLOT.

Visualize patterns of missing values

To visualize the patterns of missing values, I use the following ideas:

/* Plot missing values of X1-X3 versus the values of Date. */
%let IndepVar = Date;           /* variable to show on Y axis */
%let DepVars  = x1 x2 x3;       /* visualize missing values for these variables */
 
/* 1. Filter the data to keep only observations that have missing values.
   2. Convert from wide to long. New variables VARNAME and ISMISSING contain the 
      information to plot.   */
data Want;
set Have;
length varName $32;
array xVars[3] &DepVars;        /* response variables (inputs) */
ObsNum = _N_;                   /* if you want the original row number */
if nmiss(of xVars[*]) > 0;      /* keep only obs that have missing values */
do i = 1 to dim(xVars);
   varName = vname(xVars[i]);   /* assign the name of the original variable */
   isMissing = nmiss(xVars[i]); /* binary indicator variable */
   output;
end;
keep ObsNum &IndepVar varName isMissing;
run;
 
/* 3. Define discrete attribute map: missing values are black and nonmissing values are white. */
data AttrMap;
  retain id 'HeatMiss';
  length fillcolor $20;
  value=0; fillcolor='White'; output;  /* nonmissing = white */
  value=1; fillcolor='Black'; output;  /* missing = black */
run;
 
/* 4. Create a heat map that shows the missing values for each variable for each date. */
/*    Thanks to Dan Heath for suggesting that I use INTERVAL=QUARTER
      for major ticks and MINORINTERVAL=MONTH for minor ticks */
ods graphics / width=400px height=600px;
title "Visualize Missing Values";
proc sgplot data=Want dattrmap=AttrMap;
   heatmapparm x=varName y=&IndepVar colorgroup=isMissing / attrid=HeatMiss;
   xaxis display=(nolabel);
   yaxis display=(nolabel) type=time interval=quarter valuesformat=monyy7. 
                           minor minorinterval=month;
run;

The heat map is shown to the right. It displays dates along the vertical axis. The horizontal axis displays the three numerical variables that have missing values. You can use the graph to determine whether the pattern of missing values depend on the Date variable. In the program, I used macro variables to store the name of the X and Y variables so that the program is easier to reuse.

What does the heat map show?

  • For the X1 variable, the heat map clearly shows that the variable does not have any missing values prior to 2020, but then has quite a few in 2020. The graph is consistent with the fact that the missing values in 2020 are distributed uniformly. You can see gaps and clusters, which are common features of uniform randomness.
  • For the X2 variable, the heat map is less clear. It is possible that the missingness occurs uniformly at random for any date. You would need to run an additional analysis to discover that all the missing values occur on Sundays.
  • For the X3 variable, the heat map suggests that the missing values occur in the winter months. However, you should probably run an additional analysis to discover that all the missing values occur between November and February.

You can use PROC FREQ to perform additional analyses on the X2 and X3 variables. The following calls to PROC FREQ use the MONTHw. and WEEKDAYw. formats to format the Date variable. PROC FREQ can perform a two-way analysis to investigate whether missingness in X2 and X3 is associated with days or months, respectively. You can use a mosaic plot to show the results, as follows:

ods graphics / reset;
proc freq data=Want order=formatted;
   where varName="x2";
   format Date WEEKDAY.;
   label IsMissing = "Missing Value for X2" Date = "Day of Week";
   tables IsMissing * Date / plots=MosaicPlot;
run;
 
proc freq data=Want order=formatted;
   where varName="x3";
   format Date MONTH.;
   label IsMissing = "Missing Value for X3" Date = "Month of Year";
   tables IsMissing * Date / plots=MosaicPlot;
run;

For brevity, only the mosaic plot for the month of the year is shown. The mosaic plot shows that all the missing values for X3 occur in November through February. A similar plot shows that the missing values for X2 occur on Sunday.

Summary

This article shows how to use a heat map to plot patterns of missing values for several variables versus a date. By plotting the missing values of X1, X2, X3,..., you can assess whether the missing values depend on time. This can help you to decide whether the missing values occur completely at random (MCAR). The missing values in this article are not MCAR. However, that fact is not always evident from the missing value pattern plot, and you might need to run additional analyses on the missing values.

Although this analysis used a date as the independent variable, that is not required. You can perform similar visualizations for other choices of an independent variable.

The post Visualize dependencies of missing values appeared first on The DO Loop.