data analysis

11月 282022
 

A SAS programmer was trying to simulate poker hands. He was having difficulty because the sampling scheme for simulating card games requires that you sample without replacement for each hand. In statistics, this is called "simple random sampling."

If done properly, it is straightforward to simulate poker hands in SAS. This article shows how to generate a standard deck of 52 cards. It then shows how to generate many card hands for a game that involves several players. This can be done in two ways: The SAMPLE function in the SAS/IML language and PROC SURVEYSELECT in SAS/STAT software.

Encode a deck of cards

The first step is to generate a standard deck of 52 playing cards. There are many ways to do this, but the following DATA step iterates over 13 values (A, 2, 3, ..., Q, K) and over four suits (clubs, diamonds, hearts, and spades). The step creates a data set that has 52 cards. To make the output more compact, I concatenate the value of the card with the first letter of the suit to obtain a two-character string for each card. For example, "5C" is the "five of clubs" and "AS" is the "ace of spades." So that each card can be represented by using two characters, I use "T" instead of "10." For example, "TH" is the "ten of hearts." The

data Cards;
length Value $2 Suit $8 Card $3;
array Values[13] $ _temporary_ ('A','2','3','4','5','6','7','8','9','T','J','Q','K');
array Suits[4] $ _temporary_ ('Clubs', 'Diamonds', 'Hearts', 'Spades');
do v = 1 to 13;
   do s = 1 to 4;
      Value = Values[v]; Suit = Suits[s]; Card = cats(Value,substr(Suit,1,1));
      output;
   end;
end;
run;
 
proc print data=Cards(obs=8); 
   var v s Value Suit Card;
run;
TABLE

The cards are generated in order. The output shows the first eight cards in the deck, which are the aces and twos. For some types of poker hands (such as determining pairs and straights), the V variable can be useful, and for other analyses (such as determining flushes), the S variable can be useful.

Deal cards by using SAS/IML

When humans deal cards, we use a two-step method: shuffle the deck and then deal the top cards to players in a round-robin fashion. A computer simulation can simluate dealing cards by using a one-step method: deal random cards (without replacement) to every player. SAS provides built-in sampling methods to make this process easy to program.

In SAS/IML, the default sampling method for the SAMPLE function is sampling without replacement. If there are P players and each player is to receive C cards, then a deal consists of P*C cards, drawn without replacement from the deck. The deck does not need to be shuffled because the SAMPLE function outputs the cards in random order. Because the cards are in random order, it does not matter how you assign the cards to the players. The following SAS/IML program uses P=3 players and generates C=5 cards for each player. The program simulates one deal by assigning the first C cards to the first player, the second C cards to the second player, and so on:

%let nCards   = 5;    /* cards per player */
%let nPlayers = 3;    /* number of players */
%let nDeals   = 10;   /* number of deals to simulate */
 
/* simulate dealing many card hands in SAS/IML */
proc iml;
call randseed(1234);                       /* set random number seed */
use Cards;  read all var "Card";  close;   /* read the deck of cards into a vector */
 
nCardsPerDeal = &nCards * &nPlayers;       /* each row is a complete deal */
D = sample(Card, nCardsPerDeal, "WOR" );   /* sample without replacement from the deck */
Deal = shape(D, &nPlayers, &nCards);       /* reshape into nPlayers x nCards matrix */
 
/* print one deal */
cardNames   = "C1":"C&nCards";
playerNames = "P1":"P&nPlayers";
print Deal[c=cardNames r=playerNames];
TABLE

For this deal, the first player has two tens, two fives, and a king. The second player has a four, a five, a nine, an ace, and a two. You could sort the cards in each row, but I have left each player's cards unsorted so that you can see the random nature of the assignment.

Simulate multiple deals

The previous program simulates one deal. What if you want to simulate multiple deals? One way is to loop over the number of deals, but there is a more efficient method. The second argument to the SAMPLE function can be a two-element vector. The first element specifies the number of cards for each deal, and the second element specifies the number of deals. Thus, for example, if you want to simulate 10 deals, you can use the following statement to generate a matrix that has 10 rows. Each row is a sample (without replacement) from a 52-card deck:

/* use one call to simulate many deals */
D = sample(Card, nCardsPerDeal//&nDeals, "WOR" );   /* simulate many deals */
 
/* Ex: The 8th row contains the 8th deal. Display it. */
Deal8 = shape(D[8,], &nPlayers, &nCards);
print Deal8[c=cardNames r=playerNames];
TABLE

To demonstrate that each row is a deal, I have extracted, reshaped, and displayed the eighth row. The eighth deal is unusual because each player is dealt a "good" hand. The first player has a pair of eights, the second player has two pairs (threes and sixes), and the third player has a pair of sevens. If you want to print (or analyze) the 10 deals, you can loop over the rows, as follows:

do i = 1 to &nDeals;
   Deal = shape(D[i,], &nPlayers, &nCards);
   /* analyze or print the i_th deal */
   print i, Deal[c=cardNames r=playerNames];
end;

Deal random hands by using PROC SURVEYSELECT

You can also simulate card deals by using the SURVEYSELECT procedure. You should use the following options:

  • Use the METHOD=SRS option to specify sampling without replacement for each deal. To obtain the cards in random order, use the OUTRANDOM option.
  • Use the N= option to specify the number of cards in each deal.
  • Use the REPS= option to specify the total number of deals.

An example is shown in the following call to PROC SURVEYSELECT:

/* similar approach for PROC SURVEYSELECT */
proc surveyselect data=Cards seed=123 NOPRINT
     method=SRS                   /* sample without replacement */
     outrandom                    /* randomize the order of the cards */
     N=%eval(&nCards * &nPlayers) /* num cards per deal */
     reps=&nDeals                 /* total number of deals */
     out=AllCards;
run;

The output data set (AllCards) contains a variable named Replicate, which identifies the cards in each deal. Because the cards for each deal are in a random order, you can arbitrarily assign the cards to players. You can use a round-robin method or assign the first C cards to the first player, the next C cards to the second player, and so on. This is done by using the following DATA step:

/* assign the cards to players */
data AllDeals;
set AllCards;
by Replicate;
if first.Replicate then do;
   Cnt = 0;
end;
Player = 1 + floor(Cnt/&nCards);
Cnt + 1;
drop Cnt;
run;
OUTPUT

The simulated data are in long form. For some analyses, it might be more convenient to convert it into wide form. You can do this by using the DATA step or by using PROC TRANSPOSE, as below:

/* OPTIONAL: Convert from long form to wide form */
proc transpose data=AllDeals 
               out=Deals(rename=(col1-col&nCards=C1-C&nCards) drop=_NAME_);
   var Card;
   by Replicate Player;
run;
 
proc print data=Deals;
   where Replicate=8;
   var Replicate Player C:;
run;
OUTPUT

To demonstrate the results, I display the result of the eighth deal (Replicate=8). For this deal, the first player has several face cards but no pairs, the second player has a pair of threes, and the third player has a pair of fours.

Summary

SAS provides many tools for simulation studies. For simulations of card games, this article shows how to create a data set that represents a standard deck of 52 playing cards. From the cards, you can use the SAMPLE function in SAS/IML software or the SURVEYSELECT procedure to simulate many deals, where each deal is a sample without replacement from the deck. For each deal, you can assign the cards to the players in a systematic manner.

The post Simulate poker hands in SAS appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 142022
 

A profile plot is a compact way to visualize many variables for a set of subjects. It enables you to investigate which subjects are similar to or different from other subjects. Visually, a profile plot can take many forms. This article shows several profile plots: a line plot of the original data, a line plot of standardized data, and a heat map.

This article uses crime data to illustrate the various kinds of plots. The line plot at the right shows the incidence rate for seven types of crime in 16 US cities in 1970. The profile plot looks somewhat like a spaghetti plot, but a spaghetti plot typically features a continuous variable (often, time) along the vertical axis, whereas the horizontal axis in a profile plot features discrete categories. The vertical axis shows a response variable, which is often a count, a rate, or a proportion. The line segments enable you to trace the rates for each city. You can use the graph to discover which cities have high crime rates, which have low rates, and which are high for some crimes but low for others.

In the example, the subjects are cities; the variables are the seven crime rates. However, a profile plot is applicable to many fields. For example, in a clinical setting, the subjects could be patients and the variables could be the results of certain lab tests, such as blood levels of sugar, cholesterol, iron, and so forth. Be aware, however, that there are several different graphs that are known as "patient profile plots." Some are more complicated than the profile plots in this article.

A simple profile plot

For the city crime data, all variables are measured as rates per 100K people. Because all variables are measured on the same scale, you can plot the raw data. You could use a box plot if you want to show the distribution of rates for each crime. However, the purpose of a profile plot is to be able to trace each subject's relative rates across all variables. To do that, let's use a line plot.

The data are originally from the United States Statistical Abstracts (1970) and were analyzed by John Hartigan (1975) in his book Clustering Algorithms. The data were reanalyzed by Michael Friendly (SAS System for Statistical Graphics, 1991). This article follows Friendly's analysis for the line plots. The heat map in the last section is not considered in Friendly (1991).

The following SAS DATA step defines the data in the wide format (seven variables and 16 rows). However, to plot the data as a line plot in which colors identify the cities, it is useful to convert the data from wide to long format. For this example, I use PROC TRANSPOSE. You can then use the SERIES statement in PROC SGPLOT to visualize the rates for each crime and for each city.

/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city &$16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* convert to long format and use alphabetical ordering along X axis */
proc transpose data=crime out=crime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=crime2;  by crime;  run;
 
title "Profile Plot of Crime Rates in US Cities in 1970";
footnote J=L "Friendly (1991, p. 421) Output 8.9";
proc sgplot data=crime2;
  label rate="Crime Rate (per 100K population)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

The graph is shown at the top of this article. It is not very useful. The graph does not enable you to trace the crime rate patterns for each city. To quote Friendly (1991, p. 422): "In plotting the raw data, the high-frequency crimes such as burglary completely dominate the display, and the low-frequency crimes (murder, rape) are compressed to the bottom of the plot. Because of this, the possibility that the cities may differ in their patterns is effectively hidden."

Whenever you are trying to graph data that are on different scales, it is useful to transform the scale of the data. You could do this globally (for example, a logarithmic transformation) or you can focus on relative aspects. Following Friendly (1991), the next section standardizes the data, thereby focusing the attention on which cities have relatively high or low crime rates for various crimes.

A profile plot for standardized data

The most common way to standardize data is to use an affine transformation for each variable so that the standardized variables have mean 0 and unit variance. You can use PROC STANDARD in SAS to standardize the variables. You can specify the name of variables by using the VAR statement, or you can use the keyword _NUMERIC_ to standardize all numeric variables in the data. The following statements standardize the original wide data and then convert the standardized data to long format for graphing.

proc standard data=crime mean=0 std=1 out=stdcrime;
   var _numeric_;
run;
/* convert to long format and ensure a consistent ordering along X axis */
proc transpose data=stdcrime out=stdcrime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=stdcrime2;  by crime;  run;
 
title "Profile Plot of Standardized Crime Rates in 1970";
footnote J=L "Friendly (1991, p. 423) Output 8.10";
proc sgplot data=stdcrime2;
  label rate="Crime Rate (standard score)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

Unfortunately, for these data, this graph is not much better than the first profile plot. Friendly (1991, p. 423) calls the plot "dismal" because it is "hard to see any similarities among the cities in their patterns of crime." With a close inspection, you can (barely) make out the following facts:

  • For most crimes, Los Angeles, Detroit, and New York have high rates.
  • For most crimes, Tucson and Hartford have low rates.
  • The crime rate in Boston is low except for auto theft.

This plot is similar to a parallel coordinates plot. There are three ways to improve the standardized profile plot:

  • Rearrange the order of the categories along the X axis by putting similar categories next to each other. I created one possible re-ordering, in which property crimes are placed on the left and crimes against persons are placed on the right. The profile plot is slightly improved, but it is still hard to read.
  • Change the affine transformation for the variables. Instead of always using mean 0 and unit variance, perhaps there is a better way to transform the response values? And perhaps using nonuniform spacing for the categories could make the profiles easier to read? These two ideas are implemented in the "optimal linear profile plot" (Hartigan, 1975; Friendly, 1991). I will show how to create the optimal linear profile plot in a subsequent article.
  • Abandon the spaghetti-like line plot in favor of a lasagna plot, which uses a heat map instead of a line plot. This is shown in the next section.

A lasagna plot of profiles

The main advantage of a lasagna plot is that data are neatly displayed in rows and columns. For each subject, you can examine the distribution of the response variable across categories. Furthermore, a rectangular heat map is symmetric: for each category, you can examine the distribution of the response variable across subjects! For these data, the subjects are cities, the categories are crimes, and the response variable is either the raw crime rate or the standardized rate. The disadvantage of a heat map is that the response values are represented by colors, and it can be difficult to perceive changes in colors for some color ramps. On the other hand, you can always print the data in tabular form if you need to know the exact values of the response variables.

The following statements create a lasagna plot for the standardized data in this example. Like the line plot, the heat map uses the data in long form:

%let colormap = (CX0571B0 CX92C5DE CXF7F7F7 CXF4A582 CXCA0020);
/* lasagna plot: https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html 
   and https://blogs.sas.com/content/iml/2022/03/28/viz-missing-values-longitudinal.html */
title "Lasagna Plot of City Profiles";
title2 "Standardized Crime Rates in 1970";
proc sgplot data=stdcrime2;
   label city="City" crime="Crime" rate="Standardized Rate";
   heatmapparm x=crime y=city colorresponse=rate / outline outlineattrs=(color=gray) colormodel=&colormap; 
run;

This is a better visualization of the crime profiles for each city. You can easily see the standardized rates and detect patterns among the cities. You can see that Detroit, Kansas City, Los Angeles, New York, and Washington (DC) have high relative rates. You can see that Atlanta, Boston, Hartford, Portland, and Tucson tend to have relatively low rates. You can see the crimes for which Atlanta, Boston, and Portland do not have low rates.

This display uses an alphabetical ordering of both the categories (crimes) and the subject (cities). However, you can sort the rows and columns of a lasagna plot to improve the visualization. For example, the following heat map sorts the cities by the average standardized crime rate across all categories. It also arranges the horizontal axis so that nonviolent crimes against property are on the left and violent crimes against people on the right:

This arrangement makes it easier to see outliers such as the orange and red cells in the lower left of the graph. Both Portland and Honolulu have rates of property crime that are higher than you might suspect, given their overall low crime rates. Similarly, Boston has relatively high rates of auto theft compared to its otherwise low crime rates. In the same way, the relatively low rate of rapes in New York City is apparent.

Summary

This article shows three ways to create a profile plot in SAS. A profile plot enables you to visualize the values of many variables for a set of subjects. The simplest profile plot is a line plot of the raw values, where each line corresponds to a subject and the discrete horizontal axis displays the names of variables. However, if the variables are measured in different units or on different scales, it is useful to rescale the values of the variables. One way is to create a line plot of the standardized variables, which displays the relative values (high or low) of each variable. A third display is a heat map (sometimes called a lasagna plot), which uses color to display the high or low values of each variable.

In this article, the categories are uniformly spaced along the horizontal axis. In a subsequent article, I will show an alternative plot that uses nonuniform spacing.

The post Profile plots in SAS appeared first on The DO Loop.

11月 022022
 

The area of a convex hull enables you to estimate the area of a compact region from a set of discrete observations. For example, a biologist might have multiple sightings of a wolf pack and want to use the convex hull to estimate the area of the wolves' territory. A forestry expert might use a convex hull to estimate the area of a forest that has been infested by a beetle or a blight. This article shows how to compute the area and perimeters of convex hulls in SAS. Because a convex hull is a convex polygon, we present formulas for the area and perimeter of polygons and apply those formulas to convex hulls.

Gauss' shoelace formula for the area of a polygon

There are many formulas for the area of a planar polygon, but the one used in this article is known as Gauss' shoelace formula, or the triangle formula. This formula enables you to compute the area of a polygon by traversing the vertices of the polygon in order. The formula uses only the coordinates of adjacent pairs of vertices. Assume a polygon has n that are vertices enumerated counterclockwise, and the coordinates are \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\). Then the area of the polygon is given by the shoelace formula
\(A = \frac{1}{2} \sum_{i=1}^{n} x_i y_{i+1} - x_{i+1} y_i\)
where we use the convention that \((x_{n+1}, y_{n+1}) = (x_1, y_1)\).

To me, the remarkable thing about this formula is that it looks like it has nothing to do with an area. However, the Appendix derives Gauss' shoelace formula from Green's theorem in the plane, which reveals the connection between the formula and the area of the polygon.

You might wonder why this is called the "shoelace formula." Suppose a polygon has vertices \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\). A schematic visualization of the formula is shown to the right. You list the X and Y coordinates of the vertices and append the first vertex to the end of the list. The blue arrows represent multiplication. The red arrows represent multiplication with a negative sign. So, the first two rows indicate the quantity \(x_1 y_2 - x_2 y_1\). (If you are familiar with linear algebra, you can recognize this quantity as the determinant of the 2 x 2 matrix with first row \((x_1, y_1)\) and second row \((x_2, y_2)\).) The sequence of arrows looks like criss-cross lacing on a shoe.

Implementing the area and perimeter of a polygon in SAS

The visualization of the shoelace formula suggests an efficient method for computing the area of a polygon. First, append the first vertex to the end of the list of vertices. Then, define the vector Lx = lag(x) as the vector with elements \(x_2, x_3, \ldots, x_n, x_1\). Define Ly = lag(y) similarly. Then form the vector d = x#Ly – y#Lx, where the '#' operator indicates elementwise multiplication. The area of the polygon is the sum(d)/2. This computation is implemented in the following SAS/IML program, which also implements a function that computes the Euclidean distance along the edges of the polygon. The perimeter formula is given by
\(s = \sum_{i=1}^{n} \sqrt{(x_{i+1} -x_i)^2 + (y_{i+1} -y_i)^2}\), where \((x_{n+1}, y_{n+1}) = (x_1, y_1)\).

proc iml;
/* The following functions are modified from 
   Rick Wicklin (2016). The Polygon package. SAS Global Forum 2016 */
/* PolyArea: Return the area of a simple polygon.
   P is an N x 2 matrix of (x,y) values for vertices. N > 2.
   This function uses Gauss' "shoelace formula," which is also known as the 
   "triangle formula."  See https://en.wikipedia.org/wiki/Shoelace_formula  */
start PolyArea(P);
   lagIdx = 2:nrow(P) || 1;   /* vertices of the lag; appends first vertex to the end of P */
   xi   = P[ ,1];       yi = P[ ,2];
   xip1 = xi[lagIdx]; yip1 = yi[lagIdx]; 
   area = 0.5 * sum(xi#yip1 - xip1#yi);
   return( area );
finish;
 
/* PolyPerimeter: Return the perimeter of a polygon.
   P is an N x 2 matrix of (x,y) values for vertices. N > 2.  */
start PolyPeri(_P);
   P = _P // _P[1,];     /* append first vertex */
   /* (x_{i+1} - x_i)##2 + (y_{i+1} - y_i)##2 */
   v = dif(P[,1])##2 + dif(P[,2])##2;  /* squared distance from i_th to (i+1)st vertex */
   peri = sum(sqrt(v));                /* sum of edge lengths */
   return( peri );
finish;
 
/* demonstrate by using a 3:4:5 right triangle */
Pts = {1 1,
       4 1,
       4 5};
Area = PolyArea(Pts);
Peri = PolyPeri(Pts);
print Area Peri;

The output shows the area and perimeter for a 3:4:5 right triangle with vertices (1,1), (4,1), and (4,5). As expected, the area is 0.5*base*height = 0.5*3*4 = 6. The perimeter is the sum of the edge lengths: 3+4+5 = 12.

Notice that neither of the functions uses a loop over the vertices of the polygon. Instead, the area is computed by using vectorized computations. Essentially, the area formula uses the LAG of the vertex coordinates, and the perimeter formula uses the DIF of the coordinates.

Apply the area and perimeter formulas to convex hulls

You can use the previous functions to compute the area and perimeter of the 2-D convex hull of a set of planar points. In SAS, you can use the CVEXHULL function to compute a convex hull. The CVEXHULL function returns the indices of the points on the convex hull in counterclockwise order. You can therefore extract those points to obtain the polygon that encloses the points. You can send this polygon to the functions in the previous section to obtain the area and perimeter of the convex hull, as follows:

/* Use the CVEXHULL function in SAS to compute 2-D convex hulls. See
   https://blogs.sas.com/content/iml/2021/11/15/2d-convex-hull-sas.html  */
points = {0  2, 0.5 2, 1 2, 0.5 1, 0 0, 0.5 0, 1  0, 
          2 -1,   2 0, 2 1,   3 0, 4 1,   4 0, 4 -1, 
          5  2,   5 1, 5 0,   6 0 }; 
 
/* Find the convex hull: indices on the convex hull are positive */
Indices = cvexhull( points ); 
hullIdx = indices[loc(indices>0)];   /* extract vertices with positive indices */
convexPoly = points[hullIdx, ];      /* extract rows to form a polygon */
print hullIdx convexPoly[c={'cx' 'cy'} L=""];
 
/* what is the area of the convex hull? */
AreaCH = PolyArea(convexPoly);
PeriCH = PolyPeri(convexPoly);
print AreaCH PeriCH;

The output shows the indices and coordinates of the convex polygon that bounds the set of points. It is a two-column matrix, where each row represents the coordinates of a vertex of the polygon. The area of this polygon is 15 square units. The perimeter is 15.71 units. If these coordinates are the locations of spatial observations, you can compute the area and perimeter of the convex hull that contains the points.

Summary

A convex hull is a polygon. Accordingly, you can use formulas for the area and perimeter of a polygon to obtain the area and perimeter of a convex hull. This article shows how to efficiently compute the area and perimeter of the convex hull of a set of planar points.

Appendix: Derive Gauss' shoelace formula from Green's theorem

If a polygon has n vertices with coordinates \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\), we want to prove Gauss' shoelace formula:
\(A = \frac{1}{2} \sum_{i=1}^{n} x_i y_{i+1} - x_{i+1} y_i\)
where we use the convention that \((x_{n+1}, y_{n+1}) = (x_1, y_1)\).

At first glance, Gauss' shoelace formula does not appear to be related to areas. However, the formula is connected to area because of a result from multivariable calculus known as "the area corollary of Green's theorem in the plane." Green's theorem enables you to compute the area of a planar region by computing a line integral around the boundary of the region. That is what is happening in Gauss' shoelace formula. The "area corollary" says that you can compute the area of the polygon by computing \(\oint_C x/2\; dx - y/2\; dy\), where C is the piecewise linear path along the perimeter of the polygon.

Let's evaluate the integral on the edge between the i_th and (i+1)st vertex. Parameterize the i_th edge as
\((x(t), y(t)) = ( x_i + (x_{i+1}-x_i)t, y_i + (y_{i+1}-y_i)t )\quad \mathrm{for} \quad t \in [0,1]\). With this parameterization, \(dx = (dx/dt)\; dt = (x_{i+1}-x_i)\; dt\) and \(dy = (dy/dt)\; dt = (y_{i+1}-y_i)\; dt\).

When you substitute these expressions into the line integral, the integrand becomes a linear polynomial in t, but a little algebra shows that the linear terms cancel and only the constant terms remain. Therefore, the line integral on the i_th edge becomes
\(\frac{1}{2} \int_{\mathrm edge} x\; dx - y\; dy = \frac{1}{2} \int_0^1 x_i (y_{i+1}-y_i) - y_i (x_{i+1}-x_i) dt = \frac{1}{2} (x_i y_{i+1}- y_i x_{i+1})\).
When you evaluate the integral on all edges of the polygon, you get a summation, which proves Gauss' shoelace formula for the area of a polygon.

The post The area and perimeter of a convex hull appeared first on The DO Loop.

9月 192022
 

A common question on SAS discussion forums is how to use SAS to generate random ID values. The use case is to generate a set of random strings to assign to patients in a clinical study. If you assign each patient a unique ID and delete the patients' names, you can preserve the patients' privacy during data analysis.

A common requirement is that the strings be unique. This presents a potential problem. In a set of random values, duplicate values are expected. Think about rolling a six-sided die several times: you would expect to see a duplicate value appear after a few rolls. Even for a larger set of possible values, duplicates arise surprisingly often. For example, if a room contains 23 people, the Birthday Paradox shows that it is likely that two people in the room have the same birthday!

If you generate N random numbers independently, you will likely encounter duplicates. There are several ways to use SAS to achieve both randomness and unique values. Perhaps the easiest way is to start with a long list of unique values, then randomly select N of these values. You can do the selection in either of two ways:

  1. Generate a random permutation of these values. Assign the first N permuted values to the subjects.
  2. Use sampling without replacement to select N values from the long list.

A third way to obtain random strings without duplicates is to use a SAS/IML program.

This article shows how to use SAS to generate random four-character strings. You can use the strings as IDs for subjects in a study that requires hiding the identity of the subjects. The article also discusses an important consideration: some four-character strings are uncomplimentary or obscene. I show how to remove vulgar strings from the list of four-character strings so that no patient is assigned an ID such as 'SCUM' or 'SPAZ'.

A mapping between strings and integers

I have previously shown that you can associate each string of English characters with a positive integer. For most applications, it suffices to consider four-digit strings because there are 264 = 456,976 strings that contains four English characters from the set {A, B, C, ..., Z}. Thus, I will use four-character strings in this article. (If you use five-character strings, you can assign up to 11,881,376 unique IDs.)

The following SAS DATA step outputs the integers 0 through 264-1. For each integer, the program also creates a four-character string. Small integers are padded with "leading zeros," so that the integers 0, 1, 2, ..., 456975, are associated with the base 26 strings AAAA, AAAB, AAAC, ..., ZZZZ. See the previous article to understand how the program works.

/* Step 1: Generate all integers in the range [0, 26^k -1] for k=4.
   See https://blogs.sas.com/content/iml/2022/09/14/base-26-integer-string.html
   Thanks to KSharp for the suggestion to use the RANK and BYTE functions:
     rank('A') = 65 in ASCII order
     byte(rank('A') + j) is the ASCII character for the j_th capital letter, j=0,1,...,25
*/
%let nChar = 4;                               /* number of characters in base 26 string */
data AllIDs;
array c[0:%eval(&nChar-1)]  _temporary_ ;     /* integer coefficients c[0], c[1], ... */
length ID $&nChar;                            /* string for base b */
b = 26;                                       /* base b = 26 */
offset = rank('A');                           /* = 65 for ASCII order */
do nID = 0 to b**&nChar - 1;
   /* compute the coefficients that represent nID in Base b */
   y = nID;
   do k = 0 to &nChar - 1;
      c[k] = mod(y, b);                       /* remainder when r is divided by b */
      y = floor(y / b);                       /* whole part of division */
      substr(ID,&nChar-k,1) = byte(offset+c[k]); /* represent coefficients as string */
   end;   
   /* Some strings are vulgar. Exclude strings on the denylist. */
   if ID not in (
   'CRAP','DAMN','DUMB','HELL','PUKE','SCUM','SLUT','SPAZ' /* add F**K, S**T, etc */
   ) then output;
end;
drop b offset y k;
run;

The table shows the first few and the last few observations in the data set. The ID column contains all four-character strings of English letters, except for words on a denylist. The list of objectionable words can be quite long, so I included only a few words as an example. I left out the F-bomb, other vulgar terms, and racist slurs. The appendix contains a more complete denylist of four-letter words.

Random permutation of the complete list

The first step created a list of unique four-character ID values. The second step is to randomly select N elements from this list, where N is the number of subjects in your study that need a random ID. This section shows how to perform random selection by permuting the entire list and selecting the first N rows of the permuted data.

The following DATA step generates a random uniform number for each ID value. A call to PROC SORT then sorts the data by the random values. The resulting data set, RANDPERM, has randomly ordered ID values.

/* Option 1: Generate a random permutation of the IDs */
data RandPerm;
set AllIDs;
call streaminit(12345);
_u = rand("uniform");
run;
/* sort by the random variate */
proc sort data=RandPerm;
   by _u;
run;
/*
proc print data=RandPerm(obs=10) noobs; 
   var nID ID;
run;
*/

You can use PROC PRINT to see that the ID values are randomly ordered. Because we started with a list of unique values, the permuted IDs are also unique.

You can now merge the first N random IDs with your data. For example, suppose I want to assign an ID to each student in the Sashelp.Class data set. The following program counts how many subjects (N) are in the data, and merges the first N random IDs with the data:

/* Count how many rows (N) are in the input data set */
%let dsName = sashelp.class;
proc sql noprint;
   select count(*) into :N
   from &DSName;
quit;
 
data All;
merge &DSName RandPerm(keep=ID obs=&N);
run;
 
proc print data=All(obs=8) noobs label;
   var ID Name _NUMERIC_;
run;

The output shows that each student is assigned a random four-character ID.

Use PROC SURVEYSELECT to select IDs

The previous section uses only Base SAS routines: The DATA step and PROC SQL. An alternative approach is to extract N random IDs by using PROC SURVEYSELECT in SAS/STAT software. In this approach, you do not need to generate random numbers and manually sort the IDs. Instead, you extract the random values by using PROC SURVEYSELECT. As of SAS 9.4 M5, the SURVEYSELECT procedure supports the OUTRANDOM option, which causes the selected items in a simple random sample to be randomly permuted after they are selected. Thus, an easier way to assign random IDs is to count the number of subjects, randomly select that many ID values from the (unsorted) set of all IDs, and then merge the results:

/* Option 2: Extract random IDs and merge */
%let dsName = sashelp.class;
proc sql noprint;
   select count(*) into :N
   from &DSName;
quit;
 
proc surveyselect data=AllIDs out=RandPerm noprint seed=54321
     method=srs           /* sample w/o replacement */
     OUTRANDOM            /* SAS 9.4M5 supports the OUTRANDOM option */
     sampsize=&N;         /* number of observations in sample */
run;
 
data All;
merge &DSName RandPerm(keep=ID);
run;
 
proc print data=All(obs=8) noobs label;
   var ID Name _NUMERIC_;
run;

The table shows the ID values that were assigned to the first few subjects.

A SAS/IML method

Both previous methods assign random strings of letters to subjects. However, I want to mention a third alternative because it is so compact. You can write a SAS/IML program that performs the following steps:

  • Create a vector that contains the 26 English letters.
  • Use the EXPANDGRID function to create a matrix that contains all four-character combinations of letters. Concatenate those letters into strings.
  • Optionally, remove objectional strings. You can use the ELEMENT function to identify the location of strings on a denylist.
  • Use the SAMPLE function to generate a random sample (without replacement).

As usual, the SAS/IML version is very compact. It requires about a dozen lines of code to generate the IDs and write them to a data set for merging with the subject data:

/* Generate N random 4-character strings (remove words on denylist) */
proc iml;
letters = 'A':'Z';                    /* all English letters */
L4 = expandgrid(letters, letters, letters, letters); /* 26^4 combinations */
strings = rowcat(L4);                 /* concatenate into strings */
free L4;                              /* done with matrix; delete */
 
deny = {'CRAP','DAMN','DUMB','HELL','PUKE','SCUM','SLUT','SPAZ'}; /* add F**K, S**T, etc*/
idx = loc( ^element(strings, deny) ); /* indices of strings NOT on denylist */
ALLID = strings[idx];
 
call randseed(1234);
ID = sample(strings, &N, "WOR");      /* random sample without replacement */
create RandPerm var "ID"; append; close; /* write IDs to data set */
QUIT;
 
/* merge data and random ID values */
data All;  merge &DSName RandPerm;  run;
 
proc print data=All(obs=8) noobs label;
   var ID Name _NUMERIC_;
run;

Summary

This article shows how to assign a random string value to subjects in an experiment. The best way to do this is to start with a set of unique strings. Make sure there are no objectionable words in the set! Then you can extract a random set of N strings by permuting the set or by using PROC SURVEYSELECT to perform simple random sampling (SRS). In the SAS/IML language, you can generate all strings and extract a random subset by using only a dozen lines of code.

In a DATA step, I used base 26 to generate the list of all four-character strings. By changing the definition of the nChar macro variable, this program also works for character strings of other lengths. If you set the macro to the value k, the program will generate 26k strings of length k.

Of the three methods, the first (DATA step and PROC SORT) is the simplest. It can also easily handle new subjects who are added to the study after the initial assignment of IDs. You can simply assign the next unused random IDs to the new subjects.

Appendix: A denylist of four-letter words

This appendix shows a list of objectionable words. You do not want to assign a patient ID that is obscene, profane, insulting, racist, sexist, or crude. I used several web pages to compile a list of four-letter words that some people might find objectionable. You can use an IF-THEN statement to block the following list four-letter English words:

/* some words on this list are from 
   https://www.wired.com/2016/09/science-swear-words-warning-nsfw-af/
   https://en.everybodywiki.com/List_of_profane_words
*/
if ID not in (
'ANAL','ANUS','ARSE','BOOB','BUNG','BUTT','CLIT','COCK','COON','CRAP',
'CUMS','CUNT','DAMN','DICK','DONG','DUMB','DUMP','DYKE','FAGS','FUCK',
'GOOK','HEBE','HELL','HOMO','JEEZ','JERK','JISM','JIZZ','JUGS','KIKE',
'KNOB','KUNT','MICK','MILF','MOFO','MONG','MUFF','NADS','PAKI','PISS',
'POON','POOP','PORN','PUBE','PUKE','PUSS','PUTO','QUIM','SCUM','SHAG',
'SHAT','SHIT','SLAG','SLUT','SMEG','SPAZ','SPIC','SUCK','TARD','THOT',
'TOSS','TURD','TWIT','TWAT','WANK','WANG'
) then output;

The post Generate random ID values for subjects in SAS appeared first on The DO Loop.

9月 072022
 

Monotonic transformations occur frequently in math and statistics. Analysts use monotonic transformations to transform variable values, with Tukey's ladder of transformations and the Box-Cox transformations being familiar examples. Monotonic distributions figure prominently in probability theory because the cumulative distribution is a monotonic increasing function. For a continuous distribution that is strictly monotonic, the quantile function is also monotonic.

In a recent project, I needed to determine whether a certain function is monotonic increasing. Because the function is only known at a finite sequence of values, the reduced problem is to determine whether a sequence is increasing. I have previously shown that you can use the DIF function in SAS to test for an increasing sequence. This article shows how to apply the method to the problem of deciding whether a function is increasing.

What is a monotone sequence?

There are two types of monotonicity. In a weakly monotonic sequence, adjacent terms can be equal. That is, the difference between adjacent terms is allowed to be 0. In a strongly monotonic sequence (also called strictly monotonic), adjacent terms cannot be equal. For a strictly increasing sequence, the difference between adjacent terms is always positive. For a strictly decreasing sequence, the difference between adjacent terms is always negative. By itself, the term monotonic means the sequence is either increasing or decreasing.

Thus, there are four different tests for monotonicity. You can test whether a sequence is weakly increasing, strictly increasing, weakly decreasing, or strictly decreasing.

Test for monotone increasing in SAS

The following SAS/IML module evaluates a vector and tests whether the elements are increasing. It uses the DIF function to produce a vector that contains the difference between adjacent elements of the input vector. That is, if x = {x1, x2, x3, ..., xn} is the input vector, then d = diff(x,1,1) is the vector {x2-x1, x3-x2, ..., xn-x[n-1]}. The second argument specifies the lag. The third argument is a flag that specifies whether the result is padded with missing values. (See the Appendix for more information.) By default, the function tests for a weakly increasing sequence.

proc iml;
/* Test whether a sequence of elements is monotonic increasing.
   Valid options are 
   strict=0 : (Default) Return 1 if a sequence is nondecreasing
   strict=1 : Return 1 if a sequence is strictly increasing
*/
start IsIncr(_x, strict=0);
   x = colvec(_x);
   if nrow(x)=1 then return(1);      /* one element is always monotonic! */
   d = dif(x,1,1);                   /* lag=1; delete initial missing value */
   if strict then 
      return( all(d > 0) );
   return( all(d >= 0) );
finish;
 
/* test whether sequences are increasing */
x = {0,2,2,2,6,7,9};
y = {0,1,3,4,6,7,9};
z = {0,1,3,4,2,7,9};
b1 = IsIncr(x);     /* test weakly increasing */
b2 = IsIncr(x, 1);  /* test strictly increasing */
b3 = IsIncr(y, 1);  /* test strictly increasing */
b4 = IsIncr(z);     /* test weakly increasing */
 
print b1 b2 b3 b4;

The IsIncr function is called four times:

  1. The first call returns the value 1 because the elements of x are weakly monotonic (nondecreasing).
  2. The second call returns the value 0 because the elements of x are not strictly increasing.
  3. The third call returns the value 1 because the elements of y are strictly increasing.
  4. The fourth call returns the value 0 because the elements of z are not monotonic.

Test for monotone decreasing in SAS

If a sequence {x1, x2, x3, ...} is monotone increasing, then the sequence obtained by multiplying each element by -1 is monotone decreasing. Therefore, it is trivial to write a function that tests whether a sequence is decreasing: simply test whether the negative of the sequence is increasing! This is accomplished by the following SAS/IML function:

/* Test whether a sequence of elements is monotonic decreasing.
   strict=0 : (Default) Return 1 if a sequence is nonincreasing
   strict=1 : Return 1 if a sequence is strictly decreasing
*/
start IsDecr(x, strict=0);
   return IsIncr(-x, strict);
finish;
 
/* test whether sequence is increasing */
u = {9,8,7,7,6,2,0};
b5 = IsDecr(u);     /* test weakly decreasing */
b6 = IsDecr(u, 1);  /* test strictly decreasing */
print b5 b6;

The sequence is weakly decreasing but not strictly decreasing. The first call to the IsDecr function returns 1. The second call returns 0.

An application: Test whether a transformation is monotonic

I used the IsIncr function to address a general question: Can you detect whether an unknown univariate function is monotonic increasing?

In a recent project, I was modeling the cumulative distribution function (CDF) of an unknown continuous distribution. By definition, a CDF must be increasing, but for some values of the parameters, the model is not an increasing function. I needed to identify these "infeasible" parameter values in a quick and reliable manner.

Mathematically, an increasing function, F, has the property that for every increasing sequence, {x_i}, the image of that sequence, {F(x_i)}, is also increasing. Numerically, you can generate an increasing sequence in the domain and ask whether the image of that sequence is also increasing.

Here's an example. Suppose you want to test whether the following functions are increasing:
F1(x) = (5 - 4*x) log( x/(1-x) )
F2(x) = (5 - 5.2*x) log( x/(1-x) )
The functions are defined on the domain (0, 1). The following program defines an increasing sequence on (0,1) and tests whether the image of the sequence is also increasing for F1 and for F2:

start Func1(x);
   return (5 - 4*x)#log( x/(1-x) );
finish;
start Func2(x);
   return (5 - 5.2*x)#log( x/(1-x) );
finish;
 
dt = 0.005;
x = do(dt, 1-dt, dt);  /* increasing sequence on a fine grid */
y1 = Func1(x);         /* image of sequence under F1 */
y2 = Func2(x);         /* image of sequence under F2 */
b1 = IsIncr(y1);
b2 = IsIncr(y2);
print b1 b2;

The output indicates that the first function in increasing, but the second function is not. The following graph of the second function shows, indeed, that the function is not strictly increasing.

The accuracy of this method depends on choosing a sequence on a fine grid of points in the domain. It assumes that the derivative of the function is bounded so that function cannot quickly increase and decrease on one of the small gaps between consecutive elements of the sequence.

Summary

This article shows a way to test whether a function is monotonic on an interval. A common application is to test for an increasing function, but it is equally easy to test for a decreasing function. You can test whether the function is weakly increasing or strictly increasing.

These numerical tests evaluate a function at a finite set of points. Consequently, they can rule out the hypothesis of monotonicity, but they do not prove that the function is increasing everywhere. Nevertheless, these tests work well in practice if you choose the input sequence on a fine grid in the domain. If you have a mathematical bound on the derivative of the function, you can say more.

Appendix: A useful option for the DIF function

By default, the DIF function in SAS/IML returns a vector that is the same size as the input vector. If k is the lag parameter, the first k elements are missing values. The differences between elements are contained in the elements k+1, k+2, and so forth. If you specify, the value 1 for the third option, the DIF function deletes the leading missing values. The result is a vector that has n – k elements, where n is the length of the input vector and k is the lag parameter.

For example, the following example passes in k=1 and deletes the leading missing value for the result:

t = {0,3,3,2,7};   /* input argument has 5 elements */
d = dif(t, 1, 1);  /* LAG=1; delete leading missing value */
print d;           /* result has 4 elements */

The post A test for monotonic sequences and functions appeared first on The DO Loop.

8月 222022
 

A SAS customer asked how to use the Box-Cox transformation to normalize a single variable. Recall that a normalizing transformation is a function that attempts to convert a set of data to be as nearly normal as possible. For positive-valued data, introductory statistics courses often mention the log transformation or the square-root transformation as possible ways to eliminate skewness and normalize the data. Both of these transformations are part of the family of power transformations that are known as the Box-Cox transformations. A previous article provides the formulas for the Box-Cox transformation.

Formally, a Box-Cox transformation is a transformation of the dependent variable in a regression model. However, the documentation of the TRANSREG procedure contains an example that shows how to perform a one-variable transformation. The trick is to formulate the problem as an intercept-only regression model. This article shows how to perform a univariate Box-Cox transformation in SAS.

A normalizing transformation

Let's look at the distribution of some data and then try to transform it to become "more normal." The following call to PROC UNIVARIATE creates a histogram and normal quantile-quantile (Q-Q) plot for the MPG_Highway variable in the Sashelp.Cars data set:

%let dsName = Sashelp.Cars;    /* name of data set */
%let YName = MPG_Highway;      /* name of variable to transform */
 
proc univariate data=&dsName;
   histogram &YName / normal;
   qqplot &YName / normal(mu=est sigma=est);
   ods select histogram qqplot;
run;

The histogram shows that the data are skewed to the right. This is also seen in the Q-Q plot, which shows a point pattern that is concave up.

You can use a Box-Cox transformation to attempt to normalize the distribution of the data. But you must formulate the problem as an intercept-only regression model. To use the TRANSREG procedure for the Box-Cox transformation, do the following:

  • The syntax for PROC TRANSREG requires an independent variable (regressor). To perform an intercept-only regression, you need to manufacture a constant variable and specify it on the right-hand side of the MODEL statement.
  • The procedure will complain that the independent variable is constant, but you can add the NOZEROCONSTANT option to suppress the warning.
  • Optionally, the documentation for PROC TRANSREG points out that you can improve efficiency by using the MAXITER=0 option.

Notice that the values of the response variable are all positive, therefore we can perform a Box-Cox transformation without having to shift the data. The following statements use a DATA step view to create a new constant variable named _ZERO. The call to PROC TRANSREG uses an intercept-only regression model to transform the response variable:

data AddZero / view=AddZero;
   set &dsName;
   _zero = 0;     /* add a constant variable to use for an intercept-only model */
run;
 
proc transreg data=AddZero details maxiter=0 nozeroconstant;
   model BoxCox(&YName / geometricmean convenient lambda=-2 to 2 by 0.05) = identity(_zero);
   output out=TransOut;    /* write transformed variable to data set */
run;

The graph is fully explained in my previous article about the Box-Cox transformation. In the graph, the horizontal axis represents the value of the λ parameter in the Box-Cox transformation. For this example, the LAMBDA= option in the BOXCOX transformation specifies values for λ in the interval [-2, 2]. The vertical axis shows the value of the normal log-likelihood function for the residuals after the dependent variable is transformed for each value of λ. The maximum value of the parameter occurs when λ=0.05. However, because the CONVENIENT option was used and because λ=0 is included in the confidence interval for the optimal value (and is "convenient"), the procedure selects λ=0 for the transformation. The Box-Cox transformation for λ=0 is the logarithmic transformation Y → G*log(Y) where G is the geometric mean of the response variable. (If you do not want to scale by using the geometric mean, omit the GEOMETRICMEAN option in the BOXCOX transformation.)

To summarize, the Box-Cox method selects a logarithmic transformation as the power transformation that makes the residuals "most normal." In an intercept-only regression, the distribution of the residuals is the same as the distribution of the centered data. Consequently, this process results in a transformation that makes the response variable as normally distributed as possible, within the family of power transformations.

Visualize the transformed variable

You can use PROC UNIVARIATE to visualize the distribution of the transformed variable, as follows:

proc univariate data=TransOut;
   histogram T&YName / normal kernel;
   qqplot T&YName / normal(mu=est sigma=est);
   ods select histogram qqplot;
run;

The distribution of the transformed variable is symmetric. The distribution is not perfectly normal (it never will be). In the Q-Q plot, the point pattern shows quantiles that are below the diagonal line on the left and above the line on the right. This indicates outliers (with respect to normality) at both ends of the data distribution.

Summary

This article shows how to perform a Box-Cox transformation of a single variable in SAS. The Box-Cox transformation is intended for regression models, so the trick is to run an intercept-only regression model. To do this, you can use a SAS DATA view to create a constant variable and then use that variable as a regressor in PROC TRANSREG. The procedure produces a Box-Cox plot, which visualizes the normality of the transformed variable for each value of the power-transformation parameter. The parameter value that maximizes the normal log-likelihood function is the best parameter to choose, but in many circumstances, you can use a nearby "convenient" parameter value. The convenient parameter is more interpretable because it favors well-known transformations such as the square-root, logarithmic, and reciprocal transformations.

Appendix: Derivation of logarithm as the limiting Box-Cox transformation

Upon first glance, it is not clear why the logarithm is the correct limiting form of the Box-Cox power transformations as the parameter λ → 0. The power transformations have the form \((x^\lambda - 1)/\lambda\). If you rewrite \(x^\lambda = \exp(\lambda \log(x))\) and use the Taylor series expansion of \(\exp(z)\) near z=0, you obtain
\(\frac{\exp(\lambda \log(x)) - 1}{\lambda} = \frac{(1 + \lambda \log(x) + \lambda^2/2 \log^2(x) + \cdots) - 1}{\lambda} \approx \log(x)\) as λ → 0.

The post The univariate Box-Cox transformation appeared first on The DO Loop.

8月 172022
 

In the 1960s and '70s, before nonparametric regression methods became widely available, it was common to apply a nonlinear transformation to the dependent variable before fitting a linear regression model. This is still done today, with the most common transformation being a logarithmic transformation of the dependent variable, which fits the linear least squares model log(Y) = X*β + ε, where ε is a vector of independent normally distributed variates. Other popular choices include power transformations of Y, such as the square-root transformation.

The Box-Cox family of transformations (1964) is a popular way to use the data to suggest a transformation for the dependent variable. Some people think of the Box-Cox transformation as a univariate normalizing transformation, and, yes, it can be used that way. (I will discuss the univariate Box-Cox transformation in another article.) However, the primary purpose of the Box-Cox transformation is to transform the dependent variable in a regression model so that the residuals are normally distributed. The Box-Cox transformation attempts to find a new effect W = f(Y) such that the residuals (W - X*β) are normally distributed, where f is a power transformation that is chosen to maximize the normality of the residuals. Recall that normally distributed residuals are useful if you intend to make inferential statements about the parameters in the model, such as confidence intervals and hypothesis tests.

It is important to remember that you are normalizing residuals, not the response variable! Linear regression does not require that the variables themselves be normally distributed.

This article shows how to use the TRANSREG procedure in SAS to compute a Box-Cox transformation of Y so that the least-squares residuals are approximately normally distributed.

The simplest Box-Cox family of transformations

In the simplest case, the Box-Cox family of transformations is given by the following formula:
\( f_\lambda(y) = \left\{ \begin{array}{l l l} (y^\lambda - 1) / \lambda & & \lambda \neq 0 \\ \log (y) & & \lambda = 0 \end{array} \right. \)
The objective is to use the data to choose a value of the parameter λ that maximizes the normality of the residuals (fλ(Y) - X*β).

In SAS, you can use PROC TRANSREG to perform a regression that includes the BOXCOX transformation.

Before performing a Box-Cox transformation, let's demonstrate why it might be necessary. The following call to PROC GLM in SAS performs a least squares regression of the MPG_Highway variable (response) onto two explanatory variables (Weight and Wheelbase) for vehicles in the Sashelp.Cars data set. The residuals from the model are examined in a histogram and in a Q-Q plot:

%let dsName = Sashelp.Cars;
%let XName = Weight Wheelbase;
%let YName = MPG_Highway;
 
proc glm data=&dsName plots=diagnostics(unpack);
   model &YName = &XName;
   ods select ResidualHistogram QQPlot;
quit;

The graphs show that the distribution of the residuals for this model deviates from normality in the right tail. Apparently, there are outliers (large residuals) in the data. One way to handle this is to increase the complexity of the model by adding additional effects. Another way to handle it is to transform the dependent variable so that the relationship between variables is more linear. The Box-Cox method is one way to choose a transformation.

To implement the Box-Cox transformation of the dependent variable, use the following syntax in the TRANSREG procedure in SAS:

  • On the MODEL statement, enclose the name of the dependent variable in a BOXCOX transformation option.
  • After the name of the dependent variable, you can include options for the transformation. These options determine which values of the parameter λ are used for the transformation. I typically use the values LAMBDA=-2 to 2 by 0.1, although sometimes I will use -3 to 3 by 0.1.
  • I like to use the CONVENIENT option, which tells PROC TRANSREG to use simpler transformations, when possible, instead of the mathematically optional transformation. The idea is that if the optimal transformation is a value like λ=0.4, you might prefer to use a square-root transformation (λ=0.5) if there is no significant difference between the two parameter values. Similarly, you might prefer a LOG transformation (λ=0) over an inconvenient value such as λ=0.1. The documentation contains details of the CONVENIENT option.

The following call to PROC TRANSREG implements the Box-Cox transformation for these data and saves the residuals to a SAS data set:

proc transreg data=&dsName ss2 details plots=(boxcox);
   model BoxCox(&YName / convenient lambda=-2 to 2 by 0.1) = identity(&XName);
   output out=TransOut residual;
run;

The procedure forms 41 regression models, one for each requester value of λ in the range [-2, 2]. To help you visualize how each value of λ affects the normality of the residuals, the procedure creates the Box-Cox plot, which is shown. The graph is a panel that shows two plots:

  • The bottom panel shows the graph of the log-likelihood function (for the normality of the residuals) for each value of λ that was examined. For these data and family of models, the log-likelihood is maximized for λ= -0.4. A blue band shows a confidence interval for the maximal lambda. The confidence interval includes the value -0.5, which is a more "convenient" value because it is related to the transformation y → 1/sqrt(y). Since the CONVENIENT option was specified, λ= -0.5 will be used for the regression model. (Note: The Box-Cox transformation that is used is 2*(1 - 1/sqrt(y)), which is a scaled and shifted version of 1/sqrt(y).)
  • The top panel shows the graph of the F statistics for each effect in the model as a function of the λ parameter. For the optimal and the "convenient" values of λ, the F statistics are large (approximately F=700 for the Weight variable and F=30 for the Wheelbase variable), which means that both effects are significant predictors for the transformed dependent variable.

The procedure also displays a table that summarizes the optimal value of λ, the value used for the regression, and other information:

You can use PROC UNIVARIATE to plot the distribution of the residuals for the selected model (λ= -0.5). The name of the residual variable is RYName, where YName is the name of the dependent variable.

proc univariate data=TransOut(keep=R&YName);
   histogram R&YName / normal kernel;
   qqplot R&YName / normal(mu=est sigma=est) grid;
   ods select histogram qqplot GoodnessOfFit Moments;
run;

As shown by the histogram of the residuals, the Box-Cox transformation has eliminated the extreme outliers and improved the fit. Be aware that "improving the normality of the residuals" does not mean that the residuals are perfectly normal, especially if you use the CONVENIENT option. A test for normality (not shown) rejects the hypothesis of normality. However, the inferential statistics for linear regression are generally robust to mild departures from normality.

A normal quantile-quantile (Q-Q) plot provides an alternate way to visualize normality. The histogram appears to be slightly left-skewed. In a Q-Q plot, a left-skewed distribution will have a curved appearance, as follows:

The full Box-Cox distribution

There are two mathematical issues with the simple Box-Cox family of transformations:

  • Every element of Y must be positive, otherwise, the transformation is not well-defined. A common way to address this issue is to shift the Y variable (if necessary) by a constant, c, so that Y+c is strictly positive. One choice for c is c = 1 - min(Y).
  • The magnitude of the transformed variable depends on the parameter λ. This makes it hard to compare regression statistics for different values of λ. Spitzer (1982) popularized using the geometric mean of the response to standardize the transformations so that they are easier to compare. Accordingly, you will often see the geometric mean incorporated into the Box-Cox transformation.

Thus, the general formulation of the Box-Cox transformation incorporates two additional terms: a shift parameter and a scaling parameter. If any of the response values are not positive, you must add a constant, c. (Sometimes c is also used when Y is very large.) The scaling parameter is a power of the geometric mean of Y. The full Box-Cox transformation is
\( f_\lambda(y) = \left\{ \begin{array}{l l l} ((y + c)^\lambda - 1) / (\lambda g) & & \lambda \neq 0 \\ \log (y + c) / g & & \lambda = 0 \end{array} \right. \)
where \(g = G^{\lambda - 1}\), and G is the geometric mean of Y.

Although the response variable in this example is already positive, the following program uses a PROC SQL calculation to compute the shift parameter c = 1 - min(Y) and write the value to a macro variable. You can then use the PARAMETER= option in the BOXCOX transformation to specify the shift parameter. The geometric mean is much simpler to specify: merely use the GEOMETRICMEAN option. Thus, the following statements implement the full Box-Cox transformation in SAS:

/* use full Box-Cox transformation with c and G. */
/* compute c = 1 - min(Y) and put in a macro variable */
proc sql noprint;                              
 select 1-min(&YName) into :c trimmed from &dsName;
quit;
%put &=c;
 
proc transreg data=&dsName ss2 details plots=(boxcox);
   model BoxCox(&YName / parameter=&c geometricmean 
                         convenient lambda=-2 to 2 by 0.05) = identity(&XName);
   output out=TransOut residual;
run;

For these data, the value of the c parameter is -11, so when we add c, we are shifting the response variable to the left. For variables that have negative values (the usual situation in which you would use c), you shift the response to the right. The geometric mean scales each transformation. The bottom plot in the panel indicates that λ = 0.3 is the value of λ that makes the residuals most normal. For this analysis, λ = 0.25 is NOT in the confidence interval for λ, so no "convenient" parameter value is used. Instead, the chosen transformation is 0.3.

If you want to examine the distribution of the residuals, you can use the following call to PROC UNIVARIATE:

/* optional: plot the distribution of the residuals */
proc univariate data=TransOut(keep=R&YName);
   histogram R&YName / normal kernel;
   qqplot R&YName / normal(mu=est sigma=est) grid;
   ods select histogram qqplot;
run;

Both the histogram (not shown) and the Q-Q plot indicate that the residuals are approximately normal but are left-skewed.

Advantages and disadvantages of the Box-Cox transformation

The advantage of the Box-Cox transformation is that it provides an automated way to transform a dependent variable in a regression model so that the residuals for the model are as normal as possible. The transformations are all power transformations, and the logarithmic transformation is a limiting case that is also included.

The Box-Cox transformation also has several disadvantages:

  • The Box-Cox transformation cannot correct for a poor model, and it might obscure the fact that the model is a poor fit.
  • The Box-Cox transformation can be hard to interpret. Similarly, it can be hard to explain to a client. The client asked you to predict a variable Y. Instead, you show them a model that predicts a variable defined as \(W = ((Y + c)^\lambda - 1) / (\lambda G^{\lambda - 1})\), where λ= -0.3 and G is the geometric mean of Y. You can, of course, invert the transformation, but it is messy.

Modern nonparametric regression methods (such as splines and loess curves) might provide a better fit. However, I suggest looking at the Box-Cox transformation to see if the method suggests a simple transformation, such as the inverse, log, square-root, or quadratic transformations (λ = -1, 0, 0.5, 2). If so, you can choose the parametric approach for the transformed response.

Summary

This article shows how to use PROC TRANSREG in SAS to perform a Box-Cox transformation of the response variable in a regression model. The procedure examines a family of power transformations indexed by a parameter, λ. For each value of λ, the procedure transforms the response variable and computes a regression model. The optimal parameter is the one for which the distribution of the residuals is "most normal," as measured by a maximum likelihood computation. You can use the CONVENIENT option to specify that a nearby simple transformation (for example, a square-root transformation) is preferable to a less intuitive transformation.

This article assumes that there is one or more regressors in the model. The TRANSREG procedure can also perform a univariate Box-Cox transformation. The univariate case is discussed in a second article.

Further Reading

The post The Box-Cox transformation for a dependent variable in a regression appeared first on The DO Loop.

8月 152022
 

John Tukey was an influential statistician who proposed many statistical concepts. In the 1960s and 70s, he was fundamental in the discovery and exposition of robust statistical methods, and he was an ardent proponent of exploratory data analysis (EDA). In his 1977 book, Exploratory Data Analysis, he discussed a small family of power transformations that you can use to explore the relationship between two variables. Given bivariate data (x1, y1), (x2, y2), ..., (xn, yn), Tukey's transformations attempt to transform to the Y values so that the scatter plot of the transformed data is as linear as possible, and the relationship between the variables is simple to explain.

Tukey's Transformations

Tukey's transformations depend on a "power parameter," λ, which can be either positive or negative. When Y > 0, the transformations are as follows:

  • When λ > 0, the transformation is T(y; λ) = yλ.
  • When λ = 0, the transformation is T(y; λ) = log(y).
  • When λ < 0, the transformation is T(y; λ) = –(yλ).

If not all values of Y are positive, it is common to shift Y by adding a constant to every value. A common choice is to add c = 1 – min(Y) to every value. This ensures that the shifted values of Y are positive, and the minimum value is 1.

Although Tukey's transformation is defined for all values of the parameter, λ, he argues for using values that result in simple relationships such as integer powers (squares and cubes), logarithms, and reciprocal power laws such as inverse-square laws. Thus, he simplified the family to the following small set of useful transformations:

Some practitioners add -3, -1/3, 1/3, and 3 to this set of transformations. Tukey's ladder of transformation is similar to the famous Box-Cox family of transformations (1964), which I will describe in a subsequent article. The Box-Cox transformations have a different objective: they transform variables to acieve normality, rather than linearity.

A graphical method for Tukey's transformation

Tukey proposed many graphical techniques for exploratory data analysis. To explore whether a transformation of the Y variable results in a nearly linear relationship with the X variable, he suggests plotting the data. This is easily performed by using a SAS DATA step to transform the Y variable for each value of λ in the set {-2, -1, -1/2, 0, 1/2, 1, 2}. To ensure that the transformation only uses positive values, the following program adds c = 1 - min(Y) to each value of Y before transforming. The resulting scatter plots are displayed by using PROC SGPANEL.

/* define the data set and the X and Y variables */
%let dsName = Sashelp.Cars;
%let XName = Weight;
%let YName = MPG_Highway;
 
/* compute c = 1 - min(Y) and put in a macro variable */
proc sql noprint;                              
 select 1-min(&YName) into :c trimmed 
 from &dsName;
quit;
%put &=c;
 
data TukeyTransform;
array lambdaSeq[7] _temporary_ (-2, -1, -0.5, 0, 0.5, 1, 2);
set &dsName;
x = &XName;
y = &YName + &c;  /* offset by c to ensure Y >= 1 */
do i = 1 to dim(lambdaSeq);
   lambda = lambdaSeq[i];
   if lambda=0 then        z = log(y);
   else if lambda > 0 then z = y**lambda;
   else                    z = -(y**lambda);
   output;
end;
keep lambda x y z;
label x="&xName" y="&yName" z="Transformation of &YName";
run;
 
ods graphics / width=400px height=840px;
title "Tukey's Ladder of Transformations";
proc sgpanel data=TukeyTransform;
   panelby lambda / layout=rowlattice uniscale=column onepanel rowheaderpos=right;
   scatter x=x y=z;
   rowaxis grid; colaxis grid;
run;

The panel of scatter plots shows seven transformations of the Y variable plotted against X. It is difficult to visually discern which plot is "most linear." The graph for λ=0 seems to be most linear, but the scatter plot for λ=0.5 also have very little curvature.

This panel of graphs is typical. Graphical methods can hint at the relationships between variables, but analytical techniques are often necessary to quantify the relationships. I put the name of the data set and the name of the variables in macro variables so that you can easily run the code on other data and investigate how it works on other examples.

An analytical method for Tukey's transformation

Tukey emphasized graphical methods in Exploratory Data Analysis (1977), but his ladder of transformations has a rigorous statistical formulation. The correlation coefficient is a measure of linearity in a scatter plot, so for each value of λ you can compute ρλ = corr(X, T(Y; λ)), where T(Y; λ) is the Tukey transformation for the parameter λ. The value of λ that results in the correlation that has the largest magnitude (closest to ±1) is the optimal parameter. In SAS, you can sort the transformed data by the Lambda variable and use PROC CORR to compute the correlation for each parameter value:

proc sort data=TukeyTransform;
   by lambda;
run;
 
/* compute corr(X, T(Y; lambda)) for each lambda */
proc corr data=TukeyTransform outp=CorrOut noprint;
   by lambda;
   var Z;
   with X;
run;
 
proc print data=CorrOut(where=(_TYPE_="CORR")) noobs label;   
   var lambda z;
   label z="corr(x, T(y))";
run;

The correlation coefficients indicate that λ=0 results in a correlation of -0.85, which is the largest magnitude. For these data, λ=0.5 also does a good job of linearizing the data (ρ=-0.84). Several values of λ result in a correlation coefficient that is very strong, which is why it is difficult to find the best parameter by visually inspecting the graphs.

Summary

Tukey's ladder of transformations is a way to check whether two variables, X and Y, have a simple nonlinear relationship. For example, they might be related by a square-root or a logarithmic transformation. By using Tukey's ladder of transformations, you can discover relationships between variables such as inverse square laws, reciprocal laws, power laws, and logarithmic/exponential laws.

The post Tukey's ladder of variable transformations appeared first on The DO Loop.