rick wicklin

9月 202021
 

The SAS/IML language supports lists, which are containers that store other objects, such as matrices and other lists. A primary use of lists is to pack objects of various types into a single symbol that can be passed to and from modules. A useful feature of using lists is that lists can dynamically grow or shrink as necessary. You can use the ListAddItem subroutine to add a new item to an existing list, which can be convenient if you do not know until run time how many items the list needs to hold.

A SAS/IML programmer asked whether it is possible to add items to a sublist of a list. The answer is yes, and there is even an option to make the process efficient. Intuitively, you need to copy the sublist, modify the list, and then replace the sublist in the list. Because copying a list can be an expensive operation, the SAS/IML language provides a special option to move the list items into new symbols without copying. The move operator assigns the memory to a new symbol and "nulls out" the previous symbol.

Suppose you have a SAS/IML list, L, that contains sublists. Suppose you want to add a new item to the first sublist (L$1). You can do the following:

  • Get the sublist by using the ListGetSubItem function. For example, the syntax Y = ListGetSubItem(L, 1, 'm'); gets the first sublist and assigns it to the Y symbol. Using 'm' for the third argument means "move" the memory to the new symbol. After this call, L$1 is empty and the symbol Y points to the sublist that was formerly contained in L$1.
  • The symbol Y is a list, so use ListAddItem to add a new item to Y.
  • Now set the first item of L. You can use call ListSetSubItem(L, 1, Y, 'm'); to assign L$1 to be the value of Y. Because you use the 'm' option, the memory is moved, not copied. After this call, Y is an empty symbol. The first item of L contains the list that was formerly in Y.

The schematic diagram above illustrates the process. The following SAS/IML program shows how to implement the technique:

 
proc iml;
/* create list of lists. Specify L$1$1, L$1$2, L$2$1, and L$2$2 */
L = ListCreate(2);
L$1 = ['L11', 12];    /* first  item of L is a list with two items */
L$2 = ['L21', 22];    /* second item of L is a list with two items */
 
/* Now add a third item to L$1 and access it as L$1$3 */
newVal = 23;
Y = ListGetSubItem(L, 1, 'm');     /* get the sublist */
call ListAddItem(Y, newVal);       /* add new item */
call ListSetSubItem(L, 1, Y, 'm'); /* set it back in same position */
 
/* check that L$1$3 exists and that S is empty */
package load ListUtil;
call struct(L);
 
L13 = L$1$3;
dimY = dimension(Y);
print L13, dimY;

Summary

A useful feature of SAS/IML lists is that a list can dynamically grow. Lists also include an option that moves (rather than copies) memory to and from list items. You can use this feature to efficiently add items to a sublist of a list. For clarity, I used three separate function calls in the example program. If you plan to use this feature repeatedly, you can encapsulate the operations into a SAS/IML user-defined function.

The post Add an item to a sublist appeared first on The DO Loop.

9月 152021
 

I previously wrote about one way to solve the partition problem in SAS. In the partition problem, you divide (or partition) a set of N items into two groups of size k and N-k such that the sum of the items' weights is the same in each group. For example, if the weights of six items are X = {0.4, 1.0, 1.2, 1.7, 2.6, 2.7} and k=3, you can put the weights {0.4, 1.7, 2.7} in one group and the weights {1.0, 1.2, 2.6} in the other group. Both groups contain 4.8 units of weight.

The previous article discussed a brute-force approach in which you explicitly generate all combinations of k items from the set of N items. The program could produce all possible solutions, provided that a solution exists. This article recasts the partition problem as an optimization problem and shows two ways to solve it:

  • A feasibility problem: Define the problem as a set of constraints. Find one or more solutions that satisfy the constraints.
  • An optimization problem: Find a partition that minimizes the difference between the weights in the two groups. When there is a solution, this method finds it. If there is not a solution, this method finds partitions that distribute the weight as equally as possible.

This article shows how to use PROC OPTMODEL in SAS/OR software. In SAS Viya, you can submit the same statements by using the OPTMODEL procedure in SAS Optimization or the runOptModel action. I thank my colleague, Rob Pratt, who wrote the programs in this article and also kindly reviewed both articles about the partition problem.

A feasible set of binary vectors

The previous article formulated the problem in terms of a binary vector. I defined an N-dimensional vector, c, that contained k elements with the value -1 and N-k elements with the value +1. The two values of the vector c indicate whether an item belongs to the first group or the second group. If X is a solution (an ordering of the weights that satisfied the partition problem), then the inner product X`*c=0.

In this article, it is more convenient to define the two groups by using a 0/1 binary indicator vector, Y. With this definition of Y, a solution satisfies the equation X`*Y = X`*(1-Y). The equation specifies that the sum of weights in the '1' group equals the sum of weights in the '0' group.

To define the feasible set of Y vectors that solve the partition problem, you can define the following two constraints:

  1. 1`*Y = k, where 1 is the vector of 1s. This constraint requires that there be exactly k values of Y that are equal to 1.
  2. X`*Y = X`*(1-Y). This constraint means that the sum of the k values of X in the first set equals the sum of the N-k values of X in the second set.

The OPTMODEL procedure enables you to specify the problem and the constraints in a natural syntax. Instead of using vector notation, as I did above, it uses summations and indices. The following statements define a set that has 11 items that are to be partitioned into two groups that contain 5 and 6 items, respectively. This program returns an arbitrary feasible point and prints the solution:

/* feasibility problem: enforce equal group weights */
proc optmodel;
set ITEMS = 1..11;
num weight {ITEMS} = [0.8, 1.2, 1.3, 1.4, 1.6, 1.8, 1.9, 2.0, 2.2, 2.3, 2.5];
num k = 5;
 
/* Y[i] = 1 if item i is in group 1; Y[i] = 0 if item i is in group 2 */
var Y {ITEMS} binary;
 
/* group 1 must contain exactly k items */
con NumItemsInGroup1:            /* 1`*Y = k */
   sum{i in ITEMS} Y[i] = k;
 
/* constraint: each group must contain the same weight */
con SameWeightInEachGroup:      /* weight`*Y = weight`*(1-Y) */
   sum{i in ITEMS} weight[i] * Y[i] = sum{i in ITEMS} weight[i] * (1 - Y[i]);
 
/* find one feasible solution */
solve noobj;        /* default solver is MILP */
print Y weight;

As shown in the output, this solution partitions the weights {1.6, 1.8, 1.9, 2.0, 2.2} into one group and the other six weights into another group.

The solution is shown as a column vector. Because we will want to display multiple solutions, it is convenient to transpose the solution (or set of solutions) so that each solution is a row vector in a matrix. It is also helpful to standardize the solution vector so that the left-most k elements contain the items in one group, and the right-most N-k elements contain the elements in the other group. Fortunately, PROC OPTMODEL provides programming statements and the automatic variable _NSOL_, which tells us how many solutions were found. The following macro code fills a matrix that has _NSOL_ rows and N=11 columns. It then puts one group of weights into the first k columns and the other group in the last N-k columns. Because PROC OPTMODEL is an interactive procedure, it does not stop running until you specify a QUIT statement. Therefore, the procedure is still running, and you can submit the following statements:

/* Macro to print table with one row per solution. Put the group 
   with k items on the left and the group with N-k items on the right. */
%macro printSolutionTable;
for {s in 1.._NSOL_} do;
   leftCount  = 0;
   rightCount = 0;
   for {i in ITEMS} do;
      if Y[i].sol[s] > 0.5 then do;
         leftCount = leftCount + 1;
         soln[s,leftCount] = weight[i];
      end;
      else do;
         rightCount = rightCount + 1;
         soln[s,k+rightCount] = weight[i];
      end;
   end;
end;
print soln;
%mend printSolutionTable;
 
/* define the matrix soln and the index variables */
num soln {1.._NSOL_, 1..card(ITEMS)};
num leftCount, rightCount;
%printSolutionTable;

Find all feasible solutions to the partition problem

The previous section found one solution to the partition problem. If you want to find additional solutions, you can add options to the SOLVE statement. The options depend on the solver that you use. The previous SOLVE statement used the MILP solver. You can obtain additional solutions by using the following options:

/* find up to 100 feasible solutions */
solve noobj with milp / maxpoolsols=100 soltype=best;
put _NSOL_=;
%printSolutionTable;

By the way, you can also use constraint logic programming (CLP) to solve this problem. The CLP solver supports the FINDALLSOLNS option, which you can specify as follows:

/* an alternate way to find all feasible solutions */
solve with clp / findallsolns;

Both methods find all 13 solutions that were found in the previous article.

Minimize the absolute deviations between groups

In the previous sections, the second constraint ensures that the difference between the weights in the two groups is exactly zero. However, you might have data that cannot be evenly divided. In that case, you might want to find a partition that minimizes the absolute deviance between the two groups. You can do that by using the second equation as an objective function rather than as a constraint.

Note that you must minimize the ABSOLUTE VALUE of the difference between the weights in the two groups. If you minimize the difference, the optimization will put the smallest weights in one group and the largest in the other.

The absolute value of a linear function is not linear. However, you can use a standard trick to rewrite the objective function as a linear function subject to two linear constraints. Let x be the quantity to minimize. Recall that the absolute value |x| is defined by x when x≥0 and by -x when x≤0. Introduce a new variable, z, subject to the linear constraints that z ≥ x and z ≥ -x. The graph at the right shows the geometry of this choice. Minimizing the constrained variable z is equivalent to minimizing |x|, but it can be done by using linear programming techniques. For more information about handling absolute values in the objective function, see this "open textbook" from Cornell University.

By using this trick, you can formulate the partition problem as an optimization that minimizes the absolute deviation between the sum of weights in the two groups:

/* optimization problem: minimize absolute difference of group weights */
proc optmodel;
set ITEMS = 1..11;
num weight {ITEMS} = [0.8, 1.2, 1.3, 1.4, 1.6, 1.8, 1.9, 2.0, 2.2, 2.3, 2.5];
num k = 5;
 
/* Y[i] = 1 if item i is in group 1; Y[i] = 0 if item i is in group 2 */
var Y {ITEMS} binary;
 
/* group 1 must contain exactly k items */
con NumItemsInGroup1:
   sum{i in ITEMS} Y[i] = k;
 
/* minimize absolute difference of group weights */
/* Use a standard trick to handle absolute values in an objective function:
   z=|L*y| is z = {  L*y if L*y >= 0
                  { -L*y if L*y <  0
*/
var AbsValue;
min GroupWeightDifference = AbsValue;
 
con Linearize1:
   AbsValue >=  sum{i in ITEMS} weight[i] * Y[i] - sum{i in ITEMS} weight[i] * (1 - Y[i]);
con Linearize2:
   AbsValue >= -sum{i in ITEMS} weight[i] * Y[i] + sum{i in ITEMS} weight[i] * (1 - Y[i]);
 
/* find up to 100 optimal solutions */
solve with milp / maxpoolsols=100 soltype=best presolver=none;
put _NSOL_=;
 
num soln {1.._NSOL_, 1..card(ITEMS)};
num leftCount, rightCount;
%printSolutionTable;

This optimization also finds the 13 solutions, although the rows are displayed in a different order. The output is not shown.

An advantage of this formulation is that this technique enables you to handle the case where an exact solution is not possible. For example, if you change the smallest weight from 0.8 to 0.85, it is no longer possible to partition the weights evenly. However, the optimization continues to find 13 solutions, only now each solution defines two groups in which the weights differ by 0.05, which is the smallest possible difference.

Easier syntax in SAS Viya

The trick that handles an absolute value in an objective function is one example of a general method that can "linearize" several types of problems. In SAS Viya, there is an automatic way to perform "linearization" for these standard problems. In SAS Optimization software, PROC OPTMODEL supports the LINEARIZE option, beginning in SAS Viya 2020.1.1 (December 2020). Thus, in Viya you can solve the partition problem by using a simpler syntax:

/* show the LINEARIZE option in SAS Viya 2020.1.1 */
min GroupWeightDifference =
   abs( sum{i in ITEMS} weight[i] * Y[i] - sum{i in ITEMS} weight[i] * (1 - Y[i]) );
solve with milp LINEARIZE / maxpoolsols=100 soltype=best;

Summary

This article shows some optimization techniques for solving the partition problem on N items. You can use PROC OPTMODEL in SAS to solve the partition problem in two ways. If the partition problem has a solution, you can formulate the problem as a feasibility problem. Alternatively, you can minimize the absolute deviations between the two groups. This second approach will also find approximate partitions when an exact solution is not possible.

This article also shows a standard trick that enables you to optimize the absolute value of a linear objective function. The trick is to replace the absolute value by using two linear constraints. In SAS Viya, the LINEARIZE option implements the trick automatically.

The post The partition problem: An optimization approach appeared first on The DO Loop.

9月 132021
 

Photograph by Poussin Jean, license CC BY-SA 3.0, via Wikimedia Commons

The partition problem has many variations, but recently I encountered it as an interactive puzzle on a computer. (Try a similar game yourself!) The player is presented with an old-fashioned pan-balance scale and a set of objects of different weights. The challenge is to divide (or partition) the objects into two group. You put one group of weights on one side of the scale and the remaining group on the other side so that the scale balances.

Here's a canonical example of the partition problem for two groups. The weights of six items are X = {0.4, 1.0, 1.2, 1.7, 2.6, 2.7}. Divide the objects into two groups of three items so that each group contains half the weight, which is 4.8 for this example. Give it a try! I'll give a solution in the next section.

As is often the case, there are at least two ways to solve this problem: the brute-force approach and an optimization method that minimizes the difference in weights between the two groups. One advantage of the brute-force approach is that it is guaranteed to find all solutions. However, the brute-force method quickly becomes impractical as the number of items increases.

This article considers brute-force solutions. A more elegant solution will be discussed in a future article.

Permutations: A brute-force approach

Let's assume that the problem specifies the number of items that should be placed in each group. One way to specify groups is to use a vector of ±1 values to encode the group to which each item belongs. For example, if there are six items, the vector c = {-1, -1, -1, +1, +1, +1} indicates that the first three items belong to one group and the last three items belong to the other group.

One way to solve the partition problem for two groups is to consider all permutations of the items. For example, Y = {0.4, 1.7, 2.7, 1.0, 1.2, 2.6} is a permutation of the six weights in the previous section. The vector c = {-1, -1, -1, +1, +1, +1} indicates that the items {0.4, 1.7, 2.7} belong to one group and the items {1.0, 1.2, 2.6} belong to the other group. Both groups have 4.8 units of weight, so Y is a solution to the partition problem.

Notice that the inner product Y`*c = 0 for this permutation. Because c is a vector of ±1 values, the inner product is the difference between the sum of weights in the first group and the sum of weights in the second group. An inner product of 0 means that the sums are the same in both groups.

A program to generate all partitions

Let's see how you can use the ALLPERM function in SAS/IML to solve the two-group partition problem. Since the number of permutations grows very fast, let's use an example that contains only four items. The weights of the four items are X = {1.2, 1.7, 2.6, 3.1}. We want two items in each group, so define c = {-1, -1, 1, 1}. We search for a permutation Y = π(X), such that Y`*c = 0. The following SAS/IML program generates ALL permutations of the integers {1,2,3,4}. For each permutation, the sum of the first two weights is compared to the sum of the last two weights. The permutations for which Y`*c = 0 are solutions to the partition problem.

proc iml;
X = {1.2, 1.7, 2.6, 3.1};
c = { -1,  -1,   1,   1};  /* k = 2 */
 
/* Brute Force Method 1: Generate all permutations of items */
N = nrow(X);
P = allperm(N);            /* all permutations of 1:N */
Y = shape( X[P], nrow(P), ncol(P) );   /* reshape to N x N! matrix */
z = Y * c;                 /* want to find z=0 (or could minimize z) */
idx = loc(abs(z) < 1E-8);  /* indices where z=0 in finite precision */
Soln = Y[idx, ];           /* return all solutions; each row is solution */

This program generates P, a matrix whose rows contain all permutations of four elements. The matrix Y is a matrix where each row is a permutation of the weights. Therefore, Y*c is the vector of all differences. When a difference is zero, the two groups contain the same weights. The following statements count how many solutions are found and print the first solution:

numSolns = nrow(soln);
s = soln[1,];
print numSolns, s[c={'L1' 'L2' 'R1' 'R2'}];

There are a total of eight solutions. One solution is to put the weights {1.2, 3.1} in one group and the weights {1.7, 2.6} in the other group. What are the other solutions? For this set of weights, the other solutions are trivial variations of the first solution. The following statement prints all the solutions:

print soln[c={'L1' 'L2' 'R1' 'R2'}];

I have augmented the output so that it is easier to see the structure. In the first four rows, the values {1.2, 3.1} are in the first group and the values {1.7, 2.6} are in the second group. In the last four rows, the values switch groups. Thus, this method, which is based on generating all permutations, generates a lot of solutions that are qualitatively the same, in practice.

Combinations: Another brute-force approach

The all-permutation method generates N! possible partitions and, as we have seen, not all the partitions are qualitatively different. Thus, using all permutations is inefficient. A more efficient (but still brute-force) method is to use combinations instead of permutations. Combinations are essentially a sorted version of a permutation. The values {1, 2, 3}, {2, 1, 3}, and {3, 2, 1} are different permutations, whereas there is only one combination ({1, 2, 3}) that contains these three numbers. If there are six items and you want three items in each group, there are 6! = 720 permutations to consider, but only "6 choose 3" = 20 combinations.

The following SAS/IML function uses combinations to implement a brute-force solution of the two-group partition problem. The Partition2_Comb function takes a vector of item weights and a vector (nPart) that contains the number of items that you want in the first and second groups. If you want k items in the first group, the ALLCOMB function creates the complete set of combinations of k indices. The SETDIFF function computes the complementary set of indices. For example, if there are six items and {1, 4, 5} is a set of indices, then {2, 3, 6} is the complementary set of indices. After the various combinations are defined, the equation Y*c = 0 is used to find solutions, if any exist, where the first k elements of c are -1 and the last N-k elements are +1.

/* Brute Force Method 2: Generate all combination of size k, N-k */
start Partition2_Comb(_x, k, tol=1e-8);
   x = colvec(_x);
   N = nrow(x);
   call sort(x);                 /* Optional: standardize the output */
 
   c = j(k, 1, -1) // j(N-k, 1, 1); /* construct +/-1 vector */
   L = allcomb(N, k);            /* "N choose k" possible candidates in "left" group */
   R = j(nrow(L), N-k, .);
   do i = 1 to nrow(L);
      R[i,] = setdif(1:N, L[i,]);  /* complement indices in "right" group */
   end;
   P = L || R;                   /* combine the left and right indices */
 
   Y = shape( X[P], nrow(P) );   /* reshape X[P] into an N x (N choose k) matrix */
   z = Y * c;                    /* want to find z=0 (or could minimize z) */
   solnIdx = loc(abs(z) < tol);  /* indices where z=0 in finite precision */
   if ncol(solnIdx) = 0 then 
      soln = j(1, N, .);         /* no solution */
   else
      soln = Y[solnIdx, ];       /* each row is solution */
   return soln;                  /* return all solutions */
finish;
 
/* test the function on a set that has 11 items, partitioned into groups of 5 and 6 */
x = {0.8, 1.2, 1.3, 1.4, 1.6, 1.8, 1.9, 2.0, 2.2, 2.3, 2.5}; 
soln = Partition2_Comb(x, 5);
numSolns = nrow(soln);
print numSolns, soln[c=(('L1':'L5') || ('R1':'R6'))];

The output shows 13 solutions for a set of 11 items that are partitioned into two groups, one with five items and the other with 11-5=6 items. For the solution, each partition has 9.5 units of weight.

Summary

This article shows some SAS/IML programming techniques for using a brute-force method to solve the partition problem on N items. In the partition problem, you split items into two groups that have k and N-k items so that the weight of the items in each group is equal. This article introduced the solution in terms of all permutations of the items, but implemented a solution in terms of all combinations, which eliminates redundant orderings of the items.

In many applications, you don't need to find all solutions, you just need any solution. A subsequent article will discuss formulating the partition problem as an optimization that seeks to minimize the difference between the weights of the two groups.

The post The partition problem appeared first on The DO Loop.

9月 092021
 

A statistical programmer asked how to simulate event-trials data for groups. The subjects in each group have a different probability of experiencing the event. This article describes one way to simulate this scenario. The simulation is similar to simulating from a mixture distribution. This article also shows three different ways to visualize the results of the simulation.

Modeling the data-generation process

A simulation is supposed to reproduce a data-generation process. Typically, a simulation is based on some real data. You (the programmer) need to ensure that the simulation reflects how the real data are generated. For example, there are often differences between simulating a designed experiment or simulating an observational study:

  • In a designed experiment, the number of subjects in each group is determined by the researcher. For example, a researcher might choose 100 subjects, 50 men and 50 women. If so, each simulated data sample should also contain 50 males and 50 females.
  • In an observational study, the number of subjects in each group is a random variable that is based on the proportion of the population in each group. For example, a researcher might choose 100 subjects at random. In the data, there might be 53 men and 47 women, but that split is the result of random chance. Each simulated data sample should generate the number of males and females according to a random binomial variable. Some samples might have a 52:48 split, others a 49:51 split, and so on. In general, the group sizes are simulated from a multinomial distribution.

For other ways to choose the number of subjects in categories, see the article, "Simulate contingency tables with fixed row and column sums in SAS."

Data that specify events and trials

Suppose you have the following data from a pilot observational study. Subjects are classified into six groups based on two factors. One factor has three levels (for example, political party) and another has two levels (for example, sex). For each group, you know the number of subjects (Ni) and the number of events (Ei). You can therefore estimate the proportion of events in the i_th group as Ei / Ni. Because you know the total number of subjects (Sum(N)), you can also estimate the probability that an individual belongs to each group (πi = Ni / Sum(N)).

The following SAS DATA step defines the proportions for each group:

%let NTotal = 107;
data Events;
length Group $ 3;
input Cat1 $ Cat2 $ Events N;
Group = catx(" ", Cat1, Cat2); /* combine two factors into group ID */
p  = Events / N;               /* estimate P(event | Group[i]) */
pi = N / &NTotal;              /* estimate P(subject in Group[i]) */
drop Cat1 Cat2;
format p pi Best5.;
datalines;
A  F  1  10
A  M  8  24
B  F  4  13
B  M  12 36
C  F  12 16
C  M  6  8
;
 
proc print data=Events; run;

This pilot study has 107 subjects. The first group ("A F") contained 10 subjects, which is π1 = 9.3% of the total. In the first group, one person experienced the event, which is p1 = 10% of the group. The other rows of the table are similar. Some people call this "event-trials" data because the quantity of interest is the proportion of events that occurred in the subjects. The next section shows how to use these estimates to simulate new data samples.

Simulate group sizes and proportions in groups

One of the advantages of simulation studies is that you can take preliminary data from a pilot study and "scale it up" to create larger samples, assuming that the statistics from the pilot study are representative of the parameters in the population. For example, the pilot study involved 107 subjects, but you can easily simulate samples of size 250 or 1000 or more.

The following SAS/IML program simulates samples of size 250. The statistics from the pilot study are used as parameters in the simulation. For example, the simulation assumes that 0.093 of the population belongs to the first group, and that 10% of the subjects in the first group experience the event of interest. Notice that some groups will have a small number of subjects (such as Group 1 and Group 5, which have small values for π). Among the groups, some will have a small proportion of events (Group 1) whereas others will have a large proportion (Group 5 and Group 6).

The following simulation uses the following features:

  • The RandMultinomial function in SAS/IML generates a random set of samples sizes based on the π vector, which estimates the proportion of the population that belongs to each group.
  • For each group, you can use the binomial distribution to randomly generate the number of events in the group.
  • The proportion of events is the ratio (Number of Events) / (Group Size).
/* simulate proportions of events in groups */
proc iml;
use Events;
read all var _ALL_;  /* creates vectors Groups, Events, N, p, pi, etc */
close;
numGroups = nrow(Group);
 
call randseed(12345);
nTrials = 250;       /* simulate data for a study that contains 250 subjects */
 
Size = T( RandMultinomial(1, nTrials, pi) ); /* new random group sizes */
nEvents = j(numGroups, 1, .);                /* vector for number of events */
do i = 1 to nrow(nEvents);
   nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
end;
Proportion = nEvents / Size;                 /* proportion of events in each group */
 
results = Size || nEvents || Proportion;
print results[r=Group c={'Size' 'nEvents' 'Proportion'} F=Best5.];

The table shows one random sample of size 250 based on the statistics in the smaller pilot study. The group size and the number of events are both random variables. If you rerun this simulation, the number of subjects in the groups will change, as will the proportion of subjects in each group that experience the event. It would be trivial to adapt the program to handle a designed experiment in which the Size vector is constant.

Simulating multiple samples

An important property of a Monte Carlo simulation study is that it reveals the distribution of statistics that can possibly result in a future data sample. The table in the previous section shows one possible set of statistics, but we can run the simulation additional times to generate hundreds or thousands of additional variations. The following program generates 1,000 possible random samples and outputs the results to a SAS data set. You can then graph the distribution of the statistics. This section uses box plots to visualize the distribution of the proportion of events in each group:

/* simulate 1000 random draws under this model */
SampleID = j(numGroups, 1, .);
nEvents = j(numGroups, 1, .);                /* vector for number of events */
create Sim var {'SampleID' 'Group' 'Size' 'nEvents' 'Proportion'};
do ID = 1 to 1000;
   /* assign all elements the same value:
      https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   SampleID[,] = ID;
   Size = T( RandMultinomial(1, nTrials, pi) );  /* new random group sizes */
   do i = 1 to nrow(nEvents);
      nEvents[i] = randfun(1, "Binomial", p[i], Size[i]);
   end;
   Proportion = nEvents / Size;
   append;
end;
close Sim;
QUIT;
 
data PlotAll;
set Sim Events(keep=Group p);
run;
 
title "Proportion of Events in Each Group";
title2 "Simulated N=250; NumSim=1000";
/* box plot */
proc sgplot data=PlotAll noautolegend;
   hbox Proportion / category=Group;
   scatter y=Group x=p / markerattrs=GraphData2(symbol=CircleFilled size=11);
   yaxis type=discrete display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The box plot shows the distribution of the proportion of events in each group. For example, the proportion in the first group ("A F"), ranges from a low of 0% to a high of 33%. The proportion in the sixth group ("C M"), ranges from a low of 22% to a high of 100%. The red markers are the parameter values used in the simulation. These are the estimates from the pilot study.

Alternative ways to visualize the distribution within groups

The box plot shows a schematic representation of the distribution of proportions within each group. However, there are alternative ways to visualize the distributions. One way is to use a strip plot, as follows:

/* strip plot */
proc sgplot data=PlotAll noautolegend;
   scatter y=Group x=Proportion / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled);
   scatter y=Group x=p / markerattrs=(symbol=CircleFilled size=11);
   yaxis type=discrete reverse display=(nolabel); /* create categorical axis */
   xaxis grid; 
run;

The strip plot uses a semi-transparent marker to display each statistic. (Again, the red markers indicate the parameters for the simulation.) The density of the distribution is visualized by the width of the strips and by the darkness of the plot, which is due to overplotting the semi-transparent markers.

This plot makes it apparent that the variation between groups is based on the relative size of the groups in the pilot study. Large groups such as Group 4 ("B M") have less variation than small groups such as Group 6 ("C M"). That's because the standard error of the proportion is inversely proportional to the square root of the sample size. Thus, smaller groups have a wider distribution than the larger groups.

You can see the same features in a panel of histograms. In the following visualization, the distribution of the proportions is shown by using a histogram. Red vertical lines indicate the parameters for the simulation. This graph might be easier to explain to non-statistician.
/* plot of histograms */
proc sgpanel data=PlotAll;
   panelby Group / onepanel layout=rowlattice novarname;
   histogram Proportion;
   refline p / axis=x lineattrs=GraphData2(thickness=2);
   colaxis grid; 
run;

Summary

This article shows how to simulate event-trials data in SAS. In this example, the data belong to six different groups, and the probability of experiencing the event varies between groups. The groups are also different sizes. In the simulation, the group size and the number of events in each group are random variables. This article also shows three different ways to visualize the results of the simulation: a box plot, a strip plot, and a panel of histograms.

The post Simulate proportions for groups appeared first on The DO Loop.

9月 072021
 

A colleague spent a lot of time creating a panel of graphs to summarize some data. She did not use SAS software to create the graph, but I used SAS to create a simplified version of her graph, which is shown to the right. (The colors are from her graph.) The graph displays a panel of bar charts with uncertainty intervals. The height of the bar is the mean value of the response. I don't know which statistic she used for the "error bars." It could be one of three common statistics that visualize uncertainty: the standard deviation of the data, the standard error of the mean, or a 95% confidence interval for the mean.

Each cell in the panel shows a set of "dynamite plots," which is a common name for those bar charts with error bars. Data visualization experts generally agree that dynamite plots are not the best way to summarize a data distribution. Alternatives that show the distribution of the data include box plots, strip plots, and violin plots.

Even if you choose to present summarized data instead of the raw data, the dynamite plot is not the best choice. If you adhere to Tufte's advice to maximize the data-to-ink ratio, a simple dot plot with error bars conveys the same information with less ink.

This article discusses a makeover of the panel. In particular, here are some best practices for remaking this graph:

  • Replace the dynamite plots with a simple dot plot.
  • Exchange the axes so that the group labels can be written horizontally and the response variable is plotted along the X axis.
  • Do the colors improve or detract from the visualization? Try plotting the data with and without using colored bars.

Moving from bar charts to dot plots

Bar charts are best for representing frequencies and percentages. A dot plot is a better way to display the means and error bars for data. In the SGPLOT procedure in SAS, the DOT statement will summarize the data and visualize the summary statistics. However, I do not have access to the raw data, so I will use a SCATTER statement to create a panel of dot plots.

The following data step creates some data that are similar to the results that my colleague presented. I use PROC FORMAT to create a SAS format for the categories and groups. This enable readers to reuse my code for their own visualizations, and it keeps my colleague's data private:

proc format;
value CatFmt
      1 = "Category 1"   2 = "Category 2" 
      3 = "Category 3"   4 = "Category 4";
value GroupFmt
      1 = "Group 1"  2 = "Group 2"  3 = "Control";
run;
 
data Have;
format Category CatFmt. Group GroupFmt.;
input Category Group Mean W;
Lower = Mean - W;
Upper = Mean + W;
datalines;
1 1 35 2
1 2 55 5
1 3 70 9
2 1 55 10
2 2 30 7
2 3 10 4
3 1 5.5  3
3 2 8.2  4
3 3 7.2  1
4 1 24.5 7
4 2 27   15
4 3 12   11
;

The following makeover uses the following features of SAS graphics:

  • You can use PROC SGPANEL to create the panel of plots. The PANELBY statement specifies the categorical variable to use for the cells. You can use options on this statement to control the layout for the cells and the appearance of the cell headers.
  • The SCATTER statement plots the mean within each category. Use the GROUP= option to offset the means for each group. Use the XERRORLOWER= and XERRORUPPER= options to plot error bars. This plot uses the default colors for distinguishing groups. The colors are determined by the current ODS style. For the HTMLBlue style, the colors are dark shades of blue, red, and green. According to my colleague, she chose the purple-pink-yellow palette of colors "for aesthetic reasons." In a subsequent section, I show how to override the default colors for groups.
  • The COLAXIS and ROWAXIS statements are equivalent to the XAXIS and YAXIS statements in PROC SGPLOT. They enable you to add grid lines, control labels and tick marks, and modify axis-related properties.

The following call to PROC SGPANEL shows one way to remake my colleague's graph as a panel of dot plots:

ds graphics / width=480px height=400px;
title "Paneled Makeover Plot";
title2 "Horizontal Dot Plot with Error Bars";
footnote J=L "Without Colored Bands";
proc sgpanel data=Have noautolegend;
   panelby Category / novarname layout=rowlattice rows=4;
   scatter x=Mean y=Group / xerrorlower=Lower xerrorupper=Upper 
      errorbarattrs=(thickness=2) errorcapscale=0.5
      markerattrs=(symbol=CircleFilled) group=Group;
   colaxis grid display=(nolabel);
   rowaxis grid display=(nolabel) type=discrete REVERSE
           offsetmin=0.2 offsetmax=0.2;
run;

If you turn your head sideways, you can see that this graph displays exactly the same information as the original graph. However, it is easier to compare the means and intervals within and across categories. The group labels are displayed horizontally, which means that this same layout will support longer group labels without needing to rotate the text.

Furthermore, you can use the row-based layout even if there are additional categories or groups. The graph will be longer, but it will be the same width. That means that the graph will fit on one sheet of paper for many groups or categories. In contrast, the original layout gets wider as more categories are added, which might make it hard to fit the panel on a printed page in portrait mode.

Adding colored bands to a dot plot

In the previous graph, colors are used to distinguish the markers and lines for each group. This enables you to easily track the same group across different categories. The author of the original graph chose light colors for the bars, but light colors are not good choices for lines and markers. However, if those colors are important to the story that you want to tell, you can add colored bands to the background of the graph.

You can use the STYLEATTRS statement to tell the procedure what colors to use for groups. You can use the HIGHLOW statement to create colored bands, which is a tip I learned from my colleague Sanjay Matange. To use the HIGHLOW statement, you need to add new variables to the data to indicate the minimum and maximum values for the bands. The following statements create an alternative visualization:

/* set minimum and maximum values for colored bars */
data Want / view=Want;
set Have;
xmin=0; xmax=80;  
run;
 
footnote J=L "With Colored Bands";
proc sgpanel data=Want noautolegend;
   styleattrs datacolors=(Lavender Salmon LightYellow);
   panelby Category / novarname layout=rowlattice rows=4;
   /* plot colored bands in the background */
   highlow y=Group low=xmin high=xmax / group=Group type=bar groupdisplay=cluster
           barwidth=1.0 clusterwidth=0.9 transparency=0.5 fill nooutline;
   scatter x=Mean y=Group / xerrorlower=Lower xerrorupper=Upper 
      markerattrs=(symbol=CircleFilled color=Black) 
      errorbarattrs=(thickness=2 color=Black) errorcapscale=0.5;
   colaxis grid display=(nolabel);
   rowaxis grid display=(nolabel) type=discrete REVERSE
           offsetmin=0.2 offsetmax=0.2;
run;

Since the colored bands identify the groups, I used black to display the means and error bars. The information in this graph is the same as for the previous graph, but this graph emphasizes the colors more. In general, I prefer the graph without the color bands, but there might be situations in which colors enhance the presentation of the data.

Summary

In summary, this article shows how to remake a panel of dynamite plots, which are bar charts with error bars. The article shows that you can display the same information by using a dot plot with error bars. Furthermore, it is often helpful to rotate the plot so the response variable is plotted horizontally instead of vertically. This redesign can support many groups and categories because adding more categories makes the graph grow taller, not wider. Lastly, the article shows how to add colored bands in the background of a dot plot.

Appendix: Create the original graph

For completeness, the following SAS code creates the panel that is shown at the top of this article:

title "Paneled Dynamite Plot";
title2 "Vertical Bar Charts with Error Bars";
footnote;
proc sgpanel data=Have noautolegend;
   styleattrs datacolors=(CX6f6db2 CXe18dac CXe9ef7a); /* colors of original graph */
   panelby Category / novarname layout=columnlattice columns=4;
   vbarbasic  Group / response=Mean transparency=0.5 group=Group;
   scatter x=Group y=Mean / yerrorlower=Lower yerrorupper=Upper
           markerattrs=(size=0) errorbarattrs=(color=Black);
   colaxis grid display=(nolabel) fitpolicy=rotate;
   rowaxis grid display=(nolabel);
run;

The post Remaking a panel of dynamite plots appeared first on The DO Loop.

9月 012021
 

The number of possible bootstrap samples for a sample of size N is big. Really big.

Recall that the bootstrap method is a powerful way to analyze the variation in a statistic. To implement the standard bootstrap method, you generate B random bootstrap samples. A bootstrap sample is a sample with replacement from the data. The phrase "with replacement" is important. In fact, a bootstrap sample is sometimes called a "resample" because it is generated by sampling the data with REplacement.

This article compares the number of samples that you can draw "with replacement" to the number of samples that you can draw "without replacement." It demonstrates, for very small samples, how to generate the complete set of bootstrap samples. It also explains why generating a complete set is impractical for even moderate-sized samples.

The number of samples with and without replacement

For a sample of size N, there are N! samples if you sample WITHOUT replacement (WOR). The quantity N! grows very quickly as a function of N, so the number of permutations is big for even moderate values of N. In fact, I have shown that if N≥171, then you cannot represent N! by using a double-precision floating-point value because 171! is greater than MACBIG (=1.79769e+308), which is the largest value that can be represented in eight bytes.

However, if you sample WITH replacement (WR), then the number of possible samples is NN, which grows even faster than the factorial function! If the number of permutations is "big," then the number of WR samples is "really big"!

For comparison, what do you think is the smallest value of N such that NN exceeds the value of MACBIG? The answer is N=144, which is considerably smaller than 171. The computation is shown at the end of this article.

Another way to compare the relative sizes of N! and NN is to print a few values of both functions for small values of N. The following table shows the values for N≤10:

proc iml;
N = 1:10;
nPerm = fact(N);     /* N!  */
nResamp = N##N;      /* N^N */
both = nPerm // nResamp;
print both[r={"Permutations" "Samples (WR)"} c=N format=comma15. L=""];

Clearly, the number of bootstrap samples (samples WR) grows very big very fast. This is why the general bootstrap method uses random bootstrap samples rather than attempting some sort of "exact" computation that uses all possible bootstrap samples.

An exact bootstrap computation that uses all possible samples

If you have a VERY small data set, you could, in fact, perform an exact bootstrap computation. For the exact computation, you would generate all possible bootstrap samples, compute the distribution on each sample, and thereby obtain the exact bootstrap distribution of the statistic.

Let's carry out this scheme for a data set that has N=7 observations. From the table, we know that there are exactly 77 = 823,543 samples with replacement.

For small N, it's not hard to construct the complete set of bootstrap samples: just form the Cartesian product of the set with itself N times. In the SAS/IML language, you can use the ExpandGrid function to define the Cartesian product. If the data has 7 values, the Cartesian product will be a matrix that has 823,543 rows and 7 columns.

The following SAS/IML program performs a complete bootstrap analysis of the sample mean. The sample, S, has 7 observations. The minimum value is -1; the maximum value is 4. The sample mean is approximately 0.51. What is the bootstrap distribution of the sample mean? You can form the set of all possible subsamples of size 7, where the elements are resampled from S. You can then compute the mean of every resample and plot the distribution of the means:

proc iml;
S = {-1  -0.1  -0.6  0  0.5  0.8  4};
/* 1. compute original statistic */
m = S[:];          /* = mean(S`) */
print m;
 
/* 2. Generate ALL resamples! */
/* Cartesian product of the elements in S is
   S x S x ... x S
   \------v------/
         N times                              */
z = ExpandGrid(S, S, S, S, S, S, S);
 
/* 3. Compute statistic on each resample */
means = z[,:];
 
/* 4. Analyze the bootstrap distribution */
title "Complete Bootstrap Distribution of the Mean";
title2 "N = 7";
call histogram(means) rebin={-1 0.1} xvalues=-1:4
     other="refline 0.51/axis=x lineattrs=(color=red);";

The graph shows the mean for every possible resample of size 7. The mean of the original sample is shown as a red vertical line. Here are some of the resamples in the complete set of bootstrap samples:

  • The resamples where a datum is repeated seven times. For example, one resample is {-1, -1, -1, -1, -1, -1, -1}, which has a mean of -1. Another resample is {4, 4, 4, 4, 4, 4, 4}, which has a mean of 4. These two resamples define the minimum and maximum values, respectively, of the bootstrap distribution. There are only seven of these resamples.
  • The resamples where a datum is repeated six times. For example, one of the resamples is {-1, -1, -1, -1, -1, -1, 4}. Another is {-1, -1, 0, -1, -1, -1, -1}. A third is {4, 4, 4, 4, 4, 4, 4}, which has a mean of 4. These two resamples define the minimum and maximum values. Another resample is {4, 4, 4, 4, 0.5, 4, 4}. There are 73 resamples of this type.
  • The resamples where each datum is present exactly one time. These sets are all permutations of the sample itself, {-1 -0.1 -0.6 0 0.5 0.8 4}. There are 7! = 5,040 resamples of this type. Each of these permutations has the same mean, which is 0.51. This helps to explain why there is a visible peak in the distribution at the value of the sample mean.

The bootstrap distribution does not make any assumptions about the distribution of the data, nor does it use any asymptotic (large sample) assumptions. It uses only the observed data values.

Comparison with the usual bootstrap method

For larger samples, it is impractical to consider ALL possible samples of size N. Therefore, the usual bootstrap method generates B random samples (with replacement) for a large value of B. The resulting distribution is an approximate bootstrap distribution. The following SAS/IML statements perform the usual bootstrap analysis for B=10,000.

/* Generate B random resamples. Compute and analyze bootstrap distribution */
call randseed(123);
w = sample(S, {7, 10000});   /* 10,000 samples of size 7 (with replacement) */
means = w[,:];
title "Bootstrap Distribution of the Mean";
title2 "B = 10,000; N = 7";
call histogram(means) rebin={-1 0.1} xvalues=-1:4
     other="refline 0.51/axis=x lineattrs=(color=red);";

The resulting bootstrap distribution is similar in shape to the complete distribution, but only uses 10,0000 random samples instead of all possible samples. You can see that the properties of the second distribution (such as quantiles) will be similar to the quantiles of the complete distribution, even though the second distribution is based on many fewer bootstrap samples.

Summary

For a sample of size N, there are NN possible resamples of size N when you sample with replacement. The bootstrap distribution is based on a random sample of these resamples. When N is small, you can compute the exact bootstrap distribution by forming the Cartesian product of the sample with itself N times. This article computes the exact bootstrap distribution for the mean of seven observations and compares the exact distribution to an approximate distribution that is based on B random resamples. When B is large, the approximate distribution looks similar to the complete distribution.

For more about the bootstrap method and how to perform bootstrap analyses in SAS, see "The essential guide to bootstrapping in SAS."

Appendix: Solving the equation NN = y

This article asked, "what is the smallest value of N such that NN exceeds the value of MACBIG?" The claim is that N=144, but how can you prove it?

For any value of y > 1, you can use the Lambert W function to find the value of N that satisfies the equation NNy. Here's how to solve the equation:

  1. Take the natural log of both sides to get N*log(N) ≤ log(y).
  2. Define w = log(N) and x = log(y). Then the equation becomes w exp(w) ≤ x. The solution to this equation is found by using the Lambert W function: w = W(x).
  3. The solution to the original equation is N = exp(w). Since N is supposed to be an integer, round up or down, according to the application.

The following SAS/IML program solves for the smallest value of N such that NN exceeds the value of MACBIG in double-precision.

proc iml;
x = constant('LogBig');   /* x = log(MACBIG) */
w = LambertW(x);          /* solve for w such that w*exp(w) = x */
N1 = exp(w);              /* because N = log(w) */
N = ceil(N1);             /* round up so that N^N exceeds MACBIG */
print N1 N;
quit;

So, when N=144, NN is larger than the maximum representable double-precision value (MACBIG).

The post On the number of bootstrap samples appeared first on The DO Loop.

8月 302021
 

You can use the bootstrap method to estimate confidence intervals. Unlike formulas, which assume that the data are drawn from a specified distribution (usually the normal distribution), the bootstrap method does not assume a distribution for the data. There are many articles about how to use SAS to bootstrap statistics for univariate data and for data from a regression model. This article shows how to bootstrap the correlation coefficients in multivariate data. The main challenge is that the correlation coefficients are typically output in a symmetric matrix. To analyze the bootstrap distribution, you must convert the statistics from a "wide form" (a matrix) to a "long form" (a column), as described in a previous article.

To motivate the bootstrap method for correlations, recall that PROC CORR in SAS supports the FISHER option, which estimates confidence intervals for the correlation coefficients. However, the interval estimates are parametric: they assume that the pairs of variables are bivariate normal. How good are the estimates for real data? You can generate the bootstrap distribution for the correlation coefficients and compare the Fisher confidence intervals to the bootstrap estimates. As a preview of things to come, the graph to the right (click to expand) shows histograms of bootstrap estimates and the Fisher confidence intervals.

This article assumes that the reader is familiar with basic bootstrapping in SAS. If not, see the article, "Compute a bootstrap confidence interval in SAS."

Bootstrap multivariate data in SAS

The basic steps of the bootstrap algorithm are as follows:

  1. Compute a statistic for the original data.
  2. Use PROC SURVEYSELECT to resample (with replacement) B times from the data.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample.
  4. Use the bootstrap distribution to obtain estimates of bias and uncertainty, such as confidence intervals.

Let's use a subset of Fisher's Iris data for this example. There are four variables to analyze, so you have to estimate confidence intervals for 4*3/2 = 6 correlation coefficients. For this example, let's use the FISHER option to estimate the confidence intervals. We can then compute the proportion of bootstrap estimates that fall within those confidence intervals, thereby estimating the coverage probability of the Fisher intervals? Does a 95% Fisher interval have 95% coverage for these data? Let's find out!

Step 1: Compute statistics on the original data

The following SAS statements subset the Sashelp.Iris data and call PROC CORR to estimate Fisher's confidence intervals for the six correlation coefficients:

data sample;   /* N = 50 */
   set Sashelp.Iris(where=(Species="Versicolor"));
   label PetalLength= PetalWidth= SepalLength= SepalWidth=;
run;
 
/* STEP 1: Compute statistics on the original data */
%let varNames = PetalLength PetalWidth SepalLength SepalWidth;
%let numVars=4;
proc corr data=sample fisher(biasadj=no) noprob plots=matrix;
   var &varNames;
   ods output FisherPearsonCorr=FPCorr(keep=Var WithVar NObs Corr Lcl Ucl);
run;
 
proc print data=FPCorr noobs; run;

The output shows the estimates for the 95% Fisher confidence intervals for the correlation coefficients. Remember, these estimates assume bivariate normality. Let's generate a bootstrap distribution for the correlation statistics and examine how well the Fisher intervals capture the sampling variation among the bootstrap estimates.

Steps 2 and 3: Bootstrap samples and the bootstrap distribution

As shown in other articles, you can use PROC SURVEYSELECT in SAS to generate B bootstrap samples of the data. Each "bootstrap sample" is a sample (with replacement) from the original data. You can then use the BY statement in PROC CORR to obtain B correlation matrices, each estimated from one of the bootstrap samples:

/* STEP 2: Create B=10,000 bootstrap samples */
/* Bootstrap samples */
%let NumSamples = 10000;       /* number of bootstrap resamples */
proc surveyselect data=sample NOPRINT seed=12345
     out=BootSamples
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;
 
/* STEP 3. Compute the statistic for each bootstrap sample */
proc corr data=BootSamples noprint outp=CorrOut;
   by Replicate;
   freq NumberHits;
   var &varNames;
run;

For many bootstrap analyses, you can proceed directly to the next step, which is to summarize the bootstrap distribution. However, in this case, the statistics in the CorrOut data set are not in a convenient format. The statistics are stored in matrices, as shown by the following call to PROC PRINT, which prints the correlation estimates for the first two bootstrap samples:

/* look at estimates for first two bootstrap samples */
proc print data=CorrOut noobs;
   where _Type_="CORR" & Replicate<=2;
run;

To analyze the distribution of the correlation estimates, it is useful to reshape the data, as described in a previous article. The article shows how to obtain only the values in the cells that are highlighted in yellow. The article also shows how to add labels that identify the pairs of variables. The following DATA step reshapes the data and adds labels such as "Corr(X1,X2)":

/* Convert correlations in the OUTP= data set from wide to long form. See
   https://blogs.sas.com/content/iml/2021/08/25/symmetric-matrix-wide-to-long.html */
data BootLong;
set CorrOut(where=(_Type_="CORR") rename=(_NAME_=Var1));
length Var2 $32 Label $13;
array X[*] &varNames;
Row = 1 + mod(_N_-1, &NumVars);
do Col = Row+1 to &NumVars;       /* names for columns */
   Var2 = vname(X[Col]);
   Corr = X[Col];
   Label = cats("Corr(X",Row,",X",Col,")");
   output;
end;
drop _TYPE_ &varNames;
run;

The bootstrap estimates are now in a "long format," which makes it easier to analyze and visualize the bootstrap distribution of the estimates.

The next section combines the original estimates and the bootstrap estimates. To facilitate combining the data sets, you need to add the Label column to the data set that contains the Fisher intervals. This was also described in the previous article. Having a common Label variable enables you to merge the two data sets.

/* add labels to the Fisher statistics */
data FisherCorr;
length Label $13;
set FPCorr;
retain row col 1;
if col>=&NumVars then do; row+1; col=row; end;
col+1;
Label = cats("Corr(X",row,",X",col,")");
run;

Step 4: Analyze the bootstrap distribution

You can graph the bootstrap distribution for each pair of variables by using the SGPANEL procedure. The following statements create a basic panel:

proc sgpanel data=BootLong;
   panelby Label;
   histogram Corr;
run;

However, a goal of this analysis is to compare the Fisher confidence intervals (CIs) to the bootstrap distribution of the correlation estimates. Therefore, it would be nice to overlay the original correlation estimates and Fisher intervals on the histograms of the bootstrap distributions. To do that, you can concatenate the BootLong and the FisherCorr data sets, as follows:

/* STEP 4: Analyze the bootstrap distribution */
/* overlay the original sample estimates and the Fisher confidence intervals */
data BootLongCI;
set BootLong FisherCorr(rename=(Var=Var1 WithVar=Var2 Corr=Estimate));
label Corr="Correlation Estimate";
run;
 
title "Bootstrap Correlation Estimates (B=&NumSamples)";
title2 "Overlay Fisher Confidence Intervals";
proc sgpanel data=BootLongCI;
   panelby Label / novarname;
   histogram Corr;
   refline Estimate / axis=X lineattrs=(color=red);
   refline LCL / axis=X ;
   refline UCL / axis=X ;
run;

The graph is shown at the top of this article. In the graph, the red lines indicate the original estimates of correlation from the data. The gray vertical lines show the Fisher CIs for the data. The histograms show the distribution of the correlation estimates on 10,000 bootstrap re-samples of the data.

Visually, it looks like the Fisher confidence intervals do a good job of estimating the variation in the correlation estimates. It looks like most of the bootstrap estimates are contained in the 95% CIs.

The coverage of Fisher's confidence intervals

You can use the bootstrap estimates to compute the empirical coverage probability of the Fisher CIs. That is, for each pair of variables, what proportion of the bootstrap estimates (out of 10,000) are within the Fisher intervals? If all pairs of variables were bivariate normal, the expected proportion would be 95%.

You can use the logical expression (Corr>=LCL & Corr<=UCL) to create a binary indicator variable. The proportion of 1s for this variable equals the proportion of bootstrap estimates that are inside a Fisher interval. The following DATA step creates the indicator variable and the call to PROC FREQ counts the proportion of bootstrap estimates that are in the Fisher interval.

/* how many bootstrap estimates are inside a Fisher interval? */
proc sort data=BootLong; by Label; run;
 
data Pctl;
merge BootLong FisherCorr(rename=(Corr=Estimate));
by Label;
label Corr="Correlation Estimate";
InLimits = (Corr>=LCL & Corr<=UCL);
run;
 
proc freq data=Pctl noprint;
   by Label;
   tables InLimits / nocum binomial(level='1');
   output out=Est binomial;
run;
proc print data=Est noobs;
   var Label N _BIN_ L_BIN U_BIN;
run;

The table shows empirical estimates for the coverage probability of the Fisher intervals. The _BIN_ column contains estimates of the binary proportion. The L_BIN and U_BIN columns contain confidence intervals for the proportions. You can see that all intervals include more than 90% of the bootstrap estimates, and some are close to 95% coverage. The following graph visualizes the empirical coverage for each pair of variables:

ods graphics / width=400px height=240px;
title "Proportion of Bootstrap Estimates in Fisher Intervals"; 
proc sgplot data=Est noautolegend;
   scatter x=_BIN_ y=Label / xerrorlower=L_Bin xerrorupper=U_Bin;
   refline 0.95 / axis=x;
   xaxis  max=1 grid;
   yaxis grid display=(nolabel) offsetmin=0.1 offsetmax=0.1;
run;

The graph indicates that the many Fisher intervals are too wide (too conservative). Their empirical coverage is greater than the 95% coverage that you would expect for bivariate normal data. On the other hand, the Fisher interval for Corr(X2,X4) has lower coverage than expected.

Summary

This article shows how to perform a bootstrap analysis for correlation among multiple variables. The process of resampling the data (PROC SURVEYSELECT) and generating the bootstrap estimates (PROC CORR with a BY statement) is straightforward. However, to analyze the bootstrap distributions, you need to restructure the statistics by converting the output data from a wide form to a long form. You can use this technique to perform your own bootstrap analysis of correlation coefficients.

Part of the analysis also used the bootstrap distributions to estimate the coverage probability of the Fisher confidence intervals for these data. The Fisher intervals have 95% coverage for normally distributed data. For the iris data, most of the intervals have coverage that is greater than expected, although one interval has lower-than-expected coverage. These results are strongly dependent on the data.

The post Bootstrap correlation coefficients in SAS appeared first on The DO Loop.

8月 252021
 

For graphing multivariate data, it is important to be able to convert the data between "wide form" (a separate column for each variable) and "long form" (which contains an indicator variable that assigns a group to each observation). If the data are numeric, the wide data can be represented as an N x p matrix. The same data in long form can be represented by two columns and Np rows, where the first column contains the data and the second column identifies that the first N rows belong to the first group, the second N rows belong to the second group, and so on. Many people have written about how to use PROC TRANSPOSE or the SAS DATA step to convert from wide form to long form and vice versa.

The conversion is slightly different for a symmetric matrix because you might want to display only the upper-triangular portion of the matrix. This article examines how to convert between a symmetric correlation matrix (wide form) and a "compressed symmetric" form that only stores the elements in the upper-triangular portion of the symmetric matrix.

Compressed symmetric storage

When the data represents a symmetric N x N matrix, you can save space by storing only half the data, such as only the upper triangular portion of the matrix. This is called compressed symmetric storage. Familiar examples of symmetric matrices include correlation, covariance, and distance matrices. For example, you can represent the 3 x 3 symmetric matrix A = {3 2 1, 2 5 0, 1 0 9} by storing only the upper triangular entries U = {3, 2, 1, 5, 0, 9}. There are N(N+1)/2 upper triangular elements. For a correlation matrix, you don't need to store the diagonal elements; you only need to store N(N-1)/2 elements.

When you run a correlation analysis by using PROC CORR in SAS, you usually get a square symmetric matrix of correlation estimates. However, if you use the FISHER option to get confidence limits for the correlation coefficients, then you get a table that shows the correlation estimates for each pair of variables. I will demonstrate this point by using the Sashelp.Iris data. To make the discussion general, I will put the variable names into a macro variable so that you can concentrate on the main ideas:

data sample;   /* use a subset of the Iris data */
   set Sashelp.Iris(where=(Species="Versicolor"));
   label PetalLength= PetalWidth= SepalLength= SepalWidth=;
run;
 
%let varNames = PetalLength PetalWidth SepalLength SepalWidth;
/* count how many variables are in the macro variable */
data _null_;
nv = countw("&varNames");
call symputx("numVars", nv);
run;
%put &=numVars;   /* for this example, numVars=4 */
 
proc corr data=sample fisher(rho0=0.7 biasadj=no) noprob outp=CorrOut;
   var &varNames;
   ods select PearsonCorr FisherPearsonCorr;
   ods output FisherPearsonCorr = FisherPCorr;
run;

The usual square symmetric correlation matrix is shown at the top of the image. I have highlighted the elements in the strictly upper triangular portion of the matrix. By using these values, you can re-create the entire matrix. The second table is an alternative display that shows only the six pairwise correlation estimates. The arrangement in the second table can be useful for plotting the pairwise correlations in a bar chart.

It is useful to be able to convert the first table into the format of the second table. It is also useful to be able to augment the second table with row/column information so that each row of the second table is easily associated with the corresponding row and column in the first table.

Convert a symmetric table to a pairwise list

Let's use the information in the first table to list the correlations for pairs of variables, as shown in the second table. Notice that the call to PROC CORR uses the OUTP= option to write the correlation estimates to a SAS data set. You should print that data set to understand its structure; it contains more than just the correlation estimates! After you understand the structure, the following DATA step will make more sense. The main steps of the transformation are:

  • Create a variable named ROW that has the values 1, 2, 3, ..., N. In the following DATA step, I use the MOD function because in a subsequent article, I will use the same DATA step to perform a bootstrap computation in which the data set contains B copies of the correlation matrix.
  • Create a variable named COL. For each value of ROW, the COL variable loops over the values ROW+1, ROW+2, ..., N.
  • Put the correlation estimates into a single variable named CORR.
  • If desired, create a short label that identifies the two variables that produced the correlation estimate.
/* Convert correlations in the OUTP= data set from wide to long form (pairwise statistics) */
data Long;
set CorrOut(where=(_Type_="CORR") rename=(_NAME_=Var1));
length Var2 $32 Label $13;
array X[*] &varNames;
row = 1 + mod(_N_-1, &NumVars);   /* 1, 2, ..., N, 1, 2, ..., N, 1, 2, ... */
do col = row+1 to &NumVars;       /* names for columns */
   Var2 = vname(X[col]);
   Corr = X[col];
   Label = cats("Corr(X",row,",X",col,")");
   output;
end;
drop _TYPE_ &varNames;
run;
 
proc print data=Long; run;

The ROW and COL variables identify the position of the correlation estimates in the upper-triangular portion of the symmetric correlation matrix. You can write a similar DATA step to convert a symmetric matrix (such as a covariance matrix) in which you also want to display the diagonal elements.

Identify the rows and columns of a pairwise list

The data for the second table is stored in the FisherPCorr data set. Although it is possible to convert the pairwise list into a dense symmetric matrix, a more useful task is to identify the rows and columns for each entry in the pairwise list, as follows:

/* Write the (row,col) info for the list of pairwise correlations in long form */
data FisherCorr;
set FisherPCorr;
retain row col 1;
if col>=&NumVars then do;
   row+1;
   col=row;
end;
col+1;
Label = cats("Corr(X",row,",X",col,")");
run;
 
proc print data=FisherCorr; 
   var Var WithVar Corr row col Label;
run;

From the ROW and COL variables, you can assemble the data into a symmetric matrix. For example, I've written about how to use the SAS/IML language to create a symmetric correlation matrix from the strictly upper-triangular estimates.

Summary

For a symmetric matrix, you can display the matrix in a wide format (an N x N matrix) or you can display the upper-triangular portion of the matrix in a single column (long format). This article shows how to use the SAS DATA step to convert a correlation matrix into a long format. By adding some additional variables, you can identify each value with a row and column in the upper-triangular portion of the symmetric matrix.

The post Convert a symmetric matrix from wide to long form appeared first on The DO Loop.

8月 232021
 

This article shows how to create a "sliced survival plot" for proportional-hazards models that are created by using PROC PHREG in SAS. Graphing the result of a statistical regression model is a valuable way to communicate the predictions of the model. Many SAS procedures use ODS graphics to produce graphs automatically or upon request. The default graphics from SAS procedures are sufficient for most tasks. However, occasionally I customize or modify the graphs that are produced by SAS procedures.

The graph to the right is an example of how to create a customized sliced fit plot from PROC PHREG. This article shows how to create the plot and others like it.

It is worth mentioning that SAS supports multiple procedures for analyzing survival data. Of these, PROC LIFETEST has the most extensive methods for customizing graphics. An entire chapter of the SAS/STAT Users Guide is devoted to customizing the Kaplan-Meier plot, which is a popular graph in the life sciences.

Sliced fit plots in SAS regression procedures

Before discussing PROC PHREG, let's review the concepts for a sliced fit plot for a regression model. In a "sliced fit plot," you display the predicted values for the model (typically versus a continuous covariate) for a specific value of the other continuous covariates. Often the mean value is used for slicing, but other popular choices include the quartiles: the 25th percentile, the median, and the 75th percentile.

My favorite way to create fit plots is to use the EFFECTPLOT SLICEFIT statement in PROC PLM. By default, the EFFECTPLOT statement evaluates the continuous covariates at their mean values, but you can use the SLICEBY= option or the AT= option to specify other values.

PROC PLM and the EFFECTPLOT statement were designed primarily for generalized linear models. If you try to use the EFFECTPLOT statement for other kinds of regression models (or nonparametric regressions), you might get a warning: WARNING: The EFFECTPLOT statement cannot be used for the specified model. In that case, you can manually create a sliced fit plot by scoring the model on a carefully prepared data set.

Example data and the default behavior of PROC PHREG

PROC PHREG can create graphs automatically, so let's start by looking at the default survival plot. The following data is from an example in the PROC PHREG documentation. The goal is to predict the survival time (TIME and VSTATUS) of patients with multiple myeloma based on the measured values (log-transformed) of blood urea nitrogen (BUN). The example in the documentation analyzes nine variables, but for clarity, I only include two explanatory variables in the following data and examples:

data Myeloma;
   input Time VStatus LogBUN Frac @@;
   label Time='Survival Time'
         VStatus='0=Alive 1=Dead';
   datalines;
 1.25  1  2.2175  1   1.25  1  1.9395  1 
 2.00  1  1.5185  1   2.00  1  1.7482  1 
 2.00  1  1.3010  1   3.00  1  1.5441  0 
 5.00  1  2.2355  1   5.00  1  1.6812  0 
 6.00  1  1.3617  0   6.00  1  2.1139  1 
 6.00  1  1.1139  1   6.00  1  1.4150  1 
 7.00  1  1.9777  1   7.00  1  1.0414  1 
 7.00  1  1.1761  1   9.00  1  1.7243  1 
11.00  1  1.1139  1  11.00  1  1.2304  1 
11.00  1  1.3010  1  11.00  1  1.5682  0 
11.00  1  1.0792  1  13.00  1  0.7782  1 
14.00  1  1.3979  1  15.00  1  1.6021  1 
16.00  1  1.3424  1  16.00  1  1.3222  1 
17.00  1  1.2304  1  17.00  1  1.5911  0 
18.00  1  1.4472  0  19.00  1  1.0792  1 
19.00  1  1.2553  1  24.00  1  1.3010  1 
25.00  1  1.0000  1  26.00  1  1.2304  1 
32.00  1  1.3222  1  35.00  1  1.1139  1 
37.00  1  1.6021  0  41.00  1  1.0000  1 
41.00  1  1.1461  1  51.00  1  1.5682  1 
52.00  1  1.0000  1  54.00  1  1.2553  1 
58.00  1  1.2041  1  66.00  1  1.4472  1 
67.00  1  1.3222  1  88.00  1  1.1761  0 
89.00  1  1.3222  1  92.00  1  1.4314  1 
 4.00  0  1.9542  0   4.00  0  1.9243  0 
 7.00  0  1.1139  1   7.00  0  1.5315  0 
 8.00  0  1.0792  1  12.00  0  1.1461  0 
11.00  0  1.6128  1  12.00  0  1.3979  1 
13.00  0  1.6628  0  16.00  0  1.1461  0 
19.00  0  1.3222  1  19.00  0  1.3222  1 
28.00  0  1.2304  1  41.00  0  1.7559  1 
53.00  0  1.1139  1  57.00  0  1.2553  0 
77.00  0  1.0792  0 
;

Suppose you want to predict survival time by using LogBUN as the only explanatory variable. The following call to PROC PHREG builds the model and creates a survival plot:

/* A COVARIATES= data set is not specified, so the average value for LogBUN is used in the survival plot */
proc phreg data=Myeloma plots(overlay)=survival; /* use plots(overlay CL) to get confidence limits */
   model Time*VStatus(0)=LogBUN;
run;

Notice the title of the plot specifies that the survivor function is evaluated for LogBUN=1.3929. Why this strange value? Because this program did not use the BASELINE statement to specify a COVARIATES= data set. Consequently, the average value for LogBUN is used in the survival plot. You can use PROC MEANS to confirm the mean value of the LogBUN variable and to output other statistics such as the sample quartiles:

proc means data=Myeloma Q1 Median Mean Q3;
   var LogBUN;
run;

The next section shows how to specify the quartiles as slicing values for LogBUN when you create a survival plot.

Specify slicing values for a survival plot

PROC PHREG has a special syntax for producing a sliced fit plot. On the BASELINE statement, use the COVARIATES= option to name a SAS data set in which you have specified values at which to slice the covariates. If you do not specify a data set, or if your data set does not list some of the variables in the model, then the model is evaluated at the mean values for the continuous variables and at the reference levels for the CLASS variables.

Suppose you want to use the PHREG model to compare the probability of survival for three hypothetical patients whose LogBUN values are 1.15, 1.32, and 1.57, respectively. These LogBUN values for these patients are the sample quartiles for the distribution of patients in the study. The following DATA step creates a SAS data set that contains the quartile values. If you add the BASELINE statement and the COVARIATES= option, then PROC PHREG automatically incorporates those values into the survival graph that it creates:

data InRisks;
length Id $9;
input Id LogBUN;
datalines;
Q1     1.15 
Median 1.32 
Q3     1.57 
;
 
proc phreg data=Myeloma plots(overlay)=survival; /* use plots(overlay CL) to get confidence limits */
   model Time*VStatus(0)=LogBUN;
   baseline covariates=Inrisks / rowid=Id;
run;

The survival graph now displays three curves, one for each specified value of the LogBUN variable. The curves are labeled by using the ROWID= option on the BASELINE statement. If you omit the option, the values of the LogBUN variable are displayed.

Create your own custom survival plot

This article was motivated by a SAS programmer who wanted to customize the survival plot. For simplicity, let's suppose that you want to modify the title, add a grid, increase the width of the lines, and add an inset. In that case, you can use the OUT= option to write the relevant predicted values to a SAS data set. The SURVIVAL= option specifies the name for the variable that contains the survival probabilities. However, if you use the keyword _ALL_, then the procedure outputs estimates, standard errors, and confidence limits, and provides default names. For the _ALL_ keyword, "Survival" is used for the name of the variable that contains the probability estimates. You can then use PROC SGPLOT to create a custom survival plot, as follows:

/* write output to a SAS data set */
proc phreg data=Myeloma noprint;
   model Time*VStatus(0)=LogBUN;
   baseline covariates=Inrisks out=PHOut survival=_all_ / rowid=Id;
run;
 
title "Customized Survival Curves";
proc sgplot data=PHOut;
   *band x=Time lower=LowerSurvival upper=UpperSurvival / group=Id transparency=0.75;
   step x=Time y=Survival / group=Id lineattrs=(thickness=2);
   inset ("Q1:"="LogBUN=1.15" 
          "Median:"="LogBUN=1.32" 
          "Q3:"="LogBUN=1.57") / border opaque;
   xaxis grid; yaxis grid;
run;

The graph is shown at the top of this article.

Including a classification variable

The Myeloma data contains a categorical variable (Frac) that indicates whether the patient had bone fractures at the time of diagnosis. If you want to include that variable in the analysis, you can do so. You can add that variable to the COVARIATES= data set to control which values are overlaid on the survival plot, as follows:

data InRisks2;
length Id $9;
input Id LogBUN Frac;
datalines;
Q1:0     1.15 0
Q1:1     1.15 1
Q3:0     1.57 0
Q3:1     1.57 1
;
 
proc phreg data=Myeloma plots(overlay)=survival;
   class Frac(ref='0');
   model Time*VStatus(0)=LogBUN Frac;
   baseline covariates=Inrisks2 out=PHOut2 survival=_all_/rowid=Id; 
run;

This process can be extended for other covariates in the model. When you add new variables to the model, you can also add them to the COVARIATES= data set, if you want.

Summary

PROC PHREG supports a convenient way to create a sliced survival plot. You can create a SAS data set that contains the names and values of covariates in the model. Variables that are not in the data set are assigned their mean values (if continuous) or their reference values (if categorical). The procedure evaluates the model at those values to create a survival plot for each hypothetical (or real!) patient who has those characteristics.

If you use the PLOTS= option on the PROC PHREG statement, the procedure will create graphs for you. Otherwise, you can use the OUT= option on the BASELINE statement to write the relevant variables to a SAS data set so that you can create your own graphs.

The post Sliced survival graphs in SAS appeared first on The DO Loop.

8月 182021
 

A previous article discusses the geometry of weighted averages and shows how choosing different weights can lead to different rankings of the subjects. As an example, I showed how college programs might rank applicants by using a weighted average of factors such as test scores. "The best" applicant is determined by the choice of weights. If you change the weights, you change the definition of "the best."

The same ideas can be used to rank sports teams at tournaments. For some sports, the governing body dictates how to rank teams based on the team's performance in individual events. However, I can think of at least one popular international sporting event that awards medals (gold, silver, and bronze) to individual athletes, but does not endorse a method for converting the number of medals won into a team score. Accordingly, fans have suggested multiple ranking methods. This article discusses various ranking methods and shows how the rankings change for each method.

Some sports have established weights

For some sports, the governing body has published how to rank teams based on the performance of team members in individual events. Some sports use different scoring systems at different levels of competition, but a few representative systems are listed below.

  • In the group stage of World Cup soccer, teams get 3 points for a win, 1 point for a draw, and 0 points for a loss. Thus, team ranks are based on the weights (3, 1, 0).
  • In some swimming leagues, the team of the first-place winner in an individual event is awarded 6 points. Subsequent places are awarded 4, 3, 2, and 1 point. Swimmers who do not score in the top five do not score points. So team scores (for individual events) are based on the weights (6,4,3,2,1,0,0,0), if you use an eight-lane pool.

Notice a key characteristic of these weight vectors: More weight is given for finishing first. In fact, the weights tend to be decreasing (or at least not increasing) so that the first-place team gets more points than the second-place team, which gets more points than the third-place team, and so on.

Medals at international competitions

One popular sporting tournament awards medals to the top-three individuals but does not endorse a formula that converts medals into a team score. But that doesn't prevent fans from coming up with their own formulas! Some fans argue that "the best" team is the one with the most gold medals. Others argue that the team that has the most overall medals is the best. Others favor a weighted scheme in which a gold medal is worth more than a silver, which is worth more than a bronze. This last scheme has many variations, depending upon how much weight you assign to each medal.

Suppose that the i_th team wins Gi gold medals, Si silver medals, and Bi bronze medals. In a weighting scheme, you choose the weights for gold, silver, and bronze medals. If the vector w = (wG, wS, wB) contains the weights, then the score for each team is computed as Si = wG*Gi + wS*Si + wB*Bi. In addition, we impose the constraints that wG ≥ wS ≥ wB ≥ 0.

Equal weights: A popular weighting method

Many media outlets report the total medals awarded, so let's start by looking at the weighting scheme with the weight vector w = (1, 1, 1). For data, let's use the medal counts from an international competition that took place in Tokyo, Japan. You can use PROC RANK in SAS to rank the teams according to the total number of medals, as follows:

/* Data for medals awarded at the 2020 Summer Games (Tokyo, Japan)
   for the 25 teams who won the most medals.
   Data from Wikipedia: https://bit.ly/3yQuz0g
*/
data Medals2020;
length Team $15;
input Team 1-15 Gold Silver Bronze;
Total = Gold + Silver + Bronze;   /* equal weights: w = (1, 1, 1) */
Order = _N_;
datalines;
United States  39       41      33      
P.R. of China  38       32      18
ROC (Russia)   20       28      23
Great Britain  22       21      22
Japan          27       14      17
Australia      17       7       22
Italy          10       10      20
Germany        10       11      16
Netherlands    10       12      14
France         10       12      11
Canada         7        6       11
Brazil         7        6       8
New Zealand    7        6       7
Hungary        6        7       7
Rep. of Korea  6        4       10
Ukraine        1        6       12
Spain          3        8       6
Cuba           7        3       5
Poland         4        5       5
Switzerland    3        4       6
Turkey         2        2       9
Chinese Taipei 2        4       6
Czech Republic 4        4       3
Denmark        3        4       4
Kenya          4        4       2
;
 
proc rank data=Medals2020 out=Rank ties=Low descending;
   var Total;
   ranks Rank;
run;
 
ods graphics /width=400px height=600px;
title "Teams Ranked by Total Medals";
proc sgplot data=Rank;
   hbar Team / response=Total datalabel=Rank;
   yaxis discreteorder=data display=(nolabel);
   xaxis grid label="Total Medals";
run;

The graph displays the total medal count for 25 participating teams. The number to the right of each bar is the rank for the team when you choose equal weights for the medals. This is merely one weighting method that you can use.

Under this set of weights, a team that wins 10 gold medals (and nothing else) gets the same rank as a team that wins 10 bronze medals (and nothing else). To some, this seems unfair to the first team. Consequently, it makes sense to consider weights for which gold is weighted more than silver, which is weighted more than bronze.

Alternative weighting methods

Let's see how these rankings change if you choose a different set of weights. If you intend to compare different methods, it is useful to scale the weight vectors so that the "total weight" is unity. Thus, you would modify the previous weight vector to w = (1, 1, 1)/3, which is interpreted as "1/3 of the score comes from the number of gold medals, 1/3 comes from the number of silver medals, and 1/3 comes from the number of bronze medals." The following methods are popular choices for weights. For more about methods for weighting medal counts, see Wood, R. (2010) "Comparison of Weighted Ranking Systems."

  • Most Medals: w=(1,1,1)/3.
  • Most Gold Medals: w=(1,0,0).
  • Arithmetic weights: w=(3,2,1)/6. This method values a gold medal as equal to one silver and one bronze.
  • Geometric weights: w=(4,2,1)/7. For this method, gold medal equals two silvers. A silver medal equals two bronzes.
  • London 1908: w=(5,3,1)/9. This set of weights was used at the 1908 London Games.

You can use the SAS DATA step to compute the scores for each team under each different weighting scheme:

data WeightedScores;
length Wt $24;
set Medals2020;
Method=1; Wt="Most Medals: w=(1,1,1)/3"; Score = (1*Gold + 1*silver + 1*Bronze)/3;
output;
Method=2; Wt="Most Gold: w=(1,0,0)";     Score = 1*Gold + 0*silver + 0*Bronze;
output;
Method=3; Wt="Arithmetic: w=(3,2,1)/6";  Score = (3*Gold + 2*silver + 1*Bronze)/6;
output;
Method=4; Wt="Geometric: w=(4,2,1)/7";   Score = (4*Gold + 2*silver + 1*Bronze)/7;
output;
Method=5; Wt="London 1908: w=(5,3,1)/9"; Score = (5*Gold + 3*silver + 1*Bronze)/9;
output;
run;
 
/* Use PROC RANK to run a BY-group analysis for these weighting methods */
proc sort data=WeightedScores;   by Method;  run;
proc rank data=WeightedScores out=Ranks ties=Low descending;
   by Method;
   var Score;
   ranks Rank;
run;
 
/* prepare data for graphing */
proc sort data=Ranks;   by Method Order;  run;
 
ods graphics /width=900px height=600px;
title "Teams Ranked by Alternative Weighting Schemes";
proc sgpanel data=Ranks;
   where Method^=1;
   panelby Wt / novarname rows=1 onepanel sort=data;
   hbar Team / response=Score datalabel=Rank;
   colaxis grid label="Weighted Score";
   rowaxis discreteorder=data display=(nolabel);
run;

Click on the panel to enlarge the graph. In the panel, the teams are listed according to the total number of medals. In each panel, the length of the bar shows the team scores under a different weighting scheme. The numbers to the right of the bars indicate the team ranks. For these data, all methods consistently rank USA and China as the top teams. However, the ranks of other teams move around as you change the weighting criteria. Pay particular attention to teams that have an unequal distribution of medal types, such as Ukraine (1 gold; 6 silver; 12 bronze), Cuba (7 gold, 3 silver, and 5 bronze), and Turkey (2 gold, 2 silver, 9 bronze). For methods that give gold much more weight than bronze, Cuba will rank higher, and Ukraine and Turkey will rank lower, relative to using equal weights.

  • If you use the "Most Gold" ranking, Japan's team leapfrogs into third place, followed by the UK. A problem with the "Most Gold" ranking is that it results in many ties. For example, four teams are tied for seventh place. Should Italy (10 gold; 10 silver; 20 bronze) rank the same as France (10 gold; 12 silver; 11 bronze)? Maybe so, maybe not!
  • If you use the "Arithmetic Ranking," the teams at the top of the list do not change. It is not until you compute the scores for Cuba and Spain that you start to see teams switch places in the rankings. Teams that move down in the rankings are those who won a relatively large number of bronze medals.
  • If you use the "Geometric Ranking," Japan and the UK switch places at the top of the ranks, but mostly the changes occur lower in the rankings. As expected, Ukraine, Cuba, and Turkey are heavily affected.
  • The "London 1908" method is very similar to the "Arithmetic" ranking. The London method uses an arithmetic progression of weights, but the step size is 2 instead of 1.

Summary

Rankings are everywhere. Behind every ranking is a vector of weights that assigns the relative importance of the factors. As demonstrates in these articles, if you change the weights, you will change the scores, which often change the rankings. Consequently, when you see a ranking reported in the media, you should think about what criteria (and biases!) are used to create the ranking. Are they the same criteria that you value? Furthermore, you should realize that a slight tweak to the weights will often change the rankings. A team that is ranked 10th in one set of rankings might be ranked considerably higher or lower if you use slightly different criteria. When the rankings are based on statistical quantities (such as means and proportions), you can add confidence limits that quantify the uncertainty in the rankings.

Although the example uses sports teams, the same analysis is generally applicable for ranking other quantities: stocks, loan applicants, suppliers, job applicants, and so forth.

The post A comparison of different weighting schemes for ranking sports teams appeared first on The DO Loop.