data analysis

3月 272023
 

A previous article discusses the issue of a confounding variable and uses correlation to give an example. The example shows that the correlation between two variables might be affected by a third variable, which is called a confounding variable. The article mentions that you can use the PARTIAL statement in PROC CORR and other SAS procedures to account for the presence of the variable. Some problems have more than one confounding variable. The PARTIAL statement is an easy way to regress the original variables onto the confounding variable(s) and to compute a statistic on the residual values.

In the previous example, the correlation between the original variables (HEIGHT and WEIGHT) is 0.88. The partial correlation (which is the correlation between the residuals after regressing the variables on AGE) is 0.7, which is somewhat less. Remarkably, controlling for a confounding variable can not only affect the magnitude of a correlation, but it can also even change the SIGN of the correlation! This effect is known as Simpson's paradox. Simpson's paradox is that aggregated data can show relationships that are not present (or are even reversed!) in subpopulations of the data. It shows the importance of accounting for confounding variables when you do a statistical analysis. This article provides an example of Simpson's paradox.

Data for Simpson's paradox

Simpson's paradox always involves a confounding variable. (Which some people call a "lurking variable.) When you aggregate the data (ignoring the confounding variable), you get one result, and when you account for the confounding variable you obtain a completely different result. In many examples, the confounding variable is a categorical variable, but in the following example, the confounding variable is a continuous variable, age.

Consider the following experiment. Researchers are testing a herbicide that kills a particular weed. The experiment has multiple plots. Twelve plots contain weeds that are one week old, 12 plots contain weeds that are two weeks old, and so forth, up to weeds that are 10 weeks old. The young weeds are easier to kill and require less herbicide. The researchers want to know what proportion of the weeds will survive after being sprayed with various concentrations of the herbicide.

The following SAS DATA set defines the data and graphs the results of the experiment:

data Simpson;
label Dose = "Concentration (g/L)"
      Prob = "Probability of Survival";
input Dose Prob Age @@;
Prob = Prob / 100;    /* convert from percentage to proportion */
ID = _N_;
datalines;
0.0225 54.5 1 0.0325 59.5 1 0.0275 61.0 1 0.0325 58.5 1 0.0200 57.5 1 
0.0200 51.0 1 0.0250 55.5 1 0.0225 61.5 1 0.0325 58.5 1 0.0200 56.5 1 
0.0225 56.0 1 0.0250 54.0 1 0.0300 57.0 2 0.0225 51.5 2 0.0275 47.0 2 
0.0250 54.5 2 0.0375 53.5 2 0.0300 54.5 2 0.0350 55.0 2 0.0375 59.0 2 
0.0275 53.5 2 0.0350 57.0 2 0.0325 53.5 2 0.0250 52.5 2 0.0275 48.5 3 
0.0325 49.0 3 0.0350 47.5 3 0.0300 51.5 3 0.0325 53.5 3 0.0450 54.5 3 
0.0350 53.5 3 0.0375 56.0 3 0.0350 51.5 3 0.0350 50.5 3 0.0300 46.5 3 
0.0400 49.0 3 0.0375 48.0 4 0.0350 44.5 4 0.0375 43.5 4 0.0425 49.5 4 
0.0375 51.5 4 0.0500 52.0 4 0.0350 48.0 4 0.0375 44.5 4 0.0425 51.0 4 
0.0375 50.0 4 0.0475 48.0 4 0.0400 45.0 4 0.0400 40.5 5 0.0450 45.0 5 
0.0400 49.5 5 0.0400 42.0 5 0.0425 42.5 5 0.0500 48.5 5 0.0425 42.0 5 
0.0475 47.0 5 0.0400 44.0 5 0.0525 45.0 5 0.0500 48.5 5 0.0525 46.5 5 
0.0475 38.0 6 0.0500 38.0 6 0.0525 46.5 6 0.0475 44.5 6 0.0425 40.0 6 
0.0625 45.0 6 0.0525 41.5 6 0.0475 39.5 6 0.0525 44.0 6 0.0500 39.5 6 
0.0475 44.0 6 0.0550 43.5 6 0.0550 38.5 7 0.0550 39.0 7 0.0500 36.0 7 
0.0550 35.0 7 0.0475 35.0 7 0.0525 39.0 7 0.0625 40.0 7 0.0625 42.0 7 
0.0600 44.5 7 0.0475 39.0 7 0.0575 37.0 7 0.0525 42.5 7 0.0550 36.0 8 
0.0650 32.5 8 0.0675 38.5 8 0.0600 34.5 8 0.0625 35.0 8 0.0575 35.0 8 
0.0550 39.0 8 0.0550 33.0 8 0.0550 33.0 8 0.0600 37.5 8 0.0600 34.5 8 
0.0700 43.0 8 0.0650 35.0 9 0.0650 32.0 9 0.0600 32.0 9 0.0600 29.5 9 
0.0600 30.5 9 0.0600 31.0 9 0.0700 40.0 9 0.0625 29.5 9 0.0625 34.5 9 
0.0750 31.5 9 0.0700 34.0 9 0.0675 36.0 9 0.0700 30.0 10 0.0700 25.0 10 
0.0700 30.5 10 0.0700 28.5 10 0.0725 32.0 10 0.0700 26.5 10 0.0625 30.5 10 
0.0650 27.5 10 0.0775 32.0 10 0.0650 28.5 10 0.0725 35.0 10 0.0800 34.0 10 
;
 
proc corr data=Simpson noprob plots=scatter;
   var Dose Prob;
run;

The graph shows a scatter plot of the proportion of weeds that survived the herbicide at various concentrations. The correlation between the concentration of the dose and the survival probability is -0.85, which demonstrates a strong negative correlation.

In other words, the more poison you use, the fewer weeds die.

What? Clearly, something is wrong! This conclusion does not make sense. How can applying MORE poison result in a smaller proportion of dead weeds?

The problem is that we have aggregated data, ignoring the fact that some plots had young weeds (and low concentrations of herbicide) whereas other plots had older weeds (and higher concentrations). If you include the age of the weeds in the scatter plot by coloring the markers, you see that the young weeds are in the upper left corners and the older weeds are in the lower right corner:

title "Color Markers by Age";
proc sgplot data=Simpson;
   scatter x=Dose y=Prob / colorresponse=Age
                       markerattrs=(symbol=CircleFilled size=14) FilledOutlinedMarkers
                       colormodel=TwoColorRamp;
run;

Accounting for the confounding variable

If we account for the confounding variable by using the PARTIAL statement in PROC CORR to compute a partial correlation, we get the following graph and statistic:

proc corr data=Simpson noprob plots=scatter;
   var Dose Prob;
   partial Age;
run;

The scatter plot is the plot of the residuals of the original variables after being regressed onto age. If we account for the age of the weeds, the correlation between the dosage and the proportion of weeds that die is 0.52, which is positive, as expected. This positive relationship between the dose and the proportion of dead weeds makes intuitive sense.

If we revisit the original data, we can visualize what is happening in the subpopulations that have 1-week-old weeds, 2-week-old weeds, and so forth:

title "Regression by Age";
proc sgplot data=Simpson;
   reg x=Dose y=Prob / Group=Age;
run;

The graph shows that in each age group, the relationship between the dosage and the death rate is positive, as you would expect. A graph like this appears in the Wikipedia article about Simpson's paradox.

Summary

Simpson's paradox is well-known to statisticians. It occurs when the data contains a confounding variable. If you do not account for the confounding variable, the data appears to be related in one way, but when you account for the confounding variable, the relationship is very different. This article shows two variables are negatively correlated if you ignore age, but they are positively correlated if you account for age.

The post Simpson's paradox and confounding variables appeared first on The DO Loop.

3月 222023
 

A data analyst wanted to estimate the correlation between two variables, but he was concerned about the influence of a confounding variable that is correlated with them. The correlation might affect the apparent relationship between main two variables in the study. A common confounding variable is age because young people and older people tend to have different attitudes, behaviors, resources, and health issues.

This article gives a brief overview of partial correlation, which is a way to adjust or "control" for the effect of other variables in a study.

In this article, a confounding variable is a variable that is measured in the study and therefore can be incorporated into a model. This is different than a lurking variable, which is a variable that is not measured in the study. These definitions are common, but not universally used. The process of incorporating a confounding variable into a statistical model is called controlling for the variable.

Age as a confounding variable

Age is the classic confounding variable, so let's use a small SAS data set that gives the heights, weights, and ages of 19 school-age children. To make the analysis more general, I define macro variables Y1 and Y2 for the names of the two variables for which we want to find the correlation. For these data, the main variables are Y1=Height and Y2=Weight, and the confounding variable is Age.

To begin, let's compute the correlation between Y1 and Y2 without controlling for any other variables:

%let DSName = sashelp.Class;
%let Y1 = Height;
%let Y2 = Weight;
 
proc corr data=&DSName noprob plots(maxpoints=none)=scatter(ellipse=none);
   var &Y1;
   with &Y2;
run;

The output from the procedure indicates that Y1 and Y2 are strongly correlated, with a correlation of 0.88. But the researcher suspects that the ages of the students are contributing to the strong association between height and weight. You can test this idea graphically by creating a scatter plot of the data and coloring the markers according to the value of a third variable, as follows.

%let ControlVars = Age;         /* list one or more control variables */
 
title "Color Markers by a Confounding Variable";
proc sgplot data=&DSName;
   scatter x=&Y1 y=&Y2 / colorresponse=%scan(&ControlVars, 1)   /* color by first control var */
                         markerattrs=(symbol=CircleFilled size=14) FilledOutlinedMarkers
                         colormodel=TwoColorRamp;
run;

The graph shows that low values of height and weight are generally associated with low values of age, which are light blue in color. Similarly, high values of height and weight are generally associated with high values of age, which are colored dark blue. This indicates that the age variable might be affecting the correlation, or even might be responsible for it. One way to "adjust" for the age variable is to use partial correlation.

What is partial correlation?

A partial correlation is a way to adjust a statistic to account for one or more additional covariates. The partial correlation between variables Y1 and Y2 while adjusting for the covariates X1, X2, X3, ... is computed as follows:

  1. Regress Y1 onto the covariates and calculate the residuals for the model. Let R1 be the variable that contains the residuals for the model.
  2. Regress Y1 onto the covariates. Let R2 be the column of residuals.
  3. Compute the correlation between R1 and R2. This is the partial correlation between Y1 and Y2 after adjusting for the covariates.

How to compute partial correlation in SAS

SAS provides an easy way to compute the partial correlation. PROC CORR supports the PARTIAL statement. On the PARTIAL statement, you list the covariates that you want to account for. PROC CORR automatically computes the regression models and provides the correlation of the residuals. In addition to the partial Pearson correlation, PROC CORR can report the partial versions of Spearman's rank correlation, Kendall's association, and the partial variance and standard deviation. The partial mean is always 0, so it is not reported.

The following call to PROC CORR shows the syntax of the PARTIAL statement:

proc corr data=&DSName plots(maxpoints=none)=scatter(ellipse=none);
   var &Y1;
   with &Y2;
   partial &ControlVars; 
run;

The output includes tables and a scatter plot of the residuals of Y2 versus the residuals of Y1. Only the graph is shown here. The graph includes an inset that tells you that the partial correlation is 0.7, which is less than the unadjusted correlation (0.88). Although the axes are labeled by using the original variable names, the quantities that are graphed are residuals from the two regressions. Notice that both axes are centered around zero, and the range of each axis is different from the original range.

How should you interpret this graph and the partial correlation statistic? The statistic tells you that after adjusting for age, the correlation between the heights and weights of students is about 0.7. The graph shows the scatter plot of the residual values of the heights and weights after regressing those variables onto the age variable.

How to manually calculate partial correlations

You can obtain the partial correlation manually by performing each step of the computation. This is not necessary in SAS, but it enables you to verify the computations that are performed by the PARTIAL statement in PROC CORR. To verify the output, you can manually perform the following steps:

  1. Use PROC REG to regress Y1 onto the covariates. Use the OUTPUT statement to save the residuals for the model.
  2. Use PROC REG to regress Y2 onto the covariates. Save the residuals for this model, too.
  3. Merge the two output data sets.
  4. Use PROC CORR to compute the (raw) correlation between the residual variables. This is the partial correlation between Y1 and Y2 after adjusting for the covariates.

The following SAS statements perform this analysis:

/* 1. Regress Y1 onto covariates and save residuals */
proc reg data=&DSName noprint;
   model &Y1 = &ControlVars;
   output out=RegOut1 R=Part_&Y1;
quit;
/* 2. Regress Y2 onto covariates and save residuals */
proc reg data=&DSName noprint;
   model &Y2 = &ControlVars;
   output out=RegOut2 R=Part_&Y2;
quit;
/* 3. Merge the two sets of residuals */
data RegOut;
merge RegOut1(keep=&ControlVars Part_&Y1)
      RegOut2(keep=Part_&Y2);
label Part_&Y1 = "Residual of &Y1 ~ (&ControlVars)"
      Part_&Y2 = "Residual of &Y2 ~ (&ControlVars)";
run;
/* 4. Display the correlation between the residuals */
proc corr data=RegOut plots=scatter(ellipse=none);
   var Part_&Y1;
   with Part_&Y2;
run;

As expected, the graph of the residual values is the same graph that was created by using the PARTIAL statement. As expected, the inset shows that the correlation between the residuals is the same value that was reported by using the PARTIAL statement.

Summary

It can be useful to account for or "adjust" a statistic to account for other variables that might be strongly correlated with the main variables in the analysis. One way to adjust for confounding variables is to regress the main variables onto the confounding variables and look at the statistic for the residual values. This is called a "partial" statistic and is supported in SAS by using the PARTIAL statement in several procedures. This article shows how to use the PARTIAL statement in PROC CORR to compute the partial correlation. It also shows how to manually reproduce the computation by explicitly performing the regressions and calculating the correlation of the residuals.

Appendix: PROC REG also provides (squared) partial correlations

In SAS, there are often multiple ways to get the same statistic. It turns out that you can get the SQUARED Pearson partial correlation from PROC REG by using the PCORR2 option on the MODEL statement. I don't often use this option, but for completeness, the following statements compute the squared partial correlation between Y2 and Y1:

proc reg data=&DSName plots=none;
   model &Y2 = &Y1 &ControlVars / pcorr2;
   ods select ParameterEstimates;
quit;

The "Type II Squared Partial Correlation" between Y2 and Y1 is 0.49656. If you take the square root of that number, you get ±0.7047. The tabular output from PROC REG does not enable you to determine whether the partial correlation is positive or negative.

The post Partial correlation: controlling for confounding variables appeared first on The DO Loop.

3月 202023
 

In a previous article about Markov transition matrices, I mentioned that you can estimate a Markov transition matrix by using historical data that are collected over a certain length of time. A SAS programmer asked how you can estimate a transition matrix in SAS. The answer is that you can use PROC FREQ to tabulate the number of transitions from one state to another. The procedure outputs counts and row percentages, either of which can be used to construct an estimate of the transition matrix.

Transition of families through economic classes

Let's start with some data. A charity provides education, healthcare, and agricultural assistance to an impoverished town in Central America. The charity wants to estimate the transition of families through various economic categories (or states) based on the total family income:

  • State 1 is used for the poorest families who earn less than $2 per day per person.
  • State 2 is for families who earn between $2-$10 per day per person.
  • State 3 is for families who earn between $10-$20 per day per person.
  • State 4 is for families who earn more than $20 per day per person.

The charity has access to the economic status of 160 families who have been in the program for five years. The following SAS DATA step defines the beginning and ending states for these families:

/* State 1: families who earn less than $2 per day
   State 2: families who earn between $2-$10 per day
   State 3: families who earn between $10-$20 per day
   State 4: families who earn more than $20 per day
*/
data History;
retain ID;
input BeginState EndState @@;
ID + 1;
datalines;
1 2  2 2  1 3  3 3  3 3  3 3  1 1  3 2  4 4  3 3 
4 4  1 1  3 2  1 1  1 3  3 3  2 2  2 2  2 2  3 2 
2 3  1 3  1 1  1 2  4 3  1 1  3 4  1 3  3 3  1 2 
1 2  3 3  1 3  3 4  2 2  1 2  3 2  1 2  1 1  3 2 
1 3  1 1  1 1  1 1  1 3  1 3  3 3  1 1  2 2  4 4 
1 1  2 3  1 1  1 2  2 2  2 2  1 3  2 2  1 1  1 2 
3 3  1 3  4 4  1 3  3 4  1 1  1 2  2 2  1 2  3 2 
1 1  3 3  3 3  1 2  1 1  1 2  3 3  2 2  1 3  3 2 
1 1  1 2  1 1  4 2  1 2  1 3  1 2  1 1  2 1  1 1 
2 3  1 2  2 2  1 1  3 4  1 1  1 1  2 2  3 3  4 3 
3 2  4 3  1 1  2 1  2 3  2 2  1 2  4 4  1 2  2 1 
2 1  2 2  1 1  2 3  4 4  1 2  1 1  2 2  1 2  4 2 
1 1  2 2  1 1  1 2  1 1  2 1  2 1  1 2  2 3  2 2 
3 3  4 3  1 1  2 2  1 1  2 1  1 1  2 2  1 1  1 1 
1 1  2 2  1 1  3 2  1 3  3 2  3 3  4 4  1 1  4 2 
3 3  4 4  3 2  4 4  2 2  1 3  3 3  4 4  4 3  1 2 
;
 
proc print data=History(obs=10) noobs;
run;

The output from PROC PRINT shows the beginning and ending states for 10 families. The first family was in State 1 at the beginning of the program but was in State 2 at the end of the program. The second family was in State 2 at the beginning and remained there. The third family was in State 1 at the beginning of the program but was in State 3 at the end, and so forth. You can use PROC FREQ to tabulate the matrix of counts for the transitions from one category into another, as follows:

proc freq data=History ;
   tables BeginState * EndState / out=freqOut sparse nocol nopercent outpct;
run;

The output shows the counts and the row percentages for the data. The first row of the output is for the 75 families who started the program in State 1. Of those families, 37 (49%) remained in State 1 at the end of the program, 23 (31%) had progressed to State 2, and 15 (20%) had progressed to State 3. The second row of the output is for the 35 families who started the program in State 2. Of those families, 7 (20%) regressed to State 1, 22 (63%) remained in State 2, and 6 (17%) advanced to State 3. The other rows of the output are interpreted similarly.

You can use the row percentages to estimate the transition matrix. Merely divide each percentage by 100 to obtain a proportion. The proportion in the (i,j)th cell estimates the probability that a family that was in State i at the beginning of the program is in State j at the end of the program.

Reading the probabilities into a transition matrix

Most SAS programmers use SAS IML software to work with Markov transition matrices. The output from PROC FREQ is in "long form" in a data set that has 16 rows. You can read the estimates into a SAS IML vector and then reshape them into a 4 x 4 matrix. You can create the matrix in two ways: you can read the raw counts into a matrix and then divide each row by the row sum, or you can read the row percentages directly and then divide by 100 to obtain probabilities.

proc iml;
use freqOut;
   read all var {'BeginState' 'EndState' 'Count' 'Pct_Row'};  /* read the states, counts, and row percentages */
close;
 
N = sqrt(nrow(Count));    /* this should be an integer or else something is wrong */
names = EndState[1:N];    /* there should be N unique states */
 
/* estimate transition matrix by using counts of transitions */
C = shape(Count, N, N);   /* matrix of raw counts for each transition */
M = C / C[,+];            /* divide each cell by total counts for the row */
print M[r=names c=names];
 
/* or read the PCT_ROW column directly */
M = shape(Pct_Row, N, N);       /* raw counts */
M = M / 100;                    /* convert from percentages to proportions */
print M[r=names c=names];

The matrix of counts shows that 23+15+6+4 = 48 out of 160 families improved their states during the program. Only 7+11+3+5 = 26 families ended the program in a worse state than they began.

The estimates for the transition probability matrix are the same for both calculations, so only one output is shown. For the poorest families (the first row), about 50% did not improve their economic state whereas the other 50% did. For the families that began in State 2, 20% slipped back into extreme poverty (unfortunately), 63% stayed in that state, and 17% increased their state. The remaining rows have similar interpretations.

Summary

This article shows how to use PROC FREQ in SAS to construct an estimate of a Markov transition matrix from historical data. For each subject in the study, you need to know the state of the subject at the beginning and at the end of a time period. You can then construct a matrix of counts for the transition of subjects between states. By dividing each row by its row sum, you obtain empirical probability estimates.

The post Estimate a Markov transition matrix from historical data appeared first on The DO Loop.

3月 152023
 

Most homeowners know that large home improvement projects can take longer than you expect. Whether it's remodeling a kitchen, adding a deck, or landscaping a yard, big projects are expensive and subject to a lot of uncertainty. Factors such as weather, the availability of labor, and the supply of materials, can all contribute to uncertainty in the duration of a project.

If you ask an expert (for example, an experienced general contractor) to predict the length of time required to complete a project, he or she might respond as follows:

  • 10% of the time, a project like this takes 17 work days or less.
  • 50% of the time, we can complete it in 24 work days or less.
  • 90% of the time, we finish the work by the 35th work day.

Whether the contractor realizes it or not, these sentences are estimating three quantiles of a continuous probability distribution for the completion of the project.

Now, if you are planning a garden party on your new deck, you don't want to schedule the party before the deck is finished. One way to predict when you can have the party is to fit a probability distribution to the contractor's estimates. But what distribution should you use? The contractor's estimates do not describe a symmetric distribution, such as the normal distribution. Without domain-specific knowledge, there is no reason to prefer one distribution over the other. Furthermore, you don't have any real data to fit the model, you have only the three estimated quantiles.

This is an excellent opportunity to use the flexible metalog family of distributions. This article shows how to convert the expert's estimates into a distribution that you can compute with and simulate data from.

Create a distribution from expert opinion

In this example, the expert predicted the quantiles (completion time) for three cumulative probabilities (10%-50%-90%). Mathematically, the expert has provided three points that are on (or near) the cumulative distribution for the completion time. Equivalently, the expert has provided three points on (or near) the quantile function. It is easy to fit a model to these values by using the metalog distribution. As discussed in a previous article, SAS supports the metalog distribution through a package of SAS IML functions that you can download from GitHub.

The following program assumes that you have downloaded and installed the metalog functions by using the instructions in the previous article or by following the instructions in Appendix A of the documentation for the metalog package, which is in the file MetalogDoc.pdf. This example uses a three-term metalog model. I assume that the project is never completed in less than 10 days, so I fit a semibounded model that has support on the interval [10, ∞). You can use the ML_CreateFromCDF function to create a metalog object from the estimates. (An estimate of this type is called a symmetric-percentile triplet (SPT) because the quantile are provided for the probabilities (α, 1/2, 1-α), where 0 < α < 1/2. In this example, α = 0.1.)

proc iml;
/* This LOAD statement assumes that the metalog functions are defined as in  Appendix A of the documentation.
   See the MetalogDoc.pdf file at https://github.com/sassoftware/sas-iml-packages/tree/main/Metalog */
load module=_all_;         
 
prob = {0.1, 0.5, 0.90};   /* cumulative probabilities; note symmetric-percentile triplet */
x    = {17,   24,  35};    /* associated quantiles */
 
order = 3;                 /* three terms in model */
bounds = {10 .};           /* support on [10, infinity) */
 
SPT = ML_CreateFromCDF(x, prob, order, bounds );  /* create model from SPT */
title "3-Term Metalog Model of Expert Opinion";
p = do(0.01, 0.99, 0.01);
run ML_PlotECDF(SPT, p);  /* graph the model CDF and the expert's estimates */

For these three data points, a three-term metalog model passes through the expert's estimates, so the model is a perfect fit. If you prefer to view the probability density for the model, you can use the ML_PlotPDF call, as follows:

title "PDF of 3-Term Metalog Model of Completion Times";
run ML_PlotPDF(SPT, p);

The PDF indicates the most probable duration of the project (the mode) is about 20-25 days.

The expert gave estimates for three probability values. But what if you want estimates for other probabilities? For example, let's calculate dates for which the model predicts a 95% and a 99% chance that the deck will be completed. We can get those estimates by evaluating the quantile function of the metalog model. Let's evaluate the quantile function at 0.95 and 0.99, as well as at a few other values:

/* estimates for other percentiles */
p = {0.05, 0.25, 0.75, 0.95, 0.99};
estTime = ML_Quantile(SPT, p);
print p[F=PERCENT9.] estTime[F=6.2];

According to the model, there is a 95% chance that the project will be finished by the 40th day. There is a 99% chance that the project will be completed within 56 days. You can use those estimates to schedule a party on your new deck!

Summary

The metalog distribution can be used to construct a probability distribution from an expert's opinion. This example used three estimates for the completion of a home-improvement project. The same technique enables you to fit a metalog distribution to four, five, or more quantile estimates from an expert. The metalog model provides an easy way to convert an expert's opinion into a probability distribution that can be used for simulation and modeling.

The post Fitting a distribution to an expert's opinion: An application of the metalog distribution appeared first on The DO Loop.

3月 132023
 

A previous article describes the metalog distribution (Keelin, 2016). The metalog distribution is a flexible family of distributions that can model a wide range of shapes for data distributions. The metalog system can model bounded, semibounded, and unbounded continuous distributions. This article shows how to use the metalog distribution in SAS by downloading a library of SAS IML functions from a GitHub site.

The article contains the following sections:

  • An overview of the metalog library of SAS IML functions
  • How to download the functions from GitHub and include the functions in a program
  • How to use a metalog model to fit an unbounded model to data
  • A discussion of infeasible models

For a complete discussion of the metalog functions in SAS, see "A SAS IML Library for the Metalog Distribution" (Wicklin, 2023).

The metalog library of functions in SAS

The metalog library is a collection of SAS IML modules. The modules are object-oriented, which means that you must first construct a "metalog object," usually from raw data. The object contains information about the model. All subsequent operations are performed by passing the metalog object as an argument to functions. You can classify the functions into four categories:

  • Constructors create a metalog object from data. By default, the constructors fit a five-term unbounded metalog model.
  • Query functions enable you to query properties of the model.
  • Computational functions enable you to compute the PDF and quantiles of the distribution and to generate random variates.
  • Plotting routines create graphs of the CDF and PDF of the metalog distribution. The plotting routines are available when you use PROC IML.

Before you create a metalog object, you must choose how to model the data. There are four types of models: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. Then, you must decide the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k > 6) can fit multimodal data distributions, but be careful not to overfit the data by choosing k too large.

The most common way to create a metalog object is to fit a model to the empirical quantiles of the data, which estimates the k parameters in the model. For an unbounded metalog model, the quantile function is
\( Q(p) = a_1 + a_2 L(p) + a_3 c(p) L(p) + a_4 c(p) + a_5 c(p)^2 + a_6 c(p)^2 L(p) + a_7 c(p)^3 + \ldots \)
where L(p) = Logit(p) is the quantile function of the logistic distribution and c(p) = p – 1/2 is a linear function. You use ordinary least squares regression to fit the parameters (\(a_1, a_2, \ldots, a_k\)) in the model.

The semibounded and bounded versions are metalog models of log-transformed variables.

  • If X > L, fit a metalog model to Z = log(X – L)
  • If L < X < U, fit a metalog model to Z = log((X – L) / (U – X))

How to download the metalog functions from GitHub

The process of downloading SAS code from a GitHub site is explained in the article, "How to use Git to share SAS programs." The process is also explained in Appendix A in the metalog documentation. The following SAS program downloads the metalog functions to the WORK libref. You could use a permanent libref, if you prefer:

/* download metalog package from GitHub */
options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-iml-packages;  /* clone repo to WORK, or use permanent libref */
 
/* clone repository; if repository exists, skip download */
data _null_;
if fileexist("&repoPath.") then 
   put 'Repository already exists; skipping the clone operation'; 
else do;
   put "Cloning repository 'sas-iml-packages'";
   rc = gitfn_clone("https://github.com/sassoftware/sas-iml-packages/", "&repoPath." ); 
end;
run;
 
/* Use %INCLUDE to read source code and STORE functions to current storage library */
proc iml;
%include "&repoPath./Metalog/ML_proc.sas";   /* each file ends with a STORE statement */
%include "&repoPath./Metalog/ML_define.sas";
quit;

Each file ends with a STORE statement, so running the above program results in storing the metalog functions in the current storage library. The first file (ML_PROC.sas) should be included only when you are running the program in PROC IML. It contains the plotting routines. The second file (ML_define.sas) should be included in both PROC IML and the iml action.

Fit an unbounded metalog model to data

The Sashelp.heart data set contains data on patients in a heart study. The data set contains a variable (Systolic) for the systolic blood pressure of 5,209 patients. The following call to PROC SGPLOT creates a histogram of the data and overlays a kernel density estimate.

title 'Systolic Blood Pressure';
proc sgplot data=sashelp.heart;
   histogram Systolic;
   density Systolic / type=kernel;
   keylegend /location=inside;
run;

The kernel density estimate (KDE) is a nonparametric model. The histogram and the KDE indicate that the data are not normally distributed, and it is not clear whether the data can be modeled by using one of the common "named" probability distributions such as gamma or lognormal. In situations such as this, the metalog distribution can be an effective tool for modeling and simulation. The following SAS IML statements load the metalog library of functions, then fit a five-term unbounded metalog distribution to the data. After you fit the model, it is a good idea to run the ML_Summary subroutine, which displays a summary of the model and displays the parameter estimates.

proc iml;
load module=_all_;     /* load the metalog library */
use sashelp.heart;
   read all var "Systolic";
close;
 
MLObj = ML_CreateFromData(Systolic, 5);  /* 5-term unbounded model */
run ML_Summary(MLObj);

The ML_Summary subroutine displays two tables. The first table tells you that the model is a 5-term unbounded model, and it is feasible. Not every model is feasible, which is why it is important to examine the summary of the model. (Feasibility is discussed in a subsequent section.) The second table provides the parameter estimates for the model.

The MLObj variable is an object that encapsulates relevant information about the metalog model. After you create a metalog object, you can pass it to other functions to query, compute, or plot the model. For example, in PROC IML, you can use the ML_PlotECDF subroutine to show the agreement between the cumulative distribution of the model and the empirical cumulative distribution of the data, as follows:

title '5-Term Metalog Model for Systolic Blood Pressure';
run ML_PlotECDF(MLObj);

The model is fit to 5,209 observations. For this blog post, I drew the CDF of the model in red so that it would stand out. The model seems to fit the data well. If you prefer to view the density function, you can run the statement
    run ML_PlotPDF(MLObj);
Again, note that the MLObj variable is passed as the first argument to the function.

One way to assess the goodness of fit is to compare the sample quantiles to the quantiles predicted by the model. You can use the ML_Quantile function to obtain the predicted quantiles, as follows:

/* estimates of quantiles */
Prob = {0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99};
call qntl(SampleEst, Systolic, Prob);   /* sample quantiles */
ModelEst = ML_Quantile(MLObj, Prob);    /* model quantiles */
print Prob SampleEst ModelEst;

The model quantiles agree well with the sample quantiles. Note that the median estimate from the model (the 0.5 quantile) is equal to the a1 parameter estimate. This is always the case.

An advantage of the metalog model is that it has an explicit quantile function. As a consequence, it is easy to simulate from the model by using the inverse CDF method. You can use the ML_Rand function to simulate from the model, as follows:

/* simulate 500 new blood pressure values from the model */
call randseed(1234);
title 'Simulated Systolic Values';
BP = ML_Rand(MLObj, 500);
call histogram(BP);

Notice that the shape of this histogram is similar to the histogram of the original data. The simulated data seem to have similar properties as the original data.

The metalog documentation provides the syntax and documentation for every public method in the library.

Infeasible metalog functions

As discussed in a previous article, not every metalog model is a valid distribution. An infeasible model is one for which the quantile function is not monotone increasing. This can happen if you have a small data set that has extreme outliers or gaps, and you attempt to fit a metalog model that has many parameters. The model will overfit the data, which results in an infeasible model.

I was unable to create an infeasible metalog model for the blood-pressure data, which has thousands of observations and no extreme observations. However, if I use only the first 20 observations, I can demonstrate an infeasible model by fitting a six-term metalog model. The following SAS IML statements create a new data vector that contains only 20 observations. A six-term metalog model is fit to those data.

/* What if we use a much smaller data set? Set N = 20. */
S = Systolic[1:20];               /* N = 20 */
MLObj = ML_CreateFromData(S, 6);  /* 6-term model */
run ML_Summary(MLObj);

As shown by the output of the ML_Summary call, the six-term metalog model is not feasible. The following graph shows the quantile function of the model overlaid on the empirical quantiles. Notice that the quantile function of the model is not monotone increasing, which explains why the model is infeasible. If you encounter this situation in practice, Keelin (2016) suggests reducing the number of terms in the model.

Summary

This article describes how to use the metalog distribution in SAS. The metalog distribution is a flexible family that can model a wide range of distributional shapes. You can download the Metalog package from the SAS Software GitHub site. The package contains documentation and a collection of SAS IML modules. You can call the modules in a SAS IML program to fit a metalog model. After fitting the model, you can compute properties of the model and simulate data from the model. More information and examples are available in the documentation of the Metalog package.

The post Use the metalog distribution in SAS appeared first on The DO Loop.

3月 062023
 

Undergraduate textbooks on probability and statistics typically prove theorems that show how the variance of a sum of random variables is related to the variance of the original variables and the covariance between them. For example, the Wikipedia article on Variance contains an equation for the sum of two random variables, X and Y:
\( \operatorname {Var} (X+Y)=\operatorname {Var} (X)+\operatorname {Var} (Y)+2\,\operatorname {Cov} (X,Y) \)

A SAS programmer wondered whether equations like this are also true for vectors of data. In other words, if X and Y are columns in a data set, do the sample variance and covariance statistics satisfy the same equation? How can you use SAS to verify (or disprove) the equation for a sample of data? This article shows how to use SAS to verify that the equation (and another similar equation) is valid for vectors of data.

It is possible to verify the equation by using PROC CORR and the DATA step, but it is much simpler to use SAS IML software because the IML language enables you to directly access cells of a variance-covariance matrix.

Create the sum of columns

Let's start by choosing some data to analyze. The following DATA step renames some variables in the Sashelp.Iris data set to X1, X2, and X3. The program also creates the variables Y12 = X1 + X2, Y13 = X1 + X3, and Y23 = X2 + X3. The program then reads the data into SAS IML matrices and computes the variance-covariance matrix for the original variables (X) and their pairwise sums (Y):

/* verify variance of a sum (or linear combination) of variables */
data Want;
set sashelp.iris(where=(Species="Versicolor"));
X1 = PetalLength;
X2 = PetalWidth;
X3 = SepalLength;
Y12 = X1 + X2;
Y13 = X1 + X3;
Y23 = X2 + X3;
keep X1-X3 Y12 Y13 Y23;
run;
 
proc iml;
use Want;
   read all var {'X1' 'X2' 'X3'} into X[c=XNames];
   read all var {'Y12' 'Y13' 'Y23'} into Y;
close;
YNames = {'X1+X2' 'X1+X3' 'X2+X3'};
 
SX = cov(X);
SY = cov(Y);
print SX[c=XNames r=XNames F=best5. L='Cov(X)'],
      SY[c=YNames r=YNames F=best5. L='Cov(Sums)'];

The variance of a sum

Let's use this information to verify the identity
\( \operatorname {Var} (X1+X2)=\operatorname {Var} (X1)+\operatorname {Var} (X2)+2\,\operatorname {Cov} (X1,X2) \)

From the displayed (rounded) values of the covariance matrices, you can mentally calculate that the equation could be true. The variances of the original variables are along the diagonal of the first matrix, and the covariances are the off-diagonal elements. The variance of the sum is the [1,1] cell of the second matrix. A little mental arithmetic indicates that 40.6 ≈ 22 + 4 + 2*7.

The following SAS IML statements extract the relevant values from cells in the variance-covariance matrices. The program then subtracts the right side of the equation from the left side. If the difference is 0, then the equation is verified:

/* Confirm that the sample variance of (x1+x2) satisfies
      Var(x1 +x2) = Var(x1) + Var(x2) + 2*Cov(x1, x2)
   See https://en.wikipedia.org/wiki/Variance#Propagation
*/
VarY12= SY[1,1];
VarX1 = SX[1,1];
VarX2 = SX[2,2];
Cov12 = SX[1,2];
Eqn1 = VarY12 - (VarX1 + VarX2 + 2*Cov12);
print Eqn1;

Success! The left and right sides of the equation are equal to numerical precision. This validates the equation for the data we are using. Notice that this same technique could be used to analyze the variance of a general linear combination of other variables.

The covariance of sums

Let's verify one more equation. The Wikipedia article about Covariance states the following formula for four random variables, X, Y, Z, and W:
\( \operatorname {Cov} (X+Y,Z+W)=\operatorname {Cov} (X,Z)+\operatorname {Cov} (X,W)+\operatorname {Cov} (Y,Z)+\operatorname {Cov} (Y,W) \)

Since our example data only has three variables, let's simplify the equation by setting W=X. Then (changing variable names) the equation we want to verify is \( \operatorname {Cov} (X1+X2,X1+X3)=\operatorname {Cov} (X1,X3)+\operatorname {Cov} (X1,X1)+\operatorname {Cov} (X2,X3)+\operatorname {Cov} (X1,X2) \)

The following SAS IML statements extract the relevant cells from the covariance matrices. The program subtracts the right side of the equation from the left side and prints the difference.

/* Confirm that the sample covariance of (x1+x2) and (x1 + x3) satisfies
     Cov(x1+x2, x1+x3) = Cov(x1, x1) + Cov(x1, x2) + Cov(x1, x3) + Cov(x2, x3)
   See https://en.wikipedia.org/wiki/Covariance#Covariance_of_linear_combinations
*/
CovY12_Y13 = SY[1,2];
Cov11 = SX[1,1];      /* = Var(X1) */
Cov12 = SX[1,2];
Cov13 = SX[1,3];
Cov23 = SX[2,3];
Eqn2 = CovY12_Y13 - (Cov11 + Cov12 + Cov13 + Cov23);
print Eqn2;

Once again, the difference is essentially zero, which validates the equation.

Summary

A SAS programmer asked whether she could use SAS to verify certain equations. She wanted to verify that certain formulas for the variance and covariance of random variables are also true for sample statistics for empirical data. This article shows how to use SAS IML software to test certain equations that relate the variance and covariance of sums to the variances and covariances of a set of data variables.

The post The variance of the sums of variables appeared first on The DO Loop.

2月 222023
 

The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library of SAS IML functions that enable you to use the metalog distribution to model data in SAS.

The metalog distribution

There are dozens of continuous probability distributions that are commonly used to model univariate data. Examples include the normal, lognormal, Weibull, gamma, and beta distributions. (These are sometimes called "named" distributions.) Given a set of data, which distribution should you use? Ideally, you would use domain-specific knowledge to decide which distribution models the data-generating process. In practice, the process that generates the data is often unknown. Consequently, some people fit many different distributions to the data and use a goodness-of-fit statistic to choose the distribution that best fits the data.

Unfortunately, this "scattershot method" is neither reliable nor principled. Furthermore, sometimes the usual well-known distributions are not sufficiently flexible to model a set of data. This is shown by the graph to the right, which shows three different distributions fit to the same data. It is not clear that one model is superior to the others.

So that modelers do not have to fit many "named" distributions and choose the best, researchers have developed flexible systems of distributions that can model a wide range of shapes. Popular systems include the following:

  • The Pearson system: Karl Pearson used the normal distribution and seven other families to construct a system that can match any sample skewness and kurtosis. In SAS, you can use PROC SIMSYSTEM to fit a Pearson distribution to data.
  • The Johnson system: Norman Johnson used the normal distribution, the lognormal distribution, an unbounded distribution, and a bounded distribution to match any sample skewness and kurtosis. In SAS, you can use PROC UNIVARIATE or PROC SIMSYSTEM to fit a Johnson distribution to data.
  • The Fleishman system: The book Simulating Data with SAS (Wicklin, 2013) includes SAS IML modules that can fit the Fleishman system (Fleishman, 1978) to data.

The metalog system is a more recent family of distributions that can model many different shapes for bounded, semibounded, and unbounded continuous distributions. To use the metalog system, do the following:

  1. Choose the type of the distribution from the four available types: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. This choice is often based on domain-specific knowledge of the data-generating process.
  2. Choose the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k ≥ 7) can fit data distributions for which the density appears to have multiple peaks.
  3. Fit the quantile function of the metalog distribution to the data. This estimates the k parameters in the model.

For example, the following graph shows the PDF of a 5-term unbounded metalog model overlaid on the same histogram that was previously shown. This model seems to fit the data better than the "named" distributions.

Advantages and drawbacks of using the metalog distribution

The metalog distribution has the following advantages:

  • Because the metalog distribution can have many parameters, it is very flexible and can fit a wide range of shapes, including multimodal distributions.
  • Keelin (2016) uses ordinary least squares (OLS) to fit the data to the quantile function of the metalog distribution. Thus, the fitting process is simple to implement.
  • Because the quantile function has an explicit formula, it is simple to simulate data from the model by using the inverse CDF method of simulating data.

The metalog function has one primary disadvantage: the fitted metalog coefficients might not correspond to a valid distribution. An invalid model is called infeasible. Recall that a quantile function is always monotonically increasing as a function of the probability on the open interval (0,1). However, for an infeasible model, the quantile function is not monotonic. Equivalently, the probability density function (PDF) of the model contains nonpositive values.

Infeasibility is somewhat complicated, but one way to think about infeasibility is to think about the fact that the moments for a metalog model are related to the coefficients. If the coefficients lead to impossible moments (for example, a negative variance), then the model is infeasible.

When fitting a metalog model to data, you should always verify that the resulting model is feasible. If it is not, reduce the number of terms in the model. Some data cannot be fit when k > 2. For example, data that have one or more extreme observations often lead to an infeasible set of parameter estimates.

The following panel of graph shows six models for the same data. The models show increasing complexity as the number of terms increases. If you include six terms, the model becomes bimodal. However, the 9-term model is infeasible because the PDF near x=15 has a negative value, which is inside the circle in the last cell of the panel.

Definitions of the metalog distribution

Most common continuous probability distributions are parameterized on the data scale. That is, if X is a continuous random variable, then the CDF is a function that is defined on the set of outcomes for X. You can write the CDF as some increasing function p = F(x).

From the definition of a continuous CDF, you can obtain the quantile function and the PDF. The quantile function is defined as the inverse CDF: \(x = F^{-1}(p)\) for p in (0,1). The PDF is defined as the derivative with respect to x of the CDF function.

Keelin (2016) specifies the definitions of the metalog quantile function and PDF function. The metalog distribution reverses the usual convention by parameterizing the quantile function as x = Q(p). From the quantile function, you can obtain the CDF, which is the inverse of the quantile function. From the CDF, you can obtain the PDF. From the quantile function, you can generate random variates. Thus, knowing the quantile function is the key to working with the metalog distribution.

The unbounded metalog distribution

When you fit an unbounded metalog model, you estimate the parameters for which the model's predicted values (p, Q(p)) are close (in a least squares sense) to the data pairs (p, x). Here, p is a vector of cumulative probabilities, x is a vector of observed data values, and Q(p) is the quantile function of a metalog distribution.

The predicted values are obtained by using a linear regression model. The unbounded model has a design matrix that uses the following basis functions:

  • a vector of 1s
  • a vector that contains the values logit(p) = log(p/(1-p))
  • a vector that contains the values c(p) = p - 1/2

Specifically, the columns of the design matrix involve powers of the vector c(p) and interactions with the term logit(p). If M = [M1 | M2 | ... | Mk] is the design matrix, then the columns are as follows:

  • M1 is a column of 1s.
  • M2 is the column logit(p).
  • M3 is the interaction term c(p) logit(p).
  • M4 is the linear term c(p).
  • M5 is the quadratic term c(p)2.
  • M6 is the interaction term cc(p)2 logit(p).
  • Additional columns are of the form c(p)j or c(p)j logit(p).

The following image shows the basis functions M2-M7 for the metalog regression.

Given the design matrix, M, the least squares parameter estimates are the elements of the vector, b, which solves the matrix equations \(M^\prime M b = M^\prime x\). An estimate of the quantile function is given by Q(p) = M*b. By differentiating this expression with respect to p, you can obtain a formula for the PDF for the model (Keelin, 2016, Eq. 6, p. 254).

From these equations, you can see that not every set of coefficients corresponds to an increasing quantile function. For example, when k=2, Q(p) = b1 + b2 logit(p). To be an increasing function of p, it is necessary that b2 be strictly positive.

The semibounded and bounded metalog distributions

The semibounded and bounded metalog distributions are obtained by transforming the unbounded metalog model.

The semibounded metalog distribution on the interval [L, ∞) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(x-L). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution on [L, ∞) is then defined as Q_L(p) = L + exp(Q(p)) for p in (0,1), and Q_L(0) = L.

Similarly, the quantile function for the semibounded distribution on (-∞, U) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(U-x). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution is then defined as Q_U(p) = U - exp(-Q(p)) for p in (0,1), and Q_U(1) = U.

Finally, the quantile function for the bounded metalog distribution on an interval [L, U] is obtained by fitting an unbounded k-term distribution to the transformed data z = log((x-L)/(U-x)). This gives the quantile function, Q(p), for z. The quantile function for the bounded distribution is then defined by Q_B(p) = (L + U exp(Q(p))) / (1 + exp(Q(p))) for p in (0,1), and the boundary conditions Q_B(0) = L and Q_B(1) = U.

As mentioned previously, for every distribution, you can obtain the PDF function from the quantile function.

Random variates from the metalog distribution

Recall that you can generate a random sample from any continuous probability distribution by using the inverse CDF method. You generate a random variate, u, from the uniform distribution U(0,1), then compute \(x = F^{-1}(u)\), where F is the CDF function and \(F^{-1}\) is the inverse CDF, which is the quantile function. You can prove that x is a random variate that is distributed according to F.

For most distributions, the inverse CDF is not known explicitly, which means that you must solve for x as the root of the nonlinear equation F(x) - u = 0. However, the metalog distribution supports direct simulation because the quantile function is defined explicitly. For the unbounded metalog distribution, you can generate random variates by generating u ~ U(0,1) and computing x = Q(u), where Q is the quantile function. The semibounded and bounded cases are treated similarly.

Summary

The metalog distribution is a flexible distribution and can fit a wide range of shapes of data distributions, including multimodal distributions. You can fit a model to data by using ordinary least squares regression. Unfortunately, not every model is feasible, which means that the model is not a valid distribution. For an infeasible model, the CDF is not monotone and the PDF is not strictly positive. Nevertheless, the metalog distribution is a powerful option for fitting data when the underlying data-generating method is unknown.

In a future article, I will show how to use a library of SAS IML functions to fit a metalog distribution to data. All the images in this article were generated in SAS by using the library.

The post What is the metalog distribution? appeared first on The DO Loop.

2月 222023
 

The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library of SAS IML functions that enable you to use the metalog distribution to model data in SAS.

The metalog distribution

There are dozens of continuous probability distributions that are commonly used to model univariate data. Examples include the normal, lognormal, Weibull, gamma, and beta distributions. (These are sometimes called "named" distributions.) Given a set of data, which distribution should you use? Ideally, you would use domain-specific knowledge to decide which distribution models the data-generating process. In practice, the process that generates the data is often unknown. Consequently, some people fit many different distributions to the data and use a goodness-of-fit statistic to choose the distribution that best fits the data.

Unfortunately, this "scattershot method" is neither reliable nor principled. Furthermore, sometimes the usual well-known distributions are not sufficiently flexible to model a set of data. This is shown by the graph to the right, which shows three different distributions fit to the same data. It is not clear that one model is superior to the others.

So that modelers do not have to fit many "named" distributions and choose the best, researchers have developed flexible systems of distributions that can model a wide range of shapes. Popular systems include the following:

  • The Pearson system: Karl Pearson used the normal distribution and seven other families to construct a system that can match any sample skewness and kurtosis. In SAS, you can use PROC SIMSYSTEM to fit a Pearson distribution to data.
  • The Johnson system: Norman Johnson used the normal distribution, the lognormal distribution, an unbounded distribution, and a bounded distribution to match any sample skewness and kurtosis. In SAS, you can use PROC UNIVARIATE or PROC SIMSYSTEM to fit a Johnson distribution to data.
  • The Fleishman system: The book Simulating Data with SAS (Wicklin, 2013) includes SAS IML modules that can fit the Fleishman system (Fleishman, 1978) to data.

The metalog system is a more recent family of distributions that can model many different shapes for bounded, semibounded, and unbounded continuous distributions. To use the metalog system, do the following:

  1. Choose the type of the distribution from the four available types: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. This choice is often based on domain-specific knowledge of the data-generating process.
  2. Choose the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k ≥ 7) can fit data distributions for which the density appears to have multiple peaks.
  3. Fit the quantile function of the metalog distribution to the data. This estimates the k parameters in the model.

For example, the following graph shows the PDF of a 5-term unbounded metalog model overlaid on the same histogram that was previously shown. This model seems to fit the data better than the "named" distributions.

Advantages and drawbacks of using the metalog distribution

The metalog distribution has the following advantages:

  • Because the metalog distribution can have many parameters, it is very flexible and can fit a wide range of shapes, including multimodal distributions.
  • Keelin (2016) uses ordinary least squares (OLS) to fit the data to the quantile function of the metalog distribution. Thus, the fitting process is simple to implement.
  • Because the quantile function has an explicit formula, it is simple to simulate data from the model by using the inverse CDF method of simulating data.

The metalog function has one primary disadvantage: the fitted metalog coefficients might not correspond to a valid distribution. An invalid model is called infeasible. Recall that a quantile function is always monotonically increasing as a function of the probability on the open interval (0,1). However, for an infeasible model, the quantile function is not monotonic. Equivalently, the probability density function (PDF) of the model contains nonpositive values.

Infeasibility is somewhat complicated, but one way to think about infeasibility is to think about the fact that the moments for a metalog model are related to the coefficients. If the coefficients lead to impossible moments (for example, a negative variance), then the model is infeasible.

When fitting a metalog model to data, you should always verify that the resulting model is feasible. If it is not, reduce the number of terms in the model. Some data cannot be fit when k > 2. For example, data that have one or more extreme observations often lead to an infeasible set of parameter estimates.

The following panel of graph shows six models for the same data. The models show increasing complexity as the number of terms increases. If you include six terms, the model becomes bimodal. However, the 9-term model is infeasible because the PDF near x=15 has a negative value, which is inside the circle in the last cell of the panel.

Definitions of the metalog distribution

Most common continuous probability distributions are parameterized on the data scale. That is, if X is a continuous random variable, then the CDF is a function that is defined on the set of outcomes for X. You can write the CDF as some increasing function p = F(x).

From the definition of a continuous CDF, you can obtain the quantile function and the PDF. The quantile function is defined as the inverse CDF: \(x = F^{-1}(p)\) for p in (0,1). The PDF is defined as the derivative with respect to x of the CDF function.

Keelin (2016) specifies the definitions of the metalog quantile function and PDF function. The metalog distribution reverses the usual convention by parameterizing the quantile function as x = Q(p). From the quantile function, you can obtain the CDF, which is the inverse of the quantile function. From the CDF, you can obtain the PDF. From the quantile function, you can generate random variates. Thus, knowing the quantile function is the key to working with the metalog distribution.

The unbounded metalog distribution

When you fit an unbounded metalog model, you estimate the parameters for which the model's predicted values (p, Q(p)) are close (in a least squares sense) to the data pairs (p, x). Here, p is a vector of cumulative probabilities, x is a vector of observed data values, and Q(p) is the quantile function of a metalog distribution.

The predicted values are obtained by using a linear regression model. The unbounded model has a design matrix that uses the following basis functions:

  • a vector of 1s
  • a vector that contains the values logit(p) = log(p/(1-p))
  • a vector that contains the values c(p) = p - 1/2

Specifically, the columns of the design matrix involve powers of the vector c(p) and interactions with the term logit(p). If M = [M1 | M2 | ... | Mk] is the design matrix, then the columns are as follows:

  • M1 is a column of 1s.
  • M2 is the column logit(p).
  • M3 is the interaction term c(p) logit(p).
  • M4 is the linear term c(p).
  • M5 is the quadratic term c(p)2.
  • M6 is the interaction term cc(p)2 logit(p).
  • Additional columns are of the form c(p)j or c(p)j logit(p).

The following image shows the basis functions M2-M7 for the metalog regression.

Given the design matrix, M, the least squares parameter estimates are the elements of the vector, b, which solves the matrix equations \(M^\prime M b = M^\prime x\). An estimate of the quantile function is given by Q(p) = M*b. By differentiating this expression with respect to p, you can obtain a formula for the PDF for the model (Keelin, 2016, Eq. 6, p. 254).

From these equations, you can see that not every set of coefficients corresponds to an increasing quantile function. For example, when k=2, Q(p) = b1 + b2 logit(p). To be an increasing function of p, it is necessary that b2 be strictly positive.

The semibounded and bounded metalog distributions

The semibounded and bounded metalog distributions are obtained by transforming the unbounded metalog model.

The semibounded metalog distribution on the interval [L, ∞) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(x-L). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution on [L, ∞) is then defined as Q_L(p) = L + exp(Q(p)) for p in (0,1), and Q_L(0) = L.

Similarly, the quantile function for the semibounded distribution on (-∞, U) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(U-x). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution is then defined as Q_U(p) = U - exp(-Q(p)) for p in (0,1), and Q_U(1) = U.

Finally, the quantile function for the bounded metalog distribution on an interval [L, U] is obtained by fitting an unbounded k-term distribution to the transformed data z = log((x-L)/(U-x)). This gives the quantile function, Q(p), for z. The quantile function for the bounded distribution is then defined by Q_B(p) = (L + U exp(Q(p))) / (1 + exp(Q(p))) for p in (0,1), and the boundary conditions Q_B(0) = L and Q_B(1) = U.

As mentioned previously, for every distribution, you can obtain the PDF function from the quantile function.

Random variates from the metalog distribution

Recall that you can generate a random sample from any continuous probability distribution by using the inverse CDF method. You generate a random variate, u, from the uniform distribution U(0,1), then compute \(x = F^{-1}(u)\), where F is the CDF function and \(F^{-1}\) is the inverse CDF, which is the quantile function. You can prove that x is a random variate that is distributed according to F.

For most distributions, the inverse CDF is not known explicitly, which means that you must solve for x as the root of the nonlinear equation F(x) - u = 0. However, the metalog distribution supports direct simulation because the quantile function is defined explicitly. For the unbounded metalog distribution, you can generate random variates by generating u ~ U(0,1) and computing x = Q(u), where Q is the quantile function. The semibounded and bounded cases are treated similarly.

Summary

The metalog distribution is a flexible distribution and can fit a wide range of shapes of data distributions, including multimodal distributions. You can fit a model to data by using ordinary least squares regression. Unfortunately, not every model is feasible, which means that the model is not a valid distribution. For an infeasible model, the CDF is not monotone and the PDF is not strictly positive. Nevertheless, the metalog distribution is a powerful option for fitting data when the underlying data-generating method is unknown.

In a future article, I will show how to use a library of SAS IML functions to fit a metalog distribution to data. All the images in this article were generated in SAS by using the library.

The post What is the metalog distribution? appeared first on The DO Loop.

11月 282022
 

A SAS programmer was trying to simulate poker hands. He was having difficulty because the sampling scheme for simulating card games requires that you sample without replacement for each hand. In statistics, this is called "simple random sampling."

If done properly, it is straightforward to simulate poker hands in SAS. This article shows how to generate a standard deck of 52 cards. It then shows how to generate many card hands for a game that involves several players. This can be done in two ways: The SAMPLE function in the SAS/IML language and PROC SURVEYSELECT in SAS/STAT software.

Encode a deck of cards

The first step is to generate a standard deck of 52 playing cards. There are many ways to do this, but the following DATA step iterates over 13 values (A, 2, 3, ..., Q, K) and over four suits (clubs, diamonds, hearts, and spades). The step creates a data set that has 52 cards. To make the output more compact, I concatenate the value of the card with the first letter of the suit to obtain a two-character string for each card. For example, "5C" is the "five of clubs" and "AS" is the "ace of spades." So that each card can be represented by using two characters, I use "T" instead of "10." For example, "TH" is the "ten of hearts." The

data Cards;
length Value $2 Suit $8 Card $3;
array Values[13] $ _temporary_ ('A','2','3','4','5','6','7','8','9','T','J','Q','K');
array Suits[4] $ _temporary_ ('Clubs', 'Diamonds', 'Hearts', 'Spades');
do v = 1 to 13;
   do s = 1 to 4;
      Value = Values[v]; Suit = Suits[s]; Card = cats(Value,substr(Suit,1,1));
      output;
   end;
end;
run;
 
proc print data=Cards(obs=8); 
   var v s Value Suit Card;
run;
TABLE

The cards are generated in order. The output shows the first eight cards in the deck, which are the aces and twos. For some types of poker hands (such as determining pairs and straights), the V variable can be useful, and for other analyses (such as determining flushes), the S variable can be useful.

Deal cards by using SAS/IML

When humans deal cards, we use a two-step method: shuffle the deck and then deal the top cards to players in a round-robin fashion. A computer simulation can simluate dealing cards by using a one-step method: deal random cards (without replacement) to every player. SAS provides built-in sampling methods to make this process easy to program.

In SAS/IML, the default sampling method for the SAMPLE function is sampling without replacement. If there are P players and each player is to receive C cards, then a deal consists of P*C cards, drawn without replacement from the deck. The deck does not need to be shuffled because the SAMPLE function outputs the cards in random order. Because the cards are in random order, it does not matter how you assign the cards to the players. The following SAS/IML program uses P=3 players and generates C=5 cards for each player. The program simulates one deal by assigning the first C cards to the first player, the second C cards to the second player, and so on:

%let nCards   = 5;    /* cards per player */
%let nPlayers = 3;    /* number of players */
%let nDeals   = 10;   /* number of deals to simulate */
 
/* simulate dealing many card hands in SAS/IML */
proc iml;
call randseed(1234);                       /* set random number seed */
use Cards;  read all var "Card";  close;   /* read the deck of cards into a vector */
 
nCardsPerDeal = &nCards * &nPlayers;       /* each row is a complete deal */
D = sample(Card, nCardsPerDeal, "WOR" );   /* sample without replacement from the deck */
Deal = shape(D, &nPlayers, &nCards);       /* reshape into nPlayers x nCards matrix */
 
/* print one deal */
cardNames   = "C1":"C&nCards";
playerNames = "P1":"P&nPlayers";
print Deal[c=cardNames r=playerNames];
TABLE

For this deal, the first player has two tens, two fives, and a king. The second player has a four, a five, a nine, an ace, and a two. You could sort the cards in each row, but I have left each player's cards unsorted so that you can see the random nature of the assignment.

Simulate multiple deals

The previous program simulates one deal. What if you want to simulate multiple deals? One way is to loop over the number of deals, but there is a more efficient method. The second argument to the SAMPLE function can be a two-element vector. The first element specifies the number of cards for each deal, and the second element specifies the number of deals. Thus, for example, if you want to simulate 10 deals, you can use the following statement to generate a matrix that has 10 rows. Each row is a sample (without replacement) from a 52-card deck:

/* use one call to simulate many deals */
D = sample(Card, nCardsPerDeal//&nDeals, "WOR" );   /* simulate many deals */
 
/* Ex: The 8th row contains the 8th deal. Display it. */
Deal8 = shape(D[8,], &nPlayers, &nCards);
print Deal8[c=cardNames r=playerNames];
TABLE

To demonstrate that each row is a deal, I have extracted, reshaped, and displayed the eighth row. The eighth deal is unusual because each player is dealt a "good" hand. The first player has a pair of eights, the second player has two pairs (threes and sixes), and the third player has a pair of sevens. If you want to print (or analyze) the 10 deals, you can loop over the rows, as follows:

do i = 1 to &nDeals;
   Deal = shape(D[i,], &nPlayers, &nCards);
   /* analyze or print the i_th deal */
   print i, Deal[c=cardNames r=playerNames];
end;

Deal random hands by using PROC SURVEYSELECT

You can also simulate card deals by using the SURVEYSELECT procedure. You should use the following options:

  • Use the METHOD=SRS option to specify sampling without replacement for each deal. To obtain the cards in random order, use the OUTRANDOM option.
  • Use the N= option to specify the number of cards in each deal.
  • Use the REPS= option to specify the total number of deals.

An example is shown in the following call to PROC SURVEYSELECT:

/* similar approach for PROC SURVEYSELECT */
proc surveyselect data=Cards seed=123 NOPRINT
     method=SRS                   /* sample without replacement */
     outrandom                    /* randomize the order of the cards */
     N=%eval(&nCards * &nPlayers) /* num cards per deal */
     reps=&nDeals                 /* total number of deals */
     out=AllCards;
run;

The output data set (AllCards) contains a variable named Replicate, which identifies the cards in each deal. Because the cards for each deal are in a random order, you can arbitrarily assign the cards to players. You can use a round-robin method or assign the first C cards to the first player, the next C cards to the second player, and so on. This is done by using the following DATA step:

/* assign the cards to players */
data AllDeals;
set AllCards;
by Replicate;
if first.Replicate then do;
   Cnt = 0;
end;
Player = 1 + floor(Cnt/&nCards);
Cnt + 1;
drop Cnt;
run;
OUTPUT

The simulated data are in long form. For some analyses, it might be more convenient to convert it into wide form. You can do this by using the DATA step or by using PROC TRANSPOSE, as below:

/* OPTIONAL: Convert from long form to wide form */
proc transpose data=AllDeals 
               out=Deals(rename=(col1-col&nCards=C1-C&nCards) drop=_NAME_);
   var Card;
   by Replicate Player;
run;
 
proc print data=Deals;
   where Replicate=8;
   var Replicate Player C:;
run;
OUTPUT

To demonstrate the results, I display the result of the eighth deal (Replicate=8). For this deal, the first player has several face cards but no pairs, the second player has a pair of threes, and the third player has a pair of fours.

Summary

SAS provides many tools for simulation studies. For simulations of card games, this article shows how to create a data set that represents a standard deck of 52 playing cards. From the cards, you can use the SAMPLE function in SAS/IML software or the SURVEYSELECT procedure to simulate many deals, where each deal is a sample without replacement from the deck. For each deal, you can assign the cards to the players in a systematic manner.

The post Simulate poker hands in SAS appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.