Statistical Thinking

3月 252020
 

During an outbreak of a disease, such as the coronavirus (COVID-19) pandemic, the media shows daily graphs that convey the spread of the disease. The following two graphs appear frequently:

  • New cases for each day (or week). This information is usually shown as a histogram or needle plot. The graph is sometimes called a frequency graph.
  • The total number of cases plotted against time. Usually, this graph is a line graph. The graph is sometimes called a cumulative frequency graph.

An example of each graph is shown above. The two graphs are related and actually contain the same information. However, the cumulative frequency graph is less familiar and is harder to interpret. This article discusses how to read a cumulative frequency graph. The shape of the cumulative curve indicates whether the daily number of cases is increasing, decreasing, or staying the same.

For this article, I created five examples that show the spread of a hypothetical disease. The numbers used in this article do not reflect any specific disease or outbreak.

How to read a cumulative frequency graph

When the underlying quantity is nonnegative (as for new cases of a disease), the cumulative curve never decreases. It either increases (when new cases are reported) or it stays the same (if no new cases are reported).

When the underlying quantity (new cases) is a count, the cumulative curve is technically a step function, but it is usually shown as a continuous curve by connecting each day's cumulative total. A cumulative curve for many days (more than 40) often looks smooth, so you can describe its shape by using the following descriptive terms:

  • When the number of new cases is increasing, the cumulative curve is "concave up." In general, a concave-up curve is U-shaped, like this: ∪. Because a cumulative frequency curve is nondecreasing, it looks like the right side of the ∪ symbol.
  • When the number of new cases is staying the same, the cumulative curve is linear. The slope of the curve indicates the number of new cases.
  • When the number of new cases is decreasing, the cumulative curve is "concave down." In general, a concave-up curve looks like an upside-down U, like this: ∩. Because a cumulative frequency curve is nondecreasing, a concave-down curve looks like the left side of the ∩ symbol.

A typical cumulative curve is somewhat S-shaped, as shown to the right. The initial portion of the curve (the red region) is concave up, which indicates that the number of new cases is increasing. The cumulative curve is nearly linear between Days 35 and 68 (the yellow region), which indicates that the number of new cases each day is approximately constant. After Day 68, the cumulative curve is concave down, which indicates that the number of daily cases is decreasing. Each interval can be short, long, or nonexistent.

The cumulative curve looks flat near Day 100. When the cumulative curve is exactly horizontal (zero slope), it indicates that there are no new cases.

Sometimes you might see a related graph that displayed the logarithm of the cumulative cases. Near the end of this article, I briefly discuss how to interpret a log-scale graph.

Examples of frequency graphs

It is useful to look at the shape of the cumulative frequency curve for five different hypothetical scenarios. This section shows the cases-per-day frequency graphs; the cumulative frequency curves are shown in subsequent sections.

In each scenario, a population experiences a total of 1,000 cases of a disease over a 100-day time period. For the sake of discussion, suppose that the health care system can treat up to 20 new cases per day. The graphs to the right indicate that some scenarios will overwhelm the health care system whereas others will not. The five scenarios are:

  • Constant rate of new cases: In the top graph, the community experiences about 10 new cases per day for each of the 100 days. Because the number of cases per day is small, the health care system can treat all the infected cases.
  • Early peak: In the second graph, the number of new cases quickly rises for 10 days before gradually declining over the next 50 days. Because the more than 20 new cases develop on Days 5–25, the health care system is overwhelmed on those days.
  • Delayed peak: In the third graph, the number of new cases gradually rises, levels out, and gradually declines. There are only a few days in which the number of new cases exceeds the capacity of the health care system. Epidemiologists call this scenario "flattening the curve" of the previous scenario. By practicing good hygiene and avoiding social interactions, a community can delay the spread of a disease.
  • Secondary outbreak: In the fourth graph, the first outbreak is essentially resolved when a second outbreak appears. This can happen, for example, if a new infected person enters a community after the first outbreak ends. To prevent this scenario, public health officials might impose travel bans on certain communities.
  • Exponential growth: In the fifth graph, the number of new cases increases exponentially. The health care system is eventually overwhelmed, and the graph does not indicate when the spread of the disease might end.

The graphs in this section are frequency graphs. The next sections show and interpret a cumulative frequency graph for each scenario.

Constant rate of new cases

In the first scenario, new cases appear at a constant rate. Consequently, the cumulative frequency chart looks like a straight line. The slope of the line is the rate at which new cases appear. For example, in this scenario, the number of new cases each day is approximately 10. Consequently, the cumulative curve has an average slope ("rise over run") that is close to 10.

Early peak

In the second scenario, new cases appear very quickly at first, then gradually decline. Consequently, the first portion of the cumulative curve is concave up and the second portion is concave down. In this scenario, the number of new cases dwindles to zero, as indicated by the nearly horizontal cumulative curve.

At any moment in time, you can use the slope of the cumulative curve to estimate the number of new cases that are occurring at that moment. Days when the slope of the cumulative curve is large (such as Day 10), correspond to days on which many new cases are reported. Where the cumulative curve is horizontal (zero slope, such as after Day 60), there are very few new cases.

Delayed peak

In the third scenario, new cases appear gradually, level out, and then decline. This is reflected in the cumulative curve. Initially, the cumulative curve is concave up. It then straightens out and appears linear for 10–15 days. Finally, it turns concave down, which indicates that the number of new cases is trending down. Near the end of the 100-day period, the cumulative curve is nearly horizontal because very few new cases are being reported.

Secondary outbreak

In the fourth scenario, there are two outbreaks. During the first outbreak, the cumulative curve is concave up or down as the new cases increase or decrease, respectively. The cumulative curve is nearly horizontal near Day 50, but then goes through a smaller concave up/down cycle as the second outbreak appears. Near the end of the 100-day period, the cumulative curve is once again nearly horizontal as the second wave ends.

Exponential growth

The fifth scenario demonstrates exponential growth. Initially, the number of new cases increases very gradually, as indicated by the small slope of the cumulative frequency curve. However, between Days 60–70, the number of new cases begins to increase dramatically. The lower and upper curves are both increasing at an exponential rate, but the scale of the vertical axis for the cumulative curve (upper graph) is much greater than for the graph of new cases (lower graph). This type of growth is more likely in a population that does not use quarantines and "social distancing" to reduce the spread of new cases.

This last example demonstrates why it is important to label the vertical axis. At first glance, the upper and lower graphs look very similar. Both exhibit exponential growth. One way to tell them apart is to remember that a cumulative frequency graph never decreases. In contrast, if you look closely at the lower graph, you can see that some bars (Days 71 and 91) are shorter than the previous day's bar.

Be aware of log-scale axes

The previous analysis assumes that the vertical axis plot the cumulative counts on a linear scale. Scientific articles might display the logarithm of the total counts. The graph is on a log scale if the vertical axis says "log scale" or if the tick values are powers of 10 such as 10, 100, 1000, and so forth. If the graph uses a log scale:

  • A straight line indicates that new cases are increasing at an exponential rate. The slope of the line indicates how quickly cases will double, with steep lines indicating a short doubling time.
  • A concave-down curve indicates that new cases are increasing at rate that is less than exponential. Log-scale graphs make it difficult to distinguish between a slowly increasing rate and a decreasing rate.

Summary

In summary, this article shows how to interpret a cumulative frequency graph. A cumulative frequency graph is provided for five scenarios that describe the spread of a hypothetical disease. In each scenario, the shape of the cumulative frequency graph indicates how the disease is spreading:

  • When the cumulative curve is concave up, the number of new cases is increasing.
  • When the cumulative curve is linear, the number of new cases is not changing.
  • When the cumulative curve is concave down, the number of new cases is decreasing.
  • When the cumulative curve is horizontal, there are no new cases being reported.

Although the application in this article is the spread of a fictitious disease, the ideas apply widely. Anytime you see a graph of a cumulative quantity (sales, units produced, number of traffic accidents,...), you can the ideas in this article to interpret the cumulative frequency graph and use its shape to infer the trends in the underlying quantity. Statisticians use these ideas to relate a cumulative distribution function (CDF) for a continuous random variable to its probability density function (PDF).

The post How to read a cumulative frequency graph appeared first on The DO Loop.

3月 252020
 

During an outbreak of a disease, such as the coronavirus (COVID-19) pandemic, the media shows daily graphs that convey the spread of the disease. The following two graphs appear frequently:

  • New cases for each day (or week). This information is usually shown as a histogram or needle plot. The graph is sometimes called a frequency graph.
  • The total number of cases plotted against time. Usually, this graph is a line graph. The graph is sometimes called a cumulative frequency graph.

An example of each graph is shown above. The two graphs are related and actually contain the same information. However, the cumulative frequency graph is less familiar and is harder to interpret. This article discusses how to read a cumulative frequency graph. The shape of the cumulative curve indicates whether the daily number of cases is increasing, decreasing, or staying the same.

For this article, I created five examples that show the spread of a hypothetical disease. The numbers used in this article do not reflect any specific disease or outbreak.

How to read a cumulative frequency graph

When the underlying quantity is nonnegative (as for new cases of a disease), the cumulative curve never decreases. It either increases (when new cases are reported) or it stays the same (if no new cases are reported).

When the underlying quantity (new cases) is a count, the cumulative curve is technically a step function, but it is usually shown as a continuous curve by connecting each day's cumulative total. A cumulative curve for many days (more than 40) often looks smooth, so you can describe its shape by using the following descriptive terms:

  • When the number of new cases is increasing, the cumulative curve is "concave up." In general, a concave-up curve is U-shaped, like this: ∪. Because a cumulative frequency curve is nondecreasing, it looks like the right side of the ∪ symbol.
  • When the number of new cases is staying the same, the cumulative curve is linear. The slope of the curve indicates the number of new cases.
  • When the number of new cases is decreasing, the cumulative curve is "concave down." In general, a concave-up curve looks like an upside-down U, like this: ∩. Because a cumulative frequency curve is nondecreasing, a concave-down curve looks like the left side of the ∩ symbol.

A typical cumulative curve is somewhat S-shaped, as shown to the right. The initial portion of the curve (the red region) is concave up, which indicates that the number of new cases is increasing. The cumulative curve is nearly linear between Days 35 and 68 (the yellow region), which indicates that the number of new cases each day is approximately constant. After Day 68, the cumulative curve is concave down, which indicates that the number of daily cases is decreasing. Each interval can be short, long, or nonexistent.

The cumulative curve looks flat near Day 100. When the cumulative curve is exactly horizontal (zero slope), it indicates that there are no new cases.

Sometimes you might see a related graph that displayed the logarithm of the cumulative cases. Near the end of this article, I briefly discuss how to interpret a log-scale graph.

Examples of frequency graphs

It is useful to look at the shape of the cumulative frequency curve for five different hypothetical scenarios. This section shows the cases-per-day frequency graphs; the cumulative frequency curves are shown in subsequent sections.

In each scenario, a population experiences a total of 1,000 cases of a disease over a 100-day time period. For the sake of discussion, suppose that the health care system can treat up to 20 new cases per day. The graphs to the right indicate that some scenarios will overwhelm the health care system whereas others will not. The five scenarios are:

  • Constant rate of new cases: In the top graph, the community experiences about 10 new cases per day for each of the 100 days. Because the number of cases per day is small, the health care system can treat all the infected cases.
  • Early peak: In the second graph, the number of new cases quickly rises for 10 days before gradually declining over the next 50 days. Because the more than 20 new cases develop on Days 5–25, the health care system is overwhelmed on those days.
  • Delayed peak: In the third graph, the number of new cases gradually rises, levels out, and gradually declines. There are only a few days in which the number of new cases exceeds the capacity of the health care system. Epidemiologists call this scenario "flattening the curve" of the previous scenario. By practicing good hygiene and avoiding social interactions, a community can delay the spread of a disease.
  • Secondary outbreak: In the fourth graph, the first outbreak is essentially resolved when a second outbreak appears. This can happen, for example, if a new infected person enters a community after the first outbreak ends. To prevent this scenario, public health officials might impose travel bans on certain communities.
  • Exponential growth: In the fifth graph, the number of new cases increases exponentially. The health care system is eventually overwhelmed, and the graph does not indicate when the spread of the disease might end.

The graphs in this section are frequency graphs. The next sections show and interpret a cumulative frequency graph for each scenario.

Constant rate of new cases

In the first scenario, new cases appear at a constant rate. Consequently, the cumulative frequency chart looks like a straight line. The slope of the line is the rate at which new cases appear. For example, in this scenario, the number of new cases each day is approximately 10. Consequently, the cumulative curve has an average slope ("rise over run") that is close to 10.

Early peak

In the second scenario, new cases appear very quickly at first, then gradually decline. Consequently, the first portion of the cumulative curve is concave up and the second portion is concave down. In this scenario, the number of new cases dwindles to zero, as indicated by the nearly horizontal cumulative curve.

At any moment in time, you can use the slope of the cumulative curve to estimate the number of new cases that are occurring at that moment. Days when the slope of the cumulative curve is large (such as Day 10), correspond to days on which many new cases are reported. Where the cumulative curve is horizontal (zero slope, such as after Day 60), there are very few new cases.

Delayed peak

In the third scenario, new cases appear gradually, level out, and then decline. This is reflected in the cumulative curve. Initially, the cumulative curve is concave up. It then straightens out and appears linear for 10–15 days. Finally, it turns concave down, which indicates that the number of new cases is trending down. Near the end of the 100-day period, the cumulative curve is nearly horizontal because very few new cases are being reported.

Secondary outbreak

In the fourth scenario, there are two outbreaks. During the first outbreak, the cumulative curve is concave up or down as the new cases increase or decrease, respectively. The cumulative curve is nearly horizontal near Day 50, but then goes through a smaller concave up/down cycle as the second outbreak appears. Near the end of the 100-day period, the cumulative curve is once again nearly horizontal as the second wave ends.

Exponential growth

The fifth scenario demonstrates exponential growth. Initially, the number of new cases increases very gradually, as indicated by the small slope of the cumulative frequency curve. However, between Days 60–70, the number of new cases begins to increase dramatically. The lower and upper curves are both increasing at an exponential rate, but the scale of the vertical axis for the cumulative curve (upper graph) is much greater than for the graph of new cases (lower graph). This type of growth is more likely in a population that does not use quarantines and "social distancing" to reduce the spread of new cases.

This last example demonstrates why it is important to label the vertical axis. At first glance, the upper and lower graphs look very similar. Both exhibit exponential growth. One way to tell them apart is to remember that a cumulative frequency graph never decreases. In contrast, if you look closely at the lower graph, you can see that some bars (Days 71 and 91) are shorter than the previous day's bar.

Be aware of log-scale axes

The previous analysis assumes that the vertical axis plot the cumulative counts on a linear scale. Scientific articles might display the logarithm of the total counts. The graph is on a log scale if the vertical axis says "log scale" or if the tick values are powers of 10 such as 10, 100, 1000, and so forth. If the graph uses a log scale:

  • A straight line indicates that new cases are increasing at an exponential rate. The slope of the line indicates how quickly cases will double, with steep lines indicating a short doubling time.
  • A concave-down curve indicates that new cases are increasing at rate that is less than exponential. Log-scale graphs make it difficult to distinguish between a slowly increasing rate and a decreasing rate.

Summary

In summary, this article shows how to interpret a cumulative frequency graph. A cumulative frequency graph is provided for five scenarios that describe the spread of a hypothetical disease. In each scenario, the shape of the cumulative frequency graph indicates how the disease is spreading:

  • When the cumulative curve is concave up, the number of new cases is increasing.
  • When the cumulative curve is linear, the number of new cases is not changing.
  • When the cumulative curve is concave down, the number of new cases is decreasing.
  • When the cumulative curve is horizontal, there are no new cases being reported.

Although the application in this article is the spread of a fictitious disease, the ideas apply widely. Anytime you see a graph of a cumulative quantity (sales, units produced, number of traffic accidents,...), you can the ideas in this article to interpret the cumulative frequency graph and use its shape to infer the trends in the underlying quantity. Statisticians use these ideas to relate a cumulative distribution function (CDF) for a continuous random variable to its probability density function (PDF).

The post How to read a cumulative frequency graph appeared first on The DO Loop.

3月 162020
 

Books about statistics and machine learning often discuss the tradeoff between bias and variance for an estimator. These discussions are often motivated by a sophisticated predictive model such as a regression or a decision tree. But the basic idea can be seen in much simpler situations. This article presents a simple situation that is discussed in a short paper by Dan Jeske (1993). Namely, if a random process produces integers with a known set of probabilities, what method should you use to predict future values of the process?

I will start by briefly summarizing Jeske's result, which uses probability theory to derive the best biased and unbiased estimators. I then present a SAS program that simulates the problem and compares two estimators, one biased and one unbiased.

A random process that produces integers

Suppose a gambler asks you to predict the next roll of a six-sided die. He will reward you based on how close your guess is to the actual value he rolls. No matter what number you pick, you only have a 1/6 chance of being correct. But if the strategy is to be close to the value rolled, you can compute the expected value of the six faces, which is 3.5. Assuming that the gambler doesn't let you guess 3.5 (which is not a valid outcome), one good strategy is to round the expected value to the nearest integer. For dice, that means you would guess ROUND(3.5) = 4. Another good strategy is to randomly guess either 3 or 4 with equal probability.

Jeske's paper generalizes this problem. Suppose a random process produces the integers 1, 2, ..., N, with probabilities p1, p2, ..., pN, where the sum of the probabilities is 1. (This random distribution is sometimes called the "table distribution.") If your goal is to minimize the mean squared error (MSE) between your guess and a series of future random values, Jeske shows that the optimal solution is to guess the value that is closest to the expected value of the random variable. I call this method the ROUNDING estimator. In general, this method is biased, but it has the smallest expected MSE. Recall that the MSE is a measure of the quality of an estimator (smaller is better).

An alternative method is to randomly guess either of the two integers that are closest to the expected value, giving extra weight to the integer that is closer to the expected value. I call this method the RANDOM estimator. The random estimator is unbiased, but it has a higher MSE.

An example

The following example is from Jeske's paper. A discrete process generates a random variable, X, which can take the values 1, 2, and 3 according to the following probabilities:

  • P(X=1) = 0.2, which means that the value 1 appears with probability 0.2.
  • P(X=2) = 0.3, which means that the value 2 appears with probability 0.3.
  • P(X=3) = 0.5, which means that the value 3 appears with probability 0.5.

A graph of the probabilities is shown to the right. The expected value of this random variable is E(X) = 1(0.2) + 2(0.3) + 3(0.5) = 2.3. However, your guess must be one of the feasible values of X, so you can't guess 2.3. The best prediction (in the MSE sense) is to round the expected value. Since ROUND(2.3) is 2, the best guess for this example is 2.

Recall that an estimator for X is biased if its expected value is different from the expected value of X. Since E(X) ≠ 2, the rounding estimator is biased.

You can construct an unbiased estimator by randomly choosing the values 2 and 3, which are the two closest integers to E(X). Because E(X) = 2.3 is closer to 2 than to 3, you want to choose 2 more often than 3. You can make sure that the guesses average to 2.3 by guessing 2 with probability 0.7 and guessing 3 with probability 0.3. Then the weighted average of the guesses is 2(0.7) + 3(0.3) = 2.3, and this method produces an unbiased estimate. The random estimator is unbiased, but it will have a larger MSE.

Simulate the prediction of a random integer

Jeske proves these facts for an arbitrary table distribution, but let's use SAS to simulate the problem for the previous example. The first step is to compute the expected values of X. This is done by the following DATA step, which puts the expected value into a macro variable named MEAN:

/* Compute the expected value of X where 
   P(X=1) = 0.2
   P(X=2) = 0.3
   P(X=3) = 0.5
*/
data _null_;
array prob[3] _temporary_ (0.2, 0.3, 0.5);
do i = 1 to dim(prob);
   ExpectedValue + i*prob[i];       /* Sum of i*prob[i] */
end;
call symputx("Mean", ExpectedValue);
run;
 
%put &=Mean;
MEAN=2.3

The next step is to predict future values of X. For the rounding estimator, the predicted value is always 2. For the random estimator, let k be the greatest integer less than E(X) and let F = E(X) - k be the fractional part of E(x). To get an unbiased estimator, you can randomly choose k with probability 1-F and randomly choose k+1 with probability F. This is done in the following DATA step, which makes the predictions, generates a realization of X, and computes the residual difference for each method for 1,000 random values of X:

/* If you know mean=E(X)=expected value of X, Jeske (1993) shows that round(mean) is the best 
   MSE predictor, but it is biased.
   Randomly guessing the two closest integers is the best UNBIASED MSE predictor
   https://www.academia.edu/15728006/Predicting_the_value_of_an_integer-valued_random_variable
 
   Use these two predictors for 1000 future random variates.
*/
%let NumGuesses = 1000;
data Guess(keep = x PredRound diffRound PredRand diffRand);
call streaminit(12345);
array prob[3] _temporary_ (0.2, 0.3, 0.5);  /* P(X=i) */
 
/* z = floor(z) + frac(z) where frac(z) >= 0 */
/* https://blogs.sas.com/content/iml/2020/02/10/fractional-part-of-a-number-sas.html */
k = floor(&Mean);
Frac = &Mean - k;                        /* distance from E(X) to x */
do i = 1 to &NumGuesses;
   PredRound = round(&Mean);             /* guess the nearest integer */
   PredRand = k + rand("Bern", Frac);    /* random guesses between k and k+1, weighted by Frac */
   /* The guesses are made. Now generate a new instance of X and compute residual difference */
   x = rand("Table", of prob[*]);
   diffRound = x - PredRound;            /* rounding estimate */
   diffRand  = x - PredRand;             /* unbiased estimate */
   output;
end;
run;
 
/* sure enough, ROUND is the best predictor in the MSE sense */
proc means data=Guess n USS mean;
   var diffRound DiffRand;
run;

The output from PROC MEANS shows the results of generating 1,000 random integers from X. The uncorrected sum of squares (USS) column shows the sum of the squared residuals for each estimator. (The MSE estimate is USS / 1000 for these data.) The table shows that the USS (and MSE) for the rounding estimator is smaller than for the random estimator. On the other hand, The mean of the residuals is not close to zero for the rounding method because it is a biased method. In contrast, the mean of the residuals for the random method, which is unbiased, is close to zero.

It might be easier to see the bias of the estimators if you look at the predicted values themselves, rather than at the residuals. The following call to PROC MEANS computes the sample mean for X and the two methods of predicting X:

/* the rounding method is biased; the random guess is unbiased */
proc means data=Guess n mean stddev;
   var x PredRound PredRand;
run;

This output shows that the simulated values of X have a sample mean of 2.34, which is close to the expected value. In contrast, the rounding method always predicts 2, so the sample mean for that column is exactly 2.0. The sample mean for the unbiased random method is 2.32, which is close to the expected value.

In summary, you can use SAS to simulate a simple example that compares two methods of predicting the value of a discrete random process. One method is biased but has the lowest MSE. The other is unbiased but has a larger MSE. In statistics and machine learning, practitioners often choose between an unbiased method (such as ordinary least squares regression) and a biased method (such as ridge regression or LASSO regression). The example in this article provides a very simple situation that you can use to think about these issues.

The post Predict a random integer: The tradeoff between bias and variance appeared first on The DO Loop.

2月 262020
 

The ROC curve is a graphical method that summarizes how well a binary classifier can discriminate between two populations, often called the "negative" population (individuals who do not have a disease or characteristic) and the "positive" population (individuals who do have it). As shown in a previous article, there is a theoretical model, called the binormal model, that describes the fundamental features in binary classification. The model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the negative population is less than the mean of scores for the positive population. The figure to the right (which was discussed in the previous article) shows a threshold value (the vertical line) that a researcher can use to classify an individual as belonging to the positive or negative population, according to whether his score is greater than or less than the threshold, respectively.

In most applications, any reasonable choice of the threshold will misclassify some individuals. Members of the negative population can be misclassified, which results in a false positive (FP). Members of the positive population can be misclassified, which results in a false negative (FP). Correctly classified individuals are true negatives (TN) and true positives (TP).

Vizualize the binary classification method

One way to assess the predictive accuracy of the classifier is to use the proportions of the populations that are classified correctly or are misclassified. Because the total area under a normal curve is 1, the threshold parameter divides the area into two proportions. It is instructive to look at how the proportions change as the threshold value ranges. The proportions are usually called "rates." The four regions correspond to the True Negative Rate (TNR), False Positive Rate (FPR), False Negative Rate (FNR), and True Positive Rate (TPR).

For the binormal model, you can use the standard deviations of the populations to choose a suitable range for the threshold parameter. The following SAS DATA step uses the normal cumulative distribution function (CDF) to compute the proportion of each population that lies to the left and to the right of the threshold parameter for a range of values. These proportions are then plotted against the threshold parameters.

%let mu_N    = 0;       /* mean of Negative population */
%let sigma_N = 1;       /* std dev of Negative population */
%let mu_P    = 2;       /* mean of Positive population */
%let sigma_P = 0.75;    /* std dev of Positive population */
 
/* TNR = True Negative Rate (TNR)  = area to the left of the threshold for the Negative pop
   FPR = False Positive Rate (FPR) = area to the right of the threshold for the Negative pop
   FNR = False Negative Rate (FNR) = area to the left of the threshold for the Positive pop
   TPR = True Positive Rate (TPR)  = area to the right of the threshold for the Positive pop  
*/
data ClassRates;
do t = -3 to 4 by 0.1;   /* threshold cutoff value (could use mean +/- 3*StdDev) */
  TNR = cdf("Normal", t, &mu_N, &sigma_N);
  FPR = 1 - TNR;
  FNR = cdf("Normal", t, &mu_P, &sigma_P);
  TPR = 1 - FNR;
  output;
end;
run;
 
title "Classification Rates as a Function of the Threshold";
%macro opt(lab); 
   name="&lab" legendlabel="&lab" lineattrs=(thickness=3); 
%mend;
proc sgplot data=ClassRates;
   series x=t y=TNR / %opt(TNR);
   series x=t y=FPR / %opt(FPR);
   series x=t y=FNR / %opt(FNR); 
   series x=t y=TPR / %opt(TPR); 
   keylegend "TNR" "FNR" / position=NE location=inside across=1;
   keylegend "TPR" "FPR" / position=SE location=inside across=1;
   xaxis offsetmax=0.2 label="Threshold";
   yaxis label="Classification Rates";
run;

The graph shows how the classification and misclassification rates vary as you change the threshold parameter. A few facts are evident:

  • Two of the curves are redundant because FPR = 1 – TNR and TPR = 1 – FNR. Thus, it suffices to plot only two curves. A common choice is to display only the FPR and TPR curves.
  • When the threshold parameter is much less than the population means, essentially all individuals are predicted to belong to the positive population. Thus, the FPR and the TPR are both essentially 1.
  • As the parameter increases, both rates decrease monotonically.
  • When the threshold parameter is much greater than the population means, essentially no individuals are predicted to belong to the positive population. Thus, the FPR and TPR are both essentially 0.

The ROC curve

The graph in the previous section shows the FPR and TPR as functions of t, the threshold parameter. Alternatively, you can plot the parametric curve ROC(t) = (FPR(t), TPR(t)), for t ∈ (-∞, ∞). Because the FPR and TPR quantities are proportions, the curve (called the ROC curve) is always contained in the unit square [0, 1] x [0, 1]. As discussed previously, as the parameter t → -∞, the curve ROC(t) → (1, 1). As the parameter t → ∞, the curve ROC(t) → (0, 0). The main advantage of the ROC curve is that the ROC curve is independent of the scale of the population scores. In fact, the standard ROC curve does not display the threshold parameter. This means that you can compare the ROC curves from different models that might use different scores to classify the negative and positive populations.

The following call to PROC SGPLOT creates an ROC curve for the binormal model by plotting the TPR (on the vertical axis) against the FPR (on the horizontal axis). The resulting ROC curve is shown to the right.

title "ROC Curve";
title2;
proc sgplot data=ClassRates aspect=1 noautolegend;
   series x=FPR y=TPR / lineattrs=(thickness=2);
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;
   yaxis grid;
run;

The standard ROC curve does not display the values of the threshold parameter. However, for instructional purposes, it can be enlightening to plot the values of a few selected threshold parameters. An example is shown in the following ROC curve. Displaying the ROC curve this way emphasizes that each point on the ROC curve corresponds to a different threshold parameter. For example, when t=1, the cutoff parameter is 1 and the classification is accomplished by using the vertical line and binormal populations that are shown at the beginning of this article.

Interpretation of the ROC curve

The ROC curve shows the tradeoff between correctly classifying those who have a disease/condition and those who do not. For concreteness, suppose you are trying to classify people who have cancer based on medical tests. Wherever you place the threshold cutoff, you will make two kinds of errors: You will not identify some people who actually have cancer and you will mistakenly tell other people that they have cancer when, in fact, they do not. The first error is very bad; the second error is also bad but not life-threatening. Consider three choices for the threshold parameter in the binormal model:

  • If you use the threshold value t=2, the previous ROC curve indicates that about half of those who have cancer are correctly classified (TPR=0.5) while misclassifying very few people who do not have cancer (FPR=0.02). This value of the threshold is probably not optimal because the test only identifies half of those individuals who actually have cancer.
  • If you use t=1, the ROC curve indicates that about 91% of those who have cancer are correctly classified (TPR=0.91) while misclassifying about 16% of those who do not have cancer (FPR=0.16). This value of the threshold seems more reasonable because it detects most cancers while not alarming too many people who do not have cancer.
  • As you decrease the threshold parameter, the detection rate only increases slightly, but the proportion of false positives increases rapidly. If you use t=0, the classifier identifies 99.6% of the people who have cancer, but it also mistakenly tells 50% of the non-cancer patients that they have cancer.

In general, the ROC curve helps researchers to understand the trade-offs and costs associated with false positive and false negatives.

Concluding remarks

In summary, the binormal ROC curve illustrates fundamental features of the binary classification problem. Typically, you use a statistical model to generate scores for the negative and positive populations. The binormal model assumes that the scores are normally distributed and that the mean of the negative scores is less than the mean of the positive scores. With that assumption, it is easy to use the normal CDF function to compute the FPR and TPR for any value of a threshold parameter. You can graph the FPR and TPR as functions of the threshold parameter, or you can create an ROC curve, which is a parametric curve that displays both rates as the parameter varies.

The binormal model is a useful theoretical model and is more applicable than you might think. If the variables in the classification problem are multivariate normal, then any linear classifier results in normally distributed scores. In addition, Krzandowski and Hand (2009, p. 34-35), state that the ROC curve is unchanged by any monotonic increasing transformation of scores, which means that the binormal model applies to any set of scores that can be transformed to normality. This is a large set, indeed, since it includes the complete Johnson system of distributions.

In practice, we do not know the distribution of scores for the population. Instead, we have to estimate the FPR and TPR by using collected data. PROC LOGISTIC in SAS can estimate an ROC curve for data by using a logistic regression classifier. Furthermore, PROC LOGISTIC can automatically create an empirical ROC curve from any set of paired observed and predicted values.

The post The binormal model for ROC curves appeared first on The DO Loop.

2月 242020
 

The purpose of this article is to show how to use SAS to create a graph that illustrates a basic idea in a binary classification analysis, such as discriminant analysis and logistic regression. The graph, shown at right, shows two populations. Subjects in the "negative" population do not have some disease (or characteristic) whereas individuals in the "positive" population do have it. There is a function (a statistical model) that associates a score with each individual, and the distribution of the scores is shown. A researcher wants to use a threshold value (the vertical line) to classify individuals. An individual is predicted to be negative (does not have the disease) or positive (does have the disease) according to whether the individual's score is lower than or higher than the cutoff theshold, respectively.

Unless the threshold value perfectly discriminates between the populations, some individuals will be classified correctly, and others will be classified incorrectly. There are four possibilities:

  • A subject that belongs to the negative population might be classified as "negative." This is a correct classification, so this case is called a "true negative" (TN).
  • A subject that belongs to the negative population might be classified as "positive." This is a wrong classification, so this case is called a "false positive" (FP).
  • A subject that belongs to the positive population might be classified as "negative." This is a wrong classification, so this case is called a "false negative" (FN).
  • A subject that belongs to the positive population might be classified as "positive." This is a wrong classification, so this case is called a "true positive" (TP).

Typically, these concepts are visualized by using a panel of histograms based on a finite sample of data. However, this visualization uses the populations themselves, rather than data. In particular, this visualization assumes two normal populations, a situation that is called the binormal model for binary discrimination (Krzandowski and Hand, ROC Curves for Continuous Data, 2009, p. 31-35).

A graph of the populations

In building up any complex graph, it is best to start with a simpler version. A simple version enables you to develop and debug your program and to experiment with various visualizations. This section creates a version of the graph that does not have the threshold line or the four categories (TN, FP, FN, and TP).

The following DATA step uses the PDF function to generate the density curves for two populations. For this graph, the negative population is chosen to be N(0, 1) whereas the positive population is N(2, 0.75). By convention, the mean of the negative population is chosen to be less than the mean of the positive population. I could have used two separate DO loops to iterate over the X values for the distributions, but instead I used (temporary) arrays to store the parameters for each population.

/* 1. Create data for the negative and positive populations by using a binormal model. 
      The data are in "long form." 
      An indicator variable (Class) has the values "Negative" and "Positive." */
%let mu_N    = 0;       /* mean of Negative population */
%let sigma_N = 1;       /* std dev of Negative population */
%let mu_P    = 2;       /* mean of Positive population */
%let sigma_P = 0.75;    /* std dev of Positive population */
 
data Binormal(drop=i);
array mu[2]    _temporary_ (&mu_N, &mu_P);
array sigma[2] _temporary_ (&sigma_N, &sigma_P);
array c[2] $   _temporary_ ("Negative", "Positive");
 
do i = 1 to 2;
   Class = c[i];
   do x = mu[i] - 3*sigma[i] to mu[i] + 3*sigma[i] by 0.05;
      pdf = pdf("Normal", x, mu[i], sigma[i]);
      output;
   end;
end;
run;

The first few observations are shown below:

                       Class        x             pdf
                      Negative    -3.00    .004431848
                      Negative    -2.95    .005142641
                      Negative    -2.90    .005952532

Because the data are in "long form," you can use PROC SGPANEL to create a basic graph. You need to use PANELBY Class to display the negative population in one graph and the positive population in another. Here are a few design decisions that I made for the visualization:

  • SAS will assign default colors to the two populations, but you can use the STYLEATTRS statement to assign specific colors to each population curve.
  • You can use the BAND statement to fill in the area under the population density curves.
  • The SGPANEL procedure will display row headers or column headers to identify the positive and negative populations, but I used the NOHEADER option to suppress these headers and used the INSET statement to add the information in the upper left corner of each graph.
  • Since this graph is a schematic diagram, I suppress the ticks and values on the axes by using the DISPLAY=(noticks novalues) option on the ROWAXIS and COLAXIS statements.
ods graphics / width=480px height=360px;
title "The 'Negative' and 'Positive' Populations";
proc sgpanel data=Binormal noautolegend;
   styleattrs datacolors=(SteelBlue LightBrown);
   panelby Class      / layout=rowlattice onepanel noheader;
   inset Class        / position=topleft textattrs=(size=14) nolabel;
   band x=x upper=pdf lower=0 / group=Class;
   series x=x y=pdf   / lineattrs=(color=black);
   rowaxis offsetmin=0 display=(noticks novalues) label="Density";
   colaxis display=(noticks novalues) label="Score";
run;

Adding regions for true and false classifications

The previous section graphs the populations. To add the regions that are correctly and incorrectly classified by a given threshold value, you need to modify the DATA step that creates the density curves. In addition to the Class indicator variable (which has the values "Negative" and "Positive"), you need to add an indicator variable that has four values: "TN", "FP", "FN", and "TP". This second indicator variable will be used to assign colors for each region. The four regions depend on the value of the threshold parameter, which means that the DO loop that iterates over the X values should be split into two parts: the part less than the threshold and the part greater than the threshold. This is shown by the following:

%let cutoff  = 1;       /* value of the threshold parameter */
 
data Binormal2(drop=i);
array mu[2]     _temporary_ (&mu_N, &mu_P);
array sigma[2]  _temporary_ (&sigma_N, &sigma_P);
array c[2] $    _temporary_ ("Negative", "Positive");
array T[2, 2] $ _temporary_ ("TN", "FP", "FN", "TP");
 
do i = 1 to 2;
   Class = c[i];
   Type = T[i, 1];
   do x = mu[i] - 3*sigma[i] to &cutoff by 0.01;
      pdf = pdf("Normal", x, mu[i], sigma[i]);
      output;
   end;
   Type = T[i, 2];
   do x = &cutoff to mu[i] + 3*sigma[i] by 0.01;
      pdf = pdf("Normal", x, mu[i], sigma[i]);
      output;
   end;
end;
run;

The first few observations are shown below:

                   Class      Type      x             pdf
                  Negative     TN     -3.00    .004431848
                  Negative     TN     -2.99    .004566590
                  Negative     TN     -2.98    .004704958

The graph should display labels for the four regions. I will use the TEXT statement to place the labels. For the height of the labels, I will use 90% of the maximum height of the density curves. For the horizontal positions, I will offset the text by +/-50% of the standard deviation of the distribution. I use the STYLEATTRS statement to assign the colors to the four regions.

/* find the maximum height of the graph */
proc means data=Binormal2 max noprint;
   var pdf;
   output out=OutLab max=Max;
run;
 
/* use +/- s*StdDev to space the labels */
data label(drop=Max);
set OutLab(keep=Max);
y = 0.9 * Max;            /* 90% of the maximum height */
Class = "Negative"; 
x = &cutoff - 0.5*&sigma_N; Text = "TN"; output;
x = &cutoff + 0.5*&sigma_N; Text = "FP"; output;
Class = "Positive"; 
x = &cutoff - 0.5*&sigma_P; Text = "FN"; output;
x = &cutoff + 0.5*&sigma_P; Text = "TP"; output;
run;
 
data All;
   set Binormal2 label;      /* merge the two data sets */
run;
 
ods graphics / width=480px height=360px;
title "Relationship Between Threshold and Classification";
proc sgpanel data=All noautolegend;
   styleattrs datacolors=(SteelBlue LightBlue Cream LightBrown);
   panelby Class      / layout=rowlattice onepanel noheader;
   inset Class        / position=topleft textattrs=(size=14) nolabel;
   band x=x upper=pdf lower=0 / group=Type;
   series x=x y=pdf   / lineattrs=(color=black);
   refline &cutoff    / axis=X;
   text x=x y=y text=Text / textattrs=(size=14);
   rowaxis offsetmin=0 display=(noticks novalues) label="Density";
   colaxis display=(noticks novalues) label="Score";
run;

The graph is shown at the top of this article.

In summary, this article shows how to create a graph that illustrates a fundamental relationship in the binary classification problem.

The post Visualization of a binary classification analysis appeared first on The DO Loop.

1月 292020
 

In a previous article, I showed how to perform collinearity diagnostics in SAS by using the COLLIN option in the MODEL statement in PROC REG. For models that contain an intercept term, I noted that there has been considerable debate about whether the data vectors should be mean-centered prior to performing the collinearity diagnostics. In other words, if X is the design matrix used for the regression, should you use X to analyze collinearity or should you use the centered data X – mean(X)? The REG procedure provides options for either choice. The COLLIN option uses the X matrix to assess collinearity; the COLLINOINT option uses the centered data.

As Belsley (1984, p. 76) states, "centering will typically seem to improve the conditioning." However, he argues that running collinearity diagnostics on centered data "gives us information about the wrong problem." He goes on to say, "mean-centering typically removes from the data the interpretability that makes conditioning diagnostics meaningful."

This article looks at how centering the data affects the collinearity diagnostics. Throughout this article, when I say "collinearity diagnostics, I am referring to the variance-decomposition algorithm that is implemented by the COLLIN in PROC REG, which was described in the previous article. Nothing in this article applies to the VIF or TOL options in PROC REG, which provide alternative diagnostics.

The article has two main sections:

  • The mathematics behind the COLLIN (and COLLINOINT) options in PROC REG.
  • An example of an ill-conditioned linear system that becomes perfectly conditioned if you center the data.

The arguments in this article are taken from the references at the end of this article. This article assumes that you have already read my previous article about collinearity diagnostics.

The mathematics of the COLLIN option

The COLLIN option implements the regression-coefficient variance decomposition due to Belsley and presented in Belsley, Kuh, and Welsch (1980), henceforth, BKW. The collinearity diagnostics algorithm (also known as an analysis of structure) performs the following steps:

  1. Let X be the data matrix. If the model includes an intercept, X has a column of ones. BKW recommend that you NOT center X, but if you choose to center X, do it at this step. As a reminder, the COLLIN option in PROC REG does not center the data whereas the COLLINOINT option centers the data.
  2. Scale X so that each column has unit length (unit variance).
  3. Compute the singular value decomposition of X = UDV`.
  4. From the diagonal matrix, D, compute the eigenvalues and condition indices of X`X.
  5. Compute P, the matrix of variance-decomposition proportions as described in BKW, p. 105-107.
  6. From this information, you can determine whether the regression model suffers from harmful collinearity.

To make sure that I completely understand an algorithm, I like to implement it in the SAS/IML matrix language. The following SAS/IML statements implement the analysis-of-structure method. You can run the program on the same Fitness data that were used in the previous article. The results are the same as from PROC REG.

proc iml;
start CollinStruct(lambda, cond, P,         /* output variables */
                   XVar, opt);              /* input variables  */
 
   /* 1. optionally center the data */
   if upcase(opt) = "NOINT" then 
      X = XVar - mean(XVar);                /* COLLINOINT: do not add intercept, center */
   else
      X = j(nrow(XVar), 1, 1) || XVar;      /* COLLIN: add intercept, do not center */
 
   /* 2. Scale X to have unit column length (unit variance) */
   Z = X / sqrt(X[##, ]);
   /* 3. Obtain the SVD of X and calculate condition indices and the P matrix */
   call svd(U, D, V, Z);
   /* 4. compute the eigenvalues and condition indices of X`X */
   lambda = D##2;                           /* eigenvalues are square of singular values */
   cond = sqrt(lambda[1] / lambda);         /* condition indices */
 
   /* 5. Compute P = matrix of variance-decomposition proportions */
   phi = V##2 / lambda`;          /* divide squared columns by eigenvalues (proportions of each PC) */
   phi_k = phi[,+];               /* for each component, sum across columns */
   P = T( phi / phi_k );          /* create proportions and transpose the result */
finish;
 
/* Perform Regression-Coefficient Variance Decomposition of BKW */
varNames = {RunTime Age Weight RunPulse MaxPulse RestPulse};
use fitness;
   read all var varNames into XVar;
close;
 
/* perform COLLIN analysis */
call CollinStruct(lambda, cond, P,  XVar, "INT");
print "----- Do Not Center the Data (COLLIN) -----", lambda cond;
 
/* perform COLLINOINT analysis */
call CollinStruct(lambda0, cond0, P0,  XVar, "NOINT");
print "----- Center the Data (COLLINOINT) -----", lambda0 cond0;

The first table shows the eigenvalues and (more importantly) the condition indices for the original (uncentered) data. You can see that there are three condition indices that exceed 30, which indicates that there might be as many as three sets of nearly collinear relationships among the variables. My previous analysis showed two important sets of relationships:

  • Age is moderately collinear with the intercept term.
  • RunPulse is strongly collinear with MaxPulse.

In the second table, which analyses the structure of the centered data, none of the condition indices are large. An interpretation of the second table is that the variables are not collinear. This contradicts the first analysis.

Why does centering the data change the condition indices so much? This phenomenon was studied by Belsley who showed that "centering will typically seem to improve the conditioning," sometimes by a large factor (Belsley, 1984, p. 76). Belsley says that the second table "gives us information about the wrong problem; namely, it tells us about the sensitivity of the LS solution... to numerically small relative changes in the centered data. And since the magnitude of the centered changes is usually uninterpretable," so also are the condition indices for the centered data.

Ill-conditioned data that becomes perfectly conditioned by centering

Belsley (1984) presents a small data set (N=20) for which the original variables are highly collinear (maximum condition index is 1,242) whereas the centered data is perfectly conditioned (all condition indices are 1). Belsley could have used a much smaller example, as shown in Chennamaneni et al. (2008). I used their ideas to construct the following example.

Suppose that the (uncentered) data matrix is
X = A + ε B
where A is any N x k matrix that has constant columns, B is a centered orthogonal matrix, and ε > 0 is a small number, such as 0.001. Clearly, X is a small perturbation of a rank-deficient and ill-conditioned matrix (A). The condition indices for X can be made arbitrarily large by making ε arbitrarily small. I think everyone would agree that the columns of X are highly collinear. As shown below, the analysis-of-structure algorithm on X reveals the collinearities.

But what happens if you center the data? Because A has constant columns, the mean-centered version of A is the zero matrix. Centering B does not change it because the columns are already centered. Therefore, the centered version of X is ε B, which is a perfectly conditioned orthogonal matrix! This construction is valid for arbitrarily large data, but the following statements implement this construction for a small 3 x 2 matrix.

A = { 1  2,         /* linearly dependent ==> infinite condition index) */
      1  2,
      1  2};
B = {-1  1,         /* orthogonal columns ==> perfect condition number (1) */
      0 -2,
      1  1};
eps = 0.001;        /* the smaller eps is, the more ill-conditioned X is */
X = A + eps * B;    /* small perturbation of a rank deficient matrix */
 
/* The columns of X are highly collinear. The columns X - mean(X) are perfectly conditioned. */
Xc = X - mean(X);
print X, Xc;

This example reveals how "mean-centering can remove from the data the information needed to assess conditioning correctly" (Belsley, 1984, p. 74). As expected, if you run the analysis-of-structure diagnostics on this small example, the collinearity is detected in the original data. However, if you center the data prior to running the diagnostics, the results do not indicate that the data are collinear:

/* The columns of the X matrix are highly collinear, but only 
   the analysis of the uncentered data reveals the collinearity */
call CollinStruct(lambda, cond, P, X, "INT");
print lambda cond P[c={"Intercept" "X1" "X2"}];  /* as ill-conditioned as you want */
 
call CollinStruct(lambda0, cond0, P0, X, "NOINT");
print lambda0 cond0 P0[c={"X1" "X2"}];           /* perfectly conditioned */

In the first table (which is equivalent to the COLLIN option in PROC REG), the strong collinearities are apparent. In the second table (which is equivalent to the COLLINOINT option), the collinearities are not apparent. As Belsley (1984, p. 75) says, an example like this "demonstrates that it matters very much in what form the data are taken in order to assess meaningfully the conditioning of a LS problem, and that centered data are not usually the correct form."

Geometrically, the situation is similar to the following diagram, which is part of Figure 2 on p. 7 of Chennamaneni, et al. (2008). The figure shows two highly collinear vectors (U and V). However, the mean-centered vectors are not close to being collinear and can even be orthogonal (perfectly conditioned).

U and V are highly collinear. The mean centered vectors are orthogonal.

Summary

In summary, this article has presented Belsley's arguments about why collinearity diagnostics should be performed on the original data. As Belsley (1984, p. 75) says, there are situations in which centering the data is useful, but "assessing conditioning is not one of them." The example in the second section demonstrates why Belsley's arguments are compelling: the data are clearly strongly collinear, yet if you apply the algorithm to the mean-centered data, you do not get any indication that the problem exists. The analysis of the fitness data shows that the same situation can occur in real data.

These examples convince me that the analysis-of-structure algorithm reveals collinearity only when you apply it to the original data. If you agree, then you should use the COLLIN option in PROC REG to perform collinearity diagnostics.

However, not everyone agrees with Belsley. If you are not convinced, you can use the COLLINOINT option in PROC REG to perform collinearity diagnostics on the centered data. However, be aware that the estimates for the centered data are still subject to inflated variances and sensitive parameter estimates (Belsley, 1984, p. 74), even if this diagnostic procedure does not reveal that fact.

Further reading

The post Collinearity diagnostics: Should the data be centered? appeared first on The DO Loop.

1月 152020
 

In my book Simulating Data with SAS, I show how to use a graphical tool, called the moment-ratio diagram, to characterize and compare continuous probability distributions based on their skewness and kurtosis (Wicklin, 2013, Chapter 16). The idea behind the moment-ratio diagram is that skewness and kurtosis are essential for understanding the shapes of many common univariate distributions. The skewness and kurtosis are sometimes called moment ratios because they are ratios of centralized moments of a distribution. A moment-ratio diagram displays the locus of possible (skewness, kurtosis) pairs for many common distributions. This article displays a simple moment-ratio diagram and shows the locations of a few well-known distributions.

Constructing the moment-ratio diagram

When you specify the parameters for a probability distribution, you get a density curve that has a particular shape. (The shape does not depend on the mean or variance, so all distributions in this article are assumed to be centered and scaled so that they have zero mean and unit variance.) The shape is partially characterized by the skewness and kurtosis values, although those two moments do not fully define the shape of the distribution.

For most probability distributions, the skewness and kurtosis depend on the distribution parameters. In general, the following is true:

  • A distribution that has no shape parameters corresponds to a point in the moment-ratio (M-R) diagram.
  • A distribution that has one shape parameter corresponds to a curve in the M-R diagram.
  • A distribution that has two shape parameters corresponds to a region.

However, notice the phrase "shape parameters." A shape parameter changes the shape of the distribution, and not every parameter changes the shape. For example, location and scale parameters do not change the shape. The normal distribution has two parameters (location and scale), but all normal distributions are "bell-shaped." Because all normal curves have zero skewness and zero kurtosis, the normal distribution is represented by the point (0, 0) in the moment-ratio diagram. Similarly, the exponential distribution has a scale parameter, but all standardized exponential distributions have the same shape; the exponential distribution is represented by the point (2, 6) in the (skewness, kurtosis) coordinates.

In contrast, distributions such as the gamma or beta distributions have shape parameters that change the shape of the density function. The gamma distribution has one shape parameter and corresponds with a curve in the moment-ratio diagram. The beta distribution has two shape parameters and corresponds with a region in the moment-ratio diagram.

A basic moment-ratio diagram

This section lists a few common univariate distributions and specifies how the skewness and (excess) kurtosis depend on the parameters for the distributions. A simplified moment-ratio diagram appears to the right and displays points, curves, or regions for each distribution. Some researchers show only the right half of the diagram because many common distributions have positive skewness. By historical convention, the kurtosis axis points down. The letters and colors in the diagram are explained in the following list.

  • Beta distribution: The beta distribution contains two parameters that affect the shape. The skewness and kurtosis range over a region (grey) that is bounded by two quadratic curves.
  • Exponential distribution: Regardless of the value of the scale parameter, the skewness is always 2 and the kurtosis is always 6. Thus, the exponential distribution corresponds to a single point (E) on the M-R diagram.
  • Gamma distribution: For the parameter α, the skewness is 2/sqrt(α) and the kurtosis is 6/α. Thus, the gamma distribution corresponds to a curve (magenta) on the M-R diagram. The centered gamma distribution converges to the normal distribution as α → ∞, which is why the gamma curve approaches the Normal point (N). The gamma distribution is also a limit of the beta distribution, which is why the gamma curve bounds the beta region.
  • Gumbel distribution: The skewness is 1.14 and the kurtosis is 2.4. Thus, the Gumbel distribution corresponds to a single point (G).
  • Lognormal distribution: The skewness is (exp(σ2) + 2) sqrt(exp(σ2) - 1). The kurtosis is exp(4σ2) + 2 exp(3σ2) + 3 exp(2σ2) - 6. Thus, the lognormal distribution corresponds to a curve (green) in the M-R diagram.
  • Normal distribution: Every normal distribution has the same shape. The skewness and kurtosis are both 0 so there is a single point (N). The normal distribution is a limiting form of several distributions, including the beta, gamma, lognormal, and t.
  • t distribution: The moments for the t distribution are only defined for more than four degrees of freedom (ν). For those distributions, the skewness is 0 and the kurtosis is 6/(ν - 4). Thus, the t distribution corresponds to a sequence of points (T5, T6, T7, ...) that converge to the normal distribution as ν → ∞.
  • Johnson SB distribution: The Johnson SB distribution is a flexible distribution that can obtain any skewness and kurtosis combination "above" the lognormal distribution. Notice that this region overlaps with the beta region and the gamma curve. The overlap means that there are parameters for the SB distribution that have the same skewness and kurtosis as certain beta and gamma distributions. This does not imply that every distribution above the lognormal curve is a member of the SB family! It merely means that there is an SB distribution that has the same skewness and kurtosis.
  • Johnson SU distribution: The Johnson SB distribution is a flexible distribution that can obtain any skewness and kurtosis combination "below" the lognormal distribution.

A more complex moment-ratio diagram

The moment-ratio diagram was known to early researchers such as Pearson in the 1800s. In 2010, E. Vargo, R. Pasupathy, and L. Leemis published a tour-de-force moment-ratio diagram that included dozens of probability distributions on one diagram ("Moment-Ratio Diagrams for Univariate Distributions", JQT, 2010). Read their paper to learn more about moment-ratio diagrams. Leemis provides a large color version of the moment-ratio diagram on his website and posted it in a 2019 article in Significance magazine. A low-resolution version is shown below. Click the image to see the original version.

Low-resolution version of moment-ratio diagram in Vargo, Pasupathy, and Leemis (JQT, 2010).

What is the moment-ratio diagram used for?

The moment-ratio diagram is a useful theoretical tool for understanding the relationships between continuous univariate distributions. But how is the moment-ratio diagram useful to a data analyst? As I show in my book, you can use the moment-ratio diagram as a tool for modeling univariate distributions. Given a set of data, you can compute the sample skewness and kurtosis. By plotting the sample moments on the moment-ratio diagram, you can select candidate distributions to model the data. You can also use the moment-ratio diagram to organize simulation studies for which you want to test an algorithm or statistical method on data that have a wide range of skewness and kurtosis values.

An appendix to my book provides a SAS program that generates a moment-ratio diagram and overlays one or more data values. I used a small modification of that program to create the basic moment-ratio diagram in this article.

The post The moment-ratio diagram appeared first on The DO Loop.

12月 032019
 

Longitudinal data are used in many health-related studies in which individuals are measured at multiple points in time to monitor changes in a response variable, such as weight, cholesterol, or blood pressure. There are many excellent articles and books that describe the advantages of a mixed model for analyzing longitudinal data. Recently, I encountered an introductory article that summarizes the essential issues in a little more than five pages! You can download the article for free: "A Primer in Longitudinal Data Analysis", by G. Fitzmaurice and C. Ravichandran (2008), Circulation, 118(19), p. 2005-2010.

The article analyzes a set of longitudinal data in two ways. First, the authors use a traditional linear model to perform an "analysis of response profiles." Then, the authors discuss how a mixed model can correct some of the deficiencies of the analysis. This blog post analyzes the same data by using PROC GLM in SAS. A subsequent blog post analyzes the same data by using PROC MIXED in SAS.

Longitudinal Data: Treatment of lead-exposed children

Fitzmaurice and C. Ravichandran analyze data for a randomized trial involving toddlers who were exposed to high levels of lead. The article analyzes a subset of 100 children. Half the children were given a treatment (called succimer) and the other half were given a placebo. The blood lead levels were measured for each child at baseline (week 0), week 1, week 4, and week 6. The main scientific question is "whether the two treatment groups differ in their patterns of change from baseline in mean blood lead levels" (p. 2006).

The children in the subset were measured at all four time points. There are no missing values or mistimed measurements. (This situation is fairly unusual in longitudinal data, which is often plagued by missed appointments or individuals who leave the study.) The following "spaghetti plot" shows the individual measurements for the 100 children at each time point. Each line represents a child. Blue lines indicate that the child was in the placebo group; red lines indicate the experimental group that was given succimer.

You can download the SAS program that contains the data and the programs that create the graphs and tables in this article.

Analysis of response profiles

In an analysis of response profiles, you compare the mean response-by-time profile for the treatment and placebo groups. You can visualize the mean response over time for each group by using the VBAR statement in PROC SGPLOT. The following graph shows the mean value and standard error for each time point for each treatment group:

If the treatment is ineffective, the line segments for the two treatment groups will be approximately parallel. The graph shows that this is not that case for these data. The visualization indicates that the mean blood-lead value for the treatment group (Treatment='A') is lower than for the placebo group at 1 and 4 weeks.

You can use PROC GLM to confirm that these differences are statistically significant and to estimate the effect that taking succimer had on the mean blood-lead level:

proc glm data=tlc;
   class Time(ref='0') Treatment(ref='P');
   model y = Treatment Time Treatment*Time / solution;
   output out=GLMOut Predicted=Pred;
quit;

According to the Type 3 statistics, all three effects in the model are significant. The parameter estimates (outlined in red) indicate that the mean blood-lead level for children in the Treatment='A' group is 11.4 mcg/dL lower than the children in the placebo group after 1 week. Similarly, after four weeks the mean of the experimental group is 8.8 mcg/dL lower than the placebo group. These are both significant differences, so the response-profile analysis provides a positive answer to the research question: the profile for the treatment group is lower than the placebo group.

Advantages and disadvantages of the analysis of response profiles

As discussed in Fitzmaurice and Ravichandran (2008), the analysis of the response profile has several advantages:

  • It is familiar to researchers who have experience with ANOVA.
  • It does not require any advanced statistical modeling , such as modeling the covariance of the repeated measurements.

However, this simple method suffers from several statistical problems:

  • Longitudinal data do not satisfy the assumptions of linear regression and ANOVA. Because the data contains repeated measures from the same individuals, the residual errors are neither independent nor do they have constant variance (homoscedastic).
  • Some participants in a study might miss an appointment or drop out of the study. Others might be measured at time points that were not part of the design (for example, at 2 or 3 weeks). These two problems are known as incomplete data and mistimed measurements, respectively. Although the first can be handled by using an unbalanced ANOVA, the second is a problem that does not have a simple solution within an ANOVA model that uses discrete time points.
  • The response-profile analysis does not enable you to model each individual's response as a function of time.

Prediction of individual response profiles

The inability to model individual trajectories is often the reason that researchers abandon the response-profile analysis in favor of a more complicated mixed model. To be clear, the GLM model can make predictions, but the predicted values for every child in the placebo group are the same. Similarly, the predicted values for every child in the experimental group are the same. This is shown in the following panel of graphs, which shows the predicted response curves for six children in the study, three from each treatment group.

/* Look at predictions for each individual. They are identical! */
proc sort data=GLMOut(where=(ID in (1,2,4,5,6,7))) out=GLMSubset;
   by Treatment ID;
run;
 
title2 "Fixed Effect Model";
proc sgpanel data=GLMSubset dattrmap=Order;
   panelby Treatment ID;
   scatter x=Time y=y / group=Treatment attrid=Treat;
   series x=Time y=Pred / group=Treatment attrid=Treat;
run;

Notice that the predicted responses are the same across the top row (ID=2, 5, and 6). These children were all in the experimental group. Although the predicted values seem to fit the actual observed response for ID=2, the predicted responses for ID=5 and ID=6 are much higher than the observed responses. Although the predicted response is the best prediction for an "average patient," it does not account for individual differences in the study participants.

The same is true for the patients in the placebo group, three of which are plotted in the second row. The predicted values are "too low" for ID=1 and are "too high" for ID=4.

If modeling the individual profiles is important, then clearly this method is not sufficient. If you want to model the individual profiles, you can use a linear mixed model. The mixed model also addresses other deficiencies of the response-profile analysis. The mixed model is described in the next blog post.

You can download the SAS code and data for the response-profile analysis.

Further reading

This blog post is a brief summary of the article "A Primer in Longitudinal Data Analysis" (Fitzmaurice and Ravichandran, 2008). See the article for more details. Also, these data and these ideas are also discussed in the book Applied Longitudinal Analysis (2012, 2nd Ed) by G. Fitzmaurice, N. Laird, and J. Ware. You can download data (and SAS programs) from the book at the book's web site.

The post Longitudinal data: The response-profile model appeared first on The DO Loop.

11月 202019
 

In a linear regression model, the predicted values are on the same scale as the response variable. You can plot the observed and predicted responses to visualize how well the model agrees with the data, However, for generalized linear models, there is a potential source of confusion. Recall that a generalized linear model (GLIM) has two components: a linear predictor, which is a linear combination of the explanatory variables, and a transformation (called the inverse link function) that maps the linear predictor onto the scale of the data. Consequently, SAS regression procedures support two types of predicted values and prediction limits. In the SAS documentation, the first type is called "predictions on the linear scale" whereas the second type is called "predictions on the data scale."

For many SAS procedures, the default is to compute predicted values on the linear scale. However, for GLIMs that model nonnormal response variables, it is more intuitive to predict on the data scale. The ILINK option, which is shorthand for "apply the inverse link transformation," converts the predicted values to the data scale. This article shows how the ILINK option works by providing an example for a logistic regression model, which is the most familiar generalized linear models.

Review of generalized linear models

The SAS documentation provides an overview of GLIMs and link functions. The documentation for PROC GENMOD provides a list of link functions for common regression models, including logistic regression, Poisson regression, and negative binomial regression.

Briefly, the linear predictor is
η = X*β
where X is the design matrix and β is the vector of regression coefficients. The link function (g) is a monotonic function that relates the linear predictor to the conditional mean of the response. Sometimes the symbol μ is used to denote the conditional mean of the response (μ = E[Y|x]), which leads to the formula
g(μ) = X*β

In SAS, you will often see options and variables names (in output data sets) that contains the substring 'XBETA'. When you see 'XBETA', it indicates that the statistic or variable is related to the LINEAR predictor. Because the link function, g, is monotonic, it has an inverse, g-1. For generalized linear models, the inverse link function maps the linear-scale predictions to data-scale predictions: if η = x β is a predicted value on the linear scale, then g-1(η) is the predicted value for x on the data scale.

When the response variable is binary, the GLIM is the logistic model. If you use the convention that Y=1 indicates an event and Y=0 indicates the absence of an event, then the "data scale" is [0, 1] and the GLIM predicts the probability that the event occurs. For the logistic GLIM, the link function is the logit function:
g(μ) = logit(μ) = log( μ / (1 - μ) )
The inverse of the logit function is called the logistic function:
g-1(η) = logistic(η) = 1 / (1 + exp(-η))

To demonstrate the ILINK option, the next sections perform the following tasks:

  1. Use PROC LOGISTIC to fit a logistic model to data. You can use the STORE statement to store the model to an item store.
  2. Use the SCORE statement in PROC PLM to score new data. This example scores data by using the ILINK option.
  3. Score the data again, but this time do not use the ILINK option. Apply the logistic transformation to the linear estimates to demonstrate that relationship between the linear scale and the data scale.

The transformation between the linear scale and the data scale is illustrated by the following graph:

Fit a logistic model

The following data are from the documentation for PROC LOGISTIC. The model predicts the probability of Pain="Yes" (the event) for patients in a study, based on the patients' sex, age, and treatment method ('A', 'B', or 'P'). The STORE statement in PROC LOGISTIC creates an item store that can be used for many purposes, including scoring new observations.

data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   DROP Duration;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;
 
title 'Logistic Model on Neuralgia';
proc logistic data=Neuralgia;
   class Sex Treatment;
   model Pain(Event='Yes')= Sex Age Treatment;
   store PainModel / label='Neuralgia Study';  /* store model for post-fit analysis */
run;

Score new data by using the ILINK option

There are many reasons to use PROC PLM, but an important purpose of PROC PLM is to score new observations. Given information about new patients, you can use PROC PLM to predict the probability of pain if these patients are given a specific treatment. The following DATA step defines the characteristics of four patients who will receive Treatment B. The call to PROC PLM scores and uses the ILINK option to predict the probability of pain:

/* Use PLM to score new patients */
data NewPatients;
   input Treatment $ Sex $ Age Duration;
   DROP Duration;
   datalines;
B M 67 15 
B F 73  5 
B M 74 12 
B F 79 16 
;
 
/* predictions on the DATA scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

The Predicted column contains probabilities in the interval [0, 1]. The 95% prediction limits for the predictions are given by the LCLM and UCLM columns. For example, the prediction interval for the 67-year-old man is approximately [0.03, 0.48].

These values and intervals are transformations of analogous quantities on the linear scale. The logit transformation maps the predicted probabilities to the linear estimates. The inverse logit (logistic) transformation maps the linear estimates to the predicted probabilities.

Linear estimates and the logistic transformation

The linear scale is important because effects are additive on this scale. If you are testing the difference of means between groups, the tests are performed on the linear scale. For example, the ESTIMATE, LSMEANS, and LSMESTIMATE statements in SAS perform hypothesis testing on the linear estimates. Each of these statements supports the ILINK option, which enables you to display predicted values on the data scale.

To demonstrate the connection between the predicted values on the linear and data scale, the following call to PROC PLM scores the same data according to the same model. However, this time the ILINK option is omitted, so the predictions are on the linear scale.

/* predictions on the LINEAR scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreXBeta(
         rename=(Predicted=XBeta LCLM=LowerXB UCLM=UpperXB))
         predicted lclm uclm;         /* ILINK not used, so linear predictor */
run;
 
proc print data=ScoreXBeta; run;

I have renamed the variables that PROC PLM creates for the estimates on the linear scale. The XBeta column shows the predicted values. The LowerXB and UpperXB columns show the prediction interval for each patient. The XBeta column shows the values you would obtain if you use the parameter estimates table from PROC LOGISTIC and apply those estimates to the observations in the NewPatients data.

To demonstrate that the linear estimates are related to the estimates in the previous section, the following SAS DATA step uses the logistic (inverse logit) transformation to convert the linear estimates onto the predicted probabilities:

/* Use the logistic (inverse logit) to transform the linear estimates
     (XBeta) into probability estimates in [0,1], which is the data scale.
   You can use the logit transformation to go the other way. */
data LinearToData;
   set ScoreXBeta;   /* predictions on linear scale */
   PredProb   = logistic(XBeta);
   LCLProb    = logistic(LowerXB);
   UCLProb    = logistic(UpperXB); 
run;
 
proc print data=LinearToData;
   var Treatment Sex Age PredProb LCLProb UCLProb;
run;

The transformation of the linear estimates gives the same values as the estimates that were obtained by using the ILINK option in the previous section.

Summary

In summary, there are two different scales for predicting values for a generalized linear model. When you report predicted values, it is important to specify the scale you are using. The data scale makes intuitive sense because it is the same scale as the response variable. You can use the ILINK option in SAS to get predicted values and prediction intervals on the data scale.

The post Predicted values in generalized linear models: The ILINK option in SAS appeared first on The DO Loop.

10月 092019
 

In a previous article, I mentioned that the VLINE statement in PROC SGPLOT is an easy way to graph the mean response at a set of discrete time points. I mentioned that you can choose three options for the length of the "error bars": the standard deviation of the data, the standard error of the mean, or a confidence interval for the mean. This article explains and compares these three options. Which one you choose depends on what information you want to convey to your audience. As I will show, some of the statistics are easier to interpret than others. At the end of this article, I tell you which statistic I recommend.

Sample data

The following DATA step simulates data at four time points. The data at each time point are normally distributed, but the mean, standard deviation, and sample size of the data vary for each time point.

data Sim;
label t = "Time";
array mu[4]    _temporary_ (80 78 78 79); /* mean */
array sigma[4] _temporary_ ( 1  2  2  3); /* std dev */
array N[4]     _temporary_ (36 32 28 25); /* sample size */
call streaminit(12345);
do t = 1 to dim(mu);
   do i = 1 to N[t];
      y = rand("Normal", mu[t], sigma[t]); /* Y ~ N(mu[i], sigma[i]) */
      output;
   end;
end;
run;
 
title "Response by Time";
ods graphics / width=400px height=250px;
proc sgplot data=Sim;
   vbox y / category=t connect=mean;
run;
Box plots of data at four time points

The box plot shows the schematic distribution of the data at each time point. The boxes use the interquartile range and whiskers to indicate the spread of the data. A line connects the means of the responses at each time point.

A box plot might not be appropriate if your audience is not statistically savvy. A simpler display is a plot of the mean for each time point and error bars that indicate the variation in the data. But what statistic should you use for the heights of the error bars? What is the best way to show the variation in the response variable?

Relationships between sample standard deviation, SEM, and CLM

Before I show how to plot and interpret the various error bars, I want to review the relationships between the sample standard deviation, the standard error of the mean (SEM), and the (half) width of the confidence interval for the mean (CLM). These statistics are all based on the sample standard deviation (SD). The SEM and width of the CLM are multiples of the standard deviation, where the multiplier depends on the sample size:

  • The SEM equals SD / sqrt(N). That is, the standard error of the mean is the standard deviation divided by the square root of the sample size.
  • The width of CLM is a multiple of the SEM. For large samples, the multiple for a 95% confidence interval is approximately 1.96. In general, suppose the significance level is α and you are interested in 100(1-α)% confidence limits. Then the multiplier is a quantile of the t distribution with N-1 degrees of freedom, often denoted by t*1-α/2, N-1.

You can use PROC MEANS and a short DATA step to display the relevant statistics that show how these three statistics are related:

/* Optional: Compute SD, SEM, and half-width of CLM (not needed for plotting) */
proc means data=Sim noprint;
   class t;
   var y;
   output out=MeanOut N=N stderr=SEM stddev=SD lclm=LCLM uclm=UCLM;
run;
 
data Summary;
set MeanOut(where=(t^=.));
CLMWidth = (UCLM-LCLM)/2;   /* half-width of CLM interval */
run;
 
proc print data=Summary noobs label;
   format SD SEM CLMWidth 6.3;
   var T SD N SEM CLMWidth;
run;
Relationships between StdDev, SEM, and CLM

The table shows the standard deviation (SD) and the sample size (N) for each time point. The SEM column is equal to SD / sqrt(N). The CLMWidth value is a little more than twice the SEM value. (The multiplier depends on N; For these data, it ranges from 2.03 to 2.06.)

As shown in the next section, the values in the SD, SEM, and CLMWidth columns are the lengths of the error bars when you use the STDDEV, STDERR, and CLM options (respectively) to the LIMITSTAT= option on the VLINE statement in PROC SGPLOT.

Visualize and interpret the choices of error bars

Let's plot all three options for the error bars on the same scale, then discuss how to interpret each graph. Several interpretations use the 68-95-99.7 rule for normally distributed data. The following statements create the three line plots with error bars:

%macro PlotMeanAndVariation(limitstat=, label=);
   title "VLINE Statement: LIMITSTAT = &limitstat";
   proc sgplot data=Sim noautolegend;
      vline t / response=y stat=mean limitstat=&limitstat markers;
      yaxis label="&label" values=(75 to 82) grid;
   run;
%mend;
 
title "Mean Response by Time Point";
%PlotMeanAndVariation(limitstat=STDDEV, label=Mean +/- Std Dev);
%PlotMeanAndVariation(limitstat=STDERR, label=Mean +/- SEM);
%PlotMeanAndVariation(limitstat=CLM,    label=Mean and CLM);
Display error bars for the means by using the standard deviation

Use the standard deviations for the error bars

In the first graph, the length of the error bars is the standard deviation at each time point. This is the easiest graph to explain because the standard deviation is directly related to the data. The standard deviation is a measure of the variation in the data. If the data at each time point are normally distributed, then (1) about 64% of the data have values within the extent of the error bars, and (2) almost all the data lie within three times the extent of the error bars.

The main advantage of this graph is that a "standard deviation" is a term that is familiar to a lay audience. The disadvantage is that the graph does not display the accuracy of the mean computation. For that, you need one of the other statistics.

Display error bars for the means by using the standard error of the mean

Use the standard error for the error bars

In the second graph, the length of the error bars is the standard error of the mean (SEM). This is harder to explain to a lay audience because it in an inferential statistic. A qualitative explanation is that the SEM shows the accuracy of the mean computation. Small SEMs imply better accuracy than larger SEMs.

A quantitative explanation requires using advanced concepts such as "the sampling distribution of the statistic" and "repeating the experiment many times." For the record, the SEM is an estimate of the standard deviation of the sampling distribution of the mean. Recall that the sampling distribution of the mean can be understood in terms of repeatedly drawing random samples from the population and computing the mean for each sample. The standard error is defined as the standard deviation of the distribution of the sample means.

The exact meaning of the SEM might be difficult to explain to a lay audience, but the qualitative explanation is often sufficient.

Display error bars for the means by using confidence intervals

Use a confidence interval of the mean for the error bars

In the third graph, the length of the error bars is a 95% confidence interval for the mean. This graph also displays the accuracy of the mean, but these intervals are about twice as long as the intervals for the SEM.

The confidence interval for the mean is hard to explain to a lay audience. Many people incorrectly think that "there is a 95% chance that the population mean is in this interval." That statement is wrong: either the population mean is in the interval or it isn't. There is no probability involved! The words "95% confidence" refer to repeating the experiment many times on random samples and computing a confidence interval for each sample. The true population mean will be in about 95% of the confidence intervals.

Conclusions

In summary, there are three common statistics that are used to overlay error bars on a line plot of the mean: the standard deviation of the data, the standard error of the mean, and a 95% confidence interval for the mean. The error bars convey the variation in the data and the accuracy of the mean estimate. Which one you use depends on the sophistication of your audience and the message that you are trying to convey.

My recommendation? Despite the fact that confidence intervals can be misinterpreted, I think that the CLM is the best choice for the size of the error bars (the third graph). If I am presenting to a statistical audience, the audience understands the CLMs. For a less sophisticated audience, I do not dwell on the probabilistic interpretation of the CLM but merely say that the error bars "indicate the accuracy of the mean."

As explained previously, each choice has advantages and disadvantages. What choice do you make and why? You can share your thoughts by leaving a comment.

The post What statistic should you use to display error bars for a mean? appeared first on The DO Loop.