rick wicklin

4月 062020
 

I have written several articles about how to work with continuous probability distributions in SAS. I always emphasize that it is important to be able to compute the four essential functions for working with a statistical distribution. Namely, you need to know how to generate random values, how to compute the PDF, how to compute the CDF, and how to compute quantiles. In this article, I describe how to compute each of the four quantities for the geometric distribution, which is a DISCRETE probability distribution. The graphs that visualize a discrete distribution are slightly different than for continuous distributions. Also, the geometric distribution has two different definitions, and I show how to work with either definition in SAS.

The definition of the geometric distribution

A Bernoulli trial is an experiment that has two results, usually referred to as a "failure" or a "success." The success occurs with probability p and the failure occurs with probability 1-p. "Success" means that a specific event occurred whereas "failure" indicates that the event did not occur. Because the event can be negative (death, recurrence of cancer, ...) sometimes a "success" is not a cause for celebration!

The geometric distribution has two definitions:

  1. The number of trials until the first success in a sequence of independent Bernoulli trials. The possible values are 1, 2, 3, ....
  2. The number of failures before the first success in a sequence of independent Bernoulli trials. The possible values are 0, 1, 2, ....

Obviously, the two definitions are closely related. If X is a geometric random variable according to the first definition, then Y=X-1 is a geometric random variable according to the second definition. For example, define "heads" as the event that you want to monitor. If you toss a coin and it first shows heads on the third toss, then the number of trials until the first success is 3 and the number of failures is 2. Whenever you work with the geometric distribution (or its generalization, the negative binomial distribution), you should check to see which definition is being used.

The definition of the geometric distribution in SAS software

It is regrettable that SAS was not consistent in choosing a definition. The first definition is used by the RAND function to generate random variates. The second definition is used by the PDF function, the CDF function, and the QUANTILE function. In this article, I will use the "number of trials," which is the first definition. In my experience, this definition is more useful in applications. I will point out how to adjust the syntax of the SAS functions so that they work for either definition.

The probability mass function for the geometric distribution

For a discrete probability distribution, the probability mass function (PMF) gives the probability of each possible value of the random variable. So for the geometric distribution, we want to compute and visualize the probabilities for n=1, 2, 3,... trials.

The probabilities depend on the parameter p, which is the probability of success. I will use three different values to illustrate how the geometric distribution depends on the parameter:

  • p = 1/2 = 0.5: the probability of "heads" when you toss a coin
  • p = 1/6 = 0.166: the probability of rolling a 6 with a six-sided die
  • p = 1/13 = 0.077: the probability of drawing an ace from a shuffled deck of 52 cards

The following SAS DATA step uses the PDF function to compute the probabilities for these three cases. The parameters in the PDF function are the number of failures and the probability of success. If you let n be the number of trials until success, then n-1 is the number of failures before success. Thus, the following statements use n-1 as a parameter. Since I will need the cumulative probabilities in the next section, I use the CDF function to compute them in the same DATA step:

data Geometric;
do p = 0.5, 0.166, 0.077;
   /* We want the probability that the event occurs on the n_th trial.
      This is the same as the probability of n-1 failure before the event. */
   do n = 1 to 16;                     
      pdf = pdf("Geometric", n-1, p);  /* probability that event occurs on the n_th trial */
      cdf = cdf("Geometric", n-1, p);  /* probability that event happens on or before n_th Trial*/
      output;
   end;
end;
run;
 
title "Probability Mass Function for Geom(p)";
proc sgplot data=Geometric;
   scatter x=n y=pdf / group=p markerattrs=(symbol=CircleFilled);
   series  x=n y=pdf / group=p lineattrs=(color=Gray);
   label n="Number of Trials" pdf="Probability of Event";
   xaxis grid integer values=(0 to 16 by 2) valueshint;
run;

If you are visualizing the PMF for one value of p, I suggest that you use a bar chart. Because I am showing multiple PMFs in one graph, I decided to use a scatter plot to indicate the probabilities, and I used gray lines to visually connect the probabilities that belong to the same distribution. This is the same technique that is used on the Wikipedia page for the geometric distribution.

For large probabilities, the PMF probability decreases rapidly. When the probability for success is large, the event usually occurs early; it is rare that you would have to wait for many trials before the event occurs. This geometric rate of decrease is why the distribution is named the "geometric" distribution! For small probabilities, the PMF probability decreases slowly. It is common to wait for many trials before the first success occurs.

The cumulative probability function

If you want to know the cumulative probability that the event occurs on or before the nth trial, you can use the CDF function. The CDF curve was computed in the previous section for all three probabilities. The following call to PROC SGPLOT creates a graph. If I were plotting a single distribution, I would use a bar chart. Because I am plotting several distributions, the call uses the STEP statement to create a step plot for the discrete horizontal axis. However, if you want to show the CDF for many trials (maybe 100 or more), then you can use the SERIES statement because at that scale the curve will look smooth to the eye.

title "Cumulative Distribution for Geom(p)";
title2 "Probability That Event Occurs on or before n_th Trial";
proc sgplot data=Geometric;
   step x=n y=cdf / group=p curvelabel markers markerattrs=(symbol=CircleFilled);
   label n="Number of Trials" cdf="Cumulative Probability";
   xaxis grid integer values=(0 to 16 by 2) valueshint;
   yaxis grid;
run;

The cumulative probability increases quickly when the probability of the event is high. When the probability is low, the cumulative probability is initially almost linear. You can prove this by using a Taylor series expansion of the CDF, as follows. The formula for the CDF is F(n) = 1 – (1 – p)n. When p is much much less than 1 (written p ≪ 1), then (1 – p)n ≈ 1 – n p + .... Consequently, the CDF is nearly linear (with slope p) as a function of n when p ≪ 1.

Quantiles of the geometric distribution

When the density function (PDF) of a continuous distribution is positive, the CDF is strictly increasing. Consequently, the inverse CDF function is continuous and increasing. For discrete distributions, the CDF function is a step function, and the quantile is the smallest value for which the CDF is greater than or equal to the given probability. Consequently, the quantile function is also a step function. The following DATA step uses the QUANTILE function to compute the quantiles for p=1/6 and for evenly spaced values of the cumulative probability.

data Quantile;
prob = 0.166;
do p = 0.01 to 0.99 by 0.005;
   n = quantile("Geometric", p, prob); * number of failures until event;
   n = n+1;                            * number of trials until event;
   output;
end;
run;
 
title "Quantiles for Geom(0.166)";
proc sgplot data=Quantile;
   scatter x=p y=n / markerattrs=(symbol=CircleFilled);
   label n="Number of Trials" p="Cumulative Probability";
   xaxis grid;
   yaxis grid;
run;

Random variates

For a discrete distribution, it is common to use a bar chart to show a random sample. For a large sample, you might choose a histogram. However, I think it is useful to plot the individual values in the sample, especially if some of the random variates are extreme values. You can Imagine 20 students who each roll a six-side die until it shows 6. The following DATA step simulates this experiment.

%let numSamples = 20;           *sample of 20 people;
data RandGeom;
call streaminit(12345);         * random seed for covid-19;
p = 0.166;                      * 1/6 ~ 0.166;
do ID = 1 to &numSamples;
   x = rand("Geometric", p);    *number of trials until event occurs;
   /* if you want the number of failures before the event: x = x-1 */
   Zero = 0;
   output;
end;
run;

In the simulated random sample, the values range from 1 (got a 6 on the first roll) to 26 (waited a long time before a 6 appeared). You can create a bar chart that shows these values. You can also add reference lines to the chart to indicate the mean and median of the Geom(1/6) distribution. For a Geom(p) distribution, the expected value is 1/p and the median of the distribution is ceil( -1/log2(1-p) ). When p = 1/6, the expected value is 6 and the median value is 4.

When a bar chart contains a few very long bars, you might want to "clip" the bars at some reference value so that the smaller bars are visible. The HIGHLOW statement in PROC SGPLOT supports the CLIPCAP option, which truncates the bar and places an arrow to indicate that the true length of the bar is longer than indicated in the graph. If you add labels to the end of each bar, the reader can see the true value even though the bar is truncated in length, as follows:

proc sort data=RandGeom;
   by x;
run;
/* compute mean and median of the Geom(1/6) distribution */
data Stats;
p = 0.166;
PopMean = 1/p;                    PopMedian = ceil( -1/log2(1-p) );
call symputx("PopMean", PopMean); call symputx("PopMedian", PopMedian);
run;
 
/* use CLIPCAP amd MAX= option to prevent long bars */
title "&numSamples Random Rolls of a Six-side Die";
title2 "Random Sample of Geom(0.166)";
proc sgplot data=RandGeom;
   highlow y=ID low=Zero high=x / type=bar highlabel=x    CLIPCAP; 
   refline &PopMean / axis=x label="Mean" labelpos=min; 
   refline &PopMedian / axis=x label="Population Median" labelpos=min; 
   xaxis grid label="Number of Trials until Event" minor  MAX=10;  
   yaxis type=discrete discreteorder=data reverse valueattrs=(size=7);
run;

This chart shows that four students got "lucky" and observed a 6 on their first roll. Half the students observed a 6 is four or fewer rolls, which is the median value of the Geom(1/6) distribution. Some students rolled many times before seeing a 6. The graph is truncated at 10, and students who required more than 10 rolls are indicated by an arrow.

Conclusions

This article shows how to compute the four essential functions for the geometric distribution. The geometric distribution is a discrete probability distribution. Consequently, some concepts are different than for continuous distributions. SAS provides functions for the PMF, CDF, quantiles, and random variates. However, you need to be a careful because there are two common ways to define the geometric distribution. The SAS statements in this article show how to define the geometric distribution as the number of trials until an event occurs. However, with minor modifications, the same program can handle the second definition, which is the number of failures before the event.

The post The geometric distribution in SAS appeared first on The DO Loop.

4月 012020
 

During an epidemic, such as the coronavirus pandemic of 2020, the media often shows graphs of the cumulative numbers of confirmed cases for different countries. Often these graphs use a logarithmic scale for the vertical axis. In these graphs, a straight line indicates that new cases are increasing at an exponential rate. The slope of the line indicates how quickly cases will double, with steep lines indicating a short doubling time. The doubling time is the length of time required to double the number of confirmed cases, assuming nothing changes.

This article shows one way to estimate the doubling time by using the most recent data. The method uses linear regression to estimate the slope (m) of a curve, then estimates the doubling time as log(2) / m.

The data in this article are the cumulative counts for COVID-19 cases in four countries (Italy, the United States, Canada, and South Korea) for the dates 03Mar20202 to 27Mar2020. You can download the data and SAS program for this article.

A log-scale visualization of the cumulative cases

The data set contains four variables:

  • The Region variable specifies the country.
  • The Day variable specifies the number of days since 03Mar2020.
  • The Cumul variable specifies the cumulative counts of confirmed cases of COVID-19.
  • The Log10Cumul variable specifies the base-10 logarithm of the confirmed cases. In SAS, you can use the LOG10 function to compute the base-10 logarithm.

You can use PROC SGPLOT to visualize these data. The following graph plots the total number of cases, but uses the TYPE=LOG and LOGBASE=10 options to specify a base-10 logarithmic axis for the counts:

title "Cumulative Counts (log scale)";
proc sgplot data=Virus;
  where Cumul > 0;
  series x=Day y=Cumul / group=Region curvelabel;
  xaxis grid;
  yaxis type=LOG logbase=10 grid
        values=(100 500 1000 5000 10000 50000 100000) ValuesHint;
run;

This graph is sometimes called a semi-log graph because only one axis is displayed on a log scale. A straight line on a semi-log graph indicates exponential growth. However, all exponential growth is not equal. The slope of the line indicates how quickly the growth is occurring, and the doubling time is one way to measure the growth. A line with a steep slope indicates that the underlying quantity (confirmed cases) will double in a short period of time. A line with a flat slope indicates that the underlying quantity is not growing as quickly and will take a long time to double. For these data, the visualization reveals the following facts:

  • The curves for the United States and Canada have steep slopes.
  • The curve for South Korea is much flatter, which indicates that the number of confirmed cases is growing very slowly in that country.
  • The slope of the curve for Italy looks similar to the US curve for days 0–6, but then the Italy curve begins to flatten. Although the US and Italy had the same number of cases on Day 24, the slope of the Italy curve was less than the slope of the US curve. The interpretation is that (on Day 24) the estimated doubling time for US cases is shorter than for Italy.

An estimate of the slope at the end of each curve

Some researchers fit a linear regression to all values on the curve in order to estimate an average slope. This is usually not a good idea because interventions (such as stay-at-home orders) cause the curves to bend over time. This is clearly seen in the curves for Italy and South Korea.

You can get a better estimate for the current rate of growth if you fit a regression line by using only recent data values. I suggest using data from several previous days, such as the previous 5 or 7 days. You can use the REG procedure in SAS to estimate the slope of each line based on the five most recent observations:

%let MaxDay = 24;
proc reg data=Virus outest=Est noprint;
  where Day >= %eval(&MaxDay-4);           *previous 5 days: Day 20, 21, 22, 23, and 24;
  by Region notsorted;
  model log10Cumul = Day;
quit;

The Est data set contains estimates of the slope (and intercept) for the line that best fits the recent data. The estimates are shown in a subsequent section.

An estimate of the doubling time

You can use these estimated slopes to estimate the doubling time for each curve. If a quantity Y increases from Y0 at time t0 to 2*Y0 at some future time t0 + Δt, the value Δt is the doubling time. The next paragraph shows that the doubling time at t0 is log(2) / m, where m is an estimate of the slope at t0.

The idea is to use the tangent line at t0 to estimate the doubling time. Let log(Y) = m*t + b be the equation of the tangent line at t0 on the semi-log graph. When Y increases from Y0 to 2*Y0, the logarithm increases from log(Y0) to log(2*Y0) = log(2)+log(Y0). Since the slope is "rise over run," the tangent line reaches the doubled value when
m = [(log(2) + log(Y0)) - log(Y0)] / Δt = log(2) / Δt.
Solving for Δt gives
Δt = log(2) / m,
where m is the slope of a regression line for the semi-log curve at t0. This formula estimates the doubling time, which does not depend on the value of Y, only on the slope at t0.

Estimate the doubling time from the slope

The following SAS DATA step estimates the doubling time by using the slope estimates at the end of each curve (Day 24):

data DoublingTime;
set Est(rename=(Day=Slope));
label Slope = "Slope of Line" DoublingTime = "Doubling Time";
DoublingTime = log10(2) / Slope;
keep Region Intercept Slope DoublingTime;
run;
 
proc print data=DoublingTime label noobs;
  format Slope DoublingTime 6.3;
  var Region Slope DoublingTime;
run;

For a pandemic, short doubling times are bad and long doubling times are good. Based on the 27Mar2020 data, the table estimates the doubling time for Italy to be 9 days. In contrast, the estimate for the US doubling time is about 3.3 days, and the estimate for Canada is about 2.5. The estimate for South Korea is 67 days, but for such a long time period the assumption that "the situation stays the same" is surely not valid.

Visualizing the doubling time

You can visualize the doubling times by adding an arrow to the end of each curve:

  • The base of the arrow is located at the most recent data.
  • The direction of the arrow is determined by the estimated slope of the curve.
  • The horizontal extent of the arrow is the doubling time.
  • The vertical extent of the arrow is twice the current count.

The arrows are shown in the following visualization, which excludes South Korea:

This graph shows how quickly the US and Canadian counts are predicted to double. The tip of each arrow indicates the time at which the number of cases are predicted to double. For the US and Canada, this is only a few days. The arrow for Italy indicates a longer time before the Italian cases double. Again, these calculations assume that the number of cases continues to grow at the estimated rate on Day 24. Because the curve for Italy appears to be flattening, the Italian estimate will (hopefully) be overly pessimistic.

Summary

In summary, you can use the slope of a cumulative curve (on a log scale) to estimate the doubling time for the underlying quantity. To find the slope at the most recent observation, you can fit a linear regression line to recent data. The doubling time is given by log(2)/m, where m is the estimate of the slope of the cumulative curve in a semi-log graph. If you want to visualize the doubling time on the graph, you can add an arrow to the end of each curve.

The post Estimates of doubling time for exponential growth appeared first on The DO Loop.

3月 302020
 

A cumulative curve shows the total amount of some quantity at multiple points in time. Examples include:

  • Total sales of songs, movies, or books, beginning when the item is released.
  • Total views of blog posts, beginning when the post is published.
  • Total cases of a disease for different countries, beginning when the disease passes some threshold (such as the 100th case) in each country.

If you plot multiple cumulative curves on the same graph, you can compare the different songs, blog posts, or countries. How do the sales of one song compare with another song that was released last week? How does the spread of a disease in one country compare with the spread in a different country that experienced the same disease months ago?

Robert Alison noticed that when you plot multiple cumulative curves on the same graph, the graph looks like a series of tall smokestacks, each billowing smoke. You can imagine a breeze is blowing from the left, which makes the smoke drift to the right. Therefore, he named this graph a "smokestack plot." A smokestack plot overlays several cumulative curves, which might start at different points in time. An example of a smokestack plot is shown to the right. (Click to enlarge.)

This article shows how to use SAS to create a smokestack plot like this one. It discusses how to get data in a suitable form, how to generate the cumulative quantities, and how to plot the cumulative curves. To illustrate the process, this article uses views of blog posts during early 2020.

You can download the data and SAS program for this article.

The form of the data

The data for a smokestack plot should be in "long form." That is, the data should contain three variables:

  • An ID variable that identifies each subject. For these data, each post has a unique ID number.
  • A date or time variable. For these data, the DATE variable identifies days that the blog post was viewed, beginning with the day it was published.
  • A variable that records the number of events for each subject at each time point. For these data, the VALUE variable records the number of views for each blog post for each day.

The data should be sorted by the time variable within each value of the ID variable.

If you are interested in the frequency of views, you can plot the number of views versus time to obtain a frequency plot. For example, the following plot shows the number of views for each day for the blog post with ID=25559, which was published on 13JAN2020.

The frequency plot shows that the post received almost 100 views when it was first published, but the daily views quickly fell to about 10 per day until the end of January, when the blog post was mentioned in a newsletter. That caused a spike in views. The spike eventually faded, and the post settled into a steady state in which it receives a few (3 or 4) views per day. (Diseases and songs also experience spikes. A disease might spike when newly infected persons enter an area. Songs spike if they are featured in a movie or TV show.)

Other blog posts follow a similar trajectory. There is an initial spike in views after publication, and the daily views eventually settle into a steady state. For some posts, the daily views remain relatively high even long after they are published. Others attract only a few views per day after the initial spike.

Create a smokestack plot

You can use a smokestack plot of the cumulative views to monitor the steady-state behavior and to compare different subjects.

The first step is to compute the cumulative quantities for each ID. Assuming that the data are sorted by the Date variable within each value of the ID variable, the following DATA step computes the cumulative number of views for each value of the ID variable.

/* Compute cumulative quantities */
data Cumul;
set Have;                  /* assume sorted by Date for each ID variable */
by ID notsorted;           /* use NOTSORTED option if not sorted by ID */
if first.ID then do;
   Cumul=0;                /* initialize the cumulative count */
   output;                 /* Optional: to display 0, include the OUTPUT statements */
end;
Cumul + Value;             /* increment the cumulative count */
output;
run;

You can now plot the cumulative quantity versus time for each ID value. In SAS, you can use the SERIES statement and specify the GROUP= option to obtain a different curve for each ID value:

ods graphics / reset;
title "Smokestack Plot";
title2 "Overlay Multiple Cumulative Curves";
proc sgplot data=Cumul;
   series x=Date y=Cumul / group=ID lineattrs=(pattern=solid) curvelabel; 
   xaxis grid display=(nolabel);
   yaxis grid label="Cumulative Views";
run;

The graph is shown at the top of this article. The graph shows that the posts are published at regular intervals (every Monday). They initially get a flurry of views, but after a week or two the views taper off to a rate that is different for each post. The slope of the cumulative curve indicates the rate at which a blog gets new views.

The blog posts in this data set get about 100 views on the first day. That's the "smokestack," which rises vertically. After that, new views look like smoke. During the first few days, the smoke rises quickly, which corresponds to a greater-than-average number of views. As time goes on, there are fewer views, so the trails of smoke become less steep. The long-term views are mostly generated from internet search engines at a post-specific rate. You can see in the graph that some slopes (smoke trails) are steeper than others.

On occasion, a post will be referenced in a newsletter, viral tweet, or subsequent blog post, which can result in a secondary spike in views. This is seen for the post that has ID=25559. In the smokestack plot, this looks like an updraft that lifts the smoke trail dramatically upward before it once again settles into a (possibly new) long-term slope.

Create a common origin

You can use a variant of the smokestack plot to compare multiple cumulative curves as if they all started on the same day. To do that, you can create a new variable (DAY), which is the number of days since publication. You can use that variable to plot the cumulative curves with a common origin, as follows:

data Origin;
retain BaseDate;
set Cumul;                 /* start with the cumulative data */
by ID notsorted;           /* use the NOTSORTED option if not sorted by ID */
if first.ID then           /* remember the first day for this ID */
   BaseDate = Date;
Day = Date - BaseDate;     /* days since first day */
drop BaseDate;
run;
 
title2 "Cumulative Curves with a Common Origin";
proc sgplot data=Origin;
   series x=Day y=Cumul / group=ID lineattrs=(pattern=solid) curvelabel; 
   xaxis grid label="Days Since Publication" values=(0 to 100 by 10) valueshint;
   yaxis grid label="Cumulative Views" values=(0 to 1000 by 100) valueshint;
run;

This graph enables you to compare the number of views for each post n days after publication. Notice that the older posts have longer smoke trails, whereas the more recent posts have shorter trails. The "common origin" smokestack plot provides several informative facts:

  • ID=25559 (dark blue, top) was popular even before it got a big "updraft" about 15 days after publication.
  • Although ID=25609 (dark red, middle) and ID=25079 (brown, middle) have fewer cumulative views than ID=25559. you can see that their long-term slope is steeper, which indicates that they get more daily views. If these trends hold, they may overtake ID-25559 in total views after a few more months.
  • Although ID=25779 (olive green) and ID=25753 (magenta) started out somewhat strong, they are now generating only 1-2 views per day. In contrast, ID=25672 (teal, bottom) started out slowly but generate 4-5 views every day, so it might eventually surpass those other two posts in total views.

You can make other choices for a common origin. For example, if some posts have a slower start than others, you could let "Day 0" be the first day on which each post surpasses 200 views.

Log-scale smokestack plots

I intentionally omitted blog posts that have more than 1000 cumulative views because I wanted to be able to see the trajectories of the less popular items. However, in your data, you might have one or two subjects whose cumulative quantities are an order of magnitude larger than the others. In this case, you might want to plot the cumulative curves on a logarithmic scale, which is equivalent to plotting the logarithm of the cumulative counts.

The SGPLOT procedure makes it easy to plot data on the log scale: simply add TYPE=LOG to either the YAXIS or XAXIS statement. By default, the scale is the natural logarithm (base e), which is appropriate for most scientific applications. For social science and business applications, you can use LOGBASE=10 to plot log10 of the data. The data for a log-scale graph must be strictly positive, so the following call to PROC SGPLOT uses a WHERE Cumul>0 to drop any zeros in the cumulative counts. You can use a log scale for the original smokestack plot or for the version that has a common origin:

title2 "Cumulative Curves (log scale)";
proc sgplot data=Origin;
   where Cumul > 0;
   series x=Day y=Cumul / group=ID lineattrs=(pattern=solid) curvelabel; 
   xaxis grid label="Days Since Publication" values=(0 to 100 by 10) valueshint;
   yaxis grid type=log logbase=10     /* LOG10 transformation of axis */
         label="Cumulative Views" values=(100 to 1000 by 100) valueshint;
run;

As I said, a log scale enables you to see very large values (such as 10,000 or more) and very small (such as 100 or less) in one plot. For these data, the main advantage of a log scale is that you can more easily see values for the curves near Day=0 where the cumulative counts are small (less than 200).

Summary

This article shows how to use SAS to create a smokestack plot, which is a graph that overlays multiple cumulative curves. Often, the cumulative curves will start at different points in time. However, you can translate the curves to a common origin by defining a new variable such as "the number of days since publication." This enables you to directly compare to curves at the same point in their evolution. Lastly, if one curve is much taller than the others, you can use a log-scale axis to visualize all curves on a common scale.

The smokestack plot can be used for many different applications. During the coronavirus pandemic of 2020, smokestack plots were a common way to compare the cumulative totals of infected persons for different countries.

The post Smokestack plots: A visualization technique for comparing cumulative curves appeared first on The DO Loop.

3月 252020
 

During an outbreak of a disease, such as the coronavirus (COVID-19) pandemic, the media shows daily graphs that convey the spread of the disease. The following two graphs appear frequently:

  • New cases for each day (or week). This information is usually shown as a histogram or needle plot. The graph is sometimes called a frequency graph.
  • The total number of cases plotted against time. Usually, this graph is a line graph. The graph is sometimes called a cumulative frequency graph.

An example of each graph is shown above. The two graphs are related and actually contain the same information. However, the cumulative frequency graph is less familiar and is harder to interpret. This article discusses how to read a cumulative frequency graph. The shape of the cumulative curve indicates whether the daily number of cases is increasing, decreasing, or staying the same.

For this article, I created five examples that show the spread of a hypothetical disease. The numbers used in this article do not reflect any specific disease or outbreak.

How to read a cumulative frequency graph

When the underlying quantity is nonnegative (as for new cases of a disease), the cumulative curve never decreases. It either increases (when new cases are reported) or it stays the same (if no new cases are reported).

When the underlying quantity (new cases) is a count, the cumulative curve is technically a step function, but it is usually shown as a continuous curve by connecting each day's cumulative total. A cumulative curve for many days (more than 40) often looks smooth, so you can describe its shape by using the following descriptive terms:

  • When the number of new cases is increasing, the cumulative curve is "concave up." In general, a concave-up curve is U-shaped, like this: ∪. Because a cumulative frequency curve is nondecreasing, it looks like the right side of the ∪ symbol.
  • When the number of new cases is staying the same, the cumulative curve is linear. The slope of the curve indicates the number of new cases.
  • When the number of new cases is decreasing, the cumulative curve is "concave down." In general, a concave-up curve looks like an upside-down U, like this: ∩. Because a cumulative frequency curve is nondecreasing, a concave-down curve looks like the left side of the ∩ symbol.

A typical cumulative curve is somewhat S-shaped, as shown to the right. The initial portion of the curve (the red region) is concave up, which indicates that the number of new cases is increasing. The cumulative curve is nearly linear between Days 35 and 68 (the yellow region), which indicates that the number of new cases each day is approximately constant. After Day 68, the cumulative curve is concave down, which indicates that the number of daily cases is decreasing. Each interval can be short, long, or nonexistent.

The cumulative curve looks flat near Day 100. When the cumulative curve is exactly horizontal (zero slope), it indicates that there are no new cases.

Sometimes you might see a related graph that displayed the logarithm of the cumulative cases. Near the end of this article, I briefly discuss how to interpret a log-scale graph.

Examples of frequency graphs

It is useful to look at the shape of the cumulative frequency curve for five different hypothetical scenarios. This section shows the cases-per-day frequency graphs; the cumulative frequency curves are shown in subsequent sections.

In each scenario, a population experiences a total of 1,000 cases of a disease over a 100-day time period. For the sake of discussion, suppose that the health care system can treat up to 20 new cases per day. The graphs to the right indicate that some scenarios will overwhelm the health care system whereas others will not. The five scenarios are:

  • Constant rate of new cases: In the top graph, the community experiences about 10 new cases per day for each of the 100 days. Because the number of cases per day is small, the health care system can treat all the infected cases.
  • Early peak: In the second graph, the number of new cases quickly rises for 10 days before gradually declining over the next 50 days. Because the more than 20 new cases develop on Days 5–25, the health care system is overwhelmed on those days.
  • Delayed peak: In the third graph, the number of new cases gradually rises, levels out, and gradually declines. There are only a few days in which the number of new cases exceeds the capacity of the health care system. Epidemiologists call this scenario "flattening the curve" of the previous scenario. By practicing good hygiene and avoiding social interactions, a community can delay the spread of a disease.
  • Secondary outbreak: In the fourth graph, the first outbreak is essentially resolved when a second outbreak appears. This can happen, for example, if a new infected person enters a community after the first outbreak ends. To prevent this scenario, public health officials might impose travel bans on certain communities.
  • Exponential growth: In the fifth graph, the number of new cases increases exponentially. The health care system is eventually overwhelmed, and the graph does not indicate when the spread of the disease might end.

The graphs in this section are frequency graphs. The next sections show and interpret a cumulative frequency graph for each scenario.

Constant rate of new cases

In the first scenario, new cases appear at a constant rate. Consequently, the cumulative frequency chart looks like a straight line. The slope of the line is the rate at which new cases appear. For example, in this scenario, the number of new cases each day is approximately 10. Consequently, the cumulative curve has an average slope ("rise over run") that is close to 10.

Early peak

In the second scenario, new cases appear very quickly at first, then gradually decline. Consequently, the first portion of the cumulative curve is concave up and the second portion is concave down. In this scenario, the number of new cases dwindles to zero, as indicated by the nearly horizontal cumulative curve.

At any moment in time, you can use the slope of the cumulative curve to estimate the number of new cases that are occurring at that moment. Days when the slope of the cumulative curve is large (such as Day 10), correspond to days on which many new cases are reported. Where the cumulative curve is horizontal (zero slope, such as after Day 60), there are very few new cases.

Delayed peak

In the third scenario, new cases appear gradually, level out, and then decline. This is reflected in the cumulative curve. Initially, the cumulative curve is concave up. It then straightens out and appears linear for 10–15 days. Finally, it turns concave down, which indicates that the number of new cases is trending down. Near the end of the 100-day period, the cumulative curve is nearly horizontal because very few new cases are being reported.

Secondary outbreak

In the fourth scenario, there are two outbreaks. During the first outbreak, the cumulative curve is concave up or down as the new cases increase or decrease, respectively. The cumulative curve is nearly horizontal near Day 50, but then goes through a smaller concave up/down cycle as the second outbreak appears. Near the end of the 100-day period, the cumulative curve is once again nearly horizontal as the second wave ends.

Exponential growth

The fifth scenario demonstrates exponential growth. Initially, the number of new cases increases very gradually, as indicated by the small slope of the cumulative frequency curve. However, between Days 60–70, the number of new cases begins to increase dramatically. The lower and upper curves are both increasing at an exponential rate, but the scale of the vertical axis for the cumulative curve (upper graph) is much greater than for the graph of new cases (lower graph). This type of growth is more likely in a population that does not use quarantines and "social distancing" to reduce the spread of new cases.

This last example demonstrates why it is important to label the vertical axis. At first glance, the upper and lower graphs look very similar. Both exhibit exponential growth. One way to tell them apart is to remember that a cumulative frequency graph never decreases. In contrast, if you look closely at the lower graph, you can see that some bars (Days 71 and 91) are shorter than the previous day's bar.

Be aware of log-scale axes

The previous analysis assumes that the vertical axis plot the cumulative counts on a linear scale. Scientific articles might display the logarithm of the total counts. The graph is on a log scale if the vertical axis says "log scale" or if the tick values are powers of 10 such as 10, 100, 1000, and so forth. If the graph uses a log scale:

  • A straight line indicates that new cases are increasing at an exponential rate. The slope of the line indicates how quickly cases will double, with steep lines indicating a short doubling time.
  • A concave-down curve indicates that new cases are increasing at rate that is less than exponential. Log-scale graphs make it difficult to distinguish between a slowly increasing rate and a decreasing rate.

Summary

In summary, this article shows how to interpret a cumulative frequency graph. A cumulative frequency graph is provided for five scenarios that describe the spread of a hypothetical disease. In each scenario, the shape of the cumulative frequency graph indicates how the disease is spreading:

  • When the cumulative curve is concave up, the number of new cases is increasing.
  • When the cumulative curve is linear, the number of new cases is not changing.
  • When the cumulative curve is concave down, the number of new cases is decreasing.
  • When the cumulative curve is horizontal, there are no new cases being reported.

Although the application in this article is the spread of a fictitious disease, the ideas apply widely. Anytime you see a graph of a cumulative quantity (sales, units produced, number of traffic accidents,...), you can the ideas in this article to interpret the cumulative frequency graph and use its shape to infer the trends in the underlying quantity. Statisticians use these ideas to relate a cumulative distribution function (CDF) for a continuous random variable to its probability density function (PDF).

The post How to read a cumulative frequency graph appeared first on The DO Loop.

3月 252020
 

During an outbreak of a disease, such as the coronavirus (COVID-19) pandemic, the media shows daily graphs that convey the spread of the disease. The following two graphs appear frequently:

  • New cases for each day (or week). This information is usually shown as a histogram or needle plot. The graph is sometimes called a frequency graph.
  • The total number of cases plotted against time. Usually, this graph is a line graph. The graph is sometimes called a cumulative frequency graph.

An example of each graph is shown above. The two graphs are related and actually contain the same information. However, the cumulative frequency graph is less familiar and is harder to interpret. This article discusses how to read a cumulative frequency graph. The shape of the cumulative curve indicates whether the daily number of cases is increasing, decreasing, or staying the same.

For this article, I created five examples that show the spread of a hypothetical disease. The numbers used in this article do not reflect any specific disease or outbreak.

How to read a cumulative frequency graph

When the underlying quantity is nonnegative (as for new cases of a disease), the cumulative curve never decreases. It either increases (when new cases are reported) or it stays the same (if no new cases are reported).

When the underlying quantity (new cases) is a count, the cumulative curve is technically a step function, but it is usually shown as a continuous curve by connecting each day's cumulative total. A cumulative curve for many days (more than 40) often looks smooth, so you can describe its shape by using the following descriptive terms:

  • When the number of new cases is increasing, the cumulative curve is "concave up." In general, a concave-up curve is U-shaped, like this: ∪. Because a cumulative frequency curve is nondecreasing, it looks like the right side of the ∪ symbol.
  • When the number of new cases is staying the same, the cumulative curve is linear. The slope of the curve indicates the number of new cases.
  • When the number of new cases is decreasing, the cumulative curve is "concave down." In general, a concave-up curve looks like an upside-down U, like this: ∩. Because a cumulative frequency curve is nondecreasing, a concave-down curve looks like the left side of the ∩ symbol.

A typical cumulative curve is somewhat S-shaped, as shown to the right. The initial portion of the curve (the red region) is concave up, which indicates that the number of new cases is increasing. The cumulative curve is nearly linear between Days 35 and 68 (the yellow region), which indicates that the number of new cases each day is approximately constant. After Day 68, the cumulative curve is concave down, which indicates that the number of daily cases is decreasing. Each interval can be short, long, or nonexistent.

The cumulative curve looks flat near Day 100. When the cumulative curve is exactly horizontal (zero slope), it indicates that there are no new cases.

Sometimes you might see a related graph that displayed the logarithm of the cumulative cases. Near the end of this article, I briefly discuss how to interpret a log-scale graph.

Examples of frequency graphs

It is useful to look at the shape of the cumulative frequency curve for five different hypothetical scenarios. This section shows the cases-per-day frequency graphs; the cumulative frequency curves are shown in subsequent sections.

In each scenario, a population experiences a total of 1,000 cases of a disease over a 100-day time period. For the sake of discussion, suppose that the health care system can treat up to 20 new cases per day. The graphs to the right indicate that some scenarios will overwhelm the health care system whereas others will not. The five scenarios are:

  • Constant rate of new cases: In the top graph, the community experiences about 10 new cases per day for each of the 100 days. Because the number of cases per day is small, the health care system can treat all the infected cases.
  • Early peak: In the second graph, the number of new cases quickly rises for 10 days before gradually declining over the next 50 days. Because the more than 20 new cases develop on Days 5–25, the health care system is overwhelmed on those days.
  • Delayed peak: In the third graph, the number of new cases gradually rises, levels out, and gradually declines. There are only a few days in which the number of new cases exceeds the capacity of the health care system. Epidemiologists call this scenario "flattening the curve" of the previous scenario. By practicing good hygiene and avoiding social interactions, a community can delay the spread of a disease.
  • Secondary outbreak: In the fourth graph, the first outbreak is essentially resolved when a second outbreak appears. This can happen, for example, if a new infected person enters a community after the first outbreak ends. To prevent this scenario, public health officials might impose travel bans on certain communities.
  • Exponential growth: In the fifth graph, the number of new cases increases exponentially. The health care system is eventually overwhelmed, and the graph does not indicate when the spread of the disease might end.

The graphs in this section are frequency graphs. The next sections show and interpret a cumulative frequency graph for each scenario.

Constant rate of new cases

In the first scenario, new cases appear at a constant rate. Consequently, the cumulative frequency chart looks like a straight line. The slope of the line is the rate at which new cases appear. For example, in this scenario, the number of new cases each day is approximately 10. Consequently, the cumulative curve has an average slope ("rise over run") that is close to 10.

Early peak

In the second scenario, new cases appear very quickly at first, then gradually decline. Consequently, the first portion of the cumulative curve is concave up and the second portion is concave down. In this scenario, the number of new cases dwindles to zero, as indicated by the nearly horizontal cumulative curve.

At any moment in time, you can use the slope of the cumulative curve to estimate the number of new cases that are occurring at that moment. Days when the slope of the cumulative curve is large (such as Day 10), correspond to days on which many new cases are reported. Where the cumulative curve is horizontal (zero slope, such as after Day 60), there are very few new cases.

Delayed peak

In the third scenario, new cases appear gradually, level out, and then decline. This is reflected in the cumulative curve. Initially, the cumulative curve is concave up. It then straightens out and appears linear for 10–15 days. Finally, it turns concave down, which indicates that the number of new cases is trending down. Near the end of the 100-day period, the cumulative curve is nearly horizontal because very few new cases are being reported.

Secondary outbreak

In the fourth scenario, there are two outbreaks. During the first outbreak, the cumulative curve is concave up or down as the new cases increase or decrease, respectively. The cumulative curve is nearly horizontal near Day 50, but then goes through a smaller concave up/down cycle as the second outbreak appears. Near the end of the 100-day period, the cumulative curve is once again nearly horizontal as the second wave ends.

Exponential growth

The fifth scenario demonstrates exponential growth. Initially, the number of new cases increases very gradually, as indicated by the small slope of the cumulative frequency curve. However, between Days 60–70, the number of new cases begins to increase dramatically. The lower and upper curves are both increasing at an exponential rate, but the scale of the vertical axis for the cumulative curve (upper graph) is much greater than for the graph of new cases (lower graph). This type of growth is more likely in a population that does not use quarantines and "social distancing" to reduce the spread of new cases.

This last example demonstrates why it is important to label the vertical axis. At first glance, the upper and lower graphs look very similar. Both exhibit exponential growth. One way to tell them apart is to remember that a cumulative frequency graph never decreases. In contrast, if you look closely at the lower graph, you can see that some bars (Days 71 and 91) are shorter than the previous day's bar.

Be aware of log-scale axes

The previous analysis assumes that the vertical axis plot the cumulative counts on a linear scale. Scientific articles might display the logarithm of the total counts. The graph is on a log scale if the vertical axis says "log scale" or if the tick values are powers of 10 such as 10, 100, 1000, and so forth. If the graph uses a log scale:

  • A straight line indicates that new cases are increasing at an exponential rate. The slope of the line indicates how quickly cases will double, with steep lines indicating a short doubling time.
  • A concave-down curve indicates that new cases are increasing at rate that is less than exponential. Log-scale graphs make it difficult to distinguish between a slowly increasing rate and a decreasing rate.

Summary

In summary, this article shows how to interpret a cumulative frequency graph. A cumulative frequency graph is provided for five scenarios that describe the spread of a hypothetical disease. In each scenario, the shape of the cumulative frequency graph indicates how the disease is spreading:

  • When the cumulative curve is concave up, the number of new cases is increasing.
  • When the cumulative curve is linear, the number of new cases is not changing.
  • When the cumulative curve is concave down, the number of new cases is decreasing.
  • When the cumulative curve is horizontal, there are no new cases being reported.

Although the application in this article is the spread of a fictitious disease, the ideas apply widely. Anytime you see a graph of a cumulative quantity (sales, units produced, number of traffic accidents,...), you can the ideas in this article to interpret the cumulative frequency graph and use its shape to infer the trends in the underlying quantity. Statisticians use these ideas to relate a cumulative distribution function (CDF) for a continuous random variable to its probability density function (PDF).

The post How to read a cumulative frequency graph appeared first on The DO Loop.

3月 232020
 

When you create a graph by using the SGPLOT procedure in SAS, usually the default tick locations are acceptable. Sometimes, however, you might want to specify a set of custom tick values for one or both axes. This article shows three examples:

  • Specify evenly spaced values.
  • Specify tick values that are not evenly spaced.
  • Specify custom labels for the ticks, including symbols such as π.

You can accomplish these tasks by using options on the XAXIS and YAXIS statements. The VALUES= option enables you to specify tick locations; the VALUESDISPLAY= option enables you to specify text strings to display on an axis.

Use the VALUES= option to specify tick locations

You can use the VALUES= option on the XAXIS or YAXIS statement to specify a set of values at which to display ticks. If you want to specify a set of evenly spaced values, you can use the (start TO stop BY increment) syntax. For example, the following call to PROC SGPLOT specifies that the ticks on the Y axis should be specified at the locations -0.5, 0, 0.5, 1, ..., 3.

/* create data for the graph of the function y = x*exp(x) */
data Function;
do x = -5 to 1 by 0.05;
   y = x*exp(x);
   output;
end;
run;
 
ods graphics / width=480px height=360px;
title "Graph of y = x*exp(x)";
proc sgplot data=Function;
   series x=x y=y;
   yaxis grid values=(-0.5 to 3 by 0.5);   /* specify tick locations */
   xaxis grid;
run;

Notice that the Y axis has tick marks at the values that are specified by the VALUES= option.

Specify tick mark locations and labels

Sometimes you might want to highlight a specific value on a graph. For example, you might want to indicate where a graph has a maximum, minimum, or inflection point. The graph in the previous section has a minimum value at (x,y) = (1, -1/e) ≈ (1, -0.368). You can use the VALUES= option to specify a set of tick locations that include -0.368. Because nobody who reads the graph will associate -0.368 with the number -1/e, you might want to display the text string "-1/e" on the axis instead of -0.368.

The VALUESDISPLAY= option enables you to specify a text string for each value in the VALUES= list. The strings are used as labels for each tick mark. For example, the following call to PROC SGPLOT uses the VALUESDISPLAY= option to display the string "-1/e" at the location y = -0.368. To further emphasize that the value corresponds to the minimum value of the function, I use the DROPLINE statement to display line segments that extend from the point (1, -0,368) to each axis.

%let eInv = %sysfunc(exp(-1)); /* e inverse = exp(-1) */
title "Graph of y = x*exp(x)";
proc sgplot data=Function;
   series x=x y=y;
   dropline x= -1 y= -&eInv / dropto=both lineattrs=(color=coral);
   /* Draw coordinate axes in dark grey */
   refline 0 / axis=y lineattrs=(color=darkgrey);
   refline 0 / axis=x lineattrs=(color=darkgrey);
   /* ticks on the Y axis include -1/e */
   yaxis grid  max=1.5
         values        = (-1   -&eInv  0   0.5   1   1.5 )   /* locations */
         valuesdisplay = ("-1" "-1/e" "0" "0.5" "1" "1.5");  /* strings to display */
   xaxis grid;
run;

Use symbols for tick labels

The previous example displays the string "-1/e" at a tick mark. You can also display symbols on the axes. Back in 2011, Dan Heath discussed how to specify symbols by using Unicode values. Since SAS 9.4M3, you can also use Unicode symbols in user-defined formats. The following example defines the Unicode symbol for pi (π) and uses the symbol in the VALUESDISPLAY= option to label tick marks at π/2, π, 3π/2, and 2π.

/* create the graph of y = sin(x) on [0, 2*pi] */
data Trig;
pi = constant('pi');  drop pi;
do x = 0 to 2*pi by pi/25;
   y = sin(x);
   output;
end;
run;
 
%let piChar = (*ESC*){Unicode pi};              /* define pi symbol */
title "Graph of y = sin(x)";
proc sgplot data=Trig noautolegend;
   series x=x y=y;
   refline 0 / axis=y lineattrs=(color=darkgrey);
   xaxis grid 
         values        = (0    1.57        3.14      4.71         6.28 )     /* 0, pi/2, pi, 3pi/2, 2pi */
         valuesdisplay = ("0" "&piChar/2" "&piChar" "3&piChar/2" "2&piChar");  
run;

In this example, I used the VALUES= and VALUESDISPLAY= option on the XAXIS statement. Consequently, the X axis displays the symbols π/2, π, 3π/2, and 2π.

In summary, you can use the VALUES= option to specify a list of locations for ticks on an axis. You can use the (start TO stop BY increment) notation to specify evenly spaced ticks. For unevenly spaced ticks, you can list the ticks manually, such as (1 5 10 25 50 100). If you want to display text instead of numbers, you can use the VALUESDISPLAY= option, which enables you to specify arbitrary strings. You can use Unicode symbols in the VALUESDISPLAY= list.

The post Add custom tick marks to a SAS graph appeared first on The DO Loop.

3月 182020
 

A SAS/IML programmer asked about the best way to print multiple SAS/IML variables when each variable needs a different format. He wanted the output to resemble the "Parameter Estimates" table that is produced by PROC REG and other SAS/STAT procedures.

This article shows four ways to print SAS/IML vectors in a table in which each variable has a different format. The methods are:

  1. Use the SAS/IML PRINT statement and assign a format to each column. Optionally, you can also assign a label to each variable.
  2. Create a SAS/IML table and use the TablePrint statement. You can use the TableSetVarFormat subroutine to assign formats to columns.
  3. If you want to format the variables according to a SAS/STAT template, you can use the TablePrint subroutine and specify the name of the template.
  4. Write the vectors to a data set and use PROC PRINT.

An example table

Suppose you have computed a linear regression in the SAS/IML language. You have six vectors (each with three observations) that you want to print in table that resembles the ParameterEstimates table that is produced by PROC REG and other SAS/STAT procedures. Here are the vectors:

proc iml;
factors = {'Intercept', 'MPG_Highway','Weight'};      /* Variable */
df      = {1,           1,            1};             /* DF = degrees of freedom */
est     = {-3.73649,    0.87086,      0.00011747};    /* Estimate */
SE      = {1.25091,     0.02446,      0.0001851};     /* StdErr = standard error */
t       = {-2.99,       35.6,         0.63};          /* tValue = value t test that the parameter is zero */
pvalue= {0.0029803,     1.36E-129,    0.525902};      /* Probt = two-sided p-value for hypothesis test */

The following PRINT statement displays these vectors:
     PRINT factors df est SE t pvalue;
However, there are two problems with the output:

  1. The column headers are the names of the variables. You might want to display more meaningful column headers.
  2. By default, each vector is printed by using the BEST9. format. However, it can be desirable to print each vector in a different format.

For this table, the programmer wanted to use the same formats that PROC REG uses for the ParameterEstimates table, which are as follows:

1. The PRINT statement

IML programmers would use the PRINT statement to print all six columns in a single table. You can use the LABEL= option (or L=, for short) on the PRINT statement to specify the column header. You can use the FORMAT= option (or F=, for short) to specify the format. Thus, you can use the following PRINT statement to create a table in which each column uses its own format:

/* 1: Use PRINT statement. Assign format to each column. 
      You cannot get a spanning header when you print individual columns */
print "-------- Parameter Estimates --------";
print factors[L='Variable']
      df[L='DF'        F=2.0]
      est[L='Estimate' F=D11.3]
      SE[L='StdErr'    F=D11.3]
      t[L='tValue'     F=7.2]
      pvalue[L='Probt' F=PVALUE6.4];

The result is acceptable. The PRINT statement enables you to apply formats to each column. However, unfortunately, there is no way to put a spanning header that says "Parameter Estimates" across the top of the table. This example uses a separate PRINT statement to emulate a header.

2. The TablePrint statement

If you have SAS/IML 14.2 or later (or the iml action in SAS Viya 3.5 or later), you can put the variables into a SAS/IML table and use the TablePrint statement to display the table. The IML language supports the TableSetVarFormat statement. which you can use to set the formats for individual columns, as shown in the following statements:

/* 2: Put variables into a table and use the TABLEPRINT statement. */
Tbl = TableCreate('Variable', factors);
call TableAddVar(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'},
                        df || est    || SE    || t     || pvalue);
call TableSetVarFormat(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'},
                            {'2.0' 'D11.3'   'D11.3'  '7.2'    'PVALUE6.4'});
call TablePrint(Tbl) ID='Variable' label='Parameter Estimates';

You can see that the TablePrint routine does an excellent job printing the table. You can use the ID= option to specify that the Variable column should be used as row headers. The LABEL= option puts a spanning header across the top of the table.

3. TablePrint can use a template

In the previous section, I used the TableSetVarFormat function to manually apply formats to specific columns. However, if your goal is to apply the same formats that are used by a SAS/STAT procedure, there is an easier way. You can use the TEMPLATE= option on the TablePrint subroutine to specify the name of an ODS template. You need to make sure that the names of variables in the table are the same as the names that are used in the template, but if you do that then the template will automatically format the variables and display a spanning header for the table.

/* 3. An easier way to duplicate the ParameterEstimates table from SAS/STAT procedure. 
      Use the TEMPLATE= option. (Make sure the names of variables are what the template expects. */
Tbl = TableCreate('Variable', factors);
call TableAddVar(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'},
                        df || est    || SE    || t     || pvalue);
call TablePrint(Tbl) label='Parameter Estimates'
                     template="Stat.REG.ParameterEstimates";

Remember: You can use ODS TRCE ON or the SAS/STAT documentation to find the name of ODS tables that a procedure creates.

4. Use PROC PRINT

If none of the previous methods satisfy your needs, you can always write the data to a SAS data set and then use Base SAS methods to display the table. For example, the following statements create a SAS data set and use PROC PRINT to display the table. The LABEL statement is used to specify the column headers. The FORMAT statement is used to apply formats to the columns:

/* 4. Write a SAS data set and use PROC PRINT */
create PE var {'factors' 'df' 'est' 'SE' 't' 'pvalue'};
append;
close;
QUIT;
 
proc print data=PE noobs label;
   label factors='Variable' df='DF' est='Estimate' SE='StdErr' t='tValue' pvalue='Probt';
   format df 2.0 est SE D11.3 t 7.2 pvalue PVALUE6.4;
   var factors df est SE t pvalue;
run;

Further reading

To learn more about SAS/IML table and other non-matrix data structures, see Wicklin (2017), More Than Matrices: SAS/IML Software Supports New Data Structures."

The post Print SAS/IML variables with formats appeared first on The DO Loop.

3月 162020
 

Books about statistics and machine learning often discuss the tradeoff between bias and variance for an estimator. These discussions are often motivated by a sophisticated predictive model such as a regression or a decision tree. But the basic idea can be seen in much simpler situations. This article presents a simple situation that is discussed in a short paper by Dan Jeske (1993). Namely, if a random process produces integers with a known set of probabilities, what method should you use to predict future values of the process?

I will start by briefly summarizing Jeske's result, which uses probability theory to derive the best biased and unbiased estimators. I then present a SAS program that simulates the problem and compares two estimators, one biased and one unbiased.

A random process that produces integers

Suppose a gambler asks you to predict the next roll of a six-sided die. He will reward you based on how close your guess is to the actual value he rolls. No matter what number you pick, you only have a 1/6 chance of being correct. But if the strategy is to be close to the value rolled, you can compute the expected value of the six faces, which is 3.5. Assuming that the gambler doesn't let you guess 3.5 (which is not a valid outcome), one good strategy is to round the expected value to the nearest integer. For dice, that means you would guess ROUND(3.5) = 4. Another good strategy is to randomly guess either 3 or 4 with equal probability.

Jeske's paper generalizes this problem. Suppose a random process produces the integers 1, 2, ..., N, with probabilities p1, p2, ..., pN, where the sum of the probabilities is 1. (This random distribution is sometimes called the "table distribution.") If your goal is to minimize the mean squared error (MSE) between your guess and a series of future random values, Jeske shows that the optimal solution is to guess the value that is closest to the expected value of the random variable. I call this method the ROUNDING estimator. In general, this method is biased, but it has the smallest expected MSE. Recall that the MSE is a measure of the quality of an estimator (smaller is better).

An alternative method is to randomly guess either of the two integers that are closest to the expected value, giving extra weight to the integer that is closer to the expected value. I call this method the RANDOM estimator. The random estimator is unbiased, but it has a higher MSE.

An example

The following example is from Jeske's paper. A discrete process generates a random variable, X, which can take the values 1, 2, and 3 according to the following probabilities:

  • P(X=1) = 0.2, which means that the value 1 appears with probability 0.2.
  • P(X=2) = 0.3, which means that the value 2 appears with probability 0.3.
  • P(X=3) = 0.5, which means that the value 3 appears with probability 0.5.

A graph of the probabilities is shown to the right. The expected value of this random variable is E(X) = 1(0.2) + 2(0.3) + 3(0.5) = 2.3. However, your guess must be one of the feasible values of X, so you can't guess 2.3. The best prediction (in the MSE sense) is to round the expected value. Since ROUND(2.3) is 2, the best guess for this example is 2.

Recall that an estimator for X is biased if its expected value is different from the expected value of X. Since E(X) ≠ 2, the rounding estimator is biased.

You can construct an unbiased estimator by randomly choosing the values 2 and 3, which are the two closest integers to E(X). Because E(X) = 2.3 is closer to 2 than to 3, you want to choose 2 more often than 3. You can make sure that the guesses average to 2.3 by guessing 2 with probability 0.7 and guessing 3 with probability 0.3. Then the weighted average of the guesses is 2(0.7) + 3(0.3) = 2.3, and this method produces an unbiased estimate. The random estimator is unbiased, but it will have a larger MSE.

Simulate the prediction of a random integer

Jeske proves these facts for an arbitrary table distribution, but let's use SAS to simulate the problem for the previous example. The first step is to compute the expected values of X. This is done by the following DATA step, which puts the expected value into a macro variable named MEAN:

/* Compute the expected value of X where 
   P(X=1) = 0.2
   P(X=2) = 0.3
   P(X=3) = 0.5
*/
data _null_;
array prob[3] _temporary_ (0.2, 0.3, 0.5);
do i = 1 to dim(prob);
   ExpectedValue + i*prob[i];       /* Sum of i*prob[i] */
end;
call symputx("Mean", ExpectedValue);
run;
 
%put &=Mean;
MEAN=2.3

The next step is to predict future values of X. For the rounding estimator, the predicted value is always 2. For the random estimator, let k be the greatest integer less than E(X) and let F = E(X) - k be the fractional part of E(x). To get an unbiased estimator, you can randomly choose k with probability 1-F and randomly choose k+1 with probability F. This is done in the following DATA step, which makes the predictions, generates a realization of X, and computes the residual difference for each method for 1,000 random values of X:

/* If you know mean=E(X)=expected value of X, Jeske (1993) shows that round(mean) is the best 
   MSE predictor, but it is biased.
   Randomly guessing the two closest integers is the best UNBIASED MSE predictor
   https://www.academia.edu/15728006/Predicting_the_value_of_an_integer-valued_random_variable
 
   Use these two predictors for 1000 future random variates.
*/
%let NumGuesses = 1000;
data Guess(keep = x PredRound diffRound PredRand diffRand);
call streaminit(12345);
array prob[3] _temporary_ (0.2, 0.3, 0.5);  /* P(X=i) */
 
/* z = floor(z) + frac(z) where frac(z) >= 0 */
/* https://blogs.sas.com/content/iml/2020/02/10/fractional-part-of-a-number-sas.html */
k = floor(&Mean);
Frac = &Mean - k;                        /* distance from E(X) to x */
do i = 1 to &NumGuesses;
   PredRound = round(&Mean);             /* guess the nearest integer */
   PredRand = k + rand("Bern", Frac);    /* random guesses between k and k+1, weighted by Frac */
   /* The guesses are made. Now generate a new instance of X and compute residual difference */
   x = rand("Table", of prob[*]);
   diffRound = x - PredRound;            /* rounding estimate */
   diffRand  = x - PredRand;             /* unbiased estimate */
   output;
end;
run;
 
/* sure enough, ROUND is the best predictor in the MSE sense */
proc means data=Guess n USS mean;
   var diffRound DiffRand;
run;

The output from PROC MEANS shows the results of generating 1,000 random integers from X. The uncorrected sum of squares (USS) column shows the sum of the squared residuals for each estimator. (The MSE estimate is USS / 1000 for these data.) The table shows that the USS (and MSE) for the rounding estimator is smaller than for the random estimator. On the other hand, The mean of the residuals is not close to zero for the rounding method because it is a biased method. In contrast, the mean of the residuals for the random method, which is unbiased, is close to zero.

It might be easier to see the bias of the estimators if you look at the predicted values themselves, rather than at the residuals. The following call to PROC MEANS computes the sample mean for X and the two methods of predicting X:

/* the rounding method is biased; the random guess is unbiased */
proc means data=Guess n mean stddev;
   var x PredRound PredRand;
run;

This output shows that the simulated values of X have a sample mean of 2.34, which is close to the expected value. In contrast, the rounding method always predicts 2, so the sample mean for that column is exactly 2.0. The sample mean for the unbiased random method is 2.32, which is close to the expected value.

In summary, you can use SAS to simulate a simple example that compares two methods of predicting the value of a discrete random process. One method is biased but has the lowest MSE. The other is unbiased but has a larger MSE. In statistics and machine learning, practitioners often choose between an unbiased method (such as ordinary least squares regression) and a biased method (such as ridge regression or LASSO regression). The example in this article provides a very simple situation that you can use to think about these issues.

The post Predict a random integer: The tradeoff between bias and variance appeared first on The DO Loop.

3月 112020
 
Regular polygons approximate a circle

Recently, I saw a graphic on Twitter by @neilrkaye that showed the rapid convergence of a regular polygon to a circle as you increase the number of sides for the polygon. The author remarked that polygons that have 40 or more sides "all look like circles to me." That is, a regular polygon with a large number of sides is visually indistinguishable from a circle.

I had two reactions to the graphic. The first was "Cool! I want to create a similar figure in SAS!" (I have done so on the right; click to enlarge.) The second reaction was that the idea is both mathematically trivial and mathematically sophisticated. It is trivial because it is known to Archimedes more than 2,000 years ago and is readily understood by using high school math. But it is sophisticated because it is a particular instance of a mathematical limit, which is the foundation of calculus and the mathematical field of analysis.

The figure also demonstrates the fact that you can approximate a smooth object (a circle, a curve, a surface, ...) by a large number of small, local, linear approximations. It is not an exaggeration to say that that concept is among the most important concepts in applied mathematics. It is used in numerical methods to solve nonlinear equations, integrals, differential equations, and more. It is used in numerical optimization. It forms the foundations of computer graphics. A standard technique in computational algorithms that involves a nonlinear function is to approximate the function by the first term of the Taylor expansion—a process known as linearization.

An approximation of pi

Archimedes used this process (local approximation by a linear function) to approximate pi (π), which is the ratio of the circumference of a circle to its diameter. He used two sets of regular polygons: one that is inscribed in the unit circle and the other that circumscribes the unit circle. The circumference of an inscribed polygon is less than the circumference of the circle. The circumference of a circumscribed polygon is greater than the circumference of the circle. They converge to a common value, which is the circumference of the circle. If you apply this process to a unit circle, you approximate the value 2π. Archimedes used the same construction to approximate the area of the unit circle, which is π.

You can use Archimedes's ideas to approximate π by using trigonometry and regular polygons, as shown in the following paragraph.

Suppose that a regular polygon has n sides. Then the central angle between adjacent vertices is θ = 2π/n radians. The following diagram illustrates the geometry of inscribed and circumscribed polygons. The right triangle that is shown is half of the triangle between adjacent vertices. Consequently,

  • The half-length of a side is b, where b = cos(θ/2) for the inscribed polygon and b = tan(θ/2) for the circumscribed polygon.
  • The height of the triangle is h, where h = |sin(θ/2)| for the inscribed polygon and h = 1 for the circumscribed polygon.
  • The circumfernce of the regular polygon is 2 n b
  • The area of the regular polygon is 2 n ( b h/2 ).

Approximate pi by using Archimedes's method

Although Archimedes did not have access to a modern computer, you can easily write a SAS DATA step program to reproduce Archimedes's approximations to the circumference and area of the unit circle, as shown below:

/* area and circomference */
data ApproxCircle;
pi = constant('pi');           /* Hah! We must know pi in order to evaluate the trig functions! */
south = 3*pi/2;                /* angle that is straight down */
do n = 3 to 100;
   angle = 2*pi/n;
   theta = south + angle/2;    /* first vertex for this n-gon */
   /* Circumference and Area for circumscribed polygon */
   b = tan(angle/2);
   h = 1;
   C_Out = 2*n * b;
   A_Out = 2*n * 0.5*b*h;
   /* Circumference and Area for inscribed polygon */
   b = cos(theta);
   h = abs( sin(theta) );
   C_In = 2*n * b;
   A_In = 2*n * 0.5*b*h;
   /* difference between the circumscribed and inscribed circles */
   CDiff = C_Out - C_In;
   ADiff = A_Out - A_In;
   output;
end;
label CDiff="Circumference Difference" ADiff="Area Difference";
keep n C_In C_Out A_In A_Out CDiff ADiff;
run;
 
/* graph the circumference and area of the n-gons as a function of n */
ods graphics / reset height=300px width=640px;
%let curveoptions = curvelabel curvelabelpos=min;
title "Circumference of Regular Polygons";
proc sgplot data=ApproxCircle;
   series x=n y=C_Out / &curveoptions;
   series x=n y=C_In  / &curveoptions;
   refline 6.2831853 / axis=y;              /* reference line at 2 pi */
   xaxis grid values=(0 to 100 by 10) valueshint;
   yaxis values=(2 to 10) valueshint label="Regular Polygon Approximation";
run;
 
title "Area of Regular Polygons";
proc sgplot data=ApproxCircle;
   series x=n y=A_Out / &curveoptions;
   series x=n y=A_In  / &curveoptions;
   refline 3.1415927 / axis=y;              /* reference line at pi */
   xaxis grid values=(0 to 100 by 10) valueshint;
   yaxis  values=(2 to 10) valueshint label="Regular Polygon Approximation";
run;

As you can see from the graph, the circumference and area of the regular polygons converge quickly to 2π and π, respectively. After n = 40 sides, the curves are visually indistinguishable from their limits, which is the same result that we noticed visually when looking at the grid of regular polygons.

The DATA step also computes the difference between the measurements of the circumscribed and inscribed polygons. You can print out a few of the differences to determine how close these estimates are to the circumference and area of the unit circle:

 
proc print data=ApproxCircle noobs;
   where n in (40, 50, 56, 60, 80, 100);
   var n CDiff ADiff;
run;

The table shows that the difference between the circumference of the circumscribed and inscribed polygons is about 0.02 when n = 40. For n = 56, the difference is less than 0.01, which means that the circumference of a regular polynomial approximates the circumference of the unit circle to two decimal places when n ≥ 56. If you use a regular 100-gon, the circumference is within 0.003 of the circumference of the unit circle. Although it is not shown, it turns out you need to use 177 sides before the difference is within 0.001, meaning that a 177-gon approximates the circumference of the unit circle to three decimal places.

Similar results hold for the area of the polygons and the area of a unit circle.

In conclusion, not only does a regular n-gon look very similar to the circle when n is large, but you can quantify how quickly the circumference and areas of an n-gon converges to the values 2 π and π, respectively. For n=56, the polygon values are accurate to two decimal places; for n=177, the polygon values are accurate to three decimal places.

Approximating a smooth curve by a series of discrete approximations is the foundation of calculus and modern numerical methods. The idea had its start in ancient Greece, but the world had to wait for Newton, Leibnitz, and others to provide the mathematical machinery (limits, convergence, ...) to understand the concepts rigorously.

The post Polygons, pi, and linear approximations appeared first on The DO Loop.

3月 092020
 

In a previous article, I discussed the binormal model for a binary classification problem. This model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the Negative population is less than the mean of scores for the Positive population. I showed how you can construct an exact ROC curve for the population.

Of course, in the real world, you do not know the distribution of the population, so you cannot obtain an exact ROC curve. However, you can estimate the ROC curve by using a random sample from the population.

This article draws a random sample from the binormal model and constructs the empirical ROC curve by using PROC LOGISTIC in SAS. The example shows an important truth that is sometimes overlooked: a sample ROC curve is a statistical estimate. Like all estimates, it is subject to sampling variation. You can visualize the variation by simulating multiple random samples from the population and overlaying the true ROC curve on the sample estimates.

Simulate data from a binormal model

The easiest way to simulate data from a binormal model is to simulate the scores. Recall the assumptions of the binormal model: all variables are continuous and there is a function that associates a score with each individual. The distribution of scores is assumed to be normal for the positive and negative populations. Thus, you can simulate the scores themselves, rather than the underlying data.

For this article, the scores for the Negative population (those who do not have a disease or condition) is N(0, 1). The scores for the Positive population (those who do have the disease or condition) is N(1.5, 0.75). The ROC curve for the population is shown to the right. It was computed by using the techniques from a previous article about binormal ROC curves.

The following SAS DATA step simulates a sample from this model. It samples nN = 50 scores from the Negative population and nP = 25 scores from the Positive population. The distribution of the scores in the sample are graphed below:

%let mu_N    = 0;       /* mean of Negative population */
%let sigma_N = 1;       /* std dev of Negative population */
%let mu_P    = 1.5;     /* mean of Positive population */
%let sigma_P = 0.75;    /* std dev of Positive population */
 
%let n_N = 50;     /* number of individuals from the Negative population */
%let n_P = 25;     /* number of individuals from the Positive population */
 
/* simulate one sample from the binormal model */
data BinormalSample;
call streaminit(12345);
y = 1;             /* positive population */
do i = 1 to &n_P;
   x = rand("Normal", &mu_P, &sigma_P); output;
end;
y = 0;             /* negative population */
do i = 1 to &n_N;
   x = rand("Normal", &mu_N, &sigma_N); output;
end;
drop i;
run;
 
title "Distribution of Scores for 'Negative' and 'Positive' Samples";
ods graphics / width=480px height=360px subpixel;
proc sgpanel data=BinormalSample noautolegend;
   styleattrs datacolors=(SteelBlue LightBrown);
   panelby y      / layout=rowlattice onepanel noheader;
   inset y        / position=topleft textattrs=(size=14) separator="=";
   histogram x    / group=y;
   rowaxis offsetmin=0;
   colaxis label="Score";
run;

Create an ROC plot for a sample

You can create an ROC curve by first creating a statistical model that classifies each observation into one of the two classes. You can then call PROC LOGISTIC in SAS to create the ROC curve, which summarizes the misclassification matrix (also called the confusion matrix) at various cutoff values for a threshold parameter. Although you can create an ROC curve for any predictive model, the following statements fit a logistic regression model. You can use the OUTROC= option on the MODEL statement to write the values of the sample ROC curve to a SAS data set. You can then overlay the ROC curves for the sample and for the population to see how they compare:

/* use logistic regression to classify individuals */
proc logistic data=BinormalSample noprint;
   model y(event='1') = x / outroc=ROC1;
run;
 
/* merge in the population ROC curve */
data ROC2;
   set ROC1 PopROC;
run;
 
title "Population ROC Curve vs. Estimate";
ods graphics / width=480px height=480px;
proc sgplot data=ROC2 aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / lineattrs=(color=blue) name="est" legendlabel="Estimate";
   series x=FPR y=TPR / lineattrs=(color=black) name="pop" legendlabel="Population";
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   keylegend "est" "pop" / location=inside position=bottomright across=1 opaque;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows the sample ROC curve (the blue, piecewise-constant curve) and the population ROC curve (the black, smooth curve). The sample ROC curve is an estimate of the population curve. For this random sample, you can see that most estimates of the true population rate are too large (given a value for the threshold parameter) and most estimates of the false positive rate are too low. In summary, the ROC estimate for this sample is overly optimistic about the ability of the classifier to discriminate between the positive and negative populations.

The beauty of the binormal model is that we know the true ROC curve. We can compare estimates to the truth. This can be useful when evaluating competing models: Instead of comparing sample ROC curves with each other, we can compare them to the ROC curve for the population.

Variation in the ROC estimates

If we simulate many samples from the binormal model and plot many ROC estimates, we can get a feel for the variation in the ROC estimates in random samples that have nN = 50 and nP = 25 observations. The following DATA step simulates B = 100 samples. The subsequent call to PROC LOGISTIC uses BY-group analysis to fit all B samples and generate the ROC curves. The ROC curves for five samples are shown below:

/* 100 samples from the binormal model */
%let NumSamples = 100;
data BinormalSim;
call streaminit(12345);
do SampleID = 1 to &NumSamples;
   y = 1;             /* sample from positive population */
   do i = 1 to &n_P;
      x = rand("Normal", &mu_P, &sigma_P); output;
   end;
   y = 0;             /* sample from negative population */
   do i = 1 to &n_N;
      x = rand("Normal", &mu_N, &sigma_N); output;
   end;
end;
drop i;
run;
 
proc logistic data=BinormalSim noprint;
   by SampleID;
   model y(event='1') = x / outroc=ROCs;
run;
 
title "5 Sample ROC Curves";
proc sgplot data=ROCs aspect=1 noautolegend;
   where SampleID <= 5;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

There is a moderate amount of variation between these curves. The brown curve looks different from the magenta curve, even though each curve results from fitting the same model to a random sample from the same population. The difference between the curves are only due to random variation in the data (scores).

You can use partial transparency to overlay all 100 ROC curves on the same graph. The following DATA step overlays the empirical estimates and the ROC curve for the population:

/* merge in the population ROC curve */
data AllROC;
   set ROCs PopROC;
run;
 
title "ROC Curve for Binormal Model";
title2 "and Empirical ROC Curves for &NumSamples Random Samples";
proc sgplot data=AllROC aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID transparency=0.9
                                lineattrs=(color=blue pattern=solid thickness=2);
   series x=FPR y=TPR / lineattrs=(color=black thickness=2);
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows 100 sample ROC curves in the background (blue) and the population ROC curve in the foreground (black). The ROC estimates show considerable variability.

Summary

In summary, this article shows how to simulate samples from a binormal model. You can use PROC LOGISTIC to generate ROC curves for each sample. By looking at the variation in the ROC curves, you can get a sense for how these estimates can vary due to random variation in the data.

The post ROC curves for a binormal sample appeared first on The DO Loop.