rick wicklin

3月 222023
 

A data analyst wanted to estimate the correlation between two variables, but he was concerned about the influence of a confounding variable that is correlated with them. The correlation might affect the apparent relationship between main two variables in the study. A common confounding variable is age because young people and older people tend to have different attitudes, behaviors, resources, and health issues.

This article gives a brief overview of partial correlation, which is a way to adjust or "control" for the effect of other variables in a study.

In this article, a confounding variable is a variable that is measured in the study and therefore can be incorporated into a model. This is different than a lurking variable, which is a variable that is not measured in the study. These definitions are common, but not universally used. The process of incorporating a confounding variable into a statistical model is called controlling for the variable.

Age as a confounding variable

Age is the classic confounding variable, so let's use a small SAS data set that gives the heights, weights, and ages of 19 school-age children. To make the analysis more general, I define macro variables Y1 and Y2 for the names of the two variables for which we want to find the correlation. For these data, the main variables are Y1=Height and Y2=Weight, and the confounding variable is Age.

To begin, let's compute the correlation between Y1 and Y2 without controlling for any other variables:

%let DSName = sashelp.Class;
%let Y1 = Height;
%let Y2 = Weight;
 
proc corr data=&DSName noprob plots(maxpoints=none)=scatter(ellipse=none);
   var &Y1;
   with &Y2;
run;

The output from the procedure indicates that Y1 and Y2 are strongly correlated, with a correlation of 0.88. But the researcher suspects that the ages of the students are contributing to the strong association between height and weight. You can test this idea graphically by creating a scatter plot of the data and coloring the markers according to the value of a third variable, as follows.

%let ControlVars = Age;         /* list one or more control variables */
 
title "Color Markers by a Confounding Variable";
proc sgplot data=&DSName;
   scatter x=&Y1 y=&Y2 / colorresponse=%scan(&ControlVars, 1)   /* color by first control var */
                         markerattrs=(symbol=CircleFilled size=14) FilledOutlinedMarkers
                         colormodel=TwoColorRamp;
run;

The graph shows that low values of height and weight are generally associated with low values of age, which are light blue in color. Similarly, high values of height and weight are generally associated with high values of age, which are colored dark blue. This indicates that the age variable might be affecting the correlation, or even might be responsible for it. One way to "adjust" for the age variable is to use partial correlation.

What is partial correlation?

A partial correlation is a way to adjust a statistic to account for one or more additional covariates. The partial correlation between variables Y1 and Y2 while adjusting for the covariates X1, X2, X3, ... is computed as follows:

  1. Regress Y1 onto the covariates and calculate the residuals for the model. Let R1 be the variable that contains the residuals for the model.
  2. Regress Y1 onto the covariates. Let R2 be the column of residuals.
  3. Compute the correlation between R1 and R2. This is the partial correlation between Y1 and Y2 after adjusting for the covariates.

How to compute partial correlation in SAS

SAS provides an easy way to compute the partial correlation. PROC CORR supports the PARTIAL statement. On the PARTIAL statement, you list the covariates that you want to account for. PROC CORR automatically computes the regression models and provides the correlation of the residuals. In addition to the partial Pearson correlation, PROC CORR can report the partial versions of Spearman's rank correlation, Kendall's association, and the partial variance and standard deviation. The partial mean is always 0, so it is not reported.

The following call to PROC CORR shows the syntax of the PARTIAL statement:

proc corr data=&DSName plots(maxpoints=none)=scatter(ellipse=none);
   var &Y1;
   with &Y2;
   partial &ControlVars; 
run;

The output includes tables and a scatter plot of the residuals of Y2 versus the residuals of Y1. Only the graph is shown here. The graph includes an inset that tells you that the partial correlation is 0.7, which is less than the unadjusted correlation (0.88). Although the axes are labeled by using the original variable names, the quantities that are graphed are residuals from the two regressions. Notice that both axes are centered around zero, and the range of each axis is different from the original range.

How should you interpret this graph and the partial correlation statistic? The statistic tells you that after adjusting for age, the correlation between the heights and weights of students is about 0.7. The graph shows the scatter plot of the residual values of the heights and weights after regressing those variables onto the age variable.

How to manually calculate partial correlations

You can obtain the partial correlation manually by performing each step of the computation. This is not necessary in SAS, but it enables you to verify the computations that are performed by the PARTIAL statement in PROC CORR. To verify the output, you can manually perform the following steps:

  1. Use PROC REG to regress Y1 onto the covariates. Use the OUTPUT statement to save the residuals for the model.
  2. Use PROC REG to regress Y2 onto the covariates. Save the residuals for this model, too.
  3. Merge the two output data sets.
  4. Use PROC CORR to compute the (raw) correlation between the residual variables. This is the partial correlation between Y1 and Y2 after adjusting for the covariates.

The following SAS statements perform this analysis:

/* 1. Regress Y1 onto covariates and save residuals */
proc reg data=&DSName noprint;
   model &Y1 = &ControlVars;
   output out=RegOut1 R=Part_&Y1;
quit;
/* 2. Regress Y2 onto covariates and save residuals */
proc reg data=&DSName noprint;
   model &Y2 = &ControlVars;
   output out=RegOut2 R=Part_&Y2;
quit;
/* 3. Merge the two sets of residuals */
data RegOut;
merge RegOut1(keep=&ControlVars Part_&Y1)
      RegOut2(keep=Part_&Y2);
label Part_&Y1 = "Residual of &Y1 ~ (&ControlVars)"
      Part_&Y2 = "Residual of &Y2 ~ (&ControlVars)";
run;
/* 4. Display the correlation between the residuals */
proc corr data=RegOut plots=scatter(ellipse=none);
   var Part_&Y1;
   with Part_&Y2;
run;

As expected, the graph of the residual values is the same graph that was created by using the PARTIAL statement. As expected, the inset shows that the correlation between the residuals is the same value that was reported by using the PARTIAL statement.

Summary

It can be useful to account for or "adjust" a statistic to account for other variables that might be strongly correlated with the main variables in the analysis. One way to adjust for confounding variables is to regress the main variables onto the confounding variables and look at the statistic for the residual values. This is called a "partial" statistic and is supported in SAS by using the PARTIAL statement in several procedures. This article shows how to use the PARTIAL statement in PROC CORR to compute the partial correlation. It also shows how to manually reproduce the computation by explicitly performing the regressions and calculating the correlation of the residuals.

Appendix: PROC REG also provides (squared) partial correlations

In SAS, there are often multiple ways to get the same statistic. It turns out that you can get the SQUARED Pearson partial correlation from PROC REG by using the PCORR2 option on the MODEL statement. I don't often use this option, but for completeness, the following statements compute the squared partial correlation between Y2 and Y1:

proc reg data=&DSName plots=none;
   model &Y2 = &Y1 &ControlVars / pcorr2;
   ods select ParameterEstimates;
quit;

The "Type II Squared Partial Correlation" between Y2 and Y1 is 0.49656. If you take the square root of that number, you get ±0.7047. The tabular output from PROC REG does not enable you to determine whether the partial correlation is positive or negative.

The post Partial correlation: controlling for confounding variables appeared first on The DO Loop.

3月 202023
 

In a previous article about Markov transition matrices, I mentioned that you can estimate a Markov transition matrix by using historical data that are collected over a certain length of time. A SAS programmer asked how you can estimate a transition matrix in SAS. The answer is that you can use PROC FREQ to tabulate the number of transitions from one state to another. The procedure outputs counts and row percentages, either of which can be used to construct an estimate of the transition matrix.

Transition of families through economic classes

Let's start with some data. A charity provides education, healthcare, and agricultural assistance to an impoverished town in Central America. The charity wants to estimate the transition of families through various economic categories (or states) based on the total family income:

  • State 1 is used for the poorest families who earn less than $2 per day per person.
  • State 2 is for families who earn between $2-$10 per day per person.
  • State 3 is for families who earn between $10-$20 per day per person.
  • State 4 is for families who earn more than $20 per day per person.

The charity has access to the economic status of 160 families who have been in the program for five years. The following SAS DATA step defines the beginning and ending states for these families:

/* State 1: families who earn less than $2 per day
   State 2: families who earn between $2-$10 per day
   State 3: families who earn between $10-$20 per day
   State 4: families who earn more than $20 per day
*/
data History;
retain ID;
input BeginState EndState @@;
ID + 1;
datalines;
1 2  2 2  1 3  3 3  3 3  3 3  1 1  3 2  4 4  3 3 
4 4  1 1  3 2  1 1  1 3  3 3  2 2  2 2  2 2  3 2 
2 3  1 3  1 1  1 2  4 3  1 1  3 4  1 3  3 3  1 2 
1 2  3 3  1 3  3 4  2 2  1 2  3 2  1 2  1 1  3 2 
1 3  1 1  1 1  1 1  1 3  1 3  3 3  1 1  2 2  4 4 
1 1  2 3  1 1  1 2  2 2  2 2  1 3  2 2  1 1  1 2 
3 3  1 3  4 4  1 3  3 4  1 1  1 2  2 2  1 2  3 2 
1 1  3 3  3 3  1 2  1 1  1 2  3 3  2 2  1 3  3 2 
1 1  1 2  1 1  4 2  1 2  1 3  1 2  1 1  2 1  1 1 
2 3  1 2  2 2  1 1  3 4  1 1  1 1  2 2  3 3  4 3 
3 2  4 3  1 1  2 1  2 3  2 2  1 2  4 4  1 2  2 1 
2 1  2 2  1 1  2 3  4 4  1 2  1 1  2 2  1 2  4 2 
1 1  2 2  1 1  1 2  1 1  2 1  2 1  1 2  2 3  2 2 
3 3  4 3  1 1  2 2  1 1  2 1  1 1  2 2  1 1  1 1 
1 1  2 2  1 1  3 2  1 3  3 2  3 3  4 4  1 1  4 2 
3 3  4 4  3 2  4 4  2 2  1 3  3 3  4 4  4 3  1 2 
;
 
proc print data=History(obs=10) noobs;
run;

The output from PROC PRINT shows the beginning and ending states for 10 families. The first family was in State 1 at the beginning of the program but was in State 2 at the end of the program. The second family was in State 2 at the beginning and remained there. The third family was in State 1 at the beginning of the program but was in State 3 at the end, and so forth. You can use PROC FREQ to tabulate the matrix of counts for the transitions from one category into another, as follows:

proc freq data=History ;
   tables BeginState * EndState / out=freqOut sparse nocol nopercent outpct;
run;

The output shows the counts and the row percentages for the data. The first row of the output is for the 75 families who started the program in State 1. Of those families, 37 (49%) remained in State 1 at the end of the program, 23 (31%) had progressed to State 2, and 15 (20%) had progressed to State 3. The second row of the output is for the 35 families who started the program in State 2. Of those families, 7 (20%) regressed to State 1, 22 (63%) remained in State 2, and 6 (17%) advanced to State 3. The other rows of the output are interpreted similarly.

You can use the row percentages to estimate the transition matrix. Merely divide each percentage by 100 to obtain a proportion. The proportion in the (i,j)th cell estimates the probability that a family that was in State i at the beginning of the program is in State j at the end of the program.

Reading the probabilities into a transition matrix

Most SAS programmers use SAS IML software to work with Markov transition matrices. The output from PROC FREQ is in "long form" in a data set that has 16 rows. You can read the estimates into a SAS IML vector and then reshape them into a 4 x 4 matrix. You can create the matrix in two ways: you can read the raw counts into a matrix and then divide each row by the row sum, or you can read the row percentages directly and then divide by 100 to obtain probabilities.

proc iml;
use freqOut;
   read all var {'BeginState' 'EndState' 'Count' 'Pct_Row'};  /* read the states, counts, and row percentages */
close;
 
N = sqrt(nrow(Count));    /* this should be an integer or else something is wrong */
names = EndState[1:N];    /* there should be N unique states */
 
/* estimate transition matrix by using counts of transitions */
C = shape(Count, N, N);   /* matrix of raw counts for each transition */
M = C / C[,+];            /* divide each cell by total counts for the row */
print M[r=names c=names];
 
/* or read the PCT_ROW column directly */
M = shape(Pct_Row, N, N);       /* raw counts */
M = M / 100;                    /* convert from percentages to proportions */
print M[r=names c=names];

The matrix of counts shows that 23+15+6+4 = 48 out of 160 families improved their states during the program. Only 7+11+3+5 = 26 families ended the program in a worse state than they began.

The estimates for the transition probability matrix are the same for both calculations, so only one output is shown. For the poorest families (the first row), about 50% did not improve their economic state whereas the other 50% did. For the families that began in State 2, 20% slipped back into extreme poverty (unfortunately), 63% stayed in that state, and 17% increased their state. The remaining rows have similar interpretations.

Summary

This article shows how to use PROC FREQ in SAS to construct an estimate of a Markov transition matrix from historical data. For each subject in the study, you need to know the state of the subject at the beginning and at the end of a time period. You can then construct a matrix of counts for the transition of subjects between states. By dividing each row by its row sum, you obtain empirical probability estimates.

The post Estimate a Markov transition matrix from historical data appeared first on The DO Loop.

3月 152023
 

Most homeowners know that large home improvement projects can take longer than you expect. Whether it's remodeling a kitchen, adding a deck, or landscaping a yard, big projects are expensive and subject to a lot of uncertainty. Factors such as weather, the availability of labor, and the supply of materials, can all contribute to uncertainty in the duration of a project.

If you ask an expert (for example, an experienced general contractor) to predict the length of time required to complete a project, he or she might respond as follows:

  • 10% of the time, a project like this takes 17 work days or less.
  • 50% of the time, we can complete it in 24 work days or less.
  • 90% of the time, we finish the work by the 35th work day.

Whether the contractor realizes it or not, these sentences are estimating three quantiles of a continuous probability distribution for the completion of the project.

Now, if you are planning a garden party on your new deck, you don't want to schedule the party before the deck is finished. One way to predict when you can have the party is to fit a probability distribution to the contractor's estimates. But what distribution should you use? The contractor's estimates do not describe a symmetric distribution, such as the normal distribution. Without domain-specific knowledge, there is no reason to prefer one distribution over the other. Furthermore, you don't have any real data to fit the model, you have only the three estimated quantiles.

This is an excellent opportunity to use the flexible metalog family of distributions. This article shows how to convert the expert's estimates into a distribution that you can compute with and simulate data from.

Create a distribution from expert opinion

In this example, the expert predicted the quantiles (completion time) for three cumulative probabilities (10%-50%-90%). Mathematically, the expert has provided three points that are on (or near) the cumulative distribution for the completion time. Equivalently, the expert has provided three points on (or near) the quantile function. It is easy to fit a model to these values by using the metalog distribution. As discussed in a previous article, SAS supports the metalog distribution through a package of SAS IML functions that you can download from GitHub.

The following program assumes that you have downloaded and installed the metalog functions by using the instructions in the previous article or by following the instructions in Appendix A of the documentation for the metalog package, which is in the file MetalogDoc.pdf. This example uses a three-term metalog model. I assume that the project is never completed in less than 10 days, so I fit a semibounded model that has support on the interval [10, ∞). You can use the ML_CreateFromCDF function to create a metalog object from the estimates. (An estimate of this type is called a symmetric-percentile triplet (SPT) because the quantile are provided for the probabilities (α, 1/2, 1-α), where 0 < α < 1/2. In this example, α = 0.1.)

proc iml;
/* This LOAD statement assumes that the metalog functions are defined as in  Appendix A of the documentation.
   See the MetalogDoc.pdf file at https://github.com/sassoftware/sas-iml-packages/tree/main/Metalog */
load module=_all_;         
 
prob = {0.1, 0.5, 0.90};   /* cumulative probabilities; note symmetric-percentile triplet */
x    = {17,   24,  35};    /* associated quantiles */
 
order = 3;                 /* three terms in model */
bounds = {10 .};           /* support on [10, infinity) */
 
SPT = ML_CreateFromCDF(x, prob, order, bounds );  /* create model from SPT */
title "3-Term Metalog Model of Expert Opinion";
p = do(0.01, 0.99, 0.01);
run ML_PlotECDF(SPT, p);  /* graph the model CDF and the expert's estimates */

For these three data points, a three-term metalog model passes through the expert's estimates, so the model is a perfect fit. If you prefer to view the probability density for the model, you can use the ML_PlotPDF call, as follows:

title "PDF of 3-Term Metalog Model of Completion Times";
run ML_PlotPDF(SPT, p);

The PDF indicates the most probable duration of the project (the mode) is about 20-25 days.

The expert gave estimates for three probability values. But what if you want estimates for other probabilities? For example, let's calculate dates for which the model predicts a 95% and a 99% chance that the deck will be completed. We can get those estimates by evaluating the quantile function of the metalog model. Let's evaluate the quantile function at 0.95 and 0.99, as well as at a few other values:

/* estimates for other percentiles */
p = {0.05, 0.25, 0.75, 0.95, 0.99};
estTime = ML_Quantile(SPT, p);
print p[F=PERCENT9.] estTime[F=6.2];

According to the model, there is a 95% chance that the project will be finished by the 40th day. There is a 99% chance that the project will be completed within 56 days. You can use those estimates to schedule a party on your new deck!

Summary

The metalog distribution can be used to construct a probability distribution from an expert's opinion. This example used three estimates for the completion of a home-improvement project. The same technique enables you to fit a metalog distribution to four, five, or more quantile estimates from an expert. The metalog model provides an easy way to convert an expert's opinion into a probability distribution that can be used for simulation and modeling.

The post Fitting a distribution to an expert's opinion: An application of the metalog distribution appeared first on The DO Loop.

3月 132023
 

A previous article describes the metalog distribution (Keelin, 2016). The metalog distribution is a flexible family of distributions that can model a wide range of shapes for data distributions. The metalog system can model bounded, semibounded, and unbounded continuous distributions. This article shows how to use the metalog distribution in SAS by downloading a library of SAS IML functions from a GitHub site.

The article contains the following sections:

  • An overview of the metalog library of SAS IML functions
  • How to download the functions from GitHub and include the functions in a program
  • How to use a metalog model to fit an unbounded model to data
  • A discussion of infeasible models

For a complete discussion of the metalog functions in SAS, see "A SAS IML Library for the Metalog Distribution" (Wicklin, 2023).

The metalog library of functions in SAS

The metalog library is a collection of SAS IML modules. The modules are object-oriented, which means that you must first construct a "metalog object," usually from raw data. The object contains information about the model. All subsequent operations are performed by passing the metalog object as an argument to functions. You can classify the functions into four categories:

  • Constructors create a metalog object from data. By default, the constructors fit a five-term unbounded metalog model.
  • Query functions enable you to query properties of the model.
  • Computational functions enable you to compute the PDF and quantiles of the distribution and to generate random variates.
  • Plotting routines create graphs of the CDF and PDF of the metalog distribution. The plotting routines are available when you use PROC IML.

Before you create a metalog object, you must choose how to model the data. There are four types of models: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. Then, you must decide the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k > 6) can fit multimodal data distributions, but be careful not to overfit the data by choosing k too large.

The most common way to create a metalog object is to fit a model to the empirical quantiles of the data, which estimates the k parameters in the model. For an unbounded metalog model, the quantile function is
\( Q(p) = a_1 + a_2 L(p) + a_3 c(p) L(p) + a_4 c(p) + a_5 c(p)^2 + a_6 c(p)^2 L(p) + a_7 c(p)^3 + \ldots \)
where L(p) = Logit(p) is the quantile function of the logistic distribution and c(p) = p – 1/2 is a linear function. You use ordinary least squares regression to fit the parameters (\(a_1, a_2, \ldots, a_k\)) in the model.

The semibounded and bounded versions are metalog models of log-transformed variables.

  • If X > L, fit a metalog model to Z = log(X – L)
  • If L < X < U, fit a metalog model to Z = log((X – L) / (U – X))

How to download the metalog functions from GitHub

The process of downloading SAS code from a GitHub site is explained in the article, "How to use Git to share SAS programs." The process is also explained in Appendix A in the metalog documentation. The following SAS program downloads the metalog functions to the WORK libref. You could use a permanent libref, if you prefer:

/* download metalog package from GitHub */
options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-iml-packages;  /* clone repo to WORK, or use permanent libref */
 
/* clone repository; if repository exists, skip download */
data _null_;
if fileexist("&repoPath.") then 
   put 'Repository already exists; skipping the clone operation'; 
else do;
   put "Cloning repository 'sas-iml-packages'";
   rc = gitfn_clone("https://github.com/sassoftware/sas-iml-packages/", "&repoPath." ); 
end;
run;
 
/* Use %INCLUDE to read source code and STORE functions to current storage library */
proc iml;
%include "&repoPath./Metalog/ML_proc.sas";   /* each file ends with a STORE statement */
%include "&repoPath./Metalog/ML_define.sas";
quit;

Each file ends with a STORE statement, so running the above program results in storing the metalog functions in the current storage library. The first file (ML_PROC.sas) should be included only when you are running the program in PROC IML. It contains the plotting routines. The second file (ML_define.sas) should be included in both PROC IML and the iml action.

Fit an unbounded metalog model to data

The Sashelp.heart data set contains data on patients in a heart study. The data set contains a variable (Systolic) for the systolic blood pressure of 5,209 patients. The following call to PROC SGPLOT creates a histogram of the data and overlays a kernel density estimate.

title 'Systolic Blood Pressure';
proc sgplot data=sashelp.heart;
   histogram Systolic;
   density Systolic / type=kernel;
   keylegend /location=inside;
run;

The kernel density estimate (KDE) is a nonparametric model. The histogram and the KDE indicate that the data are not normally distributed, and it is not clear whether the data can be modeled by using one of the common "named" probability distributions such as gamma or lognormal. In situations such as this, the metalog distribution can be an effective tool for modeling and simulation. The following SAS IML statements load the metalog library of functions, then fit a five-term unbounded metalog distribution to the data. After you fit the model, it is a good idea to run the ML_Summary subroutine, which displays a summary of the model and displays the parameter estimates.

proc iml;
load module=_all_;     /* load the metalog library */
use sashelp.heart;
   read all var "Systolic";
close;
 
MLObj = ML_CreateFromData(Systolic, 5);  /* 5-term unbounded model */
run ML_Summary(MLObj);

The ML_Summary subroutine displays two tables. The first table tells you that the model is a 5-term unbounded model, and it is feasible. Not every model is feasible, which is why it is important to examine the summary of the model. (Feasibility is discussed in a subsequent section.) The second table provides the parameter estimates for the model.

The MLObj variable is an object that encapsulates relevant information about the metalog model. After you create a metalog object, you can pass it to other functions to query, compute, or plot the model. For example, in PROC IML, you can use the ML_PlotECDF subroutine to show the agreement between the cumulative distribution of the model and the empirical cumulative distribution of the data, as follows:

title '5-Term Metalog Model for Systolic Blood Pressure';
run ML_PlotECDF(MLObj);

The model is fit to 5,209 observations. For this blog post, I drew the CDF of the model in red so that it would stand out. The model seems to fit the data well. If you prefer to view the density function, you can run the statement
    run ML_PlotPDF(MLObj);
Again, note that the MLObj variable is passed as the first argument to the function.

One way to assess the goodness of fit is to compare the sample quantiles to the quantiles predicted by the model. You can use the ML_Quantile function to obtain the predicted quantiles, as follows:

/* estimates of quantiles */
Prob = {0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99};
call qntl(SampleEst, Systolic, Prob);   /* sample quantiles */
ModelEst = ML_Quantile(MLObj, Prob);    /* model quantiles */
print Prob SampleEst ModelEst;

The model quantiles agree well with the sample quantiles. Note that the median estimate from the model (the 0.5 quantile) is equal to the a1 parameter estimate. This is always the case.

An advantage of the metalog model is that it has an explicit quantile function. As a consequence, it is easy to simulate from the model by using the inverse CDF method. You can use the ML_Rand function to simulate from the model, as follows:

/* simulate 500 new blood pressure values from the model */
call randseed(1234);
title 'Simulated Systolic Values';
BP = ML_Rand(MLObj, 500);
call histogram(BP);

Notice that the shape of this histogram is similar to the histogram of the original data. The simulated data seem to have similar properties as the original data.

The metalog documentation provides the syntax and documentation for every public method in the library.

Infeasible metalog functions

As discussed in a previous article, not every metalog model is a valid distribution. An infeasible model is one for which the quantile function is not monotone increasing. This can happen if you have a small data set that has extreme outliers or gaps, and you attempt to fit a metalog model that has many parameters. The model will overfit the data, which results in an infeasible model.

I was unable to create an infeasible metalog model for the blood-pressure data, which has thousands of observations and no extreme observations. However, if I use only the first 20 observations, I can demonstrate an infeasible model by fitting a six-term metalog model. The following SAS IML statements create a new data vector that contains only 20 observations. A six-term metalog model is fit to those data.

/* What if we use a much smaller data set? Set N = 20. */
S = Systolic[1:20];               /* N = 20 */
MLObj = ML_CreateFromData(S, 6);  /* 6-term model */
run ML_Summary(MLObj);

As shown by the output of the ML_Summary call, the six-term metalog model is not feasible. The following graph shows the quantile function of the model overlaid on the empirical quantiles. Notice that the quantile function of the model is not monotone increasing, which explains why the model is infeasible. If you encounter this situation in practice, Keelin (2016) suggests reducing the number of terms in the model.

Summary

This article describes how to use the metalog distribution in SAS. The metalog distribution is a flexible family that can model a wide range of distributional shapes. You can download the Metalog package from the SAS Software GitHub site. The package contains documentation and a collection of SAS IML modules. You can call the modules in a SAS IML program to fit a metalog model. After fitting the model, you can compute properties of the model and simulate data from the model. More information and examples are available in the documentation of the Metalog package.

The post Use the metalog distribution in SAS appeared first on The DO Loop.

3月 082023
 

Happy Pi Day! Every year on March 14th (written 3/14 in the US), people in the mathematical sciences celebrate "all things pi-related" because 3.14 is the three-decimal approximation to π ≈ 3.14159265358979....

Modern computer methods and algorithms enable us to calculate 100 trillion digits of π. However, I think it is fascinating to learn about how mathematicians in the pre-computer age were able to approximate π to many decimal places through cleverness and manual paper-and-pencil calculations.

One of best mathematicians of all time was Isaac Newton, who calculated 16 digits of π by hand way back in 1666. Newton accomplished this by combining geometry with several mathematical techniques that were new at the time. These techniques included the new field of integral calculus (which Newton himself invented), and an early use of a truncated infinite series to approximate a function. Newton's formulation of the problem is understandable by any high-school geometry student. Although the manual calculations are tedious, they can be understood by anyone who has taken calculus. In this article, I use a computer to simplify the long calculations in Newton's approximation. The computer calculation avoids the infinite series and provides a 14-decimal approximation to π.

I learned about Newton's approximation from Matt Parker and the "Think Maths" group in the UK, who made a wonderful video about Newton's approximation. They also provide resources for math teachers who want to present this topic to their students.

The geometry of Newton's calculation

The figure at the right shows the geometry behind Newton's calculation of pi. First, draw a semicircle of radius 1/2 centered at the point C(1/2, 0) in the Cartesian plane. Then draw a vertical line segment from A(1/4, 0) until it hits the circle at B. High school geometry shows that the triangle ABC is a 30-60-90 triangle, and the line segment AB intersects the circle at B(1/4, sqrt(3)/4).

Newton knew the following facts:

  1. The angle OCB is 60 degrees. Therefore, the area of the sector OCB is π/6 r2, where r is the radius of the circle. Because the radius is r = 1/2, the area of the sector OCB is π / 24.
  2. The area of the triangle ABC is easily found to be (1/2)*base*height = (1/2)(1/4)(sqrt(3)/4) = sqrt(3)/32. Newton could calculate this quantity to an arbitrary number of decimal places because the square root algorithm was known by the ancient Babylonians and by the Greeks.
  3. The area of the crosshatched region (denoted M) can be found by calculating a definite integral. The equation of the semicircle is \(f(x) = \sqrt{\tfrac{1}{4} - (x-\tfrac{1}{2})^2}\), which simplifies to \(f(x) = \sqrt{x} \sqrt{1-x}\). Therefore, the area can be calculated as \( M = \int_0^{1/4} \sqrt{x} \sqrt{1-x} \, dx\).

From the figure, we see that Area(OCB) = M + Area(ABC). Plugging in the formulas for these quantities, we get
π /24 = M + sqrt(3)/32. Multiplying both sides by 24 gives the equation that Newton used to approximate π:
      π = 24*M + 3 sqrt(3)/4

Newton approximated M to many decimal places, thus providing a decimal approximation for π. Newton used a truncated infinite series expansion to approximate M, but the next section shows an alternative approach: you can use numerical integration to approximate M.

A modern version of Newton's approximation

By using modern computational techniques, you can evaluate the integral (M) numerically. If you evaluate M to high precision, you get a high-precision approximation for π.

Most modern computer languages include a function to evaluate a definite interval on a closed interval. The SAS/IML language provides the QUAD subroutine, which enables you to compute M to many decimal digits by using numerical quadrature (integration). Here is the SAS/IML program that carries out Newton's approximation of pi:

/* Newton's approximation of pi. For details, see https://think-maths.co.uk/newton-pi */
proc iml;
/* integrand is f(x) = sqrt(0.25 - (x-0.5)##2)
                     = sqrt(x*(1-x)) */
start SemiCircle(x);
   return( sqrt(x - x##2) );
finish;
 
/* numerical approximation of M = area under semicircle on [0, 1/4] */
call quad(M, "SemiCircle", {0, 0.25}) EPS=1E-15;
 
/* AreaSector = pi/6 * r^2 = pi / 24 
   Newton used AreaSector = M + AreaTri, so
     pi / 24 = M + AreaTri
     pi = 24*(M + AreaTri)
*/
AreaTri = 1/2 * (1/4) * (sqrt(3) / 4); /* area of triangle ABC */
piApprox = 24*(M + AreaTri);
print piApprox[F=16.14 L="Newton's Approx"];

The program outputs an approximation to pi that is accurate to 14 decimal digits. This value is the standard double-precision floating-point approximation to pi. The program gives the same result as the SAS function constant('pi').

Newton's approximation by using infinite series

Notice that my approximation only captures the value of pi to 14 decimal digits, whereas Newton achieved 16-digit accuracy. How did Newton get more accuracy? He didn't use a numerical method for integrating the integral. Instead, he used the binomial expansion of the term sqrt(1 - x), which replaces the term by an infinite series. He kept 22 terms of the infinite series and integrated (by hand!) the resulting expression.

I am not going to repeat Newton's tour de force calculation, but you can watch a video in which the Think Maths group in the UK repeats Newton's calculation by hand using long division and extremely large pieces of paper! The group also created a document that shows the details. The main idea is to recall the binomial expansion:
\( (1+z)^{\frac {1}{2}} = 1 +{\tfrac {1}{2}}z -{\tfrac {1}{8}}z^{2} +{\tfrac {1}{16}}z^{3} -{\tfrac {5}{128}}z^{4} +{\tfrac {7}{256}}z^{5} -\cdots = \sum _{n=0}^{\infty }{\frac {(-1)^{n-1}(2n)!}{4^{n}(n!)^{2}(2n-1)}}z^{n} \)

In this case, we want to replace z by (-x), which changes the sign of terms that contain odd powers of z. The result is an infinite series in which each term (except the first) has a negative sign:

\( (1-x)^{\frac {1}{2}} = 1 -{ \tfrac {1}{2}}x -{\tfrac {1}{8}}x^{2} -{\tfrac {1}{16}}x^{3} -{\tfrac {5}{128}}x^{4} -{\tfrac {7}{256}}x^{5} -\cdots \)

Newton kept 22 terms of this infinite series and integrated the function term-by-term on the interval [0, 1/4]. The result is 22 fractions that can be evaluated by long division. However, the numerator and denominator of the higher-order terms are HUGE! For example, the 20th term is the fraction 119,409,675 / (41*275). Although, in principle, you could calculate Newton's approximation by hand (and, remember, he did!), it would not be a pleasant way to spend an evening.

Summary

In 1666, Newton computed 16 digits of π by constructing a geometric figure in which the value of pi is related to the numerical approximation of an integral. Newton approximated the integral by expanding a function in an infinite series, truncating the series at 22 terms, and evaluating each term by using long division. Although the formulation and general idea is accessible to calculus students, the calculations are long and tedious. Instead, you can use modern numerical methods to evaluate the integral, thus computing a 14-digit approximation to π.

References

The post How Newton calculated pi to 16 decimal places appeared first on The DO Loop.

3月 062023
 

Undergraduate textbooks on probability and statistics typically prove theorems that show how the variance of a sum of random variables is related to the variance of the original variables and the covariance between them. For example, the Wikipedia article on Variance contains an equation for the sum of two random variables, X and Y:
\( \operatorname {Var} (X+Y)=\operatorname {Var} (X)+\operatorname {Var} (Y)+2\,\operatorname {Cov} (X,Y) \)

A SAS programmer wondered whether equations like this are also true for vectors of data. In other words, if X and Y are columns in a data set, do the sample variance and covariance statistics satisfy the same equation? How can you use SAS to verify (or disprove) the equation for a sample of data? This article shows how to use SAS to verify that the equation (and another similar equation) is valid for vectors of data.

It is possible to verify the equation by using PROC CORR and the DATA step, but it is much simpler to use SAS IML software because the IML language enables you to directly access cells of a variance-covariance matrix.

Create the sum of columns

Let's start by choosing some data to analyze. The following DATA step renames some variables in the Sashelp.Iris data set to X1, X2, and X3. The program also creates the variables Y12 = X1 + X2, Y13 = X1 + X3, and Y23 = X2 + X3. The program then reads the data into SAS IML matrices and computes the variance-covariance matrix for the original variables (X) and their pairwise sums (Y):

/* verify variance of a sum (or linear combination) of variables */
data Want;
set sashelp.iris(where=(Species="Versicolor"));
X1 = PetalLength;
X2 = PetalWidth;
X3 = SepalLength;
Y12 = X1 + X2;
Y13 = X1 + X3;
Y23 = X2 + X3;
keep X1-X3 Y12 Y13 Y23;
run;
 
proc iml;
use Want;
   read all var {'X1' 'X2' 'X3'} into X[c=XNames];
   read all var {'Y12' 'Y13' 'Y23'} into Y;
close;
YNames = {'X1+X2' 'X1+X3' 'X2+X3'};
 
SX = cov(X);
SY = cov(Y);
print SX[c=XNames r=XNames F=best5. L='Cov(X)'],
      SY[c=YNames r=YNames F=best5. L='Cov(Sums)'];

The variance of a sum

Let's use this information to verify the identity
\( \operatorname {Var} (X1+X2)=\operatorname {Var} (X1)+\operatorname {Var} (X2)+2\,\operatorname {Cov} (X1,X2) \)

From the displayed (rounded) values of the covariance matrices, you can mentally calculate that the equation could be true. The variances of the original variables are along the diagonal of the first matrix, and the covariances are the off-diagonal elements. The variance of the sum is the [1,1] cell of the second matrix. A little mental arithmetic indicates that 40.6 ≈ 22 + 4 + 2*7.

The following SAS IML statements extract the relevant values from cells in the variance-covariance matrices. The program then subtracts the right side of the equation from the left side. If the difference is 0, then the equation is verified:

/* Confirm that the sample variance of (x1+x2) satisfies
      Var(x1 +x2) = Var(x1) + Var(x2) + 2*Cov(x1, x2)
   See https://en.wikipedia.org/wiki/Variance#Propagation
*/
VarY12= SY[1,1];
VarX1 = SX[1,1];
VarX2 = SX[2,2];
Cov12 = SX[1,2];
Eqn1 = VarY12 - (VarX1 + VarX2 + 2*Cov12);
print Eqn1;

Success! The left and right sides of the equation are equal to numerical precision. This validates the equation for the data we are using. Notice that this same technique could be used to analyze the variance of a general linear combination of other variables.

The covariance of sums

Let's verify one more equation. The Wikipedia article about Covariance states the following formula for four random variables, X, Y, Z, and W:
\( \operatorname {Cov} (X+Y,Z+W)=\operatorname {Cov} (X,Z)+\operatorname {Cov} (X,W)+\operatorname {Cov} (Y,Z)+\operatorname {Cov} (Y,W) \)

Since our example data only has three variables, let's simplify the equation by setting W=X. Then (changing variable names) the equation we want to verify is \( \operatorname {Cov} (X1+X2,X1+X3)=\operatorname {Cov} (X1,X3)+\operatorname {Cov} (X1,X1)+\operatorname {Cov} (X2,X3)+\operatorname {Cov} (X1,X2) \)

The following SAS IML statements extract the relevant cells from the covariance matrices. The program subtracts the right side of the equation from the left side and prints the difference.

/* Confirm that the sample covariance of (x1+x2) and (x1 + x3) satisfies
     Cov(x1+x2, x1+x3) = Cov(x1, x1) + Cov(x1, x2) + Cov(x1, x3) + Cov(x2, x3)
   See https://en.wikipedia.org/wiki/Covariance#Covariance_of_linear_combinations
*/
CovY12_Y13 = SY[1,2];
Cov11 = SX[1,1];      /* = Var(X1) */
Cov12 = SX[1,2];
Cov13 = SX[1,3];
Cov23 = SX[2,3];
Eqn2 = CovY12_Y13 - (Cov11 + Cov12 + Cov13 + Cov23);
print Eqn2;

Once again, the difference is essentially zero, which validates the equation.

Summary

A SAS programmer asked whether she could use SAS to verify certain equations. She wanted to verify that certain formulas for the variance and covariance of random variables are also true for sample statistics for empirical data. This article shows how to use SAS IML software to test certain equations that relate the variance and covariance of sums to the variances and covariances of a set of data variables.

The post The variance of the sums of variables appeared first on The DO Loop.

3月 012023
 

A SAS programmer wanted to compute the distribution of X-Y, where X and Y are two beta-distributed random variables. Pham-Gia and Turkkan (1993) derive a formula for the PDF of this distribution. Unfortunately, the PDF involves evaluating a two-dimensional generalized hypergeometric function, which is not available in all programming languages. Hypergeometric functions are not supported natively in SAS, but this article shows how to evaluate the generalized hypergeometric function for a range of parameter values. By using the generalized hypergeometric function, you can evaluate the PDF of the difference between two beta-distributed variables.

The difference between two beta random variables

Before doing any computations, let's visualize what we are trying to compute. Let X ~ Beta(a1, b1) and Y ~ Beta(a1, b1) be two beta-distributed random variables. We want to determine the distribution of the quantity d = X-Y. One way to approach this problem is by using simulation: Simulate random variates X and Y, compute the quantity X-Y, and plot a histogram of the distribution of d. Because each beta variable has values in the interval (0, 1), the difference has values in the interval (-1, 1).

The following simulation generates 100,000 pairs of beta variates: X ~ Beta(0.5, 0.5) and Y ~ Beta(1, 1). Both X and Y are U-shaped on (0,1). The following simulation generates the differences, and the histogram visualizes the distribution of d = X-Y:

/* Case 2 from Pham-Gia and Turkkan, 1993, p. 1765 */
%let a1 = 0.5;
%let b1 = 0.5;
%let a2 = 1;
%let b2 = 1;
%let N = 1E5;
data BetaDiff;
call streaminit(123);
do i = 1 to &N;
   x = rand("beta", &a1, &b1);
   y = rand("beta", &a2, &b2);
   d = x - y;
   output;
end;
drop i;
run;
 
title "Distribution of X-Y";
title2 "X~beta(&a1,&b1); Y~beta(&a2,&b2)";
proc sgplot data=BetaDiff;
   histogram d;
run;

For these values of the beta parameters, the distribution of the differences between the two beta variables looks like an "onion dome" that tops many Russian Orthodox churches in Ukraine and Russia. For other choices of parameters, the distribution can look quite different. The remainder of this article defines the PDF for the distribution of the differences. The formula for the PDF requires evaluating a two-dimensional generalized hypergeometric distribution.

Appell's hypergeometric function

A previous article discusses Gauss's hypergeometric function, which is a one-dimensional function that has three parameters. For certain parameter values, you can compute Gauss's hypergeometric function by computing a definite integral. The two-dimensional generalized hypergeometric function that is used by Pham-Gia and Turkkan (1993), is called Appell's hypergeometric function (denoted F1 by mathematicians). Appell's function can be evaluated by solving a definite integral that looks very similar to the integral encountered in evaluating the 1-D function. Appell's F1 contains four parameters (a,b1,b2,c) and two variables (x,y). In statistical applications, the variables and parameters are real-valued. For the parameter values c > a > 0, Appell's F1 function can be evaluated by computing the following integral:
\(F_{1}(a,b_{1},b_{2},c;x,y)={\frac {1}{B(a, c-a)}} \int _{0}^{1}u^{a-1}(1-u)^{c-a-1}(1-x u)^{-b_{1}}(1-y u)^{-b_{2}}\,du\)
where B(s,t) is the complete beta function, which is available in SAS by using the BETA function. Both arguments to the BETA function must be positive, so evaluating the BETA function requires that c > a > 0. Notice that the integration variable, u, does not appear in the answer. Appell's hypergeometric function is defined for |x| < 1 and |y| < 1. Notice that the integrand is unbounded when either x → 1 or y → 1 (assuming b1 > 0 and b2 > 0).

The following SAS IML program defines a function that uses the QUAD function to evaluate the definite integral, thereby evaluating Appell's hypergeometric function for the parameters (a,b1,b2,c) = (2,1,1,3). A table shows the values of the function at a few (x,y) points. The graph shows a contour plot of the function evaluated on the region [-0.95, 0.9] x [-0.95, 0.9].

/* Appell hypergeometric function of 2 vars 
   F1(a,b1,b2; c; x,y) is a function of (x,y) with parms = a // b1 // b2 // c;
   F1 is defined on the domain {(x,y) | |x|<1 and |y|<1}.
   You can evaluate F1 by using an integral for c > a > 0, as shown at
   https://en.wikipedia.org/wiki/Appell_series#Integral_representations
*/
proc iml;
/* Integrand for the QUAD function */
start F1Integ(u)  global(g_parm, g_x, g_y);
   a=g_parm[1]; b1=g_parm[2]; b2=g_parm[3]; c=g_parm[4];
   x=g_x; y=g_y;
   numer = u##(a-1) # (1-u)##(c-a-1);
   denom = (1-u#x)##b1 # (1-u#y)##b2;
   return( numer / denom );
finish;
 
/* Evaluate the Appell F1 hypergeometric function when c > a > 0
   and |x|<1 and |y|<1
   Pass in parm = {a, b1, b2, c} and
   z = (x1 y1,
        x2 y2,
        ...
        xn yn}; */
start AppellF1(z, parm)  global(g_parm, g_x, g_y);
   /* transfer parameters to global symbols */
   g_parm = parm;
   a = parm[1]; b1 = parm[2]; b2 = parm[3]; c = parm[4];
   n = nrow(z);
   F = j(n, 1, .);
   if a <= 0 | c <= a then do;
      /* print error message or use PrintToLOg function:
         https://blogs.sas.com/content/iml/2023/01/25/printtolog-iml.html */
      print "This implementation of the F1 function requires c > a > 0.",
            parm[r={'a','b1','b2','c'}];
      return( F );
   end;
   /* loop over the (x,y) values */
   do i = 1 to n;
      /* defined only when |x| < 1 */
      g_x = z[i,1]; g_y = z[i,2];
      if g_x^=. & g_y^=. & 
         abs(g_x)<1 & abs(g_y)<1 then do;  
         call quad(val, "F1Integ", {0 1}) MSG="NO";
         F[i] = val;
      end;
   end;
   return( F / Beta(a, c-a) );    /* scale the integral */
finish;
 
/* Example of calling AppellF1 */
parm = {2, 1, 1, 3};
parm = {2, 1, 1, 3};
xy = { 0.9  0,
       0.7  0.3,
      -0.5  0.2,
      -0.9 -0.5,
       0    0};
F1 = AppellF1(xy, parm);
print xy[L="" c={'x' 'y'}] F1;

The PDF of beta differences

Pham-Gia and Turkkan (1993) derive the PDF of the distribution for the difference between two beta random variables, X ~ Beta(a1,b1) and Y ~ Beta(a2,b2). The PDF is defined piecewise. There are different formulas, depending on whether the difference, d, is negative, zero, or positive. The formulas use powers of d, (1-d), (1-d2), the Appell hypergeometric function, and the complete beta function. The formulas are specified in the following program, which computes the PDF. Notice that linear combinations of the beta parameters are used to construct the parameters for Appell's hypergeometric function.

/* Use Appell's hypergeometric function to evaluate the PDF
   of the distribution of the difference X-Y between 
   X ~ Beta(a1,b1) and Y ~ Beta(a2,b2)
*/
start PDFBetaDiff(_d, _parm);
   a1=_parm[1]; b1=_parm[2]; a2=_parm[3]; b2=_parm[4];
   n = nrow(_d)*ncol(_d);
   PDF = j(n, 1, .);
   if a1<=0 | b1<=0 | a2<=0 | b2<=0 then do;
      print "All parameters must be positive", a1 b1 a2 b2;
      return(PDF);
   end;
   A = Beta(a1,b1) * Beta(a2,b2);
   do i = 1 to n;
      d = _d[i];
      *print d a1 b1 a2 b2;
      if d = . then 
         PDF[i] = .;
      if abs(d) < 1 then do;
         /* Formulas from Pham-Gia and Turkkan, 1993 */
         if abs(d)<1e-14 then do;
            if a1+a2 > 1 & b1+b2 > 1 then 
               PDF[i] = Beta(a1+a2-1, b1+b2-1) / A;
            *print "d=0" (a1+a2-1)[L='a1+a2-1'] (b1+b2-1)[L='b1+b2-1'] (PDF[i])[L='PDF'];
         end;
         else if d > 0 then do;  /* (0,1] */
            parm = b1            //
                   a1+b1+a2+b2-2 //
                   1-a1          //
                   a2+b1;
            xy = 1-d || 1-d##2;
            F1 = AppellF1(xy, parm);
            PDF[i] = Beta(a2,b1) # d##(b1+b2-1) # (1-d)##(a2+b1-1) # F1 / A;
            *print "d>0" xy (PDF[i])[L='PDF'];
         end;
         else do;                /* [-1,0) */
            parm = b2            //
                   1-a2          //
                   a1+b1+a2+b2-2 //
                   a1+b2;
            xy = 1-d##2 || 1+d;
            F1 = AppellF1(xy, parm);
            PDF[i] = Beta(a1,b2) # (-d)##(b1+b2-1) # (1+d)##(a1+b2-1) # F1 / A;
            *print "d<0" xy (PDF[i])[L='PDF'];
         end;
      end;
   end;
   return PDF;
finish;
 
print "*** Case 2 in Pham-Gia and Turkkan, p. 1767 ***";
a1 = 0.5;
b1 = 0.5;
a2 = 1;
b2 = 1;
parm = a1//b1//a2//b2;
 
d = {-0.9, -0.5, -0.25, 0, 0.25, 0.5, 0.9};
PDF = PDFBetaDiff(d, parm);
print d PDF;

Notice that the parameters are the same as in the simulation earlier in this article. The following graph visualizes the PDF on the interval (-1, 1):

/* graph the distribution of the difference */
d = T(do(-0.975, 0.975, 0.025));
PDF = PDFBetaDiff(d, parm);
 
title "X-Y for X ~ Beta(0.5,0.5) and Y ~ Beta(1,1)";
title2 "Fig 1, p. 1766";
call series(d, PDF) grid={x y};

The PDF, which is defined piecewise, shows the "onion dome" shape that was noticed for the distribution of the simulated data. The following graph overlays the PDF and the histogram to confirm that the two graphs agree.

Not every combination of beta parameters results in a non-smooth PDF. For example, if you define X ~ beta(3,5) and Y ~ beta(2, 8), then you can compute the PDF of the difference, d = X-Y, by changing the parameters as follows:

/* Case 5 from Pham-Gia and Turkkan, 1993, p. 1767 */
a1 = 3;
b1 = 5;
a2 = 2;
b2 = 8;
parm = a1//b1//a2//b2;

If you rerun the simulation and overlay the PDF for these parameters, you obtain the following graph:

Summary

The distribution of X-Y, where X and Y are two beta-distributed random variables, has an explicit formula (Pham-Gia and Turkkan, 1993). Unfortunately, the PDF involves evaluating a two-dimensional generalized hypergeometric function, which is a complicated special function. Hypergeometric functions are not supported natively in SAS, but this article shows how to evaluate the generalized hypergeometric function for a range of parameter values, which enables you to evaluate the PDF of the difference between two beta-distributed variables.

You can download the following SAS programs, which generate the tables and graphs in this article:

References

The post The distribution of the difference between two beta random variables appeared first on The DO Loop.

2月 272023
 

In applied mathematics, there is a large collection of "special functions." These function appera in certain applications, often as the solution to a differential equation, but also in the definition of probability distributions. For example, I have written about Bessel functions, the complete and incomplete gamma function, and the complete and incomplete beta function, and the Lambert W function, just to name a few special functions that are encountered in statistics.

You can be blissfully ignorant of a special function for many years, then one day you encounter a problem in which the function is essential. This happened recently when a SAS programmer told me about a problem that requires computing with a two-dimensional hypergeometric function. I had never heard of the function, but I did some research to figure out how to compute the function in SAS. (For a mathematical overview, see Beukers, 2014.) This article shows how to use numerical integration in SAS to compute a one-dimensional hypergeometric function, which is called Gauss's hypergeometric function and denoted \(\,_2F_1\). A subsequent article will discuss a two-dimensional hypergeometric function, which is needed to compute the PDF of a certain probability distribution. I am not an expert on the hypergeometric function, but this article shares what I have learned.

Hypergeometric functions are sometimes defined as functions of complex numbers, but the applications in statistics involve only real numbers. Consequently, this article evaluates the functions for real arguments. The hypergeometric functions in this article converge when the magnitude of each variable is less than 1.

A one-dimensional hypergeometric function

For the definition of Gauss's hypergeometric function, see the Wikipedia article about the 1-D hypergeometric series. Hypergeometric functions are defined as an infinite series. The one-dimensional hypergeometric functions have three parameters, (a,b;c). The most useful hypergeometric series is denoted by \(\,_{2}F_{1}(a,b;c; x)\) or sometimes just as \(F(a,b;c; x)\), where |x| < 1. Interestingly, the convention for a hypergeometric function is to list the parameters first and the variable (x) last. The two parameters (a,b) are called the upper parameters; the parameter c is called the lower parameter. The \(\,_2F_1\) function is a special case of the generalized hypergeometric function \(\,_pF_q\), which has p upper parameters and q lower parameters.

Compute Gauss's hypergeometric function in SAS

An important special case of the parameters is c > b > 0 and a > 0. If c > b > 0, then you can write the hypergeometric function as a definite integral. The defining equation is
\(B(b,c-b) * \,_{2}F_{1}(a,b;c; x)=\int _{0}^{1}u^{b-1}(1-u)^{c-b-1} / (1-x u)^{a}\,du\)
where B(s,t) is the complete beta function, which is available in SAS by using the BETA function. Both arguments to the BETA function must be positive, so evaluating the BETA function requires that c > b > 0. Notice that the integration variable, u, does not appear in the answer. The definite integral is a function of the parameters (a,b,c) and of the variable x.

You can perform numerical integration in SAS by using the QUAD function in SAS IML software. Notice that when a > 0, the integrand becomes unbounded as x → 1. Because of this, the QUAD function will display some warnings in the log as it seeks an accurate solution. According to my tests, the value of this integral seems to be computed correctly.

The following SAS IML program uses the QUAD function to evaluate the definite integral by using the user-defined HG1 function. The GLOBAL statement enables you to pass values of the parameters (a,b,c) and the variable x. I use G_ as a prefix to denote global variables. After evaluating the integral for each value of x, the function Hypergeometric2F1 divides the integral by the beta function, thus obtaining the value of Gauss's hypergeometric function, \(\,_2F_1\).

/* 1-D hypergeometric function 1F2(a,b;c;x) */
proc iml;
/* Integrand for the QUAD function. Parameters are in the GLOBAL clause. */
start HG1(u)  global(g_parm, g_x);
   a=g_parm[1]; b=g_parm[2]; c=g_parm[3]; x=g_x;
   /* Note: integrand unbounded as x->1 */
   F = u##(b-1) # (1-u)##(c-b-1) / (1-x#u)##a;
   return( F );
finish;
 
/* Evaluate the 2F1 hypergeometric function when c > b > 0 and |x|<1 */
start Hypergeometric2F1(x, parm)  global(g_parm, g_x);
   /* transfer the parameters to global symbols for QUAD */
   g_parm = parm;
   a = parm[1]; b = parm[2]; c = parm[3]; 
   n = nrow(x)* ncol(x);
   F = j(n, 1, .);
   if b <= 0 | c <= b then do;
      /* print error message or use PrintToLOg function:
         https://blogs.sas.com/content/iml/2023/01/25/printtolog-iml.html */
      print "This implementation of the F1 function requires c > b > 0.",
            parm[r={'a','b','c'}];
      return( F );
   end;
   /* loop over the x values */
   do i = 1 to n;
      if x[i]^=. & abs(x[i])<1 then do;  /* defined only when |x| < 1 */
         g_x = x[i];
         call quad(val, "HG1", {0 1}) MSG="NO";
         F[i] = val;
      end;
   end;
   return( F / Beta(b, c-b) );    /* scale the integral */
finish;
 
/* test the function */
a = 0.5;
b = 0.5;
c = 1;
parm = a//b//c;
 
x = {-0.5, -0.25, 0, 0.25, 0.5, 0.9};
F = Hypergeometric2F1(x, parm);
print x F[L='2F1(0.5,0.5;1;x)'];

The table shows a few values of Gauss's hypergeometric function for the parameters (a,b;c)=(0.5,0.5;1). You can visualize the function on the open interval (-1, 1) by using the following statements:

x = T( do(-0.99, 0.99, 0.01) );
F = Hypergeometric2F1(x, parm);
 
title "2F1(0.5,0.5; 1; x)";
call series(x, F) grid={x y} label={'x' '2F1(x)'};

Summary

Although I am not an expert on hypergeometric functions, this article shares a few tips for computing Gauss's hypergeometric function, \(\,_{2}F_{1}(a,b;c; x)\) in SAS for real arguments and for the special case c > b > 0. For this parameter range, you can compute the hypergeometric function as a definite integral. As shown in the next blog post, you can use this computation to compute the PDF of a certain probability distribution.

References

The post Compute hypergeometric functions in SAS appeared first on The DO Loop.

2月 222023
 

The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library of SAS IML functions that enable you to use the metalog distribution to model data in SAS.

The metalog distribution

There are dozens of continuous probability distributions that are commonly used to model univariate data. Examples include the normal, lognormal, Weibull, gamma, and beta distributions. (These are sometimes called "named" distributions.) Given a set of data, which distribution should you use? Ideally, you would use domain-specific knowledge to decide which distribution models the data-generating process. In practice, the process that generates the data is often unknown. Consequently, some people fit many different distributions to the data and use a goodness-of-fit statistic to choose the distribution that best fits the data.

Unfortunately, this "scattershot method" is neither reliable nor principled. Furthermore, sometimes the usual well-known distributions are not sufficiently flexible to model a set of data. This is shown by the graph to the right, which shows three different distributions fit to the same data. It is not clear that one model is superior to the others.

So that modelers do not have to fit many "named" distributions and choose the best, researchers have developed flexible systems of distributions that can model a wide range of shapes. Popular systems include the following:

  • The Pearson system: Karl Pearson used the normal distribution and seven other families to construct a system that can match any sample skewness and kurtosis. In SAS, you can use PROC SIMSYSTEM to fit a Pearson distribution to data.
  • The Johnson system: Norman Johnson used the normal distribution, the lognormal distribution, an unbounded distribution, and a bounded distribution to match any sample skewness and kurtosis. In SAS, you can use PROC UNIVARIATE or PROC SIMSYSTEM to fit a Johnson distribution to data.
  • The Fleishman system: The book Simulating Data with SAS (Wicklin, 2013) includes SAS IML modules that can fit the Fleishman system (Fleishman, 1978) to data.

The metalog system is a more recent family of distributions that can model many different shapes for bounded, semibounded, and unbounded continuous distributions. To use the metalog system, do the following:

  1. Choose the type of the distribution from the four available types: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. This choice is often based on domain-specific knowledge of the data-generating process.
  2. Choose the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k ≥ 7) can fit data distributions for which the density appears to have multiple peaks.
  3. Fit the quantile function of the metalog distribution to the data. This estimates the k parameters in the model.

For example, the following graph shows the PDF of a 5-term unbounded metalog model overlaid on the same histogram that was previously shown. This model seems to fit the data better than the "named" distributions.

Advantages and drawbacks of using the metalog distribution

The metalog distribution has the following advantages:

  • Because the metalog distribution can have many parameters, it is very flexible and can fit a wide range of shapes, including multimodal distributions.
  • Keelin (2016) uses ordinary least squares (OLS) to fit the data to the quantile function of the metalog distribution. Thus, the fitting process is simple to implement.
  • Because the quantile function has an explicit formula, it is simple to simulate data from the model by using the inverse CDF method of simulating data.

The metalog function has one primary disadvantage: the fitted metalog coefficients might not correspond to a valid distribution. An invalid model is called infeasible. Recall that a quantile function is always monotonically increasing as a function of the probability on the open interval (0,1). However, for an infeasible model, the quantile function is not monotonic. Equivalently, the probability density function (PDF) of the model contains nonpositive values.

Infeasibility is somewhat complicated, but one way to think about infeasibility is to think about the fact that the moments for a metalog model are related to the coefficients. If the coefficients lead to impossible moments (for example, a negative variance), then the model is infeasible.

When fitting a metalog model to data, you should always verify that the resulting model is feasible. If it is not, reduce the number of terms in the model. Some data cannot be fit when k > 2. For example, data that have one or more extreme observations often lead to an infeasible set of parameter estimates.

The following panel of graph shows six models for the same data. The models show increasing complexity as the number of terms increases. If you include six terms, the model becomes bimodal. However, the 9-term model is infeasible because the PDF near x=15 has a negative value, which is inside the circle in the last cell of the panel.

Definitions of the metalog distribution

Most common continuous probability distributions are parameterized on the data scale. That is, if X is a continuous random variable, then the CDF is a function that is defined on the set of outcomes for X. You can write the CDF as some increasing function p = F(x).

From the definition of a continuous CDF, you can obtain the quantile function and the PDF. The quantile function is defined as the inverse CDF: \(x = F^{-1}(p)\) for p in (0,1). The PDF is defined as the derivative with respect to x of the CDF function.

Keelin (2016) specifies the definitions of the metalog quantile function and PDF function. The metalog distribution reverses the usual convention by parameterizing the quantile function as x = Q(p). From the quantile function, you can obtain the CDF, which is the inverse of the quantile function. From the CDF, you can obtain the PDF. From the quantile function, you can generate random variates. Thus, knowing the quantile function is the key to working with the metalog distribution.

The unbounded metalog distribution

When you fit an unbounded metalog model, you estimate the parameters for which the model's predicted values (p, Q(p)) are close (in a least squares sense) to the data pairs (p, x). Here, p is a vector of cumulative probabilities, x is a vector of observed data values, and Q(p) is the quantile function of a metalog distribution.

The predicted values are obtained by using a linear regression model. The unbounded model has a design matrix that uses the following basis functions:

  • a vector of 1s
  • a vector that contains the values logit(p) = log(p/(1-p))
  • a vector that contains the values c(p) = p - 1/2

Specifically, the columns of the design matrix involve powers of the vector c(p) and interactions with the term logit(p). If M = [M1 | M2 | ... | Mk] is the design matrix, then the columns are as follows:

  • M1 is a column of 1s.
  • M2 is the column logit(p).
  • M3 is the interaction term c(p) logit(p).
  • M4 is the linear term c(p).
  • M5 is the quadratic term c(p)2.
  • M6 is the interaction term cc(p)2 logit(p).
  • Additional columns are of the form c(p)j or c(p)j logit(p).

The following image shows the basis functions M2-M7 for the metalog regression.

Given the design matrix, M, the least squares parameter estimates are the elements of the vector, b, which solves the matrix equations \(M^\prime M b = M^\prime x\). An estimate of the quantile function is given by Q(p) = M*b. By differentiating this expression with respect to p, you can obtain a formula for the PDF for the model (Keelin, 2016, Eq. 6, p. 254).

From these equations, you can see that not every set of coefficients corresponds to an increasing quantile function. For example, when k=2, Q(p) = b1 + b2 logit(p). To be an increasing function of p, it is necessary that b2 be strictly positive.

The semibounded and bounded metalog distributions

The semibounded and bounded metalog distributions are obtained by transforming the unbounded metalog model.

The semibounded metalog distribution on the interval [L, ∞) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(x-L). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution on [L, ∞) is then defined as Q_L(p) = L + exp(Q(p)) for p in (0,1), and Q_L(0) = L.

Similarly, the quantile function for the semibounded distribution on (-∞, U) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(U-x). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution is then defined as Q_U(p) = U - exp(-Q(p)) for p in (0,1), and Q_U(1) = U.

Finally, the quantile function for the bounded metalog distribution on an interval [L, U] is obtained by fitting an unbounded k-term distribution to the transformed data z = log((x-L)/(U-x)). This gives the quantile function, Q(p), for z. The quantile function for the bounded distribution is then defined by Q_B(p) = (L + U exp(Q(p))) / (1 + exp(Q(p))) for p in (0,1), and the boundary conditions Q_B(0) = L and Q_B(1) = U.

As mentioned previously, for every distribution, you can obtain the PDF function from the quantile function.

Random variates from the metalog distribution

Recall that you can generate a random sample from any continuous probability distribution by using the inverse CDF method. You generate a random variate, u, from the uniform distribution U(0,1), then compute \(x = F^{-1}(u)\), where F is the CDF function and \(F^{-1}\) is the inverse CDF, which is the quantile function. You can prove that x is a random variate that is distributed according to F.

For most distributions, the inverse CDF is not known explicitly, which means that you must solve for x as the root of the nonlinear equation F(x) - u = 0. However, the metalog distribution supports direct simulation because the quantile function is defined explicitly. For the unbounded metalog distribution, you can generate random variates by generating u ~ U(0,1) and computing x = Q(u), where Q is the quantile function. The semibounded and bounded cases are treated similarly.

Summary

The metalog distribution is a flexible distribution and can fit a wide range of shapes of data distributions, including multimodal distributions. You can fit a model to data by using ordinary least squares regression. Unfortunately, not every model is feasible, which means that the model is not a valid distribution. For an infeasible model, the CDF is not monotone and the PDF is not strictly positive. Nevertheless, the metalog distribution is a powerful option for fitting data when the underlying data-generating method is unknown.

In a future article, I will show how to use a library of SAS IML functions to fit a metalog distribution to data. All the images in this article were generated in SAS by using the library.

The post What is the metalog distribution? appeared first on The DO Loop.

2月 222023
 

The metalog family of distributions (Keelin, Decision Analysis, 2016) is a flexible family that can model a wide range of continuous univariate data distributions when the data-generating mechanism is unknown. This article provides an overview of the metalog distributions. A subsequent article shows how to download and use a library of SAS IML functions that enable you to use the metalog distribution to model data in SAS.

The metalog distribution

There are dozens of continuous probability distributions that are commonly used to model univariate data. Examples include the normal, lognormal, Weibull, gamma, and beta distributions. (These are sometimes called "named" distributions.) Given a set of data, which distribution should you use? Ideally, you would use domain-specific knowledge to decide which distribution models the data-generating process. In practice, the process that generates the data is often unknown. Consequently, some people fit many different distributions to the data and use a goodness-of-fit statistic to choose the distribution that best fits the data.

Unfortunately, this "scattershot method" is neither reliable nor principled. Furthermore, sometimes the usual well-known distributions are not sufficiently flexible to model a set of data. This is shown by the graph to the right, which shows three different distributions fit to the same data. It is not clear that one model is superior to the others.

So that modelers do not have to fit many "named" distributions and choose the best, researchers have developed flexible systems of distributions that can model a wide range of shapes. Popular systems include the following:

  • The Pearson system: Karl Pearson used the normal distribution and seven other families to construct a system that can match any sample skewness and kurtosis. In SAS, you can use PROC SIMSYSTEM to fit a Pearson distribution to data.
  • The Johnson system: Norman Johnson used the normal distribution, the lognormal distribution, an unbounded distribution, and a bounded distribution to match any sample skewness and kurtosis. In SAS, you can use PROC UNIVARIATE or PROC SIMSYSTEM to fit a Johnson distribution to data.
  • The Fleishman system: The book Simulating Data with SAS (Wicklin, 2013) includes SAS IML modules that can fit the Fleishman system (Fleishman, 1978) to data.

The metalog system is a more recent family of distributions that can model many different shapes for bounded, semibounded, and unbounded continuous distributions. To use the metalog system, do the following:

  1. Choose the type of the distribution from the four available types: unbounded, semibounded with a lower bound, semibounded with an upper bound, or bounded. This choice is often based on domain-specific knowledge of the data-generating process.
  2. Choose the number of terms (k) to use in the metalog model. A small number of terms (3 ≤ k ≤ 6) results in a smooth model. A larger number of terms (k ≥ 7) can fit data distributions for which the density appears to have multiple peaks.
  3. Fit the quantile function of the metalog distribution to the data. This estimates the k parameters in the model.

For example, the following graph shows the PDF of a 5-term unbounded metalog model overlaid on the same histogram that was previously shown. This model seems to fit the data better than the "named" distributions.

Advantages and drawbacks of using the metalog distribution

The metalog distribution has the following advantages:

  • Because the metalog distribution can have many parameters, it is very flexible and can fit a wide range of shapes, including multimodal distributions.
  • Keelin (2016) uses ordinary least squares (OLS) to fit the data to the quantile function of the metalog distribution. Thus, the fitting process is simple to implement.
  • Because the quantile function has an explicit formula, it is simple to simulate data from the model by using the inverse CDF method of simulating data.

The metalog function has one primary disadvantage: the fitted metalog coefficients might not correspond to a valid distribution. An invalid model is called infeasible. Recall that a quantile function is always monotonically increasing as a function of the probability on the open interval (0,1). However, for an infeasible model, the quantile function is not monotonic. Equivalently, the probability density function (PDF) of the model contains nonpositive values.

Infeasibility is somewhat complicated, but one way to think about infeasibility is to think about the fact that the moments for a metalog model are related to the coefficients. If the coefficients lead to impossible moments (for example, a negative variance), then the model is infeasible.

When fitting a metalog model to data, you should always verify that the resulting model is feasible. If it is not, reduce the number of terms in the model. Some data cannot be fit when k > 2. For example, data that have one or more extreme observations often lead to an infeasible set of parameter estimates.

The following panel of graph shows six models for the same data. The models show increasing complexity as the number of terms increases. If you include six terms, the model becomes bimodal. However, the 9-term model is infeasible because the PDF near x=15 has a negative value, which is inside the circle in the last cell of the panel.

Definitions of the metalog distribution

Most common continuous probability distributions are parameterized on the data scale. That is, if X is a continuous random variable, then the CDF is a function that is defined on the set of outcomes for X. You can write the CDF as some increasing function p = F(x).

From the definition of a continuous CDF, you can obtain the quantile function and the PDF. The quantile function is defined as the inverse CDF: \(x = F^{-1}(p)\) for p in (0,1). The PDF is defined as the derivative with respect to x of the CDF function.

Keelin (2016) specifies the definitions of the metalog quantile function and PDF function. The metalog distribution reverses the usual convention by parameterizing the quantile function as x = Q(p). From the quantile function, you can obtain the CDF, which is the inverse of the quantile function. From the CDF, you can obtain the PDF. From the quantile function, you can generate random variates. Thus, knowing the quantile function is the key to working with the metalog distribution.

The unbounded metalog distribution

When you fit an unbounded metalog model, you estimate the parameters for which the model's predicted values (p, Q(p)) are close (in a least squares sense) to the data pairs (p, x). Here, p is a vector of cumulative probabilities, x is a vector of observed data values, and Q(p) is the quantile function of a metalog distribution.

The predicted values are obtained by using a linear regression model. The unbounded model has a design matrix that uses the following basis functions:

  • a vector of 1s
  • a vector that contains the values logit(p) = log(p/(1-p))
  • a vector that contains the values c(p) = p - 1/2

Specifically, the columns of the design matrix involve powers of the vector c(p) and interactions with the term logit(p). If M = [M1 | M2 | ... | Mk] is the design matrix, then the columns are as follows:

  • M1 is a column of 1s.
  • M2 is the column logit(p).
  • M3 is the interaction term c(p) logit(p).
  • M4 is the linear term c(p).
  • M5 is the quadratic term c(p)2.
  • M6 is the interaction term cc(p)2 logit(p).
  • Additional columns are of the form c(p)j or c(p)j logit(p).

The following image shows the basis functions M2-M7 for the metalog regression.

Given the design matrix, M, the least squares parameter estimates are the elements of the vector, b, which solves the matrix equations \(M^\prime M b = M^\prime x\). An estimate of the quantile function is given by Q(p) = M*b. By differentiating this expression with respect to p, you can obtain a formula for the PDF for the model (Keelin, 2016, Eq. 6, p. 254).

From these equations, you can see that not every set of coefficients corresponds to an increasing quantile function. For example, when k=2, Q(p) = b1 + b2 logit(p). To be an increasing function of p, it is necessary that b2 be strictly positive.

The semibounded and bounded metalog distributions

The semibounded and bounded metalog distributions are obtained by transforming the unbounded metalog model.

The semibounded metalog distribution on the interval [L, ∞) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(x-L). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution on [L, ∞) is then defined as Q_L(p) = L + exp(Q(p)) for p in (0,1), and Q_L(0) = L.

Similarly, the quantile function for the semibounded distribution on (-∞, U) is obtained by fitting an unbounded k-term distribution to the transformed data z = log(U-x). This gives the quantile function, Q(p), for z. The quantile function for the semibounded distribution is then defined as Q_U(p) = U - exp(-Q(p)) for p in (0,1), and Q_U(1) = U.

Finally, the quantile function for the bounded metalog distribution on an interval [L, U] is obtained by fitting an unbounded k-term distribution to the transformed data z = log((x-L)/(U-x)). This gives the quantile function, Q(p), for z. The quantile function for the bounded distribution is then defined by Q_B(p) = (L + U exp(Q(p))) / (1 + exp(Q(p))) for p in (0,1), and the boundary conditions Q_B(0) = L and Q_B(1) = U.

As mentioned previously, for every distribution, you can obtain the PDF function from the quantile function.

Random variates from the metalog distribution

Recall that you can generate a random sample from any continuous probability distribution by using the inverse CDF method. You generate a random variate, u, from the uniform distribution U(0,1), then compute \(x = F^{-1}(u)\), where F is the CDF function and \(F^{-1}\) is the inverse CDF, which is the quantile function. You can prove that x is a random variate that is distributed according to F.

For most distributions, the inverse CDF is not known explicitly, which means that you must solve for x as the root of the nonlinear equation F(x) - u = 0. However, the metalog distribution supports direct simulation because the quantile function is defined explicitly. For the unbounded metalog distribution, you can generate random variates by generating u ~ U(0,1) and computing x = Q(u), where Q is the quantile function. The semibounded and bounded cases are treated similarly.

Summary

The metalog distribution is a flexible distribution and can fit a wide range of shapes of data distributions, including multimodal distributions. You can fit a model to data by using ordinary least squares regression. Unfortunately, not every model is feasible, which means that the model is not a valid distribution. For an infeasible model, the CDF is not monotone and the PDF is not strictly positive. Nevertheless, the metalog distribution is a powerful option for fitting data when the underlying data-generating method is unknown.

In a future article, I will show how to use a library of SAS IML functions to fit a metalog distribution to data. All the images in this article were generated in SAS by using the library.

The post What is the metalog distribution? appeared first on The DO Loop.