Statistical Programming

5月 162022
 

In categorical data analysis, it is common to analyze tables of counts. For example, a researcher might gather data for 18 boys and 12 girls who apply for a summer enrichment program. The researcher might be interested in whether the proportion of boys that are admitted is different from the proportion of girls. For this analysis, the data that you type into statistical software (such as SAS) does not typically consist of 20 observations. Rather, the data often consists of four observations and a frequency variable (also call a count variable). The first two observations record the number of boys that are admitted and not admitted, respectively. The next two observations record the corresponding frequencies for the girls.

This representation of the data is sometimes called frequency data. It is used when you can group individual subjects into categories such as boy/girl and admitted/not admitted. Despite its convenience, there are times when it is useful to expand or "unroll" the data to enumerate the individual subjects. Examples include bootstrap estimates and analyzing the data in SAS procedures that do not support a FREQ statement.

This article shows how to use the SAS DATA step to unroll frequency data. In addition, this article shows how to expand data that are in events/trials format.

Data in frequency form

The following example is from the documentation of the FREQ procedure. The goal is to compute some statistics for a 2 x 2 frequency table for 23 patients. Eight patients were on a low-cholesterol diet and 15 were on a high-cholesterol diet. The table shows how many subjects in each group developed heart disease. The data are used to estimate the relative risk of developing heart disease in the two groups:

/* Doc Example: PROC FREQ analysis of a 2x2 Contingency Table */
proc format;
   value ExpFmt 1='High Cholesterol Diet'  0='Low Cholesterol Diet';
   value RspFmt 1='Yes'  0='No';
run;
 
data FatComp;
format Exposure ExpFmt. Response RspFmt.;
input Exposure Response Count;
label Response='Heart Disease';
datalines;
0 0  6
0 1  2
1 0  4
1 1 11
;
 
title 'Case-Control Study of High Fat/Cholesterol Diet';
proc freq data=FatComp;
   tables Exposure*Response / relrisk norow nocol nopercent;
   weight Count;            /* each row represents COUNT patients */
run;

Notice that the data are input in terms of group frequencies. The output is shown. The analysis estimates the risk of heart disease for a subject on a high-cholesterol diet as 2.8 times greater than the risk for a subject on a low-cholesterol diet.

How to unroll frequency data in SAS

You can use the DATA step in SAS to unroll the frequency data. The COUNT variable tells you how many patients are in each group. You can therefore write a loop that expands the data: iterate from 1 to the COUNT value and put an OUTPUT statement in the body of the loop. If desired, you can include an ID variable to enumerate the subjects in each group, as follows:

data RawFat;
set FatComp;
Group + 1;          /* identify each category */
do ID = 1 to Count; /* identify subjects in each category */
   output;
end;
drop Count;
run;
 
proc print data=RawFat(obs=10);
run;

Whereas the original data set has four rows, the RawFat data has 23 rows. The first 10 rows are shown. The first six observations are the expansion of the subjects in the first group (Exposure=Low; Response=No). The next two observations are the subjects in the second group (Exposure=Low; Response=Yes), and so forth. The GROUP variable identifies the original groups. The ID variable enumerates patients within each group.

The second data set can be analyzed by any SAS procedure. In this form, the data do not require using a FREQ statement (or the WEIGHT statement in PROC FREQ), as shown by the following call to PROC FREQ:

/* run the same analysis on the expanded data */
proc freq data=RawFat;
   tables Exposure*Response / relrisk norow nocol nopercent;
   /* no WEIGHT statement because there is one row for each patient */
run;

The output (not shown) is the same as for the frequency data. Both data sets contain the same information, but the second might be useful for certain tasks such as a bootstrap analysis.

Data in events/trials form

A related data format is called the events/trials format. In the events/trials format, each observation has two columns that specify counts. One column specifies the number of times that a certain event occurs, and the other column specifies the number of trials (or attempts) that were conducted. This format is used for logistic regression and for other generalized linear regression models that predict the proportion of events.

The following example is from the documentation for the LOGISTIC procedure. The data describe the result of 19 batches of ingots in a designed experiment. Each batch was soaked and heated for specified combinations of times. The ingots were then inspected to determine which were not ready for rolling. The following DATA step records the results in the events/trials syntax. The R variable records the number of ingots that are not ready for rolling, and the N variable contains the number of ingots in the batch at the given values of the HEAT and SOAK variables. Consequently, R/N is the proportion of ingots that were not ready. For example, the values R=4 and N=44 means that 4 ingots (9%) were not ready for rolling in a batch that contained 44. Regression procedures such as PROC LOGISTIC enable you to analyze data by using an events/trials syntax:

/* events/trials syntax */
data ingots;
   input Heat Soak r n @@;
   datalines;
7 1.0 0 10  14 1.0 0 31  27 1.0 1 56  51 1.0 3 13
7 1.7 0 17  14 1.7 0 43  27 1.7 4 44  51 1.7 0  1
7 2.2 0  7  14 2.2 2 33  27 2.2 0 21  51 2.2 0  1
7 2.8 0 12  14 2.8 0 31  27 2.8 1 22  51 4.0 0  1
7 4.0 0  9  14 4.0 0 19  27 4.0 1 16
;
 
proc logistic data=ingots;
   model r/n = Heat Soak;
   ods select NObs ResponseProfile ParameterEstimates;
run;

The "NObs" table shows that there are 19 rows of data, which represents 387 units. The "Response Profile" table counts and displays the number of events and the number of trials. For these data, 12 units were not ready for rolling. The "ParameterEstimates" table provides estimates for the Intercept and the coefficients of the effects in the logistic regression model.

How to expand event/trials data

You can expand these data by creating 387 observations, one for each unit. The following DATA step creates a variable named BATCH that enumerates the original 19 batches of ingots. The ID variable enumerates the units in each batch. Some of the units were ready for rolling and some were not. You can create an indicator variable (NOTREADY) that arbitrarily sets the first R ingots as "not ready" (NOTREADY=1) and the remaining ingots as ready (NOTREADY=0).

/* expand or unroll the events/trials data */
data RawIngots;
set ingots;
BatchID + 1;
do ID = 1 to n;
   NotReady = (ID <= r);
   output;
end;
run;
 
/* analyze the binary variable NOTREADY */
proc logistic data=RawIngots;
   model NotReady(event='1') = Heat Soak;
   ods select NObs ResponseProfile ParameterEstimates;
run;

The "NOBs" and "Response Profile" tables are different from before, but they contain the same information. The "ParameterEstimates" table is the same and is not shown.

Summary

In categorical data analysis, it is common to analyze data that are in frequency form. Each observation represents some number of subjects, and the number is given by the value of a Count variable. In some situations, it is useful to "unroll" or expand the data. If an observation has frequency C, it results in C observations in the expanded data set. This article shows how to use the SAS DATA step to expand frequency data and data that are in events/trials syntax.

The post How to unroll frequency data appeared first on The DO Loop.

5月 022022
 

Recently, I wrote about Bartlett's test for sphericity. The purpose of this hypothesis test is to determine whether the variables in the data are uncorrelated. It works by testing whether the sample correlation matrix is close to the identity matrix.

Often statistics textbooks or articles include a statement such as "under the null hypothesis, the test statistic is distributed as a chi-square statistic with DF degrees of freedom." That sentence contains a lot of information! Using algebraic formulas and probability theory to describe a sampling distribution can be very complicated. But it is straightforward to describe a sampling distribution by using simulation and statistical graphics.

This article discusses how to use simulation to understand the sampling distribution of a statistic under a null hypothesis. You can find many univariate examples of simulating a sampling distribution for univariate data, but this article shows the sampling distribution of a statistic for a hypothesis test on multivariate data. The hypothesis test in this article is Bartlett's sphericity test, but the ideas apply to other statistics. It also shows a useful trick for simulating correlation matrices for multivariate normal (MVN) data.

The simulation approach to null distributions

Simulation is well-suited for explaining the sampling distribution of a statistic under the null hypothesis. For Bartlett's sphericity test, the null hypothesis is that the data are a random sample of size N from a multivariate normal population of p uncorrelated variables. Equivalently, the correlation matrix for the population is the p x p identity matrix. Under this hypothesis, the test statistic
\(T = -\log(\det(R)) (N-1-(2p+5)/6)\)
has a chi-square distribution with p(p–1)/2 degrees of freedom, where R is the sample correlation matrix. In other words, if we randomly sample many times from an uncorrelated MVN distribution, the statistics for each sample will follow a chi-square distribution. Let's use simulation to visualize the null distribution for Bartlett's test.

Here is a useful fact: You do not have to generate MVN data if the statistic is related to the covariance or correlation of the data. Instead, you can DIRECTLY simulate a correlation matrix for MVN data by using the Wishart distribution. The following SAS/IML statements use the Wishart distribution to simulate 10,000 correlation matrices for MVN(0, Σ) samples, where Σ is a diagonal covariance matrix:

proc iml;
call randseed(12345);
p = 3;                               /* number of MVN variables */
N = 50;                              /* MVN sample size   */
Sigma = diag(1:p);                   /* population covariance */
NumSamples = 10000;                  /* number of samples in simulation */
 
/* Simulate correlation matrices for MVN data by using the Wishart distribution.
/* Each row of A is a scatter matrix; each row of B is a covariance matrix */
A = RandWishart(NumSamples, N-1, Sigma); /* N-1 degrees of freedom */
B = A / (N-1);                           /* rescale to form covariance matrix */
 
/* print one row to show an example */
C = shape(B[1,], p);                     /* reshape 1st row to p x p matrix */
R = cov2corr(C);                         /* convert covariance to correlation */
print C, R;

The output shows one random covariance matrix (C) and its associate correlation matrix (R) from among the 1,000 random matrices. The B matrix is 10000 x 9, and each row is a sample covariance matrix for a MVN(0, Σ) sample that has N=50 observations.

Recall that the determinant of a correlation matrix is always in [0,1]. The determinant equals 1 only when R=I(p). Therefore, the expression -log(det(R)) is close to 0 when R is close to the identity matrix and gets more and more positive as R is farther and farther away from the identity matrix. (It is undefined if R is singular.) So if we apply Bartlett's formula to each of the random matrices, we expect to get a lot of statistics that are close to 0 (because R should be close to the identity) and a few that are far away. The following SAS IML program carries out this method and plots the 10,000 statistics that result:

T = j(numSamples, 1);
do i = 1 to NumSamples;
   cov = shape(B[i,], p, p);             /* convert each row to square covariance matrix */
   R = cov2corr(cov);                    /* convert covariance to correlation */
   T[i] = -log(det(R)) * (N-1-(2*p+5)/6);
end;
 
title "Bartlett's Sphericity Statistic";
title2 "&numSamples MVN(0, Sigma) Samples for Diagonal Covariance";
call histogram(T) rebin={0.25 0.5};

The histogram shows the null distribution, which is the distribution of Bartlett's statistic under the null hypothesis. As expected, most statistics are close to 0. Only a few are far from 0, where "far" is a relative term that depends on the dimension of the data (p).

Knowing that the null distribution is a chi-square distribution with DF=p(p-1)/2 degrees of freedom helps to provide a quantitative value to "far from 0." You can use the 95th percentile of the chi-square(DF) distribution to decide whether a sample correlation matrix is "far from 0":

/* critical value of chi-square(3) */
DF = p*(p-1)/2;
crit = quantile("ChiSq", 0.95, DF);  /* one-sided: Pr(chiSq < crit)=0.95 */
print crit;

You can overlay the chi-square(3) distribution on the null distribution and add the critical value to obtain the following figure:

This graph summarizes the Bartlett sphericity test. The histogram approximates the null distribution by computing Bartlett's statistic on 10,000 random samples from an uncorrelated trivariate normal distribution. The curve is the asymptotic chi-square(DF=3) distribution. The vertical line is the critical value for testing the hypothesis at the 95% confidence level. The next section uses this graph and Bartlett's test to determine whether real data is a sample from an uncorrelated MVN distribution.

Bartlett's test on real data

The previous graph shows the null distribution. If you run the test on a sample that contains three variables and 50 observations, you will get a value of the test statistic. If the value is greater than 7.8, it is unlikely that the data are from an uncorrelated multivariate normal distribution.

A previous article showed how to use PROC FACTOR to run Bartlett's test in SAS. Let's run PROC FACTOR on 50 observations and three variables of Fisher's Iris data:

proc factor data=sashelp.iris(obs=50) method=ML heywood;
   var SepalLength SepalWidth PetalLength ;
   ods select SignifTests; /* output only Bartlett's test */
run;

The value of Bartlett's statistic on these data is 41.35. The X axis for the null distribution only goes up to 20, so this value is literally "off the chart"! You would reject the null hypothesis for these data and conclude that these data do not come from the null distribution.

When researchers see this result, they often assume that one or more variables in the data are correlated. However, it could also be the case that the data are not multivariate normal since normality is an assumption that was used to generate the null distribution.

Summary

This article shows how to simulate the null distribution for Bartlett's sphericity test. The null distribution is obtained by simulating many data samples from an uncorrelated multivariate normal distribution and graphing the distribution of the test statistics. For Bartlett's test, you get a chi-square distribution.

The ideas in this article apply more generally to other hypothesis tests. An important use of simulation is to approximate the null distribution for tests when an exact form of the distribution is not known.

This article also shows that you can use the Wishart distribution to avoid having to simulate MVN data. If the goal of the simulation is to obtain a covariance or correlation matrix for MVN data, you can use the Wishart distribution to simulate the matrix directly.

The post Simulate the null distribution for a hypothesis test appeared first on The DO Loop.

4月 202022
 

Recently, I showed how to use a heat map to visualize measurements over time for a set of patients in a longitudinal study. The visualization is sometimes called a lasagna plot because it presents an alternative to the usual spaghetti plot. A reader asked whether a similar visualization can be created for subjects if the response is an ordinal variable, such as a count. Yes! And the heat map approach is a substantial improvement over a spaghetti plot in this situation.

This article pulls together several techniques from previous articles:

This article uses the following data, which represent the counts of malaria cases at five clinics over a 14-week time period:

data Clinical;
input SiteID @;
do Week = 1 to 14;
   input Count @;
   output;
end;
/* ID Wk1  Wk2  Wk3  Wk4 ... Wk14 */
datalines;
001  1 0 0 0 0 0 0 3 1 3 3 0 3 0
002  0 0 0 1 1 2 1 2 2 1 1 0 2 2
003  1 . . 1 0 1 0 3 . 1 0 3 2 1
004  1 1 . 1 0 1 2 2 3 2 1 0 . 0
005  1 1 1 . 0 0 0 1 0 1 2 4 5 1
;
 
title "Spaghetti Plot of Counts for Five Clinics";
proc sgplot data=Clinical;
   series x=Week y=Count / group=SiteID; 
   xaxis integer values=(1 to 14) valueshint;
run;

The line plot is not an effective way to visualize these data. In fact, it is almost useless. Because the counts are discrete integer values, and most counts are in the range [0, 3], the graph cannot clearly show the weekly values for any one clinic. The following sections develop a heat map that visualizes these data better.

Format the raw values

The following call to PROC FORMAT defines a format that associates character strings with values of the COUNT variable:

proc format;
value CountFmt
      . = "Not Counted"
      0 = "None"
      1 = "1"
      2 = "2"
      3 = "3" 
      4 - high = "4+";
run;

You can use this format to encode values and display them in a legend. Notice that you could also use a format to combine counts, such as using the word "Few" to describe 2 or 3 counts.

Associate colors to formatted values

A discrete attribute map ensures that the colors will not change if the data change. For ordinal data, it also ensures that the legend will be in ordinal order, as opposed to "data order" or alphabetical order.

You can use a discrete data map to associate graphical attributes with the formatted value of a variable. Examples of graphical attributes include marker colors, marker symbols, line colors, and line patterns. For a heat map, you want to associate the "fill color" of each cell with a formatted value. The following DATA step creates the mapping between values and colors. Notice that I use the PUTN function to apply the format to raw data values. This ensures that the mapping correctly associates formatted values with colors. The raw values are stored in an array (VAL) as are the colors (COLOR). This makes it easy to modify the map in the future or to adapt it to other situations.

data MyAttrs;
length Value $11 FillColor $20;
retain ID 'MalariaCount'               /* name of map */
     Show 'AttrMap';                   /* always show all groups in legend */
 
/* output the formatted value and color for a missing value */
Value = putn(., "CountFmt.");          /* formatted value */
FillColor = "LightCyan";               /* color for missing value */
output;
/* output the formatted values and colors for nonmissing values */
array   val[5]     _temporary_ (0 1 2 3 4);
array color[5] $20 _temporary_ ('White' 'CXFFFFB2' 'CXFECC5C' 'CXFD8D3C' 'CXE31A1C');
do i = 1 to dim(val);
   Value = putn(val[i], "CountFmt.");  /* formatted value for this raw value */
   FillColor = color[i];               /* color for this formatted value */
   output;
end;
drop i;
run;

Create a discrete heat map

Now you can create a heat map that uses the format and discrete attribute map from the previous sections. To use the map, you must specify two pieces of information:
  • Use the DATTRPMAP= option on the PROC SGPLOT statement to specify the name of the data set that contains the map.
  • Because a data set can contain multiple maps, use the ATTRID= option on the HEATMAPPARM statement to specify the value of the ID variable that contains the attributes for these data.
title "Heat Map of Malaria Data";
proc sgplot data=Clinical DATTRMAP=MyAttrs; /* <== the data set that contains attributes */
   format Count CountFmt.;                  /* <== apply the format to bin data */
   heatmapparm x=Week y=SiteID colorgroup=Count / outline outlineattrs=(color=gray)
               ATTRID=MalariaCount;         /* <== the discrete attribute map for these data */
   discretelegend;
   refline (1.5 to 5.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(1 to 14) valueshint;
run;

The heat map makes the data much clearer than the spaghetti map. For each clinic and each week, you can determine how many cases of malaria were seen. As mentioned earlier, you can use this same technique to classify counts into categories such as None, Few, Moderate, Many, and so forth.

Summary

You can use SAS formats to bin numerical data into ordinal categories. You can use a discrete attribute map to associate colors and other graphical attributes with categories. By combining these techniques, you can create a heat map that enables you to visualize ordinal responses for subjects over time. The heat map is much better at visualizing the data than a spaghetti plot is.

The post Use a heat map to visualize an ordinal response in longitudinal data appeared first on The DO Loop.

4月 182022
 

What is McNemar's test? How do you run the McNemar test in SAS? Why might other statistical software report a value for McNemar's test that is different from the SAS value? SAS supports an exact version of the McNemar test, but when should you use it? This article answers these questions about McNemar's test in SAS.

The McNemar test

The McNemar test (also called McNemar's test) is a way to test whether the off-diagonal elements in a 2 x 2 frequency table are significantly different. The data are assumed to be matched-pairs data, which means that they represent the responses of one set of subjects measured before and after a treatment or intervention. Situations in which you could use McNemar's test include the following:

  • Students are given a test to assess whether they can read at grade level. They go through an enrichment program and are tested again.
  • Patients are asked whether they are feeling pain. They are given a treatment and are asked about pain again.

In both examples, the same subjects are tested twice. You can use the McNemar test can assess whether there is evidence that the proportion of negative results (not reading at grade level, not feeling pain) is different after the intervention as compared to before the intervention.

Let's use some educational data. A total of 68 students were given a test before and after an eight-week enrichment program (DeMars, 2015). On the pre-test, 51 students failed and 17 passed. On the post-test, 41 students failed and 27 passed. The following 2 x 2 table shows the counts of students who failed both tests, passed one but failed the other, or passed both tests:

McNemar's test determines whether the off-diagonal elements (3 and 13) are significantly different. Equivalently, whether their difference is significantly different from 0. In general, for a 2 x 2 table such as

the null hypothesis for McNemar's test is that the intervention has no effect, which means that the probabilities for the cells b and c are equal.

McNemar's test in SAS

You can use the AGREE option on the TABLES statement in PROC FREQ to run the McNemar test in SAS. The test statistic is the quantity M = (b - c)2 / (b + c). When b and c are both large, the statistic is distributed as a chi-squared variable with DF=1 degrees of freedom. The following SAS program defines the data and uses the AGREE option to run the McNemar test on this example:

/* McNemar's test: https://en.wikipedia.org/wiki/McNemar's_test 
   Data from DeMars 2015 https://www.thejuliagroup.com/documents/De_Mars_WUSS2015.pdf */
data Students;
input PassPre PassPost Count;
datalines;
0 0 38 
0 1 13
1 0  3
1 1 14
;
 
proc freq data=Students;
  tables PassPre*PassPost / nopercent norow nocol agree;
  weight Count;
  ods select CrossTabFreqs McNemarsTest;
run;

Notice how the data are entered. Each row of the input data set represents a group of students of size COUNT. The binary values of the PassPre and PassPost variables indicate whether the students passed (1) or did not pass (0) each test. The WEIGHT statement in PROC FREQ is used to specify the count for each observation possible pair of outcomes. You could also enter the data by using 68 rows, where each row represents a student.

The value of the McNemar statistic is M = (13 - 3)2 / (13 + 3) = 6.25. The p-value is 0.0124, which means that it is unlikely to observe a statistic this large by chance if the intervention has no effect. Think of it this way: if you distribute 16 students at random to the two off-diagonal cells, there is only a small chance that the distribution would be this lopsided (13 vs 3) if the chance of landing in either cell is 50-50. Therefore, you reject that hypothesis and conclude that the intervention had an effect on the students who failed one test and passed the other.

The exact McNemar test

The assumption that McNemar's statistic is chi-square distributed is an asymptotic result for large values of the c and d frequencies. If (b + c) is small (Wikipedia suggests less than 25), then the sampling distribution for the statistic might not follow a chi-square distribution. In that case, you should use an exact test. For the data in this example, the sum of the off-diagonal frequencies is b + c = 16. In PROC FREQ in SAS, you can run an exact test by using the EXACT MCNEM statement, as follows:

proc freq data=Students;
  tables PassPre*PassPost / nopercent norow nocol agree;
  weight Count;
  exact McNem;
  ods select McNemarsTest;
run;

The output shows that the p-value for the exact test is larger than the p-value under the chi-squared assumption.

Why might other software report a different answer?

This article was inspired by a SAS customer who compared the McNemar test in SAS to the results from another software package. The other software reported a different answer. Whenever you compare software, it is important to understand the default settings and the definitions of the statistics. After some research, we discovered that the other software was not giving the result of McNemar's test, but of a variation known as Edwards's correction to McNemar's test. Whereas the test statistic for McNemar's test is M = (b - c)2 / (b + c), Edwards's correction is the statistic M2 = M = (|b - c| - 1)2 / (b + c). Because the M2 statistic subtracts 1 from the numerator, it is always less than McNemar's statistic. Consequently, the associate p-value for Edwards's correction is larger than the p-value for McNemar's statistic.

The formulas for these tests are easily implemented in the SAS DATA step or in the SAS/IML language. The following call to PROC IML computes the statistic and p-values:

/* compute McNemar's chi-square test manually */
proc iml;
use Students;
   read all var {PassPre PassPost Count};
close;
 
DF = 1;
Tab = shape(Count, 2, 2);         /* make a 2 x 2 table */
B = Tab[1,2];                     /* 1st row; 2nd col */
C = Tab[2,1];                     /* 2nd row; 1st col */
M  = (B-C)**2 / (B+C);            /* McNemar test statistic */
M2 = (abs(B-C) - 1)**2 /(B + C);  /* Edwards's correction */
pValMc = 1 - cdf("ChiSq", M, DF); /* p-value McNemar */
pValEd = 1 - cdf("ChiSq", M2, DF);/* p-value Edwards */
 
print (M ||DF ||pValMc)[c={'McNemar' 'DF' 'Pr > ChiSq'} L="McNemar's Test"];
print (M2 ||DF ||pValEd)[c={'Edwards' 'DF' 'Pr > ChiSq'} L="Edwards's Correction"];

If your software reports a value for McNemar's test that is smaller than the value reported by SAS, it is possible that the software is applying Edwards's correction.

The McNemar test is easy to code manually, and the exact McNemar test is not much harder. Let S = b + c be the sum of the off-diagonal elements. (S=16 in this example.) The null hypothesis is that there is no difference in the proportion of students in the off-diagonal elements. Under this hypothesis, the probability of seeing the value k in one cell and S-k in the other cell is given by the binomial density PDF("Binom", k, 0.5, S). The p-value is the sum of the probabilities for values of k that are more extreme than the observed values. For this problem, it is the sum of the probabilities for k=0, 1, 2, 3, 13, 14, 15, 16.

The sum of a discrete PDF can be computed by using the cumulative distribution (CDF). Because the binomial distribution is symmetric, we only need to compute the CDF for k ≤ 3 (the left tail) and multiply the result by 2 to obtain the probability for the right tail. The following SAS/IML statements implement the exact test:

/* compute McNemar's exact test manually */
S = B + C;                         /* sum of off-diagonal elements */
k = min(B,C);                      /* find extreme value */
ExactpVal = 2*cdf("Binomial", k, 0.5, S);
result = (M ||DF ||pValMc ||ExactpVal);
print result[c={'McNemar' 'DF' 'Pr > ChiSq' 'Exact Pr >= ChiSq'} L="McNemar's Test"];

The output confirms the result of the exact p-value from PROC FREQ. Notice that the p-value from Edwards's correction is quite close to the exact p-value.

Summary

This article shows how to perform the McNemar test in SAS. You can get asymptotic and exact p-values. The exact p-value is useful when the sum of the off-diagonal frequencies is less than 25. Be aware that some software might compute a variation of the test that uses Edwards's correction formula. The statistic for Edwards's correction is always less than McNemar's statistic. Although PROC FREQ does not implement Edwards's correction, it is easy to compute manually, as shown in this article.

The post The McNemar test in SAS appeared first on The DO Loop.

3月 232022
 

This article shows how to compute properties of a discrete probability distribution from basic definitions. You can use the definitions to compute the mean, variance, and median of a discrete probability distribution when there is no simple formula for those quantities.

This article is motivated by two computational questions about discrete probability distributions. The first question came from a SAS customer who asked how to calculate the mean of a discrete probability distribution. The second question concerns the median of the binomial distribution, which is an example of a discrete probability distribution. When I was looking up the binomial distribution on Wikipedia, and read the following sentence: "In general, there is no single formula to find the median for a binomial distribution." This led me to think about how software computes the median of discrete distributions.

Moments of a discrete probability distribution

Let X be a discrete random variable with density function f. We assume that X has a well-defined mean, variance, skewness, and kurtosis.

The mean value of X is the first moment of f. The mean is computed as a weighted average of the density values: μ = Σi xi f(xi),
where the sum is over all possible values of X.

The variance, skewness, and kurtosis of a probability distribution are all central moments. They are also defined as weighted sums where the weights are powers of the deviation from the mean. For example, the variance of any discrete distribution is the second central moment: σ2 = Σi (xi – μ)2 f(xi).

Let's see how to compute the mean and the variance by using SAS. To demonstrate, let's use the binomial distribution with parameters p=0.33 and n=21. That is, X ~ Binom(p, n) gives the number of successes in n=21 independent trials where the probability of success is p=0.33 for each trial. For these parameters, X can take values in the range {0,1,2,...,21}. A graph of the density function is shown to the right;

The mean and variance of the binomial distribution are known to be μ = np and σ2 = np(1-p), respectively, so we can check that our computations are correct.

You can use the SAS DATA step to compute these quantities, but I will demonstrate the formulas by using the IML procedure:

proc iml;
p = 0.33;                      /* prob of success for each trial */
n = 21;                        /* number of independent trioals */
 
/* apply the definitions */
x = 0:n;                       /* all possible values of X */
pdf = pdf("Binom", x, p, n);   /* density=pdf(), or use density formula */
meanDef = sum(x#pdf);          /* or inner product x`*pdf */
varDef  = sum((x-meanDef)##2 # pdf);
 
/* Check by using the formulas for Binom(p, n):
   mean = n*p   -and-   variance = n*p*(1-p)
*/
mu = n*p;
var = n*p*(1-p);
print p n meanDef mu varDef var;

In the program, the mean and variance are computed from first principles by using the definition of the mean and variance. You can perform this computation for any discrete distribution. For the binomial distribution, there are simple formulas for the mean and variance. The output shows that the formulas agree with the computation.

In this example, the range of the discrete random variable is finite, so it is easy to sum over all possible values of X. If X can take on infinitely many values, the computation becomes more complicated. Examples of distributions that have an infinite domain for the density include the geometric distribution and the Poison distribution. For distributions like these, the density must decrease geometrically, so there is some large number N such that for |X|>N the probability in the tails of the distribution is negligible. This reduces the computation to a finite sum.

The definition of the median of a discrete distribution

In many textbooks, the median for a discrete distribution is defined as the value X=m such that at least 50% of the probability is less than or equal to m and at least 50% of the probability is greater than or equal to m. In symbols, P(X≤m) ≤ 1/2 and P(X≥m) ≤ 1/2.

Unfortunately, this definition might not produce a unique median. For example, the binomial distribution Binom(p=0.5, n=21) does not have a unique median because:

  • m=10 is a median because P(X≤10) = 0.5 and P(X≥10) ≥ 0.5.
  • By symmetry, m=11 is a median because P(X≤11) ≥ 0.5 and P(X≥11) = 0.5.

By convention, most software reports the median to be the LEAST such number that satisfies the definition. So, for the Binom(p=0.5, n=21) distribution, most software reports 10, not 11, as the median.

With this convention, you can compute the median from first principles by summing the densities and reporting the first value of X for which the cumulative density equals or exceeds 0.5. The following statements compute the median for Binom(p=0.33, n=21) by using the definition and check the answer by using the SAS QUANTILE function:

/* use the definition of median to find the first value of X for which CDF >= 0.5 */
cdf = cusum(pdf);
GEHalf = loc(cdf >= 0.5);
medianDef = x[ GEHalf[1] ];  /* first value where CDF >= 1/2 */
 
/* check by calling the built-in QUANTILE function */
median = quantile("Binom", 0.5, p, n); 
print p n medianDef median;

Summary

Although many well-known probability distributions have simple formulas for the mean, variance, and median of the distribution, not all distributions have formulas. This article shows how to compute the mean, variance, and median of a discrete probability distribution from basic definitions. If the random variable X has density function f, then:

  • The mean is μ = Σi xi f(xi)
  • The variance is σ2 = Σi (xi – μ) 2 f(xi)
  • The median, m, is the smallest value of X for which P(X≤m) ≥ 1/2.

The article shows how to compute these quantities in SAS by using SAS/IML software. I leave the DATA step computation for an exercise.

The post Compute properties of discrete probability distributions appeared first on The DO Loop.

3月 232022
 

This article shows how to compute properties of a discrete probability distribution from basic definitions. You can use the definitions to compute the mean, variance, and median of a discrete probability distribution when there is no simple formula for those quantities.

This article is motivated by two computational questions about discrete probability distributions. The first question came from a SAS customer who asked how to calculate the mean of a discrete probability distribution. The second question concerns the median of the binomial distribution, which is an example of a discrete probability distribution. When I was looking up the binomial distribution on Wikipedia, and read the following sentence: "In general, there is no single formula to find the median for a binomial distribution." This led me to think about how software computes the median of discrete distributions.

Moments of a discrete probability distribution

Let X be a discrete random variable with density function f. We assume that X has a well-defined mean, variance, skewness, and kurtosis.

The mean value of X is the first moment of f. The mean is computed as a weighted average of the density values: μ = Σi xi f(xi),
where the sum is over all possible values of X.

The variance, skewness, and kurtosis of a probability distribution are all central moments. They are also defined as weighted sums where the weights are powers of the deviation from the mean. For example, the variance of any discrete distribution is the second central moment: σ2 = Σi (xi – μ)2 f(xi).

Let's see how to compute the mean and the variance by using SAS. To demonstrate, let's use the binomial distribution with parameters p=0.33 and n=21. That is, X ~ Binom(p, n) gives the number of successes in n=21 independent trials where the probability of success is p=0.33 for each trial. For these parameters, X can take values in the range {0,1,2,...,21}. A graph of the density function is shown to the right;

The mean and variance of the binomial distribution are known to be μ = np and σ2 = np(1-p), respectively, so we can check that our computations are correct.

You can use the SAS DATA step to compute these quantities, but I will demonstrate the formulas by using the IML procedure:

proc iml;
p = 0.33;                      /* prob of success for each trial */
n = 21;                        /* number of independent trioals */
 
/* apply the definitions */
x = 0:n;                       /* all possible values of X */
pdf = pdf("Binom", x, p, n);   /* density=pdf(), or use density formula */
meanDef = sum(x#pdf);          /* or inner product x`*pdf */
varDef  = sum((x-meanDef)##2 # pdf);
 
/* Check by using the formulas for Binom(p, n):
   mean = n*p   -and-   variance = n*p*(1-p)
*/
mu = n*p;
var = n*p*(1-p);
print p n meanDef mu varDef var;

In the program, the mean and variance are computed from first principles by using the definition of the mean and variance. You can perform this computation for any discrete distribution. For the binomial distribution, there are simple formulas for the mean and variance. The output shows that the formulas agree with the computation.

In this example, the range of the discrete random variable is finite, so it is easy to sum over all possible values of X. If X can take on infinitely many values, the computation becomes more complicated. Examples of distributions that have an infinite domain for the density include the geometric distribution and the Poison distribution. For distributions like these, the density must decrease geometrically, so there is some large number N such that for |X|>N the probability in the tails of the distribution is negligible. This reduces the computation to a finite sum.

The definition of the median of a discrete distribution

In many textbooks, the median for a discrete distribution is defined as the value X=m such that at least 50% of the probability is less than or equal to m and at least 50% of the probability is greater than or equal to m. In symbols, P(X≤m) ≤ 1/2 and P(X≥m) ≤ 1/2.

Unfortunately, this definition might not produce a unique median. For example, the binomial distribution Binom(p=0.5, n=21) does not have a unique median because:

  • m=10 is a median because P(X≤10) = 0.5 and P(X≥10) ≥ 0.5.
  • By symmetry, m=11 is a median because P(X≤11) ≥ 0.5 and P(X≥11) = 0.5.

By convention, most software reports the median to be the LEAST such number that satisfies the definition. So, for the Binom(p=0.5, n=21) distribution, most software reports 10, not 11, as the median.

With this convention, you can compute the median from first principles by summing the densities and reporting the first value of X for which the cumulative density equals or exceeds 0.5. The following statements compute the median for Binom(p=0.33, n=21) by using the definition and check the answer by using the SAS QUANTILE function:

/* use the definition of median to find the first value of X for which CDF >= 0.5 */
cdf = cusum(pdf);
GEHalf = loc(cdf >= 0.5);
medianDef = x[ GEHalf[1] ];  /* first value where CDF >= 1/2 */
 
/* check by calling the built-in QUANTILE function */
median = quantile("Binom", 0.5, p, n); 
print p n medianDef median;

Summary

Although many well-known probability distributions have simple formulas for the mean, variance, and median of the distribution, not all distributions have formulas. This article shows how to compute the mean, variance, and median of a discrete probability distribution from basic definitions. If the random variable X has density function f, then:

  • The mean is μ = Σi xi f(xi)
  • The variance is σ2 = Σi (xi – μ) 2 f(xi)
  • The median, m, is the smallest value of X for which P(X≤m) ≥ 1/2.

The article shows how to compute these quantities in SAS by using SAS/IML software. I leave the DATA step computation for an exercise.

The post Compute properties of discrete probability distributions appeared first on The DO Loop.

3月 212022
 

Statistical programmers need to access numerical constants that help us to write robust and accurate programs. Specifically, it is necessary to know when it is safe to perform numerical operations such as raising a number to a power without exceeding the largest number that is representable in finite-precision arithmetic. This article discusses five constants that every statistical programmer must know: PI, MACEPS, EXACTINT, BIG, and SMALL. In the SAS language, you can find the values of these constants by using the CONSTANT function in Base SAS.

The following table shows the values of the constants that are discussed in this article:

data Constants;
length Const $10;
format Value Best12.;
input Const @@;
Value = constant(Const);
datalines;
PI MACEPS EXACTINT 
BIG LOGBIG SQRTBIG SMALL
;
 
proc print noobs; run;

Pi and other mathematical constants

The CONSTANT function provides values for various mathematical constants. Of these, the most important constant is π ≈ 3.14159..., but you can also obtain other mathematical constants such as the natural base (e) and the golden ratio (φ). To get an approximation for π, use pi = constant('pi'). The number π is essential for working with angles and trigonometric functions. For an example, see the article, "Polygons, pi, and linear approximations," which uses π to create regular polygons.

Machine epsilon

Arguably the most important constant in computer science is machine epsilon. This number enables you to perform floating-point comparisons such as deciding whether two numbers are equal or determining the rank of a numerical matrix. To get machine epsilon, use eps = constant('maceps').

The largest representable integer

It is important to know the largest number that can be represented accurately in finite precision on your machine. This number is available as bigInt = constant('exactint'). This number enables you to find the largest factorial number that you can compute (18!) and the largest row of Pascal's triangle (56) that you can faithfully represent in double precision (56).

The largest representable double

Except for π, the constants I use the most are related to BIG ≈ 1.8E308. I primarily use LOGBIG and SQRTBIG, which you can compute as
logbig = constant('logbig');
sqrtbig = constant('sqrtbig');
These constants are useful for preventing overflow when you perform arithmetic operations on large numbers:

  • The quantity exp(x) can only be computed when x is less than LOGBIG ≈ 709.78.
  • The LOGBIG option supports the BASE suboption (BASE > 1), which you can use to ensure that raising a number to a power does not overflow. For example, constant('logbig', 2) returns 1024 because 2**1024 is the largest power of 2 that does not exceed BIG.
  • The SQRTBIG option tells you whether you can square a number without overflowing.

The smallest representable double

What is the smallest positive floating-point number on your computer? It is given by small = constant('small'). There are also LOGSMALL and SQRTSMALL versions, which you can use to prevent overflow. I don't use these constants as frequently as their 'BIG' counterparts. In my experience, underflow is usually not a problem in SAS.

Summary

This article discusses five constants that every statistical programmer must know: PI, MACEPS, EXACTINT, BIG, and SMALL. Whether you need mathematical constants (such as π) will depend on the programs that you write. The MACEPS constant is used to compare floating-point numbers. The other constants are used by computer scientists and numerical analysts to ensure that programs can correctly compute with very large (or very small) numbers without encountering floating-point overflow (or underflow).

The post Five constants every statistical programmer should know appeared first on The DO Loop.

3月 162022
 

A previous article showed how to use SAS to compute finite-difference derivatives of smooth vector-valued multivariate functions. The article uses the NLPFDD subroutine in SAS/IML to compute the finite-difference derivatives. The article states that the third output argument of the NLPFDD subroutine "contains the matrix product J`*J, where J is the Jacobian matrix. This product is used in some numerical methods, such as the Gauss-Newton algorithm, to minimize the value of a vector-valued function."

This article expands on that statement and shows that you can use the SAS/IML matrix language to implement the Gauss-Newton method by writing only a few lines of code.

The NLPFDD subroutine has been in SAS/IML software since SAS version 6. In SAS Viya, you can use the FDDSOLVE subroutine, which is a newer finite-difference function that has a syntax similar to the NLPFDD subroutine.

A review of least-squares minimization

Statisticians often use least-squares minimization in the context of linear or nonlinear least-squares (LS) regression. In LS regression, you have an observed set of response values {Y1, Y2, ..., Yn}, and you fit a parametric model to obtain a set of predicted values {P1, P2, ..., Pn}. These predictions depend on parameters, and you try to find the parameters that minimize the sum of the squared residuals, || Pi - Yi ||2 for i=1, 2, ..., n.

In this formula, the parameters are not explicitly written, but this formulation is the least-squares minimization of a function that maps each parameter to the sum of squared residuals. Abstractly, if F: Rn → Rm is a smooth vector-valued function with m components, then you can write it as F(x) = (F1(x), F2(x), ..., Fm(x)). In most applications, the dimension of the domain is less than the dimension of the range: n < m. The goal of least-square minimization is to find a value of x that minimizes the sum of the squared components of F(x). In vector form, we search for a value of x that (locally) minimizes || F(x) ||2. Equivalently, the problem is often formulated so that you minimize (1/2) || F(x) ||2. The factor of 1/2 simplifies the expression for the derivative of the objective function.

Visualize a least-squares function

The previous article used the following vector-valued function:

proc iml;
 
q = {0, 2, 1};               /* define a point in R^3 */
start VecF(st) global(q);    /* compute vector from q to a point on a cylinder */
   s = st[1]; t = st[2];
   F = j(3, 1, .);
   F[1] = cos(s) - q[1];
   F[2] = sin(s) - q[2];
   F[3] =     t  - q[3];
   return( F );
finish;

Geometrically, this function represents the vector from q = (0, 2, 1) to a point on a cylinder in R3. Therefore, the vector-valued function is minimized when (s,t) = (pi/2, 1), since that is the point on the cylinder closest to the point q, which is not on the cylinder.

Let's visualize the function (1/2) || F(s,t) ||2 for (s,t) values near (pi/2. 1). You can use the EXPANDGRID function to create a uniform grid of points in the (s,t) domain. For each point on the grid, you can use the following statements to compute (one-half) the sum of the squares of the vector-valued function. To compute the sum of squared components, you can use the SSQ function or the NORM function, but I have decided to use the inner-product F`*F.

/* create heat map of VecF */
s = do(1, 2.1, 0.02);       /* horiz grid points */
t = do(0.5, 1.5, 0.02);     /* vert grid points */
st = expandgrid(s, t);
z = j(nrow(st), 1);
do i = 1 to nrow(st);
   F = VecF( st[i,] ); 
   z[i] = 0.5 * F`*F;       /* 1/2 sum of squares of components */
end;
 
result = st||z;
create Heat from result[c={'s' 't' 'z'}]; append from result; close;
 
submit;
title "1/2 || F(x,y) ||^2";
proc sgplot data=Heat;
   heatmapparm x=s y=t colorresponse=z / colormodel=ThreeColorRamp;
   refline 1.571/axis=x label='pi/2' labelpos=min; refline 1/axis=y;
run;
endsubmit;

The graph shows the values of z = (1/2) ||F(s,t)||2 on a uniform grid. The reference lines intersect at (pi/2, 1), which is the minimum value of z.

The Gauss-Newton method for minimizing least-squares problems

One way to solve a least-squares minimization is to expand the expression (1/2) ||F(s,t)||2 in terms of the component functions. You get a scalar function of (s,t), so you can use a traditional optimization method, such as the Newton-Raphson method, which you can call by using the NLPNRA subroutine in SAS/IML. Alternatively, you can use the NLPLM subroutine to minimize the least-squares problem directly by using the vector-valued function. In this section, I show a third method: using matrix operations in SAS/IML to implement a basic Gauss-Newton method from first principles.

The Gauss-Newton method is an iterative method that does not require using any second derivatives. It begins with an initial guess, then modifies the guess by using information in the Jacobian matrix. Since each row in the Jacobian matrix is a gradient of a component function, the Gauss-Newton method is similar to a gradient descent method for a scalar-valued function. It tries to move "downhill" towards a local minimum.

Rather than derive the formula from first principles, I will merely present the Gauss-Newton algorithm, which you can find in Wikipedia and in course notes for numerical analysis (Cornell, 2017):

  1. Make an initial guess for the minimizer, x0. Let x = x0 be the current guess.
  2. Evaluate the vector FF(x) and the Jacobian J ≡ [ ∂Fi / ∂xj ] for i=1,2,..,m and j=1,2...,n. The NLPFDD subroutine in SAS/IML enables you to evaluate the function and approximate the Jacobian matrix with a single call.
  3. Find a direction that decreases || F(x) || by solving the linear system (J`J) h = -J`*F for h.
  4. Take a step in the direction of h and update the guess: xnew = x + α h for some α in (0, 1).
  5. Go to Step 2 until the problem converges.

For any point x in the domain of F, the NLPFDD subroutine returns three results: F, J, and the matrix product J`*J. These are exactly the quantities that are needed to perform the Gauss-Newton method! The following statements implement the algorithm for an initial guess x0 = (2, 0.5):

/* Gauss-Newton line-search method for minimizing || VecF(x) || */
x0 = {2  0.5};       /* initial guess for (s,t) */
parm = {3, 00, .};   /* function has 3 components, use forward difference */
nSteps = 10;         /* take 10 steps regardless of convergence */
alpha = 0.6;         /* step size for line search */
 
path = j(nSteps+1, 2);  /* save and plot the approximations */
path[1,] = x0;
/* start the Gauss-Newton method */
x = x0;
do i = 1 to nSteps;
   call nlpfdd(F, J, JtJ, "VecF", x) par=parm; /* get F, J, and J`J at x */
   h = solve(JtJ, -J`*F);   /* h points downhill */
   x = x + alpha*h`;        /* update the current guess */
   path[i+1,] = x;          /* remember this point */
end;
 
/* write the iterates to a data set */
create path from path[c={'px' 'py'}]; append from path; close;
 
submit alpha;
data All; set Heat path; run;
 
title "Path of Gauss-Newton Minimization";
title2 "alpha = &alpha";
proc sgplot data=All noautolegend;
   heatmapparm x=s y=t colorresponse=z / colormodel=ThreeColorRamp;
   refline 1.571/axis=x label='pi/2' labelpos=min; refline 1/axis=y;
   series x=px y=py / markers;
run;
endsubmit;

The graph overlays the path of the Gauss-Newton approximations on the heat map of (1/2) || F(s,t) ||2. For each iterate, you can see that the search direction appears to be "downhill." A more sophisticated algorithm would monitor the norm of h to determine when to step. You can also use different values of the step length, α, or even vary the step length as you approach the minimum. I leave those interesting variations as an exercise.

The main point I want to emphasize is that the NLPFDD subroutine returns all the information you need to implement the Gauss-Newton method. Several times, I have been asked why the NLPFDD subroutine returns the matrix J`*J, which seems like a strange quantity. Now you know that J`*J is one of the terms in the Gauss-Newton method and other similar least-squares optimization methods.

Verify the Gauss-Newton Solution

As I mentioned earlier, SAS/IML software supports two subroutines for solving least-squares minimization problems. The NLPHQN subroutine is a hybrid quasi-Newton algorithm that is similar to (but more sophisticated than) the Gauss-Newton method. Let's use the NLPHQN subroutine to solve the same problem and compare the solutions:

/* solve same problem by using NLPHQN (hybrid quasi-Newton method) */
optn = {3 1};     /* tell NLPQN that VecF has three components; print some partial results */
call NLPHQN(rc, xSoln, "VecF", x0) OPT=optn; 
print (path[11,])[c={'s' 't'} L="Soln from Gauss-Newton"],
       xSoln[c={'s' 't'} L="Soln from NLPHQN"];

The solution from the hybrid quasi-Newton method is very close the Gauss-Newton solution. If your goal is to solve a least-squares minimization, use the NLPHQN (or NLPLM) subroutine. But you might want to implement your own minimization method for special problems or for educational purposes.

Summary

In summary, this article shows three things:

  • For a vector-valued function, the NLPFDD subroutine evaluates the function and uses finite-difference derivatives to approximate the Jacobian (J) at a point in the domain. The subroutine also returns the matrix product J`*J.
  • The three quantities from the NLPFDD subroutine are exactly the three quantities needed to implement the Gauss-Newton method. By using the NLPFDD subroutine and the matrix language, you can implement the Gauss-Newton method in a few lines of code.
  • The answer from the Gauss-Newton method is very similar to the answer from calling a built-in least-squares solver, such as the NLPHQN subroutine.

Further reading

The post Least-squares optimization and the Gauss-Newton method appeared first on The DO Loop.

3月 072022
 

I previously showed how to use SAS to compute finite-difference derivatives for smooth scalar-valued functions of several variables. You can use the NLPFDD subroutine in SAS/IML software to approximate the gradient vector (first derivatives) and the Hessian matrix (second derivatives). The computation uses finite-difference derivatives to approximate the derivatives.

The NLPFDD subroutine also supports approximating the first derivatives a smooth vector-valued function. If F: Rn → Rm is a vector-valued function with m components, then you can write it as F(x) = (F1(x), F2(x), ..., Fm(x)). The Jacobian matrix, J, is the matrix whose (i,j)th entry is J[i,j] = ∂Fi / ∂xj. That is, the i_th row is the gradient of the i_th component function, Fi, i=1,2,..,m and j=1,2...,n. The NLPFDD subroutine enables you to approximate the Jacobian matrix at any point in the domain of F.

An example of a vector-valued function

A simple example of a vector-valued function is the parameterization of a cylinder, which is a mapping from R2 → R3 given by the function
C(s,t) = (cos(s), sin(s), t)
For any specified values of (s,t), the vector C(s,t) is the vector from the origin to a point on the cylinder. It is sometimes useful to consider basing the vectors at a point q, in which case the function F(s,t) = C(s,t) - q is the vector from q to a point on the cylinder. You can define this function in SAS/IML by using the following statements. You can then use the NLPFDD subroutine to compute the Jacobian matrix of the function at any value of (s,t), as follows:

/* Example of a vector-valued function f: R^n -> R^m for n=2 and m=3.
   Map from (s,t) in [0, 2*pi) x R to the cylinder in R^3.
   The vector is from q = (0, 2, 1) to F(s,t)  */
proc iml;
 
q = {0, 2, 1};               /* define a point in R^3 */
/* vector from q to a point on a cylinder */
start VecF(st) global(q);
   s = st[1]; t = st[2];
   F = j(3, 1, .);
   F[1] = cos(s) - q[1];
   F[2] = sin(s) - q[2];
   F[3] =     t  - q[3];
   return( F );
finish;
 
/* || VecF || has local min at (s,t) = (pi/2, 1), so let's evaluate derivatives at (pi/2, 1) */
pi = constant('pi');
x =  pi/2 || 1;              /* value of (s,t) at which to evaluate derivatives */
/* Parameters: the function has 3 components, use forward difference, choose step according to scale of F */
parm = {3, 00, .};           
call nlpfdd(f, J, JtJ, "VecF", x) par=parm;  /* return f(x), J=DF(x), and J`*J */
 
print f[r={'F_1' 'F_2' 'F_3'} L='F(pi/2, 1)'], 
      J[r={'F_1' 'F_2' 'F_3'} c={'d/ds' 'd/dt'} L='DF(pi/2, 1) [Fwd Diff]'];

The program evaluates the function and the Jacobian at the point (s,t)=(π/2, 1). You must use the PAR= keyword when the function is vector-valued. The PAR= argument is a three-element vector that contains the following information. If you specify a missing value, the default value is used.

  • The first element is the number of components (m) for the function. By default, the NLPFDD subroutine expects a scalar-valued function (m=1), so you must specify PARM[1]=3 for this problem.
  • The second element is the finite-difference method. The function supports several schemes, but I always use either 0 (forward difference) or 1 (central difference). By default, PARM[2]=0.
  • The third element provides a way to choose the step size for the finite-difference method. I recommend that you use the default value.

The output shows that the value of the function at (s,t)=(π/2, 1) is {0, -1, 0}. The Jacobian at that point is shown. The gradient of the i_th component function (Fi) is contained in the i_th row of the Jacobian matrix.

  • The gradient of the first component function at (π/2, 1) is {-1 0}.
  • The gradient of the second component function at (π/2, 1) is {0 0}, which means that F2 is at a local extremum. (In this case, F2 is at a minimum).
  • The gradient of the third component function is {0 1}.

The NLPFDD subroutine also returns a third result, which is named JtJ in the program. It contains the matrix product J`*J, where J is the Jacobian matrix. This product is used in some numerical methods, such as the Gauss-Newton algorithm, to minimize the value of a vector-valued function. Unless you are writing your own algorithm to perform a least-squares minimization, you probably won't need the third output argument.

Summary

The NLPFDD subroutine in SAS/IML uses finite-difference derivatives to approximate the derivatives of multivariate functions. A previous article shows how to use the NLPFDD subroutine to approximate the gradient and Hessian of a scalar-valued function. This article shows how to approximate the Jacobian of a vector-valued function. Numerical approximations for derivatives are useful in all sorts of numerical routines, including optimization, root-finding, and solving nonlinear least-squares problems.

In SAS Viya, you can use the FDDSOLVE subroutine, which is a newer implementation that has a similar syntax.

The post Finite-difference derivatives of vector-valued functions appeared first on The DO Loop.

3月 022022
 

Many applications in mathematics and statistics require the numerical computation of the derivatives of smooth multivariate functions. For simple algebraic and trigonometric functions, you often can write down expressions for the first and second partial derivatives. However, for complicated functions, the formulas can get unwieldy (and some applications do not have explicit analytical derivatives). In those cases, numerical analysts use finite-difference methods to approximate the partial derivatives of a function at a point of its domain. This article is about how to compute first-order and second-order finite-difference derivatives for smooth functions in SAS.

Formulas for finite-difference derivatives

There are many ways to approximate the derivative of a function at a point, but the most common formulas are for the forward-difference method and the central-difference method. In SAS/IML software, you can use the NLPFDD subroutine to compute a finite-difference derivative (FDD).

This article shows how to compute finite-difference derivatives for a smooth scalar-valued function f: Rn → R at a point x0 in its domain. The first derivatives are the elements of the gradient vector, grad(f). The i_th element of the gradient vector is ∂f / ∂xi for i = 1, 2, ..., n. The second derivatives are the elements of the Hessian matrix, Hes(f). The (i,j)th element of the Hessian matrix is ∂2f / ∂xixj for i,j = 1, 2, ..., n. For both cases, the derivatives are evaluated at x0, so grad(f)(x0) is a numerical row vector and Hes(f)(x0) is a numerical symmetric matrix.

For this article, I will use the following cubic function of two variables:
\(f(x,y) = x^3 - y^2 - x + 0.5 \)
A heat map of the function is shown to the right. The function has two critical points at which the gradient is the zero vector: a local maximum at (-1/sqrt(3), 0) and a saddle point at (+1/sqrt(3), 0). The formula for the gradient is grad(f) = [ \(3 x^2 - 1\), \(-2 y\)] and the elements of the Hessian matrix are H[1,1] = 6 x, H[1,2] = H[2,1] = 0, and H[2,2] = -2. You can evaluate these formulas at a specified point to obtain the analytical derivatives at that point.

Finite-difference derivatives in SAS

In SAS, you can use the NLPFDD subroutine in PROC IML to compute finite difference derivatives. (In SAS Viya, you can use the FDDSOLVE subroutine, which is a newer implementation.) By default, the NLPFDD subroutine uses forward-difference formulas to approximate the derivatives. The subroutine returns three output arguments: the value of the function, gradient, and Hessian, respectively, evaluated at a specified point. The following SAS/IML program defines the cubic function and calls the NLPFDD subroutine to approximate the derivatives:

proc iml;
start Func(xy);
   x = xy[,1]; y = xy[,2];
   z = x##3 - y##2 - x + 0.5;
   return( z );
finish;
 
/* Function has local max at (x_max, 0) where x_max = -1/sqrt(3) = -0.57735 */
x_max =  -1/sqrt(3) || 0;
call nlpfdd(f_max, grad_max, Hes_max, "Func", x_max);
 
reset fuzz=2e-6;       /* print 0 for any number less than FUZZ */
print f_max, grad_max, Hes_max;

The output shows the value of the function and its derivatives at x0 = (-1/sqrt(3), 0), which is a local maximum. The value of the function is 0.88. The gradient at the local maximum is the zero vector. The Hessian is a diagonal matrix. The forward-difference derivatives differ from the corresponding analytical values by 2E-6 or less.

Use finite-difference derivatives to classify extrema

One use of the derivatives is to determine whether a point is an extreme value. Points where the gradient of the function vanishes are called critical points. A critical point can be a local extremum or a saddle point. The eigenvalues of the Hessian matrix determine whether a critical point is a local extremum. The eigenvalues are used in the second-derivative test:

  1. If the eigenvalues are all positive, the critical point is a local minimum.
  2. If the eigenvalues are all negative, the critical point is a local maximum.
  3. If some eigenvalues are positive and others are negative, the critical point is a saddle point.
  4. If any eigenvalue is zero, the test is inconclusive.

For the current example, the Hessian is a diagonal matrix. By inspection, the eigenvalues are all negative, which proves that the point x0 = (-1/sqrt(3), 0) is a local maximum. You can repeat the computations for x_s = 1/sqrt(3) || 0, which is a saddle point.

Forward- versus central-difference derivatives

You can evaluate the derivatives at any point, not merely at the critical points. For example, you can evaluate the derivatives of the function at x1 = (-1, 0.5), which is not an extremum:

x1 = {-1 0.5};
call nlpfdd(FwdD_f, FwdD_g, FwdD_H, "Func", x1);
print FwdD_g, FwdD_H;

By default, the NLPFDD subroutine uses a forward-difference method to approximate the derivatives. You can also use a central difference scheme. The documentation describes six different ways to approximate derivatives, but the most common way is to use the objective function to choose a step size and to use either a forward or central difference. You can use the PAR= keyword to specify the method: PAR[2] = 0 specifies forward difference and PAR[2]=1 specifies central difference:

parm = {.,  00, .};                /* forward diff based on objective function (default) */
call nlpfdd(FwdD_f, FwdD_g, FwdD_H, "Func", x1) par=parm;
parm = {.,  01, .};                /* central diff based on objective function */
call nlpfdd(CenD_f, CenD_g, CenD_H, "Func", x1) par=parm;
 
grad_results = (FwdD_g // CenD_g);
print grad_results[r={"Fwd FD" "Cen FD"} c={"df/dx" "df/dy"}];
HesDiff = FwdD_H - CenD_H;
reset fuzz=2e-16;       /* print small numbers */
print HesDiff;

The first table shows the results for the forward-difference method (first row) and the central-difference method (second row). The Hessian matrices are very similar. The output shows that the maximum difference between the Hessians is about 4.5E-7. In general, the forward-difference method is faster but less accurate. The central-difference method is slower but more accurate.

Summary

This article shows how to use the NLPFDD subroutine in SAS/IML to approximate the first-order and second-order partial derivatives of a multivariate function at any point in its domain. The subroutine uses finite-difference methods to approximate the derivatives. You can use a forward-difference method (which is fast but less accurate) or a central-difference method (which is slower but more accurate). The derivatives are useful in many applications, including classifying critical points as minima, maxima, and saddle points.

The post Finite-difference derivatives in SAS appeared first on The DO Loop.