data analysis

5月 162022
 

In categorical data analysis, it is common to analyze tables of counts. For example, a researcher might gather data for 18 boys and 12 girls who apply for a summer enrichment program. The researcher might be interested in whether the proportion of boys that are admitted is different from the proportion of girls. For this analysis, the data that you type into statistical software (such as SAS) does not typically consist of 20 observations. Rather, the data often consists of four observations and a frequency variable (also call a count variable). The first two observations record the number of boys that are admitted and not admitted, respectively. The next two observations record the corresponding frequencies for the girls.

This representation of the data is sometimes called frequency data. It is used when you can group individual subjects into categories such as boy/girl and admitted/not admitted. Despite its convenience, there are times when it is useful to expand or "unroll" the data to enumerate the individual subjects. Examples include bootstrap estimates and analyzing the data in SAS procedures that do not support a FREQ statement.

This article shows how to use the SAS DATA step to unroll frequency data. In addition, this article shows how to expand data that are in events/trials format.

Data in frequency form

The following example is from the documentation of the FREQ procedure. The goal is to compute some statistics for a 2 x 2 frequency table for 23 patients. Eight patients were on a low-cholesterol diet and 15 were on a high-cholesterol diet. The table shows how many subjects in each group developed heart disease. The data are used to estimate the relative risk of developing heart disease in the two groups:

/* Doc Example: PROC FREQ analysis of a 2x2 Contingency Table */
proc format;
   value ExpFmt 1='High Cholesterol Diet'  0='Low Cholesterol Diet';
   value RspFmt 1='Yes'  0='No';
run;
 
data FatComp;
format Exposure ExpFmt. Response RspFmt.;
input Exposure Response Count;
label Response='Heart Disease';
datalines;
0 0  6
0 1  2
1 0  4
1 1 11
;
 
title 'Case-Control Study of High Fat/Cholesterol Diet';
proc freq data=FatComp;
   tables Exposure*Response / relrisk norow nocol nopercent;
   weight Count;            /* each row represents COUNT patients */
run;

Notice that the data are input in terms of group frequencies. The output is shown. The analysis estimates the risk of heart disease for a subject on a high-cholesterol diet as 2.8 times greater than the risk for a subject on a low-cholesterol diet.

How to unroll frequency data in SAS

You can use the DATA step in SAS to unroll the frequency data. The COUNT variable tells you how many patients are in each group. You can therefore write a loop that expands the data: iterate from 1 to the COUNT value and put an OUTPUT statement in the body of the loop. If desired, you can include an ID variable to enumerate the subjects in each group, as follows:

data RawFat;
set FatComp;
Group + 1;          /* identify each category */
do ID = 1 to Count; /* identify subjects in each category */
   output;
end;
drop Count;
run;
 
proc print data=RawFat(obs=10);
run;

Whereas the original data set has four rows, the RawFat data has 23 rows. The first 10 rows are shown. The first six observations are the expansion of the subjects in the first group (Exposure=Low; Response=No). The next two observations are the subjects in the second group (Exposure=Low; Response=Yes), and so forth. The GROUP variable identifies the original groups. The ID variable enumerates patients within each group.

The second data set can be analyzed by any SAS procedure. In this form, the data do not require using a FREQ statement (or the WEIGHT statement in PROC FREQ), as shown by the following call to PROC FREQ:

/* run the same analysis on the expanded data */
proc freq data=RawFat;
   tables Exposure*Response / relrisk norow nocol nopercent;
   /* no WEIGHT statement because there is one row for each patient */
run;

The output (not shown) is the same as for the frequency data. Both data sets contain the same information, but the second might be useful for certain tasks such as a bootstrap analysis.

Data in events/trials form

A related data format is called the events/trials format. In the events/trials format, each observation has two columns that specify counts. One column specifies the number of times that a certain event occurs, and the other column specifies the number of trials (or attempts) that were conducted. This format is used for logistic regression and for other generalized linear regression models that predict the proportion of events.

The following example is from the documentation for the LOGISTIC procedure. The data describe the result of 19 batches of ingots in a designed experiment. Each batch was soaked and heated for specified combinations of times. The ingots were then inspected to determine which were not ready for rolling. The following DATA step records the results in the events/trials syntax. The R variable records the number of ingots that are not ready for rolling, and the N variable contains the number of ingots in the batch at the given values of the HEAT and SOAK variables. Consequently, R/N is the proportion of ingots that were not ready. For example, the values R=4 and N=44 means that 4 ingots (9%) were not ready for rolling in a batch that contained 44. Regression procedures such as PROC LOGISTIC enable you to analyze data by using an events/trials syntax:

/* events/trials syntax */
data ingots;
   input Heat Soak r n @@;
   datalines;
7 1.0 0 10  14 1.0 0 31  27 1.0 1 56  51 1.0 3 13
7 1.7 0 17  14 1.7 0 43  27 1.7 4 44  51 1.7 0  1
7 2.2 0  7  14 2.2 2 33  27 2.2 0 21  51 2.2 0  1
7 2.8 0 12  14 2.8 0 31  27 2.8 1 22  51 4.0 0  1
7 4.0 0  9  14 4.0 0 19  27 4.0 1 16
;
 
proc logistic data=ingots;
   model r/n = Heat Soak;
   ods select NObs ResponseProfile ParameterEstimates;
run;

The "NObs" table shows that there are 19 rows of data, which represents 387 units. The "Response Profile" table counts and displays the number of events and the number of trials. For these data, 12 units were not ready for rolling. The "ParameterEstimates" table provides estimates for the Intercept and the coefficients of the effects in the logistic regression model.

How to expand event/trials data

You can expand these data by creating 387 observations, one for each unit. The following DATA step creates a variable named BATCH that enumerates the original 19 batches of ingots. The ID variable enumerates the units in each batch. Some of the units were ready for rolling and some were not. You can create an indicator variable (NOTREADY) that arbitrarily sets the first R ingots as "not ready" (NOTREADY=1) and the remaining ingots as ready (NOTREADY=0).

/* expand or unroll the events/trials data */
data RawIngots;
set ingots;
BatchID + 1;
do ID = 1 to n;
   NotReady = (ID <= r);
   output;
end;
run;
 
/* analyze the binary variable NOTREADY */
proc logistic data=RawIngots;
   model NotReady(event='1') = Heat Soak;
   ods select NObs ResponseProfile ParameterEstimates;
run;

The "NOBs" and "Response Profile" tables are different from before, but they contain the same information. The "ParameterEstimates" table is the same and is not shown.

Summary

In categorical data analysis, it is common to analyze data that are in frequency form. Each observation represents some number of subjects, and the number is given by the value of a Count variable. In some situations, it is useful to "unroll" or expand the data. If an observation has frequency C, it results in C observations in the expanded data set. This article shows how to use the SAS DATA step to expand frequency data and data that are in events/trials syntax.

The post How to unroll frequency data appeared first on The DO Loop.

5月 042022
 

In The Essential Guide to Bootstrapping in SAS, I note that there are many SAS procedures that support bootstrap estimates without requiring the analyst to write a program. I have previously written about using bootstrap options in the TTEST procedure. This article discusses the NLIN procedure, which can fit nonlinear models to data by using a least-squares method. The NLIN procedure supports the BOOTSTRAP statement, which enables you to compute bootstrap estimates of the model parameters. In nonlinear models, the sampling distribution of a regression estimate is often non-normal, so bootstrap confidence intervals can be more useful than the traditional Wald-type confidence intervals that assume the estimates are normally distributed.

Data for a dose-response curve

The documentation for PROC NLIN contains an example of fitting a parametric dose-response curve to data. The example only has seven observations, but I have created an additional seven (fake) observations to make the example richer. The graph to the right visualizes the data and a parametric dose-response curve that is fit by using PROC NLIN. The curve represents the expected value of the log-logistic model
\(f(x) = \delta + \frac{\alpha - \delta }{1+\gamma \exp \left\{ \beta \ln (x)\right\} }\)
where x is the dose. The response is assumed to be bounded in the interval [δ, α], where δ is the response for x=0, and α is the limit of the response as x → ∞. The parameters β and γ determine the shape of the dose-response curve. In the following example, δ=0, and the data suggest that α ≈ 100.

The graph to the right demonstrates the data and the dose-response curve that best fits this data in a least-squares sense.

The following DATA step defines the data. The call to PROC NLIN specifies the model and produces the graph as well as a table of parameter estimates and the correlation between the regression parameters:

data DoseResponse;
   input dose y @@;
   logdose = log(dose);
datalines;
0.009   96.56   0.035   94.12   0.07    89.76
0.15    60.21   0.20    39.95   0.28    21.88
0.50     7.46   0.01    98.1    0.05    90.2
0.10    83.6    0.15    55.1    0.30    32.5
0.40    12.8    0.45     9.6
;
 
proc nlin data=DoseResponse plots(stats=none)=(fitplot); /* request the fit plot */
   parameters alpha=100 beta=3 gamma=300;                /* three parameter model */
   delta = 0;                                            /* a constant; not a parameter */
   Switch = 1/(1+gamma*exp(beta*log(dose)));             /* log-logistic function */
   model y = delta + (alpha - delta)*Switch;
run;

Look at the statistics for the standard errors, the confidence intervals (CIs), and the correlation of the parameters. Notice that these columns contain the word "Approximate" (or "Approx") in the column headers. That is because these statistics are computed by using the formulas from linear regression models. Under the assumptions of ordinary linear regression, the estimates for the regression coefficients are (asymptotically) distributed according to a multivariate normal distribution. For nonlinear regression models, you do not know the sampling distribution of the parameter estimates in advance. However, you can use bootstrap methods to explore the sampling distribution and to improve the estimates of standard error, confidence intervals, and the correlation of the parameters.

The BOOTSTRAP statement in PROC NLIN

You could carry out the bootstrap manually, but PROC NLIN supports the BOOTSTRAP statement, which automatically runs a bootstrap analysis for the regression coefficients. The BOOTSTRAP statement supports the following options:

  • Use the SEED= option to specify the random-number stream.
  • Use the NSAMPLES= option to specify the number of bootstrap samples that you want to use to make inferences. I suggest at least 1000, and 5000 or 10000 are better values.
  • Use the BOOTCI option to request bootstrap confidence intervals for the regression parameters. You can use the BOOTCI(BC) suboption to request the bias-corrected and adjusted method. Use the BOOTCI(PERC) option for the traditional percentile-based confidence intervals. I recommend the bias-corrected option, which is the default.
  • Use the BOOTCOV option to display a table that shows the estimated covariance between parameters.
  • Use the BOOTPLOTS option to produce histograms of the bootstrap statistics for each regression parameter.

The following call to PROC NLIN repeats the previous analysis, but this time requests a bootstrap analysis to improve the estimates of standard error, CIs, and covariance between parameters:

proc nlin data=DoseResponse plots(stats=none)=(fitplot); /* request the fit plot */
   parameters alpha=100 beta=3 gamma=300;                /* three parameter model */
   delta = 0;                                            /* a constant; not a parameter */
   Switch = 1/(1+gamma*exp(beta*log(dose)));             /* log-logistic function */
   model y = delta + (alpha - delta)*Switch;
   BOOTSTRAP / seed=12345 nsamples=5000 bootci(/*BC*/) bootcov bootplots;
run;
run;

The Parameter Estimates table is the most important output because it shows the bootstrap estimates for the standard error and CIs. For these data, you can see that the estimates for the alpha and beta parameters are not very different from the approximate estimates. However, the estimates for the gamma parameter are quite different. The bootstrap standard error is almost 2 units wider. The bootstrap bias-corrected confidence interval is about 5 units shorter and is shifted to the right as compared to the traditional CI.

You can study the bootstrap distribution of the gamma parameter to understand the differences. The following histogram is produced automatically by PROC NLIN because we used the BOOTPLOTS option. You can see that the distribution of the gamma estimates is not normal. It shows a moderate amount of positive skewness and kurtosis. Thus, the bootstrap estimates are noticeably different from the traditional estimates.

The distributions for the alpha and beta statistics are not shown, but they display only small deviations from normality. Thus, the bootstrap estimates for the alpha and beta parameters are not very different from the traditional estimates.

I do not show the bootstrap estimates of the covariance of the parameters. However, if you convert the bootstrap estimates of covariance to a correlation matrix, you will find that the bootstrap estimates are close to the approximate estimates that are shown earlier.

Summary

The BOOTSTRAP statement in PROC NLIN makes it easy to perform a bootstrap analysis of the regression estimates for a nonlinear regression model. The BOOTSTRAP statement will automatically produce bootstrap estimates of the standard error, confidence intervals, and covariance of parameters. In addition, you can use the BOOTPLOTS option to visualize the bootstrap distribution of the estimates. As shown in this example, sometimes the distribution of a parameter estimate is non-normal, so a bootstrap analysis produces better inferential statistics for the regression analysis.

The post Bootstrap estimates for nonlinear regression models in SAS appeared first on The DO Loop.

4月 272022
 

When you have many correlated variables, principal component analysis (PCA) is a classical technique to reduce the dimensionality of the problem. The PCA finds a smaller dimensional linear subspace that explains most of the variability in the data. There are many statistical tools that help you decide how many principal components to keep.

However, not all data can benefit from a PCA. If the variables are weakly correlated, it might not be possible to reduce the dimension of the problem. Bartlett's sphericity test provides information about whether the correlations in the data are strong enough to use a dimension-reduction technique such as principal components or common factor analysis.

This article discusses Bartlett's sphericity test (Bartlett, 1951) and shows how to run it in SAS. I provide two examples: one uses strongly correlated data for which a PCA would be an appropriate analysis; the other uses simulated data for which the variables are uncorrelated. For the second example, Bartlett's test tells us that a dimension-reduction technique will not provide any useful benefit. You need all dimensions to explain the variance in the data.

Choose the correct "Bartlett test"

Be aware that there are two tests that have Bartlett's name, and they are similar. The better-known test is Bartlett's test for equal variance, which is part of an ANOVA analysis. It tests the hypothesis that the variances across several groups are equal. The technical term is "homogeneity of variance," or HOV. The GLM procedure in SAS can perform Bartlett's test for homogeneity of variance. The documentation has a discussion and example that shows how to use Bartlett's HOV test.

Bartlett's sphericity test is different. Loosely speaking, the test asks whether a correlation matrix is the identity matrix. If so, the variables are uncorrelated, and you cannot perform a PCA to reduce the dimensionality of the data. More formally, Bartlett's sphericity test is a test of whether the data are a random sample from a multivariate normal population MVN(μ, Σ) where the covariance matrix Σ is a diagonal matrix. Equivalently, the variables in the population are MVN and uncorrelated.

Bartlett's sphericity test for correlated data

Let's recall some facts about correlation matrices that are necessary to understand Bartlett's sphericity test. Assume the data matrix, X, has p variables and N > p observations. First, the determinant of a correlation matrix is always in the range [0,1]. The determinant equals 1 only when the correlation matrix is the identity matrix, which happens only if the columns The determinant equals 0 only when data are linearly dependent, which means that a column is equal to a linear combination of the other columns. Thus, for nondegenerate data, the determinant is always in the range (0,1] and the logarithm of the determinant is defined. Bartlett's sphericity statistic is
\(T = -\log(\det(R)) (N-1-(2p+5)/6)\)
Under the null hypothesis that the data are a random sample from the MVN(μ, Σ) distribution where the covariance matrix Σ is a diagonal matrix, Bartlett (1951) showed that this statistic has a chi-square distribution with p(p–1)/2 degrees of freedom.

If you look up "how to obtain Bartlett's test of sphericity in SAS," you will be directed to a SAS Usage note that consists of five brief sentences and a partial example. Here is a more complete example. Suppose you want to know whether a set of variables are correlated strongly enough that you can use a dimension-reduction technique to analyze the data. You can use PROC FACTOR in SAS to produce Bartlett's test. You need to specify the METHOD=ML (maximum likelihood estimation) and HEYWOOD options. The following example uses the Sashelp.Iris data, which contains four highly correlated variables. For these data, three correlation coefficients are greater than 0.8. The following statements produce Bartlett's sphericity test:

proc factor data=sashelp.iris method=ML heywood;
   ods select SignifTests; /* output only Bartlett's test */
run;

Bartlett's test is shown on the row labeled "H0: No common factors." These data have p=4 variables, so there are DF=6 degrees of freedom for the test. The value of the test statistic is 706.96. Under the null hypothesis, the probability of observing a statistic this large or larger by chance alone is exceedingly small. Therefore, you reject the null hypothesis and conclude that it is reasonable to consider applying a dimension-reduction technique to these data.

Bartlett's sphericity test for uncorrelated data

What does Bartlett's sphericity test look like if your sample is from an uncorrelated MVN distribution? Let's simulate three-dimensional uncorrelated MVN data and run the test:

data UncorrMVN;
call streaminit(123);
do i = 1 to 50;
   x1 = rand("Normal", 0, 1);
   x2 = rand("Normal", 0, 2);
   x3 = rand("Normal", 0, 3);
   output;
end;
drop i;
run;
 
proc factor data=UncorrMVN method=ML heywood;
   ods select SignifTests;
run;

For these data (which are simulated according to the null hypothesis), the p-value is not small, so there is not enough evidence to reject the null hypothesis that the variables in the population are uncorrelated. For these data, you need three dimensions to fully capture the variance in the data. Therefore, a dimension-reduction technique is not a useful way to analyze these data.

Summary

You can run Bartlett's sphericity test in SAS by using PROC FACTOR and the METHOD=ML HEYWOOD options. The test is shown in the SignifTests ODS table. The null hypothesis is that the data are a sample from an uncorrelated MVN distribution. If the p-value of the test is less than your significance level (such as 0.05), you reject the null hypothesis. In this case, it might be useful to use a dimension-reduction technique to identify a linear subspace that explains most of the variation in the data. If the p-value of the sphericity test is greater than your significance level, then the data are not very correlated. It is not useful to apply a dimension-reduction technique.

One final comment: The test is known as a "sphericity test" because data samples from an MVN distribution with a covariance matrix of the form σ2I will form a spherical point cloud. Bartlett's paper (1951) is chiefly concerned with the case of a covariance matrix for which each variable has an equal variance. In that sense, this problem is similar to the other "homogeneity of variance" problem: Bartlett's HOV test for ANOVA.

Historical Note: SAS uses the Bartlett statistic that was shown earlier. However, Bartlett's original paper actually uses the statistic \(-\log(\det(R))(N-(2p+5)/6\). I do not know who suggested subtracting 1 in the second factor. Obviously, they are equivalent asymptotically.

The post On Bartlett's sphericity test for correlation appeared first on The DO Loop.

4月 202022
 

Recently, I showed how to use a heat map to visualize measurements over time for a set of patients in a longitudinal study. The visualization is sometimes called a lasagna plot because it presents an alternative to the usual spaghetti plot. A reader asked whether a similar visualization can be created for subjects if the response is an ordinal variable, such as a count. Yes! And the heat map approach is a substantial improvement over a spaghetti plot in this situation.

This article pulls together several techniques from previous articles:

This article uses the following data, which represent the counts of malaria cases at five clinics over a 14-week time period:

data Clinical;
input SiteID @;
do Week = 1 to 14;
   input Count @;
   output;
end;
/* ID Wk1  Wk2  Wk3  Wk4 ... Wk14 */
datalines;
001  1 0 0 0 0 0 0 3 1 3 3 0 3 0
002  0 0 0 1 1 2 1 2 2 1 1 0 2 2
003  1 . . 1 0 1 0 3 . 1 0 3 2 1
004  1 1 . 1 0 1 2 2 3 2 1 0 . 0
005  1 1 1 . 0 0 0 1 0 1 2 4 5 1
;
 
title "Spaghetti Plot of Counts for Five Clinics";
proc sgplot data=Clinical;
   series x=Week y=Count / group=SiteID; 
   xaxis integer values=(1 to 14) valueshint;
run;

The line plot is not an effective way to visualize these data. In fact, it is almost useless. Because the counts are discrete integer values, and most counts are in the range [0, 3], the graph cannot clearly show the weekly values for any one clinic. The following sections develop a heat map that visualizes these data better.

Format the raw values

The following call to PROC FORMAT defines a format that associates character strings with values of the COUNT variable:

proc format;
value CountFmt
      . = "Not Counted"
      0 = "None"
      1 = "1"
      2 = "2"
      3 = "3" 
      4 - high = "4+";
run;

You can use this format to encode values and display them in a legend. Notice that you could also use a format to combine counts, such as using the word "Few" to describe 2 or 3 counts.

Associate colors to formatted values

A discrete attribute map ensures that the colors will not change if the data change. For ordinal data, it also ensures that the legend will be in ordinal order, as opposed to "data order" or alphabetical order.

You can use a discrete data map to associate graphical attributes with the formatted value of a variable. Examples of graphical attributes include marker colors, marker symbols, line colors, and line patterns. For a heat map, you want to associate the "fill color" of each cell with a formatted value. The following DATA step creates the mapping between values and colors. Notice that I use the PUTN function to apply the format to raw data values. This ensures that the mapping correctly associates formatted values with colors. The raw values are stored in an array (VAL) as are the colors (COLOR). This makes it easy to modify the map in the future or to adapt it to other situations.

data MyAttrs;
length Value $11 FillColor $20;
retain ID 'MalariaCount'               /* name of map */
     Show 'AttrMap';                   /* always show all groups in legend */
 
/* output the formatted value and color for a missing value */
Value = putn(., "CountFmt.");          /* formatted value */
FillColor = "LightCyan";               /* color for missing value */
output;
/* output the formatted values and colors for nonmissing values */
array   val[5]     _temporary_ (0 1 2 3 4);
array color[5] $20 _temporary_ ('White' 'CXFFFFB2' 'CXFECC5C' 'CXFD8D3C' 'CXE31A1C');
do i = 1 to dim(val);
   Value = putn(val[i], "CountFmt.");  /* formatted value for this raw value */
   FillColor = color[i];               /* color for this formatted value */
   output;
end;
drop i;
run;

Create a discrete heat map

Now you can create a heat map that uses the format and discrete attribute map from the previous sections. To use the map, you must specify two pieces of information:
  • Use the DATTRPMAP= option on the PROC SGPLOT statement to specify the name of the data set that contains the map.
  • Because a data set can contain multiple maps, use the ATTRID= option on the HEATMAPPARM statement to specify the value of the ID variable that contains the attributes for these data.
title "Heat Map of Malaria Data";
proc sgplot data=Clinical DATTRMAP=MyAttrs; /* <== the data set that contains attributes */
   format Count CountFmt.;                  /* <== apply the format to bin data */
   heatmapparm x=Week y=SiteID colorgroup=Count / outline outlineattrs=(color=gray)
               ATTRID=MalariaCount;         /* <== the discrete attribute map for these data */
   discretelegend;
   refline (1.5 to 5.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(1 to 14) valueshint;
run;

The heat map makes the data much clearer than the spaghetti map. For each clinic and each week, you can determine how many cases of malaria were seen. As mentioned earlier, you can use this same technique to classify counts into categories such as None, Few, Moderate, Many, and so forth.

Summary

You can use SAS formats to bin numerical data into ordinal categories. You can use a discrete attribute map to associate colors and other graphical attributes with categories. By combining these techniques, you can create a heat map that enables you to visualize ordinal responses for subjects over time. The heat map is much better at visualizing the data than a spaghetti plot is.

The post Use a heat map to visualize an ordinal response in longitudinal data appeared first on The DO Loop.

4月 182022
 

What is McNemar's test? How do you run the McNemar test in SAS? Why might other statistical software report a value for McNemar's test that is different from the SAS value? SAS supports an exact version of the McNemar test, but when should you use it? This article answers these questions about McNemar's test in SAS.

The McNemar test

The McNemar test (also called McNemar's test) is a way to test whether the off-diagonal elements in a 2 x 2 frequency table are significantly different. The data are assumed to be matched-pairs data, which means that they represent the responses of one set of subjects measured before and after a treatment or intervention. Situations in which you could use McNemar's test include the following:

  • Students are given a test to assess whether they can read at grade level. They go through an enrichment program and are tested again.
  • Patients are asked whether they are feeling pain. They are given a treatment and are asked about pain again.

In both examples, the same subjects are tested twice. You can use the McNemar test can assess whether there is evidence that the proportion of negative results (not reading at grade level, not feeling pain) is different after the intervention as compared to before the intervention.

Let's use some educational data. A total of 68 students were given a test before and after an eight-week enrichment program (DeMars, 2015). On the pre-test, 51 students failed and 17 passed. On the post-test, 41 students failed and 27 passed. The following 2 x 2 table shows the counts of students who failed both tests, passed one but failed the other, or passed both tests:

McNemar's test determines whether the off-diagonal elements (3 and 13) are significantly different. Equivalently, whether their difference is significantly different from 0. In general, for a 2 x 2 table such as

the null hypothesis for McNemar's test is that the intervention has no effect, which means that the probabilities for the cells b and c are equal.

McNemar's test in SAS

You can use the AGREE option on the TABLES statement in PROC FREQ to run the McNemar test in SAS. The test statistic is the quantity M = (b - c)2 / (b + c). When b and c are both large, the statistic is distributed as a chi-squared variable with DF=1 degrees of freedom. The following SAS program defines the data and uses the AGREE option to run the McNemar test on this example:

/* McNemar's test: https://en.wikipedia.org/wiki/McNemar's_test 
   Data from DeMars 2015 https://www.thejuliagroup.com/documents/De_Mars_WUSS2015.pdf */
data Students;
input PassPre PassPost Count;
datalines;
0 0 38 
0 1 13
1 0  3
1 1 14
;
 
proc freq data=Students;
  tables PassPre*PassPost / nopercent norow nocol agree;
  weight Count;
  ods select CrossTabFreqs McNemarsTest;
run;

Notice how the data are entered. Each row of the input data set represents a group of students of size COUNT. The binary values of the PassPre and PassPost variables indicate whether the students passed (1) or did not pass (0) each test. The WEIGHT statement in PROC FREQ is used to specify the count for each observation possible pair of outcomes. You could also enter the data by using 68 rows, where each row represents a student.

The value of the McNemar statistic is M = (13 - 3)2 / (13 + 3) = 6.25. The p-value is 0.0124, which means that it is unlikely to observe a statistic this large by chance if the intervention has no effect. Think of it this way: if you distribute 16 students at random to the two off-diagonal cells, there is only a small chance that the distribution would be this lopsided (13 vs 3) if the chance of landing in either cell is 50-50. Therefore, you reject that hypothesis and conclude that the intervention had an effect on the students who failed one test and passed the other.

The exact McNemar test

The assumption that McNemar's statistic is chi-square distributed is an asymptotic result for large values of the c and d frequencies. If (b + c) is small (Wikipedia suggests less than 25), then the sampling distribution for the statistic might not follow a chi-square distribution. In that case, you should use an exact test. For the data in this example, the sum of the off-diagonal frequencies is b + c = 16. In PROC FREQ in SAS, you can run an exact test by using the EXACT MCNEM statement, as follows:

proc freq data=Students;
  tables PassPre*PassPost / nopercent norow nocol agree;
  weight Count;
  exact McNem;
  ods select McNemarsTest;
run;

The output shows that the p-value for the exact test is larger than the p-value under the chi-squared assumption.

Why might other software report a different answer?

This article was inspired by a SAS customer who compared the McNemar test in SAS to the results from another software package. The other software reported a different answer. Whenever you compare software, it is important to understand the default settings and the definitions of the statistics. After some research, we discovered that the other software was not giving the result of McNemar's test, but of a variation known as Edwards's correction to McNemar's test. Whereas the test statistic for McNemar's test is M = (b - c)2 / (b + c), Edwards's correction is the statistic M2 = M = (|b - c| - 1)2 / (b + c). Because the M2 statistic subtracts 1 from the numerator, it is always less than McNemar's statistic. Consequently, the associate p-value for Edwards's correction is larger than the p-value for McNemar's statistic.

The formulas for these tests are easily implemented in the SAS DATA step or in the SAS/IML language. The following call to PROC IML computes the statistic and p-values:

/* compute McNemar's chi-square test manually */
proc iml;
use Students;
   read all var {PassPre PassPost Count};
close;
 
DF = 1;
Tab = shape(Count, 2, 2);         /* make a 2 x 2 table */
B = Tab[1,2];                     /* 1st row; 2nd col */
C = Tab[2,1];                     /* 2nd row; 1st col */
M  = (B-C)**2 / (B+C);            /* McNemar test statistic */
M2 = (abs(B-C) - 1)**2 /(B + C);  /* Edwards's correction */
pValMc = 1 - cdf("ChiSq", M, DF); /* p-value McNemar */
pValEd = 1 - cdf("ChiSq", M2, DF);/* p-value Edwards */
 
print (M ||DF ||pValMc)[c={'McNemar' 'DF' 'Pr > ChiSq'} L="McNemar's Test"];
print (M2 ||DF ||pValEd)[c={'Edwards' 'DF' 'Pr > ChiSq'} L="Edwards's Correction"];

If your software reports a value for McNemar's test that is smaller than the value reported by SAS, it is possible that the software is applying Edwards's correction.

The McNemar test is easy to code manually, and the exact McNemar test is not much harder. Let S = b + c be the sum of the off-diagonal elements. (S=16 in this example.) The null hypothesis is that there is no difference in the proportion of students in the off-diagonal elements. Under this hypothesis, the probability of seeing the value k in one cell and S-k in the other cell is given by the binomial density PDF("Binom", k, 0.5, S). The p-value is the sum of the probabilities for values of k that are more extreme than the observed values. For this problem, it is the sum of the probabilities for k=0, 1, 2, 3, 13, 14, 15, 16.

The sum of a discrete PDF can be computed by using the cumulative distribution (CDF). Because the binomial distribution is symmetric, we only need to compute the CDF for k ≤ 3 (the left tail) and multiply the result by 2 to obtain the probability for the right tail. The following SAS/IML statements implement the exact test:

/* compute McNemar's exact test manually */
S = B + C;                         /* sum of off-diagonal elements */
k = min(B,C);                      /* find extreme value */
ExactpVal = 2*cdf("Binomial", k, 0.5, S);
result = (M ||DF ||pValMc ||ExactpVal);
print result[c={'McNemar' 'DF' 'Pr > ChiSq' 'Exact Pr >= ChiSq'} L="McNemar's Test"];

The output confirms the result of the exact p-value from PROC FREQ. Notice that the p-value from Edwards's correction is quite close to the exact p-value.

Summary

This article shows how to perform the McNemar test in SAS. You can get asymptotic and exact p-values. The exact p-value is useful when the sum of the off-diagonal frequencies is less than 25. Be aware that some software might compute a variation of the test that uses Edwards's correction formula. The statistic for Edwards's correction is always less than McNemar's statistic. Although PROC FREQ does not implement Edwards's correction, it is easy to compute manually, as shown in this article.

The post The McNemar test in SAS appeared first on The DO Loop.

3月 282022
 

Longitudinal data are measurements for a set of subjects at multiple points in time. Also called "panel data" or "repeated measures data," this kind of data is common in clinical trials in which patients are tracked over time. Recently, a SAS programmer asked how to visualize missing values in a set of longitudinal data. In his data, the patient data was recorded every week during a 10-week trial, but some patients missed one or more appointments. The programmer wanted to visualize the missed appointments.

This article shows three related topics:

  • A DATA step technique that enables you to input longitudinal data when each subject has a different number of repeated measurements.
  • Two ways to visualize longitudinal data: a spaghetti plot and a heat map (sometimes called a "lasagna plot").
  • How to add missing values to a set of longitudinal data so that you can perform analysis on the missing values and visualize them.

Input longitudinal data: A DATA step technique

The SAS programmer shared a table that showed the structure of his data. To create the data in a SAS data set, I will demonstrate a DATA step technique that deserves to be better known. It involves using a trailing @ to create multiple observations from a single line of input. You can read about the trailing @ in the documentation of the INPUT statement. Basically, it enables you to use multiple INPUT statements to read a single line of data. The following DATA step illustrates the trailing @ technique:

/************************/
/* The trailing @ is useful for reading repeated measurements when each 
   subject has a different number of measurement (or are measured at different times) */
data Have;
input patientID NumVisits @;   /* note trailing @ */
do i = 1 to NumVisits;
   input Week Value @;
   output;
end;
drop i;
/* ID  N  Wk Val Wk Val Wk Val Wk Val Wk Val Wk Val ... */
datalines;
 1001 8   0 12   1 13    2 13   6 13   7 14   8 14.5 9 15  10 13.5 
 1002 5   0 11.5 1 12.5  3 11   9 9.5  10 8   
 1003 6   0 12   3 11    5 10.5 6 11   9 10.5 10 9   
 1004 9   0 11   1 11    2 11   4 7.5  5 6.5  7 7    8 7.5 9 5.5  10 4 
 1005 11  0 10   1 10.5  2 11   3 9    4 7    5 7.5  6 7   7 7.5   8 4  9 6.5  10 5.5
;

The DATA step creates data for five patients in a study. Each record starts with the patient ID and the number of visits for that patient. The first INPUT statement reads that information and uses the trailing @ to hold the pointer until the next INPUT statement. Because we know the number of visits for the patient, we can use a DO loop to read the (Week, Value) pairs for each visit, and use the OUTPUT statement to create one observation for each visit.

All patients have the initial baseline measurement (Week=0) but some patients did not show up for a subsequent weekly visit. Only one patient has measurements for all 11 weeks. One patient has only five measurements, and they are widely spaced apart in time. Overall, these patients attended 39 appointments and missed 16.

This is the form of the data that the SAS programmer was given. Notice that there are no SAS missing values (.) in the data. Instead, the missing appointments are implicitly defined by their absence. As we will see later, it is sometimes useful to explicitly represent the missed appointments by using missing values. But first, let's see how to visualize these data in their current form.

Spaghetti and lasagna plots

The traditional way to visualize longitudinal data is to create a spaghetti plot, as follows:

title "Spaghetti Plot of Patient Data";
proc sgplot data=Have;
   series x=Week y=Value / group=PatientID markers curvelabel;
   xaxis grid integer values=(0 to 10) valueshint;
   yaxis grid;
run;

As usual, it is difficult to visualize missing data. The locations of the missing appointments are not easy to see in this line plot. You have to look for line segments that span multiple weeks and do not have markers for one or more weeks, as seen in the lines for PatientID=1001 and PatientID=1002. However, when two lines are near each other (for example, PatientID=1004 and PatientID=1005), it is more difficult to see the missing markers.

Now imagine a study that has 20 or 50 patients. Many lines will overlap, and the missed appointments will be very difficult to see. Although the spaghetti plot is a good way to visualize nonmissing data, it is not good at showing patterns of missing values.

The lasagna plot is an alternative way to visualize longitudinal data. A lasagna plot is a heat map. Each subject is a row. Each time point is a column. Whereas spaghetti plots can be used for unevenly spaced time points, the lasagna plot is most useful when the measurements are taken at evenly spaced time intervals.

You can use the HEATMAPPARM statement in PROC SGPLOT to create a lasagna plot for these data. The following example creates the lasagna plot and uses a white-yellow-orange-red color map to visualize the measurements for each time point.

%let WhiteYeOrRed = (CXFFFFFF CXFFFFB2 CXFECC5C CXFD8D3C CXE31A1C);
/* lasagna plot: https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html */
title "Lasagna Plot of Patient Data";
proc sgplot data=Have;
heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
     colormodel=&WhiteYeOrRed; 
refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
xaxis integer values=(0 to 10) valueshint;
run;

The lasagna plot does an excellent job of visualizing the data for each patient. You can see gaps for PatientID=1001 and PatientID=1002. These gaps are missing data.

However, there is a potential problem with this graph. I intentionally used white as one of the colors in the color model because I want to emphasize a problem that can occur when you create a heat map that has missing cells. In this graph, the color white is used for two different purposes! It is the color of the background, and it is the color for the lowest value of the measurement (Value=4.0). When a value is missing, the heat map does not display a cell, and the background color shows through as for PatientID=1001. On the other hand, PatientID=1005 does not have any missing values, but the patient does have a measurement (Week=8) for which Value=4.0. That cell is also white!

There are a few ways to handle this situation:

  • Use a color model that does not include white.
  • Change the color of the background to a color that is not in the color model.
  • Add the missing values to the data and display cells that contain missing values in a different color.

The first option (change the color model) is the easiest but might not be possible if your boss or client insists on a color model that uses white. The subsequent sections explore the other two options.

Change the background color

The STYLEATTRS statement in PROC SGPLOT enables you to control the colors of various graphical elements. You can specify the background color by using the WALLCOLOR= option. The following call sets the background color to gray:

/* quick and easy way: change the background color of the "wall" in the plot */
title "Change Background Color to Gray";
proc sgplot data=Have;
   styleattrs wallcolor=Gray;
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        colormodel=&WhiteYeOrRed; 
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
run;

Now the gaps in the heat map are not white. The missing appointments are "visible" via the gray background. You can use the WALLCOLOR= option for all graphs. Notice, however, that the background color shows around the edges of the heat map. If you want to get rid of that edge effect, you can use the options OFFSETMIN=0 and OFFSETMAX=0 on the XAXIS and YAXIS statements. For example:

   yaxis offsetmin=0 offsetmax=0;

Display missing values as cells

The structure of the data makes it difficult to perform an analysis on the number and pattern of missed appointments. In its current form, the data set has 39 observations to represent the 39 visits from among the 55 scheduled appointments. If you want to analyze the missing values, it is better to restructure the data to include SAS missing values in the data.

An alternative structure for this data is for all patients to have 11 observations, but for the Value variable to be missing for any week in which the patient missed his appointment. The following DATA step creates a new data set for this alternative structure. First, a DATA step creates 11 weeks of missing values for each patient. The values of the PatientID variable are read by using a BY statement and the values of the FIRST.PatientID automatic variable. Then, this larger data set is merged with the observed data. Any observed values overwrite the missing values.

/* create sequence of missing values for each patient */
data AllMissing;
set Have;
by PatientID; /* assume sorted by PatientID */
if first.PatientID then do;
   do Week = 0 to 10;   
      Value = .;  output;
   end;
end;
run;
/* merge with observed data */
data Want;
   merge AllMissing Have;
   by PatientID Week;
run;

If you create a heat map for the data in this new format, every cell will be plotted. The cells that contain missing values will be displayed by using the color for the GraphMissing style element in the current ODS style. For my ODS style (HTMLBlue), the color for missing cells is gray.

/* the color for missing cells is the GraphMissing color */
title "Missing Values Displayed in GraphMissing Color";
proc sgplot data=Want;
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        colormodel=&WhiteYeOrRed; 
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
   legenditem type=fill name='missItem' / fillattrs=GraphMissing label="Missing Data";
   keylegend 'missItem';
run;

I like this graph because it uses SAS missing values to visualize the patterns of missed appointments. If you look carefully, you can see that there are 55 cells, one for each appointment. The gray cells represent the 16 missed appointments whereas the other colors represent the 39 clinical measurements. This visualization makes it easy to answer questions such as "which patients missed two or more consecutive appointments?"

Summary

This article discusses some ways to visualize missing values in longitudinal data. The traditional spaghetti plot does not do a good job of visualizing missing values. A heat map (sometimes called a lasagna plot) is a better choice. Depending on the structure of your data, you might need to add missing values data. By explicitly including missing values, the patterns of missing values can be visualized more easily.

This article also demonstrates how to use a trailing @ in a DATA step. This enables you to create multiple observations from a single line of input. Thanks to Warren Kuhfeld, who showed me how to read data by using this useful technique many years ago.

The post Visualize missing values in longitudinal data appeared first on The DO Loop.

2月 142022
 

This article implements Passing-Bablok regression in SAS. Passing-Bablok regression is a one-variable regression technique that is used to compare measurements from different instruments or medical devices. The measurements of the two variables (X and Y) are both measured with errors. Consequently, you cannot use ordinary linear regression, which assumes that one variable (X) is measured without error. Passing-Bablok regression is a robust nonparametric regression method that does not make assumptions about the distribution of the expected values or the error terms in the model. It also does not assume that the errors are homoskedastic.

Several readers have asked about Passing-Bablok regression because I wrote an article about Deming regression, which is similar. On the SAS Support Communities, there was a lively discussion in response to a program by KSharp in which he shows how to use Base SAS procedures to perform a Passing-Bablok regression. In response to this interest, this article shares a SAS/IML program that performs Passing-Bablok regression.

An overview of Passing-Bablok regression

Passing-Bablok regression (Passing and Bablok, 1983, J. Clin. Chem. Clin. Biochem., henceforth denote PB) fits a line through X and Y measurements. The X and Y variables measure the same quantity in the same units. Usually, X is the "gold standard" technique, whereas Y is a new method that is cheaper, more accurate, less invasive, or has some other favorable attributes. Whereas Deming regression assumes normal errors, Passing-Bablok regression can handle any error distribution provided that the measurement errors for both methods follow the same distribution. Also, the variance of each method does not need to be constant across the range of data. All that is required is that the ratio of the variances is constant. The Passing-Bablok method is robust to outliers in X or Y.

Assume you have N pairs of observations {(xi, yi), i=1..N}, where X and Y are different ways to measure the same quantity. Passing-Bablok regression determines whether the measurements methods are equivalent by estimating a line y = a + b*x. The estimate of the slope (b) is based on the median of pairs of slopes Sij = (yi - yj) / (xi - xj) for i < j. You must be careful when xi=xj. PB suggest assigning a large positive value when xi=xj and yi > yj and a large negative value when xi=xj and yi < yj. If xi=xj and yi=yj, the difference is not computed or used.

The Passing-Bablok method estimates the regression line y = a + b*x by using the following steps:
  1. Compute all the pairwise slopes. For vertical slopes, use large positive or negative values, as appropriate.
  2. Exclude slopes that are exactly -1. The reason is technical, but it involves ensuring the same conclusions whether you regress Y onto X or X onto Y.
  3. Ideally, we would estimate the slope of the regression line (b) as the median of all the pairwise slopes of the points (xi, yi) and (xj, yj), where i < j. This is a robust statistic and is similar to Theil-Sen regression. However, the slopes are not independent, so the median can be a biased estimator. Instead, estimate the slope as the shifted median of the pairwise slopes.
  4. Estimate the intercept, a, as the median of the values {yibxi}.
  5. Use a combination of symmetry, rank-order statistics, and asymptotics to obtain confidence intervals (CIs) for b and a.
  6. You can use the confidence intervals (CIs) to assess whether the two methods are equivalent. In terms of the regression estimates, the CI for the intercept (a) must contain 0 and the CI for the slope (b) must contain 1. This article uses only the CIs proposed by Passing and Bablok. Some software support jackknife or bootstrap CIs, but it is sometimes a bad idea to use resampling methods to estimate the CI of a median, so I do not recommend bootstrap CIs for Passing-Bablok regression.

Implement Passing-Bablok regression in SAS

Because the Passing-Bablok method relies on computing vectors and medians, the SAS/IML language is a natural language for implementing the algorithm. The implementation uses several helper functions from previous articles:

The following modules implement the Passing-Bablok algorithm in SAS and store the modules for future use:

%let INFTY = 1E12;         /* define a big number to use as "infinity" for vertical slopes */
proc iml;
/* Extract only the values X[i,j] where i < j.
   https://blogs.sas.com/content/iml/2012/08/16/extract-the-lower-triangular-elements-of-a-matrix.html */
   start StrictLowerTriangular(X);
   v = cusum( 1 || (ncol(X):2) );
   return( remove(vech(X), v) );
finish;
/* Compute pairwise differences:
   https://blogs.sas.com/content/iml/2011/02/09/computing-pairwise-differences-efficiently-of-course.html */ 
start PairwiseDiff(x);   /* x is a column vector */
   n = nrow(x);   
   m = shape(x, n, n);   /* duplicate data */
   diff = m` - m;        /* pairwise differences */
   return( StrictLowerTriangular(diff) );
finish;
 
/* Implement the Passing-Bablok (1983) regression method */
start PassingBablok(_x, _y, alpha=0.05, debug=0);
   keepIdx = loc(_x^=. & _y^=.);      /* 1. remove missing values */
   x = _x[keepIdx]; y = _y[keepIdx];
   nObs = nrow(x);                    /* sample size of nonmissing values */
 
   /* 1. Compute all the pairwise slopes. For vertical, use large positive or negative values. */
   DX = T(PairwiseDiff(x));           /* x[i] - x[j] */
   DY = T(PairwiseDiff(y));           /* y[i] - y[j] */
 
   /* "Identical pairs of measurements with x_i=x_j and y_i=y_j
      do not contribute to the estimation of beta." (PB, p 711) */
   idx = loc(DX^=0 | DY^=0);          /* exclude 0/0 slopes (DX=0 and DY=0) */
   G = DX[idx] || DY[idx] || j(ncol(idx), 1, .); /* last col is DY/DX */
   idx = loc( G[,1] ^= 0 );           /* find where DX ^= 0 */
   G[idx,3] = G[idx,2] / G[idx,1];    /* form Sij = DY / DX for DX^=0 */
 
   /* Special cases:
      1) x_i=x_j, but y_i^=y_j   ==> Sij = +/- infinity
      2) x_i^=x_j, but y_i=y_j   ==> Sij = 0
      "The occurrence of any of these special cases has a probability 
      of zero (experimental data should exhibit these cases very rarely)."
 
      Passing and Bablok do not specify how to handle, but one way is
      to keep these values within the set of valid slopes. Use large 
      positive or negative numbers for +infty and -infty, respectively.
   */
   posIdx = loc(G[,1]=0 & G[,2]>0);    /* DX=0 & DY>0 */
   if ncol(posIdx) then G[posIdx,3] =  &INFTY;
   negIdx = loc(G[,1]=0 & G[,2]<0);    /* DX=0 & DY<0 */
   if ncol(negIdx) then G[negIdx,3] = -&INFTY;
   if debug then PRINT (T(1:nrow(G)) || G)[c={"N" "DX" "DY" "S"}];
 
   /* 2. Exclude slopes that are exactly -1 */
   /* We want to be able to interchange the X and Y variables. 
      "For reasons of symmetry, any Sij with a value of -1 is also discarded." */
   idx = loc(G[,3] ^= -1); 
   S = G[idx,3];
   N = nrow(S);                      /* number of valid slopes */
   call sort(S);                     /* sort slopes ==> pctl computations easier */
   if debug then print (S`)[L='sort(S)'];
 
   /* 3. Estimate the slope as the shifted median of the pairwise slopes. */
   K = sum(S < -1);                  /* shift median by K values */
   b = choose(mod(N,2),              /* estimate of slope */
              S[(N+1)/2+K ],         /* if N odd, shifted median */
              mean(S[N/2+K+{0,1}])); /* if N even, average of adjacent values */
   /* 4. Estimate the intercept as the median of the values {yi – bxi}. */
   a = median(y - b*x);              /* estimate of the intercept, a */
 
   /* 5. Estimate confidence intervals. */
   C = quantile("Normal", 1-alpha/2) * sqrt(nObs*(nObs-1)*(2*nObs+5)/18);
   M1 = round((N - C)/2);
   M2 = N - M1 + 1;
   if debug then PRINT K C M1 M2 (M1+K)[L="M1+K"] (M2+k)[L="M1+K"];
   CI_b = {. .};                     /* initialize CI to missing */
   if 1 <= M1+K & M1+K <= N then CI_b[1] = S[M1+K];
   if 1 <= M2+K & M2+K <= N then CI_b[2] = S[M2+K];
   CI_a = median(y - CI_b[,2:1]@x);  /* aL=med(y-bU*x); aU=med(y-bL*x); */
 
   /* return a list that contains the results */
   outList = [#'a'=a, #'b'=b, #'CI_a'=CI_a, #'CI_b'=CI_b]; /* pack into list */
   return( outList );
finish;
 
store module=(StrictLowerTriangular PairwiseDiff PassingBablok);
quit;

Notice that the PassingBablok function returns a list. A list is a convenient way to return multiple statistics from a function.

Perform Passing-Bablok regression in SAS

Having saved the SAS/IML modules, let's see how to call the PassingBablok routine and analyze the results. The following DATA step defines 102 pairs of observations. These data were posted to the SAS Support Communities by @soulmate. To support running regressions on multiple data sets, I use macro variables to store the name of the data set and the variable names.

data PBData;
label x="Method 1" y="Method 2";
input x y @@;
datalines;
0.07 0.07  0.04 0.10  0.07 0.08  0.05 0.10  0.06 0.08
0.06 0.07  0.03 0.05  0.04 0.05  0.05 0.08  0.05 0.10
0.12 0.21  0.10 0.17  0.03 0.11  0.18 0.24  0.29 0.33
0.05 0.05  0.04 0.05  0.11 0.21  0.38 0.36  0.04 0.05
0.03 0.06  0.06 0.07  0.46 0.27  0.08 0.05  0.02 0.09
0.12 0.23  0.16 0.25  0.07 0.10  0.07 0.07  0.04 0.06
0.11 0.23  0.23 0.21  0.15 0.20  0.05 0.07  0.04 0.14
0.05 0.11  0.17 0.26  8.43 8.12  1.86 1.47  0.17 0.27
8.29 7.71  5.98 5.26  0.57 0.43  0.07 0.10  0.06 0.11
0.21 0.24  8.29 7.84  0.28 0.23  8.25 7.79  0.18 0.25
0.17 0.22  0.15 0.21  8.33 7.67  5.58 5.05  0.09 0.14
1.00 3.72  0.09 0.13  3.88 3.54  0.51 0.42  4.09 3.64
0.89 0.71  5.61 5.18  4.52 4.15  0.09 0.12  0.05 0.05
0.05 0.05  0.04 0.07  0.05 0.07  0.09 0.11  0.03 0.05
2.27 2.08  1.50 1.21  5.05 4.46  0.22 0.25  2.13 1.93
0.05 0.08  4.09 3.61  1.46 1.13  1.20 0.97  0.02 0.05
1.00 1.00  3.39 3.11  1.00 0.05  2.07 1.83  6.68 6.06
3.00 2.97  0.06 0.09  7.17 6.55  1.00 1.00  2.00 0.05
2.91 2.45  3.92 3.36  1.00 1.00  7.20 6.88  6.42 5.83
2.38 2.04  1.97 1.76  4.72 4.37  1.64 1.25  5.48 4.77
5.54 5.16  0.02 0.06
;
 
%let DSName = PBData;    /* name of data set */
%let XVar = x;           /* name of X variable (Method 1) */
%let YVar = y;           /* name of Y variable (Method 2) */
 
title "Measurements from Two Methods";
proc sgplot data=&DSName;
   scatter x=&XVar y=&YVar;
   lineparm x=0 y=0 slope=1 / clip;    /* identity line: Intercept=0, Slope=1 */
   xaxis grid;
   yaxis grid;
run;

The graph indicates that the two measurements are linearly related and strongly correlated. It looks like Method 1 (X) produces higher measurements than Method 2 (Y) because most points are to the right of the identity line. Three outliers are visible. The Passing-Bablok regression should be robust to these outliers.

The following SAS/IML program analyzes the data by calling the PassingBablok function:

proc iml;
load module=(StrictLowerTriangular PairwiseDiff PassingBablok);
use &DSName; 
   read all var {"&XVar"} into x;
   read all var {"&YVar"} into y;
close;
 
R = PassingBablok(x, y);
ParameterEstimates = (R$'a' || R$'CI_a') //        /* create table from the list items */
                     (R$'b' || R$'CI_b');
print ParameterEstimates[c={'Estimate' 'LCL' 'UCL'} r={'Intercept' 'Slope'}];

The program reads the X and Y variables and calls the PassingBablok function, which returns a list of statistics. You can use the ListPrint subroutine to print the items in the list, or you can extract the items and put them into a matrix, as in the example.

The output shows that the Passing-Bablok regression line is Y = 0.028 + 0.912*X. The confidence interval for the slope parameter (b) does NOT include 1, which means that we reject the null hypothesis that the measurements are related by a line that has unit slope. Even if the CI for the slope had contained 1, the CI for the Intercept (a) does not include 0. For these data, we reject the hypothesis that the measurements from the two methods are equivalent.

Of course, you could encapsulate the program into a SAS macro that performs Passing-Bablok regression.

A second example of Passing-Bablok regression in SAS

For completeness, the following example shows results for two methods that are equivalent. The output shows that the CI for the slope includes 1 and the CI for the intercept includes 0:

%let DSName = PBData2;    /* name of data set */
%let XVar = m1;           /* name of X variable (Method 1) */
%let YVar = m2;           /* name of Y variable (Method 2) */
 
data &DSName;
label x="Method 1" y="Method 2";
input m1 m2 @@;
datalines;
10.1 10.1  10.5 10.6  12.8 13.1   8.7  8.7  10.8 11.5 
10.6 10.4   9.6  9.9  11.3 11.3   7.7  7.5  11.5 11.9 
 7.9  7.8   9.9  9.7   9.3  9.9  16.3 16.2   9.4  9.8 
11.0 11.8   9.1  8.7  12.2 11.9  13.4 12.6   9.2  9.5 
 8.8  9.1   9.3  9.2   8.5  8.7   9.6  9.7  13.5 13.1 
 9.4  9.1   9.5  9.6  10.8 11.3  14.6 14.8   9.7  9.3 
16.4 16.5   8.1  8.6   8.3  8.1   9.5  9.6  20.3 20.4 
 8.6  8.5   9.1  8.8   8.8  8.7   9.2  9.9   8.1  8.1 
 9.2  9.0  17.0 17.0  10.2 10.0  10.0  9.8   6.5  6.6 
 7.9  7.6  11.3 10.7  14.2 14.1  11.9 12.7   9.9  9.4 
;
 
proc iml;
load module=(StrictLowerTriangular PairwiseDiff PassingBablok);
use &DSName; 
   read all var {"&XVar"} into x;
   read all var {"&YVar"} into y;
close;
 
R = PassingBablok(x, y);
ParameterEstimates = (R$'a' || R$'CI_a') // 
                     (R$'b' || R$'CI_b');
print ParameterEstimates[c={'Estimate' 'LCL' 'UCL'} r={'Intercept' 'Slope'}];
QUIT;

For these data, the Passing-Bablok regression line is Y = 0.028 + 0.912*X. The confidence interval for the Slope term is [0.98, 1.06], which contains 1. The confidence interval for the Intercept is [-0.67, 0.23], which contains 0. Thus, these data suggest that the two methods are equivalent ways to measure the same quantity.

Summary

This article shows how to implement Passing-Bablok regression in SAS by using the SAS/IML language. It also explains how Passing-Bablok regression works and why different software might provide different confidence intervals. You can download the complete SAS program that implements Passing-Bablok regression and creates the graphs and tables in this article.

Further Reading

Appendix: Comparing the results of Passing-Bablok regression

The PB paper is slightly ambiguous about how to handle certain special situations in the data. The ambiguity is in the calculation of confidence intervals. Consequently, if you run the same data through different software, you are likely to obtain the same point estimates, but the confidence intervals might be slightly different (maybe in the third or fourth decimal place), especially when there are repeated X values that have multiple Y values. For example, I have compared my implementation to MedCalc software and to the mcr package in R. The authors of the mcr package made some choices that are different from the PB paper. By looking at the code, I can see that they handle tied data values differently from the way I handle them.

The post Passing-Bablok regression in SAS appeared first on The DO Loop.

2月 142022
 

This article implements Passing-Bablok regression in SAS. Passing-Bablok regression is a one-variable regression technique that is used to compare measurements from different instruments or medical devices. The measurements of the two variables (X and Y) are both measured with errors. Consequently, you cannot use ordinary linear regression, which assumes that one variable (X) is measured without error. Passing-Bablok regression is a robust nonparametric regression method that does not make assumptions about the distribution of the expected values or the error terms in the model. It also does not assume that the errors are homoskedastic.

Several readers have asked about Passing-Bablok regression because I wrote an article about Deming regression, which is similar. On the SAS Support Communities, there was a lively discussion in response to a program by KSharp in which he shows how to use Base SAS procedures to perform a Passing-Bablok regression. In response to this interest, this article shares a SAS/IML program that performs Passing-Bablok regression.

An overview of Passing-Bablok regression

Passing-Bablok regression (Passing and Bablok, 1983, J. Clin. Chem. Clin. Biochem., henceforth denote PB) fits a line through X and Y measurements. The X and Y variables measure the same quantity in the same units. Usually, X is the "gold standard" technique, whereas Y is a new method that is cheaper, more accurate, less invasive, or has some other favorable attributes. Whereas Deming regression assumes normal errors, Passing-Bablok regression can handle any error distribution provided that the measurement errors for both methods follow the same distribution. Also, the variance of each method does not need to be constant across the range of data. All that is required is that the ratio of the variances is constant. The Passing-Bablok method is robust to outliers in X or Y.

Assume you have N pairs of observations {(xi, yi), i=1..N}, where X and Y are different ways to measure the same quantity. Passing-Bablok regression determines whether the measurements methods are equivalent by estimating a line y = a + b*x. The estimate of the slope (b) is based on the median of pairs of slopes Sij = (yi - yj) / (xi - xj) for i < j. You must be careful when xi=xj. PB suggest assigning a large positive value when xi=xj and yi > yj and a large negative value when xi=xj and yi < yj. If xi=xj and yi=yj, the difference is not computed or used.

The Passing-Bablok method estimates the regression line y = a + b*x by using the following steps:
  1. Compute all the pairwise slopes. For vertical slopes, use large positive or negative values, as appropriate.
  2. Exclude slopes that are exactly -1. The reason is technical, but it involves ensuring the same conclusions whether you regress Y onto X or X onto Y.
  3. Ideally, we would estimate the slope of the regression line (b) as the median of all the pairwise slopes of the points (xi, yi) and (xj, yj), where i < j. This is a robust statistic and is similar to Theil-Sen regression. However, the slopes are not independent, so the median can be a biased estimator. Instead, estimate the slope as the shifted median of the pairwise slopes.
  4. Estimate the intercept, a, as the median of the values {yibxi}.
  5. Use a combination of symmetry, rank-order statistics, and asymptotics to obtain confidence intervals (CIs) for b and a.
  6. You can use the confidence intervals (CIs) to assess whether the two methods are equivalent. In terms of the regression estimates, the CI for the intercept (a) must contain 0 and the CI for the slope (b) must contain 1. This article uses only the CIs proposed by Passing and Bablok. Some software support jackknife or bootstrap CIs, but it is sometimes a bad idea to use resampling methods to estimate the CI of a median, so I do not recommend bootstrap CIs for Passing-Bablok regression.

Implement Passing-Bablok regression in SAS

Because the Passing-Bablok method relies on computing vectors and medians, the SAS/IML language is a natural language for implementing the algorithm. The implementation uses several helper functions from previous articles:

The following modules implement the Passing-Bablok algorithm in SAS and store the modules for future use:

%let INFTY = 1E12;         /* define a big number to use as "infinity" for vertical slopes */
proc iml;
/* Extract only the values X[i,j] where i < j.
   https://blogs.sas.com/content/iml/2012/08/16/extract-the-lower-triangular-elements-of-a-matrix.html */
   start StrictLowerTriangular(X);
   v = cusum( 1 || (ncol(X):2) );
   return( remove(vech(X), v) );
finish;
/* Compute pairwise differences:
   https://blogs.sas.com/content/iml/2011/02/09/computing-pairwise-differences-efficiently-of-course.html */ 
start PairwiseDiff(x);   /* x is a column vector */
   n = nrow(x);   
   m = shape(x, n, n);   /* duplicate data */
   diff = m` - m;        /* pairwise differences */
   return( StrictLowerTriangular(diff) );
finish;
 
/* Implement the Passing-Bablok (1983) regression method */
start PassingBablok(_x, _y, alpha=0.05, debug=0);
   keepIdx = loc(_x^=. & _y^=.);      /* 1. remove missing values */
   x = _x[keepIdx]; y = _y[keepIdx];
   nObs = nrow(x);                    /* sample size of nonmissing values */
 
   /* 1. Compute all the pairwise slopes. For vertical, use large positive or negative values. */
   DX = T(PairwiseDiff(x));           /* x[i] - x[j] */
   DY = T(PairwiseDiff(y));           /* y[i] - y[j] */
 
   /* "Identical pairs of measurements with x_i=x_j and y_i=y_j
      do not contribute to the estimation of beta." (PB, p 711) */
   idx = loc(DX^=0 | DY^=0);          /* exclude 0/0 slopes (DX=0 and DY=0) */
   G = DX[idx] || DY[idx] || j(ncol(idx), 1, .); /* last col is DY/DX */
   idx = loc( G[,1] ^= 0 );           /* find where DX ^= 0 */
   G[idx,3] = G[idx,2] / G[idx,1];    /* form Sij = DY / DX for DX^=0 */
 
   /* Special cases:
      1) x_i=x_j, but y_i^=y_j   ==> Sij = +/- infinity
      2) x_i^=x_j, but y_i=y_j   ==> Sij = 0
      "The occurrence of any of these special cases has a probability 
      of zero (experimental data should exhibit these cases very rarely)."
 
      Passing and Bablok do not specify how to handle, but one way is
      to keep these values within the set of valid slopes. Use large 
      positive or negative numbers for +infty and -infty, respectively.
   */
   posIdx = loc(G[,1]=0 & G[,2]>0);    /* DX=0 & DY>0 */
   if ncol(posIdx) then G[posIdx,3] =  &INFTY;
   negIdx = loc(G[,1]=0 & G[,2]<0);    /* DX=0 & DY<0 */
   if ncol(negIdx) then G[negIdx,3] = -&INFTY;
   if debug then PRINT (T(1:nrow(G)) || G)[c={"N" "DX" "DY" "S"}];
 
   /* 2. Exclude slopes that are exactly -1 */
   /* We want to be able to interchange the X and Y variables. 
      "For reasons of symmetry, any Sij with a value of -1 is also discarded." */
   idx = loc(G[,3] ^= -1); 
   S = G[idx,3];
   N = nrow(S);                      /* number of valid slopes */
   call sort(S);                     /* sort slopes ==> pctl computations easier */
   if debug then print (S`)[L='sort(S)'];
 
   /* 3. Estimate the slope as the shifted median of the pairwise slopes. */
   K = sum(S < -1);                  /* shift median by K values */
   b = choose(mod(N,2),              /* estimate of slope */
              S[(N+1)/2+K ],         /* if N odd, shifted median */
              mean(S[N/2+K+{0,1}])); /* if N even, average of adjacent values */
   /* 4. Estimate the intercept as the median of the values {yi – bxi}. */
   a = median(y - b*x);              /* estimate of the intercept, a */
 
   /* 5. Estimate confidence intervals. */
   C = quantile("Normal", 1-alpha/2) * sqrt(nObs*(nObs-1)*(2*nObs+5)/18);
   M1 = round((N - C)/2);
   M2 = N - M1 + 1;
   if debug then PRINT K C M1 M2 (M1+K)[L="M1+K"] (M2+k)[L="M1+K"];
   CI_b = {. .};                     /* initialize CI to missing */
   if 1 <= M1+K & M1+K <= N then CI_b[1] = S[M1+K];
   if 1 <= M2+K & M2+K <= N then CI_b[2] = S[M2+K];
   CI_a = median(y - CI_b[,2:1]@x);  /* aL=med(y-bU*x); aU=med(y-bL*x); */
 
   /* return a list that contains the results */
   outList = [#'a'=a, #'b'=b, #'CI_a'=CI_a, #'CI_b'=CI_b]; /* pack into list */
   return( outList );
finish;
 
store module=(StrictLowerTriangular PairwiseDiff PassingBablok);
quit;

Notice that the PassingBablok function returns a list. A list is a convenient way to return multiple statistics from a function.

Perform Passing-Bablok regression in SAS

Having saved the SAS/IML modules, let's see how to call the PassingBablok routine and analyze the results. The following DATA step defines 102 pairs of observations. These data were posted to the SAS Support Communities by @soulmate. To support running regressions on multiple data sets, I use macro variables to store the name of the data set and the variable names.

data PBData;
label x="Method 1" y="Method 2";
input x y @@;
datalines;
0.07 0.07  0.04 0.10  0.07 0.08  0.05 0.10  0.06 0.08
0.06 0.07  0.03 0.05  0.04 0.05  0.05 0.08  0.05 0.10
0.12 0.21  0.10 0.17  0.03 0.11  0.18 0.24  0.29 0.33
0.05 0.05  0.04 0.05  0.11 0.21  0.38 0.36  0.04 0.05
0.03 0.06  0.06 0.07  0.46 0.27  0.08 0.05  0.02 0.09
0.12 0.23  0.16 0.25  0.07 0.10  0.07 0.07  0.04 0.06
0.11 0.23  0.23 0.21  0.15 0.20  0.05 0.07  0.04 0.14
0.05 0.11  0.17 0.26  8.43 8.12  1.86 1.47  0.17 0.27
8.29 7.71  5.98 5.26  0.57 0.43  0.07 0.10  0.06 0.11
0.21 0.24  8.29 7.84  0.28 0.23  8.25 7.79  0.18 0.25
0.17 0.22  0.15 0.21  8.33 7.67  5.58 5.05  0.09 0.14
1.00 3.72  0.09 0.13  3.88 3.54  0.51 0.42  4.09 3.64
0.89 0.71  5.61 5.18  4.52 4.15  0.09 0.12  0.05 0.05
0.05 0.05  0.04 0.07  0.05 0.07  0.09 0.11  0.03 0.05
2.27 2.08  1.50 1.21  5.05 4.46  0.22 0.25  2.13 1.93
0.05 0.08  4.09 3.61  1.46 1.13  1.20 0.97  0.02 0.05
1.00 1.00  3.39 3.11  1.00 0.05  2.07 1.83  6.68 6.06
3.00 2.97  0.06 0.09  7.17 6.55  1.00 1.00  2.00 0.05
2.91 2.45  3.92 3.36  1.00 1.00  7.20 6.88  6.42 5.83
2.38 2.04  1.97 1.76  4.72 4.37  1.64 1.25  5.48 4.77
5.54 5.16  0.02 0.06
;
 
%let DSName = PBData;    /* name of data set */
%let XVar = x;           /* name of X variable (Method 1) */
%let YVar = y;           /* name of Y variable (Method 2) */
 
title "Measurements from Two Methods";
proc sgplot data=&DSName;
   scatter x=&XVar y=&YVar;
   lineparm x=0 y=0 slope=1 / clip;    /* identity line: Intercept=0, Slope=1 */
   xaxis grid;
   yaxis grid;
run;

The graph indicates that the two measurements are linearly related and strongly correlated. It looks like Method 1 (X) produces higher measurements than Method 2 (Y) because most points are to the right of the identity line. Three outliers are visible. The Passing-Bablok regression should be robust to these outliers.

The following SAS/IML program analyzes the data by calling the PassingBablok function:

proc iml;
load module=(StrictLowerTriangular PairwiseDiff PassingBablok);
use &DSName; 
   read all var {"&XVar"} into x;
   read all var {"&YVar"} into y;
close;
 
R = PassingBablok(x, y);
ParameterEstimates = (R$'a' || R$'CI_a') //        /* create table from the list items */
                     (R$'b' || R$'CI_b');
print ParameterEstimates[c={'Estimate' 'LCL' 'UCL'} r={'Intercept' 'Slope'}];

The program reads the X and Y variables and calls the PassingBablok function, which returns a list of statistics. You can use the ListPrint subroutine to print the items in the list, or you can extract the items and put them into a matrix, as in the example.

The output shows that the Passing-Bablok regression line is Y = 0.028 + 0.912*X. The confidence interval for the slope parameter (b) does NOT include 1, which means that we reject the null hypothesis that the measurements are related by a line that has unit slope. Even if the CI for the slope had contained 1, the CI for the Intercept (a) does not include 0. For these data, we reject the hypothesis that the measurements from the two methods are equivalent.

Of course, you could encapsulate the program into a SAS macro that performs Passing-Bablok regression.

A second example of Passing-Bablok regression in SAS

For completeness, the following example shows results for two methods that are equivalent. The output shows that the CI for the slope includes 1 and the CI for the intercept includes 0:

%let DSName = PBData2;    /* name of data set */
%let XVar = m1;           /* name of X variable (Method 1) */
%let YVar = m2;           /* name of Y variable (Method 2) */
 
data &DSName;
label x="Method 1" y="Method 2";
input m1 m2 @@;
datalines;
10.1 10.1  10.5 10.6  12.8 13.1   8.7  8.7  10.8 11.5 
10.6 10.4   9.6  9.9  11.3 11.3   7.7  7.5  11.5 11.9 
 7.9  7.8   9.9  9.7   9.3  9.9  16.3 16.2   9.4  9.8 
11.0 11.8   9.1  8.7  12.2 11.9  13.4 12.6   9.2  9.5 
 8.8  9.1   9.3  9.2   8.5  8.7   9.6  9.7  13.5 13.1 
 9.4  9.1   9.5  9.6  10.8 11.3  14.6 14.8   9.7  9.3 
16.4 16.5   8.1  8.6   8.3  8.1   9.5  9.6  20.3 20.4 
 8.6  8.5   9.1  8.8   8.8  8.7   9.2  9.9   8.1  8.1 
 9.2  9.0  17.0 17.0  10.2 10.0  10.0  9.8   6.5  6.6 
 7.9  7.6  11.3 10.7  14.2 14.1  11.9 12.7   9.9  9.4 
;
 
proc iml;
load module=(StrictLowerTriangular PairwiseDiff PassingBablok);
use &DSName; 
   read all var {"&XVar"} into x;
   read all var {"&YVar"} into y;
close;
 
R = PassingBablok(x, y);
ParameterEstimates = (R$'a' || R$'CI_a') // 
                     (R$'b' || R$'CI_b');
print ParameterEstimates[c={'Estimate' 'LCL' 'UCL'} r={'Intercept' 'Slope'}];
QUIT;

For these data, the Passing-Bablok regression line is Y = 0.028 + 0.912*X. The confidence interval for the Slope term is [0.98, 1.06], which contains 1. The confidence interval for the Intercept is [-0.67, 0.23], which contains 0. Thus, these data suggest that the two methods are equivalent ways to measure the same quantity.

Summary

This article shows how to implement Passing-Bablok regression in SAS by using the SAS/IML language. It also explains how Passing-Bablok regression works and why different software might provide different confidence intervals. You can download the complete SAS program that implements Passing-Bablok regression and creates the graphs and tables in this article.

Further Reading

Appendix: Comparing the results of Passing-Bablok regression

The PB paper is slightly ambiguous about how to handle certain special situations in the data. The ambiguity is in the calculation of confidence intervals. Consequently, if you run the same data through different software, you are likely to obtain the same point estimates, but the confidence intervals might be slightly different (maybe in the third or fourth decimal place), especially when there are repeated X values that have multiple Y values. For example, I have compared my implementation to MedCalc software and to the mcr package in R. The authors of the mcr package made some choices that are different from the PB paper. By looking at the code, I can see that they handle tied data values differently from the way I handle them.

The post Passing-Bablok regression in SAS appeared first on The DO Loop.

1月 262022
 

Sometimes it is useful to know the extreme values in data. You might need to know the Top 5 or the Top 10 smallest data values. Or, the Top 5 or Top 10 largest data values. There are many ways to do this in SAS, but this article shows examples that use four different methods to find the top k or bottom k data values:

  1. Use the NEXTROBS= option in PROC UNIVARIATE to output the N EXTReme OBServations. This is by far the easiest method.
  2. Sort the data in ascending or descending order. Display the top k rows.
  3. Transpose the data to make a wide data set. Use the SMALLEST and LARGEST functions in the SAS DATA step to obtain the k largest or smallest data values.
  4. If the data contain ties values, you might want to find the largest or smallest unique values. For that task, you can use the NEXTRVAL= option in PROC UNIVARIATE, or you can use PROC RANK.

This article discusses how to find the most extreme data values for a continuous (interval) variable. Each method has different ways to handle missing values and tied values.

A related question is finding the most frequently observed categories for a categorical variable. If you are interested in making a Top 10 list or bar chart of a categorical variable, see the article, "An easy way to make a "Top 10" table and bar chart in SAS."

Example data

The following SAS DATA step defines a variable that has 100 observations. Each value is in the interval [0, 4]. and some of the values are repeated.

/* 100 values. One missing value. Nonmissing values in [0, 4].
   Tied values include 0.02 (two times) and 0.03 (three times)
*/
data Have;
input x @@;
datalines;
0.54 3.33 2.55 0.95 1.11 1.02 1.09 1.78 2.87 2.53
   . 1.49 0.74 0.02 0.29 0.17 1.08 0.21 4.00 0.83
0.09 2.25 2.56 1.93 0.34 0.31 0.08 0.29 0.65 0.08
0.78 0.08 0.83 0.01 0.46 1.22 0.53 0.07 0.08 0.46
0.59 0.04 0.26 0.68 0.10 1.91 0.31 0.24 0.05 1.04
0.24 0.10 0.09 0.03 0.30 2.98 0.33 0.18 0.22 2.50
2.41 2.66 0.22 0.60 2.32 1.17 1.14 0.05 0.56 0.42
2.87 0.03 0.40 0.44 0.23 1.64 0.31 1.00 1.96 0.08
0.10 0.06 1.24 0.03 0.63 1.14 0.07 0.06 0.13 1.83
1.14 0.18 0.13 0.75 0.69 1.62 2.27 1.50 0.02 1.04 
;

There is a missing value for the X variable. The value 0.02 is repeated twice. The value 0.03 is repeated three times. The five smallest nonmissing values are {0.01, 0.02, 0.02, 0.03, 0.03, 0.03}.

Use PROC UNIVARIATE

The simplest way to see the k largest and smallest observations is to use the NEXTROBS= option on the PROC UNIVARIATE statement. You can specify the number of extreme observations that you wish to see. For example, the following statements ask for k=5 extreme observations:

%let nTop = 5;  /* display the Top 5 largest and smallest values */
 
proc univariate data=Have NExtrObs=&nTop;
   var x;
   ods select ExtremeObs;
run;

The table shows that the Top 5 smallest values are 0.01, 0.02, 0.02, 0.03, and 0.03. It provides the observations numbers (rows) for each value. Similarly, the table shows that the Top 5 largest values are 2.87, 2.87, 2.98, 3.33, and 4.00. The output also shows that tied values are handled arbitrarily. Although the value 0.03 appears three times in the data, the table does not reveal that information.

However, you can also use PROC UNIVARIATE to find the k smallest and largest UNIQUE values. To do this, specify the NEXTRVAL= option, as follows:

proc univariate data=Have NExtrObs=&nTop NExtrVal=&nTop ;
   var x;
   ods select ExtremeObs ExtremeValues;
run;

The second table shows that the value 0.02 appears twice in the data. The value 0.03 appears three times, and so on. This table is useful when you have many tied values in the data.

PROC UNIVARIATE is the simplest and most flexible way to display extreme values. By adding additional variables to the VAR statement, it is easy to report this information for several continuous variables.

Although PROC UNIVARIATE is the best way, I often see other suggestions made on discussion forums. For completeness, the rest of the article discusses alternative ways to report the smallest and largest observations in univariate data.

The sorting method

The sorting method is conceptually easy. If you sort the data in increasing order, the first few observations are the smallest. If you sort the data in decreasing order, the first few observations are the largest. You can use PROC SORT and PROC PRINT to reveal the largest and smallest values for the X variable:

%let nTop = 5;  /* display the Top 5 largest and smallest values */
 
proc sort data=Have out=SortInc;  by x;  run;
proc sort data=Have out=SortDec;  by descending x;  run;
 
proc print data=SortInc(obs=&nTop keep=x rename=(x=Smallest)); run;
proc print data=SortDec(obs=&nTop keep=x rename=(x=Largest)); run;

The tables show the smallest and largest values as computed by sorting the data. By convention, a missing value in SAS is smaller than any double-precision floating-point value. Thus, the list of smallest values includes the missing value. The output also shows that tied values are handled arbitrarily, as for the NEXTROBS= option in PROC UNIVARIATE.

Notice that PROC UNIVARIATE automatically excluded missing observations. If you use the sorting method and do not want the missing value to appear, you can use a WHERE clause to filter it out:

proc sort data=Have out=SortInc;
   where not missing(x);         /* exclude missing values from the smallest values */
   by x;  
run;

The sorting method is conceptually easy to implement. A disadvantage is that it creates two copies of the data: one for the ascending case and another for the descending case. After you create the data sets, it is easy to ask for the Top 5, Top 10, or Top 20 values without requiring additional computation. However, it is not well suited if you want to report extreme values for multiple variables.

The SMALLEST and LARGEST functions

The SAS DATA step supports the SMALLEST and LARGEST functions, which returns the kth smallest and largest values in an array, respectively. That is, the SMALLEST and LARGEST functions are designed to work on rows of data, whereas the sorting data works on columns.

You can transpose the values, put them in a temporary array, and use the SMALLEST and LARGEST functions to find the extreme values in the data. To use this technique, you first need to know how many values exist so that you can allocate an array of the correct size. The following DATA step uses a standard technique to discover the number of rows in a data set. Then the X variable is read into a temporary array. After the data are read, the SMALLEST and LARGEST functions are called multiple times to obtain the results:

/* discover the number of rows in a data set */
data _NULL_;
if 0 then 
   set Have NOBS=nRows;
call symputx('nRows', nRows);  /* put number of rows into macro variable */
stop;
run;
%put &=nRows;
 
/* DATA step supports the SMALLEST and LARGEST functions for rows */
data SmallLarge;
array v[&nRows] _temporary_;
set Have nobs=nObs end=eof;
v[_N_] = x;
if eof then do;
   do k = 1 to &nTop;
      Smallest = smallest(k, of v[*]);
      Largest = largest(k, of v[*]);
      output;
   end;
end;
keep k Smallest Largest;
run;
 
proc print data=SmallLarge;
run;

The SMALLEST and LARGEST functions are great for finding extreme values in a row of data. However, for data in a column, this example shows that this DATA step method is more complicated to implement than the procedure-based methods. Unlike the sorting method, it does not create copies of the data set on a disk. However, it does keep a copy in RAM. A disadvantage to this method is that each call to the SMALLEST and LARGEST functions traverses the array. Thus, this method can be inefficient. If you use it to generate a list of the 50 smallest values, the array is traversed 50 times.

The ranking method

You can use PROC RANK to assign a rank to each value. PROC RANK has several methods for ranking tied values. For this application, use the TIES=DENSE option. The following statements show how to display the smallest values for a variable. If you use the DESCENDING option on the PROC RANK statement, you obtain the largest values:

proc rank data=Have out=RankOutInc ties=DENSE /* Optional: DESCENDING */;
   var x;
   ranks Rank;
run;
 
proc print data=RankOutInc;
   where Rank <= &nTop & not missing(Rank);
run;

Because tied values are assigned the same ranks, this method is similar to using the NEXTRVAL= option in PROC UNIVARIATE. You can see, however, that PROC RANK does not reorder the data values. Although the table shows the five smallest values in the data, they are not displayed in a sorted order.

Summary

In summary, there are several ways to use SAS to find the Top 5 (or Top 10) smallest and largest values in data. I recommend using the NEXTROBS= option on the PROC UNIVARIATE statement. Not only is it easy to use, but you can display the smallest/largest values for multiple variables. For comparison, this article presents several alternative methods including sorting, using the SMALLEST and LARGEST functions, and using PROC RANK. These methods are less convenient than PROC UNIVARIATE.

The post 4 ways to find the k smallest and largest data values in SAS appeared first on The DO Loop.

1月 242022
 

How can you estimate percentiles in SAS Viya? This article shows how to call the percentile action from PROC CAS to estimate percentiles of variables in a CAS data table. You can use the same technique to estimate weighted percentiles. Percentiles and quantiles are essentially the same (the pth quantile is the 100*pth percentile for p in [0, 1]), so this article also shows how to estimate quantiles in SAS Viya.

I previously mentioned that some SAS procedures are CAS-enabled, which means that they will call a CAS action that runs on the CAS server if you specify a CAS table on the DATA= option. However, I also mentioned that some Base SAS procedures are "hybrid," which means that they might run on the CAS server or on the SAS Compute Server. But be careful: some Base SAS procedures (including the DATA step) look at the options that you specify to decide whether they can process the data on the CAS server. If not, they will pull the data down from the server to the SAS client and compute the result there. This is probably not what you want. It is inefficient to copy data from the CAS server if you can perform the computation on the server where the data are.

PROC MEANS is a hybrid procedure

A good example of a hybrid procedure is PROC MEANS. If the data are in a CAS table, PROC MEANS will look at the options that you specify. If you request basic descriptive statistics (N, MIN, MAX, MEAN, STD, etc), the procedure will call the aggregate action and compute the statistics in CAS. However, for some procedure options (and, I confess, I don't know which ones) the procedure decides that the requested statistics should be computed on the SAS client and uses the CAS LIBNAME engine to access the data. This results in reading the data from the CAS table and performing the computation on the SAS client.

Let's construct an example for which PROC MEANS computes the statistics in CAS. The following program loads the data in the Sashelp.Cars data set into a CAS table. It then calls PROC MEANS to compute descriptive statistics:

cas;                /* connect to CAS server */
libname mylib cas;  /* active caslib, whatever it is */
 
proc casutil;
   load data=Sashelp.Cars casout='Cars' replace; /* load data into the active caslib */
quit;
 
proc means data=mylib.Cars nolabels min mean max;   /* runs action on CAS server */
   class Origin;
   vars Cylinders MPG_City;
run;

The procedure outputs a familiar table that displays the mean, min, and max statistics for two variables, each grouped according to levels of the classification variable. In addition, the procedure displays the following note in the log:

NOTE: The CAS aggregation.aggregate action will be used to perform the initial summarization.

This note informs you that the computation took place on the CAS server. It also tells you that PROC MEANS used the aggregate action to perform the computation.

The behavior of PROC MEANS changes if you request percentiles. For some reason (unknown to me), the percentiles are not computed on the CAS server. Instead, the data are read from the CAS table by using the CAS libname engine, and the computation takes place on the SAS client:

proc means data=mylib.Cars nolabels P25 P50 P75;  /* run entirely on the SAS compute server */
   class Origin;
   vars Cylinders MPG_City;
run;

In addition to the output, the procedure writes the following note to the log:

NOTE: There were 428 observations read from the data set MYLIB.CARS.

This note implicitly tells you that the computation did not occur on the CAS server.

Computing percentiles on the CAS server

Although PROC MEANS does not automatically compute percentiles on the CAS server, you can use a CAS action to estimate the percentiles. By definition, an action always computes on the CAS server. I often look at the "Action Sets by Name" documentation when I am trying to find an action that will perform an analysis. In this case, you can search that page for the word "percentile" and find the documentation for the syntax of the percentile action. There are separate tabs for calling the action from SAS (the "CASL syntax"), from Lua, from Python, and from R. This article uses the CASL syntax, which tells you how to call the action from PROC CAS in SAS.

Let's run a few examples. You can use the percentile action (in the percentile action set) to compute percentiles of variables in CAS tables. Recall that you can call an action by specify its full two-level name (ActionSet.ActionName) each time, or by using the LOADACTIONSET statement to load the actions into your CAS session. After loading the action, you can refer to the action by using only its name.

The following call to PROC CAS loads the percentile action set, then calls the percentile action. The table= parameter is required. It is used to specify the CAS table that contains the data. Optionally, you can use additional parameters to specify the variables in the analysis, the percentiles to estimate, and more. The following call is similar to the previous PROC MEANS call. It estimates the same percentiles for the same variables.

/* load the percentile action set and make a basic call */
proc cas;
   loadactionset 'percentile';           /* load the action set */
   percentile /                          /* call the percentile action */
      table={name="Cars",                    /* name of table (in active caslib) */
             vars={"Cylinders" "MPG_City"},  /* name of analysis variables */
             groupby={"Origin"}              /* (optional) name of classification variables */
            }
      values={25 50 75}                  /* specify the percentiles */
      ;                                  /* end of syntax for the action */
run;

The output from the percentile action is in "long form" rather than "wide form," but the estimates are the same.

You might notice that the output contains a column labeled "Converged," and that the rows display "Yes." By default, the percentile uses an iterative process (method="ITERATIVE") to estimate the percentiles. The documentation states that the iterative process "is very memory efficient to compute percentiles." If you want to run a different estimation method, you can change some parameters. Most importantly, you can use the values= parameter to specify any percentiles values in [0, 100]. (By convention, the 0th percentile is the sample minimum and the 100th percentile is the sample maximum.) For example, the following statements use the default estimation method that PROC MEANS uses and add additional values to the list of percentiles.

/* You can specify other options such as percentiles and method */
proc cas;
   percentile /                          /* call the percentile action */
      table={name="Cars",                    /* name of table (in active caslib) */
             vars={"Cylinders" "MPG_City"}   /* name of analysis variables */
            }
      values={10  17.5  50  82.5  90}    /* specify the percentiles */
      pctldef = 5                        /* choice of estimand */
      method = "Exact"                   /* method for estimation */
      ;                                  /* end of syntax for the action */
run;

The percentile action also supports a weight= parameter, which you can use to specify a weight variable to compute weighted percentiles.

Summary

This article shows how to call the percentile action from PROC CAS to compute percentiles of variables in a CAS data table. The percentile action can analyze multiple variables and can estimate any percentiles that you specify. You can use the groupby= parameter inside the table= specification to estimate the percentiles for joint levels of categorical variables, which is similar to using the CLASS statement in PROC MEANS. The action also supports a weight= parameter for specifying a weighted variable.

An action will always perform its computations on the CAS server (where the data are). In contrast, some Base SAS procedures are "hybrid" procedures that may or may not compute on the CAS server. Consequently, I prefer to call actions when I need to ensure that the computations will be performed in CAS.

The post Estimate percentiles in SAS Viya appeared first on The DO Loop.