rick wicklin

10月 212019
 

Computing rates and proportions is a common task in data analysis. When you are computing several proportions, it is helpful to visualize how the rates vary among subgroups of the population. Examples of proportions that depend on subgroups include:

  • Mortality rates for various types of cancers
  • Incarceration rates by race
  • Four-year graduation rates by academic major

The first two examples are somewhat depressing, so I will use graduation rates for this article.

Uncertainty in estimates

An important fact to remember is that the uncertainty in an estimate depends on the sample size. If a small college has 8 physics majors and 5 of them graduate in four years, the graduation rate in physics is 0.6. However, because of the small sample size, the uncertainty in that estimate is much greater than for a larger group, such as if the English department graduates 50 out of 80 students. Specifically, if the estimate of a binomial proportion is p, the standard error of the estimate is sqrt(p(1–p)/n), where n is the sample size. Thus for the physics students, the standard error is sqrt(0.6*0.4/8) = 0.17, whereas for the English majors, the standard error is sqrt(0.6*0.4/80) = 0.05.

Therefore, it is a good idea to incorporate some visual aspect of the uncertainty into any graph of proportions and rates. For analyses that involve dozens or hundreds of groups, you can use a funnel plot of proportions, which I have used to analyze adoption rates for children and immunization rates for kindergartens in North Carolina. When you have a smaller number of groups, a simple alternative is a dot plot with error bars that indicate either the standard error or a 95% confidence interval for the estimate. As I've explained, I prefer to display the confidence interval.

Sample data: Graduation rates

The Chronicle of Higher Education web site enables you to find the graduation rates for US colleges and universities. You can find the average graduation rate by states (50 groups) or by college (hundreds of groups). You can also find the graduation rate by race (five groups) for any individual college. Because most colleges have fewer Hispanic, Asian, and Native American students, it is important to indicate the sample size or the uncertainty in the empirical estimates.

I don't want to embarrass any small college, so the following data are fake but are typical of the group sizes that you might see in real data. Suppose a college has six majors, labeled as A, B, C, D, E, and F. The following SAS DATA step defines the number of students who graduated in four years (Grads) and the number of students in each cohort (Total).

data Grads;
input Major $ Grads Total @@;
datalines;
A 10 22  B 10 32  C 17 25
D  4  7  E  8 14  F 16 28
;

Manual computations of confidence intervals for proportions

If you use a simple Wald confidence interval, it is easy to write a short DATA step to compute the empirical proportions and a 95% confidence interval for each major:

data GradRate;
set Grads;
GradRate = Grads / Total;
p = Grads / Total;               /* empirical proportion */
StdErr = sqrt(p*(1-p)/Total);    /* standard error */
/* use Wald 95% CIs */
z = quantile("normal", 1-0.05/2);
LCL = max(0,  p - z*StdErr);     /* LCL can't be less than 0 */
UCL = min(1,  p + z*StdErr);     /* UCL can't be less than 0 */
label p = "Proportion" LCL="Lower 95% CL" UCL="Upper 95% CL";
run;
 
proc print data=GradRate noobs label;
   var Major Grads Total p LCL UCL;
run;

The output shows that although majors D, E, and F have the same four-year graduation rate (57%), the estimate for the D group, which has only seven students, has twice as much variability as the estimate for the F group, which has four times as many students.

Automating the computations by using PROC FREQ

Although it is easy enough to write a DATA step for the Wald CI, other types of confidence intervals are more complicated. The BINOMIAL option in the TABLES statement of PROC FREQ enables you to compute many different confidence intervals (CIs), including the Wald CI. In order to use these data in PROC FREQ, you need to convert the data from Event-Trials format to Event-Nonevent format. For each major, let Graduated="Yes" indicate the count of students who graduated in four years and let Graduated="No" indicate the count of the remaining students. The following data step converts the data and estimates the binomial proportion for each group:

/* convert data to Event/Nonevent format */
data GradFreq;
set Grads;
Graduated = "Yes"; Count = Grads;       output;
Graduated = "No "; Count = Total-Grads; output;
run;
 
/* Use PROC FREQ to analyze each group separately and compute the binomial CIs */
proc freq data=GradFreq noprint;
   by notsorted Major; 
   tables Graduated / binomial(level='Yes' CL=wald); /* choose from among many confidence intervals */
   weight Count;
   output out=FreqOut binomial;
run;
 
proc print data=FreqOut noobs label;
   var Major N _BIN_ L_BIN U_BIN ;
   label _BIN_ = "Proportion" L_BIN="Lower 95% CL" U_BIN="Upper 95% CL";
run;

The output is not shown because the estimates and CIs from PROC FREQ are identical to the estimates from the "manual" calculations in the previous section. However, by using PROC FREQ you can easily compute more sophisticated confidence intervals.

Visualizing binomial proportions

As indicated earlier, it is useful to plot the proportions and confidence intervals. When you plot several proportions on the same graph, I recommend that you sort the data in some way, such as by the estimated proportions. If there are two groups that have the same proportion, you can use the size of the group to break the tie.

It can also be helpful to draw a reference line for the overall rate, regardless of group membership. (You can get the overall proportion by repeating the previous call to PROC FREQ but without using the BY statement.) For these data, the overall proportion of students who graduate in four years is 65/128 = 0.5078. Lastly, I think it is a good idea to add a table that shows the number of students in each major. You can use the YAXISTABLE statement to add that information to the graph, as follows:

/* sort by estimated proportion; break ties by using CI */
proc sort data=FreqOut;
   by _BIN_ N;
run;
 
title "Graduation Rates by College Major";
proc sgplot data=FreqOut;
   label _BIN_ = "Proportion" N="Number of Students";
   scatter y=Major x=_BIN_ / xerrorlower=L_BIN xerrorupper=U_BIN;
   yaxistable N / y=Major location=inside position=left valueattrs=(size=9);  
   refline 0.5078 / axis=x labelloc=inside label="Overall";
   yaxis discreteorder=data offsetmax=0.1;       /* preserve order of categories */
   xaxis grid values=(0 to 1 by 0.1) valueshint;
run;

The graph shows the range of graduation rates. The "error bars" are 95% CIs, which show that majors that have few students have larger uncertainty than majors that have more students. If there are 10 or more categories, I recommend that you use alternating color bands to make it easier for the reader to associate intervals with the majors.

In summary, this article shows how to use PROC FREQ to estimate proportions and confidence intervals for groups of binary data. A great way to convey the proportions to others is to graph the proportions and CIs. By including the sample size on the graph, readers can connect the uncertainty in the estimates to the sample size.

The post Compute and visualize binomial proportions in SAS appeared first on The DO Loop.

10月 212019
 

Computing rates and proportions is a common task in data analysis. When you are computing several proportions, it is helpful to visualize how the rates vary among subgroups of the population. Examples of proportions that depend on subgroups include:

  • Mortality rates for various types of cancers
  • Incarceration rates by race
  • Four-year graduation rates by academic major

The first two examples are somewhat depressing, so I will use graduation rates for this article.

Uncertainty in estimates

An important fact to remember is that the uncertainty in an estimate depends on the sample size. If a small college has 8 physics majors and 5 of them graduate in four years, the graduation rate in physics is 0.6. However, because of the small sample size, the uncertainty in that estimate is much greater than for a larger group, such as if the English department graduates 50 out of 80 students. Specifically, if the estimate of a binomial proportion is p, the standard error of the estimate is sqrt(p(1–p)/n), where n is the sample size. Thus for the physics students, the standard error is sqrt(0.6*0.4/8) = 0.17, whereas for the English majors, the standard error is sqrt(0.6*0.4/80) = 0.05.

Therefore, it is a good idea to incorporate some visual aspect of the uncertainty into any graph of proportions and rates. For analyses that involve dozens or hundreds of groups, you can use a funnel plot of proportions, which I have used to analyze adoption rates for children and immunization rates for kindergartens in North Carolina. When you have a smaller number of groups, a simple alternative is a dot plot with error bars that indicate either the standard error or a 95% confidence interval for the estimate. As I've explained, I prefer to display the confidence interval.

Sample data: Graduation rates

The Chronicle of Higher Education web site enables you to find the graduation rates for US colleges and universities. You can find the average graduation rate by states (50 groups) or by college (hundreds of groups). You can also find the graduation rate by race (five groups) for any individual college. Because most colleges have fewer Hispanic, Asian, and Native American students, it is important to indicate the sample size or the uncertainty in the empirical estimates.

I don't want to embarrass any small college, so the following data are fake but are typical of the group sizes that you might see in real data. Suppose a college has six majors, labeled as A, B, C, D, E, and F. The following SAS DATA step defines the number of students who graduated in four years (Grads) and the number of students in each cohort (Total).

data Grads;
input Major $ Grads Total @@;
datalines;
A 10 22  B 10 32  C 17 25
D  4  7  E  8 14  F 16 28
;

Manual computations of confidence intervals for proportions

If you use a simple Wald confidence interval, it is easy to write a short DATA step to compute the empirical proportions and a 95% confidence interval for each major:

data GradRate;
set Grads;
GradRate = Grads / Total;
p = Grads / Total;               /* empirical proportion */
StdErr = sqrt(p*(1-p)/Total);    /* standard error */
/* use Wald 95% CIs */
z = quantile("normal", 1-0.05/2);
LCL = max(0,  p - z*StdErr);     /* LCL can't be less than 0 */
UCL = min(1,  p + z*StdErr);     /* UCL can't be less than 0 */
label p = "Proportion" LCL="Lower 95% CL" UCL="Upper 95% CL";
run;
 
proc print data=GradRate noobs label;
   var Major Grads Total p LCL UCL;
run;

The output shows that although majors D, E, and F have the same four-year graduation rate (57%), the estimate for the D group, which has only seven students, has twice as much variability as the estimate for the F group, which has four times as many students.

Automating the computations by using PROC FREQ

Although it is easy enough to write a DATA step for the Wald CI, other types of confidence intervals are more complicated. The BINOMIAL option in the TABLES statement of PROC FREQ enables you to compute many different confidence intervals (CIs), including the Wald CI. In order to use these data in PROC FREQ, you need to convert the data from Event-Trials format to Event-Nonevent format. For each major, let Graduated="Yes" indicate the count of students who graduated in four years and let Graduated="No" indicate the count of the remaining students. The following data step converts the data and estimates the binomial proportion for each group:

/* convert data to Event/Nonevent format */
data GradFreq;
set Grads;
Graduated = "Yes"; Count = Grads;       output;
Graduated = "No "; Count = Total-Grads; output;
run;
 
/* Use PROC FREQ to analyze each group separately and compute the binomial CIs */
proc freq data=GradFreq noprint;
   by notsorted Major; 
   tables Graduated / binomial(level='Yes' CL=wald); /* choose from among many confidence intervals */
   weight Count;
   output out=FreqOut binomial;
run;
 
proc print data=FreqOut noobs label;
   var Major N _BIN_ L_BIN U_BIN ;
   label _BIN_ = "Proportion" L_BIN="Lower 95% CL" U_BIN="Upper 95% CL";
run;

The output is not shown because the estimates and CIs from PROC FREQ are identical to the estimates from the "manual" calculations in the previous section. However, by using PROC FREQ you can easily compute more sophisticated confidence intervals.

Visualizing binomial proportions

As indicated earlier, it is useful to plot the proportions and confidence intervals. When you plot several proportions on the same graph, I recommend that you sort the data in some way, such as by the estimated proportions. If there are two groups that have the same proportion, you can use the size of the group to break the tie.

It can also be helpful to draw a reference line for the overall rate, regardless of group membership. (You can get the overall proportion by repeating the previous call to PROC FREQ but without using the BY statement.) For these data, the overall proportion of students who graduate in four years is 65/128 = 0.5078. Lastly, I think it is a good idea to add a table that shows the number of students in each major. You can use the YAXISTABLE statement to add that information to the graph, as follows:

/* sort by estimated proportion; break ties by using CI */
proc sort data=FreqOut;
   by _BIN_ N;
run;
 
title "Graduation Rates by College Major";
proc sgplot data=FreqOut;
   label _BIN_ = "Proportion" N="Number of Students";
   scatter y=Major x=_BIN_ / xerrorlower=L_BIN xerrorupper=U_BIN;
   yaxistable N / y=Major location=inside position=left valueattrs=(size=9);  
   refline 0.5078 / axis=x labelloc=inside label="Overall";
   yaxis discreteorder=data offsetmax=0.1;       /* preserve order of categories */
   xaxis grid values=(0 to 1 by 0.1) valueshint;
run;

The graph shows the range of graduation rates. The "error bars" are 95% CIs, which show that majors that have few students have larger uncertainty than majors that have more students. If there are 10 or more categories, I recommend that you use alternating color bands to make it easier for the reader to associate intervals with the majors.

In summary, this article shows how to use PROC FREQ to estimate proportions and confidence intervals for groups of binary data. A great way to convey the proportions to others is to graph the proportions and CIs. By including the sample size on the graph, readers can connect the uncertainty in the estimates to the sample size.

The post Compute and visualize binomial proportions in SAS appeared first on The DO Loop.

10月 162019
 

The EFFECT statement is supported by more than a dozen SAS/STAT regression procedures. Among other things, it enables you to generate spline effects that you can use to fit nonlinear relationships in data. Recently there was a discussion on the SAS Support Communities about how to interpret the parameter estimates of spline effects. This article answers that question by visualizing the spline effects.

An overview of generated effects

Spline effects are powerful because they enable you to use parametric models to fit nonlinear relationships between an independent variable and a response. Using spline effects is not much different than use polynomial effects to fit nonlinear relationships. Suppose that a response variable, Y, appears to depend on an explanatory variable, X, in a complicated nonlinear fashion. If the relationship looks quadratic or cubic, you might try to capture the relationship by introducing polynomial effects. Instead of trying to model Y by X, you might try to use X, X2, and X3.

Strictly speaking, polynomial effects do not need to be centered at the origin. You could translate the polynomial by some amount, k, and use shifted polynomial effects such as (X-k), (X-k)2, and (X-k)3. Or you could combine these shifted polynomials with polynomials at the origin. Or use shifted polynomials that are shifted by different amounts, such as by the constants k1, k2, and k3.

Spline effects are similar to (shifted) polynomial effects. The constants (such as k1, k2, k3) that are used to shift the polynomials are called knots. Knots that are within the range of the data are called interior knots. Knots that are outside the range of the data are called exterior knots or boundary knots. You can read about the various kinds of spline effects that are supported by the EFFECT statement in SAS. Rather than rehash the mathematics, this article shows how you can use SAS to visualize a regression that uses splines. The visualization clarifies the meaning of the parameter estimates for the spline effects.

Output and visualize spline effects

This section shows how to output the spline effects into a SAS data set and plot the spline effects. Suppose you want to predict the MPG_City variable (miles per gallon in the city) based on the engine size. Because we will be plotting curves, the following statements sort the data by the EngineSize variable. Then the OUTDESIGN= option on the PROC GLMSELECT statement writes the spline effects to the Splines data set. For this example, I am using restricted cubic splines and four evenly spaced internal knots, but the same ideas apply to any choice of spline effects.

/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, ... */
title "Restricted TPF Splines";
title2 "Four Internal Knots";
proc glmselect data=cars outdesign(addinputvars fullmodel)=Splines; /* data set contains spline effects */
   effect spl = spline(EngineSize / details       /* define spline effects */
                naturalcubic basis=tpf(noint)     /* natural cubic splines, omit constant effect */
                knotmethod=equal(4));             /* 4 evenly spaced interior knots */
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   ods select ParameterEstimates SplineKnots;
   ods output ParameterEstimates=PE;
quit;

The SplineKnots table shows the locations of the internal knots. There are four equally spaced knots because the procedure used the KNOTMETHOD=EQUAL(4) option. The ParameterEstimates table shows estimates for the regression coefficients for the spline effects, which are named "Spl 1", "Spl 2", and so forth. In the Splines data set, the corresponding variables are named Spl_1, Spl_2, and so forth.

But what do these spline effects look like? The following statements plot the spline effects versus the EngineSize variable, which is the variable from which the effects are generated:

proc sgplot data=Splines;
   series x=EngineSize y=Intercept / curvelabel;
   series x=EngineSize y=spl_1 / curvelabel;
   series x=EngineSize y=spl_2 / curvelabel;
   series x=EngineSize y=spl_3 / curvelabel;
   refline 2.7 4.1 5.5 / axis=x lineattrs=(color="lightgray");
   refline 6.9 / axis=x label="upper knot" labelloc=inside lineattrs=(color="lightgray");
   yaxis label="Spline Effect";
run;

As stated in the documentation for the NATURALCUBIC option, these spline effects include "an intercept, the polynomial X, and n – 2 functions that are all linear beyond the largest knot," where n is the number of knots. This example uses n=4 knots, so Spl_2 and Spl_3 are the cubic splines. You will also see different spline effects if you change to one of the other supported spline methods, such as B-splines or the truncated power functions. Try it!

The graph shows that the natural cubic splines are reminiscent of polynomial effects, but there are a few differences:

  • The spline effects (spl_2 and spl_3) are shifted away from the origin. The spl_2 effect is shifted by 2.7 units, which is the location of the first internal knot. The spl_3 effect is shifted by 4.1 units, which is the location of the second internal knot.
  • The spline effects are 0 when EngineSize is less than the first knot position (2.7). Not all splines look like this, but these effects are based on truncated power functions (the TPF option).
  • The spline effects are linear when EngineSize is greater than the last knot position (6.9). Not all splines look like this, but these effects are restricted splines.

Predicted values are linear combinations of the spline effects

Visualizing the shapes of the spline effects enable you to make sense of the ParameterEstimates table. As in all linear regression, the predicted value is a linear combination of the design variables. In this case, the predicted values are formed by
Pred = 34.96 – 5*Spl_1 + 2.2*Spl_2 – 3.9*Spl_3
You can use the SAS DATA set or PROC IML to compute that linear combination of the spline effects. The following graph shows the predicted curve and overlays the locations of the knots:

The coefficient for Spl_1 is the largest effect (after the intercept). In the graph, you can see the general trend has an approximate slope of -5. The coefficient for the Spl_2 effect is 2.2, and you can see that the predicted values change slope between the second and third knots due to adding the Spl_2 effect. Without the Spl_3 effect, the predicted values would continue to rise after the third knot, but by adding in a negative multiple of Spl_3, the predicted values turn down again after the third knot.

Notice that the prediction function for the restricted cubic spline regression is linear before the first knot and after the last knot. The prediction function models nonlinear relationships between the interior knots.

Summary

In summary, the EFFECT statement in SAS regression procedures can generate spline effects for a continuous explanatory variable. The EFFECT statement supports many different types of splines. This article gives an example of using natural cubic splines (also called restricted cubic splines), which are based on the truncated power function (TPF) splines of degree 3. By outputting the spline effects to a data set and graphing them, you can get a better understanding of the meaning of the estimates of the regression coefficients. The predicted values are a linear combination of the spline effects, so the magnitude and sign of the regression coefficients indicate how the spline effects combine to predict the response.

The post Visualize a regression with splines appeared first on The DO Loop.

10月 142019
 

I recently wrote about how to use PROC TTEST in SAS/STAT software to compute the geometric mean and related statistics. This prompted a SAS programmer to ask a related question. Suppose you have dozens (or hundreds) of variables and you want to compute the geometric mean of each. What is the best way to obtain these geometric means?

As I mentioned in the previous post, the SAS/IML language supports the GEOMEAN function, so you compute the geometric means by iterating over each column in a data matrix. If you do not have SAS/IML software, you can use PROC UNIVARIATE in Base SAS. The UNIVARIATE procedure supports the OUTTABLE= option, which creates a SAS data set that contains many univariate statistics, including the geometric mean.

For example, suppose you want to compute the geometric means for all numeric variables in the Sashelp.Cars data set. You can use the OUTTABLE= option to write the output statistics to a data set and then print only the column that contains the geometric mean, as follows:

proc univariate data=Sashelp.Cars outtable=DescStats noprint;
   var _NUMERIC_;
run;
 
proc print data=DescStats noobs;
   var _var_ _GEOMEAN_;
run;

This method also works if your data contain a classification variable and you want to compute the geometric mean for each level of the classification variable. For example, the following statements compute the geometric means for two variables for each level of the Origin variable, which has the values "Asia", "Europe", and "USA":

proc univariate data=Sashelp.Cars outtable=DescStatsClass noprint;
   class Origin;
   var MPG_City Horsepower;
run;
 
proc print data=DescStatsClass noobs;
   var _var_ Origin _GEOMEAN_;
run;

In summary, if you want to use Base SAS to compute the geometric mean (or any of almost 50 other descriptive statistics) for many variables, use the OUTTABLE= option of PROC UNIVARIATE.

The post Compute the geometric mean for many variables in SAS appeared first on The DO Loop.

10月 092019
 

In a previous article, I mentioned that the VLINE statement in PROC SGPLOT is an easy way to graph the mean response at a set of discrete time points. I mentioned that you can choose three options for the length of the "error bars": the standard deviation of the data, the standard error of the mean, or a confidence interval for the mean. This article explains and compares these three options. Which one you choose depends on what information you want to convey to your audience. As I will show, some of the statistics are easier to interpret than others. At the end of this article, I tell you which statistic I recommend.

Sample data

The following DATA step simulates data at four time points. The data at each time point are normally distributed, but the mean, standard deviation, and sample size of the data vary for each time point.

data Sim;
label t = "Time";
array mu[4]    _temporary_ (80 78 78 79); /* mean */
array sigma[4] _temporary_ ( 1  2  2  3); /* std dev */
array N[4]     _temporary_ (36 32 28 25); /* sample size */
call streaminit(12345);
do t = 1 to dim(mu);
   do i = 1 to N[t];
      y = rand("Normal", mu[t], sigma[t]); /* Y ~ N(mu[i], sigma[i]) */
      output;
   end;
end;
run;
 
title "Response by Time";
ods graphics / width=400px height=250px;
proc sgplot data=Sim;
   vbox y / category=t connect=mean;
run;
Box plots of data at four time points

The box plot shows the schematic distribution of the data at each time point. The boxes use the interquartile range and whiskers to indicate the spread of the data. A line connects the means of the responses at each time point.

A box plot might not be appropriate if your audience is not statistically savvy. A simpler display is a plot of the mean for each time point and error bars that indicate the variation in the data. But what statistic should you use for the heights of the error bars? What is the best way to show the variation in the response variable?

Relationships between sample standard deviation, SEM, and CLM

Before I show how to plot and interpret the various error bars, I want to review the relationships between the sample standard deviation, the standard error of the mean (SEM), and the (half) width of the confidence interval for the mean (CLM). These statistics are all based on the sample standard deviation (SD). The SEM and width of the CLM are multiples of the standard deviation, where the multiplier depends on the sample size:

  • The SEM equals SD / sqrt(N). That is, the standard error of the mean is the standard deviation divided by the square root of the sample size.
  • The width of CLM is a multiple of the SEM. For large samples, the multiple for a 95% confidence interval is approximately 1.96. In general, suppose the significance level is α and you are interested in 100(1-α)% confidence limits. Then the multiplier is a quantile of the t distribution with N-1 degrees of freedom, often denoted by t*1-α/2, N-1.

You can use PROC MEANS and a short DATA step to display the relevant statistics that show how these three statistics are related:

/* Optional: Compute SD, SEM, and half-width of CLM (not needed for plotting) */
proc means data=Sim noprint;
   class t;
   var y;
   output out=MeanOut N=N stderr=SEM stddev=SD lclm=LCLM uclm=UCLM;
run;
 
data Summary;
set MeanOut(where=(t^=.));
CLMWidth = (UCLM-LCLM)/2;   /* half-width of CLM interval */
run;
 
proc print data=Summary noobs label;
   format SD SEM CLMWidth 6.3;
   var T SD N SEM CLMWidth;
run;
Relationships between StdDev, SEM, and CLM

The table shows the standard deviation (SD) and the sample size (N) for each time point. The SEM column is equal to SD / sqrt(N). The CLMWidth value is a little more than twice the SEM value. (The multiplier depends on N; For these data, it ranges from 2.03 to 2.06.)

As shown in the next section, the values in the SD, SEM, and CLMWidth columns are the lengths of the error bars when you use the STDDEV, STDERR, and CLM options (respectively) to the LIMITSTAT= option on the VLINE statement in PROC SGPLOT.

Visualize and interpret the choices of error bars

Let's plot all three options for the error bars on the same scale, then discuss how to interpret each graph. Several interpretations use the 68-95-99.7 rule for normally distributed data. The following statements create the three line plots with error bars:

%macro PlotMeanAndVariation(limitstat=, label=);
   title "VLINE Statement: LIMITSTAT = &limitstat";
   proc sgplot data=Sim noautolegend;
      vline t / response=y stat=mean limitstat=&limitstat markers;
      yaxis label="&label" values=(75 to 82) grid;
   run;
%mend;
 
title "Mean Response by Time Point";
%PlotMeanAndVariation(limitstat=STDDEV, label=Mean +/- Std Dev);
%PlotMeanAndVariation(limitstat=STDERR, label=Mean +/- SEM);
%PlotMeanAndVariation(limitstat=CLM,    label=Mean and CLM);
Display error bars for the means by using the standard deviation

Use the standard deviations for the error bars

In the first graph, the length of the error bars is the standard deviation at each time point. This is the easiest graph to explain because the standard deviation is directly related to the data. The standard deviation is a measure of the variation in the data. If the data at each time point are normally distributed, then (1) about 64% of the data have values within the extent of the error bars, and (2) almost all the data lie within three times the extent of the error bars.

The main advantage of this graph is that a "standard deviation" is a term that is familiar to a lay audience. The disadvantage is that the graph does not display the accuracy of the mean computation. For that, you need one of the other statistics.

Display error bars for the means by using the standard error of the mean

Use the standard error for the error bars

In the second graph, the length of the error bars is the standard error of the mean (SEM). This is harder to explain to a lay audience because it in an inferential statistic. A qualitative explanation is that the SEM shows the accuracy of the mean computation. Small SEMs imply better accuracy than larger SEMs.

A quantitative explanation requires using advanced concepts such as "the sampling distribution of the statistic" and "repeating the experiment many times." For the record, the SEM is an estimate of the standard deviation of the sampling distribution of the mean. Recall that the sampling distribution of the mean can be understood in terms of repeatedly drawing random samples from the population and computing the mean for each sample. The standard error is defined as the standard deviation of the distribution of the sample means.

The exact meaning of the SEM might be difficult to explain to a lay audience, but the qualitative explanation is often sufficient.

Display error bars for the means by using confidence intervals

Use a confidence interval of the mean for the error bars

In the third graph, the length of the error bars is a 95% confidence interval for the mean. This graph also displays the accuracy of the mean, but these intervals are about twice as long as the intervals for the SEM.

The confidence interval for the mean is hard to explain to a lay audience. Many people incorrectly think that "there is a 95% chance that the population mean is in this interval." That statement is wrong: either the population mean is in the interval or it isn't. There is no probability involved! The words "95% confidence" refer to repeating the experiment many times on random samples and computing a confidence interval for each sample. The true population mean will be in about 95% of the confidence intervals.

Conclusions

In summary, there are three common statistics that are used to overlay error bars on a line plot of the mean: the standard deviation of the data, the standard error of the mean, and a 95% confidence interval for the mean. The error bars convey the variation in the data and the accuracy of the mean estimate. Which one you use depends on the sophistication of your audience and the message that you are trying to convey.

My recommendation? Despite the fact that confidence intervals can be misinterpreted, I think that the CLM is the best choice for the size of the error bars (the third graph). If I am presenting to a statistical audience, the audience understands the CLMs. For a less sophisticated audience, I do not dwell on the probabilistic interpretation of the CLM but merely say that the error bars "indicate the accuracy of the mean."

As explained previously, each choice has advantages and disadvantages. What choice do you make and why? You can share your thoughts by leaving a comment.

The post What statistic should you use to display error bars for a mean? appeared first on The DO Loop.

10月 072019
 

It is always great to read an old paper or blog post and think, "This task is so much easier in SAS 9.4!" I had that thought recently when I stumbled on a 2007 paper by Wei Cheng titled "Graphical Representation of Mean Measurement over Time." A substantial portion of the eight-page paper is SAS code to creating a graph of the mean responses over time for patients in two arms of a clinical trial. (An arm is a group of participants who receive an intervention or who receive no intervention, such as an experimental group and the control group.)

The graph to the right is a modern version of one graph that Cheng created. This graph is created by using PROC SGPLOT. This article shows how to create this and other graphs that visualize the mean response by time for groups in a clinical trial.

This article assumes that the data are measured at discrete time points. If time is a continuous variable, you can model the mean response by using a regression model, and you can use the EFFECTPLOT statement to graph the predicted mean response versus time.

Sample clinical data

Cheng did not include his sample data, but the following DATA step defines fake data for 11 patients, five in one arm and six in the other. The data produce graphs that are similar to the graphs in Cheng's paper.

data study;
input Armcd $ SubjID $ y1-y5;                /* read data in wide form */
label VisitNum = 'Visit' Armcd = "Treatment";
VisitNum=1; y=y1; output;                    /* immediately transform data to long form */
VisitNum=2; y=y2; output;
VisitNum=3; y=y3; output;
VisitNum=4; y=y4; output;
VisitNum=5; y=y5; output;
drop y1-y5;
datalines;
A 001 135 138 135 134  .
A 002 142 140 141 139 138
A 003 140 137 136 135 133
A 004 131 131 130 131 130
A 005 128 125  .  121 121
B 006 125 120 115 110 105
B 007 139 134 128 128 122
B 008 136 129 126 120 111
B 009 128 125 127 133 136
B 010 120 114 112 110  96
B 011 129 122 120 119  .
;

Use the VLINE statement for mean and variation

The VLINE statement in PROC SGPLOT can summarize data across groups. When you use the RESPONSE= and STAT= option, it can display the mean, median, count, or percentage of a response variable. You can add "error bars" to the graph by using the LIMITSTAT= option. Following Cheng, the error bars indicate the standard error of the mean (SEM). the following statements create the line plot shown at the top of this article:

/* simplest way to visualize means over time for each group */
title "Mean Response by Arm";
proc sgplot data=study;
   vline VisitNum / response=y group=Armcd stat=mean limitstat=stderr;
   yaxis label='Mean +/- SEM';
run;

That was easy! Notice that the VLINE statement computes the mean and standard error for Y for each value of VisitNum and Armcd variables.

This graph shows the standard error of the mean, but you could also show confidence limits for the mean (LIMITSTAT=CLM) or indicate the extent of one or more standard deviations (LIMITSTAT=STDDEV and use the NUMSTD= option).

An alternative plot: Box plots by time

Cheng's graph is appropriate when the intended audience for the graph includes people who might not be experts in statistics. For a more sophisticated audience, you could create a series of box plots and connect the means of the box plots. In this plot, the CATEGORY= option is used to specify the time-like variable and the GROUP= option is used to specify the arms of the study. (Learn about the difference between categories and groups in box plots.)

/* box plots connected by means */
title "Response by Arm";
proc sgplot data=study;
   vbox y / category=VisitNum group=Armcd groupdisplay=cluster 
            connect=mean clusterwidth=0.35;
run;

Whereas the first graph emphasizes the mean value of the responses, the box plot emphasizes the individual responses. The mean responses are connected by lines. The boxes show the interquartile range (Q1 and Q3) as well as the median response. Whiskers and outliers indicate the spread of the data.

Graph summarized statistics

In the previous sections, the VLINE and VBOX statements automatically summarized the data for each time point and for each arm of the study. This is very convenient, but the SGPLOT statements support only a limited number of statistics such as the mean and median. For more control over the statistics, you can use PROC MEANS or PROC UNIVARIATE to summarize the data and then use the SERIES statement to plot the statistics and (optionally) use the SCATTER statement to plot error bars for the statistic.

PROC MEANS supports dozens of descriptive statistics, but, for easy comparison, I will show how to create the same graph by using summarized data. The following call to PROC MEANS creates an output data set that contains statistics for each visit/arm combination.

proc means data=study N mean stderr stddev lclm uclm NDEC=2;
   class Armcd VisitNum;
   var y;
   output out=MeanOut N=N mean=Mean stderr=SEM stddev=SD lclm=LCLM uclm=UCLM;
run;

The output data set (MeanOut) contains all the information in the table, plus additional "marginal" information that summarizes the means across all arms (for each visit), across all visits (for each arm), and for the entire study. When you use the MeanOut data set, you should use a WHERE clause to specify which information you want to analyze. For this example, we want only the information for the Armcd/VisitNum combinations. You can run a simple DATA step to subset the output and to create variables for the values Mean +/- SEM, as follows:

/* compute lower/upper bounds as Mean +/- SEM */
data Summary;
set MeanOut(where=(Armcd^=" " & VisitNum^=.));
LowerSEM = Mean - SEM;
UpperSEM = Mean + SEM;
run;
 
/* create a graph of summary statistics that is similar to the VLINE graph */
title2 "Presummarized Data";
proc sgplot data=Summary;
series  x=VisitNum y=Mean / group=Armcd;
scatter x=VisitNum y=Mean / group=Armcd
        yerrorlower=LowerSEM yerrorupper=UpperSEM;
run;

You can use this technique to create graphs of other statistics versus time.

Adding tabular information to a mean-versus-time graph

You can augment a mean-versus-time graph by adding additional information about the study at each time point. In Cheng's paper, much of the code was devoted to adding information about the number of patients that were measured at each time point.

In SAS 9.4, you can use the XAXISTABLE statement to add one or more rows of information to a graph. The output from PROC MEANS includes a variable named N, which gives the number of nonmissing measurements at each time. The following statements add information about the number of patients. The CLASS= option subsets the counts by the arm, and the COLORGROUP= option displays the text in the group colors.

title2 "Table with Participant Counts";
proc sgplot data=Summary;
series  x=VisitNum y=Mean / group=Armcd;
scatter x=VisitNum y=Mean / group=Armcd
        yerrorlower=LowerSEM yerrorupper=UpperSEM;
xaxistable N / location=inside class=Armcd colorgroup=Armcd
               title="Number of Patients" 
               valueattrs=(size=10) labelattrs=(size=10);
yaxis label='mean +/- SEM';
run;

In summary, SAS 9.4 makes it is easy to graph the mean response versus time for various arms of a clinical study. Cheng wrote his paper in 2007 using SAS 9.1.3, but there have been TONS of additions to the ODS Statistical Graphics system since then. This article shows that you can let PROC SGPLOT summarize the data and plot it by using the VLINE statement or the VBOX statement. Or you can summarize the data yourself and plot it by using the SERIES and SCATTER statements. For the summarized data, you can overlay tables of statistics such as the number of patients at each time point. Whichever method you choose, the SGPLOT procedure makes it easy to create the graphs of statistics versus time.

The post Graph the mean response versus time in SAS appeared first on The DO Loop.

10月 022019
 

I frequently see questions on SAS discussion forums about how to compute the geometric mean and related quantities in SAS. Unfortunately, the answers to these questions are sometimes confusing or even wrong. In addition, some published papers and web sites that claim to show how to calculate the geometric mean in SAS contain wrong or misleading information.

This article shows how to compute the geometric mean, the geometric standard deviation, and the geometric coefficient of variation in SAS. It first shows how to use PROC TTEST to compute the geometric mean and the geometric coefficient of variation. It then shows how to compute several geometric statistics in the SAS/IML language.

For an introduction to the geometric mean, see "What is a geometric mean." For information about the (arithmetic) coefficient of variation (CV) and its applications, see the article "What is the coefficient of variation?"

Compute the geometric mean and geometric CV in SAS

As discussed in my previous article, the geometric mean arises naturally when positive numbers are being multiplied and you want to find the average multiplier. Although the geometric mean can be used to estimate the "center" of any set of positive numbers, it is frequently used to estimate average values in a set of ratios or to compute an average growth rate.

The TTEST procedure is the easiest way to compute the geometric mean (GM) and geometric CV (GCV) of positive data. To demonstrate this, the following DATA step simulates 100 random observations from a lognormal distribution. PROC SGPLOT shows a histogram of the data and overlays a vertical line at the location of the geometric mean.

%let N = 100;
data Have;
call streaminit(12345);
do i = 1 to &N;
   x = round( rand("LogNormal", 3, 0.8), 0.1);    /* generate positive values */
   output;
end;
run;
 
title "Geometric Mean of Skewed Positive Data";
proc sgplot data=Have;
   histogram x / binwidth=10 binstart=5 showbins;
   refline 20.2 / axis=x label="Geometric/Mean" splitchar="/" labelloc=inside
                  lineattrs=GraphData2(thickness=3);
   xaxis values=(0 to 140 by 10);
   yaxis offsetmax=0.1;
run;

Where is the "center" of these data? That depends on your definition. The mode of this skewed distribution is close to x=15, but the arithmetic mean is about 26.4. The mean is pulled upwards by the long right tail. It is a mathematical fact that the geometric mean of data is always less than the arithmetic mean. For these data, the geometric mean is 20.2.

To compute the geometric mean and geometric CV, you can use the DIST=LOGNORMAL option on the PROC TTEST statement, as follows:

proc ttest data=Have dist=lognormal; 
   var x;
   ods select ConfLimits;
run;

The geometric mean, which is 20.2 for these data, estimates the "center" of the data. Notice that the procedure does not report the geometric standard deviation (or variance), but instead reports the geometric coefficient of variation (GCV), which has the value 0.887 for this example. The documentation for the TTEST procedure explains why the GCV is the better measure of variation: "For lognormal data, the CV is the natural measure of variability (rather than the standard deviation) because the CV is invariant to multiplication of [the data]by a constant."

You might wonder whether data need to be lognormally distributed to use this table. The answer is that the data do not need to be lognormally distributed to use the geometric mean and geometric CV. However, the 95% confidence intervals for these quantities assume log-normality.

Definitions of geometric statistics

As T. Kirkwood points out in a letter to the editors of Biometric (Kirkwood, 1979), if data are lognormally distributed as LN(μ σ), then

  • The quantity GM = exp(μ) is the geometric mean. It is estimated from a sample by the quantity exp(m), where m is the arithmetic mean of the log-transformed data.
  • The quantity GSD = exp(σ) is defined to be the geometric standard deviation. The sample estimate is exp(s), where s is the standard deviation of the log-transformed data.
  • The geometric standard error (GSE) is defined by exponentiating the standard error of the mean of the log-transformed data. Geometric confidence intervals are handled similarly.
  • Kirkwood's proposal for the geometric coefficient of variation (GCV) is not generally used. Instead, the accepted definition of the GCV is GCV = sqrt(exp(σ2) – 1), which is the definition that is used in SAS. The estimate for the GCV is sqrt(exp(s2) – 1).

You can use these formulas to compute the geometric statistics for any positive data. However, only for lognormal data do the statistics have a solid theoretical basis: transform to normality, compute a statistic, apply the inverse transform.

Compute the geometric mean in SAS/IML

You can use the SAS/IML language to compute the geometric mean and other "geometric statistics" such as the geometric standard deviation and the geometric CV. The GEOMEAN function is a built-in SAS/IML function, but the other statistics are implemented by explicitly computing statistics of the log-transformed data, as described in the previous section:

proc iml;
use Have; read all var "x"; close;  /* read in positive data */
GM = geomean(x);               /* built-in GEOMEAN function */
print GM;
 
/* To estimate the geometric mean and geometric StdDev, compute
   arithmetic estimates of log(X), then EXP transform the results. */
n = nrow(x);
z = log(x);                  /* log-transformed data */
m = mean(z);                 /* arithmetic mean of log(X) */
s = std(z);                  /* arithmetic std dev of log(X) */
GM2 = exp(m);                /* same answer as GEOMEAN function */
GSD = exp(s);                /* geometric std dev */
GCV = sqrt(exp(s**2) - 1);   /* geometric CV */
print GM2 GSD GCV;

Note that the GM and GCV match the output from PROC TTEST.

What does the geometric standard deviation mean? As for the arithmetic mean, you need to start by thinking about the location of the geometric mean (20.2). If the data are normally distributed, then about 68% of the data are within one standard deviation of the mean, which is the interval [m0s, m+s]. For lognormal data, about 68% of the data should be in the interval [GM/GSD, GM*GSD] and, in fact, 65 out of 100 of the simulated observations are in that interval. Similarly, about 95% of lognormal data should be in the interval [GM/GSD2, GM*GSD2]. For the simulated data, 94 out of 100 observations are in the interval, as shown below:

I am not aware of a similar interpretation of the geometric coefficient of variation. The GCV is usually used to compare two samples. As opposed to the confidence intervals in the previous paragraph, the GCV does not make any reference to the geometric mean of the data.

Other ways to compute the geometric mean

The methods in this article are the simplest ways to compute the geometric mean in SAS, but there are other ways.

  • You can use the DATA step to log-transform the data, use PROC MEANS to compute the descriptive statistics of the log-transformed data, then use the DATA step to exponentiate the results.
  • PROC SURVEYMEANS can compute the geometric mean (with confidence intervals) and the standard error of the geometric mean for survey responses. However, the variance of survey data is not the same as the variance of a random sample, so you should not use the standard error statistic unless you have survey data.

As I said earlier, there is some bad information out there on the internet about this topic, so beware. A site that seems to get all the formulas correct and present the information in a reasonable way is Alex Kritchevsky's blog.

You can download the complete SAS program that I used to compute the GM, GSD, and GCV. The program also shows how to compute confidence intervals for these quantities.

The post Compute the geometric mean, geometric standard deviation, and geometric CV in SAS appeared first on The DO Loop.

9月 302019
 

There are several different kinds of means. They all try to find an average value from among a set of numbers. Although the most popular mean is the arithmetic mean, the geometric mean can be useful for problems in statistics, finance, and biology. A common application of the geometric mean is to find an average growth rate for an asset or for a population.

What is the geometric mean?

The geometric mean of n nonnegative numbers is the n_th root of the product of the numbers:
GM = (x1 * x2 * ... * xn)1/n = (Π xi)1/n
When the numbers are all positive, the geometric mean is equivalent to computing the arithmetic mean of the log-transformed data and then using the exponential function to back-transform the result: GM = exp( (1/n) Σ log(xi) ).

Physical interpretation of the geometric mean

The geometric mean has a useful interpretation in terms of the volume of an n-dimensional rectangular solid. If GM is the geometric mean of n positive numbers, then GM is the length of the side of an n-dimensional cube that has the same volume as the rectangular solid with side lengths x1, x2, ..., xn. For example, a rectangular solid with sides 1.5, 2, and 3 has a volume V = 9. The geometric mean of those three numbers are ((1.5)(2)(3))1/3 ≈ 2.08. The volume of a cube with sides of length 2.08 is 9.

This interpretation in terms of volume is analogous to interpreting the arithmetic mean in terms of length. Namely, if you have n line segments of lengths x1, x2, ..., xn, then the total length of the segments is the same as the length of n copies of a segment of length AM, where AM is the arithmetic mean.

What is the geometric mean good for?

The geometric mean can be used to estimate the "center" of any set of positive numbers but is frequently used to estimate an average value in problems that deal with growth rates or ratios. For example, the geometric mean is an average growth rate for an asset or for a population. The following example uses the language of finance, although you can replace "initial investment" by "initial population" if you are interested in population growth

In precalculus, growth rates are introduced in terms of the growth of an initial investment amount (the principle) compounded yearly at a fixed interest rate, r. After n years, the principle (P) is worth
An = P(1 + r)n
The quantity 1 + r sometimes confuses students. It appears because when you add the principle (P) and the interest (Pr), you get P(1 + r).

In precalculus, the interest rate is assumed to be a constant, which is fine for fixed-rate investments like bank CDs. However, many investments have a growth rate that is not fixed but varies from year to year. If the growth rate of the investment is r1 during the first year, r2 during the second year, and rn during the n_th year, then after n years the investment is worth
An = P(1 + r1)(1 + r2)...(1 + rn)
  = P (Π xi), where xi = 1 - ri.

What is the average growth rate for the investment? One interpretation of the average growth rate is the fixed rate that would give the same return after n years. That hypothetical fixed-rate growth is found by using the geometric mean of the values x1, x2, ..., xn. That is, if GM is the geometric mean of the xi, the value
An = P*GMn,
which assumes a fixed interest rate, is exactly the same as for the varying-rate computation.

An example of the geometric mean: The growth rate of gold

Let's apply the ideas in the preceding section. Suppose that you bought $1000 of gold on Jan 1, 2010. The following table gives the yearly rate of return for gold during the years 2010–2018, along with the value of the $1000 investment at the end of each year.

According to the table, the value of the investment after 9 years is $1160.91, which represents a total return of about 16%. What is the fixed-rate that would give the same return after 9 years when compounded annually? That is found by computing the geometric mean of the numbers in the third column:
GM = (1.2774 * 1.1165 * ... * 0.9885)1/9 = 1.01672
In other words, the investment in gold yielded the same return as a fixed-rate bank CD at 1.672% that is compounded yearly for 9 years. The end-of-year values for both investments are shown in the following graph.

The geometric mean in statistics

The geometric mean arises naturally in situations in which quantities are multiplied together. This happens so often that there is a probability distribution, called the lognormal distribution, that models this situation. if Z has a normal distribution, then you can obtain a lognormal distribution by applying the exponential transformation: X = exp(Z) is lognormal. The Wikipedia article for the lognormal distribution states, "The lognormal distribution is important in the description of natural phenomena... because many natural growth processes are driven by the accumulation of many small percentage changes." For lognormal data, the geometric mean is often more useful than the arithmetic mean. In my next blog post, I will show how to compute the geometric mean and other associated statistics in SAS.

The post What is a geometric mean? appeared first on The DO Loop.

9月 252019
 

One of the strengths of the SAS/IML language is its flexibility. Recently, a SAS programmer asked how to generalize a program in a previous article. The original program solved one optimization problem. The reader said that she wants to solve this type of problem 300 times, each time using a different set of parameters. Essentially, she wants to loop over a set of parameter values, solve the corresponding optimization problem, and save each solution to a data set.

Yes, you can do this in SAS/IML, and the technique is not limited to solving a series of optimization problems. Any time you have a parameterized family of problems, you can implement this idea. First, you figure out how to solve the problem for one set of parameters, then you can loop over many sets of parameters and solve each problem in turn.

Solve the problem one time

As I say in my article, "Ten tips before you run an optimization," you should always develop and debug solving ONE problem before you attempt to solve MANY problems. This section describes the problem and solves it for one set of parameters.

For this article, the goal is to find the values (x1, x2) that maximize a function of two variables:
F(x1, x2; a b) = 1 – (x1 – a)2 – (x2 – b)2 + exp(–(x12 + x22)/2);
The function has two parameters, (a, b), which are not involved in the optimization. The function is a sum of two terms: a quadratic function and an exponentially decreasing function that looks like the bivariate normal density.

Because the exponential term rapidly approaches zero when (x1, x2) moves away from the origin, this function is a perturbation of a quadratic function. The maximum will occur somewhat close to the value (x1, x2) = (a, b), which is the value for which the quadratic term is maximal. The following graph shows a contour plot for the function when (a, b) = (-1, 2). The maximum value occurs at approximately (x1, x2) = (-0.95, 1.9).

The following program defines the function in the SAS/IML language. In the definition, the parameters (a, b) are put in the GLOBAL statement because they are constant during each optimization. The SAS/IML language provides several nonlinear optimization routines. This program uses Newton-Raphson optimizer to solve the problem for (a, b) = (-1, 2).

proc iml;
start Func(x) global (a, b);
   return 1 - (x[,1] - a)##2 - (x[,2] - b)##2 +
            exp(-0.5*(x[,1]##2 + x[,2]##2));
finish;
 
a = -1; b = 2;      /* set GLOBAL parameters */
/* test functions for one set of parameters */
opt = {1,           /* find maximum of function   */
       2};          /* print a little bit of output */
x0 = {0 0};         /* initial guess for solution */
call nlpnra(rc, result, "Func", x0, opt);    /* find maximal solution */
print result[c={"x1" "x2"}];

As claimed earlier, when (a, b) = (-1, 2), the maximum value of the function occurs at approximately (x1, x2) = (-0.95, 1.9).

Solve the problem for many parameters

After you have successfully solved the problem for one set of parameters, you can iterate over a sequence of parameters. The following DATA step specifies five sets of parameters, but it could specify 300 or 3,000. These parameters are read into a SAS/IML matrix. When you solve a problem many times in a loop, it is a good idea to suppress any output. The SAS/IML program suppresses the tables and iteration history for each optimization step and saves the solution and the convergence status:

/* define a set of parameter values */
data Params;
input a b;
datalines;
-1  2
 0  1
 0  4
 1  0
 3  2
 ;
 
proc iml;
start Func(x) global (a, b);
   return 1 - (x[,1] - a)##2 - (x[,2] - b)##2 +
            exp(-0.5*(x[,1]##2 + x[,2]##2));
finish;
 
/* read parameters into a matrix */
varNames = {'a' 'b'}; 
use Params;  read all var varNames into Parms;  close;
 
/* loop over parameters and solve problem */
opt = {1,          /* find maximum of function   */
       0};         /* no print */
Soln = j(nrow(Parms), 2, .);
returnCode = j(nrow(Parms), 1, .);
do i = 1 to nrow(Parms);
   /* assign GLOBAL variables */
   a = Parms[i, 1];  b = Parms[i, 2]; 
   /* For now, use same guess for every parameter. */
   x0 = {0 0};        /* initial guess for solution */
   call nlpnra(rc, result, "Func", x0, opt);
   returnCode[i] = rc;   /* save convergence status and solution vector */
   Soln[i,] = result;
end;
print Parms[c=varNames] returnCode Soln[c={x1 x2} format=Best8.];

The output shows the parameter values (a,b), the status of the optimization, and the optimal solution for (x1, x2). The third column (returnCode) has the value 3 or 6 for these optimizations. You can look up the exact meaning of the return codes, but the main thing to remember is that a positive return code indicates a successful optimization. A negative return code indicates that the optimization terminated without finding an optimal solution. For this example, all five problems were solved successfully.

Choosing an initial guess based on the parameters

If the problem is not solved successfully for a certain parameter value, it might be that the initial guess was not very good. It is often possible to approximate the objective function to obtain a good initial guess for the solution. This not only helps ensure convergence, but it often improves the speed of convergence. If you want to solve 300 problems, having a good guess will speed up the total time.

For this example, the objective functions are a perturbation of a quadratic function. It is easy to show that the quadratic function has an optimal solution at (x1, x2) = (a, b), and you can verify from the previous output table that the optimal solutions are close to (a, b). Consequently, if you use (a, b) as an initial guess, rather than a generic value like (0, 0), then each problem will converge in only a few iterations. For this problem, the GetInitialGuess function return (a,b), but in general the function would return a function of the parameter or even solve a simpler set of equations.

/* Choose an initial guess based on the parameters. */
start GetInitialGuess(Parms);
   return Parms;  /* for this problem, the solution is near (x1,x2)=(a,b) */
finish;
 
do i = 1 to nrow(Parms);
   /* assign GLOBAL variables */
   a = Parms[i, 1]; 
   b = Parms[i, 2]; 
   /* Choose an initial guess based on the parameters. */
   x0 = GetInitialGuess(Parms[i,]); /* initial guess for solution */
   call nlpnra(rc, result, "Func", x0, opt);
   returnCode[i] = rc;
   Soln[i,] = result;
end;
print Parms[c=varNames] returnCode Soln[c={x1 x2} format=Best8.];

The solutions and return codes are similar to the previous example. You can see that changing the initial guess changes the convergence status and slightly changes the solution values.

For this simple problem, choosing a better initial guess provides only a small boost to performance. If you use (0,0) as an initial guess, you can solve 3,000 problems in about 2 seconds. If you use the GetInitialGuess function, it takes about 1.8 seconds to solve the same set of optimizations. For other problems, providing a good initial guess might be more important.

In conclusion, the SAS/IML language makes it easy to solve multiple optimization problems, where each problem uses a different set of parameter values. To improve convergence, you can sometimes use the parameter values to compute an initial guess.

The post Solve many optimization problems appeared first on The DO Loop.

9月 232019
 

Although I do not typically blog about undocumented SAS options, I'll make an exception this time. For many years, I have known that the CONTENTS and COMPARE procedures support the BRIEF and SHORT options, but I always forget which option goes with which procedure. For the record, here are the documented options:

SAS provides an autocomplete feature that programmers can use to remind themselves which options are supported in each procedure. Both SAS Studio and SAS Enterprise Guide support the autocomplete feature, which displays a list of options as you type SAS code. Unfortunately, I turn off that feature because I find it distracting. Consequently, I cannot remember whether to use SHORT or BRIEF.

My "solution" to this dilemma is to randomly guess an option, run the program, and fix the error if I guessed wrong. I should be wrong 50% of the time, but I noticed that I am wrong only about 25% of the time. Upon investigating, I discovered that some seemingly "wrong" code actually works. Although it is not documented, it turns out that the COMPARE procedure supports the SHORT option as an alias for BRIEFSUMMARY.

This is awesome! No longer do I need to randomly guess the option. I can use the SHORT option in both PROC CONTENTS and PROC COMPARE! As a bonus, the SHORT option is also supported by7 PROC OPTIONS and PROC DATASETS.

If you have never used the SHORT option before, it's a great time saver because it produces condensed output, as shown in the following examples. For PROC CONTENTS, I like to use the options VARNUM SHORT when I want to display a list of the variables in a data set:

proc contents data=Sashelp.Cars varnum SHORT;      /* SHORT is documented option */
run;

For PROC COMPARE, I use the undocumented SHORT option (which is an alias for BRIEF) when I want to display whether two data sets are equal:

data cars; set Sashelp.cars; run;
 
proc compare base=sashelp.cars compare=cars SHORT; /* SHORT is not documented, but it works */
run;

As I mentioned, the SHORT option is also supported in PROC OPTIONS. For completeness, here is an example that writes the values of several options to the SAS log:

proc options option=(linesize pagesize memsize _LAST_) SHORT; run;    /* SHORT is documented option */
 LINESIZE=75 PAGESIZE=24 MEMSIZE=17179869184 _LAST_=WORK.CARS

In short (see what I did there?), you can use the SHORT option in several Base SAS procedures. If you can't remember that the SHORT option applies to PROC CONTENTS and the BRIEF option applies to PROC COMPARE, just use the SHORT option in both procedures and in PROC OPTIONS, too.

The post Use the SHORT option in Base SAS procedures to reduce output appeared first on The DO Loop.