Statistical Programming

10月 162019
 

The EFFECT statement is supported by more than a dozen SAS/STAT regression procedures. Among other things, it enables you to generate spline effects that you can use to fit nonlinear relationships in data. Recently there was a discussion on the SAS Support Communities about how to interpret the parameter estimates of spline effects. This article answers that question by visualizing the spline effects.

An overview of generated effects

Spline effects are powerful because they enable you to use parametric models to fit nonlinear relationships between an independent variable and a response. Using spline effects is not much different than use polynomial effects to fit nonlinear relationships. Suppose that a response variable, Y, appears to depend on an explanatory variable, X, in a complicated nonlinear fashion. If the relationship looks quadratic or cubic, you might try to capture the relationship by introducing polynomial effects. Instead of trying to model Y by X, you might try to use X, X2, and X3.

Strictly speaking, polynomial effects do not need to be centered at the origin. You could translate the polynomial by some amount, k, and use shifted polynomial effects such as (X-k), (X-k)2, and (X-k)3. Or you could combine these shifted polynomials with polynomials at the origin. Or use shifted polynomials that are shifted by different amounts, such as by the constants k1, k2, and k3.

Spline effects are similar to (shifted) polynomial effects. The constants (such as k1, k2, k3) that are used to shift the polynomials are called knots. Knots that are within the range of the data are called interior knots. Knots that are outside the range of the data are called exterior knots or boundary knots. You can read about the various kinds of spline effects that are supported by the EFFECT statement in SAS. Rather than rehash the mathematics, this article shows how you can use SAS to visualize a regression that uses splines. The visualization clarifies the meaning of the parameter estimates for the spline effects.

Output and visualize spline effects

This section shows how to output the spline effects into a SAS data set and plot the spline effects. Suppose you want to predict the MPG_City variable (miles per gallon in the city) based on the engine size. Because we will be plotting curves, the following statements sort the data by the EngineSize variable. Then the OUTDESIGN= option on the PROC GLMSELECT statement writes the spline effects to the Splines data set. For this example, I am using restricted cubic splines and four evenly spaced internal knots, but the same ideas apply to any choice of spline effects.

/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, ... */
title "Restricted TPF Splines";
title2 "Four Internal Knots";
proc glmselect data=cars outdesign(addinputvars fullmodel)=Splines; /* data set contains spline effects */
   effect spl = spline(EngineSize / details       /* define spline effects */
                naturalcubic basis=tpf(noint)     /* natural cubic splines, omit constant effect */
                knotmethod=equal(4));             /* 4 evenly spaced interior knots */
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   ods select ParameterEstimates SplineKnots;
   ods output ParameterEstimates=PE;
quit;

The SplineKnots table shows the locations of the internal knots. There are four equally spaced knots because the procedure used the KNOTMETHOD=EQUAL(4) option. The ParameterEstimates table shows estimates for the regression coefficients for the spline effects, which are named "Spl 1", "Spl 2", and so forth. In the Splines data set, the corresponding variables are named Spl_1, Spl_2, and so forth.

But what do these spline effects look like? The following statements plot the spline effects versus the EngineSize variable, which is the variable from which the effects are generated:

proc sgplot data=Splines;
   series x=EngineSize y=Intercept / curvelabel;
   series x=EngineSize y=spl_1 / curvelabel;
   series x=EngineSize y=spl_2 / curvelabel;
   series x=EngineSize y=spl_3 / curvelabel;
   refline 2.7 4.1 5.5 / axis=x lineattrs=(color="lightgray");
   refline 6.9 / axis=x label="upper knot" labelloc=inside lineattrs=(color="lightgray");
   yaxis label="Spline Effect";
run;

As stated in the documentation for the NATURALCUBIC option, these spline effects include "an intercept, the polynomial X, and n – 2 functions that are all linear beyond the largest knot," where n is the number of knots. This example uses n=4 knots, so Spl_2 and Spl_3 are the cubic splines. You will also see different spline effects if you change to one of the other supported spline methods, such as B-splines or the truncated power functions. Try it!

The graph shows that the natural cubic splines are reminiscent of polynomial effects, but there are a few differences:

  • The spline effects (spl_2 and spl_3) are shifted away from the origin. The spl_2 effect is shifted by 2.7 units, which is the location of the first internal knot. The spl_3 effect is shifted by 4.1 units, which is the location of the second internal knot.
  • The spline effects are 0 when EngineSize is less than the first knot position (2.7). Not all splines look like this, but these effects are based on truncated power functions (the TPF option).
  • The spline effects are linear when EngineSize is greater than the last knot position (6.9). Not all splines look like this, but these effects are restricted splines.

Predicted values are linear combinations of the spline effects

Visualizing the shapes of the spline effects enable you to make sense of the ParameterEstimates table. As in all linear regression, the predicted value is a linear combination of the design variables. In this case, the predicted values are formed by
Pred = 34.96 – 5*Spl_1 + 2.2*Spl_2 – 3.9*Spl_3
You can use the SAS DATA set or PROC IML to compute that linear combination of the spline effects. The following graph shows the predicted curve and overlays the locations of the knots:

The coefficient for Spl_1 is the largest effect (after the intercept). In the graph, you can see the general trend has an approximate slope of -5. The coefficient for the Spl_2 effect is 2.2, and you can see that the predicted values change slope between the second and third knots due to adding the Spl_2 effect. Without the Spl_3 effect, the predicted values would continue to rise after the third knot, but by adding in a negative multiple of Spl_3, the predicted values turn down again after the third knot.

Notice that the prediction function for the restricted cubic spline regression is linear before the first knot and after the last knot. The prediction function models nonlinear relationships between the interior knots.

Summary

In summary, the EFFECT statement in SAS regression procedures can generate spline effects for a continuous explanatory variable. The EFFECT statement supports many different types of splines. This article gives an example of using natural cubic splines (also called restricted cubic splines), which are based on the truncated power function (TPF) splines of degree 3. By outputting the spline effects to a data set and graphing them, you can get a better understanding of the meaning of the estimates of the regression coefficients. The predicted values are a linear combination of the spline effects, so the magnitude and sign of the regression coefficients indicate how the spline effects combine to predict the response.

The post Visualize a regression with splines appeared first on The DO Loop.

10月 022019
 

I frequently see questions on SAS discussion forums about how to compute the geometric mean and related quantities in SAS. Unfortunately, the answers to these questions are sometimes confusing or even wrong. In addition, some published papers and web sites that claim to show how to calculate the geometric mean in SAS contain wrong or misleading information.

This article shows how to compute the geometric mean, the geometric standard deviation, and the geometric coefficient of variation in SAS. It first shows how to use PROC TTEST to compute the geometric mean and the geometric coefficient of variation. It then shows how to compute several geometric statistics in the SAS/IML language.

For an introduction to the geometric mean, see "What is a geometric mean." For information about the (arithmetic) coefficient of variation (CV) and its applications, see the article "What is the coefficient of variation?"

Compute the geometric mean and geometric CV in SAS

As discussed in my previous article, the geometric mean arises naturally when positive numbers are being multiplied and you want to find the average multiplier. Although the geometric mean can be used to estimate the "center" of any set of positive numbers, it is frequently used to estimate average values in a set of ratios or to compute an average growth rate.

The TTEST procedure is the easiest way to compute the geometric mean (GM) and geometric CV (GCV) of positive data. To demonstrate this, the following DATA step simulates 100 random observations from a lognormal distribution. PROC SGPLOT shows a histogram of the data and overlays a vertical line at the location of the geometric mean.

%let N = 100;
data Have;
call streaminit(12345);
do i = 1 to &N;
   x = round( rand("LogNormal", 3, 0.8), 0.1);    /* generate positive values */
   output;
end;
run;
 
title "Geometric Mean of Skewed Positive Data";
proc sgplot data=Have;
   histogram x / binwidth=10 binstart=5 showbins;
   refline 20.2 / axis=x label="Geometric/Mean" splitchar="/" labelloc=inside
                  lineattrs=GraphData2(thickness=3);
   xaxis values=(0 to 140 by 10);
   yaxis offsetmax=0.1;
run;

Where is the "center" of these data? That depends on your definition. The mode of this skewed distribution is close to x=15, but the arithmetic mean is about 26.4. The mean is pulled upwards by the long right tail. It is a mathematical fact that the geometric mean of data is always less than the arithmetic mean. For these data, the geometric mean is 20.2.

To compute the geometric mean and geometric CV, you can use the DIST=LOGNORMAL option on the PROC TTEST statement, as follows:

proc ttest data=Have dist=lognormal; 
   var x;
   ods select ConfLimits;
run;

The geometric mean, which is 20.2 for these data, estimates the "center" of the data. Notice that the procedure does not report the geometric standard deviation (or variance), but instead reports the geometric coefficient of variation (GCV), which has the value 0.887 for this example. The documentation for the TTEST procedure explains why the GCV is the better measure of variation: "For lognormal data, the CV is the natural measure of variability (rather than the standard deviation) because the CV is invariant to multiplication of [the data]by a constant."

You might wonder whether data need to be lognormally distributed to use this table. The answer is that the data do not need to be lognormally distributed to use the geometric mean and geometric CV. However, the 95% confidence intervals for these quantities assume log-normality.

Definitions of geometric statistics

As T. Kirkwood points out in a letter to the editors of Biometric (Kirkwood, 1979), if data are lognormally distributed as LN(μ σ), then

  • The quantity GM = exp(μ) is the geometric mean. It is estimated from a sample by the quantity exp(m), where m is the arithmetic mean of the log-transformed data.
  • The quantity GSD = exp(σ) is defined to be the geometric standard deviation. The sample estimate is exp(s), where s is the standard deviation of the log-transformed data.
  • The geometric standard error (GSE) is defined by exponentiating the standard error of the mean of the log-transformed data. Geometric confidence intervals are handled similarly.
  • Kirkwood's proposal for the geometric coefficient of variation (GCV) is not generally used. Instead, the accepted definition of the GCV is GCV = sqrt(exp(σ2) – 1), which is the definition that is used in SAS. The estimate for the GCV is sqrt(exp(s2) – 1).

You can use these formulas to compute the geometric statistics for any positive data. However, only for lognormal data do the statistics have a solid theoretical basis: transform to normality, compute a statistic, apply the inverse transform.

Compute the geometric mean in SAS/IML

You can use the SAS/IML language to compute the geometric mean and other "geometric statistics" such as the geometric standard deviation and the geometric CV. The GEOMEAN function is a built-in SAS/IML function, but the other statistics are implemented by explicitly computing statistics of the log-transformed data, as described in the previous section:

proc iml;
use Have; read all var "x"; close;  /* read in positive data */
GM = geomean(x);               /* built-in GEOMEAN function */
print GM;
 
/* To estimate the geometric mean and geometric StdDev, compute
   arithmetic estimates of log(X), then EXP transform the results. */
n = nrow(x);
z = log(x);                  /* log-transformed data */
m = mean(z);                 /* arithmetic mean of log(X) */
s = std(z);                  /* arithmetic std dev of log(X) */
GM2 = exp(m);                /* same answer as GEOMEAN function */
GSD = exp(s);                /* geometric std dev */
GCV = sqrt(exp(s**2) - 1);   /* geometric CV */
print GM2 GSD GCV;

Note that the GM and GCV match the output from PROC TTEST.

What does the geometric standard deviation mean? As for the arithmetic mean, you need to start by thinking about the location of the geometric mean (20.2). If the data are normally distributed, then about 68% of the data are within one standard deviation of the mean, which is the interval [m0s, m+s]. For lognormal data, about 68% of the data should be in the interval [GM/GSD, GM*GSD] and, in fact, 65 out of 100 of the simulated observations are in that interval. Similarly, about 95% of lognormal data should be in the interval [GM/GSD2, GM*GSD2]. For the simulated data, 94 out of 100 observations are in the interval, as shown below:

I am not aware of a similar interpretation of the geometric coefficient of variation. The GCV is usually used to compare two samples. As opposed to the confidence intervals in the previous paragraph, the GCV does not make any reference to the geometric mean of the data.

Other ways to compute the geometric mean

The methods in this article are the simplest ways to compute the geometric mean in SAS, but there are other ways.

  • You can use the DATA step to log-transform the data, use PROC MEANS to compute the descriptive statistics of the log-transformed data, then use the DATA step to exponentiate the results.
  • PROC SURVEYMEANS can compute the geometric mean (with confidence intervals) and the standard error of the geometric mean for survey responses. However, the variance of survey data is not the same as the variance of a random sample, so you should not use the standard error statistic unless you have survey data.

As I said earlier, there is some bad information out there on the internet about this topic, so beware. A site that seems to get all the formulas correct and present the information in a reasonable way is Alex Kritchevsky's blog.

You can download the complete SAS program that I used to compute the GM, GSD, and GCV. The program also shows how to compute confidence intervals for these quantities.

The post Compute the geometric mean, geometric standard deviation, and geometric CV in SAS appeared first on The DO Loop.

9月 032019
 

An important application of the dot product (inner product) of two vectors is to determine the angle between the vectors. If u and v are two vectors, then
cos(θ) = (u ⋅ v) / (|u| |v|)
You could apply the inverse cosine function if you wanted to find θ in [0, π], but since the cosine function is a monotonic increasing transformation on [0, π], it is usually sufficient to know the cosine of the angle between two vectors. The expression (u ⋅ v) / (|u| |v|) is called the cosine similarity between the vectors u and v. It is a value in [-1, 1]. This article discusses the cosine similarity, why it is useful, and how you can compute it in SAS.

What is the cosine similarity?

For multivariate numeric data, you can compute the cosine similarity of the rows or of the columns. The cosine similarity of the rows tells you which subjects are similar to each other. The cosine similarity of the columns tells you which variables are similar to each other. For example, if you have clinical data, the rows might represent patients and the variables might represent measurements such as blood pressure, cholesterol, body-mass index, and so forth. The row similarity compares patients to each other; the column similarity compares vectors of measurements.

Vectors that are very similar to each other have a cosine similarity that is close to 1. Vectors that are nearly orthogonal have a cosine similarity near 0. Vectors that point in opposite directions have a cosine similarity of –1. However, in practice, the cosine similarity is often used on vectors that have nonnegative values. For those vectors, the angle between them is never more than 90 degrees and so the cosine similarity is between 0 and 1.

To illustrate how to compute the cosine similarity in SAS, the following statements create a simple data set that has four observations and two variables. The four row vectors are plotted:

data Vectors;
length Name $1;
input Name x y;
datalines;
A  0.5 1
B  3   5
C  3   2.8
D  5   1
;
 
ods graphics / width=400px height=400px;
title "Four Row Vectors";
proc sgplot data=Vectors aspect=1;
   vector x=x y=y / datalabel=Name datalabelattrs=(size=14);
   xaxis grid;  yaxis grid;
run;
Four vectors. Find the cosine similarity.

The cosine similarity of observations

If you look at the previous graph of vectors and think that vector A is unlike the other vectors, then you are using the magnitude (length) of the vectors to form that opinion. The cosine similarity does not use the magnitude of the vectors to decide which vectors are alike. Instead, it uses only the direction of the vectors. You can visualize the cosine similarity by plotting the normalized vectors that have unit length, as shown in the next graph.

Four normalized vectors. The cosine similarity is the angle between these vectors.

The second graph shows the vectors that are used to compute the cosine similarity. For these vectors, A and B are most similar to each other. Vector C is more similar to B than to D, and D is least similar to the others. You can use PROC DISTANCE and the METHOD=COSINE option to compute the cosine similarity between observations. If your data set has N observations, the result of PROC DISTANCE is an N x N matrix of cosine similarity values. The (i,j)th value is the similarity between the i_th vector and the j_th vector.

proc distance data=Vectors out=Cos method=COSINE shape=square;
   var ratio(_NUMERIC_);
   id Name;
run;
proc print data=Cos noobs; run;
Cosine similarity between four vectors

The results of the DISTANCE procedure confirm what we already knew from the geometry. Namely, A and B are most similar to each other (cosine similarity of 0.997), C is more similar to B (0.937) than to D (0.85), and D is not very similar to the other vectors (similarities range from 0.61 to 0.85).

Notice that the cosine similarity is not a linear function of the angle between vectors. The angle between vectors B and A is 4 degrees, which has a cosine of 0.997. The angle between vectors B and C is almost four times as big (16 degrees) but the cosine similarity is 0.961, which is not very different from 0.997 (and certainly not four times as small). Notice also that the graph of the cosine function is flat near θ=0. Consequently, the cosine similarity does not vary much between the vectors in this example.

Compute cosine similarity in SAS/IML

SAS/STAT does not include a procedure that computes the cosine similarity of variables, but you can use the SAS/IML language to compute both row and column similarity. The following PROC IML statements define functions that compute the cosine similarity for rows and for columns. To support missing values in the data, the functions use listwise deletion to remove any rows that contain a missing value. All functions are saved for future use.

/* Compute cosine similarity matrices in SAS/IML */
proc iml;
/* exclude any row with a missing value */
start ExtractCompleteCases(X);
   if all(X ^= .) then return X;
   idx = loc(countmiss(X, "row")=0);
   if ncol(idx)>0 then return( X[idx, ] );
   else                return( {} ); 
finish;
 
/* compute the cosine similarity of columns (variables) */
start CosSimCols(X, checkForMissing=1);
   if checkForMissing then do;    /* by default, check for missing and exclude */
      Z = ExtractCompleteCases(X);
      Y = Z / sqrt(Z[##,]);       /* stdize each column */
   end;
      else Y = X / sqrt(X[##,]);  /* skip the check if you know all values are valid */
   cosY = Y` * Y;                 /* pairwise inner products */
   /* because of finite precision, elements could be 1+eps or -1-eps */
   idx = loc(cosY> 1); if ncol(idx)>0 then cosY[idx]= 1; 
   idx = loc(cosY<-1); if ncol(idx)>0 then cosY[idx]=-1; 
   return cosY;
finish;
 
/* compute the cosine similarity of rows (observations) */
start CosSimRows(X);
   Z = ExtractCompleteCases(X);  /* check for missing and exclude */
   return T(CosSimCols(Z`, 0));  /* transpose and call CosSimCols */
finish;
store module=(ExtractCompleteCases CosSimCols CosSimRows);

Visualize the cosine similarity matrix

When you compare k vectors, the cosine similarity matrix is k x k. When k is larger than 5, you probably want to visualize the similarity matrix by using heat maps. The following DATA step extracts two subsets of vehicles from the Sashelp.Cars data set. The first subset contains vehicles that have weak engines (low horsepower) whereas the second subset contains vehicles that have powerful engines (high horsepower). You can use a heat map to visualize the cosine similarity matrix between these vehicles:

data Vehicles;
   set sashelp.cars(where=(Origin='USA'));
   if Horsepower < 140 OR Horsepower >= 310;
run;
proc sort data=Vehicles; by Type; run;  /* sort obs by vehicle type */
 
ods graphics / reset;
proc iml;
load module=(CosSimCols CosSimRows);
use Vehicles;
read all var _NUM_ into X[c=varNames r=Model]; 
read all var {Model Type}; close;
 
labl = compress(Type + ":" + substr(Model, 1, 10));
cosRY = CosSimRows(X);
call heatmapcont(cosRY) xvalues=labl yvalues=labl title="Cosine Similarity between Vehicles";
Cosine similarity between attributes of 20 vehicles

The heat map shows the cosine similarities of the attributes of 20 vehicles. Vehicles of a specific type (SUV, sedan, sports car,...) tend to be similar to each other and different from vehicles of other types. The four sports cars stand out. They are very dissimilar to the sedans, and they are also dissimilar to the SUVs and wagons.

Notice the small range for the cosine similarity values. Even the most dissimilar vehicles have a cosine similarity of 0.99.

Cosine similarity of columns

You can treat each row data as a vector of dimension p. Similarly (no pun intended!), you can treat each column as a vector of length N. You can use the CosSimCols function, defined in the previous section, to compute the cosine similarity matrix of numerical columns. The math is the same but is applied to the transpose of the data matrix. To demonstrate the function, the following statements compute and visualize the column similarity for the Vehicles data. There are 10 numerical variables in the data.

cosCY = CosSimCols(X);
call heatmapcont(cosCY) xvalues=varNames yvalues=varNames title="Cosine of Angle Between Variables";
Cosine similarity between variables in the Vehicles data set

The heat map for the columns is shown. You can see that the MPG_City and MPG_Highway variables are dissimilar to most other variables, but similar to each other. Other sets of similar variables include the variables that measure cost (MSRP and Invoice), the variables that measure power (EngineSize, Cylinders, Horsepower), and the variables that measure size (Wheelbase, Length, Weight).

The connection between cosine similarity and correlation

The similarity matrix of the variables shows which variables are similar and dissimilar. In that sense, the matrix might remind you of a correlation matrix. However, there is an important difference: The correlation matrix displays the pairwise inner products of centered variables. The cosine similarity does not center the variables. Although the correlation is scale-invariant and affine invariant, the cosine similarity is not affine invariant: If you add or subtract a constant from a variable, its cosine similarity with other variables will change.

When the data represent positive quantities (as in the Vehicles data), the cosine similarity between two vectors can never be negative. For example, consider the relationship between the fuel economy variables (MPG_City and MPG_Highway) and the "power" and "size" variables. Intuitively, the fuel economy is negatively correlated with power and size (for example, Horsepower and Weight). However, the cosine similarity is positive, which shows another difference between correlation and cosine similarity.

Why is the cosine similarity useful?

The fact that the cosine similarity does not center the data is its biggest strength (and also its biggest weakness). The cosine similarity is most often used in applications in which the variables represent counts or indicator variables. Often, each row represents a document such as a recipe, a book, or a song. The columns indicate an attribute such as an ingredient, word, or musical technique.

For text documents, the number of columns (which represent important words) can be in the thousands. The goal is to discover which documents are structurally similar to each other, perhaps because they share similar content. Recommender engines can use the cosine similarity to suggest new books or articles that are similar to one that was previously read. The same technique can recommend similar recipes or similar songs.

The advantage of the cosine similarity is its speed and the fact that it is very useful for sparse data. In a future article, I'll provide a simple example that demonstrates how you can use the cosine similarity to find recipes that are similar or dissimilar to each other.

The post Cosine similarity of vectors appeared first on The DO Loop.

8月 262019
 

Most SAS programmers know how to use PROC APPEND or the SET statement in DATA step to unconditionally append new observations to an existing data set. However, sometimes you need to scan the data to determine whether or not to append observations. In this situation, many SAS programmers choose one of the following methods:

  • Inside a DATA step, use the SYMPUT call to create a macro variable that indicates whether to append observations. After the DATA step ends, use %IF-%THEN processing to check the value of the macro variable and conditionally append the observations.
  • Use the DATA step to determine whether to append data and append data in the same DATA step. This is especially useful if the values for the new observations depend on the data that you scanned.

This article shows the second method. It shows how to use the SAS DATA step to scan through observations and remember certain values. If a condition is met, it uses the values to append new observations to the end of the data by using end-of-file processing.

A motivating example

Often SAS programmers need to implement complicated data-dependent logic. A simple example is "If the XYZ variable contains a certain value but doesn't contain a different value, then do something." On the SAS discussion forums, the experts often suggest scanning through the data with a DATA step and keeping one or more "flag variables" that indicate which conditions have been satisfied. At the end of the DATA step, you can look at the values of the flag variables to determine what action to take.

Last week I encountered a situation where I needed to conditionally append observations to input data. Although the solution is easy if you use the SAS/IML language (which enables you to scan an entire vector of values), I needed to solve the problem by using the DATA step, which processes only one observation at a time. The problem had the following form:

  • The data are in long form. The character variable TYPE always contains the values 'Min' and 'Max'. The numeric variable VALUE contains numbers.
  • The TYPE variable might or might not contain the values 'LowVal' and 'HighVal'. If they do appear, they always appear after the 'Min' and 'Max' values.

The goal is to create an output data set that always contains the four values 'Min', 'Max', 'LowVal', and 'HighVal'. The goal is summarized by the figure to the right. The following list describes how to generate the 'LowVal' and 'HighVal' observations if they do not exist.

  • If TYPE='HighVal' does not appear in the data, create it and copy the value of the TYPE='Max' observation to use for the VALUE variable.
  • Similarly, if TYPE='LowVal' does not appear in the data, create it. Copy the value of the TYPE='Min' observation to use for the VALUE variable.

The figure shows the four situations that can occur. The input data set always contains the 'Min' and 'Max' values but can contain none, one, or two of the other values. To goal is to produce the data set on the right, which always contains all four values. The next section presents a solution, so stop reading here if you want to solve the problem on your own!

The END= option and end-of-file processing

To solve the problems, I used two facts about the SAS DATA step:

  1. You can use the END= option on the SET statement to create a temporary binary indicator variable that has the value 1 for only the last observation of the input data.
  2. The SAS DATA step contains an implicit loop over all observations in the input data. If you do not use an OUTPUT statement, the DATA step performs an implicit output for each observation. However, if the program contains an OUTPUT statement anywhere in the program, then the implicit output is disabled. Therefore, whenever you use an OUTPUT statement, you must use other OUTPUT statements whenever you want to write an observation to the output data set.

The following program scans the input data. It remembers the values of the 'Min' and 'Max' observations, in case it needs them. It uses indicator variables to determine whether the data contains the 'LowVal' and 'HighVal' observations. After the input data are read, the program uses an end-of-file indicator variable (EOF) to determine whether or not to add observations for 'LowVal' and 'HighVal'.

Because the program uses an OUTPUT statement to (conditionally) create new observation, you must also put an OUTPUT statement after the SET statement to ensure that the original observations are all written.

/* Include all 4 test cases. Use WHERE clause to test each case. */
data Test;
length Type $7;
input Group Type $ Value;
datalines;
1 Min      -3
1 Max       3
1 LowVal   -2
1 HighVal   2
2 Min      -3
2 Max       3
2 HighVal   2
3 Min      -3
3 Max       3
3 LowVal   -2
4 Min      -3
4 Max       3
;
 
/* Input order is always 'Min' and 'Max' optionally followed by 
   'LowVal' and 'HighVal', if they exist. */
%let dsname = Test(where=(Group=4));    /* use 1,2,3,4 to test all cases */
 
data Want;
drop HighFound LowFound Min Max;   /* temporary variables */
retain HighFound LowFound 0        /* binary indicator variables: Initialize to 0 (false) */
       Min Max .;                  /* VALUE of 'Min' and 'Max' obs: Initialize to missing */
set &dsname end=EOF;               /* EOF is temporary indicator variable */
output;                            /* need OUTPUT because of EOF processing */
if Type = 'Min' then 
   min = Value;              /* remember the Min value */
else if Type = 'Max' then
   max = Value;              /* remember the Max value */
else if Type = 'LowVal' then 
   LowFound = 1;             /* Low value found; no need to create it */
else if Type = 'HighVal' then 
   HighFound = 1;            /* High value found; no need to create it */
 
/* end-of-file processing: conditionally append new observations */
if EOF then do;
   if ^LowFound then do;     /* Low value not found. Add it. */
      Type = "LowVal"; Value = Min; output;
   end;
   if ^HighFound then do;    /* High value not found. Add it. */
      Type = "HighVal"; Value = Max; output;
   end;
end;
run;
 
proc print data=Want; 
   var Group Type Value;
run;

The result is shown for input data that contains only the 'Min' and 'Max' observations but not the 'LowVal' or 'HighVal' observations. The output shows that the 'LowVal' or 'HighVal' observations were correctly appended to the input data, and that values for the VALUE column were copied from the 'Min' and 'Max' observations, respectively. You can verify that the other three input data sets are also correctly handled.

Use caution with the DELETE and subsetting IF statements

When performing end-of-file processing, be careful if you use a DELETE statement or a subsetting IF statement. For details and examples, see "The Perils of End-of-File Processing when Subsetting Data" (Landry, 2009). Landry summarizes the problem as follows (p. 2): "The problem occurs only when the last record in the dataset is deleted.... The reason this happens is that when a record is deleted..., SAS stops processing and returns to the next iteration of the DATA step. Thus, any executable statements placed after the [DELETE or subsetting IF statements] do not get executed."

Summary

In summary, this article shows how to use the SAS DATA step to conditionally add observations to the end of the input data. This is useful for data-dependent logic when the observations that you need to append depend on the values of the data. You can perform end-of-file processing by using the END= option on the SET statement to create an indicator variable that has the value 1 for the last observation in the input data. You can use the OUTPUT statement to append additional observations, but remember that you also need to use the OUTPUT statement after the SET statement if you want to output the original data.

Do you have a favorite way to conditionally append data? Do you know of other potential pitfalls with end-of-file processing? Leave a comment.

The post Conditionally append observations to a SAS data set appeared first on The DO Loop.

8月 192019
 

One of my friends likes to remind me that "there is no such thing as a free lunch," which he abbreviates by "TINSTAAFL" (or TANSTAAFL). The TINSTAAFL principle applies to computer programming because you often end up paying a cost (in performance) when you call a convenience function that simplifies your program.

I was thinking about TINSTAAFL recently when I was calling a Base SAS function from the SAS/IML matrix language. The SAS/IML language supports hundreds of built-in functions that operate on vectors and matrices. However, you can also call hundreds of functions in Base SAS and pass in vectors for the parameters. It is awesome and convenient to be able to call the virtual smorgasbord of functions in Base SAS, such as probability function, string matching functions, trig function, financial functions, and more. Of course, there is no such thing as a free lunch, so I wondered about the overhead costs associated with calling a Base SAS function from SAS/IML. Base SAS functions typically are designed to operate on scalar values, so the IML language has to call the underlying function many times, once for each value of the parameter vector. It is more expensive to call a function a million times (each time passing in a scalar parameter) than it is to call a function one time and pass in a vector that contains a million parameters.

To determine the overhead costs, I decided to test the cost of calling the MISSING function in Base SAS. The IML language has a built-in syntax (b = (X=.)) for creating a binary variable that indicates which elements of a vector are missing. The call to the MISSING function (b = missing(X)) is equivalent, but requires calling a Base SAS many times, once for each element of x. The native SAS/IML syntax will be faster than calling a Base SAS function (TINSTAAFL!), but how much faster?

The following program incorporates many of my tips for measuring the performance of a SAS computation. The test is run on large vectors of various sizes. Each computation (which is very fast, even on large vectors) is repeated 50 times. The results are presented in a graph. The following program measures the performance for a character vector that contains all missing values.

/* Compare performance of IML syntax
   b = (X = " ");
   to performance of calling Base SAS MISSING function 
   b = missing(X);
*/
proc iml;
numRep = 50;                            /* repeat each computation 50 times */
sizes = {1E4, 2.5E4, 5E4, 10E4, 20E4};  /* number of elements in vector */
labl = {"Size" "T_IML" "T_Missing"};
Results = j(nrow(sizes), 3);
Results[,1] = sizes;
 
/* measure performance for character data */
do i = 1 to nrow(sizes);
   A = j(sizes[i], 1, " ");            /* every element is missing */
   t0 = time();
   do k = 1 to numRep;
      b = (A = " ");                   /* use built-in IML syntax */
   end;
   Results[i, 2] = (time() - t0) / numRep;
 
   t0 = time();
   do k = 1 to numRep;
      b = missing(A);                  /* call Base SAS function */
   end;
   Results[i, 3] = (time() - t0) / numRep;
end;
 
title "Timing Results for (X=' ') vs missing(X) in SAS/IML";
title2 "Character Data";
long = (sizes // sizes) || (Results[,2] // Results[,3]);   /* convert from wide to long for graphing */
Group = j(nrow(sizes), 1, "T_IML") // j(nrow(sizes), 1, "T_Missing"); 
call series(long[,1], long[,2]) group=Group grid={x y} label={"Size" "Time (s)"} 
            option="markers curvelabel" other="format X comma8.;";

The graph shows that the absolute times for creating a binary indicator variable is very fast for both methods. Even for 200,000 observations, creating a binary indicator variable takes less than five milliseconds. However, on a relative scale, the built-in SAS/IML syntax is more than twice as fast as calling the Base SAS MISSING function.

You can run a similar test for numeric values. For numeric values, the SAS/IML syntax is about 10-20 times faster than the call to the MISSING function, but, again, the absolute times are less than five milliseconds.

So, what's the cost of calling a Base SAS function from SAS/IML? It's not free, but it's very cheap in absolute terms! Of course, the cost depends on the number of elements that you are sending to the Base SAS function. However, in general, there is hardly any cost associated with calling a Base SAS function from SAS/IML. So enjoy the lunch buffet! Not only is it convenient and plentiful, but it's also very cheap!

The post Timing performance in SAS/IML: Built-in functions versus Base SAS functions appeared first on The DO Loop.

8月 142019
 

Many programmers are familiar with "short-circuit" evaluation in an IF-THEN statement. Short circuit means that a program does not evaluate the remainder of a logical expression if the value of the expression is already logically determined. The SAS DATA step supports short-circuiting for simple logical expressions in IF-THEN statements and WHERE clauses (Polzin 1994, p. 1579; Gilsen 1997). For example, in the following logical-AND expression, the condition for the variable Y does not need to be checked if the condition for variable X is true:

data _null_;
set A end=eof;
if x>0 & y>0 then   /* Y is evaluated only if X>0 */
   count + 1;
if eof then 
   put count=;
run;

Order the conditions in a logical statement by likelihood

SAS programmers can optimize their IF-THEN and WHERE clauses if they can estimate the probability of each separate condition in a logical expression:

  • In a logical AND statement, put the least likely events first and the most likely events last. For example, suppose you want to find patients at a VA hospital that are male, over the age of 50, and have kidney cancer. You know that kidney cancer is a rare form of cancer. You also know that most patients at the VA hospital are male. To optimize a WHERE clause, you should put the least probable conditions first:
    WHERE Disease="Kidney Cancer" & Age>50 & Sex="Male";
  • In a logical OR statement, put the most likely events first and the least likely events last. For example, suppose you want to find all patients at a VA hospital that are either male, or over the age of 50, or have kidney cancer. To optimize a WHERE clause, you should use
    WHERE Sex="Male" | Age>50 | Disease="Kidney Cancer";

The SAS documentation does not discuss the conditions for which a logical expression does or does not short circuit. Polzin (1994, p. 1579) points out that when you put function calls in the logical expression, SAS evaluates certain function calls that produce side effects. Common functions that have side effects include random number functions and user-defined functions (via PROC FCMP) that have output arguments. The LAG and DIF functions can also produce side effects, but it appears that expressions that involve the LAG and DIF functions are short-circuited. You can force a function evaluation by calling the function prior to an IF-THEN statement. You can use nested IF-THEN/ELSE statements to ensure that functions are not evaluated unless prior conditions are satisfied.

Logical ligatures

The SAS/IML language does not support short-circuiting in IF-THEN statements, but it performs several similar optimizations that are designed to speed up your code execution. One optimization involves the ANY and ALL functions, which test whether any or all (respectively) elements of a matrix satisfy a logical condition. A common usage is to test whether a missing value appear anywhere in a vector, as shown in the following SAS/IML statement:

bAny = any(y = .);   /* TRUE if any element of Y is missing */
/* Equivalently, use   bAll = all(y ^= .);  */

The SAS/IML language treats simple logical expressions like these as a single function call, not as two operations. I call this a logical ligature because two operations are combined into one. (A computer scientist might just call this a parser optimization.)

You might assume that the expression ANY(Y=.) is evaluated by using a two-step process. In the first step, the Boolean expression y=. is evaluated and the result is assigned to a temporary binary matrix, which is the same size as Y. In the second step, the temporary matrix is sent to the ANY function, which evaluates the binary matrix and returns TRUE if any element is nonzero. However, it turns out that SAS/IML does not use a temporary matrix. The SAS/IML parser recognizes that the expression inside the ANY function is a simple logical expression. The program can evaluate the function by looking at each element of Y and returning TRUE as soon it finds a missing value. In other words, it short-circuits the computation. If no value is missing, the expression evaluates to FALSE.

Short circuiting an operation can save considerable time. In the following SAS/IML program, the vector X contains 100 million elements, all equal to 1. The vector Y also contains 100 million elements, but the first element of the vector is a missing value. Consequently, the computation for Y is essentially instantaneous whereas the computation for X takes a tenth of a second:

proc iml;
numReps = 10;      /* run computations 10 times and report average */
N = 1E8;           /* create vector with 100 million elements */
x = j(N, 1, 1);    /* all elements of x equal 1 */
y = x; y[1] = .;   /* the first element of x is missing */
 
/* the ALL and ANY functions short-circuit when the 
   argument is a simple logical expression */
/* these function calls examine only the first elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(y = .);   /* TRUE for y[1] */
   bAll = all(y ^= .);  /* TRUE for y[2] */
end;
t = (time() - t0) / numReps;
print t[F=BEST6.];
 
/* these function calls must examine all elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(x = .);   
   bAll = all(x ^= .);
end;
t = (time() - t0) / numReps;
print t[F=BEST6.];

Although the evaluation of X does not short circuit, it still uses the logical ligature to evaluate the expression. Consequently, the evaluation is much faster than the naive two-step process that is shown explicitly by the following statements, which require about 0.3 seconds and twice as much memory:

   /* two-step process: slower */
   b1 = (y=.);                /* form the binary vector */
   bAny = any(b1);            /* TRUE for y[1] */

In summary, the SAS DATA step uses short-circuit evaluation in IF-THEN statements and WHERE clauses that use simple logical expressions. If the expression contains several subexpressions, you can optimize your program by estimating the probability that each subexpression is true. In the SAS/IML language, the ANY and ALL functions not only short circuit, but when their argument is a simple Boolean expression, the language treats the function call as a logical ligature and evaluates the call in an efficient manner that does not require any temporary memory.

Short circuits can be perplexing if you don't expect them. Equally confusing is expecting a statement to short circuit, but it doesn't. If you have a funny story about short-circuit operators, in any language, leave a comment.

The post Short-circuit evaluation and logical ligatures in SAS appeared first on The DO Loop.

7月 312019
 

Sometimes a little thing can make a big difference. I am enjoying a new enhancement of SAS/IML 15.1, which enables you to use a numeric vector as the column header or row header when you print a SAS/IML matrix. Prior to SAS/IML 15.1, you had to first use the CHAR or PUTN function to manually apply a format to the numeric values. You could then use the formatted values as column or row headers. Now you can skip the formatting step.

A common situation in which I want to use numeric values as a column header is when I am using the TABULATE function to compute the frequencies for a discrete numerical variable. For example, the Cylinders variable in the Sashelp.Cars data set indicates the number of cylinders in each vehicle. You might want to know how many vehicles in the data are four-cylinder, six-cylinder, eight-cylinder, and so forth. In SAS/IML 15.1, you can use a numeric variable (such as the levels of the Cylinders variable) directly in the COLNAME= option of the PRINT statement, as follows:

proc iml;
use Sashelp.Cars;
read all var "Cylinders" into X;
close;
 
call tabulate(level, freq, X);      /* count number of obs for each level of X */
/* New in SAS/IML 15.1: use numeric values as COLNAME= option to PRINT statement */
print freq[colname=level];

Prior to SAS/IML 15.1, you had to convert numbers to character values by applying a format. This is commonly done by using the CHAR function, which applies a W.D format. This technique is still useful, but optional. You can also use the PUTN function to apply any format you want, such as COMMA., DATE., TIME., and so forth. For example, if you like Roman numerals, you could apply the ROMAN. format!

labels = char(level, 2);         /* apply w.d format */
print freq[colname=labels];
 
labels = putn(level, "ROMAN.");  /* second arg is name of format */
print freq[colname=labels];

You can also use numeric values for the COLNAME= option to obtain row headers in SAS/IML 15.1. Sure, it's a small enhancement, but I like it!

The post Use numeric values for column headers when printing a matrix appeared first on The DO Loop.

7月 242019
 
Probability densities for Gumbel distributions. The parameter values correspond to the parameters that model the distribution of the maximum value in a sample of size n drawn from the standard normal distribution.

SAS supports more than 25 common probability distributions for the PDF, CDF, QUANTILE, and RAND functions. Of course, there are infinitely many distributions, so not every possible distribution is supported. If you need a less-common distribution, I've shown how to extend the functionality of Base SAS (by using PROC FCMP) or the SAS/IML language by defining your own functions. This article shows how to implement the Gumbel distribution in SAS. The density functions for four parameter values of the Gumbel distribution are shown to the right. As I recently discussed, the Gumbel distribution is used to model the maximum (or minimum) in a sample of size n.

Essential functions for the Gumbel Distribution

If you are going to work with a probability distribution, you need at least four essential functions: The CDF (cumulative distribution), the PDF (probability density), the QUANTILE function (inverse CDF), and a way to generate random values from the distribution. Because the RAND function in SAS supports the Gumbel distribution, you only need to define the CDF, PDF, and QUANTILE functions.

These functions are trivial to implement for the Gumbel distribution because each has an explicit formula. The following list defines each function. To match the documentation for the RAND function and PROC UNIVARIATE (which both support the Gumbel distribution), μ is used for the location parameter and σ is used for the scale parameter:

  • CDF: F(x; μ, σ) = exp(-exp(-z)), where z = (x - μ) / σ
  • PDF: f(x; μ, σ) = exp(-z-exp(-z)) / σ, where z = (x - μ) / σ
  • QUANTILE: F-1(p; μ, σ) = μ - σ log(-log(p)), where LOG is the natural logarithm.

The RAND function in Base SAS and the RANDGEN function in SAS/IML support generating random Gumbel variates. Alternatively, you could generate u ~ U(0,1) and then use the QUANTILE formula to transform u into a random Gumbel variate.

Using these definitions, you can implement the functions as inline functions, or you can create a function call, such as in the following SAS/IML program:

proc iml;
start CDFGumbel(x, mu, sigma);
   z = (x-mu)/sigma;
   return exp(-exp(-z));                   * CDF of Gumbel distribution;
finish;
 
start PDFGumbel(x, mu, sigma);
   z = (x-mu)/sigma;
   return exp(-z-exp(-z)) / sigma; 
finish;
 
start QuantileGumbel(p, mu, sigma);
   return mu - sigma # log(-log(p));
finish;
 
/* Example: Call the PDFGumbel function */
mu = 3.09;                       /* params for max value of a sample of size */
sigma = 0.286;                   /*       n=1000 from the std normal distrib */
x = do(2, 5, 0.025);
PDF = PDFGumbel(x, mu, sigma);
title "Gumbel PDF";
call series(x, PDF) grid={x y};

The graph shows the density for the Gumbel(3.09, 0.286) distribution, which models the distribution of the maximum value in a sample of size n=1000 drawn from the standard normal distribution. You can see that the maximum value is typically between 2.5 and 4.5, which values near 3 being the most likely.

Plot a family of Gumbel curves

I recently discussed the best way to draw a family of curves in SAS by using PROC SGPLOT. The following DATA step defines the PDF and CDF for a family of Gumbel distributions. You can use this data set to display several curves that indicate how the distribution changes for various values of the parameters.

data Gumbel;
array n[4] _temporary_     (10   100  1000 1E6);   /* sample size */
array mu[4] _temporary_    (1.28 2.33 3.09 4.75);  /* Gumbel location parameter */
array beta [4] _temporary_ (0.5  0.36 0.29 0.2);   /* Gumbel scale parameter */
do i = 1 to dim(beta);                       /* parameters in the outer loop */
   m = mu[i]; b = beta[i];
   Params = catt("mu=", m, "; beta=", b);    /* concatenate parameters */
   do x = 0 to 6 by 0.01;
      z = (x-m)/b;
      CDF = exp( -exp(-z));                  /* CDF for Gumbel */
      PDF = exp(-z - exp(-z)) / b;           /* PDF for Gumbel */
      output;
   end;
end;
drop z;
run;
 
/* Use ODS GRAPHICS / ATTRPRIORITY=NONE 
   if you want to force the line attributes to vary in the HTML destination. */
title "The Gumbel(mu, beta) Cumulative Distribution";
proc sgplot data=Gumbel;
   label cdf="Probability";
   series x=x y=cdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid;  yaxis grid;
run;
 
title "The Gumbel(mu, beta) Probability Density";
proc sgplot data=Gumbel;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid offsetmin=0 offsetmax=0;  yaxis grid;
run;
 
title "The Gumbel(mu, beta) Quantile Function";
proc sgplot data=Gumbel;
   label cdf="Probability";
   series x=cdf y=x / group=Params lineattrs=(thickness=2);
   keylegend / position=right sortorder=reverseauto;
   xaxis grid;  yaxis grid;
run;

The graph for the CDF is shown. A graph for the PDF is shown at the top of this article. I augmented the PDF curve with labels that show the sample size for which the Gumbel distribution is associated. The leftmost curve is the distribution of the maximum in a sample of size n=10; the rightmost curve is for a sample of size one million.

Random values and fitting parameters to data

For completeness, I will mention that you can use the RAND function to generate random values and use PROC UNIVARIATE to fit Gumbel parameters to data. For example, the following DATA step generates 5,000 random variates from Gumbel(1.28, 0.5). The call to PROC UNIVARIATE fits a Gumbel model to the data. The estimates for (μ, σ) are (1.29, 0.51).

data GumbelRand;
call streaminit(12345);
mu = 1.28;
beta = 0.5;
do i = 1 to 5000;
   x = rand("Gumbel", mu, beta);
   output;
end;
keep x;
run;
 
proc univariate data=GumbelRand;
   histogram x / Gumbel;
run;

In summary, although the PDF, CDF, and QUANTILE functions in SAS do not support the Gumbel distribution, it is trivial to implement these functions. This article visualizes the PDF and CDF of the Gumbel distributions for several parameter values. It also shows how to simulate random values from a Gumbel distribution and how to fit the Gumbel model to data.

The post Implement the Gumbel distribution in SAS appeared first on The DO Loop.

6月 262019
 

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

  • Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
  • Forming the sum of squares and crossproducts matrix (SSCP, also called the X`X matrix). This is done once.
  • Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi OUTEST;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   ods output ParameterEstimates=PE;
   performance details;
run;

For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */
proc transpose data=PE out=inest(type=EST) label=_TYPE_;
   label Estimate=PARMS;
   var Estimate;
   id ParmName;
run;
 
ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi INEST=inest MAXITER=0;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   performance details;
run;
The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest 
   plots(only MAXPOINTS=NONE)=oddsratio(range=clip);
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB;
   oddsratio c1;
run;

The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop.

5月 202019
 

Recently I wrote about how to compute the Kolmogorov D statistic, which is used to determine whether a sample has a particular distribution. One of the beautiful facts about modern computational statistics is that if you can compute a statistic, you can use simulation to estimate the sampling distribution of that statistic. That means that instead of looking up critical values of the D statistic in a table, you can estimate the critical value by using empirical quantiles from the simulation.

This is a wonderfully liberating result! No longer are we statisticians constrained by the entries in a table in the appendix of a textbook. In fact, you could claim that modern computation has essentially killed the standard statistical table.

Obtain critical values by using simulation

Before we compute anything, let's recall a little statistical theory. If you get a headache thinking about null hypotheses and sampling distributions, you might want to skip the next two paragraphs!

When you run a hypothesis test, you compare a statistic (computed from data) to a hypothetical distribution (called the null distribution). If the observed statistic is way out in a tail of the null distribution, you reject the hypothesis that the statistic came from that distribution. In other words, the data does not seem to have the characteristic that you are testing for. Statistical tables use "critical values" to designate when a statistic is in the extreme tail. A critical value is a quantile of the null distribution; if the observed statistic is greater than the critical value, then the statistic is in the tail. (Technically, I've described a one-tailed test.)

One of the uses for simulation is to approximate the sampling distribution of a statistic when the true distribution is not known or is known only asymptotically. You can generate a large number of samples from the null hypothesis and compute the statistic on each sample. The union of the statistics approximates the true sampling distribution (under the null hypothesis) so you can use the quantiles to estimate the critical values of the null distribution.

Critical values of the Kolmogorov D distribution

You can use simulation to estimate the critical value for the Kolmogorov-Smirnov statistical test for normality. For the data in my previous article, the null hypothesis is that the sample data follow a N(59, 5) distribution. The alternative hypothesis is that they do not. The previous article computed a test statistic of D = 0.131 for the data (N = 30). If the null hypothesis is true, is that an unusual value to observe? Let's simulate 40,000 samples of size N = 30 from N(59,5) and compute the D statistic for each. Rather than use PROC UNIVARIATE, which computes dozens of statistics for each sample, you can use the SAS/IML computation from the previous article, which is very fast. The following simulation runs in a fraction of a second.

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
%let NumSamples = 40000;
 
proc iml;
call randseed(73);
N = &N;
i = T( 1:N );                           /* ranks */
u = i/N;                                /* ECDF height at right-hand endpoints */
um1 = (i-1)/N;                          /* ECDF height at left-hand endpoints  */
 
y = j(N, &NumSamples, .);               /* columns of Y are samples of size N */
call randgen(y, "Normal", &mu, &sigma); /* fill with random N(mu, sigma)      */
D = j(&NumSamples, 1, .);               /* allocate vector for results        */
 
do k = 1 to ncol(y);                    /* for each sample:                   */
   x = y[,k];                           /*    get sample x ~ N(mu, sigma)     */
   call sort(x);                        /*    sort sample                     */
   F = cdf("Normal", x, &mu, &sigma);   /*    CDF of reference distribution   */
   D[k] = max( F - um1, u - F );        /*    D = max( D_minus, D_plus )      */
end;
 
title "Monte Carlo Estimate of Sampling Distribution of Kolmogorov's D Statistic";
title2 "N = 30; N_MC = &NumSamples";
call histogram(D) other=
     "refline 0.131 / axis=x label='Sample D' labelloc=inside lineattrs=(color=red);";

The test statistic is right smack dab in the middle of the null distribution, so there is no reason to doubt that the sample is distributed as N(59, 5).

How big would the test statistic need to be to be considered extreme? To test the hypothesis at the α significance level, you can compute the 1 – α quantile of the null distribution. The following statements compute the critical value for α = 0.05 and N = 30:

/* estimate critical value as the 1 - alpha quantile */
alpha = 0.05;
call qntl(Dcrit_MC, D, 1-alpha);
print Dcrit_MC;

The estimated critical value for a sample of size 30 is 0.242. This compares favorably with the exact critical value from a statistical table, which gives Dcrit = 0.2417 for N = 30.

You can also use the null distribution to compute a p value for an observed statistic. The p value is estimated as the proportion of statistics in the simulation that exceed the observed value. For example, if you observe data that has a D statistic of 0.28, the estimated p value is obtained by the following statements:

Dobs = 0.28;                        /* hypothetical observed statistic */
pValue = sum(D >= Dobs) / nrow(D);  /* proportion of distribution values that exceed D0 */
print Dobs pValue;

This same technique works for any sample size, N, although most tables critical values only for all N ≤ 30. For N > 35, you can use the following asymptotic formulas, developed by Smirnov (1948), which depend only on α:

The Kolmogorov D statistic does not depend on the reference distribution

It is reasonable to assume that the results of this article apply only to a normal reference distribution. However, Kolmogorov proved that the sampling distribution of the D statistic is actually independent of the reference distribution. In other words, the distribution (and critical values) are the same regardless of the continuous reference distribution: beta, exponential, gamma, lognormal, normal, and so forth. That is a surprising result, which explains why there is only one statistical table for the critical values of the Kolmogorov D statistic, as opposed to having different tables for different reference distributions.

In summary, you can use simulation to estimate the critical values for the Kolmogorov D statistic. In a vectorized language such as SAS/IML, the entire simulation requires only about a dozen statements and runs extremely fast.

The post Critical values of the Kolmogorov-Smirnov test appeared first on The DO Loop.