Statistical Programming

3月 182020
 

A SAS/IML programmer asked about the best way to print multiple SAS/IML variables when each variable needs a different format. He wanted the output to resemble the "Parameter Estimates" table that is produced by PROC REG and other SAS/STAT procedures.

This article shows four ways to print SAS/IML vectors in a table in which each variable has a different format. The methods are:

  1. Use the SAS/IML PRINT statement and assign a format to each column. Optionally, you can also assign a label to each variable.
  2. Create a SAS/IML table and use the TablePrint statement. You can use the TableSetVarFormat subroutine to assign formats to columns.
  3. If you want to format the variables according to a SAS/STAT template, you can use the TablePrint subroutine and specify the name of the template.
  4. Write the vectors to a data set and use PROC PRINT.

An example table

Suppose you have computed a linear regression in the SAS/IML language. You have six vectors (each with three observations) that you want to print in table that resembles the ParameterEstimates table that is produced by PROC REG and other SAS/STAT procedures. Here are the vectors:

proc iml;
factors = {'Intercept', 'MPG_Highway','Weight'};      /* Variable */
df      = {1,           1,            1};             /* DF = degrees of freedom */
est     = {-3.73649,    0.87086,      0.00011747};    /* Estimate */
SE      = {1.25091,     0.02446,      0.0001851};     /* StdErr = standard error */
t       = {-2.99,       35.6,         0.63};          /* tValue = value t test that the parameter is zero */
pvalue= {0.0029803,     1.36E-129,    0.525902};      /* Probt = two-sided p-value for hypothesis test */

The following PRINT statement displays these vectors:
     PRINT factors df est SE t pvalue;
However, there are two problems with the output:

  1. The column headers are the names of the variables. You might want to display more meaningful column headers.
  2. By default, each vector is printed by using the BEST9. format. However, it can be desirable to print each vector in a different format.

For this table, the programmer wanted to use the same formats that PROC REG uses for the ParameterEstimates table, which are as follows:

1. The PRINT statement

IML programmers would use the PRINT statement to print all six columns in a single table. You can use the LABEL= option (or L=, for short) on the PRINT statement to specify the column header. You can use the FORMAT= option (or F=, for short) to specify the format. Thus, you can use the following PRINT statement to create a table in which each column uses its own format:

/* 1: Use PRINT statement. Assign format to each column. 
      You cannot get a spanning header when you print individual columns */
print "-------- Parameter Estimates --------";
print factors[L='Variable']
      df[L='DF'        F=2.0]
      est[L='Estimate' F=D11.3]
      SE[L='StdErr'    F=D11.3]
      t[L='tValue'     F=7.2]
      pvalue[L='Probt' F=PVALUE6.4];

The result is acceptable. The PRINT statement enables you to apply formats to each column. However, unfortunately, there is no way to put a spanning header that says "Parameter Estimates" across the top of the table. This example uses a separate PRINT statement to emulate a header.

2. The TablePrint statement

If you have SAS/IML 14.2 or later (or the iml action in SAS Viya 3.5 or later), you can put the variables into a SAS/IML table and use the TablePrint statement to display the table. The IML language supports the TableSetVarFormat statement. which you can use to set the formats for individual columns, as shown in the following statements:

/* 2: Put variables into a table and use the TABLEPRINT statement. */
Tbl = TableCreate('Variable', factors);
call TableAddVar(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'},
                        df || est    || SE    || t     || pvalue);
call TableSetVarFormat(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'},
                            {'2.0' 'D11.3'   'D11.3'  '7.2'    'PVALUE6.4'});
call TablePrint(Tbl) ID='Variable' label='Parameter Estimates';

You can see that the TablePrint routine does an excellent job printing the table. You can use the ID= option to specify that the Variable column should be used as row headers. The LABEL= option puts a spanning header across the top of the table.

3. TablePrint can use a template

In the previous section, I used the TableSetVarFormat function to manually apply formats to specific columns. However, if your goal is to apply the same formats that are used by a SAS/STAT procedure, there is an easier way. You can use the TEMPLATE= option on the TablePrint subroutine to specify the name of an ODS template. You need to make sure that the names of variables in the table are the same as the names that are used in the template, but if you do that then the template will automatically format the variables and display a spanning header for the table.

/* 3. An easier way to duplicate the ParameterEstimates table from SAS/STAT procedure. 
      Use the TEMPLATE= option. (Make sure the names of variables are what the template expects. */
Tbl = TableCreate('Variable', factors);
call TableAddVar(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'},
                        df || est    || SE    || t     || pvalue);
call TablePrint(Tbl) label='Parameter Estimates'
                     template="Stat.REG.ParameterEstimates";

Remember: You can use ODS TRCE ON or the SAS/STAT documentation to find the name of ODS tables that a procedure creates.

4. Use PROC PRINT

If none of the previous methods satisfy your needs, you can always write the data to a SAS data set and then use Base SAS methods to display the table. For example, the following statements create a SAS data set and use PROC PRINT to display the table. The LABEL statement is used to specify the column headers. The FORMAT statement is used to apply formats to the columns:

/* 4. Write a SAS data set and use PROC PRINT */
create PE var {'factors' 'df' 'est' 'SE' 't' 'pvalue'};
append;
close;
QUIT;
 
proc print data=PE noobs label;
   label factors='Variable' df='DF' est='Estimate' SE='StdErr' t='tValue' pvalue='Probt';
   format df 2.0 est SE D11.3 t 7.2 pvalue PVALUE6.4;
   var factors df est SE t pvalue;
run;

Further reading

To learn more about SAS/IML table and other non-matrix data structures, see Wicklin (2017), More Than Matrices: SAS/IML Software Supports New Data Structures."

The post Print SAS/IML variables with formats appeared first on The DO Loop.

3月 162020
 

Books about statistics and machine learning often discuss the tradeoff between bias and variance for an estimator. These discussions are often motivated by a sophisticated predictive model such as a regression or a decision tree. But the basic idea can be seen in much simpler situations. This article presents a simple situation that is discussed in a short paper by Dan Jeske (1993). Namely, if a random process produces integers with a known set of probabilities, what method should you use to predict future values of the process?

I will start by briefly summarizing Jeske's result, which uses probability theory to derive the best biased and unbiased estimators. I then present a SAS program that simulates the problem and compares two estimators, one biased and one unbiased.

A random process that produces integers

Suppose a gambler asks you to predict the next roll of a six-sided die. He will reward you based on how close your guess is to the actual value he rolls. No matter what number you pick, you only have a 1/6 chance of being correct. But if the strategy is to be close to the value rolled, you can compute the expected value of the six faces, which is 3.5. Assuming that the gambler doesn't let you guess 3.5 (which is not a valid outcome), one good strategy is to round the expected value to the nearest integer. For dice, that means you would guess ROUND(3.5) = 4. Another good strategy is to randomly guess either 3 or 4 with equal probability.

Jeske's paper generalizes this problem. Suppose a random process produces the integers 1, 2, ..., N, with probabilities p1, p2, ..., pN, where the sum of the probabilities is 1. (This random distribution is sometimes called the "table distribution.") If your goal is to minimize the mean squared error (MSE) between your guess and a series of future random values, Jeske shows that the optimal solution is to guess the value that is closest to the expected value of the random variable. I call this method the ROUNDING estimator. In general, this method is biased, but it has the smallest expected MSE. Recall that the MSE is a measure of the quality of an estimator (smaller is better).

An alternative method is to randomly guess either of the two integers that are closest to the expected value, giving extra weight to the integer that is closer to the expected value. I call this method the RANDOM estimator. The random estimator is unbiased, but it has a higher MSE.

An example

The following example is from Jeske's paper. A discrete process generates a random variable, X, which can take the values 1, 2, and 3 according to the following probabilities:

  • P(X=1) = 0.2, which means that the value 1 appears with probability 0.2.
  • P(X=2) = 0.3, which means that the value 2 appears with probability 0.3.
  • P(X=3) = 0.5, which means that the value 3 appears with probability 0.5.

A graph of the probabilities is shown to the right. The expected value of this random variable is E(X) = 1(0.2) + 2(0.3) + 3(0.5) = 2.3. However, your guess must be one of the feasible values of X, so you can't guess 2.3. The best prediction (in the MSE sense) is to round the expected value. Since ROUND(2.3) is 2, the best guess for this example is 2.

Recall that an estimator for X is biased if its expected value is different from the expected value of X. Since E(X) ≠ 2, the rounding estimator is biased.

You can construct an unbiased estimator by randomly choosing the values 2 and 3, which are the two closest integers to E(X). Because E(X) = 2.3 is closer to 2 than to 3, you want to choose 2 more often than 3. You can make sure that the guesses average to 2.3 by guessing 2 with probability 0.7 and guessing 3 with probability 0.3. Then the weighted average of the guesses is 2(0.7) + 3(0.3) = 2.3, and this method produces an unbiased estimate. The random estimator is unbiased, but it will have a larger MSE.

Simulate the prediction of a random integer

Jeske proves these facts for an arbitrary table distribution, but let's use SAS to simulate the problem for the previous example. The first step is to compute the expected values of X. This is done by the following DATA step, which puts the expected value into a macro variable named MEAN:

/* Compute the expected value of X where 
   P(X=1) = 0.2
   P(X=2) = 0.3
   P(X=3) = 0.5
*/
data _null_;
array prob[3] _temporary_ (0.2, 0.3, 0.5);
do i = 1 to dim(prob);
   ExpectedValue + i*prob[i];       /* Sum of i*prob[i] */
end;
call symputx("Mean", ExpectedValue);
run;
 
%put &=Mean;
MEAN=2.3

The next step is to predict future values of X. For the rounding estimator, the predicted value is always 2. For the random estimator, let k be the greatest integer less than E(X) and let F = E(X) - k be the fractional part of E(x). To get an unbiased estimator, you can randomly choose k with probability 1-F and randomly choose k+1 with probability F. This is done in the following DATA step, which makes the predictions, generates a realization of X, and computes the residual difference for each method for 1,000 random values of X:

/* If you know mean=E(X)=expected value of X, Jeske (1993) shows that round(mean) is the best 
   MSE predictor, but it is biased.
   Randomly guessing the two closest integers is the best UNBIASED MSE predictor
   https://www.academia.edu/15728006/Predicting_the_value_of_an_integer-valued_random_variable
 
   Use these two predictors for 1000 future random variates.
*/
%let NumGuesses = 1000;
data Guess(keep = x PredRound diffRound PredRand diffRand);
call streaminit(12345);
array prob[3] _temporary_ (0.2, 0.3, 0.5);  /* P(X=i) */
 
/* z = floor(z) + frac(z) where frac(z) >= 0 */
/* https://blogs.sas.com/content/iml/2020/02/10/fractional-part-of-a-number-sas.html */
k = floor(&Mean);
Frac = &Mean - k;                        /* distance from E(X) to x */
do i = 1 to &NumGuesses;
   PredRound = round(&Mean);             /* guess the nearest integer */
   PredRand = k + rand("Bern", Frac);    /* random guesses between k and k+1, weighted by Frac */
   /* The guesses are made. Now generate a new instance of X and compute residual difference */
   x = rand("Table", of prob[*]);
   diffRound = x - PredRound;            /* rounding estimate */
   diffRand  = x - PredRand;             /* unbiased estimate */
   output;
end;
run;
 
/* sure enough, ROUND is the best predictor in the MSE sense */
proc means data=Guess n USS mean;
   var diffRound DiffRand;
run;

The output from PROC MEANS shows the results of generating 1,000 random integers from X. The uncorrected sum of squares (USS) column shows the sum of the squared residuals for each estimator. (The MSE estimate is USS / 1000 for these data.) The table shows that the USS (and MSE) for the rounding estimator is smaller than for the random estimator. On the other hand, The mean of the residuals is not close to zero for the rounding method because it is a biased method. In contrast, the mean of the residuals for the random method, which is unbiased, is close to zero.

It might be easier to see the bias of the estimators if you look at the predicted values themselves, rather than at the residuals. The following call to PROC MEANS computes the sample mean for X and the two methods of predicting X:

/* the rounding method is biased; the random guess is unbiased */
proc means data=Guess n mean stddev;
   var x PredRound PredRand;
run;

This output shows that the simulated values of X have a sample mean of 2.34, which is close to the expected value. In contrast, the rounding method always predicts 2, so the sample mean for that column is exactly 2.0. The sample mean for the unbiased random method is 2.32, which is close to the expected value.

In summary, you can use SAS to simulate a simple example that compares two methods of predicting the value of a discrete random process. One method is biased but has the lowest MSE. The other is unbiased but has a larger MSE. In statistics and machine learning, practitioners often choose between an unbiased method (such as ordinary least squares regression) and a biased method (such as ridge regression or LASSO regression). The example in this article provides a very simple situation that you can use to think about these issues.

The post Predict a random integer: The tradeoff between bias and variance appeared first on The DO Loop.

3月 092020
 

In a previous article, I discussed the binormal model for a binary classification problem. This model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the Negative population is less than the mean of scores for the Positive population. I showed how you can construct an exact ROC curve for the population.

Of course, in the real world, you do not know the distribution of the population, so you cannot obtain an exact ROC curve. However, you can estimate the ROC curve by using a random sample from the population.

This article draws a random sample from the binormal model and constructs the empirical ROC curve by using PROC LOGISTIC in SAS. The example shows an important truth that is sometimes overlooked: a sample ROC curve is a statistical estimate. Like all estimates, it is subject to sampling variation. You can visualize the variation by simulating multiple random samples from the population and overlaying the true ROC curve on the sample estimates.

Simulate data from a binormal model

The easiest way to simulate data from a binormal model is to simulate the scores. Recall the assumptions of the binormal model: all variables are continuous and there is a function that associates a score with each individual. The distribution of scores is assumed to be normal for the positive and negative populations. Thus, you can simulate the scores themselves, rather than the underlying data.

For this article, the scores for the Negative population (those who do not have a disease or condition) is N(0, 1). The scores for the Positive population (those who do have the disease or condition) is N(1.5, 0.75). The ROC curve for the population is shown to the right. It was computed by using the techniques from a previous article about binormal ROC curves.

The following SAS DATA step simulates a sample from this model. It samples nN = 50 scores from the Negative population and nP = 25 scores from the Positive population. The distribution of the scores in the sample are graphed below:

%let mu_N    = 0;       /* mean of Negative population */
%let sigma_N = 1;       /* std dev of Negative population */
%let mu_P    = 1.5;     /* mean of Positive population */
%let sigma_P = 0.75;    /* std dev of Positive population */
 
%let n_N = 50;     /* number of individuals from the Negative population */
%let n_P = 25;     /* number of individuals from the Positive population */
 
/* simulate one sample from the binormal model */
data BinormalSample;
call streaminit(12345);
y = 1;             /* positive population */
do i = 1 to &n_P;
   x = rand("Normal", &mu_P, &sigma_P); output;
end;
y = 0;             /* negative population */
do i = 1 to &n_N;
   x = rand("Normal", &mu_N, &sigma_N); output;
end;
drop i;
run;
 
title "Distribution of Scores for 'Negative' and 'Positive' Samples";
ods graphics / width=480px height=360px subpixel;
proc sgpanel data=BinormalSample noautolegend;
   styleattrs datacolors=(SteelBlue LightBrown);
   panelby y      / layout=rowlattice onepanel noheader;
   inset y        / position=topleft textattrs=(size=14) separator="=";
   histogram x    / group=y;
   rowaxis offsetmin=0;
   colaxis label="Score";
run;

Create an ROC plot for a sample

You can create an ROC curve by first creating a statistical model that classifies each observation into one of the two classes. You can then call PROC LOGISTIC in SAS to create the ROC curve, which summarizes the misclassification matrix (also called the confusion matrix) at various cutoff values for a threshold parameter. Although you can create an ROC curve for any predictive model, the following statements fit a logistic regression model. You can use the OUTROC= option on the MODEL statement to write the values of the sample ROC curve to a SAS data set. You can then overlay the ROC curves for the sample and for the population to see how they compare:

/* use logistic regression to classify individuals */
proc logistic data=BinormalSample noprint;
   model y(event='1') = x / outroc=ROC1;
run;
 
/* merge in the population ROC curve */
data ROC2;
   set ROC1 PopROC;
run;
 
title "Population ROC Curve vs. Estimate";
ods graphics / width=480px height=480px;
proc sgplot data=ROC2 aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / lineattrs=(color=blue) name="est" legendlabel="Estimate";
   series x=FPR y=TPR / lineattrs=(color=black) name="pop" legendlabel="Population";
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   keylegend "est" "pop" / location=inside position=bottomright across=1 opaque;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows the sample ROC curve (the blue, piecewise-constant curve) and the population ROC curve (the black, smooth curve). The sample ROC curve is an estimate of the population curve. For this random sample, you can see that most estimates of the true population rate are too large (given a value for the threshold parameter) and most estimates of the false positive rate are too low. In summary, the ROC estimate for this sample is overly optimistic about the ability of the classifier to discriminate between the positive and negative populations.

The beauty of the binormal model is that we know the true ROC curve. We can compare estimates to the truth. This can be useful when evaluating competing models: Instead of comparing sample ROC curves with each other, we can compare them to the ROC curve for the population.

Variation in the ROC estimates

If we simulate many samples from the binormal model and plot many ROC estimates, we can get a feel for the variation in the ROC estimates in random samples that have nN = 50 and nP = 25 observations. The following DATA step simulates B = 100 samples. The subsequent call to PROC LOGISTIC uses BY-group analysis to fit all B samples and generate the ROC curves. The ROC curves for five samples are shown below:

/* 100 samples from the binormal model */
%let NumSamples = 100;
data BinormalSim;
call streaminit(12345);
do SampleID = 1 to &NumSamples;
   y = 1;             /* sample from positive population */
   do i = 1 to &n_P;
      x = rand("Normal", &mu_P, &sigma_P); output;
   end;
   y = 0;             /* sample from negative population */
   do i = 1 to &n_N;
      x = rand("Normal", &mu_N, &sigma_N); output;
   end;
end;
drop i;
run;
 
proc logistic data=BinormalSim noprint;
   by SampleID;
   model y(event='1') = x / outroc=ROCs;
run;
 
title "5 Sample ROC Curves";
proc sgplot data=ROCs aspect=1 noautolegend;
   where SampleID <= 5;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

There is a moderate amount of variation between these curves. The brown curve looks different from the magenta curve, even though each curve results from fitting the same model to a random sample from the same population. The difference between the curves are only due to random variation in the data (scores).

You can use partial transparency to overlay all 100 ROC curves on the same graph. The following DATA step overlays the empirical estimates and the ROC curve for the population:

/* merge in the population ROC curve */
data AllROC;
   set ROCs PopROC;
run;
 
title "ROC Curve for Binormal Model";
title2 "and Empirical ROC Curves for &NumSamples Random Samples";
proc sgplot data=AllROC aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID transparency=0.9
                                lineattrs=(color=blue pattern=solid thickness=2);
   series x=FPR y=TPR / lineattrs=(color=black thickness=2);
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows 100 sample ROC curves in the background (blue) and the population ROC curve in the foreground (black). The ROC estimates show considerable variability.

Summary

In summary, this article shows how to simulate samples from a binormal model. You can use PROC LOGISTIC to generate ROC curves for each sample. By looking at the variation in the ROC curves, you can get a sense for how these estimates can vary due to random variation in the data.

The post ROC curves for a binormal sample appeared first on The DO Loop.

3月 042020
 

Suppose that a data set contains a set of parameter values. For each row of parameters, you need to perform some computation. A recent discussion on the SAS Support Communities mentions an important point: if there are duplicate rows in the data, a program might repeat the same computation several times. This is inefficient. An efficient alternative is to perform the computations once and store them in a list. This article describes this issue and implements it by using lists in SAS/IML software. Lists were introduced in SAS/IML 14.2 (SAS 9.4M4). A natural syntax for creating lists was introduced in SAS/IML 14.3 (SAS 9.4M5).

This article assumes that the matrices are not known until you read a data set that contains the parameters. The program must be able to handle computing, storing, and accessing an arbitrary number of matrices: 4, 10, or even 100. Obviously, if you know in advance the number of matrices that you need, then you can just create those matrices and give them names such as X1, X2, X3, and X4.

Example: Using powers of a matrix

To give a concrete example, suppose the following data set contains multiple values for a parameter, p:

data Powers;
input p @@;
datalines;
9 1 5 2 2 5 9 1 2 5 2 1 
;

For each value of p, you need to compute X = Ap = A*A*...*A for a matrix, A, and then use X in a subsequent computation. Here's how you might perform that computation in a straightforward way in SAS/IML without using lists:

proc iml;
/* use a small matrix, but in a real application the matrix might be large. */
A = {2 1, 1 3};
 
use Powers;                    /* read the parameters from the data set */
read all var "p";
close; 
 
/* The naive method: Compute the power A**p[i] each time you need it. */
do i = 1 to nrow(p);
  power = p[i];
  X = A**power;                /* compute each power as needed */
  /* compute with X ... */
end;

The program computes Ap for each value of p in the data set. Because the data set contains duplicate values of p, the program computes A2 four times, A5 three times, and A9 two times. If A is a large matrix, it is more efficient to compute and store these matrices and reuse them as needed.

Store all possible matrices in a list

If the values of p are uniformly likely, the easiest way to store matrix powers in a list is to find the largest value of p (call it m) and then create a list that has m items. The first item in the list is A, the second item is A2, and so forth up to Am. This is implemented by using the following statements:

/* Method 1: Find upper bound of power, m. 
             Compute all matrices A^p for 1 <= p <= m. 
             Store these matrices in a list so you don't need to recompute. */
m = max(p);                    /* largest power in data */
L = ListCreate( m );
do i = 1 to m;
  L$i = A**i;                  /* compute all powers up to maximum; store A^i as i_th item */
end;
 
do i = 1 to nrow(p);           /* extract and use the matrices, as needed */
  power = p[i];
  X = L$power;                 /* or   X = ListGetItem(L, power); */
  /* compute with X ... */
end;

For these data, the list contains nine elements because 9 is the largest value in the data. The following statements show how to display a compact version of the list, which shows the matrices Ap for p=1,2,...,9.

package load ListUtil;         /* load the Struct and ListPrint subroutines */
run struct(L);
Structure of a nine-item list. The p_th item stores A^p.

The table shows the first few elements of every item in the list. In this case, the p_th item is storing the 2 x 2 matrix Ap. The elements of the matrix are displayed in row-major order.

Store only the necessary matrices in a list

Notice that the data only contain four values of p: 1, 2, 5, and 9. The method in the previous section computes and stores unnecessary matrices such as A3, A4, and A8. This is inefficient. Depending on the maximum value of p (such as max(p)=100), this method might be wasteful in terms of memory and computational resources.

A better solution is to compute and store only the matrices that are needed for the computation. For this example, you only need to compute and store four matrices. You can use the UNIQUE function to find the unique values of p (in sorted order). The following statements compute and store Ap for only the unique values of p in the data set:

/* Method 2: Compute matrices A^p for the unique values of p in the data.
             Store only these matrices in a named list. */
u = unique(p);                 /* find the unique values of p */
nu = ncol(u);                  /* how many unique values? */
L = ListCreate( nu );
do i = 1 to nu;
  power = u[i];                /* compute only the powers in the data; store in a named list */
  L$i =  [#"Power"  = power,   /* optional: store as a named list */
          #"Matrix" = A**power]; 
end;
 
run struct(L);
Structure of a four-item list, where each item is a named list.

For this example, each item in the list L is itself a list. I used a "named list" with items named "Power" and "Matrix" so that the items in L have context. The STRUCT subroutine shows that L contains only four items.

How can you access the matrix for, say, A5? There are several ways, but the easiest is to use the u matrix that contains the unique values of p in sorted order. You can use the LOC function to look up the index of the item that you want. For example, A5 is stored as the third item in the list because 5 is the third element of u. Because each item in L is itself a list, you can extract the "Matrix" item as follows:

do i = 1 to nrow(p);
  power = p[i];
  k = loc(u = power);          /* the k_th item is A^p */
  X = L$k$"Matrix";            /* extract it */
  /* compute with X ... */
  *print i power X;
end;

In summary, this article describes how to use a list to store matrices so that you can compute them once and reuse them many times. The article assumes that the number of matrices is not known in advance but is obtained by reading a data set at run time. The example stores powers of a matrix, but this technique is generally applicable.

If you have never used lists in SAS/IML, see the article "Create lists by using a natural syntax in SAS/IML" or watch this video that describes the list syntax.

The post Store pre-computed matrices in a list appeared first on The DO Loop.

2月 262020
 

The ROC curve is a graphical method that summarizes how well a binary classifier can discriminate between two populations, often called the "negative" population (individuals who do not have a disease or characteristic) and the "positive" population (individuals who do have it). As shown in a previous article, there is a theoretical model, called the binormal model, that describes the fundamental features in binary classification. The model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the negative population is less than the mean of scores for the positive population. The figure to the right (which was discussed in the previous article) shows a threshold value (the vertical line) that a researcher can use to classify an individual as belonging to the positive or negative population, according to whether his score is greater than or less than the threshold, respectively.

In most applications, any reasonable choice of the threshold will misclassify some individuals. Members of the negative population can be misclassified, which results in a false positive (FP). Members of the positive population can be misclassified, which results in a false negative (FP). Correctly classified individuals are true negatives (TN) and true positives (TP).

Vizualize the binary classification method

One way to assess the predictive accuracy of the classifier is to use the proportions of the populations that are classified correctly or are misclassified. Because the total area under a normal curve is 1, the threshold parameter divides the area into two proportions. It is instructive to look at how the proportions change as the threshold value ranges. The proportions are usually called "rates." The four regions correspond to the True Negative Rate (TNR), False Positive Rate (FPR), False Negative Rate (FNR), and True Positive Rate (TPR).

For the binormal model, you can use the standard deviations of the populations to choose a suitable range for the threshold parameter. The following SAS DATA step uses the normal cumulative distribution function (CDF) to compute the proportion of each population that lies to the left and to the right of the threshold parameter for a range of values. These proportions are then plotted against the threshold parameters.

%let mu_N    = 0;       /* mean of Negative population */
%let sigma_N = 1;       /* std dev of Negative population */
%let mu_P    = 2;       /* mean of Positive population */
%let sigma_P = 0.75;    /* std dev of Positive population */
 
/* TNR = True Negative Rate (TNR)  = area to the left of the threshold for the Negative pop
   FPR = False Positive Rate (FPR) = area to the right of the threshold for the Negative pop
   FNR = False Negative Rate (FNR) = area to the left of the threshold for the Positive pop
   TPR = True Positive Rate (TPR)  = area to the right of the threshold for the Positive pop  
*/
data ClassRates;
do t = -3 to 4 by 0.1;   /* threshold cutoff value (could use mean +/- 3*StdDev) */
  TNR = cdf("Normal", t, &mu_N, &sigma_N);
  FPR = 1 - TNR;
  FNR = cdf("Normal", t, &mu_P, &sigma_P);
  TPR = 1 - FNR;
  output;
end;
run;
 
title "Classification Rates as a Function of the Threshold";
%macro opt(lab); 
   name="&lab" legendlabel="&lab" lineattrs=(thickness=3); 
%mend;
proc sgplot data=ClassRates;
   series x=t y=TNR / %opt(TNR);
   series x=t y=FPR / %opt(FPR);
   series x=t y=FNR / %opt(FNR); 
   series x=t y=TPR / %opt(TPR); 
   keylegend "TNR" "FNR" / position=NE location=inside across=1;
   keylegend "TPR" "FPR" / position=SE location=inside across=1;
   xaxis offsetmax=0.2 label="Threshold";
   yaxis label="Classification Rates";
run;

The graph shows how the classification and misclassification rates vary as you change the threshold parameter. A few facts are evident:

  • Two of the curves are redundant because FPR = 1 – TNR and TPR = 1 – FNR. Thus, it suffices to plot only two curves. A common choice is to display only the FPR and TPR curves.
  • When the threshold parameter is much less than the population means, essentially all individuals are predicted to belong to the positive population. Thus, the FPR and the TPR are both essentially 1.
  • As the parameter increases, both rates decrease monotonically.
  • When the threshold parameter is much greater than the population means, essentially no individuals are predicted to belong to the positive population. Thus, the FPR and TPR are both essentially 0.

The ROC curve

The graph in the previous section shows the FPR and TPR as functions of t, the threshold parameter. Alternatively, you can plot the parametric curve ROC(t) = (FPR(t), TPR(t)), for t ∈ (-∞, ∞). Because the FPR and TPR quantities are proportions, the curve (called the ROC curve) is always contained in the unit square [0, 1] x [0, 1]. As discussed previously, as the parameter t → -∞, the curve ROC(t) → (1, 1). As the parameter t → ∞, the curve ROC(t) → (0, 0). The main advantage of the ROC curve is that the ROC curve is independent of the scale of the population scores. In fact, the standard ROC curve does not display the threshold parameter. This means that you can compare the ROC curves from different models that might use different scores to classify the negative and positive populations.

The following call to PROC SGPLOT creates an ROC curve for the binormal model by plotting the TPR (on the vertical axis) against the FPR (on the horizontal axis). The resulting ROC curve is shown to the right.

title "ROC Curve";
title2;
proc sgplot data=ClassRates aspect=1 noautolegend;
   series x=FPR y=TPR / lineattrs=(thickness=2);
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;
   yaxis grid;
run;

The standard ROC curve does not display the values of the threshold parameter. However, for instructional purposes, it can be enlightening to plot the values of a few selected threshold parameters. An example is shown in the following ROC curve. Displaying the ROC curve this way emphasizes that each point on the ROC curve corresponds to a different threshold parameter. For example, when t=1, the cutoff parameter is 1 and the classification is accomplished by using the vertical line and binormal populations that are shown at the beginning of this article.

Interpretation of the ROC curve

The ROC curve shows the tradeoff between correctly classifying those who have a disease/condition and those who do not. For concreteness, suppose you are trying to classify people who have cancer based on medical tests. Wherever you place the threshold cutoff, you will make two kinds of errors: You will not identify some people who actually have cancer and you will mistakenly tell other people that they have cancer when, in fact, they do not. The first error is very bad; the second error is also bad but not life-threatening. Consider three choices for the threshold parameter in the binormal model:

  • If you use the threshold value t=2, the previous ROC curve indicates that about half of those who have cancer are correctly classified (TPR=0.5) while misclassifying very few people who do not have cancer (FPR=0.02). This value of the threshold is probably not optimal because the test only identifies half of those individuals who actually have cancer.
  • If you use t=1, the ROC curve indicates that about 91% of those who have cancer are correctly classified (TPR=0.91) while misclassifying about 16% of those who do not have cancer (FPR=0.16). This value of the threshold seems more reasonable because it detects most cancers while not alarming too many people who do not have cancer.
  • As you decrease the threshold parameter, the detection rate only increases slightly, but the proportion of false positives increases rapidly. If you use t=0, the classifier identifies 99.6% of the people who have cancer, but it also mistakenly tells 50% of the non-cancer patients that they have cancer.

In general, the ROC curve helps researchers to understand the trade-offs and costs associated with false positive and false negatives.

Concluding remarks

In summary, the binormal ROC curve illustrates fundamental features of the binary classification problem. Typically, you use a statistical model to generate scores for the negative and positive populations. The binormal model assumes that the scores are normally distributed and that the mean of the negative scores is less than the mean of the positive scores. With that assumption, it is easy to use the normal CDF function to compute the FPR and TPR for any value of a threshold parameter. You can graph the FPR and TPR as functions of the threshold parameter, or you can create an ROC curve, which is a parametric curve that displays both rates as the parameter varies.

The binormal model is a useful theoretical model and is more applicable than you might think. If the variables in the classification problem are multivariate normal, then any linear classifier results in normally distributed scores. In addition, Krzandowski and Hand (2009, p. 34-35), state that the ROC curve is unchanged by any monotonic increasing transformation of scores, which means that the binormal model applies to any set of scores that can be transformed to normality. This is a large set, indeed, since it includes the complete Johnson system of distributions.

In practice, we do not know the distribution of scores for the population. Instead, we have to estimate the FPR and TPR by using collected data. PROC LOGISTIC in SAS can estimate an ROC curve for data by using a logistic regression classifier. Furthermore, PROC LOGISTIC can automatically create an empirical ROC curve from any set of paired observed and predicted values.

The post The binormal model for ROC curves appeared first on The DO Loop.

2月 192020
 

Are you a statistical programmer whose company has adopted SAS Viya? If so, you probably know that the DATA step can run in parallel in SAS Cloud Analytic Services (CAS). As Sekosky (2017) says, "running in a single thread in SAS is different from running in many threads in CAS." He goes on to state, "you cannot take any DATA step, change the librefs used, and have it run correctly in parallel. You ... have to know what your program is doing to make sure you know what it does when it runs in parallel."

This article discusses one aspect of "know what your program is doing." Specifically, to run in parallel, the DATA step must use only functions and statements that are "CAS-enabled." Most DATA step functions run in CAS. However, there is a set of "SAS-only" functions that do not run in CAS. This article discusses these functions and provides a link to a list of the SAS-only functions. It also shows how you can get SAS to tell you whether your DATA step contains a SAS-only function.

DATA steps that run in CAS

By default, the DATA step will attempt to run in parallel when the step satisfies three conditions (Bultman and Secosky, 2018, p. 2):

  1. All librefs in the step are CAS engine librefs to the same CAS session.
  2. All statements in the step are supported by the CAS DATA step.
  3. All functions, CALL routines, formats, and informats in the step are available in CAS.

The present article is concerned with the third condition. How can you know in advance whether all functions and CALL routines are available in CAS?

A list of DATA step functions that do not run in CAS

For every DATA step function, the SAS Viya documentation indicates whether the function is supported on CAS or not. For example, the following screenshots of the Viya 3.5 documentation show the documentation for the RANUNI function and the RAND function:

Notice that the documentation for the RANUNI function says, "not supported in a DATA step that runs in CAS," whereas the RAND function is in the "CAS category," which means that it is supported in CAS. This means that if you use the RANUNI function in a DATA step, the DATA step will not run in CAS. (Similarly, for the other old-style random number functions, such as RANNOR.) Instead, it will try to run in SAS. This could result in copying input data from CAS, running the program in a single thread, and copying the final data set into a CAS table. Copying all that data is not efficient.

Fortunately, you do not need to look up every function to determine if it is a CAS-enabled or SAS-only function. The documentation now includes a list, by category, of the Base SAS functions (and CALL routines) that do not run in CAS. The following screenshot shows the top of the documentation page.

Can you always rewrite a DATA step to run in CAS?

For the example in the previous section (the RANUNI function), there is a newer function (the RAND function) that has the same functionality and is supported in CAS. Thus, if you have an old SAS program that uses the RANUNI function, you can replace that call with RAND("UNIFORM") and the modified DATA step will run in CAS. Unfortunately, not all functions have a CAS-enabled replacement. There are about 200 Base SAS functions that do not run in CAS, and most of them cannot be replaced by an equivalent function that runs in CAS.

The curious reader might wonder why certain classes of functions cannot run in CAS. Here are a few sets of functions that do not run in CAS, along with a few reasons:

  • Functions specific to the English language, such as the SOUNDEX and SPEDIS functions. Also, functions that are specific to single-byte character sets (especially I18N Level 0). Most of these functions are not applicable to an international audience that uses UTF-8 encoding.
  • Functions and statements for reading and writing text files. For example, INFILE, INPUT, and FOPEN/FCLOSE. There are other ways to import text files into CAS.
  • Macro-related functions such as SYMPUT and SYMGET. Remember: There are no macro variables in CAS! The macro pre-processor is a SAS-specific feature, and one of the principles of SAS Viya is that programmers who use other languages (such as Python or Lua) should have equal access to the functionality in Viya.
  • Old-style functions for generating random numbers from probability distributions. Use the RAND function instead.
  • Functions that rely on a single-threaded execution on a data set that has ordered rows. Examples include DIF, LAG, and some permutation/combination functions such as ALLCOMB. Remember: A CAS data table does not have an inherent order.
  • The US-centric state and ZIP code functions.
  • Functions for working with Git repositories.

How to force the DATA step to stop if it is not able to run in CAS

When a DATA step runs in CAS, you will see a note in the log that says:
NOTE: Running DATA step in Cloud Analytic Services.

If the DATA step runs in SAS, no note is displayed. Suppose that you intend for a DATA step to run in CAS but you make a mistake and include a SAS-only function. What happens? The default behavior is to (silently) run the DATA step in SAS and then copy the (possibly huge) data into a CAS table. As discussed previously, this is not efficient.

You might want to tell the DATA step that it should run in CAS or else report an error. You can use the SESSREF= option to specify that the DATA step must run in a CAS session. For example, if MySess is the name of your CAS session, you can submit the following DATA step:

/* use the SESSREF= option to force the DATA step to report an ERROR if it cannot run in CAS */
data MyCASLib.Want / sessref=MySess;
   x = ranuni(1234);    /* this function is not supported in the CAS DATA step */
run;
NOTE: Running DATA step in Cloud Analytic Services.
ERROR: The function RANUNI is unknown, or cannot be accessed.
ERROR: The action stopped due to errors.

The log is shown. The NOTE says that the step was submitted to a CAS session. The first ERROR message reports that your program contains a function that is not available. The second ERROR message reports that the DATA step stopped running. A DATA step that runs on CAS calls the dataStep.runCode action, which is why the message says "the action stopped."

This is a useful trick. The SESSREF= option forces the DATA step to run in CAS. If it cannot, it tells you which function is not CAS-enabled.

Other ways to monitor where the DATA steps runs

The DATA step documentation in SAS Viya contains more information about how to control and monitor where the DATA step runs. In particular, it discusses how to use the MSGLEVEL= system option to get detailed information about where a DATA step ran and in how many threads. The documentation also includes additional examples and best practices for running the DATA step in CAS. I recommend the SAS Cloud Analytic Services: DATA Step Programming documentation as the first step towards learning the advantages and potential pitfalls of running a DATA step in CAS.

Summary

The main purpose of this article is to provide a list of Base SAS functions (and CALL routines) that do not run in CAS. If you include one of these functions in a DATA step, the DATA step cannot run in CAS. This can be inefficient. You can use the SESSREF= option on the DATA statement to force the DATA step to run in CAS. If it cannot run in CAS, it stops with an error and informs you which function is not supported in CAS.

The post A list of SAS DATA step functions that do not run in CAS appeared first on The DO Loop.

1月 272020
 

The Johnson system (Johnson, 1949) contains a family of four distributions: the normal distribution, the lognormal distribution, the SB distribution (which models bounded distributions), and the SU distribution (which models unbounded distributions). Note that 'B' stands for 'bounded' and 'U' stands for 'unbounded.' A previous article explains the purpose of the Johnson system and shows how to work with the Johnson SB distribution, including how to compute the probability density function (PDF), generate random variates, and estimate parameters from data. This article provides similar information for the Johnson SU distribution.

What is the Johnson SU distribution?

The SU distribution contains a location parameter, θ, and a scale parameter, σ > 0. The SU density is positive for all real numbers. The two shape parameters for the SU distribution are called δ and γ in the SAS documentation for PROC UNIVARIATE. By convention, the δ parameter is taken to be positive (δ > 0).

If X is a random variable from the Johnson SU distribution, then the variable
Z = γ + δ sinh-1( (X-θ) / σ )
is standard normal. The transformation to normality is invertible. Because sinh-1(t) = arsinh(t) = log(t + sqrt(1+t2)), some authors specify the transformation in terms of logs and square roots.

Generate random variates from the SU distribution

The relationship between the SU and the normal distribution provides an easy way to generate random variates from the SU distribution. To be explicit, define Y = (Z-γ) / δ, where Z ∼ N(0,1). Then the following transformation defines a Johnson SU random variable:
X = θ + σ sinh(Y),
where sinh(t) = (exp(t) – exp(-t)) / 2.

The following SAS DATA step generates a random sample from a Johnson SU distribution:

/* Johnson SU(location=theta, scale=sigma, shape=delta, shape=gamma) */
data SU(keep= X);
call streaminit(1);
theta = 0;  sigma = 1;   gamma = -1.1;   delta = 1.5;  
do i = 1 to 1000;
   Y = (rand("Normal")-gamma) / delta;
   X = theta + sigma * sinh(Y);
   output;
end;
run;

You can use PROC UNIVARIATE in SAS to overlay the SU density on the histogram of the random variates:

proc univariate data=SU;
   histogram x / SU(theta=0 sigma=1 gamma=-1.1 delta=1.5);
   ods select Histogram;
run;

The PDF of the SU distribution

The formula for the SU density function is given in the PROC UNIVARIATE documentation (set h = v = 1 in the formula). You can evaluate the probability density function (PDF) on the interval [θ – 4 σ, θ + 4 σ] for visualization purposes. So that you can easily compare various shape parameters, the following examples use θ=0 and σ=1. The θ parameter translates the density curve and the σ parameter scales it, but these parameters do not change the shape of the curve. Therefore, only the two shape parameters are varied in the following program, which generates the PDF of the SU distribution for four combinations of parameters.

/* Visualize a few SU distributions */
data SU_pdf;
array _theta[4] _temporary_ (0 0 0 0);
array _sigma[4] _temporary_ (1 1 1 1);
array _gamma[4] _temporary_ (-1.1 -1.1  0.5  0.5);
array _delta[4] _temporary_ ( 1.5  0.8  0.8  0.5 );
sqrt2pi = sqrt(2*constant('pi'));
do i = 1 to dim(_theta);
   theta=_theta[i]; sigma=_sigma[i]; gamma=_gamma[i]; delta=_delta[i]; 
   Params = catt("gamma=", gamma, "; delta=", delta);
   do x = theta-4*sigma to theta+4*sigma by 0.01;
      c = delta / (sigma * sqrt2pi);
      w = (x-theta) / sigma;
      z = gamma + delta*arsinh(w);
      PDF = c / sqrt(1+w**2) * exp( -0.5*z**2 );
      output;
   end;
end;
drop c w z sqrt2pi;
run;
 
title "The Johnson SU Probability Density";
title2 "theta=0; sigma=1";
proc sgplot data=SU_pdf;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   xaxis grid offsetmin=0 offsetmax=0;  yaxis grid;
run;

The SU distributions have relatively large kurtosis. The graph shows that the SU distribution contains a variety of unimodal curves. The blue curve (γ=–1.1, δ=1.5) and the red curve (γ=–1.1, δ=0.8) both exhibit positive skewness whereas the other curves are closer to being symmetric.

The SU distributions are useful for modeling "heavy-tailed" distributions. A random sample from a heavy-tailed distribution typically contains more observations that are far from the mean than a random normal sample of the same size. Thus you might want to use a Johnson SU distribution to model a phenomenon in which extreme events occur somewhat frequently.

The CDF and quantiles of the SU distribution

I am not aware of explicit formulas for the cumulative distribution function (CDF) and quantile function of the SU distribution. You can use numerical integration of the PDF to evaluate the cumulative distribution. You can use numerical root-finding methods to evaluate the inverse CDF (quantile) function, but it will be an expensive computation.

Fitting parameters of the SU distribution

If you have data, you can use PROC UNIVARIATE to estimate four parameters of the SU distribution that fit the data. In an earlier section, we simulated a data set for which all values are in the interval [0, 1]. The following call to PROC UNIVARIATE estimates the shape parameters for these simulated data:

proc univariate data=SU;
   histogram x / SU(theta=EST sigma=EST gamma=EST delta=EST
                    fitmethod=percentiles);
run;

The HISTOGRAM statement supports three methods for fitting the parameters from data. By default, the procedure uses the FITMETHOD=PERCENTILE method (Slifker and Shapiro, 1980) to estimate the parameters. You can also use FITMETHOD=MLE to obtain a maximum likelihood estimate or FITMETHOD=MOMENTS to estimate the parameters by matching the sample moments. For these data, I've used the percentile method. The method of moments does not provide a feasible solution for these data.

Summary

The Johnson SU distribution is a family that models unbounded distributions. It is especially useful for modeling distributions that have heavy tails. This article shows how to simulate random values from the SU distribution and how to visualize the probability density function (PDF). It also shows how to use PROC UNIVARIATE to fit parameters of the SU distribution to data.

The post The Johnson SU distribution appeared first on The DO Loop.

1月 202020
 

From the early days of probability and statistics, researchers have tried to organize and categorize parametric probability distributions. For example, Pearson (1895, 1901, and 1916) developed a system of seven distributions, which was later called the Pearson system. The main idea behind a "system" of distributions is that for each feasible combination of skewness and kurtosis, there is a member of the system that achieves the combination. A smaller system is the Johnson system (Johnson, 1949), which contains only four distributions: the normal distribution, the lognormal distribution, the SB distribution (which models bounded distributions), and the SU distribution (which models unbounded distributions).

Every member of the Johnson system can be transformed into a normal distribution by using elementary functions. Because the transformation is invertible, it is also true that every member of the Johnson system is a transformation of a normal distribution.

The Johnson SB and SU distributions have four parameters and are extremely flexible, which means that they can fit a wide range of distribution shapes. As John von Neumann once said, "With four parameters, I can fit an elephant and with five I can make him wiggle his trunk." As I indicated in an article about the moment-ratio diagram, if you specify the skewness and kurtosis of a distribution, you can find a unique member of the Johnson system that has the same skewness and kurtosis.

This article shows how to compute the four essential functions for the Johnson SB distribution: the PDF, the CDF, the quantile function, and a method to generate random variates. It also shows how to fit the parameters of the distribution to data by using PROC UNIVARIATE in SAS.

What is the Johnson SB distribution?

The SB distribution contains a threshold parameter, θ, and a scale parameter, σ > 0. The SB distribution has positive density (support) on the interval (θ, θ + σ). The two shape parameters for the SB distribution are called δ and γ in the SAS documentation for PROC UNIVARIATE. By convention, the parameter δ is taken to be positive (δ > 0).

If X is a random variable from the Johnson SB distribution, then the variable
Z = γ + δ log( (X-θ) / (θ + σ - X) )
is standard normal. Because the transformation to normality is invertible, you can use properties of the normal distribution to understand properties of the SB distribution. The relationship also enables you to generate random variates of the SB distribution from random normal variates. To be explicit, define Y = (Z-γ) / δ, where Z ∼ N(0,1). Then the following transformation defines a Johnson SB random variable:
X = ( σ exp(Y) + θ (exp(Y) + 1) ) / (exp(Y) + 1);

Interestingly, if σ=1 and θ=0, then this is the logistic transformation. So one way to get a member of the Johnson system is as a logistic transformation of a normal random variable.

Generate random variates from the SB distribution

The relationship between the SB and the normal distribution provides an easy way to generate random variates from the SB distribution. You simply generate a normal variate and then transform it, as follows:

/* Johnson SB(threshold=theta, scale=sigma, shape=gamma, shape=delta) */
data SB(keep= X);
call streaminit(1);
theta = 0;   sigma = 1;   gamma = -1.1;    delta = 1.5;
do i = 1 to 1000;
   Y = (rand("Normal")-gamma) / delta;
   expY = exp(Y);
   X = ( sigma*expY + theta*(expY + 1) ) / (expY + 1);
   output;
end;
run;
 
/* set bins: https://blogs.sas.com/content/iml/2014/08/25/bins-for-histograms.html */
proc univariate data=SB;
   histogram X / SB(theta=0, sigma=1, gamma=-1.1, delta=1.5)
                 endpoints=(0 to 1 by 0.05);
   ods select Histogram;
run;

I used PROC UNIVARIATE in SAS to overlay the SB density on the histogram of the random variates. The result follows:

Density curve for a Johnson SB distribution overlaid on a histogram of bounded data

The PDF of the SB distribution

The formula for the SB density function is given in the PROC UNIVARIATE documentation (set h = v = 1 in the formula). You can evaluate the probability density function (PDF) on the interval (θ, θ + σ) in order to visualize the density function. So that you can easily compare various shape parameters, the following examples use θ=0 and σ=1 and plot the density on the interval (0, 1). The θ parameter translates the density curve and the σ parameter scales it, but these parameters do not change the shape of the curve. Therefore, only the two shape parameters are varied in the following program, which generates the PDF of the SB distribution for four combinations of parameters.

/* Visualize a few SB distributions */
data SB;
array _theta[4] _temporary_ (0 0 0 0);
array _sigma[4] _temporary_ (1 1 1 1);
array _gamma[4] _temporary_ (-1.1 -1.1  0.5  0.5);
array _delta[4] _temporary_ ( 1.5  0.8  0.8  0.05 );
sqrt2pi = sqrt(2*constant('pi'));
do i = 1 to dim(_theta);
   theta=_theta[i]; sigma=_sigma[i]; gamma=_gamma[i]; delta=_delta[i]; 
   Params = catt("gamma=", gamma, "; delta=", delta);
   do x = theta to theta+sigma by 0.005;
      c = delta / (sigma * sqrt2pi);
      w = (x-theta) / sigma;
      t = (x - theta) / (theta + sigma - x);
      PDF = c / (w*(1-w)) * exp( -0.5*(gamma + delta*log(t))**2 );
      output;
   end;
end;
drop c w t sqrt2pi;
run;
 
title "The Johnson SB Probability Density";
title2 "theta=0; sigma=1";
proc sgplot data=SB;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   xaxis grid offsetmin=0 offsetmax=0;  yaxis grid;
run;
Density curves for four Johnson SB distributions

The graph shows that the SB distribution supports a variety of shapes. The blue curve (γ=–1.1, δ=1.5) and the red curve (γ=–1.1, δ=0.8) both exhibit negative skewness, whereas the green curve (γ=0.5, δ=0.8) has positive skewness. Those three curves are unimodal, whereas the brown curve (γ=0.5, δ=0.05) is a U-shaped distribution. These curves might remind you of the beta distribution, which is a more familiar family of bounded distributions.

The CDF and quantiles of the SB distribution

I am not aware of explicit formulas for the cumulative distribution function (CDF) and quantile function of the SB distribution. You can use numerical integration of the PDF to evaluate the cumulative distribution. You can use numerical root-finding methods to evaluate the inverse CDF (quantile) function, but it will be an expensive computation.

Fitting parameters of the SB distribution

If you have data, you can use PROC UNIVARIATE to estimate four parameters of the SB distribution that fit the data. In practice, often the threshold (and sometimes the scale) is known from domain-specific knowledge. For example, if you are modeling student test scores, you might know that the scores are always in the range [0, 100], so θ=0 and σ=100.

In an earlier section, we simulated a data set for which all values are in the interval [0, 1]. The following call to PROC UNIVARIATE estimates the shape parameters for these simulated data:

proc univariate data=SB;
   histogram X / SB(theta=0, sigma=1, fitmethod=moments) 
                 endpoints=(0 to 1 by 0.05);
run;
Fit parameters to the Johnson SB distribution by using PROC UNIVARIATE in SAS

By default, the HISTOGRAM statement uses the FITMETHOD=PERCENTILE method to fit the data. You can also use FITMETHOD=MLE to obtain a maximum likelihood estimate, or FITMETHOD=MOMENTS to estimate the parameters by matching the sample moments. In this example, I show how to use the FITMETHOD=MOMENTS option, which, for these data, seems to provide a better fit than the method of percentiles.

Summary

The Johnson system is a four-parameter system that contains four families of distributions. If you choose any feasible combination of skewness and kurtosis, you can find a member of the Johnson system that has that same skewness and kurtosis. The SB distribution is a family that models bounded distributions. This article shows how to simulate random values from the SB distribution and how to visualize the probability density function (PDF). It also shows how to use PROC UNIVARIATE to fit parameters of the SB distribution to data.

The post The Johnson SB distribution appeared first on The DO Loop.

11月 252019
 

What is an efficient way to evaluate a multivariate quadratic polynomial in p variables? The answer is to use matrix computations! A multivariate quadratic polynomial can be written as the sum of a purely quadratic term (degree 2), a purely linear term (degree 1), and a constant term (degree 0). The purely quadratic term is called a quadratic form. This article shows how to use matrix computations to efficiently evaluate a multivariate quadratic polynomial.

Quadratic polynomials and matrix expressions

Visualization of a quadratic polynomial in SAS

As I wrote in a previous article about optimizing a quadratic function, the matrix of second derivatives and the gradient of first derivatives appear in the matrix representation of a quadratic polynomial. I will use the same example as in my previous article. Namely, the following quadratic polynomial in two variables:
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
       = (1/2) x` Q x + L` x + 6
where Q = {18  1, 1  8} is a 2x2 symmetric matrix, L = {-12  -4} is a column vector, and x = {x, y} is a column vector that represents the coordinates at which to evaluate the quadratic function.

The graph at the right visualizes this quadratic function by using a heat map. Small values of the function are shown in white. Larger values of the function are shown in blues and greens. The largest values are shown in reds and blacks. The global minimum of this function is approximately (x, y) = (0.64, 0.42), and that point is indicated by a star.

This example generalizes. Every quadratic function in p variables can be written as the sum of a quadratic form (1/2 x` Q x), a linear term (L` x) and a constant. Here Q is a p x p symmetric matrix of second derivatives (the Hessian matrix of f) and L and x are p-dimensional column vectors.

Evaluate quadratic polynomial in SAS

Because the computation involves vectors and matrices, the SAS/IML language is the natural place to use matrices to evaluate a quadratic function. The following SAS/IML statements implement a simple function to evaluate a quadratic polynomial (given in terms of Q, L, and a constant) at an arbitrary two-dimensional vector, x:

proc iml;
/* Evaluate  f(x) = 0.5 * x` * Q * x + L`*x + const, where
   Q is p x p symmetric matrix, 
   L and x are col vector with p elements.
   This version evaluates ONE vector x and returns a scalar value. */
start EvalQuad(x, Q, L, const=0);
   return 0.5 * x`*Q*x + L`*x + const;
finish;
 
/* compute Q and L for f(x,y)= 9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6 */
Q = {18 1,             /* matrix of second derivatives */
      1 8};
L = { -12, -4};        /* use column vectors for L and x */
const = 6;
x0 = {0, -1};          /* evaluate polynomial at this point */
f = EvalQuad(x0, Q, L, const);
print f[L="f(0,-1)"];

Evaluate a quadratic polynomial at multiple points

When I implement a function in the SAS/IML language, I try to "vectorize" it so that it can evaluate multiple points in a single call. Often you can use matrix operations to vectorize a function evaluation, but I don't see how to make the math work for this problem. The natural way to evaluate a quadratic polynomial at k vectors X1, X2, ..., Xk, is to pack those vectors into a p x k matrix X such that each column of X is a point at which to evaluate the polynomial. Unfortunately, the matrix computation of the quadratic form M = 0.5 * X`*Q*X results in a k x k matrix. Only the k diagonal elements are needed for evaluating the polynomial on the k input vectors, so although it is possible to compute M, doing so would be very inefficient.

In this case, it seems more efficient to loop over the columns of X. The following function implements a SAS/IML module that evaluates a quadratic polynomial at every column of X and returns a row vector of the results. The module is demonstrated by calling it on a matrix that has 5 columns.

/* Evaluate the quadratic function at each column of X and return a row vector. */
start EvalQuadVec(X, Q, L, const=0);
   f = j(1, ncol(X), .);
   do i = 1 to ncol(X);
      v = X[,i];
      f[i] = 0.5 * v`*Q*v + L`*v + const;
   end;
   return f;
finish;
 
/*    X1  X2  X3 X4  X5  */
vx = {-1 -0.5 0  0.5 1 ,
      -3 -2  -1  0   1 };
f = EvalQuadVec(vx, Q, L, const=0);
print (vx // f)[r={'x' 'y' 'f(x,y)'} c=('X1':'X5')];

Evaluate a quadratic polynomial on a uniform grid of points

You can use the EvalQuadVec function to evaluate a quadratic polynomial on any set of multiple points. In particular, you can use the ExpandGrid function to construct a regular 2-D grid of points. By evaluating the function at each point on the grid, you can visualize the function. The following statements create a heat map of the function on a regular grid. The heat map is shown at the top of this article.

x = do(-1, 1, 0.1); 
y = do(-3, 1.5, 0.1);
xy = expandgrid(x, y);             /* 966 x 2 matrix */
f = EvalQuadVec(xy`, Q, L, const); /* evaluate polynomial at all points */
 
/* write results to a SAS data set and visualize the function by using a heat map */
M = xy || f`;
create Heatmap from M[c={'x' 'y' 'f'}];  append from M;  close;
QUIT;
 
data optimal;
   xx=0.64; yy=0.42;  /* optional: add the optimal (x,y) value */
run;
data All;  set Heatmap optimal; run;
 
title "Heat Map of Quadratic Function";
proc sgplot data=All;
   heatmapparm x=x y=y colorresponse=f / colormodel= (WHITE CYAN YELLOW RED BLACK);
   scatter x=xx y=yy / markerattrs=(symbol=StarFilled);
   xaxis offsetmin=0 offsetmax=0;
   yaxis offsetmin=0 offsetmax=0;
run;

Quadratic approximations

Perhaps you do not often use quadratic polynomials. This technique is useful even for general nonlinear functions because it enables you to find the best quadratic approximation to a multivariate function at any point x0. The multivariate Taylor series at the point x0, truncated at second order, is
f(x) ≈ f(x0) + L` · (xx0) + (1/2) (xx0)` · Q · (xx0)
where L = ∇f(x0) is the gradient of f evaluate at x0 and Q = D2f(x0) is the symmetric Hessian matrix of second derivatives of f evaluated at x0.

Summary

In summary, you can use matrix computations to evaluate a multivariate quadratic polynomial. This article shows how to evaluate a quadratic polynomial at multiple points. For a polynomial of two variables, you can use this technique to visualize quadratic functions.

The post Evaluate a quadratic polynomial in SAS appeared first on The DO Loop.

10月 162019
 

The EFFECT statement is supported by more than a dozen SAS/STAT regression procedures. Among other things, it enables you to generate spline effects that you can use to fit nonlinear relationships in data. Recently there was a discussion on the SAS Support Communities about how to interpret the parameter estimates of spline effects. This article answers that question by visualizing the spline effects.

An overview of generated effects

Spline effects are powerful because they enable you to use parametric models to fit nonlinear relationships between an independent variable and a response. Using spline effects is not much different than use polynomial effects to fit nonlinear relationships. Suppose that a response variable, Y, appears to depend on an explanatory variable, X, in a complicated nonlinear fashion. If the relationship looks quadratic or cubic, you might try to capture the relationship by introducing polynomial effects. Instead of trying to model Y by X, you might try to use X, X2, and X3.

Strictly speaking, polynomial effects do not need to be centered at the origin. You could translate the polynomial by some amount, k, and use shifted polynomial effects such as (X-k), (X-k)2, and (X-k)3. Or you could combine these shifted polynomials with polynomials at the origin. Or use shifted polynomials that are shifted by different amounts, such as by the constants k1, k2, and k3.

Spline effects are similar to (shifted) polynomial effects. The constants (such as k1, k2, k3) that are used to shift the polynomials are called knots. Knots that are within the range of the data are called interior knots. Knots that are outside the range of the data are called exterior knots or boundary knots. You can read about the various kinds of spline effects that are supported by the EFFECT statement in SAS. Rather than rehash the mathematics, this article shows how you can use SAS to visualize a regression that uses splines. The visualization clarifies the meaning of the parameter estimates for the spline effects.

Output and visualize spline effects

This section shows how to output the spline effects into a SAS data set and plot the spline effects. Suppose you want to predict the MPG_City variable (miles per gallon in the city) based on the engine size. Because we will be plotting curves, the following statements sort the data by the EngineSize variable. Then the OUTDESIGN= option on the PROC GLMSELECT statement writes the spline effects to the Splines data set. For this example, I am using restricted cubic splines and four evenly spaced internal knots, but the same ideas apply to any choice of spline effects.

/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, ... */
title "Restricted TPF Splines";
title2 "Four Internal Knots";
proc glmselect data=cars outdesign(addinputvars fullmodel)=Splines; /* data set contains spline effects */
   effect spl = spline(EngineSize / details       /* define spline effects */
                naturalcubic basis=tpf(noint)     /* natural cubic splines, omit constant effect */
                knotmethod=equal(4));             /* 4 evenly spaced interior knots */
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   ods select ParameterEstimates SplineKnots;
   ods output ParameterEstimates=PE;
quit;

The SplineKnots table shows the locations of the internal knots. There are four equally spaced knots because the procedure used the KNOTMETHOD=EQUAL(4) option. The ParameterEstimates table shows estimates for the regression coefficients for the spline effects, which are named "Spl 1", "Spl 2", and so forth. In the Splines data set, the corresponding variables are named Spl_1, Spl_2, and so forth.

But what do these spline effects look like? The following statements plot the spline effects versus the EngineSize variable, which is the variable from which the effects are generated:

proc sgplot data=Splines;
   series x=EngineSize y=Intercept / curvelabel;
   series x=EngineSize y=spl_1 / curvelabel;
   series x=EngineSize y=spl_2 / curvelabel;
   series x=EngineSize y=spl_3 / curvelabel;
   refline 2.7 4.1 5.5 / axis=x lineattrs=(color="lightgray");
   refline 6.9 / axis=x label="upper knot" labelloc=inside lineattrs=(color="lightgray");
   yaxis label="Spline Effect";
run;

As stated in the documentation for the NATURALCUBIC option, these spline effects include "an intercept, the polynomial X, and n – 2 functions that are all linear beyond the largest knot," where n is the number of knots. This example uses n=4 knots, so Spl_2 and Spl_3 are the cubic splines. You will also see different spline effects if you change to one of the other supported spline methods, such as B-splines or the truncated power functions. Try it!

The graph shows that the natural cubic splines are reminiscent of polynomial effects, but there are a few differences:

  • The spline effects (spl_2 and spl_3) are shifted away from the origin. The spl_2 effect is shifted by 2.7 units, which is the location of the first internal knot. The spl_3 effect is shifted by 4.1 units, which is the location of the second internal knot.
  • The spline effects are 0 when EngineSize is less than the first knot position (2.7). Not all splines look like this, but these effects are based on truncated power functions (the TPF option).
  • The spline effects are linear when EngineSize is greater than the last knot position (6.9). Not all splines look like this, but these effects are restricted splines.

Predicted values are linear combinations of the spline effects

Visualizing the shapes of the spline effects enable you to make sense of the ParameterEstimates table. As in all linear regression, the predicted value is a linear combination of the design variables. In this case, the predicted values are formed by
Pred = 34.96 – 5*Spl_1 + 2.2*Spl_2 – 3.9*Spl_3
You can use the SAS DATA set or PROC IML to compute that linear combination of the spline effects. The following graph shows the predicted curve and overlays the locations of the knots:

The coefficient for Spl_1 is the largest effect (after the intercept). In the graph, you can see the general trend has an approximate slope of -5. The coefficient for the Spl_2 effect is 2.2, and you can see that the predicted values change slope between the second and third knots due to adding the Spl_2 effect. Without the Spl_3 effect, the predicted values would continue to rise after the third knot, but by adding in a negative multiple of Spl_3, the predicted values turn down again after the third knot.

Notice that the prediction function for the restricted cubic spline regression is linear before the first knot and after the last knot. The prediction function models nonlinear relationships between the interior knots.

Summary

In summary, the EFFECT statement in SAS regression procedures can generate spline effects for a continuous explanatory variable. The EFFECT statement supports many different types of splines. This article gives an example of using natural cubic splines (also called restricted cubic splines), which are based on the truncated power function (TPF) splines of degree 3. By outputting the spline effects to a data set and graphing them, you can get a better understanding of the meaning of the estimates of the regression coefficients. The predicted values are a linear combination of the spline effects, so the magnitude and sign of the regression coefficients indicate how the spline effects combine to predict the response.

The post Visualize a regression with splines appeared first on The DO Loop.