Aa previous article discussed the mathematical properties of the singular value decomposition (SVD) and showed how to use the SVD subroutine in SAS/IML software. This article uses the SVD to construct a low-rank approximation to an image. Applications include image compression and denoising an image.

### Construct a grayscale image

The value of each pixel in a grayscale image can be stored in a matrix where each element of the matrix is a value between 0 (off) and 1 (full intensity). I want to create a small example that is easy to view, so I'll create a small matrix that contains information for a low-resolution image of the capital letters "SVD." Each letter will be five pixels high and three pixels wide, arranges in a 7 x 13 array of 0/1 values. The following SAS/IML program creates the array and displays a heat map of the data:

```ods graphics / width=390px height=210px; proc iml; SVD = {0 0 0 0 0 0 0 0 0 0 0 0 0, 0 1 1 1 0 1 0 1 0 1 1 0 0, 0 1 0 0 0 1 0 1 0 1 0 1 0, 0 1 1 1 0 1 0 1 0 1 0 1 0, 0 0 0 1 0 1 0 1 0 1 0 1 0, 0 1 1 1 0 0 1 0 0 1 1 0 0, 0 0 0 0 0 0 0 0 0 0 0 0 0 }; A = SVD; /* ColorRamp: White Gray1 Gray2 Gray3 Black */ ramp = {CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000}; call heatmapcont(A) colorramp=ramp showlegend=0 title="Original Image" range={0,1};```

### A low-rank approximation to an image

Because the data matrix contains only five non-zero rows, the rank of the A matrix cannot be more than 5. The following statements compute the SVD of the data matrix and create a plot of the singular values. The plot of the singular values is similar to the scree plot in principal component analysis, and you can use the plot to help choose the number of components to retain when approximating the data.

```call svd(U, D, V, A); /* A = U*diag(D)*V` */ title "Singular Values"; call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};```

For this example, it looks like retaining three or five components would be good choices for approximating the data. To see how low-rank approximations work, let's generate and view the rank-1, rank-2, and rank-3 approximations:

```keep = 1:3; /* use keep=1:5 to see all approximations */ do i = 1 to ncol(keep); idx = 1:keep[i]; /* components to keep */ Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`; /* rank k approximation */ Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, stdize into [0,1] */ s = "Rank = " + char(keep[i],1); call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1}; end;```

The rank-1 approximation does a good job of determining the columns and do not contain that contain the letters. The approximation also picks out two rows that contain the horizontal strokes of the capital "S."

The rank-2 approximation refines the image and adds additional details. You can begin to make out the letters "SVD." In particular, all three horizontal strokes for the "S" are visible and you can begin to see the hole in the capital "D."

The rank-3 approximation contains enough details that someone unfamiliar with the message can read it. The "S" is reconstructed almost perfectly and the space inside the "V" and "D" is almost empty. Even though this data is five-dimensional, this three-dimensional approximation is very good.

Not only is a low-rank approximation easier to work with than the original five-dimensional data, but a low-rank approximation represents a compression of the data. The original image contains 7 x 13 = 91 values. For the rank-3 approximation, three columns of the U matrix contain 33 numbers and three columns of VT contain 15 numbers. So the total number of values required to represent the rank-3 approximation is only 48, which is almost half the number of values as for the original image.

### Denoising an image

You can use the singular value decomposition and low-rank approximations to try to eliminate random noise that has corrupted an image. Every TV detective series has shown an episode in which the police obtain a blurry image of a suspect's face or license plate. The detective asks the computer technician if she can enhance the image. With the push of a button, the blurry image is replaced by a crystal clear image that enables the police to identify and apprehend the criminal.

The image reconstruction algorithms used in modern law enforcement are more sophisticated than the SVD. Nevertheless, the SVD can do a reasonable job of removing small random noise from an image, thereby making certain features easier to see. The SVD has to have enough data to work with, so the following statements duplicate the "SVD" image four times before adding random Gaussian noise to the data. The noise has a standard deviation equal to 25% of the range of the data. The noisy data is displayed as a heat map on the range [-0.25, 1.25] by using a color ramp that includes yellow for negative values and blue for values greater than 1.

```call randseed(12345); A = (SVD // SVD) || (SVD // SVD); /* duplicate the image */ A = A + randfun( dimension(A), "Normal", 0, 0.25); /* add Gaussian noise */ /* Yellow White Gray1 Gray2 Gray3 Black Blue */ ramp2 = {CXFFEDA0 CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000 CX3182BD}; call heatmapcont(A) colorramp=ramp2 showlegend=0 title="Noisy Image" range={-0.5, 1.5};```

I think most police officers would be able to read this message in spite of the added noise, but let's see if the SVD can clean it up. The following statements compute the SVD and create a plot of the singular values:

```call svd(U, D, V, A); /* A = U*diag(D)*V` */ call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};```

There are 14 non-zero singular values. In theory, the main signal is contained in the components that correspond to the largest singular values whereas the noise is captured by the components for the smallest singular values. For these data, the plot of the singular values suggests that three or five (or nine) components might capture the main signal while ignoring the noise. The following statements create and display the rank-3 and rank-5 approximations. Only the rank-5 approximation is shown.

```keep = {3 5}; /* which components to examine? */ do i = 1 to ncol(keep); idx = 1:keep[i]; Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`; Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, standardize into [0,1] */ s = "Rank = " + char(keep[i],2); call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1}; end;```

The denoised low-rank image is not as clear as Hollywood depicts, but it is quite readable. It is certainly good enough to identify a potential suspect. I can hear the detective mutter, "Book 'em, Danno! Murder One."

In summary, the singular value decomposition (SVD) enables you to approximate a data matrix by using a low-rank approximation. This article uses a small example for which the full data matrix is rank-5. A plot of the singular values can help you choose the number of components to retain. For this example, a rank-3 approximation represents the essential features of the data matrix.

For similar analyses and examples that use the singular value decomposition, see

In SAS software, the SVD is heavily used in the SAS Text Miner product. For an overview of how a company can use text mining to analyze customer support issues, see Sanders and DeVault (2004) "Using SAS at SAS: The Mining of SAS Technical Support."

The post The singular value decomposition and low-rank approximations appeared first on The DO Loop.

Visualizing the correlations between variables often provides insight into the relationships between variables. I've previously written about how to use a heat map to visualize a correlation matrix in SAS/IML, and Chris Hemedinger showed how to use Base SAS to visualize correlations between variables.

Recently a SAS programmer asked how to construct a bar chart that displays the pairwise correlations between variables. This visualization enables you to quickly identify pairs of variables that have large negative correlations, large positive correlations, and insignificant correlations.

In SAS, PROC CORR can computes the correlations between variables, which are stored in matrix form in the output data set. The following call to PROC CORR analyzes the correlations between all pairs of numeric variables in the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. Because of missing values, some pairwise correlations use more observations than others.

```ods exclude all; proc corr data=sashelp.Heart; /* pairwise correlation */ var _NUMERIC_; ods output PearsonCorr = Corr; /* write correlations, p-values, and sample sizes to data set */ run; ods exclude none;```

The CORR data set contains the correlation matrix, p-values, and samples sizes. The statistics are stored in "wide form," with few rows and many columns. As I previously discussed, you can use the HEATMAPCONT subroutine in SAS/IML to quickly visualize the correlation matrix:

```proc iml; use Corr; read all var "Variable" into ColNames; /* get names of variables */ read all var (ColNames) into mCorr; /* matrix of correlations */ ProbNames = "P"+ColNames; /* variables for p-values are named PX, PY, PZ, etc */ read all var (ProbNames) into mProb; /* matrix of p-values */ close Corr;   call HeatmapCont(mCorr) xvalues=ColNames yvalues=ColNames colorramp="ThreeColor" range={-1 1} title="Pairwise Correlation Matrix";```

The heat map gives an overall impression of the correlations between variables, but it has some shortcomings. First, you can't determine the magnitudes of the correlations with much precision. Second, it is difficult to compare the relative sizes of correlations. For example, which is stronger: the correlation between systolic and diastolic blood pressure or the correlation between weight and MRW? (MRW is a body-weight index.)

These shortcomings are resolved if you present the pairwise correlations as a bar chart. To create a bar chart, it is necessary to convert the output into "long form." Each row in the new data set will represent a pairwise correlation. To identify the row, you should also create a new variable that identifies the two variables whose correlation is represented. Because the correlation matrix is symmetric and has 1 on the diagonal, the long-form data set only needs the statistics for the lower-triangular portion of the correlation matrix.

Let's extract the data in SAS/IML. The following statements construct a new ID variable that identifies each new row and extract the correlations and p-values for the lower-triangular elements. The statistics are written to a SAS data set called CorrPairs. (In Base SAS, you can transform the lower-triangular statistics by using the DATA step and arrays, similar to the approach in this SAS note; feel free to post your Base SAS code in the comments.)

```numCols = ncol(mCorr); /* number of variables */ numPairs = numCols*(numCols-1) / 2; length = 2*nleng(ColNames) + 5; /* max length of new ID variable */ PairNames = j(NumPairs, 1, BlankStr(length)); i = 1; do row= 2 to numCols; /* construct the pairwise names */ do col = 1 to row-1; PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]); i = i + 1; end; end;   lowerIdx = loc(row(mCorr) > col(mCorr)); /* indices of lower-triangular elements */ Corr = mCorr[ lowerIdx ]; Prob = mProb[ lowerIdx ]; Significant = choose(Prob > 0.05, "No ", "Yes"); /* use alpha=0.05 signif level */   create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"}; append; close; QUIT;```

You can use the HBAR statement in PROC SGPLOT to construct the bar chart. This bar chart contains 45 rows, so you need to make the graph tall and use a small font to fit all the labels without overlapping. The call to PROC SORT and the DISCRETEORDER=DATA option on the YAXIS statement ensure that the categories are displayed in order of increasing correlation.

```proc sort data=CorrPairs; by Corr; run;   ods graphics / width=600px height=800px; title "Pairwise Correlations"; proc sgplot data=CorrPairs; hbar PairNames / response=Corr group=Significant; refline 0 / axis=x; yaxis discreteorder=data display=(nolabel) labelattrs=(size=6pt) fitpolicy=none offsetmin=0.012 offsetmax=0.012 /* half of 1/k, where k=number of catgories */ colorbands=even colorbandsattrs=(color=gray transparency=0.9); xaxis grid display=(nolabel); keylegend / position=topright location=inside across=1; run;```

The bar chart (click to enlarge) enables you to see which pairs of variables are highly correlated (positively and negatively) and which have correlations that are not significantly different from 0. You can use additional colors or reference lines if you want to visually emphasize other features, such as the correlations that are larger than 0.25 in absolute value.

The bar chart is not perfect. This example, which analyzes 10 variables, is very tall with 45 rows. Among k variables there are k(k-1)/2 correlations, so the number of pairwise correlations (rows) increases quadratically with the number of variables. In practice, this chart would be unreasonably tall when there are 14 or 15 variables (about 100 rows).

Nevertheless, for 10 or fewer variables, a bar chart of the pairwise correlations provides an alternative visualization that has some advantages over a heat map of the correlation matrix. What do you think? Would this graph be useful in your work? Leave a comment.

The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.

When someone refers to the correlation between two variables, they are probably referring to the Pearson correlation, which is the standard statistic that is taught in elementary statistics courses. Elementary courses do not usually mention that there are other measures of correlation.

Why would anyone want a different estimate of correlation? Well, the Pearson correlation, which is also known as the product-moment correlation, uses empirical moments of the data (means and standard deviations) to estimate the linear association between two variables. However, means and standard deviations can be unduly influenced by outliers in the data, so the Pearson correlation is not a robust statistic.

A simple robust alternative to the Pearson correlation is called the Spearman rank correlation, which is defined as the Pearson correlation of the ranks of each variable. (If a variable contains tied values, replace those values by their average rank.) The Spearman rank correlation is simple to compute and conceptually easy to understand. Some advantages of the rank correlation are

• The rank correlation is always in the interval [-1, 1]. For "tame" data, the Spearman and Pearson correlations are close to each other. In fact, if X and Y are bivariate normal random variables with Pearson correlation ρ, then the Spearman correlation is 6/π arcsin(ρ/2), which is very close to the identity function on [-1, 1].
• The rank correlation is robust to outliers. For example, the data set X={1, 2, 2, 5} has the same ranks as the set Y={1, 2, 2, 500}. Therefore for any third variable Z, the rank correlation between X and Z is the same as the rank correlation between Y and Z.
• The rank correlation is invariant under any monotonic increasing transformation of the data, such as LOG, EXP, and SQRT. In the previous example, the rank correlation between Z and X is the same as the rank correlation between Z and the log-transform of X, which is {log(1), log(2), log(2), log(5)}. This is in contrast to the Pearson correlation, which is only invariant under affine transformations with positive scaling factors (X → a*X + b, where a > 0).
• The rank correlation can be used for any ordinal variable. For example, if the variable X has the ordinal values {"Very Unsatisfied", "Unsatisfied", "Satisfied", "Very Satisfied"}, and the variable Y has the ordinal values {"Low", "Medium", "High"}, then you can compute a rank correlation between X and Y.

### Compute rank correlation in SAS

PROC CORR in SAS supports several measures of correlation, including the Pearson and Spearman correlations. For data without outliers, the two measures are often similar. For example, the following call to PROC CORR computes the Spearman rank correlation between three variables in the Sashelp.Class data set:

```/* Compute PEARSON and SPEARMAN rank correlation by using PROC CORR in SAS */ proc corr data=sashelp.class noprob nosimple PEARSON SPEARMAN; var height weight age; run;```

According to both statistics, these variables are very positively correlated, with correlations in the range [0.7, 0.88]. Notice that the rank correlations (the lower table) are similar to the Pearson correlations for these data. However, if the data contain outliers, the rank correlation estimate is less influenced by the magnitude of the outliers.

### Compute rank correlation manually

As mentioned earlier, the Spearman rank correlation is conceptually easy to understand. It consists of two steps: compute the ranks of each variable and compute the Pearson correlation between the ranks. It is instructive to reproduce each step in the Spearman computation. You can use PROC RANK in SAS to compute the ranks of the variables, then use PROC CORR with the PEARSON option to compute the Pearson correlation of the ranks. If the data do not contain any missing values, then the following statements implement to two steps that compute the Spearman rank correlation:

```/* Compute the Spearman rank correlation "manually" by explicitly computing ranks */ /* First compute ranks; use average rank for ties */ proc rank data=sashelp.class out=classRank ties=mean; var height weight age; ranks RankHeight RankWeight RankAge; run;   /* Then compute Pearson correlation on the ranks */ proc corr data=classRank noprob nosimple PEARSON; var RankHeight RankWeight RankAge; run;```

The resulting table of correlations is the same as in the previous section and is not shown. Although PROC CORR can compute the rank correlation directly, it is comforting that these two steps produce the same answer. Furthermore, this two-step method can be useful if you decide to implement a rank-based statistic that is not produced by any SAS procedure. This two-step method is also the way to compute the Spearman correlation of character ordinal variables because PROC CORR does not analyze character variables. However, PROC RANK supports both character and numeric variables.

If you have missing values in your data, then make sure you delete the observations that contain missing values before you call PROC RANK. Equivalently, you can use a WHERE statement to omit the missing values. For example, you could insert the following statement into the PROC RANK statements:
where height^=. & weight^=. & age^=.;

### Compute rank correlation in SAS/IML software

In the SAS/IML language, the CORR function computes the Spearman rank correlation directly, as follows. The results are the same as the results from PROC CORR, and are not shown.

```proc iml; use sashelp.class; read all var {height weight age} into X; close;   RankCorr = corr(X, "Spearman"); /* compute rank correlation */```

If you ever need to compute a rank-based statistic manually, you can also use the RANKTIE function to compute the ranks of the elements in a numerical vector, such as
ranktie(X[ ,1], "Mean");

### Summary

The Spearman rank correlation is a robust measure of the linear association between variables. It is related to the classical Pearson correlation because it is defined as the Pearson correlation between the ranks of the individual variables. It has some very nice properties, including being robust to outliers and being invariant under monotonic increasing transformations of the data. For other measures of correlation that are supported in SAS, see the PROC CORR documentation.

The post What is rank correlation? appeared first on The DO Loop.

Recently, I was asked whether SAS can perform a principal component analysis (PCA) that is robust to the presence of outliers in the data. A PCA requires a data matrix, an estimate for the center of the data, and an estimate for the variance/covariance of the variables. Classically, these estimates are the mean and the Pearson covariance matrix, respectively, but if you replace the classical estimates by robust estimates, you can obtain a robust PCA.

This article shows how to implement a classical (non-robust) PCA by using the SAS/IML language and how to modify that classical analysis to create a robust PCA.

### A classical principal component analysis in SAS

Let's use PROC PRINTCOMP perform a simple PCA. The PROC PRINCOMP results will be the basis of comparison when we implement the PCA in PROC IML. The following example is taken from

 ```proc princomp data=Crime /* add COV option for covariance analysis */ outstat=PCAStats_CORR out=Components_CORR plots=score(ncomp=4); var Murder Rape Robbery Assault Burglary Larceny Auto_Theft; id State; ods select Eigenvectors ScreePlot '2 by 1' '4 by 3'; run;   proc print data=Components_CORR(obs=5); var Prin:; run;```

To save space, the output from PROC PRINCOMP is not shown, but it includes a table of the eigenvalues and principal component vectors (eigenvectors) of the correlation matrix, as well as a plot of the scores of the observations, which are the projection of the observations onto the principal components. The procedure writes two data sets: the eigenvalues and principal components are contained in the PCAStats_CORR data set and the scores are contained in the Components_CORR data set.

### A classical principal component analysis in SAS/IML

Assume that the data consists of n observations and p variables and assume all values are nonmissing. If you are comfortable with multivariate analysis, a principal component analysis is straightforward: the principal components are the eigenvectors of a covariance or correlation matrix, and the scores are the projection of the centered data onto the eigenvector basis. However, if it has been a few years since you studied PCA, Abdi and Williams (2010) provides a nice overview of the mathematics. The following matrix computations implement the classical PCA in the SAS/IML language:

```proc iml; reset EIGEN93; /* use "9.3" algorithm, not vendor BLAS (SAS 9.4m3) */ varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"}; use Crime; read all var varNames into X; /* X = data matrix (assume nonmissing) */ close;   S = cov(X); /* estimate of covariance matrix */ R = cov2corr(S); /* = corr(X) */ call eigen(D, V, R); /* D=eigenvalues; columns of V are PCs */ scale = T( sqrt(vecdiag(S)) ); /* = std(X) */ B = (X - mean(X)) / scale; /* center and scale data about the mean */ scores = B * V; /* project standardized data onto the PCs */   print V[r=varNames c=("Prin1":"Prin7") L="Eigenvectors"]; print (scores[1:5,])[c=("Prin1":"Prin7") format=best9. L="Scores"];```

The principal components (eigenvectors) and scores for these data are identical to the same quantities that were produced by PROC PRINCOMP. In the preceding program I could have directly computed R = corr(X) and scale = std(X), but I generated those quantities from the covariance matrix because that is the approach used in the next section, which computes a robust PCA.

If you want to compute the PCA from the covariance matrix, you would use S in the EIGEN call, and define B = X - mean(X) as the centered (but not scaled) data matrix.

### A robust principal component analysis

This section is based on a similar robust PCA computation in Wicklin (2010). The main idea behind a robust PCA is that if there are outliers in the data, the covariance matrix will be unduly influenced by those observations. Therefore you should use a robust estimate of the covariance matrix for the eigenvector computation. Also, the arithmetic mean is unduly influenced by the outliers, so you should center the data by using a robust estimate of center before you form the scores.

SAS/IML software provides

 ```/* get robust estimates of location and covariance */ N = nrow(X); p = ncol(X); /* number of obs and variables */ optn = j(8,1,.); /* default options for MCD */ optn[4]= floor(0.75*N); /* override default: use 75% of data for robust estimates */ call MCD(rc,est,dist,optn,x); /* compute robust estimates */ RobCov = est[3:2+p, ]; /* robust estimate of covariance */ RobLoc = est[1, ]; /* robust estimate of location */   /* use robust estimates to perform PCA */ RobCorr = cov2corr(RobCov); /* robust correlation */ call eigen(D, V, RobCorr); /* D=eigenvalues; columns of V are PCs */ RobScale = T( sqrt(vecdiag(RobCov)) ); /* robust estimates of scale */ B = (x-RobLoc) / RobScale; /* center and scale data */ scores = B * V; /* project standardized data onto the PCs */```

Notice that the SAS/IML code for the robust PCA is very similar to the classical PCA. The only difference is that the robust estimates of covariance and location (from the MCD call) are used instead of the classical estimates.

If you want to compute the robust PCA from the covariance matrix, you would use RobCov in the EIGEN call, and define B = X - RobLoc as the centered (but not scaled) data matrix.

### Comparing the classical and robust PCA

You can create a score plot to compare the classical scores to the robust scores. The Getting Started example in the PROC PRINCOMP documentation shows the classical scores for the first three components. The documentation suggests that Nevada, Massachusetts, and New York could be outliers for the crime data.

You can write the robust scores to a SAS data set and used the SGPLOT procedure to plot the scores of the classical and robust PCA on the same scale. The first and third component scores are shown to the right, with abbreviated state names serving as labels for the markers. (Click to enlarge.) You can see that the robust first-component scores for California and Nevada are separated from the other states, which makes them outliers in that dimension. (The first principal component measures overall crime rate.) The robust third-component scores for New York and Massachusetts are also well-separated and are outliers for the third component. (The third principal component appears to contrast murder with rape and auto theft with other theft.)

Because the crime data does not have huge outliers, the robust PCA is a perturbation of the classical PCA, which makes it possible to compare the analyses. For data that have extreme outliers, the robust analysis might not resemble its classical counterpart.

If you run an analysis like this on your own data, remember that eigenvectors are not unique and so there is no guarantee that the eigenvectors for the robust correlation matrix will be geometrically aligned with the eigenvectors from the classical correlation matrix. For these data, I multiplied the second and fourth robust components by -1 because that seems to make the score plots easier to compare.

### Summary

In summary, you can implement a robust principal component analysis by using robust estimates for the correlation (or covariance) matrix and for the "center" of the data. The MCD subroutine in SAS/IML language is one way to compute a robust estimate of the covariance and location of multivariate data. The SAS/IML language also provides ways to compute eigenvectors (principal components) and project the (centered and scaled) data onto the principal components.

As discussed in Hubert, Rousseeuw, and Branden (2005), the MCD algorithm is most useful for 100 or fewer variables. They describe an alternative computation that can support a robust PCA in higher dimensions.

### References

The post Robust principal component analysis in SAS appeared first on The DO Loop.

Last week I blogged about the broken-stick problem in probability, which reminded me that the broken-stick model is one of the many techniques that have been proposed for choosing the number of principal components to retain during a principal component analysis. Recall that for a principal component analysis (PCA) of p variables, a goal is to represent most of the variation in the data by using k new variables, where hopefully k is much smaller than p. Thus PCA is known as a dimension-reduction algorithm.

Many researchers have proposed methods for choosing the number of principal components. Some methods are heuristic, others are statistical. No method is perfect. Often different techniques result in different suggestions.

This article uses SAS to implement the broken stick model and compares that method with three other simple rules for dimension reduction. A few references are provided at the end of this article.

### A principal component analysis by using PROC PRINCOMP

Let's start with an example. In SAS, you can use the PRINCOMP procedure to conduct a principal component analysis. The following example is taken from the Getting Started example in the PROC PRINCOMP documentation. The program analyzes seven crime rates for the 50 US states in 1977. (The documentation shows how to generate the data set.) The following call generates a scree plot, which shows the proportion of variance explained by each component. It also writes the Eigenvalues table to a SAS data set:

```proc princomp data=Crime plots=scree; var Murder Rape Robbery Assault Burglary Larceny Auto_Theft; id State; ods output Eigenvalues=Evals; run;```

The panel shows two graphs that plot the numbers in the "Eigenvalues of the Correlation Matrix" table. The plot on the left is the scree plot, which is a graph of the eigenvalues. The sum of the eigenvalues is 7, which is the number of variables in the analysis. If you divide each eigenvalue by 7, you obtain the proportion of variance that each principal component explains. The graph on the right plots the proportions and the cumulative proportions.

### The scree plot as a guide to retaining components

The scree plot is my favorite graphical method for deciding how many principal components to keep. If the scree plot contains an "elbow" (a sharp change in the slopes of adjacent line segments), that location might indicate a good number of principal components (PCs) to retain. For this example, the scree plot shows a large change in slopes at the second eigenvalue and a smaller change at the fourth eigenvalue. From the graph of the cumulative proportions, you can see that the first two PCs explain 76% of the variance in the data, whereas the first four PCs explain 91%.

If "detect the elbow" is too imprecise for you, a more precise algorithm is to start at the right-hand side of the scree plot and look at the points that lie (approximately) on a straight line. The leftmost point along the trend line indicates the number of components to retain. (In geology, "scree" is rubble at the base of a cliff; the markers along the linear trend represent the rubble that can be discarded.) For the example data, the markers for components 4–7 are linear, so components 1–4 would be kept. This rule (and the scree plot) was proposed by Cattell (1966) and revised by Cattell and Jaspers (1967).

### How does the broken-stick model choose components?

D. A. Jackson (1993) says that the broken-stick method is one of the better methods for choosing the number of PCs. The method provides "a good combination of simplicity of calculation and accurate evaluation of dimensionality relative to the other statistical approaches" (p. 2212). The broken-stick model retains components that explain more variance than would be expected by randomly dividing the variance into p parts. As I discussed last week, if you randomly divide a quantity into p parts, the expected proportion of the kth largest piece is (1/p)Σ(1/i) where the summation is over the values i=k..p. For example, if p=7 then
E1 = (1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.37,
E2 = (1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.228,
E3 = (1/3 + 1/4 + 1/5 + 1/6 + 1/7) / 7 = 0.156, and so forth.

I think of the "expected proportions" as corresponding to a null model that contains uncorrelated (noise) variables. If you plot the eigenvalues of the correlation matrix against the broken-stick proportions, the observed proportions that are higher than the expected proportions indicate of how many principal component to keep.

### The broken-stick model for retaining components

The plot to the right shows the scree plot overlaid on a dashed curve that indicates the expected proportions that result from a broken-stick model. An application of the broken-stick model keeps one PC because only the first observed proportion of variance is higher than the corresponding broken-stick proportion.

How can you compute the points on the dashed curve? The expected proportions in the broken-stick model for p variables are proportional to the cumulative sums of the sequence of ratios {1/p, 1/(p-1), ..., 1}. You can use the CUSUM function in SAS/IML to compute a cumulative sum of a sequence, as shown below. Notice that the previous call to PROC PRINCOMP used the ODS OUTPUT statement to create a SAS data set that contains the values in the Eigenvalue table. The SAS/IML program reads in that data and compares the expected proportions to the observed proportions. The number of components to retain is computed as the largest integer k for which the first k components each explain more variance than the broken-stick model (null model).

```proc iml; use Evals; read all var {"Number" "Proportion"}; close;   /* Broken Stick (Joliffe 1986; J. E. Jackson, p. 47) */ /* For random p-1 points in [0,1], expected lengths of the p subintervals are: */ p = nrow(Proportion); g = cusum(1 / T(p:1)) / p; /* expected lengths of intervals (smallest to largest) */ ExpectedLen = g[p:1]; /* reverse order: largest to smallest */   keep = 0; /* find first k for which ExpectedLen[i] < Proportion[i] if i<=k */ do i = 1 to p while(ExpectedLen[i] < Proportion[i]); keep = i; end; print Proportion ExpectedLen keep;```

As seen in the graph, only the first component is retained under the broken-stick model.

### Average of eigenvalues

The average-eigenvalue test (Kaiser-Guttman test) retains the eigenvalues that exceed the average eigenvalue. For a p x p correlation matrix, the sum of the eigenvalues is p, so the average value of the eigenvalues is 1. To account for sampling variability, Jolliffe (1972) suggested a more liberal criterion: retain eigenvalues greater than 0.7 times the average eigenvalue. These two suggestions are implemented below:

```/* Average Root (Kaiser 1960; Guttman 1954; J. E. Jackson, p. 47) */ mean = mean(Proportion); keepAvg = loc( Proportion >= mean )[<>];   /* Scaled Average Root (Joliffe 1972; J. E. Jackson, p. 47-48) */ keepScaled = loc( Proportion >= 0.7*mean )[<>]; print keepAvg keepScaled;```

### Create the broken-stick graph

For completeness, the following statement write the broken-stick proportions to a SAS data set and call PROC SGPLOT to overlay the proportion of variance for the observed data on the broken-stick model:

```/* write expected proportions for broken-stick model */ create S var {"Number" "Proportion" "ExpectedLen"}; append; close; quit;   title "Broken Stick Method for Retaining Principal Components"; proc sgplot data=S; label ExpectedLen = "Broken-Stick Rule" Proportion = "Proportion of Variance" Number = "Number of Components"; series x=Number y=ExpectedLen / lineattrs=(pattern=dash); series x=Number y=Proportion / lineattrs=(thickness=2); keylegend / location=inside position=topright across=1; xaxis grid; yaxis label = "Proportion of Variance"; run;```

### Summary

Sometimes people ask why PROC PRINCOMP doesn't automatically choose the "correct" number of PCs to use for dimension reduction. This article describes four popular heuristic rules, all which give different answers! The rules in this article are the scree test (2 or 4 components), the broken-stick rule (1 component), the average eigenvalue rule (2 components), and the scaled eigenvalue rule (3 components).

So how should a practicing statistician decide how many PCs to retain? First, remember that these guidelines do not tell you how many components to keep, they merely make suggestions. Second, recognize that any reduction of dimension requires a trade-off between accuracy (high dimensions) and interpretability (low dimensions). Third, these rules—although helpful—cannot replace domain-specific knowledge of the data. Try each suggestion and see if the resulting model contains the features in the data that are important for your analysis.

The post Dimension reduction: Guidelines for retaining principal components appeared first on The DO Loop.

Skewness is a measure of the asymmetry of a univariate distribution. I have previously shown how to compute the skewness for data distributions in SAS. The previous article computes Pearson's definition of skewness, which is based on the standardized third central moment of the data.

Moment-based statistics are sensitive to extreme outliers. A single extreme observation can radically change the mean, standard deviation, and skewness of data. It is not surprising, therefore, that there are alternative definitions of skewness. One robust definition of skewness that is intuitive and easy to compute is a quantile definition, which is also known as the Bowley skewness or Galton skewness.

### A quantile definition of skewness

The quantile definition of skewness uses Q1 (the lower quartile value), Q2 (the median value), and Q3 (the upper quartile value). You can measure skewness as the difference between the lengths of the upper quartile (Q3-Q2) and the lower quartile (Q2-Q1), normalized by the length of the interquartile range (Q3-Q1). In symbols, the quantile skewness γQ is

You can visualize this definition by using the figure to the right. For a symmetric distribution, the quantile skewness is 0 because the length Q3-Q2 is equal to the length Q2-Q1. If the right length (Q3-Q2) is larger than the left length (Q2-Q1), then the quantile skewness is positive. If the left length is larger, then the quantile skewness is negative. For the extreme cases when Q1=Q2 or Q2=Q3, the quantile skewness is ±1. Consequently, whereas the Pearson skewness can be any real value, the quantile skewness is bounded in the interval [-1, 1]. The quantile skewness is not defined if Q1=Q3, just as the Pearson skewness is not defined when the variance of the data is 0.

There is an intuitive interpretation for the quantile skewness formula. Recall that the relative difference between two quantities R and L can be defined as their difference divided by their average value. In symbols, RelDiff = (R - L) / ((R+L)/2). If you choose R to be the length Q3-Q2 and L to be the length Q2-Q1, then quantile skewness is half the relative difference between the lengths.

### Compute the quantile skewness in SAS

It is instructive to simulate some skewed data and compute the two measures of skewness. The following SAS/IML statements simulate 1000 observations from a Gamma(a=4) distribution. The Pearson skewness of a Gamma(a) distribution is 2/sqrt(a), so the Pearson skewness for a Gamma(4) distribution is 1. For a large sample, the sample skewness should be close to the theoretical value. The QNTL call computes the quantiles of a sample.

```/* compute the quantile skewness for data */ proc iml; call randseed(12345); x = j(1000, 1); call randgen(x, "Gamma", 4);   skewPearson = skewness(x); /* Pearson skewness */ call qntl(q, x, {0.25 0.5 0.75}); /* sample quartiles */ skewQuantile = (q[3] -2*q[2] + q[1]) / (q[3] - q[1]); print skewPearson skewQuantile;```

For this sample, the Pearson skewness is 1.03 and the quantile skewness is 0.174. If you generate a different random sample from the same Gamma(4) distribution, the statistics will change slightly.

### Relationship between quantile skewness and Pearson skewness

In general, there is no simple relationship between quantile skewness and Pearson skewness for a data distribution. (This is not surprising: there is also no simple relationship between a median and a mean, nor between the interquartile range and the standard deviation.) Nevertheless, it is interesting to compare the Pearson skewness to the quantile skewness for a particular probability distribution.

For many probability distributions, the Pearson skewness is a function of the parameters of the distribution. To compute the quantile skewness for a probability distribution, you can use the quantiles for the distribution. The following SAS/IML statements compute the skewness for the Gamma(a) distribution for varying values of a.

```/* For Gamma(a), the Pearson skewness is skewP = 2 / sqrt(a). Use the QUANTILE function to compute the quantile skewness for the distribution. */ skewP = do(0.02, 10, 0.02); /* Pearson skewness for distribution */ a = 4 / skewP##2; /* invert skewness formula for the Gamma(a) distribution */ skewQ = j(1, ncol(skewP)); /* allocate vector for results */ do i = 1 to ncol(skewP); Q1 = quantile("Gamma", 0.25, a[i]); Q2 = quantile("Gamma", 0.50, a[i]); Q3 = quantile("Gamma", 0.75, a[i]); skewQ[i] = (Q3 -2*Q2 + Q1) / (Q3 - Q1); /* quantile skewness for distribution */ end;   title "Pearson vs. Quantile Skewness"; title2 "Gamma(a) Distributions"; call series(skewP, skewQ) grid={x y} label={"Pearson Skewness" "Quantile Skewness"};```

The graph shows a nonlinear relationship between the two skewness measures. This graph is for the Gamma distribution; other distributions would have a different shape. If a distribution has a parameter value for which the distribution is symmetric, then the graph will go through the point (0,0). For highly skewed distributions, the quantile skewness will approach ±1 as the Pearson skewness approaches ±∞.

### Alternative quantile definitions

Several researchers have noted that there is nothing special about using the first and third quartiles to measure skewness. An alternative formula (sometimes called Kelly's coefficient of skewness) is to use deciles: γKelly = ((P90 - P50) - (P50 - P10)) / (P90 - P10). Hinkley (1975) considered the q_th and (1-q)_th quantiles for arbitrary values of q.

### Conclusions

The quantile definition of skewness is easy to compute. In fact, you can compute the statistic by hand without a calculator for small data sets. Consequently, the quantile definition provides an easy way to quickly estimate the skewness of data. Since the definition uses only quantiles, the quantile skewness is robust to extreme outliers.

At the same time, the Bowley-Galton quantile definition has several disadvantages. It uses only the central 50% of the data to estimate the skewness. Two different data sets that have the same quartile statistics will have the same quantile skewness, regardless of the shape of the tails of the distribution. And, as mentioned previously, the use of the 25th and 75th percentiles are somewhat arbitrary.

Although the Pearson skewness is widely used in the statistical community, it is worth mentioning that the quantile definition is ideal for use with a box-and-whisker plot. The Q1, Q2, and Q2 quartiles are part of every box plot. Therefore you can visually estimate the quantile skewness as the relative difference between the lengths of the upper and lower boxes.

The post A quantile definition for skewness appeared first on The DO Loop.

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes are known. For example, researchers might want to predict the outcomes "Lived" or "Died" for patients with a certain disease. They can use data from a clinical trial to build a statistical model that uses demographic and medical measurements to predict the probability of each outcome.

SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.

For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)

1. The polygon method: A parametric model provides a formula for the boundary between regions. You can use the formula to construct polygonal regions.
2. The contour plot method: If there are two outcomes and the model provides probabilities for the first outcome, then the 0.5 contour divides the feature space into disjoint prediction regions.
3. The background grid method: You can evaluate the model on a grid of points and color each point according to the predicted outcome. You can use small markers to produce a faint indication of the prediction regions, or you can use large markers if you want to tile the graph with color.

This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM.

### A logistic model to discriminate two outcomes

To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.

```proc logistic data=LogisticData; model y(Event='1') = x1 x2; store work.LogiModel; /* save model to item store */ run;   data Grid; /* create grid in (x1,x2) coords */ do x1 = 0 to 1 by 0.02; do x2 = -7.5 to 7.5 by 0.3; output; end; end; run;   proc plm restore=work.LogiModel; /* use PROC PLM to score model on a grid */ score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink; /* evaluate the model on new data */ run;```

### The polygon method

This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.

After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!

### The contour plot method

Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.

There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:

```proc logistic data=LogisticData; model y(Event='1') = x1 x2; effectplot contour(x=x1 y=x2); /* 2. contour plot with scatter plot overlay */ run;```

Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.

The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.

### The background grid method

In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.

When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.

This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.

### Summary

This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.

The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.

A SAS customer asked how to use SAS to conduct a Z test for the equality of two proportions. He was directed to the SAS Usage Note "Testing the equality of two or more proportions from independent samples." The note says to "specify the CHISQ option in the TABLES statement of PROC FREQ to compute this test," and then adds "this is equivalent to the well-known Z test for comparing two independent proportions."

You might wonder why a chi-square test for association is equivalent to a Z test for the equality of proportions. You might also wonder if there is a direct way to test the equality of proportions. This article implements the well-known test for proportions in the DATA step and compares the results to the chi-square test results. It also shows how to get this test directly from PROC FREQ by using the RISKDIFF option.

### A chi-square test for association in SAS

The SAS Usage Note poses the following problem: Suppose you want to compare the proportions responding "Yes" to a question in independent samples of 100 men and 100 women. The number of men responding "Yes" is observed to be 30 and the number of women responding Yes was 45.

You can create the data by using the following DATA step, then call PROC FREQ to analyze the association between the response variable and gender.

```data Prop; length Group \$12 Response \$3; input Group Response N; datalines; Men Yes 30 Men No 70 Women Yes 45 Women No 55 ;   proc freq data=Prop order=data; weight N; tables Group*Response / chisq; run;```

As explained in the PROC FREQ documentation, the Pearson chi-square statistic indicates an association between the variables in the 2 x 2 table. The results show that the chi-square statistic (for 1 degree of freedom) is 4.8, which corresponds to a p-value of 0.0285. The test indicates that we should reject the null hypothesis of no association at the 0.05 significance level.

As stated in the SAS Usage Note, this association test is equivalent to a Z test for whether the proportion of males who responded "Yes" equals the proportion of females who responded "Yes." The equivalence relies on a fact from probability theory: a chi-square random variable with 1 degree of freedom is the square of a random variable from the standard normal distribution. Thus the square root of the chi-square statistic is the Z statistic (up to a sign) that you get from the test of equality of two proportion. Therefore the Z statistic should be z = ±sqrt(4.8) = ±2.19. The p-value is unchanged.

### Z test for the equality of two proportions: A DATA step implmentation

For comparison, you can implement the classical Z test by applying the formulas from a textbook or from the course material from Penn State, which includes a section about comparing two proportions. The following DATA step implements the Z test for equality of proportions:

```/* Implement the Z test for pre-summarized statistics. Specify the group proportions and sizes. For formulas, see https://onlinecourses.science.psu.edu/stat414/node/268 */ %let alpha = 0.05; %let N1 = 100; /* total trials in Group1 */ %let Event1= 30; /* Number of events in Group1 */ %let N2 = 100; /* total trials in Group2 */ %let Event2= 45; /* Number of events in Group2 */   %let Side = 2; /* use L, U, or 2 for lower, upper, or two-sided test */ title "Test of H0: p1=p2 vs Ha: p1^=p2"; /* change for Side=L or U */   data zTestProp; p1Hat = &Event1 / &N1; /* observed proportion in Group1 */ var1 = p1Hat*(1-p1Hat) / &N1; /* variance in Group1 */ p2Hat = &Event2 / &N2; /* observed proportion in Group2 */ var2 = p2Hat*(1-p2Hat) / &N2; /* variance in Group2 */ /* use pooled estimate of p for test */ Diff = p1Hat - p2Hat; /* estimate of p1 = p2 */ pHat = (&Event1 + &Event2) / (&N1 + &N2); pVar = pHat*(1-pHat)*(1/&N1 + 1/&N2); /* pooled variance */ SE = sqrt(pVar); /* estimate of standard error */ Z = Diff / SE;   Side = "&Side"; if Side="L" then /* one-sided, lower tail */ pValue = cdf("normal", z); else if Side="U" then /* one-sided, upper tail */ pValue = sdf("normal", Z); /* SDF = 1 - CDF */ else if Side="2" then pValue = 2*(1-cdf("normal", abs(Z))); /* two-sided */ format pValue PVALUE6.4 Z 7.4; label pValue="Pr < Z"; drop var1 var2 pHat pVar; run;   proc print data=zTestProp label noobs; run;```

The DATA step obtains a test statistic of Z = –2.19, which is one of the square roots of the chi-square statistic in the PROC FREQ output. Notice also that the p-value from the DATA step matches the p-value from the PROC FREQ output.

### Test equality of proportions by using PROC FREQ

There is actually a direct way to test for the equality of two independent proportions: use the RISKDIFF option in the TABLES statement in PROC FREQ. In the documentation, binomial proportions are called "risks," so a "risk difference" is a difference in proportions. (Also, a "relative risk" (the RELRISK option) measures the ratio of two proportions.) Equality of proportions is equivalent to testing whether the difference of proportions (risks) is zero.

As shown in the documentation, PROC FREQ supports many options for comparing proprtions. You can use the following suboptions to reproduce the classical equality of proportions test:

1. EQUAL requests an equality test for the difference in proportion. By default, the Wald interval (METHOD=WALD) is used, but you can choose other intervals.
2. VAR=NULL specifies how to estimate the variance for the Wald interval.
3. (optional) CL=WALD outputs the Wald confidence interval for the difference.

Combining these options gives the following direct computation of the difference between two proportions:

```proc freq data=Prop order=data; weight N; tables Group*Response / riskdiff(equal var=null cl=wald); /* Wald test for equality */ run;```

The 95% (Wald) confidence interval is shown in the first table. The confidence interval is centered on the point estimate of the difference (-0.15). The interval does not contain 0, so the difference is significantly different from 0 at the 0.05 significance level.

The second table show the result of the Wald equality test. The "ASE (H0)" row gives the estimate for the (asymptotic) standard error, assuming the null hypothesis. The Z score and the two-sided p-value match the values from the DATA step computation, and the interpretation is the same.

### Summary

In summary, the SAS Usage Note correctly states that the chi-square test of association is equivalent to the Z test for the equality of proportion. To run the Z test explicitly, this article uses the SAS DATA step to implement the test when you have summary statistics. As promised, the Z statistic is one of the square roots of the chi-square statistic and the p-values are the same. The DATA step removes some of the mystery regarding the equivalence between these two tests.

However, writing DATA step code cannot match the convenience of a procedure. For raw or pre-summarized data, you can use the RISKDIFF option in PROC FREQ to run the same test (recast as a difference of proportions or "risks"). To get exactly the same confidence intervals and statistics as the classical test (which is called the Wald test), you need to add a few suboptions. The resulting output matches the DATA step computations.

The post Test for the equality of two proportions in SAS appeared first on The DO Loop.

Suppose you roll six identical six-sided dice. Chance are that you will see at least one repeated number. The probability that you will see six unique numbers is very small: only 6! / 6^6 ≈ 0.015.

This example can be generalized. If you draw a random sample with replacement from a set of n items, duplicate values occur much more frequently than most people think. This is the source of the famous "birthday matching problem" in probability, which shows that the probability of duplicate birthdays among a small group of people is higher than most people would guess. (Among 23 people, the probability of a duplicate birthday is more than 50%.)

This fact has implications for bootstrap resampling. Recall that if a sample has n observations, then a bootstrap sample is obtained by sampling n times with replacement from the data. Since most bootstrap samples contain a duplicate of at least one observation, it is also true that most samples omit at least one observation. That raises the question: On average, how many of the original observations are not present in an average bootstrap sample?

### The average bootstrap sample

I'll give you the answer: an average bootstrap sample contains 63.2% of the original observations and omits 36.8%. The book by Chernick and LaBudde (2011, p. 199) states the following result about bootstrap resamples: "If the sample size is large and we [generate many]  bootstrap samples, we will find that on average, approximately 36.8% of the original observations will be missing from the individual bootstrap samples. Another way to look at this is that for any particular observation, approximately 36.8% of the bootstrap samples will not contain it." (Emphasis added.)

You can use elementary probability to derive this result. Suppose the original data contains n observations. A bootstrap sample is generated by sampling with replacement from the data. The probability that a particular observation is not chosen from a set of n observations is 1 - 1/n, so the probability that the observation is not chosen n times is (1 - 1/n)^n. This is the probability that the observation does not appear in a bootstrap sample.

You might remember from calculus that the limit as n → ∞ of (1 - 1/n)^n is 1/e. Therefore, when n is large, the probability that an observation is not chosen is approximately 1/e ≈ 0.368.

### Simulation: the proportion of observations in bootstrap samples

If you prefer simulation to calculation, you can simulate many bootstrap samples and count the proportion of samples that do not contain observation 1, observation 2, and so forth. The average of those proportions should be close to 1/e.

The theoretical result is only valid for large values of n, but let's start with n=6 so that we can print out intermediate results during the simulation. The following SAS/IML program uses the SAMPLE function to simulate rolling six dice a total of 10 times. Each column of the matrix M contains the result of one trial:

```options linesize=128; proc iml; call randseed(54321); n = 6; NumSamples = 10; x = T(1:n); /* the original sample {1,2,...,n} */ M = sample(x, NumSamples//n); /* each column is a draw with replacement */```

The table shows that each sample (column) contains duplicates. The first column does not contain the values {4,5,6}. The second column does not contain the value 6.

Given these samples, what proportion of samples does not contain 1? What proportion does not contain 2, and so forth? For this small simulation, we can answer these questions by visual inspection. The number 1 does not appear in the sample S5, so it does not appear in 0.1 of the samples. The number 2 does not appears in S3, S5, S8, S9, or S10, so it does not appear in 0.5 of the samples. The following SAS/IML statements count the proportion of columns that do not contain each number and then takes the average of those proportions:

```/* how many samples do not contain x[1], x[2], etc */ cnt = j(n, 1, .); /* allocate space for results */ do i = 1 to n; Y = (M=x[i]); /* binary matrix */ s = Y[+, ]; /* count for each sample (column) */ cnt[i] = sum(s=0); /* number of samples that do not contain x[i] */ end; prop = cnt / NumSamples; avg_prop = mean(prop); print avg_prop;```

The result says that, on the average, a sample does not contain 0.35 of the data values. Let's increase the sample size and the number of bootstrap samples. Change the parameters to the following and rerun the program:

```n = 75; NumSamples = 10000;```

The new estimate is 0.365, which is close to the theoretical value of 1/e. If you like, you can plot a histogram of the n proportions that make up the average:

```title "Proportion of Samples That Do Not Contain an Observation"; title2 "n=75; NumSamples=10000"; call histogram(prop) label="Proprtion" other="refline "+char(avg_prop)+"/axis=x;";```

Elementary statistical theory tells us that the proportions are approximately normally distributed with mean p=1/e and standard deviation sqrt(p(1-p)/NumSamples) ≈ 0.00482. The mean and standard deviation of the simulated proportions are very close to the theoretical values.

In conclusion, when you draw n items with replacement from a large sample of size n, on average the sample contains 63.2% of the original observations and omits 36.8%. In other words, the average bootstrap sample omits 36.8% of the original data.

The post The average bootstrap sample omits 36.8% of the data appeared first on The DO Loop.

In a previous article, I showed two ways to define a log-likelihood function in SAS. This article shows two ways to compute maximum likelihood estimates (MLEs) in SAS: the nonlinear optimization subroutines in SAS/IML and the NLMIXED procedure in SAS/STAT. To illustrate these methods, I will use the same data sets from my previous post. One data set contains binomial data, the other contains data that are lognormally distributed.

### Maximum likelihood estimates for binomial data from SAS/IML

I previously wrote a step-by-step description of how to compute maximum likelihood estimates in SAS/IML. SAS/IML contains many algorithms for nonlinear optimization, including the NLPNRA subroutine, which implements the Newton-Raphson method.

In my previous article I used the LOGPDF function to define the log-likelihood function for the binomial data. The following statements define bounds for the parameter (0 < p < 1) and provides an initial guess of p0=0.5:

```/* Before running program, create Binomial and LN data sets from previous post */   /* Example 1: MLE for binomial data */ /* Method 1: Use SAS/IML optimization routines */ proc iml; /* log-likelihood function for binomial data */ start Binom_LL(p) global(x, NTrials); LL = sum( logpdf("Binomial", x, p, NTrials) ); return( LL ); finish;   NTrials = 10; /* number of trials (fixed) */ use Binomial; read all var "x"; close;   /* set constraint matrix, options, and initial guess for optimization */ con = { 0, /* lower bounds: 0 < p */ 1}; /* upper bounds: p < 1 */ opt = {1, /* find maximum of function */ 2}; /* print some output */ p0 = 0.5; /* initial guess for solution */ call nlpnra(rc, p_MLE, "Binom_LL", p0, opt, con); print p_MLE;```

The NLPNRA subroutine computes that the maximum of the log-likelihood function occurs for p=0.56, which agrees with the graph in the previous article. We conclude that the parameter p=0.56 (with NTrials=10) is "most likely" to be the binomial distribution parameter that generated the data.

### Maximum likelihood estimates for binomial data from PROC NLMIXED

If you've never used PROC NLMIXED before, you might wonder why I am using that procedure, since this problem is not a mixed modeling regression. However, you can use the NLMIXED procedure for general maximum likelihood estimation. In fact, I sometimes joke that SAS could have named the procedure "PROC MLE" because it is so useful for solving maximum likelihood problems.

PROC NLMIXED has built-in support for computing maximum likelihood estimates of data that follow the Bernoulli (binary), binomial, Poisson, negative binomial, normal, and gamma distributions. (You can also use PROC GENMOD to fit these distributions; I have shown an example of fitting Poisson data.)

The syntax for PROC NLMIXED is very simple for the Binomial data. You use the PARMS statement to supply an initial guess for the parameter p. On the MODEL statement, you declare that you want to model the X variable as Binom(p) where NTrials=10. Be sure to ALWAYS check the documentation for the correct syntax for a binomial distribution. Some functions like PDF, CDF, and RAND use the p parameter as the first argument: Binom(p, NTrials). Some procedure (like the MCMC and NLMIXED) use the p parameter as the second argument: Binom(NTrials, p).

```/* Method 2: Use PROC NLMIXED solve using built-in modeling syntax */ proc nlmixed data=Binomial; parms p = 0.5; * initial value for parameter; NTrials = 10; model x ~ binomial(NTrials, p); run;```

Notice that the output from PROC NLMIXED contains the parameter estimate, standard error, and 95% confidence intervals. The parameter estimate is the same value (0.56) as was found by the NLPNRA routine in SAS/IML. The confidence interval confirms what we previously saw in the graph of the log-likelihood function: the function is somewhat flat near the optimum, so a 95% confidence interval is wide: [0.49, 0.63].

### Maximum likelihood estimates for lognormal data

You can use similar syntax to compute MLEs for lognormal data. The SAS/IML syntax is similar to the binomial example, so it is omitted. To view it, download the complete SAS program that computes these maximum likelihood estimates.

PROC NLMIXED does not support the lognormal distribution as a built-in distribution, which means that you need to explicitly write out the log-likelihood function and specify it in the GENERAL function on the MODEL statement. Whereas in SAS/IML you have to use the SUM function to sum the log-likelihood over all observations, the syntax for PROC NLMIXED is simpler. Just as the DATA step has an implicit loop over all observations, the NLMIXED procedure implicitly sums the log-likelihood over all observations. You can use the LOGPDF function, or you can explicitly write the log-density formula for each observation.

If you look up the lognormal distribution in the list of "Standard Definition" in the PROC MCMC documentation, you will see that one parameterization of the lognormal PDF in terms of the log-mean μ and log-standard-deviation σ is
f(x; μ, σ) = (1/(sqrt(2π σ x) exp(-(log(x)-μ)**2 / (2σ**2))
When you take the logarithm of this quantity, you get two terms, or three if you use the rules of logarithms to isolate quantities that do not depend on the parameters:

```proc nlmixed data=LN; parms mu 1 sigma 1; * initial values of parameters; bounds 0 < sigma; * bounds on parameters; sqrt2pi = sqrt(2*constant('pi')); LL = -log(sigma) - log(sqrt2pi*x) /* this term is constant w/r/t (mu, sigma) */ - (log(x)-mu)**2 / (2*sigma**2); /* Alternative: LL = logpdf("Lognormal", x, mu, sigma); */ model x ~ general(LL); run;```

The parameter estimates are shown, along with standard errors and 95% confidence intervals. The maximum likelihood estimates for the lognormal data are (μ, σ) = (1.97, 0.50). You will get the same answer if you use the LOGPDF function (inside the comment) instead of the "manual calculation." You will also get the same estimates if you omit the term log(sqrt2pi*x) because that term does not depend on the MLE parameters.

In conclusion, you can use nonlinear optimization in the SAS/IML language to compute MLEs. This approach is especially useful when the computation is part of a larger computational program in SAS/IML. Alternatively, the NLMIXED procedure makes it easy to compute MLEs for discrete and continuous distributions. For some simple distributions, the log-likelihood functions are built into PROC NLMIXED. For others, you can specify the log likelihood yourself and find the maximum likelihood estimates by using the GENERAL function.

The post Two ways to compute maximum likelihood estimates in SAS appeared first on The DO Loop.