Statistical Programming

9月 182017
 
Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
run;
 
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
run;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
run;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
run;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.

9月 182017
 
Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,...
             — American Spiritual, "Dem Bones"

Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):

proc sort data=sashelp.zipcode(where=(StateCode 
     NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI")  /* exclude territories */
     AND ZIP_Class = " "))                              /* exclude "special" zip codes */
     out=zipcode(keep=ZIP x y City StateCode);
by zip;
run;
 
title "Path of US Zip Codes in Numerical Order";
proc sgplot data=zipcode noborder;
   where StateCode NOT IN ("AK", "HI");                 /* contiguous US */
   series x=X y=Y / group=StateCode;
   xaxis  display=none;
   yaxis  display=none;
run;
Map formed by connecting the US ZIP codes in order

Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):

Commect zip codes in order for four western US states

This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.

You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.

data Zip1;
   length SCF $5.;
   set zipcode;
   FloorZip = floor(zip/100);         /* round down to nearest 100 */
   SCF = putn(FloorZip,"Z3.")||"xx";  /* Sectional Center Facility, eg, 841xx */
   keep x y zip StateCode City SCF;
run;
proc freq data=Zip1;
   where StateCode='UT';
   tables SCF / nocum;
run;
Counts of zip codes within each SCF code in Utah

If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.

Map that connects the Sectional Center Facility (SCF) codes in Utah

For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.

Map that connects the Sectional Center Facility (SCF) codes in Florida

Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.

Map showing US Sectional Center Facilitiey (SCF) codes connected in order

In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.

If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.

The post The path of zip codes appeared first on The DO Loop.

8月 162017
 

Visualizing the correlations between variables often provides insight into the relationships between variables. I've previously written about how to use a heat map to visualize a correlation matrix in SAS/IML, and Chris Hemedinger showed how to use Base SAS to visualize correlations between variables.

Recently a SAS programmer asked how to construct a bar chart that displays the pairwise correlations between variables. This visualization enables you to quickly identify pairs of variables that have large negative correlations, large positive correlations, and insignificant correlations.

In SAS, PROC CORR can computes the correlations between variables, which are stored in matrix form in the output data set. The following call to PROC CORR analyzes the correlations between all pairs of numeric variables in the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. Because of missing values, some pairwise correlations use more observations than others.

ods exclude all;
proc corr data=sashelp.Heart;      /* pairwise correlation */
   var _NUMERIC_;
   ods output PearsonCorr = Corr;  /* write correlations, p-values, and sample sizes to data set */
run;
ods exclude none;

The CORR data set contains the correlation matrix, p-values, and samples sizes. The statistics are stored in "wide form," with few rows and many columns. As I previously discussed, you can use the HEATMAPCONT subroutine in SAS/IML to quickly visualize the correlation matrix:

proc iml;
use Corr;
   read all var "Variable" into ColNames;  /* get names of variables */
   read all var (ColNames) into mCorr;     /* matrix of correlations */
   ProbNames = "P"+ColNames;               /* variables for p-values are named PX, PY, PZ, etc */
   read all var (ProbNames) into mProb;    /* matrix of p-values */
close Corr;
 
call HeatmapCont(mCorr) xvalues=ColNames yvalues=ColNames
     colorramp="ThreeColor" range={-1 1} title="Pairwise Correlation Matrix";
Heat map of correlations between variables

The heat map gives an overall impression of the correlations between variables, but it has some shortcomings. First, you can't determine the magnitudes of the correlations with much precision. Second, it is difficult to compare the relative sizes of correlations. For example, which is stronger: the correlation between systolic and diastolic blood pressure or the correlation between weight and MRW? (MRW is a body-weight index.)

These shortcomings are resolved if you present the pairwise correlations as a bar chart. To create a bar chart, it is necessary to convert the output into "long form." Each row in the new data set will represent a pairwise correlation. To identify the row, you should also create a new variable that identifies the two variables whose correlation is represented. Because the correlation matrix is symmetric and has 1 on the diagonal, the long-form data set only needs the statistics for the lower-triangular portion of the correlation matrix.

Let's extract the data in SAS/IML. The following statements construct a new ID variable that identifies each new row and extract the correlations and p-values for the lower-triangular elements. The statistics are written to a SAS data set called CorrPairs. (In Base SAS, you can transform the lower-triangular statistics by using the DATA step and arrays, similar to the approach in this SAS note; feel free to post your Base SAS code in the comments.)

numCols = ncol(mCorr);                /* number of variables */
numPairs = numCols*(numCols-1) / 2;
length = 2*nleng(ColNames) + 5;       /* max length of new ID variable */
PairNames = j(NumPairs, 1, BlankStr(length));
i = 1;
do row= 2 to numCols;                 /* construct the pairwise names */
   do col = 1 to row-1;
      PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]);
      i = i + 1;
   end;
end;
 
lowerIdx = loc(row(mCorr) > col(mCorr));  /* indices of lower-triangular elements */
Corr = mCorr[ lowerIdx ];
Prob = mProb[ lowerIdx ];
Significant = choose(Prob > 0.05, "No ", "Yes");  /* use alpha=0.05 signif level */
 
create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"};
append;
close;
QUIT;

You can use the HBAR statement in PROC SGPLOT to construct the bar chart. This bar chart contains 45 rows, so you need to make the graph tall and use a small font to fit all the labels without overlapping. The call to PROC SORT and the DISCRETEORDER=DATA option on the YAXIS statement ensure that the categories are displayed in order of increasing correlation.

proc sort data=CorrPairs;  by Corr;  run;
 
ods graphics / width=600px height=800px;
title "Pairwise Correlations";
proc sgplot data=CorrPairs;
hbar PairNames / response=Corr group=Significant;
refline 0 / axis=x;
yaxis discreteorder=data display=(nolabel) 
      labelattrs=(size=6pt) fitpolicy=none 
      offsetmin=0.012 offsetmax=0.012  /* half of 1/k, where k=number of catgories */
      colorbands=even colorbandsattrs=(color=gray transparency=0.9);
xaxis grid display=(nolabel);
keylegend / position=topright location=inside across=1;
run;
Bar chart of pairwise correlations between variables

The bar chart (click to enlarge) enables you to see which pairs of variables are highly correlated (positively and negatively) and which have correlations that are not significantly different from 0. You can use additional colors or reference lines if you want to visually emphasize other features, such as the correlations that are larger than 0.25 in absolute value.

The bar chart is not perfect. This example, which analyzes 10 variables, is very tall with 45 rows. Among k variables there are k(k-1)/2 correlations, so the number of pairwise correlations (rows) increases quadratically with the number of variables. In practice, this chart would be unreasonably tall when there are 14 or 15 variables (about 100 rows).

Nevertheless, for 10 or fewer variables, a bar chart of the pairwise correlations provides an alternative visualization that has some advantages over a heat map of the correlation matrix. What do you think? Would this graph be useful in your work? Leave a comment.

The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.

8月 092017
 

Recently, I was asked whether SAS can perform a principal component analysis (PCA) that is robust to the presence of outliers in the data. A PCA requires a data matrix, an estimate for the center of the data, and an estimate for the variance/covariance of the variables. Classically, these estimates are the mean and the Pearson covariance matrix, respectively, but if you replace the classical estimates by robust estimates, you can obtain a robust PCA.

This article shows how to implement a classical (non-robust) PCA by using the SAS/IML language and how to modify that classical analysis to create a robust PCA.

A classical principal component analysis in SAS

the FACTOR procedure.

Let's use PROC PRINTCOMP perform a simple PCA. The PROC PRINCOMP results will be the basis of comparison when we implement the PCA in PROC IML. The following example is taken from

proc princomp data=Crime /* add COV option for covariance analysis */
              outstat=PCAStats_CORR out=Components_CORR 
              plots=score(ncomp=4);
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
   id State;
   ods select Eigenvectors ScreePlot '2 by 1' '4 by 3';
run;
 
proc print data=Components_CORR(obs=5);
   var Prin:;
run;

To save space, the output from PROC PRINCOMP is not shown, but it includes a table of the eigenvalues and principal component vectors (eigenvectors) of the correlation matrix, as well as a plot of the scores of the observations, which are the projection of the observations onto the principal components. The procedure writes two data sets: the eigenvalues and principal components are contained in the PCAStats_CORR data set and the scores are contained in the Components_CORR data set.

A classical principal component analysis in SAS/IML

Assume that the data consists of n observations and p variables and assume all values are nonmissing. If you are comfortable with multivariate analysis, a principal component analysis is straightforward: the principal components are the eigenvectors of a covariance or correlation matrix, and the scores are the projection of the centered data onto the eigenvector basis. However, if it has been a few years since you studied PCA, Abdi and Williams (2010) provides a nice overview of the mathematics. The following matrix computations implement the classical PCA in the SAS/IML language:

proc iml;
reset EIGEN93;       /* use "9.3" algorithm, not vendor BLAS (SAS 9.4m3) */
varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"};
use Crime;
  read all var varNames into X;  /* X = data matrix (assume nonmissing) */
close;
 
S = cov(X);                      /* estimate of covariance matrix */
R = cov2corr(S);                 /* = corr(X) */
call eigen(D, V, R);             /* D=eigenvalues; columns of V are PCs */
scale = T( sqrt(vecdiag(S)) );   /* = std(X) */
B = (X - mean(X)) / scale;       /* center and scale data about the mean */
scores = B * V;                  /* project standardized data onto the PCs */
 
print V[r=varNames c=("Prin1":"Prin7") L="Eigenvectors"];
print (scores[1:5,])[c=("Prin1":"Prin7") format=best9. L="Scores"];
Classical principal component analysis in SAS

The principal components (eigenvectors) and scores for these data are identical to the same quantities that were produced by PROC PRINCOMP. In the preceding program I could have directly computed R = corr(X) and scale = std(X), but I generated those quantities from the covariance matrix because that is the approach used in the next section, which computes a robust PCA.

If you want to compute the PCA from the covariance matrix, you would use S in the EIGEN call, and define B = X - mean(X) as the centered (but not scaled) data matrix.

A robust principal component analysis

This section is based on a similar robust PCA computation in Wicklin (2010). The main idea behind a robust PCA is that if there are outliers in the data, the covariance matrix will be unduly influenced by those observations. Therefore you should use a robust estimate of the covariance matrix for the eigenvector computation. Also, the arithmetic mean is unduly influenced by the outliers, so you should center the data by using a robust estimate of center before you form the scores.

SAS/IML software provides

/* get robust estimates of location and covariance */
N = nrow(X); p = ncol(X);        /* number of obs and variables */
optn = j(8,1,.);                 /* default options for MCD */
optn[4]= floor(0.75*N);          /* override default: use 75% of data for robust estimates */
call MCD(rc,est,dist,optn,x);    /* compute robust estimates */
RobCov = est[3:2+p, ];           /* robust estimate of covariance */
RobLoc = est[1, ];               /* robust estimate of location */
 
/* use robust estimates to perform PCA */
RobCorr = cov2corr(RobCov);      /* robust correlation */
call eigen(D, V, RobCorr);       /* D=eigenvalues; columns of V are PCs */
RobScale = T( sqrt(vecdiag(RobCov)) ); /* robust estimates of scale */
B = (x-RobLoc) / RobScale;       /* center and scale data */
scores = B * V;                  /* project standardized data onto the PCs */

Notice that the SAS/IML code for the robust PCA is very similar to the classical PCA. The only difference is that the robust estimates of covariance and location (from the MCD call) are used instead of the classical estimates.

If you want to compute the robust PCA from the covariance matrix, you would use RobCov in the EIGEN call, and define B = X - RobLoc as the centered (but not scaled) data matrix.

Comparing the classical and robust PCA

Classical and robust principal component scores for crime data, computed in SAS

You can create a score plot to compare the classical scores to the robust scores. The Getting Started example in the PROC PRINCOMP documentation shows the classical scores for the first three components. The documentation suggests that Nevada, Massachusetts, and New York could be outliers for the crime data.

You can write the robust scores to a SAS data set and used the SGPLOT procedure to plot the scores of the classical and robust PCA on the same scale. The first and third component scores are shown to the right, with abbreviated state names serving as labels for the markers. (Click to enlarge.) You can see that the robust first-component scores for California and Nevada are separated from the other states, which makes them outliers in that dimension. (The first principal component measures overall crime rate.) The robust third-component scores for New York and Massachusetts are also well-separated and are outliers for the third component. (The third principal component appears to contrast murder with rape and auto theft with other theft.)

Because the crime data does not have huge outliers, the robust PCA is a perturbation of the classical PCA, which makes it possible to compare the analyses. For data that have extreme outliers, the robust analysis might not resemble its classical counterpart.

If you run an analysis like this on your own data, remember that eigenvectors are not unique and so there is no guarantee that the eigenvectors for the robust correlation matrix will be geometrically aligned with the eigenvectors from the classical correlation matrix. For these data, I multiplied the second and fourth robust components by -1 because that seems to make the score plots easier to compare.

Summary

In summary, you can implement a robust principal component analysis by using robust estimates for the correlation (or covariance) matrix and for the "center" of the data. The MCD subroutine in SAS/IML language is one way to compute a robust estimate of the covariance and location of multivariate data. The SAS/IML language also provides ways to compute eigenvectors (principal components) and project the (centered and scaled) data onto the principal components.

You can download the SAS program that computes the analysis and creates the graphs.

As discussed in Hubert, Rousseeuw, and Branden (2005), the MCD algorithm is most useful for 100 or fewer variables. They describe an alternative computation that can support a robust PCA in higher dimensions.

References

The post Robust principal component analysis in SAS appeared first on The DO Loop.

7月 052017
 

A SAS customer asked how to use SAS to conduct a Z test for the equality of two proportions. He was directed to the SAS Usage Note "Testing the equality of two or more proportions from independent samples." The note says to "specify the CHISQ option in the TABLES statement of PROC FREQ to compute this test," and then adds "this is equivalent to the well-known Z test for comparing two independent proportions."

You might wonder why a chi-square test for association is equivalent to a Z test for the equality of proportions. You might also wonder if there is a direct way to test the equality of proportions. This article implements the well-known test for proportions in the DATA step and compares the results to the chi-square test results. It also shows how to get this test directly from PROC FREQ by using the RISKDIFF option.

A chi-square test for association in SAS

The SAS Usage Note poses the following problem: Suppose you want to compare the proportions responding "Yes" to a question in independent samples of 100 men and 100 women. The number of men responding "Yes" is observed to be 30 and the number of women responding Yes was 45.

You can create the data by using the following DATA step, then call PROC FREQ to analyze the association between the response variable and gender.

data Prop;
length Group $12 Response $3;
input Group Response N;
datalines;
Men          Yes  30
Men          No   70
Women        Yes  45
Women        No   55
;
 
proc freq data=Prop order=data;
   weight N;
   tables Group*Response / chisq;
run;
Test for association of two categorical variables in SAS

As explained in the PROC FREQ documentation, the Pearson chi-square statistic indicates an association between the variables in the 2 x 2 table. The results show that the chi-square statistic (for 1 degree of freedom) is 4.8, which corresponds to a p-value of 0.0285. The test indicates that we should reject the null hypothesis of no association at the 0.05 significance level.

As stated in the SAS Usage Note, this association test is equivalent to a Z test for whether the proportion of males who responded "Yes" equals the proportion of females who responded "Yes." The equivalence relies on a fact from probability theory: a chi-square random variable with 1 degree of freedom is the square of a random variable from the standard normal distribution. Thus the square root of the chi-square statistic is the Z statistic (up to a sign) that you get from the test of equality of two proportion. Therefore the Z statistic should be z = ±sqrt(4.8) = ±2.19. The p-value is unchanged.

Z test for the equality of two proportions: A DATA step implmentation

For comparison, you can implement the classical Z test by applying the formulas from a textbook or from the course material from Penn State, which includes a section about comparing two proportions. The following DATA step implements the Z test for equality of proportions:

/* Implement the Z test for pre-summarized statistics. Specify the group proportions and sizes. 
   For formulas, see https://onlinecourses.science.psu.edu/stat414/node/268 */
%let alpha = 0.05;
%let N1    = 100;   /* total trials in Group1 */
%let Event1=  30;   /* Number of events in Group1  */
%let N2    = 100;   /* total trials in Group2 */
%let Event2=  45;   /* Number of events in Group2  */
 
%let Side  =   2;   /* use L, U, or 2 for lower, upper, or two-sided test */
title "Test of H0: p1=p2 vs Ha: p1^=p2"; /* change for Side=L or U */
 
data zTestProp;
p1Hat = &Event1 / &N1;                /* observed proportion in Group1 */
var1  = p1Hat*(1-p1Hat) / &N1;        /* variance in Group1 */
p2Hat = &Event2 / &N2;                /* observed proportion in Group2 */
var2  = p2Hat*(1-p2Hat) / &N2;        /* variance in Group2 */
/* use pooled estimate of p for test */
Diff = p1Hat - p2Hat;                 /* estimate of p1 = p2 */
pHat = (&Event1 + &Event2) / (&N1 + &N2);
pVar = pHat*(1-pHat)*(1/&N1 + 1/&N2); /* pooled variance */
SE   = sqrt(pVar);                    /* estimate of standard error */
Z = Diff / SE;    
 
Side = "&Side";
if Side="L" then                      /* one-sided, lower tail */
   pValue = cdf("normal", z);
else if Side="U" then                 /* one-sided, upper tail */
   pValue = sdf("normal", Z);         /* SDF = 1 - CDF */
else if Side="2" then
   pValue = 2*(1-cdf("normal", abs(Z))); /* two-sided */
format pValue PVALUE6.4 Z 7.4;
label pValue="Pr < Z";
drop var1 var2 pHat pVar;
run;
 
proc print data=zTestProp label noobs; run;
Test for equality of two independent proportions in SAS

The DATA step obtains a test statistic of Z = –2.19, which is one of the square roots of the chi-square statistic in the PROC FREQ output. Notice also that the p-value from the DATA step matches the p-value from the PROC FREQ output.

Test equality of proportions by using PROC FREQ

There is actually a direct way to test for the equality of two independent proportions: use the RISKDIFF option in the TABLES statement in PROC FREQ. In the documentation, binomial proportions are called "risks," so a "risk difference" is a difference in proportions. (Also, a "relative risk" (the RELRISK option) measures the ratio of two proportions.) Equality of proportions is equivalent to testing whether the difference of proportions (risks) is zero.

As shown in the documentation, PROC FREQ supports many options for comparing proprtions. You can use the following suboptions to reproduce the classical equality of proportions test:

  1. EQUAL requests an equality test for the difference in proportion. By default, the Wald interval (METHOD=WALD) is used, but you can choose other intervals.
  2. VAR=NULL specifies how to estimate the variance for the Wald interval.
  3. (optional) CL=WALD outputs the Wald confidence interval for the difference.

Combining these options gives the following direct computation of the difference between two proportions:

proc freq data=Prop order=data;
   weight N;
   tables Group*Response / riskdiff(equal var=null cl=wald); /* Wald test for equality */
run;
Test for difference of proportions in SAS

The 95% (Wald) confidence interval is shown in the first table. The confidence interval is centered on the point estimate of the difference (-0.15). The interval does not contain 0, so the difference is significantly different from 0 at the 0.05 significance level.

The second table show the result of the Wald equality test. The "ASE (H0)" row gives the estimate for the (asymptotic) standard error, assuming the null hypothesis. The Z score and the two-sided p-value match the values from the DATA step computation, and the interpretation is the same.

Summary

In summary, the SAS Usage Note correctly states that the chi-square test of association is equivalent to the Z test for the equality of proportion. To run the Z test explicitly, this article uses the SAS DATA step to implement the test when you have summary statistics. As promised, the Z statistic is one of the square roots of the chi-square statistic and the p-values are the same. The DATA step removes some of the mystery regarding the equivalence between these two tests.

However, writing DATA step code cannot match the convenience of a procedure. For raw or pre-summarized data, you can use the RISKDIFF option in PROC FREQ to run the same test (recast as a difference of proportions or "risks"). To get exactly the same confidence intervals and statistics as the classical test (which is called the Wald test), you need to add a few suboptions. The resulting output matches the DATA step computations.

The post Test for the equality of two proportions in SAS appeared first on The DO Loop.

6月 262017
 

My presentation at SAS Global Forum 2017 was "More Than Matrices: SAS/IML Software Supports New Data Structures." The paper was published in the conference proceedings several months ago, but I recently recorded a short video that gives an overview of using the new data structures in SAS/IML 14.2:


If your browser does not support embedded video, you can go directly to the video on YouTube.

I know that customers will be excited about programming with lists and tables. Now you can create complex data structures in the SAS/IML language, including in-memory data tables, arrays of matrices, lists of inhomogeneous data, and hierarchical data structures.

If you prefer to read about lists and tables, the following blog posts give an overview of these data structures:

For more details about tables and lists, see the following chapters in the SAS/IML User's Guide.

The post Video: Create and use lists and tables in SAS/IML appeared first on The DO Loop.

6月 142017
 

In a previous article, I showed two ways to define a log-likelihood function in SAS. This article shows two ways to compute maximum likelihood estimates (MLEs) in SAS: the nonlinear optimization subroutines in SAS/IML and the NLMIXED procedure in SAS/STAT. To illustrate these methods, I will use the same data sets from my previous post. One data set contains binomial data, the other contains data that are lognormally distributed.

Maximum likelihood estimates for binomial data from SAS/IML

I previously wrote a step-by-step description of how to compute maximum likelihood estimates in SAS/IML. SAS/IML contains many algorithms for nonlinear optimization, including the NLPNRA subroutine, which implements the Newton-Raphson method.

In my previous article I used the LOGPDF function to define the log-likelihood function for the binomial data. The following statements define bounds for the parameter (0 < p < 1) and provides an initial guess of p0=0.5:

/* Before running program, create Binomial and LN data sets from previous post */
 
/* Example 1: MLE for binomial data */
/* Method 1: Use SAS/IML optimization routines */
proc iml;
/* log-likelihood function for binomial data */
start Binom_LL(p) global(x, NTrials);
   LL = sum( logpdf("Binomial", x, p, NTrials) );
   return( LL );
finish;
 
NTrials = 10;    /* number of trials (fixed) */
use Binomial; read all var "x"; close;
 
/* set constraint matrix, options, and initial guess for optimization */
con = { 0,      /* lower bounds: 0 < p     */
        1};     /* upper bounds:     p < 1 */
opt = {1,       /* find maximum of function   */
       2};      /* print some output      */
p0  = 0.5;      /* initial guess for solution */
call nlpnra(rc, p_MLE, "Binom_LL", p0, opt, con);
print p_MLE;
Maximum likelihood estimate for binomial data

The NLPNRA subroutine computes that the maximum of the log-likelihood function occurs for p=0.56, which agrees with the graph in the previous article. We conclude that the parameter p=0.56 (with NTrials=10) is "most likely" to be the binomial distribution parameter that generated the data.

Maximum likelihood estimates for binomial data from PROC NLMIXED

If you've never used PROC NLMIXED before, you might wonder why I am using that procedure, since this problem is not a mixed modeling regression. However, you can use the NLMIXED procedure for general maximum likelihood estimation. In fact, I sometimes joke that SAS could have named the procedure "PROC MLE" because it is so useful for solving maximum likelihood problems.

PROC NLMIXED has built-in support for computing maximum likelihood estimates of data that follow the Bernoulli (binary), binomial, Poisson, negative binomial, normal, and gamma distributions. (You can also use PROC GENMOD to fit these distributions; I have shown an example of fitting Poisson data.)

The syntax for PROC NLMIXED is very simple for the Binomial data. You use the PARMS statement to supply an initial guess for the parameter p. On the MODEL statement, you declare that you want to model the X variable as Binom(p) where NTrials=10. Be sure to ALWAYS check the documentation for the correct syntax for a binomial distribution. Some functions like PDF, CDF, and RAND use the p parameter as the first argument: Binom(p, NTrials). Some procedure (like the MCMC and NLMIXED) use the p parameter as the second argument: Binom(NTrials, p).

/* Method 2: Use PROC NLMIXED solve using built-in modeling syntax */
proc nlmixed data=Binomial;
   parms p = 0.5;             * initial value for parameter;
   NTrials = 10;
   model x ~ binomial(NTrials, p);
run;
Maximum likelihood estimates for binomial data by using PROC NLMIXED in SAS

Notice that the output from PROC NLMIXED contains the parameter estimate, standard error, and 95% confidence intervals. The parameter estimate is the same value (0.56) as was found by the NLPNRA routine in SAS/IML. The confidence interval confirms what we previously saw in the graph of the log-likelihood function: the function is somewhat flat near the optimum, so a 95% confidence interval is wide: [0.49, 0.63].

Maximum likelihood estimates for lognormal data

You can use similar syntax to compute MLEs for lognormal data. The SAS/IML syntax is similar to the binomial example, so it is omitted. To view it, download the complete SAS program that computes these maximum likelihood estimates.

PROC NLMIXED does not support the lognormal distribution as a built-in distribution, which means that you need to explicitly write out the log-likelihood function and specify it in the GENERAL function on the MODEL statement. Whereas in SAS/IML you have to use the SUM function to sum the log-likelihood over all observations, the syntax for PROC NLMIXED is simpler. Just as the DATA step has an implicit loop over all observations, the NLMIXED procedure implicitly sums the log-likelihood over all observations. You can use the LOGPDF function, or you can explicitly write the log-density formula for each observation.

If you look up the lognormal distribution in the list of "Standard Definition" in the PROC MCMC documentation, you will see that one parameterization of the lognormal PDF in terms of the log-mean μ and log-standard-deviation σ is
f(x; μ, σ) = (1/(sqrt(2π σ x) exp(-(log(x)-μ)**2 / (2σ**2))
When you take the logarithm of this quantity, you get two terms, or three if you use the rules of logarithms to isolate quantities that do not depend on the parameters:

proc nlmixed data=LN;
parms mu 1 sigma 1;                 * initial values of parameters;
bounds 0 < sigma;                   * bounds on parameters;
sqrt2pi = sqrt(2*constant('pi'));
LL = -log(sigma) 
     - log(sqrt2pi*x)               /* this term is constant w/r/t (mu, sigma) */
     - (log(x)-mu)**2  / (2*sigma**2);
/* Alternative: LL = logpdf("Lognormal", x, mu, sigma); */
model x ~ general(LL);
run;
Maximum likelihood estimates for lognormal data by using PROC NLMIXED in SAS

The parameter estimates are shown, along with standard errors and 95% confidence intervals. The maximum likelihood estimates for the lognormal data are (μ, σ) = (1.97, 0.50). You will get the same answer if you use the LOGPDF function (inside the comment) instead of the "manual calculation." You will also get the same estimates if you omit the term log(sqrt2pi*x) because that term does not depend on the MLE parameters.

In conclusion, you can use nonlinear optimization in the SAS/IML language to compute MLEs. This approach is especially useful when the computation is part of a larger computational program in SAS/IML. Alternatively, the NLMIXED procedure makes it easy to compute MLEs for discrete and continuous distributions. For some simple distributions, the log-likelihood functions are built into PROC NLMIXED. For others, you can specify the log likelihood yourself and find the maximum likelihood estimates by using the GENERAL function.

The post Two ways to compute maximum likelihood estimates in SAS appeared first on The DO Loop.

4月 122017
 

At SAS Global Forum last week, I saw a poster that used SAS/IML to optimized a quadratic objective function that arises in financial portfolio management (Xia, Eberhardt, and Kastin, 2017). The authors used the Newton-Raphson optimizer (NLPNRA routine) in SAS/IML to optimize a hypothetical portfolio of assets.

The Newton-Raphson algorithm is one of my favorite optimizers. However, for a quadratic objective function there is a simpler choice. You can use the NLPQUA function in SAS/IML to optimize a quadratic objective function.

Optimal value of quadratic function of two variables

Optimize an unconstrainted quadratic objective function in SAS/IML

Let's see how this works by specifying a quadratic polynomial in two variables. The function is
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
       = 0.5*x` H x + L x + 6
where H = {18  1, 1  8} is a 2x2 symmetric matrix, L = {-12  -4} is a row vector, and x = {x, y} is a column vector.

To use the NLPQUA subroutine, you need to write down the Hessian matrix, H, which is the symmetric matrix of second derivatives. Of course, for a quadratic function the second derivatives are constants. In the following SAS/IML program, the matrix H contains the second derivatives and the vector lin contains the coefficients for the linear and constant terms of f:

/* objective function is
   f(x,y) = (3*x-2)##2 + (2*y-1)##2 + x#y + 1
          = (9*x##2  + x#y + 4*y##2)  + -12*x  - 4*y + 6
   grad(f) = g(x,y) = [ 18*x + y - 12,  x + 8*y - 4 ]
   hess(f)  = [ dg1/dx  dg2/dx ]  =  [ 18  1 ]
              [ dg1/dy  dg2/dy ]     [  1  8 ]
*/
proc iml;
H = {18 1,             /* matrix of second derivatives */
      1 8};
lin = { -12 -4 6 };    /* vector of linear terms and the constant term */

The NLPQUA subroutine enables you to specify the Hessian matrix and the linear coefficients. From those values, the routine uses an efficient solver to find the global minimum of the quadratic polynomial, which occurs at x=92/143, y=60/143:

x0 = j(1, 2, 1);       /* initial guess */
opt = {0 1};           /* minimize, print final parameters */
call nlpqua(rc, xOpt, H, x0, opt) lin=lin;
print xOpt; 
print xOpt[format=Fract.];
Optimal values of a quadratic function

Solve a constrained quadratic program in SAS/IML

You can also use the NLPQUA subroutine to solve a constrained problem. Suppose that you are only interested in values of (x,y) in the unit square and you require that the solution satisfies the linear constraint x + y = 1. In that case you can define a matrix that specifies the linear constraints as follows:

blc = {0 0 . .,   /* lower bound: 0 <= x_i        */
       1 1 . .,   /* upper bound:      x_i <= 1   */
       1 1 0 1};  /* linear constraint x +  y = 1 */
call nlpqua(rc, conOpt, H, x0, opt, blc) lin=lin;
print conOpt[format=Fract.];

Notice how the linear constraints are specified for this (and every other) NLP function. Let p be the number of parameters in the problem. The first p elements of the first row specify the lower bounds for the parameters; use a missing value for variables that are not bounded below. The last two columns of the first row are not used, so you should always set those elements to missing values. Similarly, the first p elements of the second row specify the upper bound for the variables. The last two columns of the second row are not used, so set those elements to missing values.

Additional rows specify coefficients for the linear constraint. An equality constraint of the form a1*x1 + a2*x2 + ... + an *xn = c is encoded as a row of coefficients {a1 a2 ... an 0 c}, where the 0 in the penultimate location indicates equality. A "less than" inequality of the form a1*x1 + a2*x2 + ... + an *xn ≤ c is encoded as the row {a1 a2 ... an  -1  c}, where the -1 indicates the "less than or equal" symbol. Similarly, a value of +1 in the penultimate location indicates the "greater than or equal" symbol (≥). See the documentation for the NLP routines for additional details about specifying constraints.

Solve quadratic programs in PROC OPTMODEL

If you have access to SAS/OR software, PROC OPTMODEL provides a simple and natural language for solving simple and complex optimization problems. For this problem, PROC OPTMODEL detects that the objective function is quadratic and

/* variable names x and y */
proc optmodel;
   /* unconstrained quadratic optimization */
   var x, y;
   min f = 9*x^2 + x*y + 4*y^2 - 12*x - 4*y + 6;
   solve;                    /* call QP solver */
   print x Fract. y Fract.;
 
   /* constrained quadratic optimization */
   x.lb = 0; x.ub = 1;        /* bounds on variables */
   y.lb = 0; y.ub = 1;
   con SumToOne:              /* linear constraint */
      x + y = 1;
   solve;                     /* call QP solver */
   print x Fract. y Fract.;
quit;

In summary, if you want to optimize a quadratic objective function (with or without constraints), use the NLPQUA subroutine in SAS/IML, which is a fast and efficient way to maximize or minimize a quadratic function of many variables. If you have access to SAS/OR software, PROC OPTMODEL provides an even simpler syntax.

The post Quadratic optimization in SAS appeared first on The DO Loop.

4月 102017
 

Many intervals in statistics have the form p ± δ, where p is a point estimate and δ is the radius (or half-width) of the interval. (For example, many two-sided confidence intervals have this form, where δ is proportional to the standard error.) Many years ago I wrote an article that mentioned that you can construct these intervals in the SAS/IML language by using a concatenation operator (|| or //). The concatenation creates a two-element vector, like this:

proc iml;
mu = 50; 
delta = 1.5;
CI = mu - delta  || mu + delta;   /* horizontal concatenation ==> 1x2 vector */

Last week it occurred to me that there is a simple trick that is even easier: use the fact that SAS/IML is a matrix-vector language to encode the "±" sign as a vector {-1, 1}. When SAS/IML sees a scalar multiplied by a vector, the result will be a vector:

CI = mu + {-1  1}*delta;         /* vector operation ==> 1x2 vector */
print CI;

You can extend this example to compute many intervals by using a single statement. For example, in elementary statistics we learn the "68-95-99.7 rule" for the normal distribution. The rule says that in a random sample drawn from a normal population, about 68% of the observations will be within 1 standard deviation of the mean, about 95% will be within 2 standard deviations, and about 99.7 % will be within 3 standard deviations of the mean. You can construct those intervals by using a "multiplier matrix" whose first row is {-1, +1}, whose second row is {-2, +2}, and whose third row is {-3, +3}. The following SAS/IML statements construct the three intervals for the 69-95-99.7 rule for a normal population with mean 50 and standard deviation 8:

mu = 50;  sigma = 8;
m = {-1 1,
     -2 2,
     -3 3};
 
Intervals = mu + m*sigma;
ApproxPct = {"68%", "95%", "99.7"};
print Intervals[rowname=ApproxPct];

Just for fun, let's simulate a large sample from the normal population and empirically confirm the 68-95-99.7 rule. You can use the RANDFUN function to generate a random sample and use the BIN function to detect which observations are in each interval:

call randseed(12345);
n = 10000;                                /* sample size */
x = randfun(n, "Normal", mu, sigma);      /* simulate normal sample */
ObservedPct = j(3,1,0);
do i = 1 to 3;
   b = bin(x, Intervals[i,]);             /* b[i]=1 if x[i] in interval */
   ObservedPct[i] = sum(b) / n;           /* percentage of x in interval */
end;
 
results = Intervals || {0.68, 0.95, 0.997} || ObservedPct;
print results[colname={"Lower" "Upper" "Predicted" "Observed"}
              label="Probability of Normal Variate in Intervals: X ~ N(50, 8)"];

The simulation confirms the 68-95-99.7 rule. Remember that the rule is a mnemonic device. You can compute the exact probabilities by using the CDF function. In SAS/IML, the exact computation is p = cdf("Normal", m[,2]) - cdf("Normal", m[,1]);

In summary, the SAS/IML language provides an easy syntax to construct intervals that are symmetric about a central value. You can use a vector such as {-1, 1} to construct an interval of the form p ± δ, or you can use a k x 2 matrix to construct k symmetric intervals.

The post A simple trick to construct symmetric intervals appeared first on The DO Loop.

4月 052017
 

Most regression models try to model a response variable by using a smooth function of the explanatory variables. However, if the data are generated from some nonsmooth process, then it makes sense to use a regression function that is not smooth. A simple way to model a discontinuous process in SAS is to use spline effects and specify repeated value for the knots.

Discontinuous processes: More common than you might think

The classical ANOVA is one way to analyze data that are collected before and after a known event. For example, you might record gas mileage for a car before and after a tune-up. You might collect patient data before and after they undergo a medical or surgical treatment. You might have data about real estate prices before and after some natural disaster. In all these cases, you might suspect that the response changes abruptly because of the event.

To give a simple example, suppose that a driver records the fuel economy (in mile per gallon) for a car for 12 weeks. Because the car engine is coughing and knocking, the owner brings the car to a mechanic for maintenance. After the maintenance, the car seems to run better and he records the fuel economy for another six weeks. The hypothetical data are below:

data MPG;
input mpg @@;
week = _N_;
period = ifc(week > 12, "After ", "Before");
label mpg="Miles per Gallon";
datalines;
30.5 28.1 27.1 31.2 25.2 31.1 27.7 28.2 29.6 30.6 28.9 25.9 
30.6 33.0 31.2 29.7 32.7 31.1 
;

Notice that the data contains a binary indicator variable (period) that records whether the data are from before or after the tune-up. You can use PROC GLM to perform a simple ANOVA analysis to determine whether there was a significant change in the mean fuel economy after the maintenance. The following call to PROC GLM runs an ANOVA and indicates that the mean fuel economy is about 2.7 mpg better after the tune-up.

proc glm data=MPG plots=fitplot;
   class period / ref=first;
   model mpg = period /solution;
   output out=out predicted=Pred;
quit;

Graphically, this regression analysis is usually visualized by using two box plots. (PROC GLM creates the box plots automatically when ODS graphics are enabled.) However, because the independent variable is time, you could also use a series plot to show the observed data and the mean response before and after the maintenance. By using the GROUP= option on the SERIES statement, you can get two lines for the "before" and "after" time periods.

title "Piecewise Constant Regression with Jump Discontinuity";
proc sgplot data=Out;
   block x=week block=period / transparency=0.8;
   scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black);
   series x=week y=pred / group=period lineattrs=(thickness=3) ;
run;
Piecewise constant regression function with jump discontinuity

The graph shows that the model has a jump discontinuity at the time point at which the maintenance intervention occurred. If you include the WEEK variable in the analysis, you can model the response as a linear function of time, with a jump at the time of the tune-up.

All this is probably very familiar. However, did you know that you can use splines to model the data as a continuous function that has a kink or "corner" at the time of the maintenance event? You can use this feature when the model is continuous, but the slope changes at a known time.

Splines for nonsmooth models

Several SAS procedures support the EFFECT statement, which enables you to build spline effects. The paper "Rediscovering SAS/IML Software" (Wicklin 2010, p. 4) has an example where splines are used to construct a highly nonlinear curve for a scatter plot.

A spline effect is determined by the placement of certain points called "knots." Often knots are evenly spaced within the range of the explanatory variable, but the EFFECT statement supports many other ways to position the knots. In fact, the documentation for the EFFECT statement says: "If you remove the restriction that the knots of a spline must be distinct and allow repeated knots, then you can obtain functions with less smoothness and even discontinuities at the repeated knot location. For a spline of degree d and a repeated knot with multiplicity md, the piecewise polynomials that join such a knot are required to have only d – m matching derivatives."

The degree of a linear regression is d=1, so if you specify a knot position once you obtain a piecewise linear function that contains a "kink" at the knot. The following call to PROC GLIMMIX demonstrates this technique. (I use GLIMMIX because neither PROC GLM nor PROC GENMOD support the EFFECT statement.) You can manually specify the position of knots by using the KNOTMETHOD=LIST(list) option on the EFFECT statement.

proc glimmix data=MPG;
   effect spl = spline(week / degree=1 knotmethod=list(1 13 18));     /* p-w linear */
   *effect spl = spline(week / degree=2 knotmethod=list(1 13 13 1); /*p-w quadratic */
   model mpg = spl / solution;  
   output out=out predicted=Pred;
quit;
 
title "Piecewise Linear Regression with Kink";
proc sgplot data=Out noautolegend;
   block x=week block=period / transparency=0.8;
   scatter x=week y=mpg / markerattrs=(symbol=CircleFilled color=black);
   series x=week y=pred / lineattrs=(thickness=3) ;
run;
Piecewise Linear Regression with Kink

The graph shows that the model is piecewise linear, but that the slope of the model changes at week=13. In contrast, the second EFFECT statement in the PROC GLIMMIX code (which is commented out), specifies piecewise quadratic polynomials (d=2) and repeats the knot at week=13. That results in two quadratic models that give the same predicted value at week=13 but the model is not smooth at that location. Try it out!

If you are using a SAS procedure that does not support the EFFECT statement, you can use the GLMMIX procedure to output the dummy variables that are associated with the spline effects. A nice paper by David Pasta (2003) describes how to use dummy variables in a variety of models. The paper was written before the EFFECT statement; many of the ideas in the paper are easier to implement by using the EFFECT statement.

Lastly, the TRANSREG procedure in SAS supports spline effects but has its own syntax. See the TRANSREG documentation, which includes an example of repeating knots to build a regression model for discontinuous data.

Have you ever needed to construct a nonsmooth regression model? Tell your story by leaving a comment.

The post Nonsmooth models and spline effects appeared first on The DO Loop.