data analysis

5月 162018

In my article about how to construct calibration plots for logistic regression models in SAS, I mentioned that there are several popular variations of the calibration plot. The previous article showed how to construct a loess-based calibration curve. Austin and Steyerberg (2013) recommend the loess-based curve on the basis of an extensive simulation study. However, some practitioners prefer to use a decile calibration plot. This article shows how to construct a decile-based calibration curve in SAS.

The decile calibration plot

The decile calibration plot is a graphical analog of the Hosmer-Lemeshow goodness-of-fit test for logistic regression models. The subjects are divided into 10 groups by using the deciles of the predicted probability of the fitted logistic model. Within each group, you compute the mean predicted probability and the mean of the empirical binary response. A calibration plot is a scatter plot of these 10 ordered pairs, although most calibration plots also include the 95% confidence interval for the proportion of the binary responses within each group. Many calibration plots connect the 10 ordered pairs with piecewise line segments, others use a loess curve or a least squares line to smooth the points.

Create the decile calibration plot in SAS

The previous article simulated 500 observations from a logistic regression model logit(p) = b0 + b1*x + b2*x2 where x ~ U(-3, 3). The following call to PROC LOGISTIC fits a linear model to these simulated data. That is, the model is intentionally misspecified. A call to PROC RANK creates a new variable (Decile) that identifies the deciles of the predicted probabilities for the model. This variable is used to compute the means of the predicted probabilities and the empirical proportions (and 95% confidence intervals) for each decile:

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim noprint;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
/* To construct the decile calibration plot, identify deciles of the predicted prob. */
proc rank data=LogiOut out=LogiDecile groups=10;
   var PredProb;
   ranks Decile;
/* Then compute the mean predicted prob and the empirical proportions (and CI) for each decile */
proc means data=LogiDecile noprint;
   class Decile;
   types Decile;
   var y PredProb;
   output out=LogiDecileOut mean=yMean PredProbMean
          lclm=yLower uclm=yUpper;
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiDecileOut noautolegend aspect=1;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   *loess x=PredProbMean y=yMean;  /* if you want a smoother based on deciles */
   series x=PredProbMean y=yMean;  /* if you to connect the deciles */
   scatter x=PredProbMean y=yMean / yerrorlower=yLower yerrorupper=yUpper;
   yaxis label="Observed Probability of Outcome";
   xaxis label="Predicted Probability of Outcome";
Decile calibration curve for a misspecified logistic regression model

The diagonal line is the line of perfect calibration. In a well-calibrated model, the 10 markers should lie close to the diagonal line. For this example, the graph indicates that the linear model does not fit the data well. For the first decile of the predicted probability (the lowest predicted-risk group), the observed probability of the event is much higher than the mean predicted probability. For the fourth, sixth, and seventh deciles, the observed probability is much lower than the mean predicted probability. For the tenth decile (the highest predicted-risk group), the observed probability is higher than predicted. By the way, this kind of calibration is sometimes called internal calibration because the same observations are used to fit and assess the model.

The decile calibration plot for a correctly specified model

You can fit a quadratic model to the data to see how the calibration plot changes for a correctly specified model. The results are shown below. In this graph, all markers are close to the diagonal line, which indicates a very close agreement between the predicted and observed probabilities of the event.

Decile calibration curve for a correctly specified logistic regression model

Should you use the decile calibration curve?

The decile-based calibration plot is popular, perhaps because it is so simple that it can be constructed by hand. Nevertheless, Austin and Steyerberg (2013) suggest using the loess-based calibration plot instead of the decile-based plot. Reasons include the following:

  • The use of deciles results in estimates that "display greater variability than is evident in the loess-based method" (p. 524).
  • Several researchers have argued that the use of 10 deciles is arbitrary. Why not use five? Or 15? In fact, the results of the Hosmer-Lemeshow test "can depend markedly on the number of groups, and there's no theory to guide the choice of that number." (P. Allison, 2013. "Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression")

Many leading researchers in logistic regression do not recommend the Hosmer-Lemeshow test for these reasons. The decile-based calibration curve shares the same drawbacks. Since SAS can easily create the loess-based calibration curve (see the previous article), there seems to be little reason to prefer the decile-based version.

The post Decile calibration plots in SAS appeared first on The DO Loop.

5月 142018

A logistic regression model is a way to predict the probability of a binary response based on values of explanatory variables. It is important to be able to assess the accuracy of a predictive model. This article shows how to construct a calibration plot in SAS. A calibration plot is a goodness-of-fit diagnostic graph. It enables you to qualitatively compare a model's predicted probability of an event to the empirical probability.

Calibration curves

There are two standard ways to assess the accuracy of a predictive model for a binary response: discrimination and calibration. To assess discrimination, you can use the ROC curve. Discrimination involves counting the number of true positives, false positive, true negatives, and false negatives at various threshold values. In contrast, calibration curves compare the predicted probability of the response to the empirical probability. Calibration can help diagnose lack of fit. It also helps to rank subjects according to risk.

This article discusses internal calibration, which is the agreement on the sample used to fit the model. External calibration, which involves comparing the predicted and observed probabilities on a sample that was not used to fit the model, is not discussed.

In the literature, there are two types of calibration plots. One uses a smooth curve to compare the predicted and empirical probabilities. The curve is usually a loess curve, but sometimes a linear regression curve is used. The other (to be discussed in a future article) splits the data into deciles. An extensive simulation study by Austin and Steyerberg (2013) concluded that loess-based calibration curves have several advantages. This article discusses how to use a loess fit to construct a calibration curve.

A calibration curve for a misspecified model

Calibration curves help to diagnose lack of fit. This example simulates data according to a quadratic model but create a linear fit to that data. The calibration curve should indicate that the model is misspecified.

Step 1: Simulate data for a quadratic logistic regression model

Austin and Steyerberg (2013) use the following simulated data, which is based on the earlier work of Hosmer et al. (1997). Simulate X ~ U(-3, 3) and define η = b0 + b1*X + b2*X2) to be a linear predictor, where b0 = -2.337, b1 = 0.8569, b2 = 0.3011. The logistic model is logit(p) = η, where p = Pr(Y=1 | x) is the probability of the binary event Y=1. The following SAS DATA step simulates N=500 observations from this logistic regression model:

/* simulate sample of size N=500 that follows a quadratic model from Hosmer et al. (1997) */
%let N = 500;                              /* sample size */ 
data LogiSim(keep=Y x);
call streaminit(1234);                     /* set the random number seed */
do i = 1 to &N;                            /* for each observation in the sample */
   /* Hosmer et al. chose x ~ U(-3,3). Use rand("Uniform",-3,3) for modern versions of SAS */
   x = -3 + 6*rand("Uniform");
   eta = -2.337 + 0.8569*x + 0.3011*x**2;  /* quadratic model used by Hosmer et al. */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);

For this simulated data, the "true" model is known to be quadratic. Of course, in practice, the true underlying model is unknown.

Step 2: Fit a logistic model

The next step is to fit a logistic regression model and save the predicted probabilities. The following call to PROC LOGISTIC intentionally fits a linear model. The calibration plot will indicate that the model is misspecified.

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim plots=effect;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
Plot of predicted probabilities for a logistic regression model in SAS

As shown above, PROC LOGISTIC can automatically create a fit plot for a regression that involves one continuous variable. The fit plot shows the observed responses, which are plotted at Y=0 (failure) or Y=1 (success). The predicted probabilities are shown as a sigmoidal curve. The predicted probabilities are in the range (0, 0.8). The next section creates a calibration plot, which is a graph of the predicted probability versus the observed response.

Create the calibration plot

The output from PROC LOGISTIC includes the predicted probabilities and the observed responses. Copas (1983), Harrell et al. (1984), and others have suggested assessing calibration by plotting a scatter plot smoother and overlaying a diagonal line that represents the line of perfect calibration. If the smoother lies close to the diagonal, the model is well calibrated. If there is a systematic deviation from the diagonal line, that indicates that the model might be misspecified.

The following statements sort the data by the predicted probabilities and create the calibration plot. (The sort is optional for the loess smoother, but is useful for other analyses.)

proc sort data=LogiOut;  by PredProb;  run;
/* let the data choose the smoothing parameter */
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiOut noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm;   /* smoothing value by AICC (0.657) */
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
Calibration plot for a misspecified logistic model

The loess smoother does not lie close to the diagonal line, which indicates that the linear model is not a good fit for the quadratic data. The loess curve is systematically above the diagonal line for very small and very large values of the predicted probability. This indicates that the empirical probability (the proportion of events) is higher than predicted on these intervals. In contrast, when the predicted probability is in the range [0.1, 0.45], the empirical probability is lower than predicted.

Several variations of the calibration plot are possible:

  • Add the NOMARKERS option to the LOESS statement to suppress the observed responses.
  • Omit the CLM option if you do not want to see a confidence band.
  • Use the SMOOTH=0.75 option if you want to specify the value for the loess smoothing parameter. The value 0.75 is the value that Austin and Steyerberg used in their simulation study, in part because that is the default parameter value in the R programming language.
  • Use a REG statement instead of the LOESS statement if you want a polynomial smoother.

A calibration plot for a correctly specified model

In the previous section, the model was intentionally misspecified. The calibration curve for the example did not lie close to the diagonal "perfect calibration" line, which indicates a lack of fit. Let's see how the calibration curve looks for the same data if you include a quadratic term in the model. First, generate the predicted probabilities for the quadratic model, then create the calibration curve for the model:

proc logistic data=LogiSim noprint;
   model Y(event='1') = x x*x;  /* fit a quadratic model */
   output out=LogiOut2 predicted=PredProb;
proc sort data=LogiOut2;  by PredProb;  run;
title "Calibration Plot for a Correct Model";
proc sgplot data=LogiOut2 noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm nomarkers;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   yaxis grid; xaxis grid;
Calibration plot for a correctly specified logistic model

This calibration plot indicates that the quadratic model fits the data well. The calibration curve is close to the diagonal reference line, which is the line of perfect calibration. The curve indicates that the predicted and empirical probabilities are close to each other for low-risk, moderate-risk, and high-risk subjects.

It is acceptable for the calibration curve to have small dips, bends, and wiggles. The extent that the curve wiggles is determined by the (automatically chosen) smoothing parameter for the loess algorithm, The value of the smoothing parameter is not displayed by PROC SGPLOT, but you can use PROC LOESS to display the smoothing parameter and other statistics for the loess smoother.

According to Austin and Steyerberg (2013), the calibration curve can detect misspecified models when the magnitude of the omitted term (or terms) is moderate to strong. The calibration curve also behaves better for large samples and when the incidence of outcomes is close to 50% (as opposed to having rare outcomes). Lastly, there is a connection between discrimination and calibration: The calibration curve is most effective in models for which the discrimination (as measured by the C-statistic or area under the ROC curve) is good.

In summary, this article shows how to construct a loess-based calibration curve for logistic regression models in SAS. To create a calibration curve, use PROC LOGISTIC to output the predicted probabilities for the model and plot a loess curve that regresses the observed responses onto the predicted probabilities.

My next article shows how to construct a popular alternative to the loess-based calibration curve. In the alternative version, the subjects are ranked into deciles according to the predicted probability or risk.

Further reading

The post Calibration plots in SAS appeared first on The DO Loop.

5月 022018

Order matters. When you create a graph that has a categorical axis (such as a bar chart), it is important to consider the order in which the categories appear. Most software defaults to alphabetical order, which typically gives no insight into how the categories relate to each other.

Alphabetical order rarely adds any insight to the data visualization. For a one-dimensional graph (such as a bar chart) software often provides alternative orderings. For example, in SAS you can use the CATEGORYORDER= option to sort categories of a bar chart by frequency. PROC SGPLOT also supports the DISCRETEORDER= option on the XAXIS and YAXIS statements, which you can use to specify the order of the categories.

It is much harder to choose the order of variables for a two-dimensional display such as a heat map (for example, of a correlation matrix) or a matrix of scatter plots. Both of these displays are often improved by strategically choosing an order for the variables so that adjacent levels have similar properties. Catherine Hurley ("Clustering Visualizations of Multidimensional Data", JCGS, 2004) discusses several ways to choose a particular order for p variables from among the p! possible permutations. This article implements the single-link clustering technique for ordering variables. The same technique is applicable for ordering levels of a nominal categorical variable.

Order variables in two-dimensional displays

I first implemented the single-link clustering method in 2008, which was before I started writing this blog. Last week I remembered my decade-old SAS/IML program and decided to blog about it. This article discusses how to use the program; see Hurley's paper for details about how the program works.

To illustrate ordering a set of variables, the following program creates a heat map of the correlation matrix for variables from the Sashelp.Cars data set. The variables are characteristics of motor vehicles. The rows and columns of the matrix display the variables in alphabetical order. The (i,j)th cell of the heat map visualizes the correlation between the i_th and j_th variable:

proc iml;
/* create correlation matric for variables in alphabetical order */
varNames = {'Cylinders' 'EngineSize' 'Horsepower' 'Invoice' 'Length' 
            'MPG_City' 'MPG_Highway' 'MSRP' 'Weight' 'Wheelbase'};
use Sashelp.Cars;  read all var varNames into Y;  close;
corr = corr(Y);
call HeatmapCont(corr) xvalues=varNames yvalues=varNames 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation Matrix: Variables in Alphabetical Order";
Heat map of correlation matrix; variables in alphabetical order

The heat map shows that the fuel economy variables (MPG_City and MPG_Highway) are negatively correlated with most other variables. The size of the vehicle and the power of the engine are positively correlated with each other. The correlations with the price variables (Invoice and MSRP) are small.

Although this 10 x 10 heat map visualizes all pairwise correlations, it is possible to permute the variables so that highly correlated variables are adjacent to each other. The single-link cluster method rearranges the variables by using a symmetric "merit matrix." The merit matrix can be a correlation matrix, a distance matrix, or some other matrix that indicates how the i_th category is related to the j_th category.

The following statements assume that the SingleLinkFunction is stored in a module library. The statements load and call the SingleLinkCluster function. The "merit matrix" is the correlation matrix so the algorithm tries to put strongly positively correlated variables next to each other. The function returns a permutation of the indices. The permutation induces a new order on the variables. The correlation matrix for the new order is then displayed:

/* load the module for single-link clustering */
load module=(SingleLinkCluster);
permutation = SingleLinkCluster( corr );  /* permutation of 1:p */
V = varNames[permutation];                /* new order for variables */
R = corr[permutation, permutation];       /* new order for matrix */
call HeatmapCont(R) xvalues=V yvalues=V 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation: Variables in Clustered Order";
Heat map of correlation matrix; variables in clustered order

In this permuted matrix, the variables are ordered according to their pairwise correlation. For this example, the single-link cluster algorithm groups together the three "size" variables (Length, Wheelbase, and Weight), the three "power" variables (EngineSize, Cylinders, and Horsepower), the two "price" variables (MSRP and Invoice), and the two "fuel economy" variables (MPG_Highway and MPG_City). In this representation, strong positive correlations tend to be along the diagonal, as evidenced by the dark green colors near the matrix diagonal.

The same order is useful if you want to construct a huge scatter plot matrix for these variables. Pairs of variables that are "interesting" tend to appear near the diagonal. In this case, the definition of "interesting" is "highly correlated," but you could choose some other statistic to build the "merit matrix" that is used to cluster the variables. The scatter plot matrix is shown below. Click to enlarge.

Scatter plot matrix of 10 variables. The highly correlated variable tend to appear together so that diagonal scatter plot exhibit high correlation.

Other merit matrices

I remembered Hurley's article while I was working on an article about homophily in married couples. Recall that homophily is the tendency for similar individuals to associate, or in this case to marry partners that have similar academic interests. My heat map used the data order for the disciplines, but I wondered whether the order could be improved by using single-link clustering. The data in the heat map are not symmetric, presumably because women and men using different criteria to choose their partners. However, you can construct the "average homophily matrix" in a natural way. If A is any square matrix then (A` + A)/2 is a symmetric matrix that averages the lower- and upper-triangular elements of A. The following image shows the "average homophily" matrix with the college majors ordered by clusters:

The heat map shows the "averaged data," but in a real study you would probably want to display the real data. You can see that green cells and red cells tend to occur in blocks and that most of the dark green cells are close to the diagonal. In particular, the order of adjacent majors on an axis gives information about affinity. For example, here are a few of the adjacent categories:

  • Biology and Medical/Health
  • Engineering tech, computers, and engineering
  • Interdisciplinary studies, philosophy, and theology/religion
  • Math/statistics and physical sciences
  • Communications and business


In conclusion, when you create a graph that has a nominal categorical axis, you should consider whether the graph can be improved by ordering the categories. The default order, which is often alphabetical, makes it easy to locate a particular category, but there are other orders that use the data to reveal additional structure. This article uses the single-link cluster method, but Hurley (2004) mentions several other methods.

The post Order variables in a heat map or scatter plot matrix appeared first on The DO Loop.

4月 302018

Some say that opposites attract. Others say that birds of a feather flock together. Which is it? Phillip N. Cohen, a professor of sociology at the University of Maryland, recently posted an interesting visualization that indicates that married couples who are college graduates tend to be birds of a feather. This article discusses Cohen's data and demonstrates some SAS techniques for assigning colors to a heat map, namely how to log-transform the response and how to bin the response variable by quantiles.

Cohen's data and heat map

Cohen's (2018) haet map of the ratio of observed to expected count of marriages between women and men, by college major

Cohen cross-tabulated the college majors of 27,806 married couples from 2009–2016. The majors are classified into 28 disciplines such as Architecture, Business, Computer Science, Engineering, and so forth. The data are counts in a 28 x 28 table. If the (i,j)th cell in the table has the value C[i,j], then there are C[i,j] married couples in the data for which the wife had the i_th major and the husband has the j_th major.

Cohen used a heat map (shown to the right) to visualize associations among the wife's and husband's majors. This heat map shows the ratio of observed counts to expected counts, where the expected counts are computed by assuming the wife's major and the husband's major are independent. In this heat map, the woman's major is listed on the vertical axis; the man's major is on the horizontal axis. Green indicates more marriages than expected for that combination of majors, yellow indicates about as many marriages as expected, and red indicates fewer marriages than expected. The prevalence of green along the diagonal cells in the heat map indicates that many college-educated women marry a man who studied a closely related discipline. The couples are birds of a feather.

Cohen's heat map displays ratios. A ratio of 1 means that the observed number of married couples equals the expected number. For these data, the ratio ranges from 0 (no observed married couples) to 31. The highest ratio was observed for women and men who both majored in Theology and Religious Studies. The observed number of couples is 31 times more than expected under independence. (See the 19th element along the diagonal.)

A heat map of relative differences

You might wonder why Cohen used the ratio instead of the difference between observed and expected. One reason to form a ratio is that it adjusts for the unequal popularity of majors. Business, engineering, and education are popular majors. Many fewer people major in religion and ethnic studies. If you don't account for the differences in popularity, you will see only the size of the majors, not the size of the differences.

Heat map of relative differences observed to expected counts of marriages between women and men, by college major

A different way to view the data is to visualize the relative difference: (Observed – Expected) / Expected. The relative differences range from -1 (no observed couples) to 30 (for Theology majors). In percentage terms, this range is from -100% to 3000%. If you create a heat map of the relative difference and linearly associate colors to the relative differences, only the very largest differences are visible, as shown in the heat map to the right.

This heat map is clearly not useful, but it illustrates a very common problem that occurs when you use a linear scale to assign colors to heat maps and choropleth maps. Often the data vary by several orders of magnitude, which means that a linear scale is not going to reveal subtle features in the data. As shown in this example, you typically see a small number of cells that have large values and the remaining cell values are indistinguishable.

There are two common ways to handle this situation:

  • If your audience is mathematically savvy, use a nonlinear transformation to map the values into a linear color ramp. A logarithmic transformation is the most common way to accomplish this.
  • Otherwise, bin the data into a small number of discrete categories and use a discrete color ramp to assign colors to each of the binned values. For widely varying data, quantiles are often an effective binning strategy.

Log transformation of counts

A logarithmic transform (base 10) is often the simplest way to visualize data that span several orders of magnitudes. When the data contain both positive and negative values (such as the relative differences), you can use the log-modulus transform to transform the data. The log-modulus transform is the transformation x → sign(x) * log10(|x| + 1). It preserves the sign of the data while logarithmically scaling it. The following heat map shows the marriage data where the colors are proportional to the log-modulus transform of the relative difference between the observed and expected values.

Heat map of the log-modulus transform of the relative differences between observed and expected counts of marriages between women and men, by college major

As in Cohen's heat map of the ratios, mapping a continuous color ramp onto the log-transformed values divides the married couples into three visual categories: The reddish-orange cells indicate fewer marriages than expected, the cream-colored cells indicate about as many marriages as expected, and the light green and dark green cells indicate many more marriages than expected. The drawback of the log transformation is that the scale of the gradient color ramp is not easy to understand. I like to add tooltips to the HTML version of the heat map so that the user can hover the mouse pointer over a cell to see the underlying values.

Analysis of the marriage data

What can we learn about homophily (the tendency for similar individuals to associate) from studying this heat map? Here are a few observations. Feel free to add your own ideas in the comments:

  • The strongest homophily is along the diagonal elements. In particular, Theology/Religion majors are highly likely to marry each other, as are agriculture, environment, and architecture majors.
  • Female engineers, computer scientists, and mathematicians tend to marry like-minded men. Notice the large number of red cells in those rows for the non-mathematical disciplines.
  • Women in medical/health field or business are eclectic. Those rows do not have many dark-shaded cells, which indicates a general willingness to associate with men inside and outside of their subject area.

Discrete heat map by quantiles

It is easiest to interpret a heat map when the colors correspond to the actual counts or simple statistics such as the relative difference. The simplest way to assign colors to the cells in the heat map is to bin the data. This results in a discrete set of colors rather than the continuous color ramp in the previous example. You can bin the data by using five to seven quantiles. Alternatively, you can use domain-specific or clinical ranges. (For example, you could use clinical "cut points" to bin blood pressure data into "normal," "elevated," and "hypertensive" categories. In SAS, you can use PROC FORMAT to create a customized format for the response variable and color according to the formatted values.

The following heat map uses cut points to bin the data according to the relative difference between observed and expected values. As before, red and orange cells indicate observed differences that are negative. The cream and light green cells indicate relative differences that are close to zero. The darker greens indicate observed differences that are a substantial fraction of the expected values. The darkest green indicates cells for which the difference is more than 100% greater than the expected value.

When you use a discrete color ramp, all values within a specified interval are mapped to the same color. In this heat map, most cells on the diagonal have the same color; you cannot use the heat map to determine which cell has the largest relative difference.

Heat map of the binned relative differences between observed and expected counts of marriages between women and men, by college major


There is much more to say about these data and about ways to use heat maps to discover features in the data. However, I'll stop here. The main ideas of this article are:

  • The data are interesting. In general, the married couples in this study tended to have similar college majors. The strength of that relationship varied according to the discipline and also by gender (the heat map is not symmetric). For more about the data, see Professor Cohen's web page for this study.
  • When the response variable varies over several orders of magnitude, a linear mapping of colors to values is not adequate. One way to accommodate widely varying data is to log-transform the response. If the response has both positive and negative values, you can use the log-modulus transform.
  • For a less sophisticated audience, you can assign colors by binning the data and using a discrete color ramp.

You can download the SAS program that I used to create this heat maps.

The post Assign colors in heat maps: A study of married couples and college majors appeared first on The DO Loop.

4月 252018

SAS programmers on SAS discussion forums sometimes ask how to run thousands of regressions of the form Y = B0 + B1*X_i, where i=1,2,.... A similar question asks how to solve thousands of regressions of the form Y_i = B0 + B1*X for thousands of response variables. I have previously written about how to solve the first problem by converting the data from wide form to a long form and using the BY statement in PROC REG to solve the thousands of regression problems in a single call to the procedure. My recent description of the SWEEP operator (which is implemented in SAS/IML software), provides an alternative technique that enables you to analyze both regression problems when the data are in the wide format.

Why would you anyone need to run 1,000 regressions?

Why anyone wants to solve thousands of linear regression problems? I can think of a few reasons:

  1. In the exploratory phase of an analysis, you might want to know which of thousands of variables are linearly related to a response. Running the regressions is a quick way to filter or "screen out" variables that do not have a strong linear association with the response.
  2. In a simulation study, each Y_i might be an independent draw from the same distribution. Running thousands of regressions enables you to approximate the sampling distribution of the parameter estimates.
  3. In a bootstrap analysis of regression estimates, you can perform case resampling or model-based resampling. In case resampling, you generate many X_i by resampling from the original X variable. In model-based resampling, you keep the X fixed and resample thousands of Y_i. In either case, Running thousands of regressions enables you to estimate the precision of the parameter estimates.

Many univariate regressions

In a previous article, I showed how to simulate data that satisfies a regression model. I created a data set that contains explanatory variables X1-X1000 and a single response variable, Y. You can download the SAS program that computes the data and performs the regression.

The following SAS/IML program reads the simulated data into a large matrix, M. For each regression, it forms the three-column matrix A from the intercept column, the k_th explanatory variable, and the variable Y. It then forms the sum of squares and cross products (SSCP) matrix (A`*A) and uses the SWEEP function to solve the least squares regression problem. The two parameter estimates (intercept and slope) for each explanatory variable are saved in a 2 x 1000 array. A few parameter estimates are displayed.

/* program to compute parameter estimates for Y = b0 + b1*X_k, k=1,2,... */
proc iml;
XVarNames = "x1":"x&nCont";      /* names of explanatory variables */
p = ncol(XVarNames );            /* number of X variables, excluding intercept */
varNames = XVarNames || "Y";     /* name of all data variables */
use Wide;  read all var varNames into M;  close; /* read data into M */
A = j(nrow(M), 3, 1);            /* columns for intercept, X_k, Y */
A[ ,3] = M[ , ncol(M)];          /* put Y in 3rd col */
ParamEst =  j(2, p, .);          /* allocate 2 x p matrix for estimates */
do k = 1 to ncol(XVarNames);
   A[ ,2] = M[ ,k];              /* X_k in 2nd col */
   S1 = sweep(A`*A, {1 2});      /* sweep in intercept and X_k */
   ParamEst[ ,k] = S1[{1 2}, 3]; /* estimates for k_th model Y = b0 + b1*X_k */
print (paramEst[,1:6])[r={"Intercept" "Slope"} c=XVarNames];
Run thousands of regressions in SAS and obtain parameter estimates

You could perform a similar loop for models that contain multiple variables, such as all two-variable main-effect models of the form Y = b0 + b1*X_k + b2*X_j, where k ≠ j. You can use the ALLCOMB function in SAS/IML to choose the combinations of columns to sweep.

One large multivariate regression

You can also use the SWEEP operator to perform the regression of many responses onto a single explanatory variable. This case is easy in PROC REG because the procedure supports multiple response variables. The analysis is also easy if you use the SWEEP function in SAS/IML. It only requires a single function call!

The following DATA step simulates 1,000 response variables according to the model Y_i = b0 + b1*X + ε, where b0 = 39.07, b1 = 0.902, and ε ~ N(0, 5.71) is a normally distributed random error term. These values are the parameter estimates for the regression of SepalLength onto SepalWidth for the virginica species of iris flowers in the Sashelp.Iris data set. For more about simulating regression models, see Chapter 11 of Wicklin (2013).

/* based on Wicklin (2013, p. 203) */
%let NumSim = 1000;                  /* number of Y variables */
data RegSim(drop=RMSE b0 b1 k Species);
set Sashelp.Iris(where =(Species="Virginica") keep=Species SepalWidth
RMSE = 5.71; b0 = 39.07; b1 = 0.902; /* param estimates for model SepalLength = SepalWidth */
array Y[&NumSim];
call streaminit(321);
do k = 1 to &NumSim;
   Y[k] = b0 + b1*X + rand("Normal", 0, RMSE); /* simulate responses from model */

After generating the data, you can compute all 1,000 parameter estimates by using a single call to the SWEEP function in SAS/IML. The following program forms a matrix M for which the first column is an intercept term, the second column is the X variable, and the next 1,000 columns are the simulated responses. Sweeping the first two rows of the M`*M matrix computes the 1,000 parameter estimates, which are then graphed in a scatter plot.

/* program to compute parameter estimates for Y_k = b0 + b1*X, k=1,2,... */
proc iml;
YVarNames = "Y1":"Y&numSim";     /* names of explanatory variables */
varNames = "X" || YVarNames;     /* name of all data variables */
use RegSim;  read all var varNames into M;  close;
M = j(nrow(M), 1, 1) || M;       /* add intercept column */
S1 = sweep(M`*M, {1 2});         /* sweep in intercept and X */
ParamEst = S1[{1 2}, 3:ncol(M)]; /* estimates for models Y_i = X */
title "Parameter Estimates for 1,000 Simulated Responses";
call scatter(ParamEst[1,], ParamEst[2,]) label={"Intercept" "X"}
     other="refline 39.07/axis=x; refline 0.902/axis=y;";
Run thousands of regression in SAS and obtain parameter estimates

The graph shows the parameter estimates for each of the 1,000 linear regressions. The distribution of the estimates provides more information about the sampling distribution than the estimate of standard error that PROC REG produces when you run a regression model. The graph visually demonstrates the "covariance of the estimates," which PROC REG estimates if you use the COVB option on the MODEL statement.

In summary, you can use the SWEEP operator (implemented in the SWEEP function in SAS/IML) to efficiently compute thousands of regression estimates for wide data. This article demonstrates how to compute models that have many explanatory variables (Y = b0 + b1* X_i) and models that have many response variables (Y_i = b0 + b1 * X). Are there drawbacks to this approach? Sure. You don't automatically get the dozens of ancillary regression statistics like R squared, adjusted R squared, p-values, and so forth. Also, PROC REG automatically handles missing values, whereas in SAS/IML you must first extract the complete cases before you try to form the SSCP matrix. Nevertheless, this computation can be useful for simulation studies in which the data are simulated and analyzed within SAS/IML.

Download the SAS programs for this article.

The post An easier way to run thousands of regressions appeared first on The DO Loop.

4月 232018

You've probably heard about the "80-20 Rule," which describes many natural and manmade phenomena. This rule is sometimes called the "Pareto Principle" because it was discovered by Vilfredo Pareto (1848–1923) who used it to describe the unequal distribution of wealth. Specifically, in his study, 80% of the wealth was held by 20% of the population. The unequal distribution of effort, resources, and other quantities can often be described in terms of the Pareto distribution, which has a parameter that controls the inequity. Whereas some data seem to obey an 80-20 rule, other data are better described as following a "70-30 rule" or a "60-40 rule," and so on. (Although the two numbers do not need to add to 100, it makes the statement pleasingly symmetric.)

I thought about the Pareto distribution recently when I was looking at some statistics for the SAS blogs. I found myself wondering whether 80% of the traffic to a blog is generated by 20% of its posts. (Definition: A blog is a collection of articles. Each individual article is a post.) As you can imagine, some posts are more popular than others. Some posts are timeless. They appear in internet searches when someone searches for a particular statistical or programming technique, even though they were published long ago. Other posts (such as Christmas-themed posts) do not get much traffic from internet searches. They generate traffic for a week or two and then fade into oblivion.

To better understand how various blogs at SAS follow the Pareto Principle, I downloaded data for seven blogs during a particular time period. I then kept only the top 100 posts for each blog.

The following line plot shows the pageviews for one blog. The horizontal axis indicates the posts for this blog, ranked by popularity. (The most popular post is 1, the next most popular post is 2, and so forth.) This blog has four very popular posts that each generated more than 7,000 pageviews during the time period. Another group of four posts were slightly less popular (between 4,000 and 5,000 pageviews). After those eight "blockbuster" posts are the rank-and-file posts that were viewed less than 3,000 times during the time period.

Long-tailed distribution of pageviews for a blog

The pageviews for the other blogs look similar. This distribution is also common in book-sales data: most books sell only a few thousand copies whereas the best-sellers (think John Grisham) sell hundreds of millions of copies. Movie revenue is another example that follows this distribution.

The distribution is a long-tailed distribution, a fact that becomes evident if you graph a histogram of the underlying quantity. For the blog, the following histogram shows the distribution of the pageviews for the top 100 posts:

Long-tailed distribution of pageviews for a blog

Notice the "power law" nature of the distribution for the first few bars of the histogram. The height of each bar is about 0.4 of the previous height. About 57% of the blog posts had less than 1,000 pageviews. Another 22% had between 1,000 and 2,000 pageviews. The number of rank-and-file posts in each category decreases like a power law, but then the blockbuster posts start to appear. These popular posts give the distribution a very long tail.

Because some blogs (like the one pictured) attract thousands of readers whereas other blogs have fewer readers, you need to standardize the data if you want to compare the distributions for several blogs. Recall that the Pareto Principle is a statement about cumulative percentages. The following graph shows the cumulative percentages of pageviews for seven blogs at SAS (based on the Top 100 posts):

Cumulative percentages of pageviews for seven blogs

The graph shows a dashed line with slope –1 that is overlaid on the cumulative percentage curves. The places where the dashed line intersects a curve are the "Pareto locations" for which Y% of the pageviews are generated by the first (1-Y)% most popular blog posts. In general, these blogs appear to satisfy a "70-30 rule" because about 70% of the pageviews are generated by the 30 / 100 most popular posts. There is some variation between blogs, with the upper curve following a "72-28 rule" and the lower curve satisfying a "63-37 rule."

All this is approximate and is based on only the top 100 posts for each blog. For a more rigorous analysis, you could use PROC UNIVARIATE or PROC NLMIXED to fit the parameters of the Pareto distribution to the data for each blog. However, I am happy to stop here. In general, blogs at SAS satisfy an approximate "70-30 rule," where 70% of the traffic is generated by the top 30% of the posts.

The post The 80-20 rule for blogs appeared first on The DO Loop.

4月 042018

Correlation is a statistic that measures how closely two variables are related to each other. The most popular definition of correlation is the Pearson product-moment correlation, which is a measurement of the linear relationship between two variables. Many textbooks stress the linear nature of the Pearson correlation and emphasize that a zero value for the Pearson correlation does not imply that the variables are independent. A classic example is to define X to be a random variable on [-1, 1] and define Y=X2. X and Y are clearly not independent, yet you can show that the Pearson correlation between X and Y is 0.

In 2007, G. Szekely, M. Rizzo, and N. Bakirov published a paper in the The Annals of Statistics called "Measuring and Testing Dependence by Correlation of Distances." This paper defines a distance-based correlation that can detect nonlinear relationships between variables. This blog post describes the distance correlation and implements it in SAS.

An overview of distance correlation

It is impossible to adequately summarize a 27-page paper from the Annals in a few paragraphs, but I'll try. The Szekely-Rizzo-Bakirov paper defines the distance covariance between two random variables. It shows how to estimate the distance correlation from data samples. A practical implication is that you can estimate the distance correlation by computing two matrices: the matrix of pairwise distances between observations in a sample from X and the analogous distance matrix for observations from Y. If the elements in these matrices co-vary together, we say that X and Y have a large distance correlation. If they do not, they have a small distance correlation.

For motivation, recall that the Pearson covariance between X and Y (which is usually defined as the inner product of two centered vectors) can be written in terms of the raw observations:
Classical covariance formula

The terms (xi – xj) and (yi – yj) can be thought of as the one-dimensional signed distances between the i_th and j_th observations. Szekely et al. replace those terms with centered Euclidean distances D(xi, xj) and define the distance covarariance as follows:
Distance covariance formula

The distance covariance between random vectors X and Y has the following properties:

  • X and Y are independent if and only if dCov(X,Y) = 0.
  • You can define the distance variance dVar(X) = dCov(X,X) and the distance correlation as dCor(X,Y) = dCov(X,Y) / sqrt( dVar(X) dVar(Y) ) when both variances are positive.
  • 0 ≤ dCor(X,Y) ≤ 1 for all X and Y. Note this is different from the Pearson correlation, for which negative correlation is possible.
  • dCov(X,Y) is defined for random variables in arbitrary dimensions! Because you can compute the distance between observations in any dimensions, you can compute the distance covariance regardless of dimensions. For example, X can be a sample from a 3-dimensional distribution and Y can be a sample from a 5-dimensional distribution.
  • You can use the distance correlation to define a statistical test for independence. I don't have space in this article to discuss this fact further.

Distance correlation in SAS

The following SAS/IML program defines two functions. The first "double centers" a distance matrix by subtracting the row and column marginals. The second is the distCorr function, which computes the Szekely-Rizzo-Bakirov distance covariance, variances, and correlation for two samples that each have n rows. (Recall that X and Y can have more than one column.) The function returns a list of statistics. This lists syntax is new to SAS/IML 14.3, so if you are running an older version of SAS, modify the function to return a vector.

proc iml;
start AdjustDist(A);    /* double centers matrix by subtracting row and column marginals */
   rowMean = A[:, ];
   colMean = rowMean`;  /* the same, by symmetry */
   grandMean = rowMean[:];
   A_Adj = A - rowMean - colMean + grandMean;
   return (A_Adj);
/* distance correlation: G. Szekely, M. Rizzo, and N. Bakirov, 2007, Annals of Statistics, 35(6) */
start DistCorr(x, y);
   DX = distance(x);    DY = distance(y); 
   DX = AdjustDist(DX); DY = AdjustDist(DY);
   V2XY = (DX # DY)[:];  /* mean of product of distances */
   V2X  = (DX # DX)[:];  /* mean of squared (adjusted) distance */
   V2Y  = (DY # DY)[:];  /* mean of squared (adjusted) distance */
   dCov = sqrt( V2XY );     /* distance covariance estimate */
   denom = sqrt(V2X * V2Y); /* product of std deviations */
   if denom > 0 then R2 = V2XY / denom;    /* R^2 = (dCor)^2 */
   else              R2 = 0;
   dCor = sqrt(R2);     
   T = nrow(DX)*V2XY;   /* test statistic p. 2783. Reject indep when T>=z */
   /* return List of resutls: */
   L = [#"dCov"=dCov,  #"dCor"=dCor,  #"dVarX"=sqrt(V2X), #"dVarY"=sqrt(V2Y), #"T"=T];
   /* or return a vector: L = dCov || dCor || sqrt(V2X) || sqrt(V2Y) || T; */
   return L;

Let's test the DistCorr function on two 4-element vectors. The following (X,Y) ordered pairs lie one the intersection of the unit circle and the coordinate axes. The Pearson correlation for these observations is 0 because there is no linear association. In contrast, the distance-based correlation is nonzero. The distance correlation detects a relationship between these points (namely, that they lie along the unit circle) and therefore the variables are not independent.

x = {1,0,-1, 0};
y = {0,1, 0,-1};
results = DistCorr(x, y);
/* get itenms from results */
dCov=results$"dCov";  dCor=results$"dCor";  dVarX=results$"dVarX"; dVarY=results$"dVarY";
/* or from vector: dCov=results[1];  dCor=results[2];  dVarX=results[3]; dVarY=results[4]; */
print dCov dCor dVarX dVarY;
Distance correlation for points on a circle

Examples of distance correlations

Let's look at a few other examples of distance correlations in simulated data. To make it easier to compare the Pearson correlation and the distance correlation, you can define two helper functions that return only those quantities.

Classic example: Y is quadratic function of X

The first example is the classic example (mentioned in the first paragraph of this article) that shows that a Pearson correlation of 0 does not imply independence of variables. The vector X is distributed in [-1,1] and the vector Y is defined as X2:

/* helper functions */
start PearsonCorr(x,y);
   return( corr(x||y)[1,2] );
start DCorr(x,y);
   results = DistCorr(X, Y);
   return( results$"dCor" );  /* if DistCorr returns vector: return( results[2] ); */
x = do(-1, 1, 0.1)`;    /* X is in [-1, 1] */
y = x##2;               /* Y = X^2 */
PCor = PearsonCorr(x,y);
DCor = DCorr(x,y);
print PCor DCor;
Example where Pearson correlation equals 0 but distance correlation is nonzero

As promised, the Pearson correlation is zero but the distance correlation is nonzero. You can use the distance correlation as part of a formal hypothesis test to conclude that X and Y are not independent.

Distance correlation for multivariate normal data

You might wonder how the distance correlation compares with the Pearson correlation for bivariate normal data. Szekely et al. prove that the distance correlation is always less than the absolute value of the population parameter: dCor(X,Y) ≤ |ρ|. The following statements generate a random sample from a bivariate normal distribution with Pearson correlation ρ for a range of positive and negative ρ values.

N = 500;                   /* sample size */
mu = {0 0};  Sigma = I(2); /* parameters for bivaraite normal distrib */
rho = do(-0.9, 0.9, 0.1);  /* grid of correlations */
PCor = j(1, ncol(rho), .); /* allocate vectors for results */
DCor = j(1, ncol(rho), .);
call randseed(54321);
do i = 1 to ncol(rho);     /* for each rho, simulate bivariate normal data */
   Sigma[1,2] = rho[i]; Sigma[2,1] = rho[i]; /* population covariance */
   Z = RandNormal(N, mu, Sigma);        /* bivariate normal sample */
   PCor[i] = PearsonCorr(Z[,1], Z[,2]); /* Pearson correlation */
   DCor[i] = DCorr(Z[,1], Z[,2]);       /* distance correlation */

If you graph the Pearson and distance correlation against the parameter values, you obtain the following graph:

Graph of Pearson correlation and distance correlation for samples of bivariate normal data with correlation rho

You can see that the distance correlation is always positive. It is close to, but in many cases less than, the absolute value of the Pearson estimate. Nevertheless, it is comforting that the distance correlation is closely related to the Pearson correlation for correlated normal data.

Distance correlation for data of different dimensions

As the last example, let's examine the most surprising property of the distance correlation, which is that it enables you to compute correlations between variables of different dimensions. In contrast, the Pearson correlation is defined only for univariate variables. The following statements generate two independent random normal samples with 1000 observations. The variable X is a bivariate normal sample. The variable Y is a univariate normal sample. The distance correlation for the sample is close to 0. (Because the samples were drawn independently, the distance correlation for the populations is zero.)

/* Correlation betwee a 2-D distribution and a 1-D distribution */
call randseed(12345, 1);          /* reset random number seed */
N = 1000;
mu = {0 0};
Sigma = { 1.0 -0.5, 
         -0.5  1.0};
X = RandNormal(N, mu, Sigma);     /* sample from 2-D distribution */
Y = randfun(N, "Normal", 0, 0.6); /* uncorrelated 1-D sample */
DCor = DCorr(X, Y);
print DCor;
Distance correlation for samples of different dimensions

Limitations of the distance correlation

The distance correlation is an intriguing idea. You can use it to test whether two variables (actually, sets of variables) are independent. As I see it, the biggest drawbacks of distance correlation are

  • The distance correlation is always positive because distances are always positive. Most analysts are used to seeing negative correlations when two variables demonstrate a negative linear relationship.
  • The distance correlation for a sample of size n must compute the n(n–1)/2 pairwise distances between observations. This implies that the distance correlation is an O(n2) operation, as opposed to Pearson correlation, which is a much faster O(n) operation. The implementation in this article explicitly forms the n x n distance matrix, which can become very large. For example, if n = 100,000 observations, each distance matrix requires more than 74 GB of RAM. There are ways to use less memory, but the distance correlation is still a relatively expensive computation.


In summary, this article discusses the Szekely-Rizzo-Bakirov distance-based covariance and correlation. The distance correlation can be used to create a statistical test of independence between two variables or sets of variables. The idea is interesting and appealing for small data sets. Unfortunately, the performance of the algorithm is quadratic in the number of observations, so the algorithm does not scale well to big data.

You can download the SAS/IML program that creates the computations and graphs in this article. If you do not have SAS/IML, T. Billings (2016) wrote a SAS macro that uses PROC DISTANCE to compute the distance correlation between two vectors. Rizzo and Szekely implemented their method in the 'energy' package of the R software product.

The post Distance correlation appeared first on The DO Loop.

3月 142018

One of my favorite magazines, Significance, printed an intriguing image of a symmetric matrix that shows repetition in a song's lyrics. The image was created by Colin Morris, who has created many similar images. When I saw these images, I knew that I wanted to duplicate the analysis in SAS!

Visualize repetition in lyrics: A simple example

The analysis is easy. Suppose that a song (or any text source) contains N words. Define the repetition matrix to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. Now visualize the matrix by using a heat map: Black indicates cells where the matrix is 1 and white for 0. A SAS program that performs this analysis is available at the end of this article.

To illustrate this algorithm, consider the nursery rhyme, "Row, Row, Row the Boat":

Row, row, row your boat
Gently down the stream
Merrily, merrily, merrily, merrily
Life is but a dream.

There are 18 words in this song. Words 1–3 are repeated, as are words 1-–13. You can use the SAS DATA steps to read the words of the song into a variable and use other SAS functions to strip out any punctuation. You can then use SAS/IML software to construct and visualize the repetition matrix. The details are shown at the end of this article.

Visualize repetition in song lyrics: 'Row, Row, Row Your Boat'

The repetition matrix for the song "Row, Row, Row, Your Boat" is shown to the right. For this example I could have put the actual words along the side and bottom of the matrix, but that is not feasible for songs that have hundreds of words. Instead, the matrix has a numerical axis where the number indicates the position of each word in the song.

Every repetition matrix has 1s on the diagonal. In this song, the words "row" and "merrily" are repeated. Consequently, there is a 3 x 3 block of 1s at the top left and a 4 x 4 block of 1s in the middle of the matrix. (Click to enlarge.)

As mentioned, this song has very little repetition. One way to quantify the amount of repetition is to compute the proportion of 1s in the upper triangular portion of the repetition matrix. The upper triangular portion of an N x N matrix has N(N–1)/2 elements. For this song, N=18, so there are 153 cells and 9 of them are 1s. Therefore the "repetition score" is 9 / 153 = 0.059.

Another simple example of repetition in song lyrics

I wrote a SAS/IML function that creates and visualizes the repetition matrix and returns the repetition score. In order to visualize songs that might have hundreds of words, I suppress the outlines (the grid) in the heat map. To illustrate the output of the function, the following image visualizes the words of the song "Here We Go Round the Mulberry Bush":

Here we go round the mulberry bush,
The mulberry bush,
The mulberry bush.
Here we go round the mulberry bush
So early in the morning.
Visualize repetition in song lyrics: 'Here We Go Round the Mulberry Bush'

The repetition score for this song is 0.087. You can see diagonal "stripes" that correspond to the repeating phrases "here we go round" and "the mulberry bush". In fact, if you study only the first seven rows, you can "see" almost the entire structure of the song. The first seven words contain all lyrics except for four words ("so", "early", "in", "morning").

Visualize Classic Song Lyrics

Let's visualize the repetitions in the lyrics of several classic songs.

Hey Jude (The Beatles)

When I saw Morris's examples, the first song I wanted to visualize was "Hey Jude" by the Beatles. Not only does the title phrase repeat throughout the song, but the final chorus ("Nah nah nah nah nah nah, nah nah nah, hey Jude") repeats more than a dozen times. This results in a very dense block in the lower right corner of the repetition matrix and a very high repetition score of 0.183. The following image visualizes "Hey Jude":

Visualize repetition in song lyrics: 'Hey Jude'

Love Shack (The B-52s)

The second song that I wanted to visualize was "Love Shack" by The B-52s. In addition to a title that repeats almost 40 times, the song contains a sequence near the end in which the phrase "Bang bang bang on the door baby" is alternated with various interjections. The following visualization of the repetition matrix indicates that there is a lot of variation interspersed with regular repetition. The repetition score is 0.035.

Visualize repetition in song lyrics: 'Love Shack'

Call Me (Blondie)

Lastly, I wanted to visualize the song "Call Me" by Blondie. This classic song has only 241 words, yet the title is repeated 41 times! In other words, about 1/3 of the song consists of those two words! Furthermore, there is a bridge in the middle of the song in which the phrase "oh oh oh oh oh" is alternated with other phrases (some in Italian and French) that appear only once in the song. The repetition score is 0.077. The song is visualized below:

Visualize repetition in song lyrics: 'Call Me'

How to create a repetition matrix in SAS

If you think this is a fun topic, you can construct these images yourself by using SAS. If you discover a song that has an interesting repetition matrix, post a comment!

Here's the basic idea of how to construct and visualize a repetition matrix. First, use the DATA step to read each word, use the COMPRESS function to remove any punctuation, and standardize the input by transforming all words to lowercase:
data Lyrics;
length word $20;
input word @@;
word = lowcase( compress(word, ,'ps') ); /* remove punctuation and spaces */
Here we go round the mulberry bush,
The mulberry bush,
The mulberry bush.
Here we go round the mulberry bush
So early in the morning.

In SAS/IML software you can use the ELEMENT function to find the locations in the i_th row that have the value 1. After you construct a repetition matrix, you can use the HEATMAPDISC subroutine to display it. For example, the following SAS/IML program reads the words of the song into a vector and visualizes the repetition matrix. It also returns the repetition score, which is the proportion of 1s in the upper triangular portion of the matrix.

ods graphics / width=500 height=500 NXYBINSMAX=1000000;
proc iml;
/* define a function that creates and visualizes the repetition matrix */
start VizLyrics(DSName, Title);
   use (DSName); read all var _CHAR_ into Word;  close;
   N = nrow(Word);
   M = j(N,N,0);                       /* allocate N x N matrix */
   do i = 1 to N;
      M[,i] = element(Word, Word[i]);  /* construct i_th row */
   run heatmapdisc(M) title=Title
       colorramp={white black} displayoutlines=0 showlegend=0;
   /* compute the proportion of 1s in the upper triangular portion of the matrix */
   upperIdx = loc(col(M)>row(M));
   return ( M[upperIdx][:] );  /* proportion of words that are repeated */
score = VizLyrics("Lyrics", "Here We Go Round the Mulberry Bush");
print score;

If you want to reproduce the images in this post, you can download the SAS program for this article. In addition, the program creates repetition matrices for "We Didn't Start the Fire" (Billy Joel) and a portion of Martin Luthor King Jr.'s "I Have a Dream" speech. You can modify the program and enter lyrics for your favorite songs.

The post Visualize repetition in song lyrics appeared first on The DO Loop.

2月 142018

When I first learned to program in SAS, I remember being confused about the difference between CLASS statements and BY statements. A novice SAS programmer recently asked when to use one instead of the other, so this article explains the difference between the CLASS statement and BY variables in SAS procedures.

The BY statement and the CLASS statement in SAS both enable you to specify one or more categorical variables whose levels define subgroups of the data. (For simplicity, we consider only a single categorical variable.) The primary difference is that the BY statement computes many analyses, each on a subset of the data, whereas the CLASS statement computes a single analysis of all the data. Specifically,

  • The BY statement repeats an analysis on every subgroup. The subgroups are treated as independent samples. If a BY variable defines k groups, the output will contains k copies of every table and graph, one copy for the first group, one copy for the second group, and so on.
  • The CLASS statement enables you to include a categorical variable as part of an analysis. Often the CLASS variable is used to compare the groups, such as in a t test or an ANOVA analysis. In regression models, the CLASS statement enables you to estimate parameters for the levels of a categorical variable, thereby estimating the effect of each level on the response. Another use of a CLASS variable is to define categories for a classification task, such as a discriminant analysis.

To illustrate the differences between an analysis that uses a BY statement and one that uses a CLASS statement, let's create a subset (called Cars) of the Sashelp.Cars data. The levels of the Origin variable indicate whether a vehicle is manufactured in "Asia", "Europe", or the "USA". For efficiency reasons, most classical SAS procedures require that you sort the data when you use a BY statement. Therefore, a call to PROC SORT creates a sorted version of the data called CarsSorted, which will be used for the BY-group analyses.

data Cars;
   set Sashelp.Cars;
   where cylinders in (4,6,8) and type ^= 'Hybrid'; 
proc sort data=Cars out=CarsSorted; 
   by Origin; 

Descriptive statistics for grouped data

When you generate descriptive statistics for groups of data, the univariate statistics are identical whether you use a CLASS statement or a BY statement. What changes is the way that the statistics are displayed. When you use the CLASS statement, you get one table that contains all statistics or one graph that shows the distribution of each subgroup. However, when you use the BY statement you get multiple tables and graphs.

The following statements use the CLASS statement to produce descriptive statistics. PROC UNIVARIATE displays one (paneled) graph that shows a comparative histogram for the vehicles that are made in Asia, Europe, and USA. PROC MEANS displays one table that contains descriptive statistics:

proc univariate data=Cars;
   class Origin;
   var Horsepower;
   histogram Horsepower / nrows=3; /* must use NROWS= to get panel */
   ods select histogram;
proc means data=Cars N Mean Std;
   class Origin;
   var Horsepower Weight Mpg_Highway;

In contrast, if you run a BY-group analysis on the levels of the Origin variable, you will see three times as many tables and graphs. Each analysis is preceded by a label that identifies each BY group. Notice that the BY-group analysis uses the sorted data.

proc means data=CarsSorted N Mean Std;
   by Origin;
   var Horsepower Weight Mpg_Highway;

Always remember that the output from a BY statement is equivalent to the output from running the procedure multiple times on subsets of the data. For example, the previous statistics could also be generated by calling PROC MEANS three times, each call with a different WHERE clause, as follows:

proc means N Mean Std data=CarsSorted( where=(origin='Asia') );
   var Horsepower Weight Mpg_Highway;
proc means N Mean Std data=CarsSorted( where=(origin='Europe') );
   var Horsepower Weight Mpg_Highway;
proc means N Mean Std data=CarsSorted( where=(origin='USA') );
   var Horsepower Weight Mpg_Highway;

In fact, if you ever find yourself repeating an analysis many times (perhaps by using a macro loop), you should consider whether you can rewrite your program to be more efficient by using a BY statement.

Comparing groups: Use the CLASS statement

As a general rule, you should use a CLASS statement when you want to compare or contrast groups. For example, the following call to PROC GLM performs an ANOVA analysis on the horsepower (response variable) for the three groups defined by the Origin variable. The procedure automatically creates a graph that displays three boxplots, one for each group. The procedure also computes parameter estimates for the levels of the CLASS variable (not shown).

proc glm data=Cars; /* by default, create graph with side-by-side boxplots */
   class Origin;
   model Horsepower = Origin / solution;

You can specify multiple variables on the CLASS statement to include multiple categorical variables in a model. Any variables that are not listed on the CLASS statement are assumed to be continuous. Thus the following call to PROC GLM analyzes a model that has one continuous and one classification variable. The procedure automatically produces a graph that overlays the three regression curves on the data:

ods graphics /antialias=on;
title "CLASS Variable Regression: One Model with Multiple Parameters";
proc GLM data=Cars plots=FitPlot;
   class Origin;
   model Horsepower = Origin | Weight / solution;
   ods select ParameterEstimates ANCOVAPlot;

In contrast, if you use a BY statement, the Origin variable cannot be part of the model but is used only to subset the data. If you use a BY statement, you obtain three different models of the form Horsepower = Weight. You get three parameter estimates tables and three graphs, each showing one regression line overlaid on a subset of the data.

Predicted Values: CLASS VARIABLE versus BY Variable

When you use a BY statement and fit three models of the form Horsepower = Weight, the procedure fits a total of six parameters. Notice that when you use the CLASS statement and fit the model Horsepower = Origin | Weight, you also fit six free parameters. It turns out that these two methods produce the same predicted values. In fact, you can combine the parameter estimates (for the GLM parameterization) for the CLASS model to obtain the parameter estimates from the BY-variable analysis, as shown below. Each parameter estimate for the BY-variable models are obtained as the sum of two estimates for the CLASS-variable analysis:

For many regression models, the predicted values for the BY-variable analyses are the same as for a particular model that uses a CLASS variable. As shown above, you can even see how the parameters are related when you use a GLM or reference parameterization. However, the CLASS variable formulation can fit models (such as the equal-slope model Horsepower = Origin Weight) that are not available when you use a BY variable to fit three separate models. Furthermore, the CLASS statement provides parameter estimates so that you can see the effect of the groups on the response variable. It is more difficult to compare the models that are produced by using the BY statement.

Other CLASS-like statements in SAS

Some SAS procedures use other syntax to analyze groups. In particular, the SGPLOT procedure calls classification variables "group variables." If you want to overlay graphs for multiple groups, you can use the GROUP= option on many SGPLOT statements. (Some statements support the CATEGORY= option, which is similar.) For example, to replicate the two-variable regression analysis from PROC GLM, you can use the following statements in PROC SGPLOT:

proc sgplot data=Cars;
   reg y=Horsepower x=Weight / group=Origin; /* Horsepower = Origin | Weight */


In summary, use the BY statement in SAS procedures when you want to repeat an analysis for every level of one or more categorical variables. The variables define the subsets but are not otherwise part of the analysis. In classical SAS procedures, the data must be sorted by the BY variables. A BY-group analysis can produce many tables and graphs, so you might want to suppress the ODS output and write the results to a SAS data set.

Use the CLASS statement when you want to include a categorical variable in a model. A CLASS statement often enables you to compare or contrast subgroups. For example, in regression models you can evaluate the relative effect of each level on the response variable.

In some cases, the BY statement and the CLASS statement produce identical statistics. However, the CLASS statement enables you to fit a wider variety of models.

The post The difference between CLASS statements and BY statements in SAS appeared first on The DO Loop.

1月 102018

Last week I wrote about the 10 most popular articles from The DO Loop in 2017. My most popular articles tend to be about elementary statistics or SAS programming tips. Less popular are the articles about advanced statistical and programming techniques. However, these technical articles fill an important niche. Not everyone needs to know how to interpret a diffogram that visually compares the differences between means between groups, but those who do often send me a note of thanks, which motivates me to research and write similar articles.

Statistical programmers might want to review the following technical articles from 2017. This ain't summertime, and the reading ain't easy, but I think these articles are worth the effort. I've broken them into three categories. Enjoy!

Statistical Concepts

Visualization of regression that uses a weight variable in SAS

Statistical Data Analysis and Visualization

Visualize multivariate regression model by slicing the continuous variables. Graph created by using the EFFECTPLOT SLICEFIT statement in SAS.

Advanced SAS Programming Techniques

If you are searching for a way to enhance your SAS knowledge in this New Year, I think these 10 articles from The DO Loop are worth a second look. Was there an article from 2017 that you found particularly useful? Leave a comment.

The post 10 posts from 2017 that deserve a second look appeared first on The DO Loop.