Tips and Techniques

8月 072019
 
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.

5月 132019
 

In SAS/IML programs, a common task is to write values in a matrix to a SAS data set. For some programs, the values you want to write are in a matrix and you use the CREATE FROM/APPEND FROM syntax to create the data set, as follows:

proc iml;
X = {1  2  3, 
     4  5  6, 
     7  8  9, 
    10 11 12};
create MyData from X[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from X;                                /* write all rows of X */
close;                                        /* close the data set */

In other programs, the results are computed inside an iterative DO loop. If you can figure out how many observations are generated inside the loop, it is smart to allocate room for the results prior to the loop, assign the rows inside the loop, and then write to a data set after the loop.

However, sometimes you do not know in advance how many results will be generated inside a loop. Examples include certain kinds of simulations and algorithms that iterate until convergence. An example is shown in the following program. Each iteration of the loop generates a different number of rows, which are appended to the Z matrix. If you do not know in advance how many rows Z will eventually contain, you cannot allocate the Z matrix prior to the loop. Instead, a common technique is to use vertical concatenation to append each new result to the previous results, as follows:

/* sometimes it is hard to determine in advance how many rows are in the final result */
free Z;
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Y = j(k , 3, n);         /* k x 3 matrix */
   Z = Z // Y;              /* vertical concatenation of results */
end;
create MyData2 from Z[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from Z;                                 /* write all rows */
close;                                         /* close the data set */

Concatenation within a loop tends to be inefficient. As I like to say, "friends don't let friends concatenate results inside a loop!"

If your ultimate goal is to write the observations to a data set, you can write each sub-result to the data set from inside the DO loop! The APPEND FROM statement writes whatever data are in the specified matrix, and you can call the APPEND FROM statement multiple times. Each call will write the contents of the matrix to the open data set. You can update the matrix or even change the number of rows in the matrix. For example, the following program opens the data set prior to the DO loop, appends to the data set multiple times (each time with a different number of rows), and then closes the data set after the loop ends.

/* alternative: create data set, write to it during the loop, then close it */
Z = {. . .};                /* tell CREATE stmt that data will contain three numerical variables */
create MyData3 from Z[colname={'A' 'B' 'C'}];   /* open before the loop. The TYPE of the variables are known. */
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Z = j(k , 3, n);         /* k x 3 matrix */
   append from Z;           /* write each block of data */
end;
close;                      /* close the data set */

The following output shows the contents of the MyData3 data set, which is identical to the MyData2 data set:

Notice that the CREATE statement must know the number and type (numerical or character) of the data set variables so that it can set up the data set for writing. If you are writing character variables, you also need to specify the length of the variables. I typically use missing values to tell the CREATE statement the number and type of the variables. These values are not written to the data set. It is the APPEND statement that writes data.

I previously wrote about this technique in the article "Writing data in chunks," which was focused on writing large data set that might not fit into RAM. However, the same technique is useful for writing data when the total number of rows is not known until run time. I also use it when running simulations that generate multivariate data. This technique provides a way to write data from inside a DO loop and to avoid concatenating matrices within the loop.

The post Write to a SAS data set from inside a SAS/IML loop appeared first on The DO Loop.

4月 152019
 

A quadratic form is a second-degree polynomial that does not have any linear or constant terms. For multivariate polynomials, you can quickly evaluate a quadratic form by using the matrix expression
x` A x
This computation is straightforward in a matrix language such as SAS/IML. However, some computations in statistics require that you evaluate a quadratic form that looks like
x` A-1 x
where A is symmetric and positive definite. Typically, you know A, but not the inverse of A. This article shows how to compute both kinds of quadratic forms efficiently.

What is a quadratic form?

For multivariate polynomials, you can represent the purely quadratic terms by a symmetric matrix, A. The polynomial is q(x) = x` A x, where x is an d x 1 vector and A is a d x d symmetric matrix. For example, if A is the matrix {9 -2, -2 5} and x = {x1, x2} is a column vector, the expression x` A x equals the second degree polynomial q(x1, x2) = 9*x12 - 4*x1 x2 + 5*x22. A contour plot of this polynomial is shown below.

Contour plot of the quadratic form x` A x for A = {9 -2, -2 5}

Probably the most familiar quadratic form is the squared Euclidean distance. If you let A be the d-dimensional identity matrix, then the squared Euclidean distance of a vector x from the origin is x` Id x = x` x = Σi xi2. You can obtain a weighted squared distance by using a diagonal matrix that has positive values. For example, if W = diag({2, 4}), then x` W x = 2*x12 + 4*x22. If you add in off-diagonal elements, you get cross terms in the polynomial.

Efficient evaluation of a quadratic form

If the matrix A is dense, then you can use matrix multiplication to evaluate the quadratic form. The following symmetric 3 x 3 matrix defines a quadratic polynomial in 3 variables. The multiplication evaluates the polynomial at (x1, x2, x3) = (-1. 2. 0.5).

proc iml;
/* 
   q(x1, x2, x3) = 4*x1**2 + 6*x2**2 + 9*x3**2 +
                   2*3*x1*x2 + 2*2*x1*x3 + 2*1*x2*x3
*/
A = {4 3 2,
     3 6 1,
     2 1 9};
x = {-1, 2, 0.5};
q1 = x`*A*x;
print q1;

When you are dealing with large matrices, always remember that you should never explicitly form a large diagonal matrix. Multiplying with a large diagonal matrix is a waste of time and memory. Instead, you can use elementwise multiplication to evaluate the quadratic form more efficiently:

w = {4, 6, 9};
q2 = x`*(w#x);     /* more efficient than q = x`*diag(w)*x; */
print q2;

Quadratic forms with positive definite matrices

In statistics, the matrix is often symmetric positive definite (SPD). The matrix A might be a covariance matrix (for a nondegenerate system), the inverse of a covariance matrix, or the Hessian evaluated at the minimum of a function. (Recall that the inverse of a symmetric positive definite (SPD) matrix is SPD.) An important example is the squared Mahalanobis distance x` Σ-1 x, which is a quadratic form.

As I have previously written, you can use a trick in linear algebra to efficiently compute the Mahalanobis distance. The trick is to compute the Cholesky decomposition of the SPD matrix. I'll use a large Toeplitz matrix, which is guaranteed to be symmetric and positive definite, to demonstrate the technique. The function EvalSPDQuadForm evaluates a quadratic form defined by the SPD matrix A at the coordinates given by x:

/* Evaluate quadratic form q = x`*A*x, where A is symmetric positive definite.
   Let A = U`*U be the Cholesky decomposition of A.
   Then q = x`(U`*U)*x = (U*x)`(Ux)
   So define z = U*x and compute q = z`*z
*/
start EvalSPDQuadForm(A, x);
   U = root(A);           /* Cholesky root */
   z = U*x;
   return (z` * z);       /* dot product of vectors */
finish;
 
/* Run on large example */
call randseed(123);
N = 1000;
A = toeplitz(N:1);
x = randfun(N, "Normal");
 
q3 = EvalSPDQuadForm(A, x);  /* efficient */
qCheck = x`*A*x;             /* check computation by using a slower method */
 
print q3 qCheck;

You can use a similar trick to evaluate the quadratic form x` A-1 x. I previously used this trick to evaluate the Mahalanobis distance efficiently. It combines a Cholesky decomposition (the ROOT function in SAS/IML) and the TRISOLV function for solving triangular systems.

/* Evaluate quadratic form x`*inv(A)*x, where  A is symmetric positive definite.
   Let w be the solution of A*w = x and A=U`U be the Cholesky decomposition.
   To compute  q = x` * inv(A) * x = x` * w, you need to solve for w.
   (U`U)*w = x, so
        First solve triangular system U` z = x   for z,
        then solve triangular system  U w = z   for w 
*/
start EvalInvQuadForm(A, x);
   U = root(A);              /* Cholesky root */
   z = trisolv(2, U, x);     /* solve linear system */
   w = trisolv(1, U, z);  
   return (x` * w);          /* dot product of vectors */
finish;
 
/* test the function */
q4 = EvalInvQuadForm(A, x);  /* efficient */
qCheck = x`*Solve(A,x);      /* check computation by using a slower method */
print q4 qCheck;

You might wonder how much time is saved by using the Cholesky root and triangular solvers, rather than by computing the general solution by using the SOLVE function. The following graph compares the average time to evaluate the same quadratic form by using both methods. You can see that for large matrices, the ROOT-TRISOLV method is many times faster than the straightforward method that uses SOLVE.

In summary, you can use several tricks in numerical linear algebra to efficiently evaluate a quadratic form, which is a multivariate quadratic polynomial that contains only second-degree terms. These techniques can greatly improve the speed of your computational programs.

The post Efficient evaluation of a quadratic form appeared first on The DO Loop.

4月 032019
 

I've previously written about how to deal with nonconvergence when fitting generalized linear regression models. Most generalized linear and mixed models use an iterative optimization process, such as maximum likelihood estimation, to fit parameters. The optimization might not converge, either because the initial guess is poor or because the model is not a good fit to the data. SAS regression procedures for which this might happen include PROC LOGISTIC, GENMOD, MIXED, GLMMIX, and NLMIXED.

For mixed models, several problems can occur if you have a misspecified model. One issue results in the following note in the SAS log:
NOTE: Estimated G matrix is not positive definite.
This article describes what the note means, why it might occur, and what to do about it. If you encounter this note during a BY-group analysis or simulation study, this article shows how to identify the samples that have the problem.

There are two excellent references that discuss issues related to convergence in the SAS mixed model procedures:

What is the G matrix?

Before we discuss convergence, let's review what the G matrix is and why it needs to be positive definite. The matrix formulation of a mixed model is
Y = X β + Z γ + ε
where β is a vector of fixed-effect parameters. The random effects are assumed to be random realizations from multivariate normal distributions. In particular, γ ~ MVN(0, G) and ε ~ MVN(0, R), where G and R are covariance matrices. The variance-covariance matrix G is often used to specify subject-specific effects, whereas R specifies residual effects. A goal of mixed models is to specify the structure of the G and/or R matrices and estimate the variance-covariance parameters.

Because G is a covariance matrix, G must be positive semidefinite. A nondegenerate covariance matrix will be fully positive definite. However, estimates of G might not have this property. SAS alerts you if the estimate is not positive definite.

As stated in Kiernan (2018, p. ), "It is important that you do not ignore this message."

Reasons the estimated G matrix is not positive definite

A SAS Usage Note states that the PROC MIXED message means that "one or more variance components on the RANDOM statement is/are estimated to be zero and could/should be removed from the model." Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) describe several reasons why an estimated G matrix can fail to be positive definite. Two of the more common reasons include:

  • There is not enough variation in the response. After controlling for the fixed effects, there isn't any (or much) variation for the random effects. You might want to remove the random effect from the model. (KTG, p. 9)
  • The model is misspecified. You might need to modify the model or change the covariance structure to use fewer parameters. (Kiernan, p. 18)

Convergence issues in simulation studies or BY-group analyses

If you encounter the note "Estimated G matrix is not positive definite" for real data, you should modify the model or collect more data. In a simulation study, however, there might be simulated samples that do not fit the model even when the data is generated from the model! This can happen for very small data sets and for studies in which the variance components are very small. In either case, you might want to identify the samples for which the model fails to converge, either to exclude them from the analysis or to analyze them by using a different model. The same situation can occur for few BY groups during a traditional BY-group analysis.

The following SAS program illustrates how to detect the samples for which the estimated G matrix is not positive definite. The 'SP' data are from an example in the PROC MIXED documentation. The data set named 'ByData' is constructed for this blog post. It contains four copies of the real data, along with an ID variable with values 1, 2, 3, and 4. For the first copy, the response variable is set to 0, which means that there is no variance in the response. For the third copy, the response variable is simulated from a normal distribution. It has variance, but does not conform to the model. The call to PROC MIXED fits the same random-effects model to all four samples:

/* Example from PROC MIXED documentation: https://bit.ly/2YEmpGw  */
data sp;
input Block A B Y @@;
datalines;
1 1 1  56  1 1 2  41  1 2 1  50  1 2 2  36  1 3 1  39  1 3 2  35
2 1 1  30  2 1 2  25  2 2 1  36  2 2 2  28  2 3 1  33  2 3 2  30
3 1 1  32  3 1 2  24  3 2 1  31  3 2 2  27  3 3 1  15  3 3 2  19
4 1 1  30  4 1 2  25  4 2 1  35  4 2 2  30  4 3 1  17  4 3 2  18
;
 
/* concatenate four copies of the data, but modify Y for two copies */
data ByData;
call streaminit(1);
set sp(in=a1) sp(in=a2) sp(in=a3) sp(in=a4);
SampleID = a1 + 2*a2 + 3*a3 + 4*a4;
if a1 then Y = 0;                         /* response is constant (no variation) */
if a3 then Y = rand("Normal", 31, 10);    /* response is indep of factors */
run;
 
/* suppress ODS output: https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html */
%macro ODSOff(); ods graphics off; ods exclude all; ods noresults;  %mend;
%macro ODSOn();  ods graphics on; ods exclude none; ods results;    %mend;
 
%ODSOff
proc mixed data=ByData;
   by sampleID;
   class A B Block;
   model Y = A B A*B;
   random Block A*Block;
   ods output ConvergenceStatus=ConvergeStatus;
run;
%ODSOn

The SAS log displays various notes about convergence. The second and fourth samples (the doc example) converged without difficulty. The log shows a WARNING for the first group. The third group displays the note Estimated G matrix is not positive definite.

NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive residual variance estimate.
WARNING: Stopped because of infinite likelihood.
NOTE: The above message was for the following BY group:
      SampleID=1
 
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:
      SampleID=2
NOTE: Convergence criteria met.
 
NOTE: Estimated G matrix is not positive definite.
NOTE: The above message was for the following BY group:
      SampleID=3
 
NOTE: Convergence criteria met.
NOTE: The above message was for the following BY group:
      SampleID=4

Notice that the ODS OUTPUT statement writes the ConvergenceStatus table for each BY group to a SAS data set called ConvergeStatus. For each BY group, the ConvergenceStatus table is appended to a SAS data set called ConvergeStatus. The result of the PROC PRINT statement shows the columns of the table for each of the four groups:

proc print data=ConvergeStatus;  run;
Convergence status table in PROC MIXED, which shows which BY groups had convergence or modeling problems

The SampleID column is the value of the BY-group variable. The Reason column explains why the optimization process terminated. The Status column is 0 if the model converged and nonzero otherwise. Note that the third model converged, even though the G matrix was not positive definite!

To detect nonpositive definite matrices, you need to look at the pdG column, The pdG indicates which models had a positive definite G matrix (pdG=1) or did not (pdG=0). Although I do not discuss it in this article, the pdH column is an indicator variable that has value 0 if the SAS log displays the message NOTE: Convergence criteria met but final hessian is not positive definite.

The pdG column tells you which models did not have a positive definite variance matrix. You can merge the ConverenceStatus table with the original data and exclude (or keep) the samples that did not converge or that had invalid variance estimates, as shown in the following DATA step:

/* keep samples that converge and for which gradient and Hessian are positive definite */
data GoodSamples;
merge ByData ConvergeStatus;
by SampleID;
if Status=0 & pdG & pdH;
run;

If you run PROC MIXED on the samples in the GoodSamples data set, all samples will converge without any scary-looking notes.

In summary, this article discusses the note Estimated G matrix is not positive definite, which sometimes occurs when using PROC MIXED or PROC GLMMIX in SAS. You should not ignore this note. This note can indicate that the model is misspecified. This article shows how you can detect this problem programmatically during a simulation study or a BY-group analysis. You can use the ConvergenceStatus table to identify the BY groups for which this the problem occurs. Kiernan, Tao, and Gibbs (2012) and Kiernan (2018) provide more insight into this note and other problems that you might encounter when fitting mixed models.

The post Convergence in mixed models: When the estimated G matrix is not positive definite appeared first on The DO Loop.

1月 092019
 

Numbers don't lie, but sometimes they don't reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about advanced topics, which are less popular but fill an important niche in the SAS community. Not everyone needs to know how to fit a Pareto distribution in SAS or how to compute distance-based measures of correlation in SAS. Nevertheless, these topics are interesting to think about.

I believe that learning should not stop when we leave school. If you, too, are a lifelong learner, the following topics deserve a second look. I've included articles from four different categories.

Data Visualization

  • Fringe plot: When fitting a logistic model, you can plot the predicted probabilities versus a continuous covariate or versus the empirical probability. You can use a fringe plot to overlay the data on the plot of predicted probabilities. The SAS developer of PROC LOGISTIC liked this article a lot, so look for fringe plots in a future release of SAS/STAT software!
  • Order variables in a correlation matrix or scatter plot matrix: When displaying a graph that shows many variables (such as a scatter plot matrix), you can make the graph more understandable by ordering the variables so that similar variables are adjacent to each other. The article uses single-link clustering to order the variables, as suggested by Hurley (2004).
  • A stacked band plot, created in SAS by using PROC SGPLOT
  • Stacked band plot: You can use PROC SGPLOT to automatically create a stacked bar plot. However, when the bars represent an ordered categorical variable (such as months or years), you might want to create a stacked band plot instead. This article shows how to create a stacked band plot in SAS.

Statistics and Data Analysis

Random numbers and resampling methods

Process flow diagram shows how to resample data to create a bootstrap distribution.

Optimization

These articles are technical but provide tips and techniques that you might find useful. Choose a few topics that are unfamiliar and teach yourself something new in this New Year!

Do you have a favorite article from 2018 that I did not include on the list? Share it in a comment!

The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.

1月 022019
 

Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis. Here are the most popular articles from 2018 in each category.

Data Visualization

General SAS programming techniques

Statistics and Data Analysis

I write this blog because I love to learn new things and share what I know with others. If you want to learn something new, read (or re-read!) these popular articles from 2018. Then share this page with one of your colleagues. Happy New Year! I hope we both have many opportunities to learn and share in 2019!

The post Top posts from <em>The DO Loop</em> in 2018 appeared first on The DO Loop.

12月 032018
 

When a graph includes several markers or line styles, it is often useful to create a legend that explains the relationship between the data and the symbols, color, and line styles in the graph. The SGPLOT procedure does a good job of automatically creating and placing a legend for most graphs. However, sometimes it is useful to override the procedure's default choices. This article describes five tips that you can use to customize the content and placement of legends. The tips are:

  1. Suppress the legend by using the NOAUTOLEGEND option.
  2. Choose which components of the graph appear in the legend by using a KEYLEGEND statement and the NAME= option.
  3. Position the legend by using the LOCATION= and POSITION= option on the KEYLEGEND statement.
  4. Exclude one or more items from a legend by using the EXCLUDE= option on the KEYLEGEND statement (requires SAS 9.4M3).
  5. Consolidate one or more items by using the LEGENDITEM statement (requires SAS 9.4M5).

1. Suppress the legend

By default, the SGPLOT procedure displays a legend when there are multiple plots that are overlaid in the graph. This can be caused by multiple statements or by using the GROUP= option on a statement. If the information in the default legend is redundant, and you might want to suppress it. For example, the following legend is unnecessary because the title explains the data and the regression line. You can uncomment the NOAUTOLEGEND option to suppress the legend.

title "Linear Regression for Weight and Height";
title2 "The legend is unnecessary";
proc sgplot data=Sashelp.Class /* NOAUTOLEGEND */;
   scatter x=Height y=Weight;
   reg x=Height y=Weight / nomarkers;
   footnote J=L "Use the NOAUTOLEGEND option to suppress the legend";
run;
footnote;
You can use the NOAUTOLEGEND option on the PROC SGPLOT statement in SAS to suppress a legend

2. Choose which components appear in the legend

In some graphs that overlay multiple components, some components are self -explanatory and do not need to appear in the legend. You can choose which components appear in the legend by using the NAME= option on the statements and using the KEYLEGEND statement to specify the contents of the legend. For example, the following statements create a graph that consists of a scatter plot, a confidence ellipse, and a regression line. If you only want the confidence ellipse and regression line to appear in the legend, use the NAME= option to identify each component and use the KEYLEGEND statement to specify the contents of the legend:

title "Weight versus Height";
title2 "Overlay Least Squares Fit and Confidence Ellipse";
proc sgplot data=Sashelp.Class;
   scatter x=Height y=Weight / name="scatter";
   ellipse x=Height y=Weight / name="ellipse";
   reg x=Height y=Weight     / name="reg" nomarkers lineattrs=GraphData2;
   keylegend "reg" "ellipse"; /* list item in the order you want them */
run;
In the SGPLOT procedure in SAS, you can use the NAME= option to name components and the KEYLEGEND statement to specify which components should appear in a legend

3. Position the legend

The KEYLEGEND statement supports the LOCATION= and POSTITION= options, which enable you to place the legend almost anywhere in the graph. The LOCATION= option controls whether the legend appears inside or outside of the graph area. The POSITION= option controls the placement of the legend on the graph (left, right, top, bottom,...). However, I can never remember which option controls which attribute! Therefore, I created a mnemonic, which I hope will help you remember, too:

  • The LOCATION= option contains the substring 'CAT'. A CAT likes to go INSIDE and OUTSIDE the house. Therefore, the valid keywords for the LOCATION= option are INSIDE and OUTSIDE.
  • The POSITION= option contains the substring 'SIT'. You can SIT on the LEFT or RIGHT side of a couch. (Also, "position" can be used as a verb to mean "place on a page.") Therefore, the valid keywords for the POSITION= option are BOTTOM, BOTTOMLEFT, BOTTOMRIGHT, LEFT, RIGHT, TOP, TOPLEFT, and TOPRIGHT. (Some other graphical elements support a CENTER position, but not the legend.)

The following graph is the same as in the previous example, except that the location of the legend is inside the graph area and the position of the legend is in the lower-right corner. When you move the legend to the left or right side of the graph, it is often useful to use the ACROSS=1 option to force the legend to list the items vertically. Similarly, if you position the legend at the top or bottom of a graph, you might want to use the DOWN=1 option to list the items horizontally.

   keylegend "reg" "ellipse" / location=inside position=bottomright across=1;
In the SGPLOT procedure in SAS, you can use the LOCATION= and POSITION= options to position a legend

4. Exclude items from a legend

When you use the GROUP= option to display groups, you might want to exclude some of the group categories from the legend. The KEYLEGEND statement supports the EXCLUDE= option that you can use to exclude certain items. Three situations come to mind:

  • The group levels contain missing values. You might want to exclude the missing values from the legend by using KEYLEGEND / EXCLUDE=(" ");.
  • The purpose of the graph is to focus on one or two subgroups. If so, it might make sense to label only those groups. For example, if the purpose of a graph is to show income disparity between blacks and whites, you might decide not to include Asians or Hispanics in the legend: EXCLUDE=("Asian" "Hispanic");.
  • The group is binary. If a graph shows the results of a clinical trial and the legend includes the marker shape for the patients who died, it should be clear that the other marker shape represents patients who survived: EXCLUDE=("Alive");. An example is shown below
ods graphics / attrpriority=none;
title "Patient Status";
proc sgplot data=Sashelp.Heart(obs=200 where=(Systolic<=200));
   styleattrs datasymbols=(X CircleFilled);
   scatter x=Systolic y=Diastolic / group=Status;
   keylegend / exclude=("Alive");
run;
In the SGPLOT procedure in SAS, you can use the EXCLUDE= option on the KEYLEGEND statement to suppress a legend item, such as a category in a group

5. Customize items in a legend

The previous section shows how to exclude one or more levels in a categorical variable that is specified on the GROUP= option. You also might want to customize the items that appear in the legend in order to combine, for example, marker and line attributes. A situation where this comes up is when you want to overlay a group of curves on a scatter plot.

The LEGENDITEM statement (supported in SAS 9.4M5) enables you to specify what combination of markers and line patterns you want to appear for every item in a legend. It is a "super customization" statement that gives you complete control over the legend items.

The following statements show how to use the LEGENDITEM statement to create a customized legend. By default, if you use the REG statement with the GROUP= option, the legend will show only the colors and line patterns for the regression lines. In the following example, I have used the ATTRPRIORITY=NONE option to force the marker symbols to differ between groups. I want the legend to show not only the colors and patterns of the regression lines but also the marker symbols for each group:

/* ensure order of BP_Status is High, Normal, Optimal */
proc sort data=Sashelp.Heart(obs=200 where=(Systolic<=200)) out=Heart;
   by BP_Status;
run;
 
ods graphics / attrpriority=none;
title "Patients by Blood Pressure Status";
proc sgplot data=Heart;
   styleattrs datalinepatterns=(solid) ;
   reg x=Systolic y=Diastolic / group=BP_Status;   
   legenditem type=markerline name="H" /
      label="High" lineattrs=GraphData1 markerattrs=GraphData1; 
   legenditem type=markerline name="N" /
      label="Normal" lineattrs=GraphData2 markerattrs=GraphData2; 
   legenditem type=markerline name="O" /
      label="Optimal" lineattrs=GraphData3 markerattrs=GraphData3; 
   keylegend "O" "N" "H" / title="BP Status"; 
run;
In the SGPLOT procedure in SAS, you can use the LEGENDITEM statement to create customized items in a legend

In summary, PROC SGPLOT in SAS supports several ways to create, suppress, position, and customize the items in a legend. Do you have a favorite way to customize a legend in PROC SGPLOT? Leave a comment!

The post 5 tips for customizing legends in PROC SGPLOT in SAS appeared first on The DO Loop.

11月 192018
 

You might know that you can use the ODS SELECT statement to display only some of the tables and graphs that are created by a SAS procedure. But did you know that you can use a WHERE clause on the ODS SELECT statement to display tables that match a pattern? This article shows how to use wildcards, regular expressions, and pattern matching to select ODS tables in SAS.

ODS SELECT: Filter the output from SAS procedures

A SAS procedure might produce a dozen or more tables. You might be interested in displaying a subset of those tables. Recall that you can use the ODS TRACE ON statement to obtain a list of all the tables and graphs that a procedure creates. You can then use the ODS SELECT or the ODS EXCLUDE statement to control which tables and graphs are displayed.

Here's an example from the SAS/STAT documentation. The following PROC LOGISTIC call creates 27 tables and graphs, most of which are related to ROC curves. The ODS TRACE ON statement displays the names of each output object in the SAS log:

data roc;
   input alb tp totscore popind @@;
   totscore = 10 - totscore;
   datalines;
3.0 5.8 10 0   3.2 6.3  5 1   3.9 6.8  3 1   2.8 4.8  6 0
3.2 5.8  3 1   0.9 4.0  5 0   2.5 5.7  8 0   1.6 5.6  5 1
3.8 5.7  5 1   3.7 6.7  6 1   3.2 5.4  4 1   3.8 6.6  6 1
4.1 6.6  5 1   3.6 5.7  5 1   4.3 7.0  4 1   3.6 6.7  4 0
2.3 4.4  6 1   4.2 7.6  4 0   4.0 6.6  6 0   3.5 5.8  6 1
3.8 6.8  7 1   3.0 4.7  8 0   4.5 7.4  5 1   3.7 7.4  5 1
3.1 6.6  6 1   4.1 8.2  6 1   4.3 7.0  5 1   4.3 6.5  4 1
3.2 5.1  5 1   2.6 4.7  6 1   3.3 6.8  6 0   1.7 4.0  7 0
3.7 6.1  5 1   3.3 6.3  7 1   4.2 7.7  6 1   3.5 6.2  5 1
2.9 5.7  9 0   2.1 4.8  7 1   2.8 6.2  8 0   4.0 7.0  7 1
3.3 5.7  6 1   3.7 6.9  5 1   3.6 6.6  5 1
;
 
ods graphics on;
ods trace on;
proc logistic data=roc;
   model popind(event='0') = alb tp totscore / nofit;
   roc 'Albumin' alb;
   roc 'K-G Score' totscore;
   roc 'Total Protein' tp;
   roccontrast reference('K-G Score') / estimate e;
run;

The SAS log displays the names of the tables and graphs. A portion of log is shown below:

Output Added:
-------------
Name:       OddsRatios
Label:      Odds Ratios
Template:   Stat.Logistic.OddsRatios
Path:       Logistic.ROC3.OddsRatios
-------------

Output Added:
-------------
Name:       ROCCurve
Label:      ROC Curve
Template:   Stat.Logistic.Graphics.ROC
Path:       Logistic.ROC3.ROCCurve
-------------

Output Added:
-------------
Name:       ROCOverlay
Label:      ROC Curves
Template:   Stat.Logistic.Graphics.ROCOverlay
Path:       Logistic.ROCComparisons.ROCOverlay
-------------

Output Added:
-------------
Name:       ROCAssociation
Label:      ROC Association Statistics
Template:   Stat.Logistic.ROCAssociation
Path:       Logistic.ROCComparisons.ROCAssociation
-------------

Output Added:
-------------
Name:       ROCContrastCoeff
Label:      ROC Contrast Coefficients
Template:   Stat.Logistic.ROCContrastCoeff
Path:       Logistic.ROCComparisons.ROCContrastCoeff
-------------

Only a few of the 27 ODS objects are shown here. Notice that each ODS object has four properties: a name, a label, a template, and a path. Most of the time, the name is used on the ODS SELECT statement to filter the output. For example, if you want to display only the ROC curves and the overlay of the ROC curves, you can put the following statement prior to the RUN statement in the procedure:

   ods select ROCCurve ROCOverlay;     /* specify the names literally */

Use a WHERE clause in the ODS SELECT statement

Often the ODS objects that you want to display are related to each other. In the LOGISTIC example, you might want to display all the information about ROC curves. Fortunately, the SAS developers often use a common prefix or suffix, such as 'ROC', in the names of the ODS objects. That means that you can display all ROC-related tables and graphs be selecting the ODS objects whose name (or path) contains 'ROC' as a substring.

You can use the WHERE clause to select ODS objects whose name (or label or path) matches a particular pattern. The object's name is available in a special variable named _NAME_. Similarly, the object's label and path are available in variables named _LABEL_ and _PATH_, respectively. You cannot match patterns in the template string; there is no _TEMPLATE_ variable.

In SAS, the following operators and functions are useful for matching strings:

  • The CONTAINS keyword matches strings that contains a specified substring. The question mark (?) is an equivalent way to specify the CONTAINS operator.
  • The LIKE keyword matches strings to a pattern. The underscore (_) is a wildcard that matches any character. The percent sign (%) is a wildcard that matches one or more characters.
  • The "begins with" operators (=: and in:) match strings that begin with a certain pattern or set of patterns, respectively.
  • SAS functions such as FIND, INDEX, and SUBSTR can be used to match patterns.
  • The SAS PRXMATCH function, which enables you to use Perl regular expressions to match patterns.

For example, the following statements select ODS tables and graphs from the previous PROC LOGISTIC call. You can put one of these statements before the RUN statement in the procedure:

   /* use any one of the following statements inside the PROC LOGISTIC call */
   ods select where=(_name_ =: 'ROC');                 /* name starts with 'ROC' */
   ods select where=(_name_ like 'ROC%');              /* name starts with 'ROC' */
   ods select where=(_path_ ? 'ROC');                  /* path contains 'ROC' */
   ods select where=(_label_ ? 'ROC');                 /* label contains 'ROC' */
   ods select where=(_name_ in: ('Odds', 'ROC'));      /* name starts with 'Odds' or 'ROC' */
   ods select where=(substr(_name_,4,8)='Contrast');   /* name has subtring 'Contrast' at position 4 */

For additional examples of using pattern matching to select ODS objects, see Warren Kuhfeld's graphics-focused blog post and the section of the SAS/STAT User's Guide that discusses selecting ODS graphics.

Use PRXMATCH to match regular expressions

Although the CONTAIN and LIKE operators are often sufficient for selecting a table, SAS provides the powerful PRXMATCH function for more complex pattern-matching tasks. The PRXMATCH function uses Perl regular expressions to match strings. SAS provides a Perl Regular Expression "cheat sheet" that summarizes the syntax and commons search queries for the PRXMATCH function.

You can put any of the following statements inside the PROC LOGISTIC call:

   /* use any one of the following PRXMATCH expressions inside the PROC LOGISTIC call */
   ods select where=(prxmatch('/ROC/',       _name_)); /* name contains 'ROC' anywhere */
   ods select where=(prxmatch('/^ROC/',      _name_)); /* name starts with 'ROC' */
   ods select where=(prxmatch('/Odds|^ROC/', _name_)); /* name contains 'Odds' anywhere or 'ROC' at the beginning */
   ods select where=(prxmatch('/ROC/',      _name_)=0);     /* name does NOT contain 'ROC' anywhere */
   ods select where=(prxmatch('/Logistic\.ROC2/', _path_)); /* escape special wildcard character '.' */

In summary, the WHERE= option on the ODS SELECT (and ODS EXCLUDE) statement is quite powerful. Many SAS programmers know how to list the names of tables and graphs on the ODS SELECT statement to display only a subset of the output. However, the WHERE= option enables you to use wildcards and regular expressions to select objects whose names or paths match a certain pattern. This can be a quick and efficient way to select tables that are related to each other and share a common prefix or suffix in their name.

The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.

10月 222018
 

A SAS programmer asked how to rearrange elements of a matrix. The rearrangement he wanted was rather complicated: certain blocks of data needed to move relative to other blocks, but the values within each block were to remain unchanged. It turned out that the mathematical operation he needed is called a block transpose. The BTRAN function in SAS/IML performs block-transpose operations, so the complicated rearrangement was easy to implement with a judicious call to the BTRAN function.

This article discusses the block-transpose operation and gives an example. It also shows how a block transpose can conveniently transform wide data into long data and vice versa.

The block-transpose operation

In general, a block matrix can contain blocks of various sizes. However, the BTRAN function requires that all blocks be the same size. Specifically, suppose A is a block matrix where each block is an n x m matrix. That means that A is an (r n) x (c m) matrix for some whole numbers r and c. The BTRAN function transposes the blocks to create a (c n) x (r m) block matrix.

The idea is more easily explained by a picture. The following image shows a matrix A that is composed of six blocks (of the same size). The block-transpose operation rearranges the blocks but leaves the elements within the blocks unchanged.

A block-transpose operation transposes the blocks but leaves the contents of each block unchanged

The BTRAN function in SAS/IML

Let's see how the block transpose works on an example in the SAS/IML language. The following statements create a 4 x 12 matrix. I have overlaid some grid lines to help you visualize this matrix as a 2 x 3 block matrix, where each block is 2 x 4.

proc iml;
x = repeat(1:12,2);
x = x // (10+x);
print x;
A 4x12 matrix visualized as a 2x3 block matrix. Each block is 2x4.

You can use the BTRAN function to apply a block transpose. The first argument is the matrix that contains the data. The second and third arguments specify the size of the blocks:

y = btran(x, 2, 4);    /* block size is 2x4 */
print y;
A 6x6 matrix visualized as a 3x2 block matrix. Each block is 2x4.

The output is a 6 x 6 matrix, visualized as a 3 x 2 block matrix.

Block transpose for wide-to-long transforms

If the number of rows in the block equals the number of rows in the data, then the BTRAN function stacks columns. In particular, if the block size is n x 1, the BTRAN function stacks multiple variables into a single column, just like the SHAPECOL function. For example, the Sashelp.Iris data contains four variables named SepalLength, SepalWidth, PetalLength, and PetalWidth. The following SAS/IML statements stack the four variables into a single column and create a new column that identifies the name of the original variable:

proc iml;
/* wide to long */
varNames = {"SepalLength" "SepalWidth" "PetalLength" "PetalWidth"};
use Sashelp.Iris;
   read all var "Species";        /* 150 x 3 */
   read all var varNames into X;  /* 150 x 4 */
close;
n = nrow(X); m = ncol(X);         /* n = 150; m = 4 */
 
ID = repeat(Species, m);          /* repeat the vars that are not transposed */
Var = shapecol(repeat(varNames, n), n*m); /* create column that contains variable names */
Value = btran(X, n, 1);           /* vector with 150*4 rows */ 
 
create test var {ID "Var" "Value"}; append; close;

Block transpose for wide-to-long transforms

In a similar way, you can transpose balanced data from long form to wide form. (Recall that "balanced data" means that each group has the same number of observations.) For example, the Sashelp.Iris data contains a Species variable. The first 50 observations have the value 'Setosa', the next 50 have the value 'Versicolor', and the last 50 have the value 'Virginica'. You can use a WHERE clause or a BY statement to analyze each species separately, but suppose you want to create a new data set that contains only 50 observations and has variables named SepalLengthSetosa, SepalLengthVersicolor, SepalLengthVirginica, and so forth for the other variables. The BTRAN function makes this easy: just specify a block size of 50 x 4, as follows.

/* create pairwise combinations of var names and group levels */
BlockRows = 50;
w = btran(X, BlockRows, m);                  /* 50 x 12 matrix */
 
s = Species[ do(1, n, BlockRows) ];          /* s = {'Setosa' 'Versicolor', 'Virginica'} */
vNames = rowcatc(expandgrid(varNames, s));   /* combine var names and species */
print vNames, w[c=vNames L="Measurenents (cm)"];

In summary, the BTRAN function is a useful function when you need to rearrange blocks of data without changing the values in a block. For balanced designs, you can use the BTRAN function to convert data between wide and long formats.

The post Transpose blocks to reshape data appeared first on The DO Loop.

10月 082018
 

This article compares several ways to find the elements that are common to multiple sets. I test which method is the fastest in the SAS/IML language. However, all algorithms are intrinsically fast, which raises an important question: when is it worth the time and effort to optimize an algorithm?

The idea for this topic came from reading a blog post by 'mikefc' at the "Cool but Useless" blog about the relative performance of various ways to intersect vectors in R. Mikefc used vectors that contained between 10,000 and 100,000 elements and whose intersection was only a few dozen elements. In this article, I increase the sizes of the vectors by an order of magnitude and time the performance of the intersection function in SAS/IML. I also ensure that the sets share a large set of common elements so that the intersection is sizeable.

The intersection of large sets

In general, finding the intersection between sets is a fast operation. In languages such as R, MATLAB, and SAS/IML, you can store the sets in vectors and use built-in functions to form the intersection. Because the intersection of two sets is always smaller than (or equal to) the size of the original sets, computing the "intersection of intersections" is usually faster than computing the intersection of the original (larger) sets.

The following SAS/IML program creates eight SAS/IML vectors that have between 200,000 and 550,000 elements. The elements in the vectors are positive integers, although they could also be character strings or non-integers. The vectors contain a large subset of common elements so that the intersection is sizeable. The vectors A, B, ..., H are created as follows:

proc iml;
call randseed(123);
NIntersect = 5e4; 
common = sample(1:NIntersect, NIntersect, "replace"); /* elements common to all sets */
 
source = 1:1e6;        /* elements are positive integers less than 1 million */
BaseN = 5e5;           /* approx magnitude of vectors (sets) */
A = sample(source,     BaseN, "replace") || common; /* include common elements */
B = sample(source, 0.9*BaseN, "replace") || common;
C = sample(source, 0.8*BaseN, "replace") || common;
D = sample(source, 0.7*BaseN, "replace") || common;
E = sample(source, 0.6*BaseN, "replace") || common;
F = sample(source, 0.5*BaseN, "replace") || common;
G = sample(source, 0.4*BaseN, "replace") || common;
H = sample(source, 0.3*BaseN, "replace") || common;

The intersection of these vectors contains about 31,600 elements. You can implement the following methods that find the intersection of these vectors:

  1. The XSECT function in SAS/IML accepts up to 15 arguments. You can therefore make a single call to find the intersection.
  2. If you have dozens or hundreds of sets to intersect, you can loop over the sets and call the XSECT function multiple times. (Recall that the VALUE function enables you to loop over the names of the sets and access the vectors by names.) For example, you can let W1 be the intersection of the first two sets, let W2 be the intersection of W1 with the third set, and so forth. Because the size of the intersection is typically smaller than the size of the sets, later intersections require less time to compute than earlier intersections. (Although I only implement pairwise intersections, you could conceivably intersect more than two sets at a time.)
  3. As 'mikefc' mentions, it is quicker to intersect small sets than to intersect large sets. Thus you can preprocess the names of the sets and sort them by size. This heuristic might make a difference when the set sizes vary greatly.

The following SAS/IML statements implement these three methods and compute the time required to intersect each:

/* 1. one call to the built-in method */
t0  = time();
   w = xsect(a,b,c,d,e,f,g,h);
tBuiltin = time() - t0;
 
/* 2. loop over variable names and form pairwise intersections */
varName = "a":"h";
t0  = time();
   w = value(varName[1]);
   do i = 2 to ncol(varName);
     w = xsect(w, value(varName[i]));  /* compute pairwise intersection */
   end;
tPairwise = time() - t0;
 
/* 3. Sort by size of sets, then loop */
varName = "a":"h";
t0  = time();
   len = j(ncol(varName), 1);    
   do i = 1 to ncol(varName);
     len[i] = ncol(value(varName[i])); /* number of elements in each set */
   end;
   call sortndx(idx, len);             /* sort smallest to largest */
   sortName = varName[,idx];
   w = value(sortName[1]);
   do i = 2 to ncol(sortName);
     w = xsect(w, value(sortName[i])); /* compute pairwise intersection */
   end;
tSort = time() - t0;
 
print tBuiltin tPairwise tSort;

For this example data, each method takes about 0.5 seconds to find the intersection of eight sets that contain a total of 3,000,000 elements. If you rerun the analysis, the times will vary by a few hundredths of a second, so the times are essentially equal. ('mikefc' also discusses a few other methods, including a "tally method." The tally method is fast, but it relies on the elements being positive integers.)

Whether to optimize or not

There is a quote that I like: "In theory, there is no difference between theory and practice. But, in practice, there is." (Published by Walter J. Savitch, who claims to have overheard it at a computer science conference.) For the intersection problem, you can argue theoretically that pre-sorting by length is an inexpensive way to potentially obtain a boost in performance. However, in practice (for this example data), pre-sorting improves the performance by less than 1%. Furthermore, the absolute times for this operation are short, so saving a 1% of 0.5 seconds might not be worth the effort.

Even if you never compute the intersection of sets in SAS, there are two take-aways from this exercise:

  • When you measure the performance of algorithms, use example data that is similar to the real data that you expect to encounter. The performance of this problem depends on the size of the sets, the number of sets, and the contents of the sets. The results that I show here are based on my choices for these parameters. Different parameters might yield different results.
  • Always consider the total time for an algorithm before you start to optimize it. If your analysis takes 30 seconds, there is no need to optimize a step in the analysis that takes 0.5 seconds. No matter how well you optimize that step, it is not going to substantially shrink the total run time of your analysis.

The post The intersection of multiple sets appeared first on The DO Loop.