rick wicklin

12月 112019
 

Binary matrices are used for many purposes. I have previously written about how to use binary matrices to visualize missing values in a data matrix. They are also used to indicate the co-occurrence of two events. In ecology, binary matrices are used to indicate which species of an animal are present in which ecological site. For example, if you remember your study of Darwin's finches in high school biology class, you might recall that certain finches (species) are present or absent on various Galapagos islands (sites). You can use a binary matrix to indicate which finches are present on which islands.

Recently I was involved in a project that required performing a permutation test on rows of a binary matrix. As part of the project, I had to solve three smaller problems involving rows of a binary matrix:

  1. Given two rows, find the locations at which the rows differ.
  2. Given two binary matrices, determine how similar the matrices are by computing the proportion of elements that are the same.
  3. Given two rows, swap some of the elements that differ.

This article shows how to solve each problem by using the SAS/IML matrix language. A future article will discuss permutation tests for binary matrices. For clarity, I introduce the following macro that uses a temporary variable to swap two sets of values:

/* swap values of A and B. You can use this macro in the DATA step or in the SAS/IML language */
%macro SWAP(A, B, TMP=_tmp);
   &TMP = &A; &A = &B; &B = &TMP;
%mend;

Where do binary matrices differ?

The SAS/IML matrix language enables you to treat matrices as high-level objects. You often can answer questions about matrices without writing loops that iterate over the elements of the matrices. For example, if you have two matrices of the same dimensions, you can determine the cells at which the matrices are unequal by using the "not equal" (^=) operator. The following SAS/IML statements define two 2 x 10 matrices and use the "not equal" operator to find the cells that are different:

proc iml;
A = {0 0 1 0 0 1 0 1 0 0 ,
     1 0 0 0 0 0 1 1 0 1 };
B = {0 0 1 0 0 0 0 1 0 1 ,
     1 0 0 0 0 1 1 1 0 0 };
colLabel = "Sp1":("Sp"+strip(char(ncol(A))));
rowLabel = "A1":("A"+strip(char(nrow(A))));
 
/* 1. Find locations where binary matrices differ */
Diff = (A^=B);
print Diff[c=colLabel r=rowLabel];

The matrices A and B are similar. They both have three 1s in the first row and fours 1s in the second row. However, the output shows that the matrices are different for the four elements in the sixth and tenth columns. Although I used entire matrices for this example, the same operations work on row vectors.

The proportion of elements in common

You can use various statistics to measure the similarity between the A and B matrices. A simple statistic is the proportion of elements that are in common. These matrices have 20 elements, and 16/20 = 0.8 are common to both matrices. You can compute the proportion in common by using the express (A=B)[:], or you can use the following statements if you have previously computed the Diff matrix:

/* 2. the proportion of elements in B that are the same as in A */
propDiff = 1 - Diff[:];
print propDiff;

As a reminder, the mean subscript reduction operator ([:]) computes the mean value of the elements of the matrix. For a binary matrix, the mean value is the proportion of ones.

Swap elements of rows

The first two tasks were easy. A more complicated task is swapping values that differ between rows. The swapping operation is not difficult, but it requires finding the k locations where the rows differ and then swapping all or some of those values. In a permutation test, the number of elements that you swap is a random integer between 1 and k, but for simplicity, this example uses the SWAP macro to swap two cells that differ. For clarity, the following example uses temporary variables such as x1, x2, d1, and d2 to swap elements in the matrix A:

/* specify the rows whose value you want to swap */
i1 = 1;                         /* index of first row to compare and swap */
i2 = 2;                         /* index of second row to compare and swap */
 
/* For clarity, use temp vars x1 & x2 instead of A[i1, ] and A[i2, ] */
x1 = A[ i1, ];                  /* get one row */
x2 = A[ i2, ];                  /* get another row */
idx = loc( x1^=x2 );            /* find the locations where rows differ */
if ncol(idx) > 0 then do;       /* do the rows differ? */
   d1 = x1[ ,idx];              /* values at the locations that differ */
   d2 = x2[ ,idx];
   print (d1//d2)[r={'r1' 'r2'} L='Original'];
 
   /* For a permutation test, choose a random number of locations to swap */
   /* numSwaps = randfun(1, "Integer", 1, n);
      idxSwap = sample(1:ncol(idx), numSwaps, "NoReplace");  */
   idxSwap = {2 4};              /* for now, hard-code locations to swap */
   %SWAP(d1[idxSwap], d2[idxSwap]);
   print (d1//d2)[r={'d1' 'd2'} L='New'];
   /* update matrix */
   A[ i1, idx] = d1;
   A[ i2, idx] = d2;
end;
print A[c=colLabel r=rowLabel];

The vectors x1 and x2 are the rows of A to compare. The vectors d1 and d2 are the subvectors that contain only the elements of x1 and x2 that differ. The example swaps the second and fourth columns of d1 and d2. The new values are then inserted back into the matrix A. You can compare the final matrix to the original to see that the process swapped two elements in each of two rows.

Although the examples are for binary matrices, these techniques work for any numerical matrices.

The post Swap elements in binary matrices appeared first on The DO Loop.

12月 092019
 

Recently I showed how to visualize and analyze longitudinal data in which subjects are measured at multiple time points. A very common situation is that the data are collected at two time points. For example, in medicine it is very common to measure some quantity (blood pressure, cholesterol, white-blood cell count,...) for each patient before a treatment and then measure the same quantity after an intervention.

This situation leads to pairs of observations. It is useful to visualize the pre/post measurements in a spaghetti plot, a scatter plot, or a histogram of the differences. This article points out that you do not need to create these plots manually: the TTEST procedure can produce all these plots automatically, as well as provide a statistical analysis of paired data.

Visualize pre/post data

When given measurements before and after an intervention (sometimes called "pre/post data"), I like to create a scatter plot of the "before" variable versus the "after" variable. To demonstrate, the following SAS statements define a subset of the data that were analyzed in the previous article. The subset contains 50 children who had high levels of lead in their blood. Lead is toxic, especially in children. The children were given a compound (called succimer) to reduce the blood-lead level and the levels were re-measured one week later.

data PrePost;
   input id lead0 lead1 @@;
   label lead0 = "Baseline Blood Lead Level (mcg/dL)"
         lead1 = "Post-Treatment Blood Lead Level (mcg/dL)";
datalines;
  2   26.5   14.8   3   25.8   23.0  5   20.4    2.8  6   20.4    5.4
 12   24.8   23.1  14   27.9    6.3 19   35.3   25.5 20   28.6   15.8
 22   29.6   15.8  23   21.5    6.5 25   21.8   12.0 26   23.0    4.2
 27   22.2   11.5  29   25.0    3.9 31   26.0   21.4 32   19.7   13.2
 36   29.6   17.5  39   24.4   16.4 40   33.7   14.9 43   26.7    6.4
 44   26.8   20.4  45   20.2   10.6 48   20.2   17.5 49   24.5   10.0
 53   27.1   14.9  54   34.7   39.0 57   24.5    5.1 64   27.7    4.0
 65   24.3   24.3  66   36.6   23.3 68   34.0   10.7 69   32.6   19.0
 70   29.2    9.2  71   26.4   15.3 72   21.8   10.6 79   21.1    5.6
 82   22.1   21.0  85   28.9   12.5 87   19.8   11.6 89   23.5    7.9
 90   29.1   16.8  91   30.3    3.5 93   30.6   28.2 94   22.4    7.1
 95   31.2   10.8  96   31.4    3.9 97   41.1   15.1 98   29.4   22.1
 99   21.9    7.6 100   20.7    8.1
;
 
title "Pre/Post Measurements";
title2 "Created Manually";
proc sgplot data=PrePost aspect=1 noautolegend;
   scatter x=Lead0 y=Lead1 / jitter;
   lineparm x=0 y=0 slope=1 / clip;
   xaxis grid;
   yaxis grid;
run;

The graph shows the initial measurement (on the X axis) versus the post-treatment measurement (on the Y axis) for each patient in the study. The diagonal line is the identity line. A marker above the line indicates a patient whose measurement increased after treatment. A marker below the line indicates a patient whose measurement decreased. For these data, most children experienced a decrease in blood-lead levels, which indicates that the treatment was effective.

Use PROC TTEST to visualize pre/post data

The goal of ODS statistical graphics in SAS procedures is to provide an analyst with standard visualizations of data that can accompany and enhance a statistical analysis. For pre/post data, a common analysis is to perform a paired t test for the mean difference between the pre- and post-intervention measurements. It turns out that if you run a PROC TTEST analysis, the output not only includes statistical tables but also three common graphs that visualize the data. If the data are in "wide form," the PROC TTEST syntax is simple:

ods graphics on;
proc ttest data=PrePost;
   paired Lead0*Lead1;
run;

Think about this syntax: When you type three PROC TTEST statements, you get a statistical analysis AND three standard visualizations! The graphs appear automatically when you turn on ODS graphics, without requiring any effort on your part.

The first graph from PROC TTEST is a histogram of the difference (pre-treatment minus post-treatment). The graph is actually a panel, with a small horizontal box plot underneath that gives another view of the distribution of the differences. You can see that the differences range from -5 to 28. The median change appears to be 13 and the interquartile range is from 8 to 18. If the differences are approximately normally distributed, the light green rectangle indicates a 95% confidence interval for the mean.

The second graph is a spaghetti plot of the patient profiles. Each line segment represents a patient. The lines connect the pre-treatment values (on the left) to the post-treatment values (on the right). You can see that most (but not all) patients experienced a decrease in blood-lead levels. The red line in the graph indicates the mean pre- and post-treatment levels.

The third graph is a version of the scatter plot that I created earlier by using PROC SGPLOT. Like my version, the graph includes a diagonal reference line. In addition, the graph displays a symbol that shows the pre- and post-treatment mean values of the response variable. The vertical distance between the mean marker and the reference line indicates the average change since the treatment.

PROC TTEST also creates a fourth graph (not shown), which is a normal Q-Q plot of the difference. This graph is useful if you are assessing the normality of the distribution of differences.

Summary

PROC TTEST creates excellent graphs that enable you to visualize pre/post data. With only a few SAS statements, you can produce three common graphs that visualize the change, presumably due to the treatment. Using PROC TTEST is quicker and easier than creating the graphs yourself. Furthermore, one of the graphs (the histogram/box plot panel) is a graph that is not easily created by using PROC SGPLOT. So next time you want to visualize paired data, reach for PROC TTEST.

The post Visualize data before and after a treatment appeared first on The DO Loop.

12月 092019
 

Recently I showed how to visualize and analyze longitudinal data in which subjects are measured at multiple time points. A very common situation is that the data are collected at two time points. For example, in medicine it is very common to measure some quantity (blood pressure, cholesterol, white-blood cell count,...) for each patient before a treatment and then measure the same quantity after an intervention.

This situation leads to pairs of observations. It is useful to visualize the pre/post measurements in a spaghetti plot, a scatter plot, or a histogram of the differences. This article points out that you do not need to create these plots manually: the TTEST procedure can produce all these plots automatically, as well as provide a statistical analysis of paired data.

Visualize pre/post data

When given measurements before and after an intervention (sometimes called "pre/post data"), I like to create a scatter plot of the "before" variable versus the "after" variable. To demonstrate, the following SAS statements define a subset of the data that were analyzed in the previous article. The subset contains 50 children who had high levels of lead in their blood. Lead is toxic, especially in children. The children were given a compound (called succimer) to reduce the blood-lead level and the levels were re-measured one week later.

data PrePost;
   input id lead0 lead1 @@;
   label lead0 = "Baseline Blood Lead Level (mcg/dL)"
         lead1 = "Post-Treatment Blood Lead Level (mcg/dL)";
datalines;
  2   26.5   14.8   3   25.8   23.0  5   20.4    2.8  6   20.4    5.4
 12   24.8   23.1  14   27.9    6.3 19   35.3   25.5 20   28.6   15.8
 22   29.6   15.8  23   21.5    6.5 25   21.8   12.0 26   23.0    4.2
 27   22.2   11.5  29   25.0    3.9 31   26.0   21.4 32   19.7   13.2
 36   29.6   17.5  39   24.4   16.4 40   33.7   14.9 43   26.7    6.4
 44   26.8   20.4  45   20.2   10.6 48   20.2   17.5 49   24.5   10.0
 53   27.1   14.9  54   34.7   39.0 57   24.5    5.1 64   27.7    4.0
 65   24.3   24.3  66   36.6   23.3 68   34.0   10.7 69   32.6   19.0
 70   29.2    9.2  71   26.4   15.3 72   21.8   10.6 79   21.1    5.6
 82   22.1   21.0  85   28.9   12.5 87   19.8   11.6 89   23.5    7.9
 90   29.1   16.8  91   30.3    3.5 93   30.6   28.2 94   22.4    7.1
 95   31.2   10.8  96   31.4    3.9 97   41.1   15.1 98   29.4   22.1
 99   21.9    7.6 100   20.7    8.1
;
 
title "Pre/Post Measurements";
title2 "Created Manually";
proc sgplot data=PrePost aspect=1 noautolegend;
   scatter x=Lead0 y=Lead1 / jitter;
   lineparm x=0 y=0 slope=1 / clip;
   xaxis grid;
   yaxis grid;
run;

The graph shows the initial measurement (on the X axis) versus the post-treatment measurement (on the Y axis) for each patient in the study. The diagonal line is the identity line. A marker above the line indicates a patient whose measurement increased after treatment. A marker below the line indicates a patient whose measurement decreased. For these data, most children experienced a decrease in blood-lead levels, which indicates that the treatment was effective.

Use PROC TTEST to visualize pre/post data

The goal of ODS statistical graphics in SAS procedures is to provide an analyst with standard visualizations of data that can accompany and enhance a statistical analysis. For pre/post data, a common analysis is to perform a paired t test for the mean difference between the pre- and post-intervention measurements. It turns out that if you run a PROC TTEST analysis, the output not only includes statistical tables but also three common graphs that visualize the data. If the data are in "wide form," the PROC TTEST syntax is simple:

ods graphics on;
proc ttest data=PrePost;
   paired Lead0*Lead1;
run;

Think about this syntax: When you type three PROC TTEST statements, you get a statistical analysis AND three standard visualizations! The graphs appear automatically when you turn on ODS graphics, without requiring any effort on your part.

The first graph from PROC TTEST is a histogram of the difference (pre-treatment minus post-treatment). The graph is actually a panel, with a small horizontal box plot underneath that gives another view of the distribution of the differences. You can see that the differences range from -5 to 28. The median change appears to be 13 and the interquartile range is from 8 to 18. If the differences are approximately normally distributed, the light green rectangle indicates a 95% confidence interval for the mean.

The second graph is a spaghetti plot of the patient profiles. Each line segment represents a patient. The lines connect the pre-treatment values (on the left) to the post-treatment values (on the right). You can see that most (but not all) patients experienced a decrease in blood-lead levels. The red line in the graph indicates the mean pre- and post-treatment levels.

The third graph is a version of the scatter plot that I created earlier by using PROC SGPLOT. Like my version, the graph includes a diagonal reference line. In addition, the graph displays a symbol that shows the pre- and post-treatment mean values of the response variable. The vertical distance between the mean marker and the reference line indicates the average change since the treatment.

PROC TTEST also creates a fourth graph (not shown), which is a normal Q-Q plot of the difference. This graph is useful if you are assessing the normality of the distribution of differences.

Summary

PROC TTEST creates excellent graphs that enable you to visualize pre/post data. With only a few SAS statements, you can produce three common graphs that visualize the change, presumably due to the treatment. Using PROC TTEST is quicker and easier than creating the graphs yourself. Furthermore, one of the graphs (the histogram/box plot panel) is a graph that is not easily created by using PROC SGPLOT. So next time you want to visualize paired data, reach for PROC TTEST.

The post Visualize data before and after a treatment appeared first on The DO Loop.

12月 052019
 

This is a second article about analyzing longitudinal data, which features measurements that are repeatedly taken on subjects at several points in time. The previous article discusses a response-profile analysis, which uses an ANOVA method to determine differences between the means of an experimental group and a placebo group. The response-profile analysis has limitations, including the fact that longitudinal data are autocorrelated and so do not satisfy the independence assumption of ANOVA. Furthermore, the method does not enable you to model the response profile of individual subjects; it produces only a mean response-by-time profile.

This article shows how a mixed model can produce a response profile for each subject. A mixed model also addresses other limitations of the response-profile analysis. This blog post is based on the introductory article, "A Primer in Longitudinal Data Analysis", by G. Fitzmaurice and C. Ravichandran (2008), Circulation, 118(19), p. 2005-2010.

The data (from Fitzmaurice and C. Ravichandran, 2008) are the blood lead levels for 100 inner-city children who were exposed to lead in their homes. Half were in an experimental group and were given a compound called succimer as treatment. The other half were in a placebo group. Blood-lead levels were measured for each child at baseline (week 0), week 1, week 4, and week 6. The data are visualized in the previous article. You can download the SAS program that creates the data and graphs in this article.

Advantages of the mixed model for longitudinal data

The main advantage of a mixed-effect model is that each subject is assumed to have his or her own mean response curve that explains how the response changes over time. The individual curves are a combination of two parts: "fixed effects," which are common to the population and shared by all subjects, and "random effects," which are specific to each subject. The term "mixed" implies that the model incorporates both fixed and random effects.

You can use a mixed model to do the following:

  1. Model the individual response-by-time curves.
  2. Model autocorrelation in the data. This is not discussed further in this blog post.
  3. Model time as a continuous variable, which is useful for data that includes mistimed observations and parametric models of time, such as a quadratic model or a piecewise linear model.

The book Applied Longitudinal Analysis (G. Fitzmaurice, N. Laird, and J. Ware, 2011, 2nd Ed.) discusses almost a dozen ways to model the data for blood-lead level in children. This blog post briefly shows how to implement three models in SAS that incorporate random intercepts. The models are the response-profile model, a quadratic model, and a piecewise linear model.

Visualize mixed models

I've previously written about how to visualize mixed models in SAS. One of the techniques is to create a spaghetti plot that shows the predicted response profiles for each subject in the study. Because we will examine three different models, the following statements define a macro that will sort the predicted values and plot them in a spaghetti plot:

%macro SortAndPlot(DSName);
proc sort data=&DSName;
   by descending Treatment ID Time;
run;
 
proc sgplot data=&DSName dattrmap=Order;
   series x=Time y=Pred / group=ID groupLC=Treatment break lineattrs=(pattern=solid)
                       attrid=Treat;
   legenditem type=line name="P" / label="Placebo (P)" lineattrs=GraphData1; 
   legenditem type=line name="A" / label="Succimer (A)" lineattrs=GraphData2; 
   keylegend "A" "P";
   xaxis values=(0 1 4 6) grid;
   yaxis label="Predicted Blood Lead Level (mcg/dL)";
run;
%mend;

A response-profile model with a random intercept

In the response-profile analysis, the data were analyzed by using PROC GLM, although these data do not satisfy the assumptions of PROC GLM. This article uses PROC MIXED in SAS/STAT software for the analyses. You can use the REPEATED statement in PROC MIXED to specify that the measurements for individuals are autocorrelated. We will use an unstructured covariance matrix for the model (TYPE=UN), but Fitzmaurice, Laird, and Ware (2011) discuss other options.

In the response-profile analysis, the model predicts the mean response for each treatment group. However, the baseline measurements for each subject are all different. For example, some start the trial with a blood-lead level that is higher than the mean, others start lower than the mean. To adjust for this variation among subjects, you can use the RANDOM INTERCEPT statement in PROC MIXED. Use the OUTPRED= option on the MODEL statement to write the predicted values for each subject to a data set, then plot the results. The following statements repeat the response-profile model of the previous blog post but include an intercept for each subject.

/* Repeat the response-profile analysis, but 
   use the RANDOM statement to add random intercept for each subject */
proc mixed data=TLCView;
   class id Time(ref='0') Treatment(ref='P');
   model y = Treatment Time Treatment*Time / s chisq outpred=MixedOut;
   repeated Time / type=un subject=id r;   /* measurements are repeated for subjects */
   random intercept / subject=id;          /* each subject gets its own intercept */
run;
 
title "Predicted Individual Growth Curves";
title2 "Random Intercept Model";
%SortAndPlot(MixedOut);
Random intercept model or response profiles

In this model, the shape of the response-profile curve is the same for all subjects in each treatment group. However, the curves are shifted up or down to better match the subject's individual profile.

A mixed model with a quadratic response curve

From the shape of the predicted response curve in the previous section, you might conjecture that a quadratic model might fit the data. You can fit a quadratic model in PROC MIXED by treating Time as a continuous variable. However, Time is also used in the REPEATED statement, and that statement requires a discrete CLASS variable. A resolution to this problem is to create a copy of the Time variable (call it T). You can include T in the CLASS and REPEATED statements and use Time in the MODEL statement as a continuous effect. The third analysis will require a linear spline variable, so the DATA VIEW also creates that variable, called T1.

/* Make "discrete time" (t) to use in REPEATED statement.
   Make spline effect with knot at t=1. */
data TLCView / view=TLCView;
set tlc;
t = Time;                       /* discrete copy of time */
T1 = ifn(Time<1, 0, Time - 1);  /* knot at Time=1 for PWL analysis */
run;

You can now use Time as a continuous effect and T to specify that measurements are repeated for the subjects.

/* Model time as continuous and use a quadratic model in Time. 
   For more about quadratic growth models, see
   https://support.sas.com/resources/papers/proceedings/proceedings/sugi27/p253-27.pdf */
proc mixed data=TLCView;
   class id t(ref='0') Treatment(ref='P');
   model y = Treatment Time Time*Time Treatment*Time / s outpred=MixedOutQuad;
   repeated t / type=un subject=id r;      /* measurements are repeated for subjects */
   random intercept / subject=id;          /* each subject gets its own intercept */
run;
 
title2 "Random Intercept; Quadratic in Time";
%SortAndPlot(MixedOutQuad);
Quadratic model for response profiles with random intercepts

The graph shows the individual predicted responses for each subject, but the quadratic model does not seem to capture the dramatic drop in the blood-lead level at T=1.

A mixed model with a piecewise linear response curve

The next model uses a piecewise linear model instead of a quadratic model. I've previously written about how to use spline effects in SAS to model data by using piecewise polynomials.

For the four time points, the mean response profile seems to go down for the experimental (succimer) group until T=1, and then increase at approximately a constant rate. Is this behavior caused by an underlying biochemical process? I don't know, but if you believe (based on domain knowledge) that this reflects an underlying process, you can incorporate that belief in a piecewise linear model. The first linear segment models the blood-lead level for T ≤ 1; the other segment models the blood-lead level for T > 1.

/* Piecewise linear (PWL) model with knot at Time=1.
   For more about PWL models, see Hwang (2015) 
   "Hands-on Tutorial for Piecewise Linear Mixed-effects Models Using SAS PROC MIXED"
   https://www.lexjansen.com/pharmasug-cn/2015/ST/PharmaSUG-China-2015-ST08.pdf    */
proc mixed data=TLCView;
   class id t(ref='0') Treatment(ref='P');
   model y = Treatment Time T1 Treatment*Time Treatment*T1 / s outpred=MixedOutPWL;
   repeated t / type=un subject=id r;      /* measurements are repeated for subjects */
   random intercept / subject=id;          /* each subject gets its own intercept */
run;
 
title2 "Random Intercept; Piecewise Linear Model";
%SortAndPlot(MixedOutPWL);
Piecewise linear model for response profiles with random intercepts

This graph shows a model that is piecewise linear. It assumes that the blood-lead level falls constantly during the first week of the treatment, then either falls or rises constantly during the remainder of the study. You could use the slopes of the lines to report the average rate of change during each time period.

Further reading

There are many papers and many books written about mixed models in SAS. This article presents data and ideas that are discussed in detail in the book Applied Longitudinal Analysis (2012, 2nd Ed) by G. Fitzmaurice, N. Laird, and J. Ware. For an informative article about piecewise-linear mixed models, see Hwang (2015) "Hands-on Tutorial for Piecewise Linear Mixed-effects Models Using SAS PROC MIXED" For a comprehensive discussion of mixed models and repeated-measures analysis, I recommend SAS for Mixed Models, either the 2nd edition or the new edition.

Many people have questions about how to model longitudinal data in SAS. Post your questions to the SAS Support Communities, which has a dedicated community for statistical analysis.

The post Longitudinal data: The mixed model appeared first on The DO Loop.

12月 032019
 

Longitudinal data are used in many health-related studies in which individuals are measured at multiple points in time to monitor changes in a response variable, such as weight, cholesterol, or blood pressure. There are many excellent articles and books that describe the advantages of a mixed model for analyzing longitudinal data. Recently, I encountered an introductory article that summarizes the essential issues in a little more than five pages! You can download the article for free: "A Primer in Longitudinal Data Analysis", by G. Fitzmaurice and C. Ravichandran (2008), Circulation, 118(19), p. 2005-2010.

The article analyzes a set of longitudinal data in two ways. First, the authors use a traditional linear model to perform an "analysis of response profiles." Then, the authors discuss how a mixed model can correct some of the deficiencies of the analysis. This blog post analyzes the same data by using PROC GLM in SAS. A subsequent blog post analyzes the same data by using PROC MIXED in SAS.

Longitudinal Data: Treatment of lead-exposed children

Fitzmaurice and C. Ravichandran analyze data for a randomized trial involving toddlers who were exposed to high levels of lead. The article analyzes a subset of 100 children. Half the children were given a treatment (called succimer) and the other half were given a placebo. The blood lead levels were measured for each child at baseline (week 0), week 1, week 4, and week 6. The main scientific question is "whether the two treatment groups differ in their patterns of change from baseline in mean blood lead levels" (p. 2006).

The children in the subset were measured at all four time points. There are no missing values or mistimed measurements. (This situation is fairly unusual in longitudinal data, which is often plagued by missed appointments or individuals who leave the study.) The following "spaghetti plot" shows the individual measurements for the 100 children at each time point. Each line represents a child. Blue lines indicate that the child was in the placebo group; red lines indicate the experimental group that was given succimer.

You can download the SAS program that contains the data and the programs that create the graphs and tables in this article.

Analysis of response profiles

In an analysis of response profiles, you compare the mean response-by-time profile for the treatment and placebo groups. You can visualize the mean response over time for each group by using the VBAR statement in PROC SGPLOT. The following graph shows the mean value and standard error for each time point for each treatment group:

If the treatment is ineffective, the line segments for the two treatment groups will be approximately parallel. The graph shows that this is not that case for these data. The visualization indicates that the mean blood-lead value for the treatment group (Treatment='A') is lower than for the placebo group at 1 and 4 weeks.

You can use PROC GLM to confirm that these differences are statistically significant and to estimate the effect that taking succimer had on the mean blood-lead level:

proc glm data=tlc;
   class Time(ref='0') Treatment(ref='P');
   model y = Treatment Time Treatment*Time / solution;
   output out=GLMOut Predicted=Pred;
quit;

According to the Type 3 statistics, all three effects in the model are significant. The parameter estimates (outlined in red) indicate that the mean blood-lead level for children in the Treatment='A' group is 11.4 mcg/dL lower than the children in the placebo group after 1 week. Similarly, after four weeks the mean of the experimental group is 8.8 mcg/dL lower than the placebo group. These are both significant differences, so the response-profile analysis provides a positive answer to the research question: the profile for the treatment group is lower than the placebo group.

Advantages and disadvantages of the analysis of response profiles

As discussed in Fitzmaurice and Ravichandran (2008), the analysis of the response profile has several advantages:

  • It is familiar to researchers who have experience with ANOVA.
  • It does not require any advanced statistical modeling , such as modeling the covariance of the repeated measurements.

However, this simple method suffers from several statistical problems:

  • Longitudinal data do not satisfy the assumptions of linear regression and ANOVA. Because the data contains repeated measures from the same individuals, the residual errors are neither independent nor do they have constant variance (homoscedastic).
  • Some participants in a study might miss an appointment or drop out of the study. Others might be measured at time points that were not part of the design (for example, at 2 or 3 weeks). These two problems are known as incomplete data and mistimed measurements, respectively. Although the first can be handled by using an unbalanced ANOVA, the second is a problem that does not have a simple solution within an ANOVA model that uses discrete time points.
  • The response-profile analysis does not enable you to model each individual's response as a function of time.

Prediction of individual response profiles

The inability to model individual trajectories is often the reason that researchers abandon the response-profile analysis in favor of a more complicated mixed model. To be clear, the GLM model can make predictions, but the predicted values for every child in the placebo group are the same. Similarly, the predicted values for every child in the experimental group are the same. This is shown in the following panel of graphs, which shows the predicted response curves for six children in the study, three from each treatment group.

/* Look at predictions for each individual. They are identical! */
proc sort data=GLMOut(where=(ID in (1,2,4,5,6,7))) out=GLMSubset;
   by Treatment ID;
run;
 
title2 "Fixed Effect Model";
proc sgpanel data=GLMSubset dattrmap=Order;
   panelby Treatment ID;
   scatter x=Time y=y / group=Treatment attrid=Treat;
   series x=Time y=Pred / group=Treatment attrid=Treat;
run;

Notice that the predicted responses are the same across the top row (ID=2, 5, and 6). These children were all in the experimental group. Although the predicted values seem to fit the actual observed response for ID=2, the predicted responses for ID=5 and ID=6 are much higher than the observed responses. Although the predicted response is the best prediction for an "average patient," it does not account for individual differences in the study participants.

The same is true for the patients in the placebo group, three of which are plotted in the second row. The predicted values are "too low" for ID=1 and are "too high" for ID=4.

If modeling the individual profiles is important, then clearly this method is not sufficient. If you want to model the individual profiles, you can use a linear mixed model. The mixed model also addresses other deficiencies of the response-profile analysis. The mixed model is described in the next blog post.

You can download the SAS code and data for the response-profile analysis.

Further reading

This blog post is a brief summary of the article "A Primer in Longitudinal Data Analysis" (Fitzmaurice and Ravichandran, 2008). See the article for more details. Also, these data and these ideas are also discussed in the book Applied Longitudinal Analysis (2012, 2nd Ed) by G. Fitzmaurice, N. Laird, and J. Ware. You can download data (and SAS programs) from the book at the book's web site.

The post Longitudinal data: The response-profile model appeared first on The DO Loop.

11月 272019
 
Visualization of a quadratic function and a linear subspace

This article discusses how to restrict a multivariate function to a linear subspace. This is a useful technique in many situations, including visualizing an objective function that is constrained by linear equalities. For example, the graph to the right is from a previous article about how to evaluate quadratic polynomials. The graph shows a heat map for a quadratic polynomial of two variables. The diagonal line represents a linear constraint between the X and Y variables. If you restrict the polynomial to the diagonal line, you obtain a one-dimensional function that you can easily graph and visualize. By repeating this process for other lines, you can construct a "stack" of lower-dimensional slices that enable you to understand the graph of the original bivariate function.

This example generalizes. For a function of many variables, you can restrict the function to a lower-dimensional subspace. By visualizing the restricted function, you can understand the high-dimensional function better. I previously demonstrated this technique for multidimensional regression models by using the SLICEFIT option in the EFFECTPLOT statement in SAS. This article shows how to use ideas from vector calculus to restrict a function to a parameterized linear subspace.

Visualize high-dimensional data and functions

There are basically two techniques for visualizing high-dimensional objects: projection and slicing. Many methods in multivariate statistics compute a linear subspace such that the projection of the data onto the subspace has a desirable property. One example is principal component analysis, which endeavors to find a low dimensional subspace that captures most of the variation in the data. Another example is linear discriminant analysis, which aims to find a linear subspace that maximally separates groups in the data. In both cases, the data are projected onto the computed subspaces to reveal characteristics of the data. There are also statistical methods that attempt to find nonlinear subspaces.

Whereas projection is used for data, slicing is often used to visualize the graphs of functions. In regression, you model a response variable as a function of the explanatory variables. Slicing the function along a linear subspace of the domain can reveal important characteristics about the function. That is the method used by the SLICEFIT option in the EFFECTPLOT statement, which enables you to visualize multidimensional regression models. The most common "slice" for a regression model is to specify constant values for all but one continuous variable in the model.

When visualizing an objective function that is constrained by a set of linear equalities, the idea is similar. The main difference is that the "slice" is usually determined by a general linear combination of variables.

Restrict a function to a one-dimensional subspace

A common exercise in multivariate calculus is to restrict a bivariate function to a line. (This is equivalent to a "slice" of the bivariate function.) Suppose that you choose a point, x0, and a unit vector, u, which represents the "direction". Then a parametric line through x0 in the direction of u is given by the vector expression v(t) = x0 + t u, where t is any real number. If you restrict a multivariate function to the image of v, you obtain a one-dimensional function.

For example, consider the quadratic polynomial in two variables:
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
The function is visualized by the heat map at the top of this article. Suppose x0 = {0, -1} and choose u to be the unit vector in the direction of the vector {3, 1}. (This choice corresponds to a linear constraint of the form x – 3y = 3.)

The following SAS/IML program constructs the parameterized line v(t) for a uniform set of t values. The quadratic function is evaluated at this set of values and the resulting one-dimensional function is graphed:

proc iml;
/* Evaluate the quadratic function at each column of X and return a row vector. */
start Func(XY);
   x = XY[1,]; y = XY[2,];
   return (9*x##2  + x#y + 4*y##2) - 12*x - 4*y + 6;
finish;
 
x0 = {0, -1};          /* evaluate polynomial at this point */
d  = {3, 1};           /* vector that determines direction */
u  = d / norm(d);      /* unit vector (direction) */
 
t = do(-2, 2, 0.1);    /* parameter values */
v = x0 + t @ u;        /* evenly spaced points along the linear subspace */
f = Func(v);           /* evaluate the 2-D function along the 1-D subspace */
title "Quadratic Form Evaluated on Linear Subspace";
call series(t, f) grid={x y};
Quadratic function restricted to linear subspace

The graph shows the restriction of the 2-D function to the 1-D subspace. According to the graph, the restricted function reaches a minimum value for t ≈ 0.92. The corresponding (x, y) values and the corresponding function value are shown below:

tMin = 0.92;           /* t* = parameter near the minimum */
vMin = x0 + tMin * u;  /* corresponding (x(t*), y(t*)) */
fMin = Func(vMin);     /* f(x(t*), y(t*)) */
print tMin vMin fMin;

If you look back to the heat map at the top of this article, you will see that the value (x,y) = (0.87,-0.71) correspond to the location for which the function achieves a minimum value when restricted to the linear subspace. The value of the function at that (x,y) value is approximately 6.6.

The SAS/IML program uses the Kronecker direct-product operator (@) to create points along the line. The operation is v = x0 + t @ u. The symbols x0 and u are both column vectors with two elements. The Kronecker product creates a 2 x k matrix, where k is the number of elements in the row vector t. The Kronecker product enables you to vectorize the program instead of looping over the elements of t and forming the individual vectors x0 + t[i]*u.

You can generalize this example to evaluate a multivariate function on a two-dimensional subspace. If u1 and u2 are two orthogonal, p-dimensional, unit vectors, the expression x0 + s*u1 + t*u2 spans a two-dimensional subspace of Rp. You can use the ExpandGrid function in SAS/IML to generate ordered pairs (s, t) on a two-dimensional grid.

Summary

In summary, you can slice a multivariate function along a linear subspace. Usually, 1-D or 2-D subspaces are used and the resulting (restricted) function is graphed. This helps you to understand the function. You can choose a series of slices (often parallel to each other) and "stack" the slices in order to visualize the multivariate function. Typically, this technique works best for functions that have three or four continuous variables.

You can download the SAS program that creates the graphs in this article.

The post Evaluate a function on a linear subspace appeared first on The DO Loop.

11月 252019
 

What is an efficient way to evaluate a multivariate quadratic polynomial in p variables? The answer is to use matrix computations! A multivariate quadratic polynomial can be written as the sum of a purely quadratic term (degree 2), a purely linear term (degree 1), and a constant term (degree 0). The purely quadratic term is called a quadratic form. This article shows how to use matrix computations to efficiently evaluate a multivariate quadratic polynomial.

Quadratic polynomials and matrix expressions

Visualization of a quadratic polynomial in SAS

As I wrote in a previous article about optimizing a quadratic function, the matrix of second derivatives and the gradient of first derivatives appear in the matrix representation of a quadratic polynomial. I will use the same example as in my previous article. Namely, the following quadratic polynomial in two variables:
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
       = (1/2) x` Q x + L` x + 6
where Q = {18  1, 1  8} is a 2x2 symmetric matrix, L = {-12  -4} is a column vector, and x = {x, y} is a column vector that represents the coordinates at which to evaluate the quadratic function.

The graph at the right visualizes this quadratic function by using a heat map. Small values of the function are shown in white. Larger values of the function are shown in blues and greens. The largest values are shown in reds and blacks. The global minimum of this function is approximately (x, y) = (0.64, 0.42), and that point is indicated by a star.

This example generalizes. Every quadratic function in p variables can be written as the sum of a quadratic form (1/2 x` Q x), a linear term (L` x) and a constant. Here Q is a p x p symmetric matrix of second derivatives (the Hessian matrix of f) and L and x are p-dimensional column vectors.

Evaluate quadratic polynomial in SAS

Because the computation involves vectors and matrices, the SAS/IML language is the natural place to use matrices to evaluate a quadratic function. The following SAS/IML statements implement a simple function to evaluate a quadratic polynomial (given in terms of Q, L, and a constant) at an arbitrary two-dimensional vector, x:

proc iml;
/* Evaluate  f(x) = 0.5 * x` * Q * x + L`*x + const, where
   Q is p x p symmetric matrix, 
   L and x are col vector with p elements.
   This version evaluates ONE vector x and returns a scalar value. */
start EvalQuad(x, Q, L, const=0);
   return 0.5 * x`*Q*x + L`*x + const;
finish;
 
/* compute Q and L for f(x,y)= 9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6 */
Q = {18 1,             /* matrix of second derivatives */
      1 8};
L = { -12, -4};        /* use column vectors for L and x */
const = 6;
x0 = {0, -1};          /* evaluate polynomial at this point */
f = EvalQuad(x0, Q, L, const);
print f[L="f(0,-1)"];

Evaluate a quadratic polynomial at multiple points

When I implement a function in the SAS/IML language, I try to "vectorize" it so that it can evaluate multiple points in a single call. Often you can use matrix operations to vectorize a function evaluation, but I don't see how to make the math work for this problem. The natural way to evaluate a quadratic polynomial at k vectors X1, X2, ..., Xk, is to pack those vectors into a p x k matrix X such that each column of X is a point at which to evaluate the polynomial. Unfortunately, the matrix computation of the quadratic form M = 0.5 * X`*Q*X results in a k x k matrix. Only the k diagonal elements are needed for evaluating the polynomial on the k input vectors, so although it is possible to compute M, doing so would be very inefficient.

In this case, it seems more efficient to loop over the columns of X. The following function implements a SAS/IML module that evaluates a quadratic polynomial at every column of X and returns a row vector of the results. The module is demonstrated by calling it on a matrix that has 5 columns.

/* Evaluate the quadratic function at each column of X and return a row vector. */
start EvalQuadVec(X, Q, L, const=0);
   f = j(1, ncol(X), .);
   do i = 1 to ncol(X);
      v = X[,i];
      f[i] = 0.5 * v`*Q*v + L`*v + const;
   end;
   return f;
finish;
 
/*    X1  X2  X3 X4  X5  */
vx = {-1 -0.5 0  0.5 1 ,
      -3 -2  -1  0   1 };
f = EvalQuadVec(vx, Q, L, const=0);
print (vx // f)[r={'x' 'y' 'f(x,y)'} c=('X1':'X5')];

Evaluate a quadratic polynomial on a uniform grid of points

You can use the EvalQuadVec function to evaluate a quadratic polynomial on any set of multiple points. In particular, you can use the ExpandGrid function to construct a regular 2-D grid of points. By evaluating the function at each point on the grid, you can visualize the function. The following statements create a heat map of the function on a regular grid. The heat map is shown at the top of this article.

x = do(-1, 1, 0.1); 
y = do(-3, 1.5, 0.1);
xy = expandgrid(x, y);             /* 966 x 2 matrix */
f = EvalQuadVec(xy`, Q, L, const); /* evaluate polynomial at all points */
 
/* write results to a SAS data set and visualize the function by using a heat map */
M = xy || f`;
create Heatmap from M[c={'x' 'y' 'f'}];  append from M;  close;
QUIT;
 
data optimal;
   xx=0.64; yy=0.42;  /* optional: add the optimal (x,y) value */
run;
data All;  set Heatmap optimal; run;
 
title "Heat Map of Quadratic Function";
proc sgplot data=All;
   heatmapparm x=x y=y colorresponse=f / colormodel= (WHITE CYAN YELLOW RED BLACK);
   scatter x=xx y=yy / markerattrs=(symbol=StarFilled);
   xaxis offsetmin=0 offsetmax=0;
   yaxis offsetmin=0 offsetmax=0;
run;

Quadratic approximations

Perhaps you do not often use quadratic polynomials. This technique is useful even for general nonlinear functions because it enables you to find the best quadratic approximation to a multivariate function at any point x0. The multivariate Taylor series at the point x0, truncated at second order, is
f(x) ≈ f(x0) + L` · (xx0) + (1/2) (xx0)` · Q · (xx0)
where L = ∇f(x0) is the gradient of f evaluate at x0 and Q = D2f(x0) is the symmetric Hessian matrix of second derivatives of f evaluated at x0.

Summary

In summary, you can use matrix computations to evaluate a multivariate quadratic polynomial. This article shows how to evaluate a quadratic polynomial at multiple points. For a polynomial of two variables, you can use this technique to visualize quadratic functions.

The post Evaluate a quadratic polynomial in SAS appeared first on The DO Loop.

11月 202019
 

In a linear regression model, the predicted values are on the same scale as the response variable. You can plot the observed and predicted responses to visualize how well the model agrees with the data, However, for generalized linear models, there is a potential source of confusion. Recall that a generalized linear model (GLIM) has two components: a linear predictor, which is a linear combination of the explanatory variables, and a transformation (called the inverse link function) that maps the linear predictor onto the scale of the data. Consequently, SAS regression procedures support two types of predicted values and prediction limits. In the SAS documentation, the first type is called "predictions on the linear scale" whereas the second type is called "predictions on the data scale."

For many SAS procedures, the default is to compute predicted values on the linear scale. However, for GLIMs that model nonnormal response variables, it is more intuitive to predict on the data scale. The ILINK option, which is shorthand for "apply the inverse link transformation," converts the predicted values to the data scale. This article shows how the ILINK option works by providing an example for a logistic regression model, which is the most familiar generalized linear models.

Review of generalized linear models

The SAS documentation provides an overview of GLIMs and link functions. The documentation for PROC GENMOD provides a list of link functions for common regression models, including logistic regression, Poisson regression, and negative binomial regression.

Briefly, the linear predictor is
η = X*β
where X is the design matrix and β is the vector of regression coefficients. The link function (g) is a monotonic function that relates the linear predictor to the conditional mean of the response. Sometimes the symbol μ is used to denote the conditional mean of the response (μ = E[Y|x]), which leads to the formula
g(μ) = X*β

In SAS, you will often see options and variables names (in output data sets) that contains the substring 'XBETA'. When you see 'XBETA', it indicates that the statistic or variable is related to the LINEAR predictor. Because the link function, g, is monotonic, it has an inverse, g-1. For generalized linear models, the inverse link function maps the linear-scale predictions to data-scale predictions: if η = x β is a predicted value on the linear scale, then g-1(η) is the predicted value for x on the data scale.

When the response variable is binary, the GLIM is the logistic model. If you use the convention that Y=1 indicates an event and Y=0 indicates the absence of an event, then the "data scale" is [0, 1] and the GLIM predicts the probability that the event occurs. For the logistic GLIM, the link function is the logit function:
g(μ) = logit(μ) = log( μ / (1 - μ) )
The inverse of the logit function is called the logistic function:
g-1(η) = logistic(η) = 1 / (1 + exp(-η))

To demonstrate the ILINK option, the next sections perform the following tasks:

  1. Use PROC LOGISTIC to fit a logistic model to data. You can use the STORE statement to store the model to an item store.
  2. Use the SCORE statement in PROC PLM to score new data. This example scores data by using the ILINK option.
  3. Score the data again, but this time do not use the ILINK option. Apply the logistic transformation to the linear estimates to demonstrate that relationship between the linear scale and the data scale.

The transformation between the linear scale and the data scale is illustrated by the following graph:

Fit a logistic model

The following data are from the documentation for PROC LOGISTIC. The model predicts the probability of Pain="Yes" (the event) for patients in a study, based on the patients' sex, age, and treatment method ('A', 'B', or 'P'). The STORE statement in PROC LOGISTIC creates an item store that can be used for many purposes, including scoring new observations.

data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   DROP Duration;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;
 
title 'Logistic Model on Neuralgia';
proc logistic data=Neuralgia;
   class Sex Treatment;
   model Pain(Event='Yes')= Sex Age Treatment;
   store PainModel / label='Neuralgia Study';  /* store model for post-fit analysis */
run;

Score new data by using the ILINK option

There are many reasons to use PROC PLM, but an important purpose of PROC PLM is to score new observations. Given information about new patients, you can use PROC PLM to predict the probability of pain if these patients are given a specific treatment. The following DATA step defines the characteristics of four patients who will receive Treatment B. The call to PROC PLM scores and uses the ILINK option to predict the probability of pain:

/* Use PLM to score new patients */
data NewPatients;
   input Treatment $ Sex $ Age Duration;
   DROP Duration;
   datalines;
B M 67 15 
B F 73  5 
B M 74 12 
B F 79 16 
;
 
/* predictions on the DATA scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreILink
         predicted lclm uclm / ilink; /* ILINK gives probabilities */
run;
 
proc print data=ScoreILink; run;

The Predicted column contains probabilities in the interval [0, 1]. The 95% prediction limits for the predictions are given by the LCLM and UCLM columns. For example, the prediction interval for the 67-year-old man is approximately [0.03, 0.48].

These values and intervals are transformations of analogous quantities on the linear scale. The logit transformation maps the predicted probabilities to the linear estimates. The inverse logit (logistic) transformation maps the linear estimates to the predicted probabilities.

Linear estimates and the logistic transformation

The linear scale is important because effects are additive on this scale. If you are testing the difference of means between groups, the tests are performed on the linear scale. For example, the ESTIMATE, LSMEANS, and LSMESTIMATE statements in SAS perform hypothesis testing on the linear estimates. Each of these statements supports the ILINK option, which enables you to display predicted values on the data scale.

To demonstrate the connection between the predicted values on the linear and data scale, the following call to PROC PLM scores the same data according to the same model. However, this time the ILINK option is omitted, so the predictions are on the linear scale.

/* predictions on the LINEAR scale */
proc plm restore=PainModel noprint;
   score data=NewPatients out=ScoreXBeta(
         rename=(Predicted=XBeta LCLM=LowerXB UCLM=UpperXB))
         predicted lclm uclm;         /* ILINK not used, so linear predictor */
run;
 
proc print data=ScoreXBeta; run;

I have renamed the variables that PROC PLM creates for the estimates on the linear scale. The XBeta column shows the predicted values. The LowerXB and UpperXB columns show the prediction interval for each patient. The XBeta column shows the values you would obtain if you use the parameter estimates table from PROC LOGISTIC and apply those estimates to the observations in the NewPatients data.

To demonstrate that the linear estimates are related to the estimates in the previous section, the following SAS DATA step uses the logistic (inverse logit) transformation to convert the linear estimates onto the predicted probabilities:

/* Use the logistic (inverse logit) to transform the linear estimates
     (XBeta) into probability estimates in [0,1], which is the data scale.
   You can use the logit transformation to go the other way. */
data LinearToData;
   set ScoreXBeta;   /* predictions on linear scale */
   PredProb   = logistic(XBeta);
   LCLProb    = logistic(LowerXB);
   UCLProb    = logistic(UpperXB); 
run;
 
proc print data=LinearToData;
   var Treatment Sex Age PredProb LCLProb UCLProb;
run;

The transformation of the linear estimates gives the same values as the estimates that were obtained by using the ILINK option in the previous section.

Summary

In summary, there are two different scales for predicting values for a generalized linear model. When you report predicted values, it is important to specify the scale you are using. The data scale makes intuitive sense because it is the same scale as the response variable. You can use the ILINK option in SAS to get predicted values and prediction intervals on the data scale.

The post Predicted values in generalized linear models: The ILINK option in SAS appeared first on The DO Loop.

11月 182019
 

My colleague, Mike Drutar, recently showed how to create a "strip plot" that shows the distribution of temperatures for each calendar month at a particular location. Mike created the strip plot in SAS Visual Analytics by using a point-and-click interface. This article shows how to create a similar graph by using SAS programming statements and the SGPLOT procedure in Base SAS. Along the way, I'll point out some tips and best practices for creating a strip plot.

Daily temperature data for Albany, NY

The data in this article is 25 years of daily temperatures in Albany, NY, from 1995 to 2019. I have analyzed this data previously when discussing how to model periodic data. The following DATA step downloads the data from the internet and adds a SAS date variable:

/* Read the data directly from the Internet */
filename webfile url "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYALBANY.txt" 
 /* some corporate users might need to add  proxy='http://...:80' */;
data TempData;
infile webfile;
input month day year Temperature;
format Date date9.;
Date = MDY(month, day, year);
if Temperature=-99 then delete;   /* -99 is used for missing values */
run;

A basic strip plot

In a basic strip plot, a continuous variable is plotted against levels of a categorical variable. If the values of the continuous variable are distinct, this technique theoretically can show all data values. In practice, however, there will be overplotting of markers, especially for large data sets. You can use a jittering and semi-transparent markers to reduce the effect of overplotting. The SCATTER statement in PROC SGPLOT supports the JITTER option and the TRANSPARENCY= option.

Drutar's strip plot displays temperatures for each month of the year. For the Albany temperature data, you might assume that you need to create a new categorical variable that has the values 'Jan', 'Feb', ..., 'Dec'. However, you do not need to create a new variable. You can use the FORMAT statement and the MONNAMEw. format to convert the Date variable into discrete values "on the fly." This technique is described in the article "Use SAS formats to bin numerical variables." If you use the TYPE=DISCRETE option on the XAXIS statement, you obtain a basic strip plot of temperature versus each calendar month.

/* Create a strip plot.
   Use a format to bin dates into months:
   https://blogs.sas.com/content/iml/2016/08/08/sas-formats-bin-numerical-variables.html
*/
proc sgplot data=TempData;
   format Date MONNAME3.;                 /* bin Date into 12 months */
   scatter x=Date y=Temperature / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled) legendlabel="Daily Temperature";
   xaxis type=discrete display=(nolabel); /* create categorical axis */
   yaxis grid label="Temperature (F)"; 
run;
Basic strip plot in SAS

You can see dark areas in the graph. These indicate high-density regions where the daily temperatures are similar. For some applications, it is useful to further emphasize these regions by overlaying statistical estimates that show the average and range of each strip, as shown in the next section.

Of course, if your data already contains a categorical variable, you can create a strip plot directly. You will not need to use the FORMAT trick.

Overlay a visualization of the center and variation

To indicate the center of each month's temperature distribution, Drutar displays the median value for each month. He also overlays a line segment that shows the data range (min to max). In my strip plot, I will overlay the median but will use the interquartile range (Q1 to Q3) to display the variation in the data. You can use PROC MEANS to create a SAS data set that contains the statistics for each month:

/* write statistics for each month to a data set */
proc means data=Tempdata noprint;
   format Date MONNAME3.;            /* bin Date into 12 months */
   class Date;                       /* output the statistics for each month */
   var Temperature;
   output out=MeanOut(where=(_TYPE_=1)) median=Median Q1=Q1 Q3=Q3;
run;

To create a new graph that overlays the statistics, append the statistics and the data. You can then use a high-low plot to show the variation in the data and a second SCATTER statement to overlay the median values, as follows:

data StripPlot;
   set TempData MeanOut;                  /* append statistics to data */
run;
 
proc sgplot data=StripPlot;
   format Date MONNAME3.;                 /* bin Date into 12 months */
   scatter x=Date y=Temperature / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled) legendlabel="Daily Temperature";
   highlow x=date low=Q1 high=Q3 /        /* use high-low plot to display range of data */
           legendlabel="IQR" lineattrs=GraphData3(thickness=5);
   scatter x=date y=Median /              /* plot the median value for each strip */
           markerattrs=GraphData2(size=12 symbol=CircleFilled);
   xaxis type=discrete display=(nolabel); /* create categorical axis */
   yaxis grid label="Temperature (F)"; 
run;
Strip plot in SAS with overlays of median and interquartile range statistics

In summary, you can use the SCATTER statement in the SGPLOT procedure to create a basic strip plot. One axis will be the continuous variable, the other will be a discrete (categorical) variable. You can use the JITTER option to reduce overplotting. For data that contain thousands of observations, you might also want to use the TRANSPARENCY= option to display semi-transparent markers. Typically, you will use higher transparency values for larger data. Finally, you can use PROC MEANS to create an output data set that contains summary statistics for each strip. This article computes and displays the median and interquartile range for each strip, but you could also use the mean and standard deviation.

The post Create a strip plot in SAS appeared first on The DO Loop.

11月 132019
 

Biplots are two-dimensional plots that help to visualize relationships in high dimensional data. A previous article discusses how to interpret biplots for continuous variables. The biplot projects observations and variables onto the span of the first two principal components. The observations are plotted as markers; the variables are plotted as vectors. The observations and/or vectors are not usually on the same scale, so they need to be rescaled so that they fit on the same plot. There are four common scalings (GH, COV, JK, and SYM), which are discussed in the previous article.

This article shows how to create biplots in SAS. In particular, the goal is to create the biplots by using modern ODS statistical graphics. You can obtain biplots that use the traditional SAS/GRAPH system by using the %BIPLOT macro by Michael Friendly. The %BIPLOT macro is very powerful and flexible; it is discussed later in this article.

There are four ways to create biplots in SAS by using ODS statistical graphics:

  • You can use PROC PRINQUAL in SAS/STAT software to create the COV biplot.
  • If you have a license for SAS/GRAPH software (and SAS/IML software), you can use Friendly's %BIPLOT macro and use the OUT= option in the macro to save the coordinates of the markers and vectors. You can then use PROC SGPLOT to create a modern version of Friendly's biplot.
  • You can use the matrix computations in SAS/IML to "manually" compute the coordinates of the markers and vectors. (These same computations are performed internally by the %BIPLOT macro.) You can use the Biplot module to create a biplot, or you can use the WriteBiplot module to create a SAS data set that contains the biplot coordinates. You can then use PROC SGPLOT to create the biplot.

For consistency with the previous article, all methods in this article standardize the input variables to have mean zero and unit variance (use the SCALE=STD option in the %BIPLOT macro). All biplots show projections of the same four-dimensional Fisher's iris data. The following DATA step assigns a blank label. If you do not supply an ID variable, some biplots display observations numbers.

data iris;
   set Sashelp.iris;
   id = " ";        /* create an empty label variable */
run;

Use PROC PRINQUAL to compute the COV biplot

The PRINQUAL procedure can perform a multidimensional preference analysis, which is visualized by using a MDPREF plot. The MDPREF plot is closely related to biplot (Jackson (1991), A User’s Guide to Principal Components, p. 204). You can get PROC PRINQUAL to produce a COV biplot by doing the following:

  • Use the N=2 option to specify you want to compute two principal components.
  • Use the MDPREF=1 option to specify that the procedure should not rescale the vectors in the biplot. By default, MDPREF=2.5 and the vectors appear 2.5 larger than they should be. (More on scaling vectors later.)
  • Use the IDENTITY transformation so that the variables are not transformed in a nonlinear manner.

The following PROC PRINQUAL statements produce a COV biplot (click to enlarge):

proc prinqual data=iris plots=(MDPref) 
              n=2       /* project onto Prin1 and Prin2 */
              mdpref=1; /* use COV scaling */
   transform identity(SepalLength SepalWidth PetalLength PetalWidth);  /* identity transform */
   id ID;
   ods select MDPrefPlot;
run;
COV biplot, created in SAS by using PROC PRINQUAL

Use Friendly's %BIPLOT macro

Friendly's books [SAS System for Statistical Graphics (1991) and Visualizing Categorical Data (2000)] introduced many SAS data analysts to the power of using visualization to accompany statistical analysis, and especially the analysis of multivariate data. His macros use traditional SAS/GRAPH graphics from the 1990s. In the mid-2000s, SAS introduced ODS statistical graphics, which were released with SAS 9.2. Although the %BIPLOT macro does not use ODS statistical graphics directly, the macro supports the OUT= option, which enables you to create an output data set that contains all the coordinates for creating a biplot.

The following example assumes that you have downloaded the %BIPLOT macro and the %EQUATE macro from Michael Friendly's web site.

/* A. Use %BIPLOT macro, which uses SAS/IML to compute the biplot coordinates. 
      Use the OUT= option to get the coordinates for the markers and vectors.
   B. Transpose the data from long to wide form.
   C. Use PROC SGPLOT to create the biplot
*/
%let FACTYPE = SYM;   /* options are GH, COV, JK, SYM */
title "Biplot: &FACTYPE, STD";
%biplot(data=iris, 
        var=SepalLength SepalWidth PetalLength PetalWidth, 
        id=id,
        factype=&FACTYPE,  /* GH, COV, JK, SYM */
        std=std,           /* NONE, MEAN, STD  */
        scale=1,           /* if you do not specify SCALE=1, vectors are auto-scaled */
        out=biplotFriendly,/* write SAS data set with results */ 
        symbols=circle dot, inc=1);
 
/* transpose from long to wide */
data Biplot;
set biplotFriendly(where=(_TYPE_='OBS') rename=(dim1=Prin1 dim2=Prin2 _Name_=_ID_))
    biplotFriendly(where=(_TYPE_='VAR') rename=(dim1=vx dim2=vy _Name_=_Variable_));
run;
 
proc sgplot data=Biplot aspect=1 noautolegend;
   refline 0 / axis=x; refline 0 / axis=y;
   scatter x=Prin1 y=Prin2 / datalabel=_ID_;
   vector  x=vx    y=vy    / datalabel=_Variable_
                             lineattrs=GraphData2 datalabelattrs=GraphData2;
   xaxis grid offsetmin=0.1 offsetmax=0.2;
   yaxis grid;
run;
SYM biplot, created in SAS by using ODS statistical graphics

Because you are using PROC SGPLOT to display the biplot, you can easily configure the graph. For example, I added grid lines, which are not part of the output from the %BIPLOT macro. You could easily change attributes such as the size of the fonts or add additional features such as an inset. With a little more work, you can merge the original data and the biplot data and color-code the markers by a grouping variable (such as Species) or by a continuous response variable.

Notice that the %BUPLOT macro supports a SCALE= option. The SCALE= option applies an additional linear scaling to the vectors. You can use this option to increase or decrease the lengths of the vectors in the biplot. For example, in the SYM biplot, shown above, the vectors are long relative to the range of the data. If you want to display vectors that are only 25% as long, you can specify SCALE=0.25. You can specify numbers greater than 1 to increase the vector lengths. For example, SCALE=2 will double the lengths of the vectors. If you omit the SCALE= option or set SCALE=0, then the %BIPLOT macro automatically scales the vectors to the range of the data. If you use the SCALE= option, you should tell the reader that you did so.

SAS/IML modules that compute biplots

The %BIPLOT macro uses SAS/IML software to compute the locations of the markers and vectors for each type of biplot. I wrote three SAS/IML modules that perform the three steps of creating a biplot:

  • The CalcBiplot module computes the projections of the observations and scores onto the first few principal components. This module (formerly named CalcPrinCompBiplot) was written in the mid-2000s and has been distributed as part of the SAS/IML Studio application. It returns the scores and vectors as SAS/IML matrices.
  • The WriteBiplot module calls the CalcBiplot module and then writes the scores to a SAS data set called _SCORES and the vectors (loadings) to a SAS data set called _VECTORS. It also creates two macro variables, MinAxis and MaxAxis, which you can use if you want to equate the horizontal and vertical scales of the biplot axes.
  • The Biplot function calls the WriteBiplot module and then calls PROC SGPLOT to create a biplot. It is the "raw SAS/IML" version of the %BIPLOT macro.

You can use the CalcBiplot module to compute the scores and vectors and return them in IML matrices. You can use the WriteBiplot module if you want that information in SAS data sets so that you can create your own custom biplot. You can use the Biplot module to create standard biplots. The Biplot and WriteBiplot modules are demonstrated in the next sections.

Use the Biplot module in SAS/IML

The syntax of the Biplot module is similar to the %BIPLOT macro for most arguments. The input arguments are as follows:

  • X: The numerical data matrix
  • ID: A character vector of values used to label rows of X. If you pass in an empty matrix, observation numbers are used to label the markers. This argument is ignored if labelPoints=0.
  • varNames: A character vector that contains the names of the columns of X.
  • FacType: The type of biplot: 'GH', 'COV', 'JK', or 'SYM'.
  • StdMethod: How the original variables are scaled: 'None', 'Mean', or 'Std'.
  • Scale: A numerical scalar that specifies additional scaling applied to vectors. By default, SCALE=1, which means the vectors are not scaled. To shrink the vectors, specify a value less than 1. To lengthen the vectors, specify a value greater than 1. (Note: The %BIPLOT macro uses SCALE=0 as its default.)
  • labelPoints: A binary 0/1 value. If 0 (the default) points are not labeled. If 1, points are labeled by the ID values. (Note: The %BIPLOT macro always labels points.)

The last two arguments are optional. You can specify them as keyword-value pairs outside of the parentheses. The following examples show how you can call the Biplot module in a SAS/IML program to create a biplot:

ods graphics / width=480px height=480px;
proc iml;
/* assumes the modules have been previously stored */
load module=(CalcBiplot WriteBiplot Biplot);
use sashelp.iris;
read all var _NUM_ into X[rowname=Species colname=varNames];
close;
 
title "COV Biplot with Scaled Vectors and Labels";
run Biplot(X, Species, varNames, "COV", "Std") labelPoints=1;   /* label obs */
 
title "JK Biplot: Relationships between Observations";
run Biplot(X, NULL, varNames, "JK", "Std");
 
title "JK Biplot: Automatic Scaling of Vectors";
run Biplot(X, NULL, varNames, "JK", "Std") scale=0;            /* auto scale; empty ID var */
 
title "SYM Biplot: Vectors Scaled by 0.25";
run Biplot(X, NULL, varNames, "SYM", "Std") scale=0.25;        /* scale vectors by 0.25 */
SYM biplot with auto-scaled vectors. Biplot created by using ODS statistical graphics

The program creates four biplots, but only the last one is shown. The last plot uses the SCALE=0.25 option to rescale the vectors of the SYM biplot. You can compare this biplot to the SYM biplot in the previous section, which did not rescale the length of the vectors.

Use the WriteBiplot module in SAS/IML

If you prefer to write an output data set and then create the biplot yourself, use the WriteBiplot module. After loading the modules and the data (see the previous section), you can write the biplot coordinates to the _Scores and _Vectors data sets, as follows. A simple DATA step appends the two data sets into a form that is easy to graph:

run WriteBiplot(X, NULL, varNames, "JK", "Std") scale=0;   /* auto scale vectors */
QUIT;
 
data Biplot;
   set _Scores _Vectors;    /* append the two data sets created by the WriteBiplot module */
run;
 
title "JK Biplot: Automatic Scaling of Vectors";
title2 "FacType=JK; Std=Std";
proc sgplot data=Biplot aspect=1 noautolegend;
   refline 0 / axis=x; refline 0 / axis=y;
   scatter x=Prin1 y=Prin2 / ; 
   vector  x=vx    y=vy    / datalabel=_Variable_
                             lineattrs=GraphData2 datalabelattrs=GraphData2;
   xaxis grid offsetmin=0.1 offsetmax=0.1 min=&minAxis max=&maxAxis;
   yaxis grid min=&minAxis max=&maxAxis;
run;
JK biplot, created in SAS by using ODS statistical graphics

In the program that accompanies this article, there is an additional example in which the biplot data is merged with the original data so that you can color-code the observations by using the Species variable.

Summary

This article shows four ways to use modern ODS statistical graphics to create a biplot in SAS. You can create a COV biplot by using the PRINQUAL procedure. If you have a license for SAS/IML and SAS/GRAPH, you can use Friendly's %BIPLOT macro to write the biplot coordinates to a SAS data set, then use PROC SGPLOT to create the biplot. This article also presents SAS/IML modules that compute the same biplots as the %BIPLOT macro. The WriteBiplot module writes the data to two SAS data sets (_Score and _Vector), which can be appended and used to plot a biplot. This gives you complete control over the attributes of the biplot. Or, if you prefer, you can use the Biplot module in SAS/IML to automatically create biplots that are similar to Friendly's but are displayed by using ODS statistical graphics.

You can download the complete SAS program that is used in this article. For convenience, I have also created a separate file that defines the SAS/IML modules that create biplots.

The post Create biplots in SAS appeared first on The DO Loop.