rick wicklin

1月 152020
 

In my book Simulating Data with SAS, I show how to use a graphical tool, called the moment-ratio diagram, to characterize and compare continuous probability distributions based on their skewness and kurtosis (Wicklin, 2013, Chapter 16). The idea behind the moment-ratio diagram is that skewness and kurtosis are essential for understanding the shapes of many common univariate distributions. The skewness and kurtosis are sometimes called moment ratios because they are ratios of centralized moments of a distribution. A moment-ratio diagram displays the locus of possible (skewness, kurtosis) pairs for many common distributions. This article displays a simple moment-ratio diagram and shows the locations of a few well-known distributions.

Constructing the moment-ratio diagram

When you specify the parameters for a probability distribution, you get a density curve that has a particular shape. (The shape does not depend on the mean or variance, so all distributions in this article are assumed to be centered and scaled so that they have zero mean and unit variance.) The shape is partially characterized by the skewness and kurtosis values, although those two moments do not fully define the shape of the distribution.

For most probability distributions, the skewness and kurtosis depend on the distribution parameters. In general, the following is true:

  • A distribution that has no shape parameters corresponds to a point in the moment-ratio (M-R) diagram.
  • A distribution that has one shape parameter corresponds to a curve in the M-R diagram.
  • A distribution that has two shape parameters corresponds to a region.

However, notice the phrase "shape parameters." A shape parameter changes the shape of the distribution, and not every parameter changes the shape. For example, location and scale parameters do not change the shape. The normal distribution has two parameters (location and scale), but all normal distributions are "bell-shaped." Because all normal curves have zero skewness and zero kurtosis, the normal distribution is represented by the point (0, 0) in the moment-ratio diagram. Similarly, the exponential distribution has a scale parameter, but all standardized exponential distributions have the same shape; the exponential distribution is represented by the point (2, 6) in the (skewness, kurtosis) coordinates.

In contrast, distributions such as the gamma or beta distributions have shape parameters that change the shape of the density function. The gamma distribution has one shape parameter and corresponds with a curve in the moment-ratio diagram. The beta distribution has two shape parameters and corresponds with a region in the moment-ratio diagram.

A basic moment-ratio diagram

This section lists a few common univariate distributions and specifies how the skewness and (excess) kurtosis depend on the parameters for the distributions. A simplified moment-ratio diagram appears to the right and displays points, curves, or regions for each distribution. Some researchers show only the right half of the diagram because many common distributions have positive skewness. By historical convention, the kurtosis axis points down. The letters and colors in the diagram are explained in the following list.

  • Beta distribution: The beta distribution contains two parameters that affect the shape. The skewness and kurtosis range over a region (grey) that is bounded by two quadratic curves.
  • Exponential distribution: Regardless of the value of the scale parameter, the skewness is always 2 and the kurtosis is always 6. Thus, the exponential distribution corresponds to a single point (E) on the M-R diagram.
  • Gamma distribution: For the parameter α, the skewness is 2/sqrt(α) and the kurtosis is 6/α. Thus, the gamma distribution corresponds to a curve (magenta) on the M-R diagram. The centered gamma distribution converges to the normal distribution as α → ∞, which is why the gamma curve approaches the Normal point (N). The gamma distribution is also a limit of the beta distribution, which is why the gamma curve bounds the beta region.
  • Gumbel distribution: The skewness is 1.14 and the kurtosis is 2.4. Thus, the Gumbel distribution corresponds to a single point (G).
  • Lognormal distribution: The skewness is (exp(σ2) + 2) sqrt(exp(σ2) - 1). The kurtosis is exp(4σ2) + 2 exp(3σ2) + 3 exp(2σ2) - 6. Thus, the lognormal distribution corresponds to a curve (green) in the M-R diagram.
  • Normal distribution: Every normal distribution has the same shape. The skewness and kurtosis are both 0 so there is a single point (N). The normal distribution is a limiting form of several distributions, including the beta, gamma, lognormal, and t.
  • t distribution: The moments for the t distribution are only defined for more than four degrees of freedom (ν). For those distributions, the skewness is 0 and the kurtosis is 6/(ν - 4). Thus, the t distribution corresponds to a sequence of points (T5, T6, T7, ...) that converge to the normal distribution as ν → ∞.
  • Johnson SB distribution: The Johnson SB distribution is a flexible distribution that can obtain any skewness and kurtosis combination "above" the lognormal distribution. Notice that this region overlaps with the beta region and the gamma curve. The overlap means that there are parameters for the SB distribution that have the same skewness and kurtosis as certain beta and gamma distributions. This does not imply that every distribution above the lognormal curve is a member of the SB family! It merely means that there is an SB distribution that has the same skewness and kurtosis.
  • Johnson SU distribution: The Johnson SB distribution is a flexible distribution that can obtain any skewness and kurtosis combination "below" the lognormal distribution.

A more complex moment-ratio diagram

The moment-ratio diagram was known to early researchers such as Pearson in the 1800s. In 2010, E. Vargo, R. Pasupathy, and L. Leemis published a tour-de-force moment-ratio diagram that included dozens of probability distributions on one diagram ("Moment-Ratio Diagrams for Univariate Distributions", JQT, 2010). Read their paper to learn more about moment-ratio diagrams. Leemis provides a large color version of the moment-ratio diagram on his website and posted it in a 2019 article in Significance magazine. A low-resolution version is shown below. Click the image to see the original version.

Low-resolution version of moment-ratio diagram in Vargo, Pasupathy, and Leemis (JQT, 2010).

What is the moment-ratio diagram used for?

The moment-ratio diagram is a useful theoretical tool for understanding the relationships between continuous univariate distributions. But how is the moment-ratio diagram useful to a data analyst? As I show in my book, you can use the moment-ratio diagram as a tool for modeling univariate distributions. Given a set of data, you can compute the sample skewness and kurtosis. By plotting the sample moments on the moment-ratio diagram, you can select candidate distributions to model the data. You can also use the moment-ratio diagram to organize simulation studies for which you want to test an algorithm or statistical method on data that have a wide range of skewness and kurtosis values.

An appendix to my book provides a SAS program that generates a moment-ratio diagram and overlays one or more data values. I used a small modification of that program to create the basic moment-ratio diagram in this article.

The post The moment-ratio diagram appeared first on The DO Loop.

1月 132020
 

Did you add "learn something new" to your list of New Year's resolutions? Last week, I wrote about the most popular articles from The DO Loop in 2019. The most popular articles are about elementary topics in SAS programming or univariate statistics because those topics have broad appeal.

Advanced topics and multivariate statistics are less popular but no less important. If you want to learn something new, check out this "Editor's Choice" list of articles that will broaden your statistical knowledge and enhance your SAS programming skills. I've grouped the articles into three broad categories.

Regression statistics

Schematic diagram of outliers in bivariate normal data. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'b' is a multivariate outlier.

High-dimensional analyses and visualization:

Low-dimensional data visualization

The tips and techniques in these articles are useful, so read a few articles today and teach yourself something new in this New Year!

Do you have a favorite article from 2019 that did not make either list? Share it in a comment!

The post 10 posts from 2019 that deserve a second look appeared first on The DO Loop.

1月 082020
 

Many SAS procedures can automatically create a graph that overlays multiple prediction curves and their prediction limits. This graph (sometimes called a "fit plot" or a "sliced fit plot") is useful when you want to visualize a model in which a continuous response variable depends on one continuous explanatory variable and one categorical (classification) variable. You can use the EFFECTPLOT statement in PROC PLM to create similar visualizations of other kinds of regression models. This article shows three ways to create a sliced fit plot: direct from the procedure, by using the EFFECTPLOT statement in PROC PLM, and by writing the predictions to a data set and using PROC SGPLOT to graph the results.

The data for this example is from the PROC LOGISTIC documentation. The response variable is the presence or absence of pain in seniors who are splits into two treatment groups (A and B) and a placebo group (P).

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;

Automatically create a sliced fit plot

Many SAS regression procedures support the PLOTS= option on the PROC statement. For PROC LOGISTIC, the option that creates a sliced fit plot is the PLOTS=EFFECTPLOT option, and you can add prediction limits to the graph by using the CLBAND suboption, as follows:

proc logistic data=Neuralgia alpha=0.2 plots(only)=effectplot(clband);
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
run;

That was easy! The procedure automatically creates a title, legend, axis labels, and so forth. By using the PLOTS= option, you get a very nice plot that shows the predictions and prediction limits for the model.

Create a sliced fit plot by using PROC PLM

One of the nice things about the STORE statement in SAS regression procedures is that it enables you to create graphs and perform other post-fitting analyses without rerunning the procedure. Maybe you intend to examine many models before deciding on the best model. You can run goodness-of-fit statistics for the models and then use PROC PLM to create a sliced fit plot for only the final model. To do this, use the STORE statement in the regression procedure and then "restore" the model in PROC PLM, which can perform several post-fitting analyses, including creating a sliced fit plot, as follows:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   store PainModel / label='Neuralgia Study';
run;
 
proc plm restore=PainModel noinfo;
   effectplot slicefit(x=Age sliceby=Treatment) / clm;
run;

The sliced fit plot is identical to the one that is produced by PROC LOGISTIC and is not shown.

Create a sliced fit plot manually

For many situations, the statistical graphics that are automatically produced are adequate. However, at times you might want to customize the graph by changing the title, the placement of the legend, the colors, and so forth. Sometimes companies mandate color-schemes and fonts that every report must use. For this purpose, SAS supports ODS styles and templates, which you can use to permanently change the output of SAS procedures. However, in many situations, you just want to make a small one-time modification. In that situation, it is usually simplest to write the predictions to a SAS data set and then use PROC SGPLOT to create the graph.

It is not hard to create a sliced fit plot. For these data, you can perform three steps:

  1. Write the predicted values and upper/lower prediction limits to a SAS data set.
  2. Sort the data by the classification variable and by the continuous variable.
  3. Use the BAND statement with the TRANSPARENCY= option to plot the confidence bands. Use the SERIES statement to plot the predicted values.

You can use the full power of PROC SGPLOT to modify the plot. For example, the following statements label the curves, move the legend, and change the title and Y axis label:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   /* 1. Use a procedure or DATA step to write Pred, Lower, and Upper limits */
   output out=LogiOut pred=Pred lower=Lower upper=Upper;
run;
 
/* 2. Be sure to SORT! */
proc sort data=LogiOut;
   by Treatment Age;
run;
 
/* 3. Use a BAND statement. If more that one band, use transparency */
title "Predicted Probabilities with 80% Confidence Limits";
title2 "Three Treatment Groups";
proc sgplot data=LogiOut;
   band x=Age lower=Lower upper=Upper / group=Treatment transparency=0.75 name="L";
   series x=Age y=Pred / group=Treatment curvelabel;
   xaxis grid;
   yaxis grid label="Predicted Probability of Pain" max=1;
   keylegend "L" / location=inside position=NW title="Treatment" across=1 opaque;
run;

The graph is shown at the top of this article. It has a customized title, label, and legend. In addition, the curves on this graph differ from the curves on the previous graph. The OUTPUT statement evaluates the model at the observed values of Age and Treatment. Notice that the A and B treatment groups do not have any patients that are over the age of 80, therefore those prediction curves do not extend to the right-hand side of the graph.

In summary, there are three ways to visualize predictions and confidence bands for a regression model in SAS. This example used PROC LOGISTIC, but many other regression procedures support similar options. In most SAS/STAT procedures, you can use the PLOTS= option to obtain a fit plot or a sliced fit plot. More than a dozen procedures support the STORE statement, which enables you to use PROC PLM to create the visualization. Lastly, all regression procedures support some way to output predicted values to a SAS data set. You can sort the data, then use the BAND statement (with transparency) and the SERIES statement to create the sliced fit plot.

The post 3 ways to add confidence limits to regression curves in SAS appeared first on The DO Loop.

1月 062020
 

Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category.

SAS programming tips

The Essential Guide to Binning

  • Create training, testing, and validation data sets:: This post shows how to create training, validation, and test data sets in SAS. This technique is popular in data science and machine learning because you typically want to fit the model on one set of data and then evaluate the goodness of fit by using a different set of data.
  • 5 reasons to use PROC FORMAT to recode variables in SAS: Often SAS programmers use PROC SQL or the SAS DATA step to create new data variables to recode raw data. This is not always necessary. It is more efficient to use PROC FORMAT to recode the raw data. Learn five reasons why you should use PROC FORMAT to recode variables.
  • Conditionally append observations to a data set: In preparing data for graphing. you might want to add additional data to the end of a data set. (For example, to plot reference lines or text.) You can do this by using one DATA step to create the new observations and another DATA step to merge the new observations to the end of the original data. However, you can also use the END= option and end-of-file processing to append new observations to data. Read about how to use the END= option to append data. The comments at the end of the article show how to perform similar computations by using a hash table, PROC SQL, and a DOW loop.
  • Use PROC HPBIN to bin numerical variables: In machine learning, it is common to bin numerical variables into a set of discrete values. You can use PROC HPBIN to bin multiple variables in a single pass through the data. PROC HPBIN can bin data into equal-length bins (called bucket binning) or by using quantiles of the data. This article and the PROC FORMAT article are both referenced in my Essential Guide to Binning in SAS.

Statistical analyses

Geometric Meaning of Kolmogorov's D

Data visualization

Visualize an Interaction Effect

I always enjoy learning new programming methods, new statistical ideas, and new data visualization techniques. If you like to learn new things, too, read (or re-read!) these popular articles from 2019. Then share this page with a friend. I hope we both have many opportunities to learn and share together in the new year.

The post Top posts from <em>The DO Loop</em> in 2019 appeared first on The DO Loop.

12月 182019
 
Conditional bin plot in SAS

A 2-D "bin plot" counts the number of observations in each cell in a regular 2-D grid. The 2-D bin plot is essentially a 2-D version of a histogram: it provides an estimate for the density of a 2-D distribution. As I discuss in the article, "The essential guide to binning in SAS," sometimes you are more interested in displaying bins that contain (approximately) an equal number of observations. This leads to a quantile bin plot.

A SAS programmer on the SAS Support Communities recently requested a variation of the quantile bin plot. In the variation, the bins for the Y variable are computed as usual, but the bins for the X variable are conditional on the Y quantiles. This leads to a situation where the bins do not form a regular grid, but rather vary in both weight and width, as shown to the right. I call this a conditional quantile bin plot. In the graph, there are 12 rectangles. Each contains approximately the same number of observations. The three rectangles across each row contain a quartile of Y values.

This article shows how to construct a conditional quantile bin plot in SAS by first finding the quantiles for the Y variable and then finding the quantiles for the X variable (conditioned on the Y) for each Y quantile. You can solve this problem by using the SAS/IML language (as shown in the Support Community post) or by using PROC RANK and the DATA step, as shown in this article.

Although it is straightforward to compute conditional quantiles, it is less clear how to display the rectangles that represent the cells in the graph. On the Support Community, one solution uses the VECTOR statement in PROC SGPLOT, whereas another solution uses the POLYLINE command in an annotate data set. In this article, I present a third alternative, which is to use a POLYGON statement in PROC SGPLOT. I think that the POLYGON statement is the most powerful option because you can easily display the rectangles as filled, unfilled, in the foreground, in the background, and you can color the polygons in many informative ways.

Compute the unconditional and conditional quantiles

To compute quantiles, you can use the QNTL function in PROC IML or you can call PROC RANK and use the GROUPS= option. The following statements use PROC RANK. ("KSharp" used this method for his answer on the Support Community.) One noteworthy idea in the following program is to use the CATX or CATT function to form a grouping variable that combines the group names for the X and Y variables.

%let XVar = MomWtGain; /* name of X variable */
%let YVar = Weight;    /* name of Y variable */
%let NumXDiv = 3;      /* number of subdivisions (quantiles) for the X variable */
%let NumYDiv = 4;      /* number of subdivisions (quantiles) for the Y variable */
 
/* create the example data */
data Have;
set sashelp.bweight(obs=5000 keep=&XVar &YVar);
if ^cmiss(&XVar, &YVar);   /* only use complete cases */
run;
 
/* group the Y values into quantiles */
proc rank data=Have out=RankY groups=&NumYDiv ties=high;
   var &YVar;
   ranks rank_&YVar;
run;
proc sort data=RankY;  by rank_&YVar;  run;
 
/* Conditioned on each Y quantile, group the X values into quantiles */
proc rank data=RankY out=RankBoth groups=&NumXDiv ties=high;
   by rank_&YVar;
   var &XVar;
   ranks rank_&XVar;
run;
proc sort data=RankBoth;  by rank_&YVar rank_&XVar;  run;
 
/* Trick: Use the CATT fuction to create a group for each unique (RankY, RankX) combination */
data Points;
set RankBoth;
Group = catt(rank_&YVar, rank_&XVar);
run;
 
proc sgplot data=Points;
   scatter x=&XVar y=&YVar / group=Group markerattrs=(symbol=CircleFilled) transparency=0.75;
   /* These REFLINESs are hard-coded for these data. They are the unconditional Y quantiles. */
   refline 3061 3402 3716 / axis=Y;
run;

The SCATTER statement uses the GROUP= option to color each observation according to the quantile bin to which it belongs. The horizontal reference lines are the 25th, 50th, and 75th percentiles of the Y variable. These lines separate the Y variable into four groups. Within each group, the X variable is divided into three groups. The remainder of the program shows how to construct and plot the 12 rectangles that bound the groups.

Create polygons for the quantile bins

You can use PROC MEANS and a BY statement to find the minimum and maximum values of the X and Y variables for each cell. As a bonus, PROC MEANS also tells you how many observations are in each cell. If the data are unique values, each cell will have approximately N / 12 observations, where N=5000 is the total number of observations in the data set. If the data have duplicate values, remember that some quantile groups might have more observations than others.

/* Create polygons from ranks. Each polygon is a rectangle of the form
   [min(X), max(X)] x [min(Y), max(Y)] 
*/
/* generate min/max for each group. Results are in wide form. */
proc means data=Points noprint;
   by Group;
   var &XVar &YVar;
   output out=Limits(drop=_Type_) min= max= / autoname;  
run;
 
proc print data=Limits noobs;
run;
Coordinate of rectangular bins in a conditional quantile bin plot

For these data, there are between 362 and 500 observations in each cell. If there were no tied values, you would expect 5000/12 ≈ 417 observations in each cell.

The Limits data set contains all the information that you need to construct the cells of the conditional quantile bin plot. PROC MEANS writes the ranges of the variables in a wide format. In order to use the POLYGON statement in PROC SGPLOT, you need to convert the data to long form and use the GROUP variable as an ID variable for the polygon plot:

/* generate polygons from min/max of groups. Write data in long form. */
data Rectangles;
set Limits;
xx = &XVar._min; yy = &YVar._min; output;
xx = &XVar._max; yy = &YVar._min; output;
xx = &XVar._max; yy = &YVar._max; output;
xx = &XVar._min; yy = &YVar._max; output;
xx = &XVar._min; yy = &YVar._min; output;   /* repeat the first obs to close the polygon */
drop &XVar._min &XVar._max &YVar._min &YVar._max;
run;
 
/* combine the data and the polygons. Plot the results */
data All;
set Points Rectangles;
label _FREQ_ = "Frequency";
run;

Create polygons for the quantile bins

The advantage of creating polygons is that you can experiment with many ways to visualize the conditional quantile plot. For example, the simplest quantile plot merely overlays rectangles on a scatter plot of the bivariate data, as follows:

title "Conditional Quantile Bin Plot";
proc sgplot data=All noautolegend;
   scatter x=&XVar y=&YVar / markerattrs=(symbol=CircleFilled) transparency=0.9;
   polygon x=xx y=yy ID=Group;  /* plot in the foreground, no fill */
run;

On the other hand, you can also display the rectangles in the background and use the GROUP= option to display the rectangles in different colors.

proc sgplot data=All noautolegend;
   polygon x=xx y=yy ID=Group / fill outline group=Group; /* background, filled */
   scatter x=&XVar y=&YVar / markerattrs=(symbol=CircleFilled) transparency=0.9;
run;

The graph that results from these statements is shown at the top of this article.

To give a third example, you can plot the rectangles in the background and use the COLORRESPONSE= option to color the rectangles according to the number of observations inside each bin. This is shown in the following graph, which uses the default three-color color ramp to assign colors to rectangles.

proc sgplot data=All;
   polygon x=xx y=yy ID=Group / fill outline colorresponse=_FREQ_; /* background, filled */
   scatter x=&XVar y=&YVar / markerattrs=(symbol=CircleFilled) transparency=0.9;
   gradlegend;
run;

Summary

In summary, this article shows how to construct a 2-D quantile bin plot where the quantiles in the horizontal direction are conditional on the quantiles in the vertical direction. In such a plot, the cells are irregularly spaced but have approximately the same number of observations in each cell. You can use polygons to create the rectangles. The polygons give you a lot of flexibility about how to display the conditional quantile plot.

You can download the complete SAS program that creates the conditional quantile bin plot.

The post Create a conditional quantile bin plot in SAS appeared first on The DO Loop.

12月 162019
 
Rockin' around the Christmas tree
At the Christmas party hop.
     – Brenda Lee
Christmas tree image with added noise

Last Christmas, I saw a fun blog post that used optimization methods to de-noise an image of a Christmas tree. Although there are specialized algorithms that remove random noise from an image, I am not going to use a specialized de-noising algorithm. Instead, I will use the SVD algorithm, which is a general-purpose matrix algorithm that has many applications. One application is to construct a low-rank approximation of a matrix. This article applies the SVD algorithm to a noisy image of a Christmas tree. The result is recognizable as a Christmas tree, but much of the noise has been eliminated.

A noisy Christmas tree image

The image to the right is a heat map of a binary (0/1) matrix that has 101 rows and 100 columns. First, I put the value 1 in certain cells so that the heat map resembles a Christmas tree. About 41% of the cells have the value 1. We will call these cells "the tree".

To add random noise to the image, I randomly switched 10% of the cells. The result is an image that is recognizable as a Christmas tree but is very noisy.

Apply the SVD to a matrix

As mentioned earlier, I will attempt to de-noise the matrix (image) by using the general-purpose SVD algorithm, which is available in SAS/IML software. If M is the 101 x 100 binary matrix, then the following SAS/IML statements factor the matrix by using a singular-value decomposition. A series plot displays the first 20 singular values (ordered by magnitude) of the noisy Christmas tree matrix:

call svd(U, D, V, M);                             /* A = U*diag(D)*V` */
call series(1:20, D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"};

The graph (which is similar to a scree plot in principal component analysis) indicates that the first four singular values are somewhat larger than the remainder. The following statements approximate M by using rank-4 matrix. The rank-4 matrix is not a binary matrix, but you can visualize the rank-4 approximation matrix by using a continuous heat map. For convenience, the matrix elements are shifted and scaled into the interval [0,1].

keep = 4;        /* how many components to keep? */
idx = 1:keep;
Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`;
Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, standardize into [0,1] */
s = "Rank " + strip(char(keep)) + " Approximation of Noisy Christmas Tree Image";
call heatmapcont(Ak) colorramp=ramp displayoutlines=0 showlegend=0 title=s range={0, 1};

Apply a threshold cutoff to classify pixels

The continuous heat map shows a dark image of the Christmas tree surrounded by a light-green "fog". The dark-green pixels represent cells that have values near 1. The light-green pixels have smaller values. You can use a histogram to show the distribution of values in the rank-4 matrix:

ods graphics / width=400px height=300px;
call histogram(Ak);

You can think of the cell values as being a "score" that predicts whether each cell belongs to the Christmas tree (green) or not (white). The histogram indicates that the values in the matrix appear to be a mixture of two distributions. The high values in the histogram correspond to dark-green pixels (cells); the low values correspond to light-green pixels. To remove the light-green "fog" in the rank-4 image, you can use a "threshold value" to convert the continuous values to binary (0/1) values. Essentially, this operation predicts which cells are part of the tree and which are not.

For every choice of the threshold parameter, there will be correct and incorrect predictions:

  • A "true positive" is a cell that belongs to the tree and is colored green.
  • A "false positive" is a cell that does not belong to the tree but is colored green.
  • A "true negative" is a cell that does not belong to the tree and is colored white.
  • A "false negative" is a cell that belongs to the tree but is colored white.

By looking at the histogram, you might guess that a threshold value of 0.5 will do a good job of separating the low and high values. The following statements use 0.5 to convert the rank-4 matrix into a binary matrix, which you can think of as the predicted values of the de-noising process:

t = 0.5;  /* threshold parameter */
s = "Discrete Filter: Threshold = " + strip(char(t,4,2)) ;
call HeatmapDisc(Ak > t) colorramp=ramp displayoutlines=0 showlegend=0 title=s;

Considering that 10% of the original image was corrupted, the heat map of the reconstructed matrix is impressive. It looks like a Christmas tree, but has much less noise. The reconstructed matrix agrees with the original (pre-noise) matrix for 98% of the cells. In addition:

  • There are only a few false positives (green dots) that are far from the tree.
  • There are some false negatives in the center of the tree, but many fewer than were in the original noisy image.
  • The locations where the image is the most ragged is along the edge and trunk of the Christmas tree. In that region, the de-noising could not tell the difference between tree and non-tree.

The effect of the threshold parameter

You might wonder what the reconstruction would look like for different choices of the cutoff threshold. The following statements create two other heat maps, one for the threshold value 0.4 and the other for the threshold value 0.6. As you might have guessed, smaller threshold values result in more false positives (green pixels away from the tree), whereas higher threshold values result in more false negatives (white pixels inside the tree).

do t = 0.4 to 0.6 by 0.2;
   s = "Discrete Filter: Threshold = " + strip(char(t,4,2)) ;
   call HeatmapDisc(Ak > t) colorramp=ramp displayoutlines=0 showlegend=0 title=s;
end;

Summary

Although I have presented this experiment in terms of an image of a Christmas tree, it is really a result about matrices and low-rank approximations. I started with a binary 0/1 matrix, and I added noise to 10% of the cells. The SVD factorization enables you to approximate the matrix by using a rank-4 approximation. By using a cutoff threshold, you can approximate the original matrix before the random noise was added. Although the SVD algorithm is a general-purpose algorithm that was not designed for de-noising images, it does a good job eliminating the noise and estimating the original matrix.

If you are interested in exploring these ideas for yourself, you can download the complete SAS program that creates the graphs in this article.

The post Math-ing around the Christmas tree: Can the SVD de-noise an image? appeared first on The DO Loop.

12月 112019
 

Binary matrices are used for many purposes. I have previously written about how to use binary matrices to visualize missing values in a data matrix. They are also used to indicate the co-occurrence of two events. In ecology, binary matrices are used to indicate which species of an animal are present in which ecological site. For example, if you remember your study of Darwin's finches in high school biology class, you might recall that certain finches (species) are present or absent on various Galapagos islands (sites). You can use a binary matrix to indicate which finches are present on which islands.

Recently I was involved in a project that required performing a permutation test on rows of a binary matrix. As part of the project, I had to solve three smaller problems involving rows of a binary matrix:

  1. Given two rows, find the locations at which the rows differ.
  2. Given two binary matrices, determine how similar the matrices are by computing the proportion of elements that are the same.
  3. Given two rows, swap some of the elements that differ.

This article shows how to solve each problem by using the SAS/IML matrix language. A future article will discuss permutation tests for binary matrices. For clarity, I introduce the following macro that uses a temporary variable to swap two sets of values:

/* swap values of A and B. You can use this macro in the DATA step or in the SAS/IML language */
%macro SWAP(A, B, TMP=_tmp);
   &TMP = &A; &A = &B; &B = &TMP;
%mend;

Where do binary matrices differ?

The SAS/IML matrix language enables you to treat matrices as high-level objects. You often can answer questions about matrices without writing loops that iterate over the elements of the matrices. For example, if you have two matrices of the same dimensions, you can determine the cells at which the matrices are unequal by using the "not equal" (^=) operator. The following SAS/IML statements define two 2 x 10 matrices and use the "not equal" operator to find the cells that are different:

proc iml;
A = {0 0 1 0 0 1 0 1 0 0 ,
     1 0 0 0 0 0 1 1 0 1 };
B = {0 0 1 0 0 0 0 1 0 1 ,
     1 0 0 0 0 1 1 1 0 0 };
colLabel = "Sp1":("Sp"+strip(char(ncol(A))));
rowLabel = "A1":("A"+strip(char(nrow(A))));
 
/* 1. Find locations where binary matrices differ */
Diff = (A^=B);
print Diff[c=colLabel r=rowLabel];

The matrices A and B are similar. They both have three 1s in the first row and fours 1s in the second row. However, the output shows that the matrices are different for the four elements in the sixth and tenth columns. Although I used entire matrices for this example, the same operations work on row vectors.

The proportion of elements in common

You can use various statistics to measure the similarity between the A and B matrices. A simple statistic is the proportion of elements that are in common. These matrices have 20 elements, and 16/20 = 0.8 are common to both matrices. You can compute the proportion in common by using the express (A=B)[:], or you can use the following statements if you have previously computed the Diff matrix:

/* 2. the proportion of elements in B that are the same as in A */
propDiff = 1 - Diff[:];
print propDiff;

As a reminder, the mean subscript reduction operator ([:]) computes the mean value of the elements of the matrix. For a binary matrix, the mean value is the proportion of ones.

Swap elements of rows

The first two tasks were easy. A more complicated task is swapping values that differ between rows. The swapping operation is not difficult, but it requires finding the k locations where the rows differ and then swapping all or some of those values. In a permutation test, the number of elements that you swap is a random integer between 1 and k, but for simplicity, this example uses the SWAP macro to swap two cells that differ. For clarity, the following example uses temporary variables such as x1, x2, d1, and d2 to swap elements in the matrix A:

/* specify the rows whose value you want to swap */
i1 = 1;                         /* index of first row to compare and swap */
i2 = 2;                         /* index of second row to compare and swap */
 
/* For clarity, use temp vars x1 & x2 instead of A[i1, ] and A[i2, ] */
x1 = A[ i1, ];                  /* get one row */
x2 = A[ i2, ];                  /* get another row */
idx = loc( x1^=x2 );            /* find the locations where rows differ */
if ncol(idx) > 0 then do;       /* do the rows differ? */
   d1 = x1[ ,idx];              /* values at the locations that differ */
   d2 = x2[ ,idx];
   print (d1//d2)[r={'r1' 'r2'} L='Original'];
 
   /* For a permutation test, choose a random number of locations to swap */
   /* numSwaps = randfun(1, "Integer", 1, n);
      idxSwap = sample(1:ncol(idx), numSwaps, "NoReplace");  */
   idxSwap = {2 4};              /* for now, hard-code locations to swap */
   %SWAP(d1[idxSwap], d2[idxSwap]);
   print (d1//d2)[r={'d1' 'd2'} L='New'];
   /* update matrix */
   A[ i1, idx] = d1;
   A[ i2, idx] = d2;
end;
print A[c=colLabel r=rowLabel];

The vectors x1 and x2 are the rows of A to compare. The vectors d1 and d2 are the subvectors that contain only the elements of x1 and x2 that differ. The example swaps the second and fourth columns of d1 and d2. The new values are then inserted back into the matrix A. You can compare the final matrix to the original to see that the process swapped two elements in each of two rows.

Although the examples are for binary matrices, these techniques work for any numerical matrices.

The post Swap elements in binary matrices appeared first on The DO Loop.

12月 092019
 

Recently I showed how to visualize and analyze longitudinal data in which subjects are measured at multiple time points. A very common situation is that the data are collected at two time points. For example, in medicine it is very common to measure some quantity (blood pressure, cholesterol, white-blood cell count,...) for each patient before a treatment and then measure the same quantity after an intervention.

This situation leads to pairs of observations. It is useful to visualize the pre/post measurements in a spaghetti plot, a scatter plot, or a histogram of the differences. This article points out that you do not need to create these plots manually: the TTEST procedure can produce all these plots automatically, as well as provide a statistical analysis of paired data.

Visualize pre/post data

When given measurements before and after an intervention (sometimes called "pre/post data"), I like to create a scatter plot of the "before" variable versus the "after" variable. To demonstrate, the following SAS statements define a subset of the data that were analyzed in the previous article. The subset contains 50 children who had high levels of lead in their blood. Lead is toxic, especially in children. The children were given a compound (called succimer) to reduce the blood-lead level and the levels were re-measured one week later.

data PrePost;
   input id lead0 lead1 @@;
   label lead0 = "Baseline Blood Lead Level (mcg/dL)"
         lead1 = "Post-Treatment Blood Lead Level (mcg/dL)";
datalines;
  2   26.5   14.8   3   25.8   23.0  5   20.4    2.8  6   20.4    5.4
 12   24.8   23.1  14   27.9    6.3 19   35.3   25.5 20   28.6   15.8
 22   29.6   15.8  23   21.5    6.5 25   21.8   12.0 26   23.0    4.2
 27   22.2   11.5  29   25.0    3.9 31   26.0   21.4 32   19.7   13.2
 36   29.6   17.5  39   24.4   16.4 40   33.7   14.9 43   26.7    6.4
 44   26.8   20.4  45   20.2   10.6 48   20.2   17.5 49   24.5   10.0
 53   27.1   14.9  54   34.7   39.0 57   24.5    5.1 64   27.7    4.0
 65   24.3   24.3  66   36.6   23.3 68   34.0   10.7 69   32.6   19.0
 70   29.2    9.2  71   26.4   15.3 72   21.8   10.6 79   21.1    5.6
 82   22.1   21.0  85   28.9   12.5 87   19.8   11.6 89   23.5    7.9
 90   29.1   16.8  91   30.3    3.5 93   30.6   28.2 94   22.4    7.1
 95   31.2   10.8  96   31.4    3.9 97   41.1   15.1 98   29.4   22.1
 99   21.9    7.6 100   20.7    8.1
;
 
title "Pre/Post Measurements";
title2 "Created Manually";
proc sgplot data=PrePost aspect=1 noautolegend;
   scatter x=Lead0 y=Lead1 / jitter;
   lineparm x=0 y=0 slope=1 / clip;
   xaxis grid;
   yaxis grid;
run;

The graph shows the initial measurement (on the X axis) versus the post-treatment measurement (on the Y axis) for each patient in the study. The diagonal line is the identity line. A marker above the line indicates a patient whose measurement increased after treatment. A marker below the line indicates a patient whose measurement decreased. For these data, most children experienced a decrease in blood-lead levels, which indicates that the treatment was effective.

Use PROC TTEST to visualize pre/post data

The goal of ODS statistical graphics in SAS procedures is to provide an analyst with standard visualizations of data that can accompany and enhance a statistical analysis. For pre/post data, a common analysis is to perform a paired t test for the mean difference between the pre- and post-intervention measurements. It turns out that if you run a PROC TTEST analysis, the output not only includes statistical tables but also three common graphs that visualize the data. If the data are in "wide form," the PROC TTEST syntax is simple:

ods graphics on;
proc ttest data=PrePost;
   paired Lead0*Lead1;
run;

Think about this syntax: When you type three PROC TTEST statements, you get a statistical analysis AND three standard visualizations! The graphs appear automatically when you turn on ODS graphics, without requiring any effort on your part.

The first graph from PROC TTEST is a histogram of the difference (pre-treatment minus post-treatment). The graph is actually a panel, with a small horizontal box plot underneath that gives another view of the distribution of the differences. You can see that the differences range from -5 to 28. The median change appears to be 13 and the interquartile range is from 8 to 18. If the differences are approximately normally distributed, the light green rectangle indicates a 95% confidence interval for the mean.

The second graph is a spaghetti plot of the patient profiles. Each line segment represents a patient. The lines connect the pre-treatment values (on the left) to the post-treatment values (on the right). You can see that most (but not all) patients experienced a decrease in blood-lead levels. The red line in the graph indicates the mean pre- and post-treatment levels.

The third graph is a version of the scatter plot that I created earlier by using PROC SGPLOT. Like my version, the graph includes a diagonal reference line. In addition, the graph displays a symbol that shows the pre- and post-treatment mean values of the response variable. The vertical distance between the mean marker and the reference line indicates the average change since the treatment.

PROC TTEST also creates a fourth graph (not shown), which is a normal Q-Q plot of the difference. This graph is useful if you are assessing the normality of the distribution of differences.

Summary

PROC TTEST creates excellent graphs that enable you to visualize pre/post data. With only a few SAS statements, you can produce three common graphs that visualize the change, presumably due to the treatment. Using PROC TTEST is quicker and easier than creating the graphs yourself. Furthermore, one of the graphs (the histogram/box plot panel) is a graph that is not easily created by using PROC SGPLOT. So next time you want to visualize paired data, reach for PROC TTEST.

The post Visualize data before and after a treatment appeared first on The DO Loop.

12月 092019
 

Recently I showed how to visualize and analyze longitudinal data in which subjects are measured at multiple time points. A very common situation is that the data are collected at two time points. For example, in medicine it is very common to measure some quantity (blood pressure, cholesterol, white-blood cell count,...) for each patient before a treatment and then measure the same quantity after an intervention.

This situation leads to pairs of observations. It is useful to visualize the pre/post measurements in a spaghetti plot, a scatter plot, or a histogram of the differences. This article points out that you do not need to create these plots manually: the TTEST procedure can produce all these plots automatically, as well as provide a statistical analysis of paired data.

Visualize pre/post data

When given measurements before and after an intervention (sometimes called "pre/post data"), I like to create a scatter plot of the "before" variable versus the "after" variable. To demonstrate, the following SAS statements define a subset of the data that were analyzed in the previous article. The subset contains 50 children who had high levels of lead in their blood. Lead is toxic, especially in children. The children were given a compound (called succimer) to reduce the blood-lead level and the levels were re-measured one week later.

data PrePost;
   input id lead0 lead1 @@;
   label lead0 = "Baseline Blood Lead Level (mcg/dL)"
         lead1 = "Post-Treatment Blood Lead Level (mcg/dL)";
datalines;
  2   26.5   14.8   3   25.8   23.0  5   20.4    2.8  6   20.4    5.4
 12   24.8   23.1  14   27.9    6.3 19   35.3   25.5 20   28.6   15.8
 22   29.6   15.8  23   21.5    6.5 25   21.8   12.0 26   23.0    4.2
 27   22.2   11.5  29   25.0    3.9 31   26.0   21.4 32   19.7   13.2
 36   29.6   17.5  39   24.4   16.4 40   33.7   14.9 43   26.7    6.4
 44   26.8   20.4  45   20.2   10.6 48   20.2   17.5 49   24.5   10.0
 53   27.1   14.9  54   34.7   39.0 57   24.5    5.1 64   27.7    4.0
 65   24.3   24.3  66   36.6   23.3 68   34.0   10.7 69   32.6   19.0
 70   29.2    9.2  71   26.4   15.3 72   21.8   10.6 79   21.1    5.6
 82   22.1   21.0  85   28.9   12.5 87   19.8   11.6 89   23.5    7.9
 90   29.1   16.8  91   30.3    3.5 93   30.6   28.2 94   22.4    7.1
 95   31.2   10.8  96   31.4    3.9 97   41.1   15.1 98   29.4   22.1
 99   21.9    7.6 100   20.7    8.1
;
 
title "Pre/Post Measurements";
title2 "Created Manually";
proc sgplot data=PrePost aspect=1 noautolegend;
   scatter x=Lead0 y=Lead1 / jitter;
   lineparm x=0 y=0 slope=1 / clip;
   xaxis grid;
   yaxis grid;
run;

The graph shows the initial measurement (on the X axis) versus the post-treatment measurement (on the Y axis) for each patient in the study. The diagonal line is the identity line. A marker above the line indicates a patient whose measurement increased after treatment. A marker below the line indicates a patient whose measurement decreased. For these data, most children experienced a decrease in blood-lead levels, which indicates that the treatment was effective.

Use PROC TTEST to visualize pre/post data

The goal of ODS statistical graphics in SAS procedures is to provide an analyst with standard visualizations of data that can accompany and enhance a statistical analysis. For pre/post data, a common analysis is to perform a paired t test for the mean difference between the pre- and post-intervention measurements. It turns out that if you run a PROC TTEST analysis, the output not only includes statistical tables but also three common graphs that visualize the data. If the data are in "wide form," the PROC TTEST syntax is simple:

ods graphics on;
proc ttest data=PrePost;
   paired Lead0*Lead1;
run;

Think about this syntax: When you type three PROC TTEST statements, you get a statistical analysis AND three standard visualizations! The graphs appear automatically when you turn on ODS graphics, without requiring any effort on your part.

The first graph from PROC TTEST is a histogram of the difference (pre-treatment minus post-treatment). The graph is actually a panel, with a small horizontal box plot underneath that gives another view of the distribution of the differences. You can see that the differences range from -5 to 28. The median change appears to be 13 and the interquartile range is from 8 to 18. If the differences are approximately normally distributed, the light green rectangle indicates a 95% confidence interval for the mean.

The second graph is a spaghetti plot of the patient profiles. Each line segment represents a patient. The lines connect the pre-treatment values (on the left) to the post-treatment values (on the right). You can see that most (but not all) patients experienced a decrease in blood-lead levels. The red line in the graph indicates the mean pre- and post-treatment levels.

The third graph is a version of the scatter plot that I created earlier by using PROC SGPLOT. Like my version, the graph includes a diagonal reference line. In addition, the graph displays a symbol that shows the pre- and post-treatment mean values of the response variable. The vertical distance between the mean marker and the reference line indicates the average change since the treatment.

PROC TTEST also creates a fourth graph (not shown), which is a normal Q-Q plot of the difference. This graph is useful if you are assessing the normality of the distribution of differences.

Summary

PROC TTEST creates excellent graphs that enable you to visualize pre/post data. With only a few SAS statements, you can produce three common graphs that visualize the change, presumably due to the treatment. Using PROC TTEST is quicker and easier than creating the graphs yourself. Furthermore, one of the graphs (the histogram/box plot panel) is a graph that is not easily created by using PROC SGPLOT. So next time you want to visualize paired data, reach for PROC TTEST.

The post Visualize data before and after a treatment appeared first on The DO Loop.

12月 052019
 

This is a second article about analyzing longitudinal data, which features measurements that are repeatedly taken on subjects at several points in time. The previous article discusses a response-profile analysis, which uses an ANOVA method to determine differences between the means of an experimental group and a placebo group. The response-profile analysis has limitations, including the fact that longitudinal data are autocorrelated and so do not satisfy the independence assumption of ANOVA. Furthermore, the method does not enable you to model the response profile of individual subjects; it produces only a mean response-by-time profile.

This article shows how a mixed model can produce a response profile for each subject. A mixed model also addresses other limitations of the response-profile analysis. This blog post is based on the introductory article, "A Primer in Longitudinal Data Analysis", by G. Fitzmaurice and C. Ravichandran (2008), Circulation, 118(19), p. 2005-2010.

The data (from Fitzmaurice and C. Ravichandran, 2008) are the blood lead levels for 100 inner-city children who were exposed to lead in their homes. Half were in an experimental group and were given a compound called succimer as treatment. The other half were in a placebo group. Blood-lead levels were measured for each child at baseline (week 0), week 1, week 4, and week 6. The data are visualized in the previous article. You can download the SAS program that creates the data and graphs in this article.

Advantages of the mixed model for longitudinal data

The main advantage of a mixed-effect model is that each subject is assumed to have his or her own mean response curve that explains how the response changes over time. The individual curves are a combination of two parts: "fixed effects," which are common to the population and shared by all subjects, and "random effects," which are specific to each subject. The term "mixed" implies that the model incorporates both fixed and random effects.

You can use a mixed model to do the following:

  1. Model the individual response-by-time curves.
  2. Model autocorrelation in the data. This is not discussed further in this blog post.
  3. Model time as a continuous variable, which is useful for data that includes mistimed observations and parametric models of time, such as a quadratic model or a piecewise linear model.

The book Applied Longitudinal Analysis (G. Fitzmaurice, N. Laird, and J. Ware, 2011, 2nd Ed.) discusses almost a dozen ways to model the data for blood-lead level in children. This blog post briefly shows how to implement three models in SAS that incorporate random intercepts. The models are the response-profile model, a quadratic model, and a piecewise linear model.

Visualize mixed models

I've previously written about how to visualize mixed models in SAS. One of the techniques is to create a spaghetti plot that shows the predicted response profiles for each subject in the study. Because we will examine three different models, the following statements define a macro that will sort the predicted values and plot them in a spaghetti plot:

%macro SortAndPlot(DSName);
proc sort data=&DSName;
   by descending Treatment ID Time;
run;
 
proc sgplot data=&DSName dattrmap=Order;
   series x=Time y=Pred / group=ID groupLC=Treatment break lineattrs=(pattern=solid)
                       attrid=Treat;
   legenditem type=line name="P" / label="Placebo (P)" lineattrs=GraphData1; 
   legenditem type=line name="A" / label="Succimer (A)" lineattrs=GraphData2; 
   keylegend "A" "P";
   xaxis values=(0 1 4 6) grid;
   yaxis label="Predicted Blood Lead Level (mcg/dL)";
run;
%mend;

A response-profile model with a random intercept

In the response-profile analysis, the data were analyzed by using PROC GLM, although these data do not satisfy the assumptions of PROC GLM. This article uses PROC MIXED in SAS/STAT software for the analyses. You can use the REPEATED statement in PROC MIXED to specify that the measurements for individuals are autocorrelated. We will use an unstructured covariance matrix for the model (TYPE=UN), but Fitzmaurice, Laird, and Ware (2011) discuss other options.

In the response-profile analysis, the model predicts the mean response for each treatment group. However, the baseline measurements for each subject are all different. For example, some start the trial with a blood-lead level that is higher than the mean, others start lower than the mean. To adjust for this variation among subjects, you can use the RANDOM INTERCEPT statement in PROC MIXED. Use the OUTPRED= option on the MODEL statement to write the predicted values for each subject to a data set, then plot the results. The following statements repeat the response-profile model of the previous blog post but include an intercept for each subject.

/* Repeat the response-profile analysis, but 
   use the RANDOM statement to add random intercept for each subject */
proc mixed data=TLCView;
   class id Time(ref='0') Treatment(ref='P');
   model y = Treatment Time Treatment*Time / s chisq outpred=MixedOut;
   repeated Time / type=un subject=id r;   /* measurements are repeated for subjects */
   random intercept / subject=id;          /* each subject gets its own intercept */
run;
 
title "Predicted Individual Growth Curves";
title2 "Random Intercept Model";
%SortAndPlot(MixedOut);
Random intercept model or response profiles

In this model, the shape of the response-profile curve is the same for all subjects in each treatment group. However, the curves are shifted up or down to better match the subject's individual profile.

A mixed model with a quadratic response curve

From the shape of the predicted response curve in the previous section, you might conjecture that a quadratic model might fit the data. You can fit a quadratic model in PROC MIXED by treating Time as a continuous variable. However, Time is also used in the REPEATED statement, and that statement requires a discrete CLASS variable. A resolution to this problem is to create a copy of the Time variable (call it T). You can include T in the CLASS and REPEATED statements and use Time in the MODEL statement as a continuous effect. The third analysis will require a linear spline variable, so the DATA VIEW also creates that variable, called T1.

/* Make "discrete time" (t) to use in REPEATED statement.
   Make spline effect with knot at t=1. */
data TLCView / view=TLCView;
set tlc;
t = Time;                       /* discrete copy of time */
T1 = ifn(Time<1, 0, Time - 1);  /* knot at Time=1 for PWL analysis */
run;

You can now use Time as a continuous effect and T to specify that measurements are repeated for the subjects.

/* Model time as continuous and use a quadratic model in Time. 
   For more about quadratic growth models, see
   https://support.sas.com/resources/papers/proceedings/proceedings/sugi27/p253-27.pdf */
proc mixed data=TLCView;
   class id t(ref='0') Treatment(ref='P');
   model y = Treatment Time Time*Time Treatment*Time / s outpred=MixedOutQuad;
   repeated t / type=un subject=id r;      /* measurements are repeated for subjects */
   random intercept / subject=id;          /* each subject gets its own intercept */
run;
 
title2 "Random Intercept; Quadratic in Time";
%SortAndPlot(MixedOutQuad);
Quadratic model for response profiles with random intercepts

The graph shows the individual predicted responses for each subject, but the quadratic model does not seem to capture the dramatic drop in the blood-lead level at T=1.

A mixed model with a piecewise linear response curve

The next model uses a piecewise linear model instead of a quadratic model. I've previously written about how to use spline effects in SAS to model data by using piecewise polynomials.

For the four time points, the mean response profile seems to go down for the experimental (succimer) group until T=1, and then increase at approximately a constant rate. Is this behavior caused by an underlying biochemical process? I don't know, but if you believe (based on domain knowledge) that this reflects an underlying process, you can incorporate that belief in a piecewise linear model. The first linear segment models the blood-lead level for T ≤ 1; the other segment models the blood-lead level for T > 1.

/* Piecewise linear (PWL) model with knot at Time=1.
   For more about PWL models, see Hwang (2015) 
   "Hands-on Tutorial for Piecewise Linear Mixed-effects Models Using SAS PROC MIXED"
   https://www.lexjansen.com/pharmasug-cn/2015/ST/PharmaSUG-China-2015-ST08.pdf    */
proc mixed data=TLCView;
   class id t(ref='0') Treatment(ref='P');
   model y = Treatment Time T1 Treatment*Time Treatment*T1 / s outpred=MixedOutPWL;
   repeated t / type=un subject=id r;      /* measurements are repeated for subjects */
   random intercept / subject=id;          /* each subject gets its own intercept */
run;
 
title2 "Random Intercept; Piecewise Linear Model";
%SortAndPlot(MixedOutPWL);
Piecewise linear model for response profiles with random intercepts

This graph shows a model that is piecewise linear. It assumes that the blood-lead level falls constantly during the first week of the treatment, then either falls or rises constantly during the remainder of the study. You could use the slopes of the lines to report the average rate of change during each time period.

Further reading

There are many papers and many books written about mixed models in SAS. This article presents data and ideas that are discussed in detail in the book Applied Longitudinal Analysis (2012, 2nd Ed) by G. Fitzmaurice, N. Laird, and J. Ware. For an informative article about piecewise-linear mixed models, see Hwang (2015) "Hands-on Tutorial for Piecewise Linear Mixed-effects Models Using SAS PROC MIXED" For a comprehensive discussion of mixed models and repeated-measures analysis, I recommend SAS for Mixed Models, either the 2nd edition or the new edition.

Many people have questions about how to model longitudinal data in SAS. Post your questions to the SAS Support Communities, which has a dedicated community for statistical analysis.

The post Longitudinal data: The mixed model appeared first on The DO Loop.