12月 152020
 

Editor's note: This blog post is the third in a series of posts, originally published here by our partner News Literacy Project, exploring the role of data in understanding our world. Every day people use data to better understand the world. This helps them make decisions and measure impacts. But how do we take raw [...]

Evaluating media claims: Just because it's 'based on data' doesn't make it true was published on SAS Voices by Jen Sabourin

12月 152020
 

Editor's note: This blog post is the third in a series of posts, originally published here by our partner News Literacy Project, exploring the role of data in understanding our world. Every day people use data to better understand the world. This helps them make decisions and measure impacts. But how do we take raw [...]

Evaluating media claims: Just because it's 'based on data' doesn't make it true was published on SAS Voices by Jen Sabourin

12月 152020
 

Let’s flash back to a simpler year. I don’t want to date myself, so think circa 1990s. I remember sitting with my now husband watching Ken Burns’ documentary Baseball when I was first introduced to Doris Kearns Goodwin. She didn’t just know baseball – it was part of her DNA. She was smart, funny and a storyteller. I became a fan that day, and only came [...]

Unite, cultivate, replenish: 3 lessons from Doris Kearns Goodwin was published on SAS Voices by Jenn Chase

12月 142020
 

Do you need to see how long patients have been treated for? Would you like to know if a patient’s dose has changed, or if the patient experienced any dose interruptions? If so, you can use a Napoleon plot, also known as a swimmer plot, in conjunction with your exposure data set to find your answers. We demonstrate how to find the answer in our recent book SAS® Graphics for Clinical Trials by Example.

You may be wondering what a Napoleon plot is? Have you ever heard of the map of Napoleon’s Russian campaign? It was a map that displayed six types of data, such as troop movement, temperature, latitude, and longitude on one graph (Wikipedia). In the clinical setting, we try to mimic this approach by displaying several different types of safety data on one graph: hence, the name “Napoleon plot.” The plot is also known as a swimmer plot because each patient has a row in which their data is displayed, which looks like swimming lanes.

Code

Now that you know what a Napoleon plot is, how do you produce it? In essence, you are merely writing GTL code to produce the graph you need. In order to generate a Napoleon plot, some key GTL statements that are used are DISCRETEATTRMAP, HIGHLOWPLOT, SCATTERPLOT and DISCRETELEGEND. Other plot statements are used, but the statements that were just mentioned are typically used for all Napoleon plot. In our recent book, one of the chapters carefully walks you through each step to show you how to produce the Napoleon plot. Program 1, below, gives a small teaser of some of the code used to produce the Napoleon Plot.

Program 1: Code for Napoleon Plot That Highlights Dose Interruptions

	   discreteattrmap name = "Dose_Group";
            value "54" / fillattrs = (color = orange) 
                         lineattrs = (color = orange pattern = solid);     
            value "81" / fillattrs = (color = red) 
                         lineattrs = (color = red pattern = solid);
         enddiscreteattrmap;
 
         discreteattrvar attrvar = id_dose_group var = exdose attrmap = "Dose_Group";
 
         legenditem type = marker name = "54_marker" /
            markerattrs = (symbol = squarefilled color = orange)
            label = "Xan 54mg";
 
         < Other legenditem statements >
 
 
	     layout overlay / yaxisopts = (type = discrete 
                                         display = (line label)     
                                         label = "Patient")
 
	        highlowplot y = number 
                          high = eval(aendy/30.4375) 
                          low = eval(astdy/30.4375) / 
                 group = id_dose_group                       
                 type = bar 
                 lineattrs = graphoutlines 
                 barwidth = 0.2;
		 scatterplot y = number x = eval((max_aendy + 10)/30.4375) /      
                 markerattrs = (symbol = completed size = 12px);               
		 discretelegend "54_marker" "81_marker" "completed_marker" /  
                 type = marker  
                 autoalign = (bottomright) across = 1                          
                 location = inside title = "Dose";
         endlayout;

Output

Without further ado, Output 1 shows you an example of a Napoleon plot. You can see that there are many patients, and so the patient labels have been suppressed. You also see that the patient who has been on the study the longest has a dose delay indicated by the white space between the red and orange bars. While this example illustrates a simple Napoleon plot with only two types, dose exposure and treatment, the book has more complex examples of swimmer plots.

Output 1: Napoleon Plot that Highlights Dose Interruptions

Napoleon plot with orange and red bars showing dose exposure and treatment

How to create a Napoleon plot with Graph Template Language (GTL) was published on SAS Users.

12月 142020
 

A segmented regression model is a piecewise regression model that has two or more sub-models, each defined on a separate domain for the explanatory variables. For simplicity, assume the model has one continuous explanatory variable, X. The simplest segmented regression model assumes that the response is modeled by one parametric model when X is less than some threshold value and by a different parametric model when X is greater than the threshold value. The threshold value is also called a breakpoint, a cutpoint, or a join point.

A previous article shows that for simple piecewise polynomial models, you can use the EFFECT statement in SAS regression procedures to use a spline to fit some segmented regression models. The method relies on two assumptions. First, it assumes that you know the location of the breakpoint. Second, it assumes that each model has the same parametric form on each interval. For example, the model might be piecewise constant, piecewise linear, or piecewise quadratic.

If you need to estimate the location of the breakpoint from the data, or if you are modeling the response differently on each segment, you cannot use a single spline effect. Instead, you need to use a SAS procedure such as PROC NLIN that enables you to specify the model on each segment and to estimate the breakpoint. For simplicity, this article shows how to estimate a model that is quadratic on the first segment and constant on the second segment. (This is called a plateau model.) The model also estimates the location of the breakpoint.

An example of a segmented model

A SAS customer recently asked about segmented models on the SAS Support Communities. Suppose that a surgeon wants to model how long it takes her to perform a certain procedure over time. When the surgeon first started performing the procedure, it took about 3 hours (180 minutes). As she refined her technique, that time decreased. The surgeon wants to predict how long this surgery now takes now, and she wants to estimate when the time reached its current plateau. The data are shown below and visualized in a scatter plot. The length of the procedure (in minutes) is recorded for 25 surgeries over a 16-month period.

data Have;
  input SurgeryNo Date :mmddyy10. duration @@;
  format Date mmddyy10.;
datalines;
1       3/20/2019       182   2       5/16/2019       150
3       5/30/2019       223   4       6/6/2019        142
5       6/11/2019       111   6       7/11/2019       164
7       7/26/2019        83   8       8/22/2019       144
9       8/29/2019       162  10       9/19/2019        83
11      10/10/2019       70  12       10/17/2019      114
13      10/31/2019      113  14       11/7/2019        97
15      11/21/2019       83  16       12/5/2019       111
17      12/5/2019        73  18       12/12/2019       87
19      12/19/2019       86  20       1/9/2020        102
21      1/16/2020       124  22       1/23/2020        95
23      1/30/2020       134  24       3/5/2020        121
25      6/4/2020         60
;
 
title "Time to Perform a Surgery";
proc sgplot data=Have;
  scatter x=Date y=Duration;
run;

From the graph, it looks like the duration for the procedure decreased until maybe November or December 2019. The goal of a segmented model is to find the breakpoint and to model the duration before and after the breakpoint. For this purpose, you can use a segmented plateau model that is a quadratic model prior to the breakpoint and a constant model after the breakpoint.

Formulate a segmented regression model

A segmented plateau model is one of the examples in the PROC NLIN documentation. The documentation shows how to use constraints in the problem to eliminate one or more parameters. For example, a common assumption is that the two segments are joined smoothly at the breakpoint location, x0. If f(x) is the predictive model to the left of the breakpoint and g(x) is the model to the right, then continuity dictates that f(x0) = g(x0) and smoothness dictates that f`(x0) = g`(x0). In many cases (such as when the models are low-degree polynomials), the two constraints enable you to reparameterize the models to eliminate two of the parameters.

The PROC NLIN documentation shows the details. Suppose f(x) is a quadratic function f(x) = α + β x + γ x2 and g(x) is a constant function g(x) = c. Then you can use the constraints to reparameterize the problem so that (α, β, γ) are free parameters, and the other two parameters are determined by the formulas:
x0 = -β / (2 γ)
c = α – β2 / (4 γ)
You can use the ESTIMATE statement in PROC NLIN to obtain estimates and standard error for the x0 and c parameters.

Provide initial guesses for the parameters

As is often the case, the hard part is to guess initial values for the parameters. You must supply an initial guess on the PARMS statement in PROC NLIN. One way to create a guess is to use a related "reduced model" to provide the estimates. For example, you can use PROC GLM to fit a global quadratic model to the data, as follows:

/* rescale by using x="days since start" as the variable */
%let RefDate = '20MAR2019'd;
data New;
set Have;
rename duration=y;
x = date - &RefDate;  /* days since first record */
run;
 
proc glm data=New;
   model y = x x*x;
run;

These parameter estimates are used in the next section to specify the initial parameter values for the segmented model.

Fit a segmented regression model in SAS

Recall that a SAS date is represented by the number of days since 01JAN1960. Thus, for these data, the Date values are approximately 22,000. Numerically speaking, it is often better to use smaller numbers in a regression problem, so I will rescale the explanatory variable to be the number of days since the first surgery. You can use the parameter estimates from the quadratic model as the initial values for the segmented model:

title 'Segmented Model with Plateau';
proc nlin data=New plots=fit noitprint;
   parms alpha=184 beta= -0.5 gamma= 0.001;
 
   x0 = -0.5*beta / gamma;
   if (x < x0) then
        mean = alpha + beta*x  + gamma*x*x;    /* quadratic model for x < x0 */
   else mean = alpha + beta*x0 + gamma*x0*x0;  /* constant model for x >= x0 */
   model y = mean;
 
   estimate 'plateau'    alpha + beta*x0 + gamma*x0*x0;
   estimate 'breakpoint' -0.5*beta / gamma;
   output out=NLinOut predicted=Pred L95M=Lower U95M=Upper;
   ods select ParameterEstimates AdditionalEstimates FitPlot;
run;

The output from PROC NLIN includes estimates for the quadratic regression coefficients and for the breakpoint and the plateau value. According to the second table, the surgeon can now perform this surgical procedure in about 98 minutes, on average. The 95% confidence interval [77, 119] suggests that the surgeon might want to schedule two hours for this procedure so that she is not late for her next appointment.

According to the estimate for the breakpoint (x0), she achieved her plateau after about 287 days of practice. However, the confidence interval is quite large, so there is considerable uncertainty in this estimate.

The output from PROC NLIN includes a plot that overlays the predictions on the observed data. However, that graph is in terms of the number of days since the first surgery. If you want to return to the original scale, you can graph the predicted values versus the Date. You can also add reference lines that indicate the plateau value and the estimated breakpoint, as follows:

/* convert x back to original scale (Date) */
data _NULL_;
   plateauDate = 287 + &RefDate;    /* translate back to Date scale */
   call symput('x0', plateauDate);  /* store breakpoint in macro variable */
   put plateauDate DATE10.;
run;
 
/* plot the predicted values and CLM on the original scale */
proc sgplot data=NLinOut noautolegend;
   band x=Date lower=Lower upper=Upper;
   refline 97.97  / axis=y label="Plateau"    labelpos=max;
   refline &x0 / axis=x label="01JAN2020" labelpos=max;
   scatter x=Date y=y;
   series  x=Date y=Pred;
   yaxis label='Duration';
run;

The graph shows the breakpoint estimate, which happens to be 01JAN2020. It also shows the variability in the data and the wide prediction limits.

Summary

This article shows how to fit a simple segmented model in SAS. The model has one breakpoint, which is estimated from the data. The model is quadratic before the breakpoint, constant after the breakpoint, and joins smoothly at the breakpoint. The constraints on continuity and smoothness reduce the problem from five parameters to three free parameters. The article shows how to use PROC NLIN in SAS to solve segmented models and how to visualize the result.

You can use a segmented model for many other data sets. For example, if you are a runner and routinely run a 5k distance, you can use a segmented model to monitor your times. Are your times decreasing, or did you reach a plateau? When did you reach the plateau?

The post Segmented regression models in SAS appeared first on The DO Loop.

12月 112020
 

In recent years, we've have seen some astronomical contracts given to professional athletes. Major League Baseball (MLB) has certainly had its share. One of its first notable “megadeals” was when Alex Rodriguez, the Seattle Mariners’ power-hitting shortstop, left the team to join the Texas Rangers in 2001. The Rangers committed [...]

Competing in the Big Leagues with or without Big Money was published on SAS Voices by Pete Berryman

12月 112020
 

In recent years, we've have seen some astronomical contracts given to professional athletes. Major League Baseball (MLB) has certainly had its share. One of its first notable “megadeals” was when Alex Rodriguez, the Seattle Mariners’ power-hitting shortstop, left the team to join the Texas Rangers in 2001. The Rangers committed [...]

Competing in the Big Leagues with or without Big Money was published on SAS Voices by Pete Berryman

12月 112020
 

In recent years, we've have seen some astronomical contracts given to professional athletes. Major League Baseball (MLB) has certainly had its share. One of its first notable “megadeals” was when Alex Rodriguez, the Seattle Mariners’ power-hitting shortstop, left the team to join the Texas Rangers in 2001. The Rangers committed [...]

Competing in the Big Leagues with or without Big Money was published on SAS Voices by Pete Berryman

12月 092020
 

Today, SAS customers visit a variety of places for customer support needs. This multiple environment approach is tedious and confusing. My SAS solves this issue by combining these various customer service locations into a single environment. Meet My SAS!

What is My SAS?

My SAS is a brand-new customer experience page. This new location takes a variety of customer service places and puts them in one interface. The goal of My SAS is to ensure all SAS customers have the best possible experience available in the marketplace.

Finding My SAS

You can access My SAS in a variety of ways:
Finding My SAS

  • My SAS can be accessed via its web address: my.sas.com,
  • via the Software Order Email (SOE),
  • or at the top right-hand side of the sas.com page by clicking on the person icon.

What does My SAS do?

My SAS consists of four major components. Let’s explore each page.

For all SAS Customers (v9 and Viya 3):

1. Overview page

From here you can see and/or update your SAS profile. Manage and review your orders or learn more about SAS Viya or My SAS. On the top ribbon of this page, you can dive deeper into My SAS by going to outstand technical tracks via ‘My Services’, view or sign up for SAS Cloud offerings via ‘My Cloud”, and for Viya 2020 customers, access deployment order assets via “My Viya Orders’.

Overview Page

2. My Services page

The My Services page show open support tracks, the track ID, and the status of these tracks associated with the SAS user. Users can also create a new technical track from this location.

My Services Page

3. My Cloud page

The My Cloud page is the location for signing up, launching, and accessing SAS Cloud Offerings. This is also where you can preview new product trials and see when any existing trial periods will end.

My Cloud Page

For SAS Viya 2020 customers:

4. My Orders page

The My orders page displays information for each of your SAS Viya 2020 orders. For each order, you can download deployment assets and access documentation for the selected version. You can also manage user access to the orders.

My Orders Page

In summary, My SAS is your single location for managing your SAS profile, technical tracks, and cloud offerings. For Viya 2020, you can manage your order deployment assets and users' access to these orders. To learn more about My SAS, please check out this tour video:

What are you waiting for? Explore My SAS today!

Meet My SAS was published on SAS Users.

12月 092020
 

One purpose of principal component analysis (PCA) is to reduce the number of important variables in a data analysis. Thus, PCA is known as a dimension-reduction algorithm. I have written about four simple rules for deciding how many principal components (PCs) to keep.

There are other methods for deciding how many PCs to keep. Recently a SAS customer asked about a method known as Horn's method (Horn, 1965), also called parallel analysis. This is a simulation-based method for deciding how many PCs to keep. If the original data consists of N observations and p variables, Horn's method is as follows:

  • Generate B sets of random data with N observations and p variables. The variables should be normally distributed and uncorrelated. That is, the data matrix, X, is a random sample from a multivariate normal (MVN) distribution with an identity correlation parameter: X ~ MVN(0, I(p)).
  • Compute the corresponding B correlation matrices and the eigenvalues for each correlation matrix.
  • Estimate the 95th percentile of each eigenvalue distribution. That is, estimate the 95th percentile of the largest eigenvalue, the 95th percentile of the second largest eigenvalue, and so forth.
  • Compare the observed eigenvalues to the 95th percentiles of the simulated eigenvalues. If the observed eigenvalue is larger, keep it. Otherwise, discard it.

I do not know why the adjective "parallel" is used for Horn's analysis. Nothing in the analysis is geometrically parallel to anything else. Although you can use parallel computations to perform a simulation study, I doubt Horn was thinking about that in 1965. My best guess is that is Horn's method is a secondary analysis that is performed "off to the side" or "in parallel" to the primary principal component analysis.

Parallel analysis in SAS

You do not need to write your own simulation method to use Horn's method (parallel analysis). Horn's parallel analysis is implemented in SAS (as of SAS/STAT 14.3 in SAS 9.4M5) by using the PARALLEL option in PROC FACTOR. The following call to PROC FACTOR uses data about US crime rates. The data are from the Getting Started example in PROC PRINCOMP.

proc factor data=Crime method=Principal 
   parallel(nsims=1000 seed=54321) nfactors=parallel plots=parallel;
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
run;

The PLOTS=PARALLEL option creates the visualization of the parallel analysis. The solid line shows the eigenvalues for the observed correlation matrix. The dotted line shows the 95th percentile of the simulated data. When the observed eigenvalue is greater than the corresponding 95th percentile, you keep the factor. Otherwise, you discard the factor. The graph shows that only one principal component would be kept according to Horn's method. This graph is a variation of the scree plot, which is a plot of the observed eigenvalues.

The same information is presented in tabular form in the "ParallelAnalysis" table. The first row is the only row for which the observed eigenvalue is greater than the 95th percentile (the "critical value") of the simulated eigenvalues.

Interpretation of the parallel analysis

Statisticians often use statistical tests based on a null hypothesis. In Horn's method, the simulation provides the "null distribution" of the eigenvalues of the correlation matrix under the hypothesis that the variables are uncorrelated. Horn's method says that we should only accept a factor as important if it explains more variance than would be expected from uncorrelated data.

Although the PARALLEL option is supported in PROC FACTOR, some researchers suggest that parallel analysis is valid only for PCA. Saccenti and Timmerman (2017) write, "Because Horn’s parallel analysis is associated with PCA, rather than [common factor analysis], its use to indicate the number of common factors is inconsistent (Ford, MacCallum, & Tait, 1986; Humphreys, 1975)." I an expert in factor analysis, but a basic principle of simulation is to ensure that the "null distribution" is appropriate to the analysis. For PCA, the null distribution in Horn's method (eigenvalues of a sample correlation matrix) is appropriate. However, in some common factor models, the important matrix is a "reduced correlation matrix," which does not have 1s on the diagonal.

The advantages of knowing how to write a simulation

Although the PARALLEL option in PROC FACTOR runs a simulation and summarizes the results, there are several advantages to implementing a parallel analysis yourself. For example, you can perform the analysis on the covariance (rather than correlation) matrix. Or you can substitute a robust correlation matrix as part of a robust principal component analysis.

I decided to run my own simulation because I was curious about the distribution of the eigenvalues. The graph that PROC FACTOR creates shows only the upper 95th percentiles of the eigenvalue distribution. I wanted to overlay a confidence band that indicates the distribution of the eigenvalues. The band would visualize the uncertainty in the eigenvalues of the simulated data. How wide is the band? Would you get different results if you use the median eigenvalue instead of the 95th percentile?

Such a graph is shown to the right. The confidence band was created by using a technique similar to the one I used to visualize uncertainty in predictions for linear regression models. The graph shows that the distribution of each eigenvalue and connects them with a straight line. The confidence band fits well with the existing graph, even though the X axis is discrete.

Implement Horn's simulation method

Here's an interesting fact about the simulation in Horn's method. Most implementations generate B random samples, X ~ MVN(0, I(p)), but you don't actually NEED the random samples! All you need are the correlation matrices for the random samples. It turns out that you can simulate the correlation matrices directly by using the Wishart distribution. SAS/IML software includes the RANDWISHART function, which simulates matrices from the Wishart distribution. You can transform those matrices into correlation matrices, find the eigenvalues, and compute the quantiles in just a few lines of PROC IML:

/* Parallel Analysis (Horn 1965) */
proc iml;
/* 1. Read the data and compute the observed eigenvalues of the correlation */
varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"};
use Crime;
   read all var varNames into X;
   read all var "State";
close;
p = ncol(X);
N = nrow(X);
 
m = corr(X);                  /* observed correlation matrix */
Eigenvalue = eigval(m);       /* observed eigenvalues */
 
/* 2. Generate random correlation matrices from MVN(0,I(p)) data
      and compute the eigenvalues. Each row of W is a p x p scatter 
      matrix for a random sample of size N where X ~ MVN(0, I(p)) */
nSim = 1000;
call randseed(12345);
W = RandWishart(nSim, N-1, I(p));  /* each row stores a p x p matrix */
S = W / (N-1);                /* rescale to form covariance matrix */
simEigen = j(nSim, p);        /* store eigenvalues in rows */
do i = 1 to nSim;
   cov = shape(S[i,], p, p);  /* reshape the i_th row into p x p */
   R = cov2corr(cov);         /* i_th correlation matrix */
   simEigen[i,] = T(eigval(R));  
end;
 
/* 3. find 95th percentile for each eigenvalue */
alpha = 0.05;
call qntl(crit, simEigen, 1-alpha);
 
results = T(1:nrow(Eigenvalue)) || Eigenvalue || crit`;
print results[c={"Number" "Observed Eigenvalue" "Simul Crit Val"} F=BestD8.];

The table is qualitatively the same as the one produced by PROC FACTOR. Both tables are the results of simulations, so you should expect to see small differences in the third column, which shows the 95th percentile of the distributions of the eigenvalues.

Visualize the distribution of eigenvalues

The eigenvalues are stored as rows of the simEigen matrix, so you can estimate the 5th, 10th, ..., 95th percentiles and over band plots on the eigenvalue (scree) plot, as follows:

/* 4. Create a graph that illustrates Horn's method: 
   Factor Number vs distribution of eigenvalues. Write results in long form. */
/* 4a. Write the observed eigenvalues and the 95th percentile */
create Horn from results[c={"Factor" "Eigenvalue" "SimulCritVal"}];
   append from results;
close;
 
/* 4b. Visualize the uncertainty in the simulated eigenvalues. For details, see
   https://blogs.sas.com/content/iml/2020/10/12/visualize-uncertainty-regression-predictions.html 
*/
a = do(0.05, 0.45, 0.05);          /* significance levels */
call qntl(Lower, simEigen, a);     /* lower qntls         */
call qntl(Upper, simEigen, 1-a);   /* upper qntls         */
Factor = col(Lower);               /* 1,2,3,...,1,2,3,... */
alpha = repeat(a`, 1, p);             
create EigenQntls var {"Factor" "alpha" "Lower" "Upper"};
   append;
close;
QUIT;
 
proc sort data=EigenQntls;
   by alpha Factor;
run;
data All;
   set Horn EigenQntls;
run;
 
title "Horn's Method (1965) for Choosing the Number of Factors";
title2 "Also called Parallel Analysis";
proc sgplot data=All noautolegend;
   band x=Factor lower=Lower upper=Upper/ group=alpha fillattrs=(color=gray) transparency=0.9;
   series x=Factor y=Eigenvalue / markers name='Eigen' legendlabel='Observed Eigenvalue';
   series x=Factor y=SimulCritVal / markers lineattrs=(pattern=dot) 
          name='Sim' legendlabel='Simulated Crit Value';
   keylegend 'Eigen' 'Sim' / across=1;
run;

The graph is shown in the previous section. The darkest part of the band shows the median eigenvalue. You can see that the "null distribution" of eigenvalues is rather narrow, even though the data contain only 50 observations. I thought perhaps it would be wider. Because the band is narrow, it doesn't matter much whether you choose the 95th percentile as a critical value or some other value (90th percentile, 80th percentile, and so forth). For these data, any reasonable choice for a percentile will still lead to rejecting the second factor and keeping only one principal component. Because the band is narrow, the results will not be unduly affected by whether you use few or many Monte Carlo simulations. In this article, both simulations used B=1000 simulations.

Summary

In summary, PROC FACTOR supports the PARALLEL and PLOTS=PARALLEL options for performing a "parallel analysis," which is Horn's method for choosing the number of principal components to retain. PROC FACTOR creates a table and graph that summarize Horn's method. You can also run the simulation yourself. If you use SAS/IML, you can simulate the correlation matrices directly, which is more efficient than simulating the data. If you run the simulation yourself, you can add additional features to the scree plot, such as a confidence band that shows the null distribution of the eigenvalues.

The post Horn's method: A simulation-based method for retaining principal components appeared first on The DO Loop.