10月 162019
 

The EFFECT statement is supported by more than a dozen SAS/STAT regression procedures. Among other things, it enables you to generate spline effects that you can use to fit nonlinear relationships in data. Recently there was a discussion on the SAS Support Communities about how to interpret the parameter estimates of spline effects. This article answers that question by visualizing the spline effects.

An overview of generated effects

Spline effects are powerful because they enable you to use parametric models to fit nonlinear relationships between an independent variable and a response. Using spline effects is not much different than use polynomial effects to fit nonlinear relationships. Suppose that a response variable, Y, appears to depend on an explanatory variable, X, in a complicated nonlinear fashion. If the relationship looks quadratic or cubic, you might try to capture the relationship by introducing polynomial effects. Instead of trying to model Y by X, you might try to use X, X2, and X3.

Strictly speaking, polynomial effects do not need to be centered at the origin. You could translate the polynomial by some amount, k, and use shifted polynomial effects such as (X-k), (X-k)2, and (X-k)3. Or you could combine these shifted polynomials with polynomials at the origin. Or use shifted polynomials that are shifted by different amounts, such as by the constants k1, k2, and k3.

Spline effects are similar to (shifted) polynomial effects. The constants (such as k1, k2, k3) that are used to shift the polynomials are called knots. Knots that are within the range of the data are called interior knots. Knots that are outside the range of the data are called exterior knots or boundary knots. You can read about the various kinds of spline effects that are supported by the EFFECT statement in SAS. Rather than rehash the mathematics, this article shows how you can use SAS to visualize a regression that uses splines. The visualization clarifies the meaning of the parameter estimates for the spline effects.

Output and visualize spline effects

This section shows how to output the spline effects into a SAS data set and plot the spline effects. Suppose you want to predict the MPG_City variable (miles per gallon in the city) based on the engine size. Because we will be plotting curves, the following statements sort the data by the EngineSize variable. Then the OUTDESIGN= option on the PROC GLMSELECT statement writes the spline effects to the Splines data set. For this example, I am using restricted cubic splines and four evenly spaced internal knots, but the same ideas apply to any choice of spline effects.

/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, ... */
title "Restricted TPF Splines";
title2 "Four Internal Knots";
proc glmselect data=cars outdesign(addinputvars fullmodel)=Splines; /* data set contains spline effects */
   effect spl = spline(EngineSize / details       /* define spline effects */
                naturalcubic basis=tpf(noint)     /* natural cubic splines, omit constant effect */
                knotmethod=equal(4));             /* 4 evenly spaced interior knots */
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   ods select ParameterEstimates SplineKnots;
   ods output ParameterEstimates=PE;
quit;

The SplineKnots table shows the locations of the internal knots. There are four equally spaced knots because the procedure used the KNOTMETHOD=EQUAL(4) option. The ParameterEstimates table shows estimates for the regression coefficients for the spline effects, which are named "Spl 1", "Spl 2", and so forth. In the Splines data set, the corresponding variables are named Spl_1, Spl_2, and so forth.

But what do these spline effects look like? The following statements plot the spline effects versus the EngineSize variable, which is the variable from which the effects are generated:

proc sgplot data=Splines;
   series x=EngineSize y=Intercept / curvelabel;
   series x=EngineSize y=spl_1 / curvelabel;
   series x=EngineSize y=spl_2 / curvelabel;
   series x=EngineSize y=spl_3 / curvelabel;
   refline 2.7 4.1 5.5 / axis=x lineattrs=(color="lightgray");
   refline 6.9 / axis=x label="upper knot" labelloc=inside lineattrs=(color="lightgray");
   yaxis label="Spline Effect";
run;

As stated in the documentation for the NATURALCUBIC option, these spline effects include "an intercept, the polynomial X, and n – 2 functions that are all linear beyond the largest knot," where n is the number of knots. This example uses n=4 knots, so Spl_2 and Spl_3 are the cubic splines. You will also see different spline effects if you change to one of the other supported spline methods, such as B-splines or the truncated power functions. Try it!

The graph shows that the natural cubic splines are reminiscent of polynomial effects, but there are a few differences:

  • The spline effects (spl_2 and spl_3) are shifted away from the origin. The spl_2 effect is shifted by 2.7 units, which is the location of the first internal knot. The spl_3 effect is shifted by 4.1 units, which is the location of the second internal knot.
  • The spline effects are 0 when EngineSize is less than the first knot position (2.7). Not all splines look like this, but these effects are based on truncated power functions (the TPF option).
  • The spline effects are linear when EngineSize is greater than the last knot position (6.9). Not all splines look like this, but these effects are restricted splines.

Predicted values are linear combinations of the spline effects

Visualizing the shapes of the spline effects enable you to make sense of the ParameterEstimates table. As in all linear regression, the predicted value is a linear combination of the design variables. In this case, the predicted values are formed by
Pred = 34.96 – 5*Spl_1 + 2.2*Spl_2 – 3.9*Spl_3
You can use the SAS DATA set or PROC IML to compute that linear combination of the spline effects. The following graph shows the predicted curve and overlays the locations of the knots:

The coefficient for Spl_1 is the largest effect (after the intercept). In the graph, you can see the general trend has an approximate slope of -5. The coefficient for the Spl_2 effect is 2.2, and you can see that the predicted values change slope between the second and third knots due to adding the Spl_2 effect. Without the Spl_3 effect, the predicted values would continue to rise after the third knot, but by adding in a negative multiple of Spl_3, the predicted values turn down again after the third knot.

Notice that the prediction function for the restricted cubic spline regression is linear before the first knot and after the last knot. The prediction function models nonlinear relationships between the interior knots.

Summary

In summary, the EFFECT statement in SAS regression procedures can generate spline effects for a continuous explanatory variable. The EFFECT statement supports many different types of splines. This article gives an example of using natural cubic splines (also called restricted cubic splines), which are based on the truncated power function (TPF) splines of degree 3. By outputting the spline effects to a data set and graphing them, you can get a better understanding of the meaning of the estimates of the regression coefficients. The predicted values are a linear combination of the spline effects, so the magnitude and sign of the regression coefficients indicate how the spline effects combine to predict the response.

The post Visualize a regression with splines appeared first on The DO Loop.

10月 152019
 

In a previous post, I discussed using logs to troubleshoot problems in your Viya environment. In this post, I will look at some additional ways to troubleshoot using some of the tools provided by the Viya Operations Infrastructure. With applications, servers and numerous micro-services all working together and generating their own logs in Viya, it can be difficult to find relevant logs. In order to manage the large number of logs and to enable you to locate messages of interest, the operations infrastructure provides components to collect and store log messages.

The collection process is illustrated in the diagram below.

Co-ordinated by the operations infrastructure:

  • sas-watch log continuously collects and sends log messages to the RabbitMQ exchange
  • sas-stream pulls the messages from RabbitMQ and writes them to disk as a tab-separated value (TSV) file
  • Every five minutes, the sas-ops-agentsrv runs the DatamartEtl task to extract log messages from the TSV file and load them into the VIYALOGS CAS-indexed search table

SAS Environment Manager uses the information in the VIYALOGS table and the VIYALOGS_SOURCES tables to display log messages and graphs that contain the frequency and trends of messages. The SAS Environment Manager LOG’s interface makes it really easy to search and analyze log messages. Using the interface, you can view, subset and search logs. The interface has the filtering capabilities on the left hand side and displays the messages on the right. By default, the filter is set to display all messages from all applications and services from the last 30 minutes.

You can modify the filter to extend or shorten the timeframe, subset the level of messages displayed or the source (service/application) that the messages are coming from. You can also search for any text within a message.

Many administrators would prefer a command-line interface, and the good news is there is one.

sas-ops is a command-line interface which allows for the monitoring of the operational infrastructure in a SAS Viya deployment environment.

I have found the sas-ops log command very useful to troubleshoot problems. The sas-ops log command can be used to stream log messages that are generated by SAS Viya applications and services. The messages can be streamed to a terminal window or piped to a file. The sas-ops logs command is located at /opt/sas/viya/home/bin and can be run from any machine in a Viya environment that is included in the CommandLine.

When would you use sas-ops logs to stream log messages? Some potential scenarios are to:

  • troubleshoot a poorly performing report or analysis
  • debug problems in the environment such as logon issues
  • monitor access to resources

In these cases, using sas-ops logs you can stream the log messages from all services to a single file or terminal.

In its simplest form, the command live streams all log messages from a Viya environment to the terminal. Selecting CTRC+C will stop the streaming.

./sas-ops logs

Partial output from the stream is shown below.

If you want to save the output, you can redirect the stream to a file.

./sas-ops logs > /tmp/mylog.log

You can get more creative and achieve more complex tasks. You can change the format of the message output using –format. For example, to create a file with json which could be read by another process use:

./sas-ops logs –format pretty > mylogs.json

You can also:

  • stream messages for just a specific Viya service
  • filter logs messages by text in a regular expression
  • stream for a specific duration

The duration is specified using the format 0h0m0s0ms, but you can also use individual parts of the specification, for example 30s for 30 seconds or 5m for 5 minutes.

Consider the situation where we want to monitor access to a particular CAS table over a specific period of time. The command below will output to a file all messages that contain the table name HR_SUMMARY for a period of 5 minutes.

./sas-ops logs –match HR_SUMMARY –timeout 5m > /tmp/hr_summary_access.log

The output shows all the CAS actions that were performed on the table during the time period.

You can subset the stream to one service.

Consider a case where a user is having an issue logging in and you suspect you have an issue with the LDAP setup. To check the problem, you can firstly enable DEBUG logging on com.sas.identities. Then stream the log messages from the identities service.

./sas-ops logs –format pretty –source identities > logonerrors.json

Viewing the output shows that there is something wrong with the LDAP query.

I think you will agree that sas-ops logs is a very useful tool for monitoring and troubleshooting issues in a Viya environment. For more information, check out the following resources:

I would like to thank Bryan Ellington for his helpful input with this post.

Capturing log messages from Viya deployments was published on SAS Users.

10月 142019
 

I recently wrote about how to use PROC TTEST in SAS/STAT software to compute the geometric mean and related statistics. This prompted a SAS programmer to ask a related question. Suppose you have dozens (or hundreds) of variables and you want to compute the geometric mean of each. What is the best way to obtain these geometric means?

As I mentioned in the previous post, the SAS/IML language supports the GEOMEAN function, so you compute the geometric means by iterating over each column in a data matrix. If you do not have SAS/IML software, you can use PROC UNIVARIATE in Base SAS. The UNIVARIATE procedure supports the OUTTABLE= option, which creates a SAS data set that contains many univariate statistics, including the geometric mean.

For example, suppose you want to compute the geometric means for all numeric variables in the Sashelp.Cars data set. You can use the OUTTABLE= option to write the output statistics to a data set and then print only the column that contains the geometric mean, as follows:

proc univariate data=Sashelp.Cars outtable=DescStats noprint;
   var _NUMERIC_;
run;
 
proc print data=DescStats noobs;
   var _var_ _GEOMEAN_;
run;

This method also works if your data contain a classification variable and you want to compute the geometric mean for each level of the classification variable. For example, the following statements compute the geometric means for two variables for each level of the Origin variable, which has the values "Asia", "Europe", and "USA":

proc univariate data=Sashelp.Cars outtable=DescStatsClass noprint;
   class Origin;
   var MPG_City Horsepower;
run;
 
proc print data=DescStatsClass noobs;
   var _var_ Origin _GEOMEAN_;
run;

In summary, if you want to use Base SAS to compute the geometric mean (or any of almost 50 other descriptive statistics) for many variables, use the OUTTABLE= option of PROC UNIVARIATE.

The post Compute the geometric mean for many variables in SAS appeared first on The DO Loop.

10月 102019
 

DATA Step BY Statements

DATA Step is a very powerful language that SAS and Open Source programmers leverage to stage data for the analytical life cycle. A popular technique is to use the DESCENDING option on BY variables to identify the largest value. Let’s review the example in Figure 1:

  • On line 74 we are using the descending option on the BY statement for the numeric variable MSRP. The reason we are doing this is so we can identify the most expensive car for each make of car in our data set.
  • On line 79 we group our data by MAKE of car.
  • On line 80 we leverage the FIRST. statement on the subsetting IF statement to output the first record for each MAKE. In Figure 2 we can review the results.


Figure 1. Descending BY Statement


Figure 2. Listing of Most Expensive Cars by MAKE

What is CAS?

CAS is SAS Viya’s in-memory engine that processes data and logic in a distributed computing paradigm. When working with CAS tables we can simulate the DESCENDING BY statement by creating a CAS View which will then become the source table to our DATA Step. Let’s review Figure 3:

  • On line 79 we will leverage the CASL (SAS® Cloud Analytic Services Language) action set TABLE with the action VIEW to create the CAS View that will be used as our source table in the DATA Step.
  • On lines 80 and 81 we will store our CAS View in the CASUSER CASLIB with the name of DESCENDING.
  • On line 82 and 83 we use the TABLES statement to specify the input CAS table to our CAS View.
  • On line 84 we use the VARLIST statement to identify the columns from the input table we want in our CAS View.
  • On lines 85 we create a new variable for our CAS View using the computedVars statement,
  • On line 86 we provide the math for our new variable N_MSRP. N_MSRP is the negated value of the input CAS table variable MSRP. Note: This simulation only works for numeric variables. For character data I suggest using LAST. processing which you can review in this blog post.


Figure 3. Simulating DESCENDING BY Statement for Numeric Variables

Now that we have our CAS View with its new variable N_MSRP, we can move on to the DATA Step code in Figure 3.

  • On line 92 the SET statement specifies the source to our DATA Step; CAS View CASUSER.DESCENDING
  • On line 83 we leverage the BY Statement to group our data in ascending order for the CAS View variables MAKE and N_MSRP. Because N_MSRP is in ascending order of our original variable MSRP is in DESCENDING order.
  • On line 94 we use a subsetting IF statement to output the first occurrence of each MAKE.

Figure 4 is a listing of our new CAS table CASUSER.DESCENDING2 and displays the most expensive car for each make of car.


Figure 4. Listing of Most Expensive Cars by MAKE

Template for Creating a CAS View

/* Create a CAS view */
/* For each DESCENDING numeric create a new variable(s) */
/* The value of the new variable(s) is the negated value */
/* of the original DESCENDING BY numeric variable(s) */
proc cas;
   table.view / replace = true
   caslib='casuser'
   name='descending'
   tables={{
      name='cars'
      varlist={'msrp' 'make'},
      computedVars={{name='n_msrp'}},
      computedVarsProgram='n_msrp = -(msrp)'
   }};
run;
quit;
 
data casuser.descending2;
   set casuser.descending;
   by make n_msrp ;
   if first.make ;
run;
 
proc print data=casuser.descending2;
title "Most Expensive Cars";
run;

Conclusion

It is a very common coding technique to process data with a DESCENDING BY statement using DATA Step. With Viya 3.5 the DESCENDING BY statements is supported, for numeric and character data in DATA Step, with this caveat: DESCENDING works on all but the first BY variable on the BY statement. For earlier versions of SAS Viya this simulation is the best practices for numeric data that you want in DESCENDING order.

How to Simulate DATA Step DESCENDING BY Statements in SAS® Cloud Analytic Services (CAS) was published on SAS Users.

10月 092019
 

What can data tell us about the easiest hole at your favorite golf course? Or which hole contributes the most to mastering the course? A golf instructor once told me golf is not my sport, and that my swing is hopeless, but that didn’t stop me from analyzing golf data. [...]

How to win the SAS Championship was published on SAS Voices by Frank Silva

10月 092019
 

As a fellow student, I know that making sure you get the right books for learning a new skill can be tough. To get you started off right, I would like to share the top SAS books that professors are requesting for students learning SAS. With this inside sneak-peak, you can see what books instructors and professors are using to give new SAS users a jump-start with their SAS programming skills.

1. Learning SAS by Example: A Programmer's Guide, Second Edition

At the top of the list is Ron Cody’s Learning SAS by Example: A Programmer’s Guide, Second Edition. This book teaches SAS programming to new SAS users by building from very basic concepts to more advanced topics. Many programmers prefer examples rather than reference-type syntax, and so this book uses short examples to explain each topic. The new edition of this classic has been updated to SAS 9.4 and includes new chapters on PROC SGPLOT and Perl regular expressions. Check out this free excerpt for a glimpse into the way the book can help you summarize your data.

2. An Introduction to SAS University Edition

I cannot recommend this book highly enough for anyone starting out in data analysis. This book earns a place on my desk, within easy reach. - Christopher Battiston, Wait Times Coordinator, Women's College Hospital

The second most requested book will help you get up-and-running with the free SAS University Edition using Ron Cody’s easy-to-follow, step-by-step guide. This book is aimed at beginners who want to either use the point-and-click interactive environment of SAS Studio, or who want to write their own SAS programs, or both.

The first part of the book shows you how to perform basic tasks, such as producing a report, summarizing data, producing charts and graphs, and using the SAS Studio built-in tasks. The second part of the book shows you how to write your own SAS programs, and how to use SAS procedures to perform a variety of tasks. In order to get familiar with the SAS Studio environment, this book also shows you how to access dozens of interesting data sets that are included with the product.

For more insights into this great book, check out Ron Cody’s useful tips for SAS University Edition in this recent SAS blog.

3. The Little SAS Book: A Primer, Fifth Edition

Our third book is a classic that just keeps getting better. The Little SAS Book is essential for anyone learning SAS programming. Lora Delwiche and Susan Slaughter offer a user-friendly approach so readers can quickly and easily learn the most commonly used features of the SAS language. Each topic is presented in a self-contained two-page layout complete with examples and graphics. Also, make sure to check out some more tips on learning SAS from the authors in their blog post.

We are also excited to announce that the newest edition of The Little SAS Book is coming out this Fall! The sixth edition will be interface independent, so it won’t matter if you are using SAS Studio, SAS Enterprise Guide, or the SAS windowing environment as your programming interface. In this new edition, the authors have included more examples of creating and using permanent SAS data sets, as well as using PROC IMPORT to read data. The new edition also deemphasizes reading raw data files using the INPUT statement—a topic that is no longer covered in the new base SAS programmer certification exam. Check out the upcoming titles page for more information!

4. SAS Certification Prep Guide: Statistical Business Analysis Using SAS 9

Number four is a must-have study guide for the SAS Certified Statistical Business Analyst Using SAS 9 exam. Written for both new and experienced SAS programmers, the SAS Certification Prep Guide: Statistical Business Analysis Using SAS 9 is an in-depth prep guide for the SAS Certified Statistical Business Analyst Using SAS 9: Regression and Modeling exam. The authors step through identifying the business question, generating results with SAS, and interpreting the output in a business context. The case study approach uses both real and simulated data to master the content of the certification exam. Each chapter also includes a quiz aimed at testing the reader’s comprehension of the material presented. To learn more about this great guide, watch an interview with co-author Joni Shreve.

5. SAS for Mixed Models: Introduction and Basic Applications

Models are a vital part of analyzing research data. It seems only fitting, then, that this popular SAS title would be our fifth most popular book requested by SAS instructors. Mixed models are now becoming a core part of undergraduate and graduate programs in statistics and data science. This book is great for those with intermediate-level knowledge of SAS and covers the latest capabilities for a variety of SAS applications. Be sure to read the review of this book by Austin Lincoln, a technical writer at SAS, for great insights into a book he calls a “survival guide” for creating mixed models.

Want more?

I hope this list will help in your search for a SAS book that will get you to the next step in your SAS education goals. To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

Top 5 SAS Books for Students was published on SAS Users.

10月 092019
 

In a previous article, I mentioned that the VLINE statement in PROC SGPLOT is an easy way to graph the mean response at a set of discrete time points. I mentioned that you can choose three options for the length of the "error bars": the standard deviation of the data, the standard error of the mean, or a confidence interval for the mean. This article explains and compares these three options. Which one you choose depends on what information you want to convey to your audience. As I will show, some of the statistics are easier to interpret than others. At the end of this article, I tell you which statistic I recommend.

Sample data

The following DATA step simulates data at four time points. The data at each time point are normally distributed, but the mean, standard deviation, and sample size of the data vary for each time point.

data Sim;
label t = "Time";
array mu[4]    _temporary_ (80 78 78 79); /* mean */
array sigma[4] _temporary_ ( 1  2  2  3); /* std dev */
array N[4]     _temporary_ (36 32 28 25); /* sample size */
call streaminit(12345);
do t = 1 to dim(mu);
   do i = 1 to N[t];
      y = rand("Normal", mu[t], sigma[t]); /* Y ~ N(mu[i], sigma[i]) */
      output;
   end;
end;
run;
 
title "Response by Time";
ods graphics / width=400px height=250px;
proc sgplot data=Sim;
   vbox y / category=t connect=mean;
run;
Box plots of data at four time points

The box plot shows the schematic distribution of the data at each time point. The boxes use the interquartile range and whiskers to indicate the spread of the data. A line connects the means of the responses at each time point.

A box plot might not be appropriate if your audience is not statistically savvy. A simpler display is a plot of the mean for each time point and error bars that indicate the variation in the data. But what statistic should you use for the heights of the error bars? What is the best way to show the variation in the response variable?

Relationships between sample standard deviation, SEM, and CLM

Before I show how to plot and interpret the various error bars, I want to review the relationships between the sample standard deviation, the standard error of the mean (SEM), and the (half) width of the confidence interval for the mean (CLM). These statistics are all based on the sample standard deviation (SD). The SEM and width of the CLM are multiples of the standard deviation, where the multiplier depends on the sample size:

  • The SEM equals SD / sqrt(N). That is, the standard error of the mean is the standard deviation divided by the square root of the sample size.
  • The width of CLM is a multiple of the SEM. For large samples, the multiple for a 95% confidence interval is approximately 1.96. In general, suppose the significance level is α and you are interested in 100(1-α)% confidence limits. Then the multiplier is a quantile of the t distribution with N-1 degrees of freedom, often denoted by t*1-α/2, N-1.

You can use PROC MEANS and a short DATA step to display the relevant statistics that show how these three statistics are related:

/* Optional: Compute SD, SEM, and half-width of CLM (not needed for plotting) */
proc means data=Sim noprint;
   class t;
   var y;
   output out=MeanOut N=N stderr=SEM stddev=SD lclm=LCLM uclm=UCLM;
run;
 
data Summary;
set MeanOut(where=(t^=.));
CLMWidth = (UCLM-LCLM)/2;   /* half-width of CLM interval */
run;
 
proc print data=Summary noobs label;
   format SD SEM CLMWidth 6.3;
   var T SD N SEM CLMWidth;
run;
Relationships between StdDev, SEM, and CLM

The table shows the standard deviation (SD) and the sample size (N) for each time point. The SEM column is equal to SD / sqrt(N). The CLMWidth value is a little more than twice the SEM value. (The multiplier depends on N; For these data, it ranges from 2.03 to 2.06.)

As shown in the next section, the values in the SD, SEM, and CLMWidth columns are the lengths of the error bars when you use the STDDEV, STDERR, and CLM options (respectively) to the LIMITSTAT= option on the VLINE statement in PROC SGPLOT.

Visualize and interpret the choices of error bars

Let's plot all three options for the error bars on the same scale, then discuss how to interpret each graph. Several interpretations use the 68-95-99.7 rule for normally distributed data. The following statements create the three line plots with error bars:

%macro PlotMeanAndVariation(limitstat=, label=);
   title "VLINE Statement: LIMITSTAT = &limitstat";
   proc sgplot data=Sim noautolegend;
      vline t / response=y stat=mean limitstat=&limitstat markers;
      yaxis label="&label" values=(75 to 82) grid;
   run;
%mend;
 
title "Mean Response by Time Point";
%PlotMeanAndVariation(limitstat=STDDEV, label=Mean +/- Std Dev);
%PlotMeanAndVariation(limitstat=STDERR, label=Mean +/- SEM);
%PlotMeanAndVariation(limitstat=CLM,    label=Mean and CLM);
Display error bars for the means by using the standard deviation

Use the standard deviations for the error bars

In the first graph, the length of the error bars is the standard deviation at each time point. This is the easiest graph to explain because the standard deviation is directly related to the data. The standard deviation is a measure of the variation in the data. If the data at each time point are normally distributed, then (1) about 64% of the data have values within the extent of the error bars, and (2) almost all the data lie within three times the extent of the error bars.

The main advantage of this graph is that a "standard deviation" is a term that is familiar to a lay audience. The disadvantage is that the graph does not display the accuracy of the mean computation. For that, you need one of the other statistics.

Display error bars for the means by using the standard error of the mean

Use the standard error for the error bars

In the second graph, the length of the error bars is the standard error of the mean (SEM). This is harder to explain to a lay audience because it in an inferential statistic. A qualitative explanation is that the SEM shows the accuracy of the mean computation. Small SEMs imply better accuracy than larger SEMs.

A quantitative explanation requires using advanced concepts such as "the sampling distribution of the statistic" and "repeating the experiment many times." For the record, the SEM is an estimate of the standard deviation of the sampling distribution of the mean. Recall that the sampling distribution of the mean can be understood in terms of repeatedly drawing random samples from the population and computing the mean for each sample. The standard error is defined as the standard deviation of the distribution of the sample means.

The exact meaning of the SEM might be difficult to explain to a lay audience, but the qualitative explanation is often sufficient.

Display error bars for the means by using confidence intervals

Use a confidence interval of the mean for the error bars

In the third graph, the length of the error bars is a 95% confidence interval for the mean. This graph also displays the accuracy of the mean, but these intervals are about twice as long as the intervals for the SEM.

The confidence interval for the mean is hard to explain to a lay audience. Many people incorrectly think that "there is a 95% chance that the population mean is in this interval." That statement is wrong: either the population mean is in the interval or it isn't. There is no probability involved! The words "95% confidence" refer to repeating the experiment many times on random samples and computing a confidence interval for each sample. The true population mean will be in about 95% of the confidence intervals.

Conclusions

In summary, there are three common statistics that are used to overlay error bars on a line plot of the mean: the standard deviation of the data, the standard error of the mean, and a 95% confidence interval for the mean. The error bars convey the variation in the data and the accuracy of the mean estimate. Which one you use depends on the sophistication of your audience and the message that you are trying to convey.

My recommendation? Despite the fact that confidence intervals can be misinterpreted, I think that the CLM is the best choice for the size of the error bars (the third graph). If I am presenting to a statistical audience, the audience understands the CLMs. For a less sophisticated audience, I do not dwell on the probabilistic interpretation of the CLM but merely say that the error bars "indicate the accuracy of the mean."

As explained previously, each choice has advantages and disadvantages. What choice do you make and why? You can share your thoughts by leaving a comment.

The post What statistic should you use to display error bars for a mean? appeared first on The DO Loop.

10月 072019
 

It is always great to read an old paper or blog post and think, "This task is so much easier in SAS 9.4!" I had that thought recently when I stumbled on a 2007 paper by Wei Cheng titled "Graphical Representation of Mean Measurement over Time." A substantial portion of the eight-page paper is SAS code to creating a graph of the mean responses over time for patients in two arms of a clinical trial. (An arm is a group of participants who receive an intervention or who receive no intervention, such as an experimental group and the control group.)

The graph to the right is a modern version of one graph that Cheng created. This graph is created by using PROC SGPLOT. This article shows how to create this and other graphs that visualize the mean response by time for groups in a clinical trial.

This article assumes that the data are measured at discrete time points. If time is a continuous variable, you can model the mean response by using a regression model, and you can use the EFFECTPLOT statement to graph the predicted mean response versus time.

Sample clinical data

Cheng did not include his sample data, but the following DATA step defines fake data for 11 patients, five in one arm and six in the other. The data produce graphs that are similar to the graphs in Cheng's paper.

data study;
input Armcd $ SubjID $ y1-y5;                /* read data in wide form */
label VisitNum = 'Visit' Armcd = "Treatment";
VisitNum=1; y=y1; output;                    /* immediately transform data to long form */
VisitNum=2; y=y2; output;
VisitNum=3; y=y3; output;
VisitNum=4; y=y4; output;
VisitNum=5; y=y5; output;
drop y1-y5;
datalines;
A 001 135 138 135 134  .
A 002 142 140 141 139 138
A 003 140 137 136 135 133
A 004 131 131 130 131 130
A 005 128 125  .  121 121
B 006 125 120 115 110 105
B 007 139 134 128 128 122
B 008 136 129 126 120 111
B 009 128 125 127 133 136
B 010 120 114 112 110  96
B 011 129 122 120 119  .
;

Use the VLINE statement for mean and variation

The VLINE statement in PROC SGPLOT can summarize data across groups. When you use the RESPONSE= and STAT= option, it can display the mean, median, count, or percentage of a response variable. You can add "error bars" to the graph by using the LIMITSTAT= option. Following Cheng, the error bars indicate the standard error of the mean (SEM). the following statements create the line plot shown at the top of this article:

/* simplest way to visualize means over time for each group */
title "Mean Response by Arm";
proc sgplot data=study;
   vline VisitNum / response=y group=Armcd stat=mean limitstat=stderr;
   yaxis label='Mean +/- SEM';
run;

That was easy! Notice that the VLINE statement computes the mean and standard error for Y for each value of VisitNum and Armcd variables.

This graph shows the standard error of the mean, but you could also show confidence limits for the mean (LIMITSTAT=CLM) or indicate the extent of one or more standard deviations (LIMITSTAT=STDDEV and use the NUMSTD= option).

An alternative plot: Box plots by time

Cheng's graph is appropriate when the intended audience for the graph includes people who might not be experts in statistics. For a more sophisticated audience, you could create a series of box plots and connect the means of the box plots. In this plot, the CATEGORY= option is used to specify the time-like variable and the GROUP= option is used to specify the arms of the study. (Learn about the difference between categories and groups in box plots.)

/* box plots connected by means */
title "Response by Arm";
proc sgplot data=study;
   vbox y / category=VisitNum group=Armcd groupdisplay=cluster 
            connect=mean clusterwidth=0.35;
run;

Whereas the first graph emphasizes the mean value of the responses, the box plot emphasizes the individual responses. The mean responses are connected by lines. The boxes show the interquartile range (Q1 and Q3) as well as the median response. Whiskers and outliers indicate the spread of the data.

Graph summarized statistics

In the previous sections, the VLINE and VBOX statements automatically summarized the data for each time point and for each arm of the study. This is very convenient, but the SGPLOT statements support only a limited number of statistics such as the mean and median. For more control over the statistics, you can use PROC MEANS or PROC UNIVARIATE to summarize the data and then use the SERIES statement to plot the statistics and (optionally) use the SCATTER statement to plot error bars for the statistic.

PROC MEANS supports dozens of descriptive statistics, but, for easy comparison, I will show how to create the same graph by using summarized data. The following call to PROC MEANS creates an output data set that contains statistics for each visit/arm combination.

proc means data=study N mean stderr stddev lclm uclm NDEC=2;
   class Armcd VisitNum;
   var y;
   output out=MeanOut N=N mean=Mean stderr=SEM stddev=SD lclm=LCLM uclm=UCLM;
run;

The output data set (MeanOut) contains all the information in the table, plus additional "marginal" information that summarizes the means across all arms (for each visit), across all visits (for each arm), and for the entire study. When you use the MeanOut data set, you should use a WHERE clause to specify which information you want to analyze. For this example, we want only the information for the Armcd/VisitNum combinations. You can run a simple DATA step to subset the output and to create variables for the values Mean +/- SEM, as follows:

/* compute lower/upper bounds as Mean +/- SEM */
data Summary;
set MeanOut(where=(Armcd^=" " & VisitNum^=.));
LowerSEM = Mean - SEM;
UpperSEM = Mean + SEM;
run;
 
/* create a graph of summary statistics that is similar to the VLINE graph */
title2 "Presummarized Data";
proc sgplot data=Summary;
series  x=VisitNum y=Mean / group=Armcd;
scatter x=VisitNum y=Mean / group=Armcd
        yerrorlower=LowerSEM yerrorupper=UpperSEM;
run;

You can use this technique to create graphs of other statistics versus time.

Adding tabular information to a mean-versus-time graph

You can augment a mean-versus-time graph by adding additional information about the study at each time point. In Cheng's paper, much of the code was devoted to adding information about the number of patients that were measured at each time point.

In SAS 9.4, you can use the XAXISTABLE statement to add one or more rows of information to a graph. The output from PROC MEANS includes a variable named N, which gives the number of nonmissing measurements at each time. The following statements add information about the number of patients. The CLASS= option subsets the counts by the arm, and the COLORGROUP= option displays the text in the group colors.

title2 "Table with Participant Counts";
proc sgplot data=Summary;
series  x=VisitNum y=Mean / group=Armcd;
scatter x=VisitNum y=Mean / group=Armcd
        yerrorlower=LowerSEM yerrorupper=UpperSEM;
xaxistable N / location=inside class=Armcd colorgroup=Armcd
               title="Number of Patients" 
               valueattrs=(size=10) labelattrs=(size=10);
yaxis label='mean +/- SEM';
run;

In summary, SAS 9.4 makes it is easy to graph the mean response versus time for various arms of a clinical study. Cheng wrote his paper in 2007 using SAS 9.1.3, but there have been TONS of additions to the ODS Statistical Graphics system since then. This article shows that you can let PROC SGPLOT summarize the data and plot it by using the VLINE statement or the VBOX statement. Or you can summarize the data yourself and plot it by using the SERIES and SCATTER statements. For the summarized data, you can overlay tables of statistics such as the number of patients at each time point. Whichever method you choose, the SGPLOT procedure makes it easy to create the graphs of statistics versus time.

The post Graph the mean response versus time in SAS appeared first on The DO Loop.

10月 022019
 

The final phase in analytical model deployment is the perfect unsolved mystery. Why are 50% of analytics models never deployed? And why does it take three months or more to complete 90% of deployed models? What happened or didn’t happen to allow analytical insights to reach their potential? It’s a [...]

Mystery solved: The case for operationalizing analytics was published on SAS Voices by Lindsey Coombs