10月 092019
 

What can data tell us about the easiest hole at your favorite golf course? Or which hole contributes the most to mastering the course? A golf instructor once told me golf is not my sport, and that my swing is hopeless, but that didn’t stop me from analyzing golf data. [...]

How to win the SAS Championship was published on SAS Voices by Frank Silva

10月 092019
 

As a fellow student, I know that making sure you get the right books for learning a new skill can be tough. To get you started off right, I would like to share the top SAS books that professors are requesting for students learning SAS. With this inside sneak-peak, you can see what books instructors and professors are using to give new SAS users a jump-start with their SAS programming skills.

1. Learning SAS by Example: A Programmer's Guide, Second Edition

At the top of the list is Ron Cody’s Learning SAS by Example: A Programmer’s Guide, Second Edition. This book teaches SAS programming to new SAS users by building from very basic concepts to more advanced topics. Many programmers prefer examples rather than reference-type syntax, and so this book uses short examples to explain each topic. The new edition of this classic has been updated to SAS 9.4 and includes new chapters on PROC SGPLOT and Perl regular expressions. Check out this free excerpt for a glimpse into the way the book can help you summarize your data.

2. An Introduction to SAS University Edition

I cannot recommend this book highly enough for anyone starting out in data analysis. This book earns a place on my desk, within easy reach. - Christopher Battiston, Wait Times Coordinator, Women's College Hospital

The second most requested book will help you get up-and-running with the free SAS University Edition using Ron Cody’s easy-to-follow, step-by-step guide. This book is aimed at beginners who want to either use the point-and-click interactive environment of SAS Studio, or who want to write their own SAS programs, or both.

The first part of the book shows you how to perform basic tasks, such as producing a report, summarizing data, producing charts and graphs, and using the SAS Studio built-in tasks. The second part of the book shows you how to write your own SAS programs, and how to use SAS procedures to perform a variety of tasks. In order to get familiar with the SAS Studio environment, this book also shows you how to access dozens of interesting data sets that are included with the product.

For more insights into this great book, check out Ron Cody’s useful tips for SAS University Edition in this recent SAS blog.

3. The Little SAS Book: A Primer, Fifth Edition

Our third book is a classic that just keeps getting better. The Little SAS Book is essential for anyone learning SAS programming. Lora Delwiche and Susan Slaughter offer a user-friendly approach so readers can quickly and easily learn the most commonly used features of the SAS language. Each topic is presented in a self-contained two-page layout complete with examples and graphics. Also, make sure to check out some more tips on learning SAS from the authors in their blog post.

We are also excited to announce that the newest edition of The Little SAS Book is coming out this Fall! The sixth edition will be interface independent, so it won’t matter if you are using SAS Studio, SAS Enterprise Guide, or the SAS windowing environment as your programming interface. In this new edition, the authors have included more examples of creating and using permanent SAS data sets, as well as using PROC IMPORT to read data. The new edition also deemphasizes reading raw data files using the INPUT statement—a topic that is no longer covered in the new base SAS programmer certification exam. Check out the upcoming titles page for more information!

4. SAS Certification Prep Guide: Statistical Business Analysis Using SAS 9

Number four is a must-have study guide for the SAS Certified Statistical Business Analyst Using SAS 9 exam. Written for both new and experienced SAS programmers, the SAS Certification Prep Guide: Statistical Business Analysis Using SAS 9 is an in-depth prep guide for the SAS Certified Statistical Business Analyst Using SAS 9: Regression and Modeling exam. The authors step through identifying the business question, generating results with SAS, and interpreting the output in a business context. The case study approach uses both real and simulated data to master the content of the certification exam. Each chapter also includes a quiz aimed at testing the reader’s comprehension of the material presented. To learn more about this great guide, watch an interview with co-author Joni Shreve.

5. SAS for Mixed Models: Introduction and Basic Applications

Models are a vital part of analyzing research data. It seems only fitting, then, that this popular SAS title would be our fifth most popular book requested by SAS instructors. Mixed models are now becoming a core part of undergraduate and graduate programs in statistics and data science. This book is great for those with intermediate-level knowledge of SAS and covers the latest capabilities for a variety of SAS applications. Be sure to read the review of this book by Austin Lincoln, a technical writer at SAS, for great insights into a book he calls a “survival guide” for creating mixed models.

Want more?

I hope this list will help in your search for a SAS book that will get you to the next step in your SAS education goals. To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

Top 5 SAS Books for Students was published on SAS Users.

10月 092019
 

In a previous article, I mentioned that the VLINE statement in PROC SGPLOT is an easy way to graph the mean response at a set of discrete time points. I mentioned that you can choose three options for the length of the "error bars": the standard deviation of the data, the standard error of the mean, or a confidence interval for the mean. This article explains and compares these three options. Which one you choose depends on what information you want to convey to your audience. As I will show, some of the statistics are easier to interpret than others. At the end of this article, I tell you which statistic I recommend.

Sample data

The following DATA step simulates data at four time points. The data at each time point are normally distributed, but the mean, standard deviation, and sample size of the data vary for each time point.

data Sim;
label t = "Time";
array mu[4]    _temporary_ (80 78 78 79); /* mean */
array sigma[4] _temporary_ ( 1  2  2  3); /* std dev */
array N[4]     _temporary_ (36 32 28 25); /* sample size */
call streaminit(12345);
do t = 1 to dim(mu);
   do i = 1 to N[t];
      y = rand("Normal", mu[t], sigma[t]); /* Y ~ N(mu[i], sigma[i]) */
      output;
   end;
end;
run;
 
title "Response by Time";
ods graphics / width=400px height=250px;
proc sgplot data=Sim;
   vbox y / category=t connect=mean;
run;
Box plots of data at four time points

The box plot shows the schematic distribution of the data at each time point. The boxes use the interquartile range and whiskers to indicate the spread of the data. A line connects the means of the responses at each time point.

A box plot might not be appropriate if your audience is not statistically savvy. A simpler display is a plot of the mean for each time point and error bars that indicate the variation in the data. But what statistic should you use for the heights of the error bars? What is the best way to show the variation in the response variable?

Relationships between sample standard deviation, SEM, and CLM

Before I show how to plot and interpret the various error bars, I want to review the relationships between the sample standard deviation, the standard error of the mean (SEM), and the (half) width of the confidence interval for the mean (CLM). These statistics are all based on the sample standard deviation (SD). The SEM and width of the CLM are multiples of the standard deviation, where the multiplier depends on the sample size:

  • The SEM equals SD / sqrt(N). That is, the standard error of the mean is the standard deviation divided by the square root of the sample size.
  • The width of CLM is a multiple of the SEM. For large samples, the multiple for a 95% confidence interval is approximately 1.96. In general, suppose the significance level is α and you are interested in 100(1-α)% confidence limits. Then the multiplier is a quantile of the t distribution with N-1 degrees of freedom, often denoted by t*1-α/2, N-1.

You can use PROC MEANS and a short DATA step to display the relevant statistics that show how these three statistics are related:

/* Optional: Compute SD, SEM, and half-width of CLM (not needed for plotting) */
proc means data=Sim noprint;
   class t;
   var y;
   output out=MeanOut N=N stderr=SEM stddev=SD lclm=LCLM uclm=UCLM;
run;
 
data Summary;
set MeanOut(where=(t^=.));
CLMWidth = (UCLM-LCLM)/2;   /* half-width of CLM interval */
run;
 
proc print data=Summary noobs label;
   format SD SEM CLMWidth 6.3;
   var T SD N SEM CLMWidth;
run;
Relationships between StdDev, SEM, and CLM

The table shows the standard deviation (SD) and the sample size (N) for each time point. The SEM column is equal to SD / sqrt(N). The CLMWidth value is a little more than twice the SEM value. (The multiplier depends on N; For these data, it ranges from 2.03 to 2.06.)

As shown in the next section, the values in the SD, SEM, and CLMWidth columns are the lengths of the error bars when you use the STDDEV, STDERR, and CLM options (respectively) to the LIMITSTAT= option on the VLINE statement in PROC SGPLOT.

Visualize and interpret the choices of error bars

Let's plot all three options for the error bars on the same scale, then discuss how to interpret each graph. Several interpretations use the 68-95-99.7 rule for normally distributed data. The following statements create the three line plots with error bars:

%macro PlotMeanAndVariation(limitstat=, label=);
   title "VLINE Statement: LIMITSTAT = &limitstat";
   proc sgplot data=Sim noautolegend;
      vline t / response=y stat=mean limitstat=&limitstat markers;
      yaxis label="&label" values=(75 to 82) grid;
   run;
%mend;
 
title "Mean Response by Time Point";
%PlotMeanAndVariation(limitstat=STDDEV, label=Mean +/- Std Dev);
%PlotMeanAndVariation(limitstat=STDERR, label=Mean +/- SEM);
%PlotMeanAndVariation(limitstat=CLM,    label=Mean and CLM);
Display error bars for the means by using the standard deviation

Use the standard deviations for the error bars

In the first graph, the length of the error bars is the standard deviation at each time point. This is the easiest graph to explain because the standard deviation is directly related to the data. The standard deviation is a measure of the variation in the data. If the data at each time point are normally distributed, then (1) about 64% of the data have values within the extent of the error bars, and (2) almost all the data lie within three times the extent of the error bars.

The main advantage of this graph is that a "standard deviation" is a term that is familiar to a lay audience. The disadvantage is that the graph does not display the accuracy of the mean computation. For that, you need one of the other statistics.

Display error bars for the means by using the standard error of the mean

Use the standard error for the error bars

In the second graph, the length of the error bars is the standard error of the mean (SEM). This is harder to explain to a lay audience because it in an inferential statistic. A qualitative explanation is that the SEM shows the accuracy of the mean computation. Small SEMs imply better accuracy than larger SEMs.

A quantitative explanation requires using advanced concepts such as "the sampling distribution of the statistic" and "repeating the experiment many times." For the record, the SEM is an estimate of the standard deviation of the sampling distribution of the mean. Recall that the sampling distribution of the mean can be understood in terms of repeatedly drawing random samples from the population and computing the mean for each sample. The standard error is defined as the standard deviation of the distribution of the sample means.

The exact meaning of the SEM might be difficult to explain to a lay audience, but the qualitative explanation is often sufficient.

Display error bars for the means by using confidence intervals

Use a confidence interval of the mean for the error bars

In the third graph, the length of the error bars is a 95% confidence interval for the mean. This graph also displays the accuracy of the mean, but these intervals are about twice as long as the intervals for the SEM.

The confidence interval for the mean is hard to explain to a lay audience. Many people incorrectly think that "there is a 95% chance that the population mean is in this interval." That statement is wrong: either the population mean is in the interval or it isn't. There is no probability involved! The words "95% confidence" refer to repeating the experiment many times on random samples and computing a confidence interval for each sample. The true population mean will be in about 95% of the confidence intervals.

Conclusions

In summary, there are three common statistics that are used to overlay error bars on a line plot of the mean: the standard deviation of the data, the standard error of the mean, and a 95% confidence interval for the mean. The error bars convey the variation in the data and the accuracy of the mean estimate. Which one you use depends on the sophistication of your audience and the message that you are trying to convey.

My recommendation? Despite the fact that confidence intervals can be misinterpreted, I think that the CLM is the best choice for the size of the error bars (the third graph). If I am presenting to a statistical audience, the audience understands the CLMs. For a less sophisticated audience, I do not dwell on the probabilistic interpretation of the CLM but merely say that the error bars "indicate the accuracy of the mean."

As explained previously, each choice has advantages and disadvantages. What choice do you make and why? You can share your thoughts by leaving a comment.

The post What statistic should you use to display error bars for a mean? appeared first on The DO Loop.

10月 072019
 

It is always great to read an old paper or blog post and think, "This task is so much easier in SAS 9.4!" I had that thought recently when I stumbled on a 2007 paper by Wei Cheng titled "Graphical Representation of Mean Measurement over Time." A substantial portion of the eight-page paper is SAS code to creating a graph of the mean responses over time for patients in two arms of a clinical trial. (An arm is a group of participants who receive an intervention or who receive no intervention, such as an experimental group and the control group.)

The graph to the right is a modern version of one graph that Cheng created. This graph is created by using PROC SGPLOT. This article shows how to create this and other graphs that visualize the mean response by time for groups in a clinical trial.

This article assumes that the data are measured at discrete time points. If time is a continuous variable, you can model the mean response by using a regression model, and you can use the EFFECTPLOT statement to graph the predicted mean response versus time.

Sample clinical data

Cheng did not include his sample data, but the following DATA step defines fake data for 11 patients, five in one arm and six in the other. The data produce graphs that are similar to the graphs in Cheng's paper.

data study;
input Armcd $ SubjID $ y1-y5;                /* read data in wide form */
label VisitNum = 'Visit' Armcd = "Treatment";
VisitNum=1; y=y1; output;                    /* immediately transform data to long form */
VisitNum=2; y=y2; output;
VisitNum=3; y=y3; output;
VisitNum=4; y=y4; output;
VisitNum=5; y=y5; output;
drop y1-y5;
datalines;
A 001 135 138 135 134  .
A 002 142 140 141 139 138
A 003 140 137 136 135 133
A 004 131 131 130 131 130
A 005 128 125  .  121 121
B 006 125 120 115 110 105
B 007 139 134 128 128 122
B 008 136 129 126 120 111
B 009 128 125 127 133 136
B 010 120 114 112 110  96
B 011 129 122 120 119  .
;

Use the VLINE statement for mean and variation

The VLINE statement in PROC SGPLOT can summarize data across groups. When you use the RESPONSE= and STAT= option, it can display the mean, median, count, or percentage of a response variable. You can add "error bars" to the graph by using the LIMITSTAT= option. Following Cheng, the error bars indicate the standard error of the mean (SEM). the following statements create the line plot shown at the top of this article:

/* simplest way to visualize means over time for each group */
title "Mean Response by Arm";
proc sgplot data=study;
   vline VisitNum / response=y group=Armcd stat=mean limitstat=stderr;
   yaxis label='Mean +/- SEM';
run;

That was easy! Notice that the VLINE statement computes the mean and standard error for Y for each value of VisitNum and Armcd variables.

This graph shows the standard error of the mean, but you could also show confidence limits for the mean (LIMITSTAT=CLM) or indicate the extent of one or more standard deviations (LIMITSTAT=STDDEV and use the NUMSTD= option).

An alternative plot: Box plots by time

Cheng's graph is appropriate when the intended audience for the graph includes people who might not be experts in statistics. For a more sophisticated audience, you could create a series of box plots and connect the means of the box plots. In this plot, the CATEGORY= option is used to specify the time-like variable and the GROUP= option is used to specify the arms of the study. (Learn about the difference between categories and groups in box plots.)

/* box plots connected by means */
title "Response by Arm";
proc sgplot data=study;
   vbox y / category=VisitNum group=Armcd groupdisplay=cluster 
            connect=mean clusterwidth=0.35;
run;

Whereas the first graph emphasizes the mean value of the responses, the box plot emphasizes the individual responses. The mean responses are connected by lines. The boxes show the interquartile range (Q1 and Q3) as well as the median response. Whiskers and outliers indicate the spread of the data.

Graph summarized statistics

In the previous sections, the VLINE and VBOX statements automatically summarized the data for each time point and for each arm of the study. This is very convenient, but the SGPLOT statements support only a limited number of statistics such as the mean and median. For more control over the statistics, you can use PROC MEANS or PROC UNIVARIATE to summarize the data and then use the SERIES statement to plot the statistics and (optionally) use the SCATTER statement to plot error bars for the statistic.

PROC MEANS supports dozens of descriptive statistics, but, for easy comparison, I will show how to create the same graph by using summarized data. The following call to PROC MEANS creates an output data set that contains statistics for each visit/arm combination.

proc means data=study N mean stderr stddev lclm uclm NDEC=2;
   class Armcd VisitNum;
   var y;
   output out=MeanOut N=N mean=Mean stderr=SEM stddev=SD lclm=LCLM uclm=UCLM;
run;

The output data set (MeanOut) contains all the information in the table, plus additional "marginal" information that summarizes the means across all arms (for each visit), across all visits (for each arm), and for the entire study. When you use the MeanOut data set, you should use a WHERE clause to specify which information you want to analyze. For this example, we want only the information for the Armcd/VisitNum combinations. You can run a simple DATA step to subset the output and to create variables for the values Mean +/- SEM, as follows:

/* compute lower/upper bounds as Mean +/- SEM */
data Summary;
set MeanOut(where=(Armcd^=" " & VisitNum^=.));
LowerSEM = Mean - SEM;
UpperSEM = Mean + SEM;
run;
 
/* create a graph of summary statistics that is similar to the VLINE graph */
title2 "Presummarized Data";
proc sgplot data=Summary;
series  x=VisitNum y=Mean / group=Armcd;
scatter x=VisitNum y=Mean / group=Armcd
        yerrorlower=LowerSEM yerrorupper=UpperSEM;
run;

You can use this technique to create graphs of other statistics versus time.

Adding tabular information to a mean-versus-time graph

You can augment a mean-versus-time graph by adding additional information about the study at each time point. In Cheng's paper, much of the code was devoted to adding information about the number of patients that were measured at each time point.

In SAS 9.4, you can use the XAXISTABLE statement to add one or more rows of information to a graph. The output from PROC MEANS includes a variable named N, which gives the number of nonmissing measurements at each time. The following statements add information about the number of patients. The CLASS= option subsets the counts by the arm, and the COLORGROUP= option displays the text in the group colors.

title2 "Table with Participant Counts";
proc sgplot data=Summary;
series  x=VisitNum y=Mean / group=Armcd;
scatter x=VisitNum y=Mean / group=Armcd
        yerrorlower=LowerSEM yerrorupper=UpperSEM;
xaxistable N / location=inside class=Armcd colorgroup=Armcd
               title="Number of Patients" 
               valueattrs=(size=10) labelattrs=(size=10);
yaxis label='mean +/- SEM';
run;

In summary, SAS 9.4 makes it is easy to graph the mean response versus time for various arms of a clinical study. Cheng wrote his paper in 2007 using SAS 9.1.3, but there have been TONS of additions to the ODS Statistical Graphics system since then. This article shows that you can let PROC SGPLOT summarize the data and plot it by using the VLINE statement or the VBOX statement. Or you can summarize the data yourself and plot it by using the SERIES and SCATTER statements. For the summarized data, you can overlay tables of statistics such as the number of patients at each time point. Whichever method you choose, the SGPLOT procedure makes it easy to create the graphs of statistics versus time.

The post Graph the mean response versus time in SAS appeared first on The DO Loop.

10月 022019
 

The final phase in analytical model deployment is the perfect unsolved mystery. Why are 50% of analytics models never deployed? And why does it take three months or more to complete 90% of deployed models? What happened or didn’t happen to allow analytical insights to reach their potential? It’s a [...]

Mystery solved: The case for operationalizing analytics was published on SAS Voices by Lindsey Coombs

10月 022019
 

I frequently see questions on SAS discussion forums about how to compute the geometric mean and related quantities in SAS. Unfortunately, the answers to these questions are sometimes confusing or even wrong. In addition, some published papers and web sites that claim to show how to calculate the geometric mean in SAS contain wrong or misleading information.

This article shows how to compute the geometric mean, the geometric standard deviation, and the geometric coefficient of variation in SAS. It first shows how to use PROC TTEST to compute the geometric mean and the geometric coefficient of variation. It then shows how to compute several geometric statistics in the SAS/IML language.

For an introduction to the geometric mean, see "What is a geometric mean." For information about the (arithmetic) coefficient of variation (CV) and its applications, see the article "What is the coefficient of variation?"

Compute the geometric mean and geometric CV in SAS

As discussed in my previous article, the geometric mean arises naturally when positive numbers are being multiplied and you want to find the average multiplier. Although the geometric mean can be used to estimate the "center" of any set of positive numbers, it is frequently used to estimate average values in a set of ratios or to compute an average growth rate.

The TTEST procedure is the easiest way to compute the geometric mean (GM) and geometric CV (GCV) of positive data. To demonstrate this, the following DATA step simulates 100 random observations from a lognormal distribution. PROC SGPLOT shows a histogram of the data and overlays a vertical line at the location of the geometric mean.

%let N = 100;
data Have;
call streaminit(12345);
do i = 1 to &N;
   x = round( rand("LogNormal", 3, 0.8), 0.1);    /* generate positive values */
   output;
end;
run;
 
title "Geometric Mean of Skewed Positive Data";
proc sgplot data=Have;
   histogram x / binwidth=10 binstart=5 showbins;
   refline 20.2 / axis=x label="Geometric/Mean" splitchar="/" labelloc=inside
                  lineattrs=GraphData2(thickness=3);
   xaxis values=(0 to 140 by 10);
   yaxis offsetmax=0.1;
run;

Where is the "center" of these data? That depends on your definition. The mode of this skewed distribution is close to x=15, but the arithmetic mean is about 26.4. The mean is pulled upwards by the long right tail. It is a mathematical fact that the geometric mean of data is always less than the arithmetic mean. For these data, the geometric mean is 20.2.

To compute the geometric mean and geometric CV, you can use the DIST=LOGNORMAL option on the PROC TTEST statement, as follows:

proc ttest data=Have dist=lognormal; 
   var x;
   ods select ConfLimits;
run;

The geometric mean, which is 20.2 for these data, estimates the "center" of the data. Notice that the procedure does not report the geometric standard deviation (or variance), but instead reports the geometric coefficient of variation (GCV), which has the value 0.887 for this example. The documentation for the TTEST procedure explains why the GCV is the better measure of variation: "For lognormal data, the CV is the natural measure of variability (rather than the standard deviation) because the CV is invariant to multiplication of [the data]by a constant."

You might wonder whether data need to be lognormally distributed to use this table. The answer is that the data do not need to be lognormally distributed to use the geometric mean and geometric CV. However, the 95% confidence intervals for these quantities assume log-normality.

Definitions of geometric statistics

As T. Kirkwood points out in a letter to the editors of Biometric (Kirkwood, 1979), if data are lognormally distributed as LN(μ σ), then

  • The quantity GM = exp(μ) is the geometric mean. It is estimated from a sample by the quantity exp(m), where m is the arithmetic mean of the log-transformed data.
  • The quantity GSD = exp(σ) is defined to be the geometric standard deviation. The sample estimate is exp(s), where s is the standard deviation of the log-transformed data.
  • The geometric standard error (GSE) is defined by exponentiating the standard error of the mean of the log-transformed data. Geometric confidence intervals are handled similarly.
  • Kirkwood's proposal for the geometric coefficient of variation (GCV) is not generally used. Instead, the accepted definition of the GCV is GCV = sqrt(exp(σ2) – 1), which is the definition that is used in SAS. The estimate for the GCV is sqrt(exp(s2) – 1).

You can use these formulas to compute the geometric statistics for any positive data. However, only for lognormal data do the statistics have a solid theoretical basis: transform to normality, compute a statistic, apply the inverse transform.

Compute the geometric mean in SAS/IML

You can use the SAS/IML language to compute the geometric mean and other "geometric statistics" such as the geometric standard deviation and the geometric CV. The GEOMEAN function is a built-in SAS/IML function, but the other statistics are implemented by explicitly computing statistics of the log-transformed data, as described in the previous section:

proc iml;
use Have; read all var "x"; close;  /* read in positive data */
GM = geomean(x);               /* built-in GEOMEAN function */
print GM;
 
/* To estimate the geometric mean and geometric StdDev, compute
   arithmetic estimates of log(X), then EXP transform the results. */
n = nrow(x);
z = log(x);                  /* log-transformed data */
m = mean(z);                 /* arithmetic mean of log(X) */
s = std(z);                  /* arithmetic std dev of log(X) */
GM2 = exp(m);                /* same answer as GEOMEAN function */
GSD = exp(s);                /* geometric std dev */
GCV = sqrt(exp(s**2) - 1);   /* geometric CV */
print GM2 GSD GCV;

Note that the GM and GCV match the output from PROC TTEST.

What does the geometric standard deviation mean? As for the arithmetic mean, you need to start by thinking about the location of the geometric mean (20.2). If the data are normally distributed, then about 68% of the data are within one standard deviation of the mean, which is the interval [m0s, m+s]. For lognormal data, about 68% of the data should be in the interval [GM/GSD, GM*GSD] and, in fact, 65 out of 100 of the simulated observations are in that interval. Similarly, about 95% of lognormal data should be in the interval [GM/GSD2, GM*GSD2]. For the simulated data, 94 out of 100 observations are in the interval, as shown below:

I am not aware of a similar interpretation of the geometric coefficient of variation. The GCV is usually used to compare two samples. As opposed to the confidence intervals in the previous paragraph, the GCV does not make any reference to the geometric mean of the data.

Other ways to compute the geometric mean

The methods in this article are the simplest ways to compute the geometric mean in SAS, but there are other ways.

  • You can use the DATA step to log-transform the data, use PROC MEANS to compute the descriptive statistics of the log-transformed data, then use the DATA step to exponentiate the results.
  • PROC SURVEYMEANS can compute the geometric mean (with confidence intervals) and the standard error of the geometric mean for survey responses. However, the variance of survey data is not the same as the variance of a random sample, so you should not use the standard error statistic unless you have survey data.

As I said earlier, there is some bad information out there on the internet about this topic, so beware. A site that seems to get all the formulas correct and present the information in a reasonable way is Alex Kritchevsky's blog.

You can download the complete SAS program that I used to compute the GM, GSD, and GCV. The program also shows how to compute confidence intervals for these quantities.

The post Compute the geometric mean, geometric standard deviation, and geometric CV in SAS appeared first on The DO Loop.

10月 012019
 

Did you know the first SAS® Users Group event took place before SAS was incorporated as a company? In 1976, hundreds of early SAS users gathered in sunny Kissimmee, FL to share tips and offer feedback before SAS was even officially a company. Our users have continued to influence the [...]

Customer experience matters was published on SAS Voices by Randy Guard

10月 012019
 

On behalf of the entire global Customer Contact Center, “Happy CX Day!” to all our SAS users!

Customer Experience Day (aka #CXDay2019) is one of our favorite days of the year—when we can reflect on customer interactions, questions and feedback from the past year—and look to the year ahead for ways to enhance our customers’ experience, like expanding our support options and helping drive improvements to our website and self-service offerings.

Plus, it gives us another excuse to celebrate all of our wonderful SAS users and partners! Whoohoo!

The opportunity to join our users on their journeys with SAS—from acquiring SAS, to learning, updating and renewing it, to attending SAS events, and beyond---is one of our favorite aspects of our jobs!

To show our appreciation, we want to share a little love.

 

 

….they are curious and passionate, and ask great questions that help us learn something new each day! - Mary

…they are passionate and dedicated to learning SAS! I love how users are always willing to help with new users’ questions in our SAS Communities groups! - Antionen

…they’re using SAS in such innovative and amazing ways to help make the world a better place. - Tricia

… we get satisfaction by providing the answers to tough questions. It makes our jobs worthwhile knowing that we have added personal value to anyone who has reached out to SAS. - Lida

...by caring for people, we’re a part of making the unimaginable possible. - Keila

…of the exciting ways they’re applying SAS to change lives – from new advancements in cancer research, clinical trials and drug testing, learning about species and ecosystems in efforts to protect endangered species and biodiversity, to impacting young lives by using advanced analytics to measure, as well as impact, student progress in K-12. - Lisa

Not familiar with the Customer Contact Center? We’re the folks who answer your SAS inquiries and point you in the right direction to get the help you need. Well, that’s part of what we do! We don’t just answer questions, we’re also listening to you and looking at ways to make things easier to navigate, simpler to find, and faster to share.

Get to know us better! Fun facts about our team:

  • We’re located in four SAS offices:
  • Collectively, our team speaks and supports 17 languages
  • …and supports nearly 100 countries
  • Some engagement professionals on our team speak more than three languages
  • We collaborate with just about every team at SAS
  • In 2018, we received over 113,000 inquiries worldwide
  • ~55% of SAS customers choose live chat as their communication channel
  • So far this year, we’ve received over 80,000 inquiries from around the world
  • The two most common SAS topics we’re asked about are SAS Training and Analytics U options

If you need any help, want to share feedback, or simply want to talk SAS, please reach out to us!

You can chat with us live on the SAS website, tweet us @SAS_Cares, or contact us via phone, email, or web form.

Want to collaborate with other SAS users? Search for answers or post questions in the SAS Support Communities. This is a great resource for your usage questions! If you're new to SAS, consider frequenting the New SAS User Community, where friendly, knowledgeable volunteers like KurtBremser are eager to help.

Wishing you a fabulous CX Day!

Happy Customer Experience Day! was published on SAS Users.

9月 302019
 

There are several different kinds of means. They all try to find an average value from among a set of numbers. Although the most popular mean is the arithmetic mean, the geometric mean can be useful for problems in statistics, finance, and biology. A common application of the geometric mean is to find an average growth rate for an asset or for a population.

What is the geometric mean?

The geometric mean of n nonnegative numbers is the n_th root of the product of the numbers:
GM = (x1 * x2 * ... * xn)1/n = (Π xi)1/n
When the numbers are all positive, the geometric mean is equivalent to computing the arithmetic mean of the log-transformed data and then using the exponential function to back-transform the result: GM = exp( (1/n) Σ log(xi) ).

Physical interpretation of the geometric mean

The geometric mean has a useful interpretation in terms of the volume of an n-dimensional rectangular solid. If GM is the geometric mean of n positive numbers, then GM is the length of the side of an n-dimensional cube that has the same volume as the rectangular solid with side lengths x1, x2, ..., xn. For example, a rectangular solid with sides 1.5, 2, and 3 has a volume V = 9. The geometric mean of those three numbers are ((1.5)(2)(3))1/3 ≈ 2.08. The volume of a cube with sides of length 2.08 is 9.

This interpretation in terms of volume is analogous to interpreting the arithmetic mean in terms of length. Namely, if you have n line segments of lengths x1, x2, ..., xn, then the total length of the segments is the same as the length of n copies of a segment of length AM, where AM is the arithmetic mean.

What is the geometric mean good for?

The geometric mean can be used to estimate the "center" of any set of positive numbers but is frequently used to estimate an average value in problems that deal with growth rates or ratios. For example, the geometric mean is an average growth rate for an asset or for a population. The following example uses the language of finance, although you can replace "initial investment" by "initial population" if you are interested in population growth

In precalculus, growth rates are introduced in terms of the growth of an initial investment amount (the principle) compounded yearly at a fixed interest rate, r. After n years, the principle (P) is worth
An = P(1 + r)n
The quantity 1 + r sometimes confuses students. It appears because when you add the principle (P) and the interest (Pr), you get P(1 + r).

In precalculus, the interest rate is assumed to be a constant, which is fine for fixed-rate investments like bank CDs. However, many investments have a growth rate that is not fixed but varies from year to year. If the growth rate of the investment is r1 during the first year, r2 during the second year, and rn during the n_th year, then after n years the investment is worth
An = P(1 + r1)(1 + r2)...(1 + rn)
  = P (Π xi), where xi = 1 - ri.

What is the average growth rate for the investment? One interpretation of the average growth rate is the fixed rate that would give the same return after n years. That hypothetical fixed-rate growth is found by using the geometric mean of the values x1, x2, ..., xn. That is, if GM is the geometric mean of the xi, the value
An = P*GMn,
which assumes a fixed interest rate, is exactly the same as for the varying-rate computation.

An example of the geometric mean: The growth rate of gold

Let's apply the ideas in the preceding section. Suppose that you bought $1000 of gold on Jan 1, 2010. The following table gives the yearly rate of return for gold during the years 2010–2018, along with the value of the $1000 investment at the end of each year.

According to the table, the value of the investment after 9 years is $1160.91, which represents a total return of about 16%. What is the fixed-rate that would give the same return after 9 years when compounded annually? That is found by computing the geometric mean of the numbers in the third column:
GM = (1.2774 * 1.1165 * ... * 0.9885)1/9 = 1.01672
In other words, the investment in gold yielded the same return as a fixed-rate bank CD at 1.672% that is compounded yearly for 9 years. The end-of-year values for both investments are shown in the following graph.

The geometric mean in statistics

The geometric mean arises naturally in situations in which quantities are multiplied together. This happens so often that there is a probability distribution, called the lognormal distribution, that models this situation. if Z has a normal distribution, then you can obtain a lognormal distribution by applying the exponential transformation: X = exp(Z) is lognormal. The Wikipedia article for the lognormal distribution states, "The lognormal distribution is important in the description of natural phenomena... because many natural growth processes are driven by the accumulation of many small percentage changes." For lognormal data, the geometric mean is often more useful than the arithmetic mean. In my next blog post, I will show how to compute the geometric mean and other associated statistics in SAS.

The post What is a geometric mean? appeared first on The DO Loop.