rick wicklin

222017
 

Sometimes SAS programmers ask about how to analyze quantiles with SAS. Common questions include:

  • How can I compute 95% confidence intervals for a median in SAS?
  • How can I test whether the medians of two independent samples are significantly different?
  • How can I repeat the previous analyses with other percentiles, such as the 20th percentile or the 90th percentile?

Historically, PROC UNIVARIATE and PROC NPAR1WAY are two procedures in SAS that analysts used for univariate analysis. PROC UNIVARIATE performs standard parametric tests. In contrast, PROC NPAR1WAY performs nonparametric tests and distribution-free analyses. An internet search reveals many resources that describe how to use UNIVARIATE and NPAR1WAY for analyzing quantiles.

However, there is an alternative way to analyze univariate quantiles: PROC QUANTREG. Although QUANTREG is designed for quantile regression, the same procedure can easily analyze quantiles of univariate data. All you need to do is omit the regressors from the right side of the MODEL statement and the procedure will analyze the "response" variable.

Be aware that the QUANTREG procedure uses an optimization algorithm to perform its analysis. This can sometimes result in different estimates than a traditional computation. For example, if the data set has an even number of observations and the middle values are a and b, one estimate for the median is the average of the two middle values (a+b)/2. The QUANTREG procedure might provide a different estimate, which could be any value in [a, b]. This difference is most noticeable in small samples. (Don't let this bother you too much. There are many definitions for quantile estimates. SAS supports five different definitions for calculating quantiles.)

Confidence intervals for percentiles

I have previously shown how to compute confidence intervals for percentiles in SAS by using PROC UNIVARIATE. The following statements compute the 20th, 50th, and 90th percentiles for the cholesterol levels of 5209 patients in a medical study, along with 95% confidence intervals for the quantiles. The computation is shown twice: first with PROC UNIVARIATE, then with PROC QUANTREG.

/* 1. Use PROC UNIVARIATE to get 95% CIs for 20th, 50th, and 90th pctls */
proc univariate data=Sashelp.Heart noprint;
   var Cholesterol;
   output out=pctl pctlpts=20 50 90 pctlpre=p
          cipctldf=(lowerpre=LCL upperpre=UCL);    /* 12.1 options (SAS 9.3m2) */
run;
 
data QUni;  /* rearrange the statistics into a table */
set pctl;
Quantile = 0.2; Estimate = p20; Lower = LCL20; Upper = UCL20; output;
Quantile = 0.5; Estimate = p50; Lower = LCL50; Upper = UCL50; output;
Quantile = 0.9; Estimate = p90; Lower = LCL90; Upper = UCL90; output;
keep Quantile Estimate Lower Upper;
run;
 
title "UNIVARIATE Results"; 
proc print noobs; run;
 
/**************************************/
/* 2. Alternative: Use PROC QUANTREG! */
ods select none; ods output ParameterEstimates=QReg ;
proc quantreg data=Sashelp.Heart;
   model Cholesterol = / quantile=0.2 0.5 .9;
run;
ods select all;
 
title "QUANTREG Results"; 
proc print noobs;
   var Quantile Estimate LowerCL UpperCL;
run;

The output shows that the confidence intervals (CIs) for the quantiles are similar, although the QUANTREG intervals are slightly wider. Although UNIVARIATE can produce CIs for these data, the situation changes if you add a weight variable. The UNIVARIATE procedure supports estimates for weighted quantiles, but does not produce confidence intervals. However, the QUANTREG procedure can provide CIs even for a weighted analysis.

Test the difference of medians

In general, PROC QUANTREG can compute statistics for quantiles that UNIVARIATE cannot. For example, you can use the ESTIMATE statement in QUANTREG to get a confidence interval for the difference between medians in two independent samples. If the confidence interval does not contain 0, you can conclude that the medians are significantly different.

The adjacent box plot shows the distribution of diastolic blood pressure for male and female patients in a medial study. Reference lines are drawn at the median values for each gender. You might want to estimate the median difference in diastolic blood pressure between male and female patients and compute a confidence interval for the difference. The following call to PROC QUANTREG estimates those quantities:

ods select ParameterEstimates Estimates;
proc quantreg data=Sashelp.Heart;
   class sex;
   model diastolic = sex / quantile=0.5;
   estimate 'Diff in Medians' sex 1 -1 / CL;
run;

The syntax should look familiar to programmers who use PROC GLM to compare the means of groups. However, this computation compares medians of groups. The analysis indicates that female patients have a diastolic blood pressure that is 3 points lower than male patients. The 95% confidence interval for the difference does not include 0, therefore the difference is statistically significant. By changing the value on the QUANTILE= option, you can compare quantiles other than the median. No other SAS procedure provides that level of control over quantile estimation.

Conclusions

PROC QUANTREG provides another tool for the SAS programmer who needs to analyze quantiles. Although QUANTREG was written for quantile regression, the procedure can also analyze univariate samples. You can use the ESTIMATE statement to compare quantiles across groups and to obtain confidence intervals for the parameters.

In general, SAS regression procedures enable you to conduct univariate analyses that are not built into any univariate procedure.

The post Quantile estimates and the difference of medians in SAS appeared first on The DO Loop.

202017
 
Distribution of colors of M&M's

Many introductory courses in probability and statistics encourage students to collect and analyze real data. A popular experiment in categorical data analysis is to give students a bag of M&M® candies and ask them to estimate the proportion of colors in the population from the sample data. In some classes, the students are also asked to perform a chi-square analysis to test whether the colors are uniformly distributed or whether the colors match a hypothetical set of proportions.

M&M's® have a long history at SAS. SAS is the world’s largest corporate consumer of M&M's. Every Wednesday a SAS employee visits every breakroom on campus and fill two large containers full of M&M's. This article uses SAS software to analyze the classic "distribution of colors" experiment.

The proportion of colors for M&M's

The "plain" M&M candies (now called "milk chocolate M&M's") are produced by the Mars, Inc. company. The distribution of colors in M&M's has a long and colorful history. The colors and proportions occasionally change, and the distribution is different for peanut and other varieties. A few incidents from my lifetime that made the national news are:

  • 1976: Red M&M's are replaced by orange. This was a big story. "Red dye #2" had been discovered to be a carcinogen. Although Mars did not use this dye in their candies, the company changed colors to alleviate customer concerns.
  • 1986: Red M&M's are brought back. Orange stays.
  • 1995: The tan color is replaced by a more vivid color. In a promotional campaign, the public is asked to vote for the replacement color. Ten million vote; blue wins in a landslide.
  • Late 1990s: The M&M web site lists the distribution of colors. Circa 1997, the color distribution was
    30% brown, 20% yellow, 20% red, 10% orange, 10% green, and 10% blue.
    Statistician and educators rejoice and publish many papers on the topic.
  • 2008: Mars changes the color distribution to
    24% blue, 20% orange, 16% green, 14% yellow, 13% red, 13% brown.
    Some time later, the proportions were removed from the web site and have not been restored.
  • 2017: What is the current distribution of colors? Read on for an interesting story!

Proportions and chi-square test

The breakroom containers at SAS are filled from two-pound bags. So as to not steal all the M&M's in the breakroom, I conducted this experiment over many weeks in late 2016 and early 2017, taking one scoop of M&M's each week. The following data set contains the cumulative counts for each of the six colors in a sample of size N = 712:

data MandMs;
input Color $7. Count;
datalines;
Red     108
Orange  133
Yellow  103
Green   139
Blue    133
Brown    96
;

A bar chart that shows the observed distribution of colors in M&M's is shown at the top of this article.

To estimate the proportion of colors in the population, simply divide each count by the total sample size, or use the FREQ procedure in SAS. PROC FREQ also enables you to run a chi-square test that compares the sample counts to the expected counts under a specified distribution. The most recent published distribution is from 2008, so let's test those proportions:

proc freq data = MandMs order=data;
   weight Count;
   tables Color / nocum chisq
   /* 2008 proportions:  red orange yellow green blue brown */
                  testp=(0.13  0.20  0.14  0.16  0.24  0.13);
run;
Chi-square test for distribution of colors in M&M's

The observed and expected proportions are shown in the table to the right. The chi-square test rejects the test hypothesis at the α = 0.05 significance level (95% confidence). In other words, the distribution of colors for M&M's in this 2017 sample does NOT appear to be the same as the color distribution from 2008! You can see this visually from the bar chart: the red and green bars are too tall and the blue bar is too short compared with the expected values.

You need a large sample to be confident that this empirical deviation is real. After collecting data for a few weeks, I did a preliminary analysis that analyzed about 300 candies. With that smaller sample, the difference between the observed and expected proportions could be attributed to sampling variability and so the chi-square test did not reject the null hypothesis. However, while running that test I noticed that the green and blue colors accounted for the majority of the difference between the observed and theoretical proportions, so I decided to collect more data.

Simultaneous confidence intervals for the M&M proportions

As I explained in a previous article, you can use the sample proportions to construct simultaneous confidence intervals for the population proportions. The following SAS/IML statements load and call the functions from the previous post:

%include "conint.sas";         /* define the MultCI and MultCIPrint modules */
proc iml; 
load module=(MultCI MultCIPrint);  
use MandMs; read all var {"Color" "Count"}; close;
 
alpha = 0.05;
call MultCIPrint(Color, Count, alpha, 2); /* construct CIs using Goodman's method */
Simultaneous confidence limits for proportions of colors in plain M&M's

The table indicates that the published 2008 proportion for blue (0.24) is far outside the 95% confidence interval, and the proportion for green (0.16) is just barely inside its interval. That by itself does not prove that the 2008 proportion are no longer valid (we might have gotten unlucky during sampling), but combined with the earlier chi-square test, it seems unlikely that the 2008 proportions are applicable to these data.

Calling the experts

The published proportions for green and blue do not seem to match the sample proportions from 2008. For this large sample, the published proportion of blue is too large whereas the published proportion of green is too small.

From reading previous articles, I know that the Customer Care team at M&M/Mars is very friendly and responsive. Apparently they get asked about the distribution of colors quite often, so I sent them a note. The next day they sent a breakdown of the colors for all M&M candies.

Interestingly, plain (and peanut) M&M's are now produced at two different factories in the US, and the factories do not use the same mixture of colors! You need to look on the packaging for the manufacturing code, which is usually stamped inside a rectangle. In the middle of the code will be the letters HKP or CLV. For example, the code might read 632GCLV20.

  • CLV: The Cleveland plant uses the following proportion of colors for plain M&M's:
    Red=0.131, Orange=0.205, Yellow=0.135, Green=0.198, Blue=0.207, and Brown=0.124.
  • HKP: The Hackettstown, NJ, plant uses the following proportion of colors for plain M&M's:
    Red=0.125, Orange=0.25, Yellow=0.125, Green=0.125, Blue=0.25, and Brown=0.125.

Although I did not know about the manufacturing codes when I collected the data, I think it is clear that the bulk of my data came from the CLV plant. You can create a graph that shows the sample proportions, the 95% simultaneous confidence intervals, and vertical hash marks to indicate the CLV population parameters, as follows:

Observed and Expected proportions of M&M color (2017)

The graph shows that the observed proportions are close to the proportions from the CLV plant. All proportions are well within the 95% simultaneous confidence intervals from the data. If you rerun the PROC FREQ chi-square analysis with the CLV proportions, the test does not reject the null hypothesis.

Summary

The experimental evidence indicates that the colors of plain M&M's in 2017 do not match the proportions that were published in 2008.

After contacting the M&M/Mars Customer Care team, I was sent a new set of proportions for 2017. The color proportions now depend on where the candies were manufactured. My data matches the proportion of colors from the Cleveland plant (manufacturing code CLV).

If you are running this analysis yourself, be sure to record whether your candies came from the HKP or CLV plant. If you want to see my analysis, you can download the complete SAS program that analyzes these data.

Educators who use M&M's to teach probability and statistics need to record the manufacturing plant, but this is still a fun (and yummy!) experiment. What do you think? Do you prefer the almost-equal orange-blue-green distribution from the CLV plant? Or do you like the orange-blue dominance from the HKP plant? Or do you just enjoy the crunchy shell and melt-in-your-mouth goodness, regardless of what colors the candies are?

The post The distribution of colors for plain M&M candies appeared first on The DO Loop.

152017
 

A categorical response variable can take on k different values. If you have a random sample from a multinomial response, the sample proportions estimate the proportion of each category in the population. This article describes how to construct simultaneous confidence intervals for the proportions as described in the 1997 paper "A SAS macro for constructing simultaneous confidence intervals for multinomial proportions" by Warren May and William Johnson (Computer Methods and Programs in Biomedicine, p. 153–162).

Estimates of multinomial proportions

In their paper, May and Johnson present data for a random sample of 220 psychiatric patients who were categorized as either neurotic, depressed, schizophrenic or having a personality disorder. The observed counts for each diagnosis are as follows:

data Psych;
input Category $21. Count;
datalines;
Neurotic              91
Depressed             49
Schizophrenic         37
Personality disorder  43
;

If you divide each count by the total sample size, you obtain estimates for the proportion of patients in each category in the population. However, the researchers wanted to compute simultaneous confidence intervals (CIs) for the parameters. The next section shows several methods for computing the CIs.

Methods of computing simultaneous confidence intervals

May and Johnson discussed six different methods for computing simultaneous CIs. In the following, 1–α is the desired overall coverage probability for the confidence intervals, χ2(α, k-1) is the upper 1–α quantile of the χ2 distribution with k-1 degrees of freedom, and π1, π2, ..., πk are the true population parameters. The methods and the references for the methods are:

  1. Quesenberry and Hurst (1964): Find the parameters πi that satisfy
    N (pi - πi)2 ≤ χ2(α, k-1) πi(1-πi).
  2. Goodman (1965): Use a Bonferroni adjustment and find the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α/k, 1) πi(1-πi).
  3. Binomial estimate of variance: For a binomial variable, you can bound the variance by using πi(1-πi) ≤ 1/4. You can construct a conservative CI for the multinomial proportions by finding the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α, 1) (1/4).
  4. Fitzpatrick and Scott (1987): You can ignore the magnitude of the proportion when bounding the variance to obtain confidence intervals that are all the same length, regardless of the number of categories (k) or the observed proportions. The formula is
    N (pi - πi)2 ≤ c2(1/4)
    where the value c2= χ2(α/2, 1) for α ≤ 0.016 and where c2= (8/9)χ2(α/3, 1) for 0.016 < α ≤ 0.15.
  5. Q and H with sample variance: You can replace the unknown population variances by the sample variances in the Quesenberry and Hurst formula to get
    N (pi - πi)2 ≤ χ2(α, k-1) pi(1-pi).
  6. Goodman with sample variance: You can replace the unknown population variances by the sample variances in the Goodman Bonferroni-adjusted formula to get
    N (pi - πi)2 ≤ χ2(α/k, 1) pi(1-pi).

In a separate paper, May and Johnson used simulations to test the coverage probability of each of these formulas. They conclude that the simple Bonferroni-adjusted formula of Goodman (second in the list) "performs well in most practical situations when the number of categories is greater than 2 and each cell count is greater than 5, provided the number of categories is not too large." In comparison, the methods that use the sample variance (fourth and fifth in the list) are "poor." The remaining methods "perform reasonably well with respect to coverage probability but are often too wide."

A nice feature of the Q&H and Goodman methods (first and second on the list) is that they procduce unsymmetric intervals that are always within the interval [0,1]. In contrast, the other intervals are symmetric and might not be a subset of [0,1] for extremely small or large sample proportions.

Computing CIs for multinomial proportions

You can download SAS/IML functions that are based on May and Johnson's paper and macro. The original macro used SAS/IML version 6, so I have updated the program to use a more modern syntax. I wrote two "driver" functions:

  • CALL MultCIPrint(Category, Count, alpha, Method) prints a table that shows the point estimates and simultaneous CIs for the counts, where the arguments to the function are as follows:
    • Category is a vector of the k categories.
    • Count is a vector of the k counts.
    • alpha is the significance level for the (1-alpha) CIs. The default value is alpha=0.05.
    • Method is a number 1, 2, ..., 6 that indicates the method for computing confidence intervals. The previous list shows the method number for each of the six methods. The default value is Method=2, which is the Goodman (1965) Bonferroni-adjusted method.
  • MultCI(Count, alpha, Method) is a function that returns a three-column matrix that contains the point estimates, lower limit, and upper limit for the CIs. The arguments are the same as above, except that the function does not use the Category vector.

Let's demonstrate how to call these functions on the psychiatric data. The following program assumes that the function definitions are stored in a file called CONINTS.sas; you might have to specify a complete path. The PROC IML step loads the definitions and the data and calls the MultCIPrint routine and requests the Goodman method (method=2):

%include "conint.sas";   /* specify full path name to file */
proc iml;
load module=(MultCI MultCIPrint);
use Psych; read all var {"Category" "Count"}; close;
 
alpha = 0.05;
call MultCIPrint(Category, Count, alpha, 2); /* Goodman = 2 */
Estimates and 95% simultaneous confidence intervals for m ultinomial proportions

The table shows the point estimates and 95% simultaneous CIs for this sample of size 220. If the intervals are wider than you expect, remember the goal: for 95% of random samples of size 220 this method should produce a four-dimensional region that contains all four parameters.

You can visualize the width of these intervals by creating a graph. The easiest way is to write the results to a SAS data set. To get the results in a matrix, call the MultCI function, as follows:

CI = MultCI(Count, alpha, 2);  /*or simply CI = MultCI(Count) */
/* write estimates and CIs to data set */
Estimate = CI[,1];  Lower = CI[,2];  Upper = CI[,3];
create CIs var {"Category" "Estimate" "Lower" "Upper"};
append;
close;
quit;
 
ods graphics / width=600px height=240px;
title 'Simultaneous Confidence Intervals for Multinomial Proportions';
title2 'Method of Goodman (1965)';
proc sgplot data=CIs;
  scatter x=Estimate y=Category / xerrorlower=Lower xerrorupper=upper
          markerattrs=(Size=12  symbol=DiamondFilled)
          errorbarattrs=GraphOutlines(thickness=2);
  xaxis grid label="Proportion" values=(0.1 to 0.5 by 0.05);
  yaxis reverse display=(nolabel);
run;
Graph of estimates and simultaneous confidence intervals for multinomial proportions

The graph shows intervals that are likely to enclose all four parameters simultaneously. The neurotic proportion in the population is probably in the range [0.33, 0.50] and at the same time the depressed proportion is in the range [0.16, 0.30] and so forth. Notice that the CIs are not symmetric about the point estimates; this is most noticeable for the smaller proportions such as the schizophrenic category.

Because the cell counts are all relatively large and because the number of categories is relatively small, Goodman's CIs should perform well.

I will mention that it you use the Fitzpatrick and Scott method (method=4), you will get different CIs from those reported in May and Johnson's paper. The original May and Johnson macro contained a bug that was corrected in a later version (personal communication with Warren May, 25FEB2016).

Conclusions

This article presents a set of SAS/IML functions that implement six methods for computing simultaneous confidence intervals for multinomial proportions. The functions are updated versions of the macro %CONINT, which was presented in May and Johnson (1987). You can use the MultCIPrint function to print a table of statistics and CIs, or you can use the MultCI function to retrieve that information into a SAS/IML matrix.

tags: Statistical Programming

The post Simultaneous confidence intervals for multinomial proportions appeared first on The DO Loop.

132017
 

A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.

One model: Many samples

As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.

In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.

As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.


Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet


Many models: One sample

There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.

One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:

%macro RunReg(DSName, NumVars);
...
%do i = 1 %to &NumVars;                    /* repeat for each x&i */
   proc reg data=&DSName noprint
            outest=PE(rename=(x&i=Value)); /* save parameter estimates */
   model Y = x&i;                          /* model Y = x_i */
   quit;
 
   /* ...then accumulate statistics... */
%end;
%mend;

The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.

If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.

The BY way for many models

An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.

To run a BY-group analysis:

  1. Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
  2. Sort the data by the BY-group variable.
  3. Run the SAS procedure, which uses the BY statement to specify each model.

1. Transpose the data

In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.

/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide;                       /* <== specify data set name HERE         */
array x [*] x1-x&nCont;         /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];           /* value for each variable for each obs */
   output;
end;
drop x:;
run;

2. Sort the data

In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:

/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

3. Run the analyses

You can now call a SAS procedure one time to compute all regression models:

/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
quit;
 
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;
run;

The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.

Conclusions

You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).

This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.

Can you think of other ways to use this trick? Leave a comment.

tags: Data Analysis, Getting Started, Statistical Programming

The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.

082017
 

On discussion forums, I often see questions that ask how to Winsorize variables in SAS. For example, here are some typical questions from the SAS Support Community:

  • I want an efficient way of replacing (upper) extreme values with (95th) percentile. I have a data set with around 600 variables and want to get rid of extreme values of all 600 variables with 95th percentile.
  • I have several (hundreds of) variables that I need to “Winsorize” at the 95% and 5%. I want all the observations with values greater 95th percentile to take the value of the 95th percentile, and all observations with values less than the 5th percentile to take the value of the 5th percentile.

It is clear from the questions that the programmer wants to modify the extreme values of dozens or hundreds of variables. As we will soon learn, neither of these requests satisfy the standard definition of Winsorization. What is Winsorization of data? What are the pitfalls and what are alternative methods?


Winsorization: Definition, pitfalls, and alternatives #StatWisdom
Click To Tweet


What is Winsorization?

The process of replacing a specified number of extreme values with a smaller data value has become known as Winsorization or as Winsorizing the data. Let's start by defining Winsorization.

Winsorization began as a way to "robustify" the sample mean, which is sensitive to extreme values. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k+1)st largest value. The mean of this new set of numbers is called the Winsorized mean. If the data are from a symmetric population, the Winsorized mean is a robust unbiased estimate of the population mean.

The graph to right provides a visual comparison. The top graph shows the distribution of the original data set. The bottom graph shows the distribution of Winsorized data for which the five smallest and five largest values have been modified. The extreme values were not deleted but were replaced by the sixth smallest or largest data value.

I consulted the Encyclopedia of Statistical Sciences (Kotz et al. (Eds), 2nd Ed, 2006) which has an article "Trimming and Winsorization " by David Ruppert (Vol 14, p. 8765). According to the article:

  • Winsorizaion is symmetric: Some people want to modify only the large data values. However, Winsorization is a symmetric process that replaces the k smallest and the k largest data values.
  • Winsorization is based on counts: Some people want to modify values based on quantiles, such as the 5th and 95th percentiles. However, using quantiles might not lead to a symmetric process. Let k1 be the number of values less than the 5th percentile and let k2 be the number of values greater than the 95th percentile. If the data contain repeated values, then k1 might not equal to k2, which means that you are potentially changing more values in one tail than in the other.

As shown by the quotes at the top of this article, posts on discussion forums sometimes muddle the definition of Winsorization. If you modify the data in an unsymmetric fashion, you will produce biased statistics.

Winsorization: The good

Why do some people want to Winsorize their data? There are a few reasons:

  • Classical statistics such as the mean and standard deviation are sensitive to extreme values. The purpose of Winsorization is to "robustify" classical statistics by reducing the impact of extreme observations.
  • Winsorization is sometimes used in the automated processing of hundreds or thousands of variables when it is impossible for a human to inspect each and every variable.
  • If you compare a Winsorized statistic with its classical counterpart, you can identify variables that might contain contaminated data or are long-tailed and require special handling in models.

Winsorization: The bad

There is no built-in procedure in SAS that Winsorizes variables, but there are some user-defined SAS macros on the internet that claim to Winsorize variables. BE CAREFUL! Some of these macros do not correctly handle missing values. Others use percentiles to determine the extreme values that are modified. If you must Winsorize, I have written a SAS/IML function that Winsorizes data and correctly handles missing values.

As an alternative to Winsorizing your data, SAS software provides many modern robust statistical methods that have advantages over a simple technique like Winsorization:

Winsorization: The ugly

If the data contains extreme values, then classical statistics are influenced by those values. However, modifying the data is a draconian measure. Recently I read an article by John Tukey, one of the early investigator of robust estimation. In the article "A survey of sampling from contaminated distributions" (1960), Tukey says (p. 457) that when statisticians encounter a few extreme values in data,

we are likely to think of them as 'strays' [or] 'wild shots' ... and to focus our attention on how normally distributed the rest of the distribution appears to be. One who does this commits two oversights, forgetting Winsor's principle that 'all distributions are normal in the middle,' and forgetting that the distribution relevant to statistical practice is that of the values actually provided and not of the values which ought to have been provided.

A little later in the essay (p. 458), he says

Sets of observations which have been de-tailed by over-vigorous use of a rule for rejecting outliers are inappropriate, since they are not samples.

I love this second quote. All of the nice statistical formulas that are used to make inferences (such as standard errors and confidence intervals) are based on the assumption that the data are a random sample that contains all of the observed values, even extreme values. The tails of a distribution are extremely important, and indiscriminately modifying large and small values invalidates many of the statistical analyses that we take for granted.

Summary

Should you Winsorize data? Tukey argues that indiscriminately modifying data is "inappropriate." In SAS, you can get the Winsorized mean directly from PROC UNIVARIATE. SAS also provides alternative robust methods such the ones in the ROBUSTREG and QUANTREG procedures.

If you decide to use Winsorization to modify your data, remember that the standard definition calls for the symmetric replacement of the k smallest (largest) values of a variable with the (k+1)st smallest (largest). If you download a program from the internet, be aware that some programs use quantiles and others do not handle missing values correctly.

What are your thoughts about Winsorizing data? Share them in the comments.

tags: Statistical Programming, Statistical Thinking

The post Winsorization: The good, the bad, and the ugly appeared first on The DO Loop.

062017
 

Suppose you create a scatter plot in SAS with PROC SGPLOT. What color does PROC SGPLOT use for the markers? If you specify the GROUP= option so that markers are colored by a grouping variable, what colors are used to represent the various groups? The following scatter plot shows the colors that are used by default for the HTMLBlue style. They are shades of blue, red, green, brown, and magenta.

data A;        /* example data with groups 1, 2, ..., 5 */
do Color = 1 to 5;
   x = Color; y = Color;  output;
end;
run;
 
title "Marker Colors Used for GROUP= Option";
title2 "HTMLBlue Style";
proc sgplot data=A;
xaxis grid; yaxis grid;
scatter x=x y=y / group=Color markerattrs=(size=24 symbol=SquareFilled);
run;
stylescolors1

Notice that these marker colors are not fully saturated colors, so they are not the SAS color names RED, BLUE, GREEN, BROWN, and MAGENTA. So what colors are these? What are their RGB values?


What colors does PROC SGPLOT uses for groups? #SASTip
Click To Tweet


Colors come from styles

Colors are defined by styles, and you can use ODS style elements to set marker colors. A style defines elements called GraphDataDefault, GraphData1, GraphData2, GraphData3, and so forth. Each element contains several attributes such as colors and line patterns. The complete list of style elements and attributes for ODS graphics is in the documentation, but for this article, the important fact is that the GraphDatan:ContrastColor attribute determines the marker color for the nth group. These are the colors in the previous scatter plot.

Display an ODS style template

Styles are defined by ODS templates. You can use the SOURCE statement in PROC TEMPLATE to display a template. In the style template, the GraphDatan:ContrastColor attributes are set by using keywords named gcdata1, gcdata2, gcdata3, etc.

If you display the template for the styles.HTMLBlue template, you will see that the HTMLBlue style inherits from the Statistical style, and it is the Statistical style that defines the contrast colors. The following statements display the contents of the Statistical template to the SAS log:

proc template;
source styles.statistical;
quit;

The template is long, and it is hard to scroll through the log to discover which colors are associated with each attribute. But that's no problem: you can use SAS to find and display only the information about contrast colors.

Display the marker colors as RGB and hexadecimal values

For years Warren Kuhfeld has been showing SAS customers how to view, edit, and use ODS templates to customize the graphs that are produced by SAS statistical procedures. A powerful technique that he uses is to write a template to a file and then use the DATA step to modify the template.

I will not modify the template but merely display information from it. The following DATA step writes the template to a text file and then uses the DATA step to find all instances of the keyword 'gcdata' in the template. For lines that contain the string 'gcdata', the program extracts the color for each keyword. The keyword-value pairs are saved to a data set, which is sorted and displayed:

libname temp "C:/temp";
proc template;
source styles.statistical / file='temp.tmp'; /* write template to text file */
quit;
 
data Colors;
keep Num Name Color R G B;
length Name Color $8;
infile 'temp.tmp';                    /* read from text file */
input;
/* example string:  'gcdata1' = cx445694 */
k = find(_infile_,'gcdata','i');      /* if k=0 then string not found */
if k > 0 then do;                     /* Found line that contains 'gcdata' */
   s = substr(_infile_, k);           /* substring from 'gcdata' to end of line */
   j = index(s, "'");                 /* index of closing quote  */
   Name = substr(s, 1, j-1);          /* keyword                 */
   if j = 7 then Num = 0;             /* string is 'gcdata'      */
   else                               /* extract number 1, 2, ... for strings */
      Num = inputn(substr(s, 7, j-7), "best2.");  /* gcdata1, gcdata2,...     */
   j = index(s, "=");                 /* index of equal sign     */
   Color = compress(substr(s, j+1));  /* color value for keyword */
   R = inputn(substr(Color, 3, 2), "HEX2.");   /* convert hex to RGB */
   G = inputn(substr(Color, 5, 2), "HEX2.");
   B = inputn(substr(Color, 7, 2), "HEX2.");
end;
if k > 0;
run;
 
proc sort data=Colors; by Num; run;
 
proc print data=Colors; 
var Name Color R G B;
run;
stylescolors2

Success! The output shows the contrast colors for the HTMLBlue style. The 'gcdata' color is the fill color (a dark blue) for markers when no GROUP= option is specified. The 'gcdatan' colors are used for markers that are colored by group membership. Obviously you could use this same technique to display other style attributes, such as line patterns or bar colors ('gdata').

If you prefer a visual summary of the attributes for an ODS style, see section "ODS Style Comparisons" in the SAS/STAT documentation. That section is part of the chapter "Statistical Graphics Using ODS," which could have been titled "Everything you always wanted to know about ODS graphics but were afraid to ask."

An application of setting marker colors

I prefer to style elements and discrete attribute maps to set colors for markers. But if you are rushed for time, you might want to use the STYLEATTRS statement to set the colors that are used for the GROUP= option. The STYLEATTRS statement requires a color list of hexadecimal colors or SAS color names. The following call to PROC SGPLOT uses the RGB/hex values for GraphData1:ContrastColor and so forth:

/* use colors for HTMLBlue style */
%let gcdata1 = cx445694;        
%let gcdata2 = cxA23A2E;
%let gcdata3 = cx01665E;
title "Origin in {Europe, USA}";
proc sgplot data=sashelp.cars;
where origin^='Asia' && type^="Hybrid";                 /* omit first category */
   styleattrs DataContrastColors = (&gcdata2 &gcdata3); /* use 2nd and 3rd colors */
   scatter x=weight y=mpg_city / group=Origin markerattrs=(symbol=CircleFilled);
   keylegend / location=inside position=TopRight across=1;
run;

It would be great if you could specify a style-independent syntax such as

styleattrs DataContrastColors=(GraphData2:ContrastColor GraphData3:ContrastColor);

Unfortunately, that syntax is not supported. The STYLEATTRS statement requires a list of color values or SAS color names.

Although this trick is interesting, in general I prefer to use styles (rather than hard-coded color values) in production code. However, if you want to know the RGB/hex values for a style, this trick shows how you can get them from an ODS template.

tags: SAS Programming, Statistical Graphics

The post What colors does PROC SGPLOT use for markers? appeared first on The DO Loop.

012017
 

In a previous article, I showed how to simulate data for a linear regression model with an arbitrary number of continuous explanatory variables. To keep the discussion simple, I simulated a single sample with N observations and p variables. However, to use Monte Carlo methods to approximate the sampling distribution of statistics, you need to simulate many samples from the same regression model.

This article shows how to simulate many samples efficiently. Efficient simulation is the main emphasis of my book Simulating Data with SAS. For a detailed discussion about simulating data from regression models, see chapters 11 and 12.

The SAS DATA step in my previous post contains four steps. To simulate multiple samples, put a DO loop around the steps that generate the error term and the response variable for each observation in the model. The following program modifies the previous program and creates a single data set that contains NumSamples (=100) samples. Each sample is identified by an ordinal variable named SampleID.

/* Simulate many samples from a  linear regression model */
%let N = 50;            /* N = sample size               */
%let nCont = 10;        /* p = number of continuous variables */
%let NumSamples = 100;  /* number of samples                  */
data SimReg(keep= SampleID i Y x:);
call streaminit(54321);
array x[&nCont];        /* explanatory variables are named x1-x&nCont */
 
/* 1. Specify model coefficients. You can hard-code values such as
array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36);
      or you can use a formula such as the following */
array beta[0:&nCont] _temporary_;
do j = 0 to &nCont;
   beta[j] = 4 * (-1)**(j+1) / (j+1);       /* formula for beta[j] */
end;
 
do i = 1 to &N;              /* for each observation in the sample */
   do j = 1 to &nCont;
      x[j] = rand("Normal"); /* 2. Simulate explanatory variables  */
   end;
 
   eta = beta[0];                       /* model = intercept term  */
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       /*     + sum(beta[j]*x[j]) */
   end;
 
   /* 5. simulate response for each sample */
   do SampleID = 1 to &NumSamples;      /* <== LOOP OVER SAMPLES   */
      epsilon = rand("Normal", 0, 1.5); /* 3. Specify error distrib*/
      Y = eta + epsilon;                /* 4. Y = model + error    */
      output;
   end;
end;
run;

The efficient way to analyzed simulated samples with SAS is to use BY-group processing. With By-group processing you can analyze all samples with a single procedure call. The following statements sort the data by the SampleID variable and call PROC REG to analyze all samples. The NOPRINT option ensures that the procedure does not spew out thousands of tables and graphs. (For procedures that do not support the NOPRINT option, there are other ways to turn off ODS when analyzing simulated data.) The OUTEST= option saves the parameter estimates for all samples to a SAS data set.

proc sort data=SimReg;
   by SampleID i;
run;
 
proc reg data=SimReg outest=PE NOPRINT;
   by SampleID;
   model y = x:;
quit;

The PE data set contains NumSamples rows. Each row contains the p parameter estimates for the analysis of one simulated sample. The distribution of estimates is an approximation to the true (theoretical) sampling distribution of the statistics. The following image visualizes the joint distribution of the estimates of four regression coefficients. You can see that the distribution of the estimates appears to be multivariate normal and centered at the values of the population parameters.

You can download the SAS program that simulates the data, analyzes it, and produces the graph. The program is very efficient. For 10,000 random samples of size N=50 that contain p=10 variables, it takes about one second to run the Monte Carlo simulation and analyses.

tags: Simulation, Statistical Programming

The post Simulate many samples from a linear regression model appeared first on The DO Loop.

302017
 

If you are a SAS programmer and use the GROUP= option in PROC SGPLOT, you might have encountered a thorny issue: if you use a WHERE clause to omit certain observations, then the marker colors for groups might change from one plot to another. This happens because the marker colors depend on the data by default. If you change the number of groups (or the order of groups), the marker colors also change.

A simple example demonstrates the problem. The following scatter plots are colored by the ORIGIN variable in the SasHelp.Cars data. (Click to enlarge.) The ORIGIN variable has three levels: Asia, Europe, and USA. On the left, all values of the ORIGIN variable are present in the graph. On the right, the Asian vehicles are excluded by using WHERE ORIGIN^="Asia". Notice that the colors of the markers on the right are not consistent with the values on the left.

Warren Kuhfeld wrote an excellent introduction to legend order and group attributes, and he describes several other ways that group colors can change from one plot to another. To solve this problem, Kuhfeld and other experts recommend that you create a discrete attribute map. A discrete attribute map is a SAS data set that specifies the colors to use for each group value. If you are not familiar with discrete attribute maps, I provide several references at the end of this article.


Automatically create a discrete attribute map for PROC SGPLOT #SASTip
Click To Tweet


Automatic creation of a discrete attribute map

The discrete attribute map is powerful, flexible, and enables the programmer to completely determine the legend order and color for all categories. However, I rarely use discrete attribute maps in my work because the process requires the manual creation of a data set. The data set has to contain all categories (spelled and capitalized correctly) and you have to remember (or look up) the structure of the data set. Furthermore, many examples use hard-coded color values such as CXFFAAAA or "LightBlue," whereas I prefer to use the GraphDatan style elements in the current ODS style.

However, I recently realized that PROC FREQ and the SAS DATA step can lessen the burden of creating a discrete attribute map. The documentation for the discrete attribute map mentions that you can define a column named MarkerStyleElement (or MarkerStyle), which specifies the names of styles elements such as GraphData1, GraphData2, and so on. Therefore, you can use PROC FREQ to write the category levels to a data set, and use a simple DATA step to add the MarkerStyleElement variable. For example, you can create a discrete attribute map for the ORIGIN variable, as follows:

/* semi-automatic way to create a DATTRMAP= data set */
%let VarName = Origin;           /* specify name of grouping variable */
proc freq data=sashelp.cars ORDER=FORMATTED;   /* or ORDER=DATA|FREQ  */
   tables &VarName / out=Attrs(rename=(&VarName=Value));
run;
 
data DAttrs;
ID = "&VarName";                 /* or "ID_&VarName" */
set Attrs(keep=Value);
length MarkerStyleElement $11.;
MarkerStyleElement = cats("GraphData", 1+mod(_N_-1, 12)); /* GraphData1, GraphData2, etc */
run;
 
proc print; run;
Structure of discrete attribute maps for DATTRMAP= option in PROC SGPLOT

Voila! The result is a valid discrete attribute data set for the ORIGIN variable. The DATTRS data set contains all the information you need to ensure that the first category is always displayed by using the GraphData1 element, the second category is displayed by using GraphData2, and so on. The program does not require that you manually type the categories or even know how many categories there are. Obviously, you could write a macro that makes it easy to generate these statements.

This data set uses the alphabetical order of the formatted values to determine the group order. However, you can use the ORDER=DATA option in PROC FREQ to order by the order of categories in the data set. You can also use the ORDER=FREQ option to order by the most frequent categories. Because most SAS-supplied styles define 12 style elements, the MOD function is used to handle categorical variable that have more than 12 levels.

Use the discrete attribute map

To use the discrete attribute map, you need to specify the DATTRMAP= option on the PROC SGPLOT statement. You also need to specify the ATTRID= option on every SGPLOT statements that will use the map. Notice that I set the value of the ID variable to be the name of the GROUP= variable. (If that is confusing, you could choose a different value, as noted in the comments of the program.) The following statements are similar to the statements that create the right-hand graph at the top of this article, except this call to PROC SGPLOT uses the DATTRS discrete attribute map:

proc sgplot data=sashelp.cars DATTRMAP=DAttrs;
where origin^='Asia' && type^="Hybrid";
   scatter x=weight y=mpg_city / group=Origin ATTRID=Origin 
                                 markerattrs=(symbol=CircleFilled);
   keylegend / location=inside position=TopRight across=1;
run;
Markers colored by attributes specified in a discrete attribute data set, using PROC SGPLOT and the DATTRMAP= option

Notice that the colors in this scatter plot are the same as for the left-hand graph at the top of this article. The group colors are now consistent, even though the number of groups is different.

Generalizing the automatic creation of a discrete attribute map

The previous section showed how to create a discrete attribute map for one variable. You can use a similar approach to automatically create a discrete data map that contains several variables. The main steps are as follows:

  1. Use ODS OUTPUT to save the OneWayFreqs tables from PROC FREQ to a SAS data set.
  2. Use the SUBSTR function to extract the variable name into the ID variable.
  3. Use the COALESCEC function to form a Value column that contains the values of the categorical variables.
  4. Use BY-group processing and the UNSORTED option to assign the style elements GraphDatan.
ods select none;
proc freq data=sashelp.cars;
   tables Type Origin;        /* specify VARS here */ 
   ods output OneWayFreqs=Freqs;
run;
ods select all;
 
data Freqs2;
set Freqs;
length ID $32.;
ID = substr(Table, 6);        /* original values are "Table VarName" */
Value = COALESCEC(F_Type, F_Origin);  /* also specify F_VARS here */
keep ID Value;
run;
 
data DAttrs(drop=count);
set Freqs2;
length MarkerStyleElement $11.;
by ID notsorted;
if first.ID then count = 0;
count + 1;
MarkerStyleElement = cats("GraphData", 1 + mod(count-1, 12));
run;

The preceding program is not completely general, but it shows the main ideas. You can adapt the program to your own data. If you are facile with the SAS macro language, you can even write a macro that generates appropriate code for an arbitrary number of variables. Leave a comment if this technique proves useful in your work or if you have ideas for improving the technique.

References

tags: Statistical Graphics, Statistical Programming

The post Automate the creation of a discrete attribute map appeared first on The DO Loop.

252017
 

This article shows how to simulate a data set in SAS that satisfies a least squares regression model for continuous variables.

When you simulate to create "synthetic" (or "fake") data, you (the programmer) control the true parameter values, the form of the model, the sample size, and magnitude of the error term. You can use simulated data as a quick-and-easy way to generate an example. You can use simulation to test the performance of an algorithm on very wide or very long data sets.

The least squares regression model with continuous explanatory variables is one of the simplest regression models. The book Simulating Data with SAS describes how to simulate data from dozens of regression models. I have previously blogged about how to simulate data from a logistic regression model in SAS.

Simulate data that satisfies a linear regression model

It is useful to be able to generate data that fits a known model. Suppose you want to fit a regression model in which the response variable is a linear combination of 10 explanatory variables, plus random noise. Furthermore, suppose you don't need to use real X values; you are happy to generate random values for the explanatory variables.

With the following SAS DATA step, you can do the following:

  1. Specify the sample size.
  2. Specify the number of explanatory variables.
  3. Specify the model parameters, which are the regression coefficients: β0=Intercept, β1, β2, β3, .... If you know the number of explanatory variables, you can hard-code these values. The example below uses the formula βj = (-1) j+1 4 / (j+1), but you can use a different formula or hard-code the parameter values.
  4. Simulate the explanatory variables. In this example, the variables are independent and normally distributed. When augmented with a column that contains all 1s (to capture the model intercept), the explanatory variables form a data matrix, X.
  5. Specify the error distribution, ε ∼ N(0,σ).
  6. Simulate the response variable as Y = MODEL + ERROR = X β + ε.
%let N = 50;        /* 1. Specify sample size */
%let nCont = 10;    /* 2. Specify the number of continuous variables */
 
data SimReg1(keep= Y x:);
call streaminit(54321);              /* set the random number seed */
array x[&nCont];         /* explanatory vars are named x1-x&nCont  */
 
/* 3. Specify model coefficients. You can hard-code values such as
array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36);
      or you can use a formula such as the following */
array beta[0:&nCont] _temporary_;
do j = 0 to &nCont;
   beta[j] = 4 * (-1)**(j+1) / (j+1);       /* formula for beta[j]  */
end;
 
do i = 1 to &N;              /* for each observation in the sample  */
   do j = 1 to dim(x);
      x[j] = rand("Normal"); /* 4. Simulate explanatory variables   */
   end;
 
   eta = beta[0];                       /* model = intercept term   */
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       /*     + sum(beta[j]*x[j])  */
   end;
   epsilon = rand("Normal", 0, 1.5);    /* 5. Specify error distrib */
   Y = eta + epsilon;                   /* 6. Y = model + error     */
   output;
end;
run;

How do you know whether the simulation is correct?

After you simulate data, it is a good idea to run a regression analysis and examine the parameter estimates. For large samples, the estimates should be close to the true value of the parameters. The following call to PROC REG fits the known model to the simulated data and displays the parameter estimates, confidence intervals for the parameters, and the root mean square error (root MSE) statistic.

proc reg data=SimReg1 plots=none;
   model Y = x: / CLB;
   ods select FitStatistics ParameterEstimates;
quit
simreg1

For a simple model a "large sample" does not have to be very large. Although the sample size is merely N = 50, the parameter estimates are close to the parameter values. Furthermore, the root MSE value is close to 1.5, which is the true magnitude of the error term. For this simulation, each 95% confidence interval contain the true value of the parameters, but that does not happen in general.

Simulate data. Compare parameters, estimates, and 95% confidence limits for a 10-variable linear regression of simulated data

For more complex regression models, you might need to generate larger samples to verify that the simulation correctly generates data from the model.

If you write the ParametereEstimates table to a SAS data set, you can create a plot that shows the parameters overlaid on a plot of the estimates and the 95% confidence limits. The plot shows that the parameters and estimates are close to each other, which should give you confidence that the simulated data are correct.

Notice that this DATA step can simulate any number of observations and any number of continuous variables. Do you want a sample that has 20, 50, or 100 variables? No problem: just change the value of the nCont macro variable! Do you want a sample that has 1000 observations? Change the value of N.

In summary, the SAS DATA step provides an easy way to simulate data from regression models in which the explanatory variables are uncorrelated and continuous. Download the complete program and modify it to your needs. For example, if you want more significant effects, use sqrt(j+1) in the denominator of the regression coefficient formula.

In a future article, I will show how to generalize this program to efficiently simulate and analyze many samples a part of a Monte Carlo simulation study.

tags: Simulation

The post Simulate data for a linear regression model appeared first on The DO Loop.

232017
 
Link to documentation for the SAS analytical products

The SAS analytical documentation has a new look.

Beginning with the 14.2 release of the SAS analytical products (which shipped with SAS 9.4m4 in November 2016), the HTML version of the online documentation has moved to a new framework called the Help Center. The URL for the online documentation is easy to remember:
http://support.sas.com/documentation/

This article shows the 14.2 documentation for the SAS analytical products, as highlighted in the adjacent image. Documentation for previous releases is also available.

The 14.2 link takes you to a new page that contains links for the User's Guides for each SAS analytical product, such as SAS/STAT, SAS/ETS, SAS/IML, and so on. When you click on a User's Guide, you are taken to the new Help Center.

An example page for the SAS/STAT documentation is shown in the following image (click to enlarge). As in previous versions of the help system, the Help Center provides drop-down lists (Overview, Getting Started, Syntax, etc.) for quick navigation within a procedure. There are also arrows (now in the upper right corner) that take you to the previous or next page in the book.

SAS Documentation in the Help Center

The following list describes five features of the Help Center that are either new or that extend features of the older HTML format. The locations of these features are highlighted with red rectangles in the previous image.


Five new features in the #SAS analytical documentation
Click To Tweet


  1. What's New in 14.2: No matter how the information is delivered, content is king. I really appreciate that SAS publishes a "What's New" chapter, which highlights new features and enhancements for each product. For long-time users of the software, the "What's New" chapter is the first place to go to determine what new options are available in the new release of your favorite SAS product.
  2. Toggle the Table of Contents (TOC): Like the older HTML documentation, the new Help Center shows the TOC for each book and chapter in a left-side pane. The Help Center enables you to toggle the TOC by clicking the icon in the upper left corner (or use CTRL+SHIFT+C). Removing the TOC provides more screen area for the documentation. This is especially important on a small display, such as on a laptop or tablet, where the TOC is hidden by default. Click the icon again to restore the TOC pane.
  3. SAS/STAT documentation search facility Enhanced search within a book: In the older HTML doc, search results appear on a separate HTML page. In the new Help Center, the search facility (click the magnifying class icon or CTRL+SHIFT+S) displays the results in a pop-up scrollable window as shown in the adjacent image. When you click a search result, the Help Center updates to display the new page. If you search again, the search window remembers your previous query. If you want to close the search window, press the ESC key. The search facility supports complex expressions, Boolean operators, and proximity searches.
  4. Links to the SAS Sample Library: My favorite new feature is that the documentation now links directly to the SAS Sample Library. For decades, SAS has distributed the Sample Library, which provides the complete data and programming statements to reproduce advanced examples. However, many SAS programmers do not know how to access the SAS Sample Library. The new 14.2 documentation now links directly to the sample programs for each analytical procedure. Simply click the link at the top of each example to see the complete data and sample program.
  5. Links to Videos: Some SAS programmers prefer watch videos to learn about new features in SAS. Over the past few years, SAS R&D has recorded dozens of video presentations about the analytical procedures. Now the documentation contains links to these videos, which often provide an overview of a procedure, analysis, or set of options. The Videos tab appears for chapters that contain videos.

In summary, the new Help Center framework provides additional ways for SAS customers to learn about the syntax, options, and output of SAS analytical procedures. At the present time, only analytical products use the Help Center. The documentation for Base SAS continues to be provided in HTML and PDF formats.

Check out the SAS analytical products 14.2 documentation and let me know what you think. Do you like something that I didn't mention? Post a comment.

tags: Tips and Techniques

The post Five reasons to check out the new SAS analytical documentation appeared first on The DO Loop.