In my new book, End-to-End Data Science with SAS: A Hands-On Programming Guide, I use the 1.5 IQR rule to adjust multiple variables.  This program utilizes a macro that loops through a list of variables to make the necessary adjustments and creates an output data set.

One of the most popular ways to adjust for outliers is to use the 1.5 IQR rule. This rule is very straightforward and easy to understand. For any continuous variable, you can simply multiply the interquartile range by the number 1.5. You then add that number to the third quartile. Any values above that threshold are suspected as being an outlier. You can also perform the same calculation on the low end. You can subtract the value of IQR x 1.5 from the first quartile to find low-end outliers.

The process of adjusting for outliers can be tedious if you have several continuous variables that are suspected as having outliers. You will need to run PROC UNIVARIATE on each variable to identify its median, 25th percentile, 75th percentile, and interquartile range. You would then need to develop a program that identifies values above and below the 1.5 IQR rule thresholds and overwrite those values with new values at the threshold.

The following program is a bit complicated, but it automates the process of adjusting a list of continuous variables according to the 1.5 IQR rule. This program consists of three distinct parts:

1. Create a BASE data set that excludes the variables contained in the &outliers global macro. Then create an OUTLIER data set that contains only the unique identifier ROW_NUM and the outlier variables.
2. Create an algorithm that loops through each of the outlier variables contained in the global variable &outliers and apply the 1.5 IQR rule to cap each variable’s range according to its unique 1.5 IQR value.
3. Merge the newly restricted outlier variable with the BASE data set.
```/*Step 1: Create BASE and OUTLIER data sets*/   %let outliers = /*list of variables*/;   DATA MYDATA.BASE; SET MYDATA.LOAN_ADJUST (DROP=&amp;outliers.); ROW_NUM = _N_; RUN;   DATA outliers; SET MYDATA.LOAN_ADJUST (KEEP=&amp;outliers. ROW_NUM); ROW_NUM = _N_; RUN;    /*Step 2: Create loop and apply the 1.5 IQR rule*/   %MACRO loopit(mylist); %LET n = %SYSFUNC(countw(&amp;mylist));   %DO I=1 %TO &amp;n; %LET val = %SCAN(&amp;mylist,&amp;I);   PROC UNIVARIATE DATA = outliers ; VAR &amp;val.; OUTPUT OUT=boxStats MEDIAN=median QRANGE=iqr; run;   data _NULL_; SET boxStats; CALL symput ('median',median); CALL symput ('iqr', iqr); run;   %PUT &amp;median; %PUT &amp;iqr;   DATA out_&amp;val.(KEEP=ROW_NUM &amp;val.); SET outliers;   IF &amp;val. ge &amp;median + 1.5 * &amp;iqr THEN &amp;val. = &amp;median + 1.5 * &amp;iqr; RUN;   /*Step 3: Merge restricted value to BASE data set*/   PROC SQL; CREATE TABLE MYDATA.BASE AS SELECT * FROM MYDATA.BASE AS a LEFT JOIN out_&amp;val. as b on a.ROW_NUM = b.ROW_NUM; QUIT;   %END; %MEND;   %LET list = &amp;outliers; %loopit(&amp;list);```

Notes on the outlier adjustment program:

• A macro variable is created that contains all of the continuous variables that are suspected of having outliers.
• Separate data sets were created: one that contains all of the outlier variables and one that excludes the outlier variables.
• A macro program is developed to contain the process of looping through the list of variables.
• A macro variable (n) is created that counts the number of variables contained in the macro variable.
• A DO loop is created that starts at the first variable and runs the following program on each variable contained in the macro variable.
• PROC UNIVARIATE identifies the variable’s median and interquartile range.
• A macro variable is created to contain the values of the median and interquartile range.
• A DATA step is created to adjust any values that exceed the 1.5 IQR rule on the high end and the low end.
• PROC SQL adds the adjusted variables to the BASE data set.

This program might seem like overkill to you. It could be easier to simply adjust outlier variables one at a time. This is often the case; however, when you have a large number of outlier variables, it is often beneficial to create an algorithm to transform them efficiently and consistently

Adjusting outliers with the 1.5 IQR rule was published on SAS Users.

While skewness and kurtosis are not as often calculated and reported as mean and standard deviation, they can be useful at times. Skewness is the 3rd moment around the mean, and characterizes whether the distribution is symmetric (skewness=0). Kurtosis is a function of the 4th central moment, and characterizes peakedness, where the normal distribution has a value of 3 and smaller values correspond to thinner tails (less peakedness).

Some packages (including SAS) subtract three from the kurtosis, so that the normal distribution has a kurtosis of 0 (this is sometimes called excess kurtosis.

R
`library(moments)library(lattice)ds = read.csv("http://www.math.smith.edu/r/data/help.csv")ds\$gender = ifelse(ds\$female==1, "female", "male")densityplot(~ cesd, data=ds, groups=gender, auto.key=TRUE)`

We see that the distribution of CESD scores is skewed with a long left tail, and appears somewhat less peaked than a normal distribution. This is confirmed by the actual statistics:
`> with(ds, tapply(cesd, gender, skewness))    female       male -0.4906171 -0.2464390 > with(ds, tapply(cesd, gender, kurtosis))    # kurtosis  female     male 2.748968 2.547061 > with(ds, tapply(cesd, gender, kurtosis))-3  # excess kurtosis    female       male -0.2510318 -0.4529394 `

SAS
SAS includes much detail on the moments and other statistics in the output from proc univariate. As usual, the quantity of output can be off-putting for new users and students. Here we extract the moments we need with the ODS system. We also generate kernel density estimates roughly analogous to the densityplot() results shown above.
`ods output moments = cesdmoments;proc univariate data="c:\book\help.sas7bdat";  class female;  var cesd;  histogram cesd / kernel;run;proc print data=cesdmoments;   where label1 = "Skewness";  var female label1 nvalue1 label2 nvalue2;run;`

With the result:
`Obs   FEMALE    Label1         nValue1    Label2         nValue2  4     0      Skewness      -0.247513   Kurtosis      -0.442010 10     1      Skewness      -0.497620   Kurtosis      -0.204928`

We note that the default is to produce unbiased (REML) estimates, rather than the biased method of moments estimator produced by the kurtosis() function (and that SAS presents the excess kurtosis).

In Chapter 2 of the book "Bayesian Computation with R" by Jim Albert, the philosoph of Bayesian statistics is introduced using an example where a parameter regarding proportion of heavy sleeping college students, i.e. those have more than 8 hours sleep a day, is studied. According to Bayes'Rule, the posterior is proportional to the product of prior probability and likelihood given prior, where the proportion factor is the likelihood of data over the whole parameter space. As we can see, the key point here is the specification of prior probability.

In order to illustrate this idea, Jim introduced three types of prior probability:
1. Prior probability over a discrete grid of choice, given a weight which serves as hyper-parameter;
2. Prior probability by a chosen probability. In most cases, the chosen probability is a so called conjugate prior becuase the posterior is the same family as the prior. In the book, a beta probability is chosen and it is conjugate in the sense that the posterior is also a beta distribution, only with different parameter;
3. A dense grid of prior probability choices which is called histogram prior;

Using a discrete grid of prior choices, we first generate a list of prior probabilities, in data set PRIOR, that we believe the true proportion of heavy sleepers should be, and assign a weight, i.e. the magnitude of credibility, for each choice in a data set P, which serves as hyper-parameter. We then consolidate the two data sets into one and plot the prior against its corresponding weight.

``````
/* Chapter 2 of Bayesian Computation with SAS */
/* In this chapter, Jim talked about applying
Bayesian Rule to learn posterior probability
using an example where the proportion of
students sleeping over 8 hours a day is studies.
Bayes Rule:
Posterior(p|data) \prop Prior(p)* likelihood(data|p)
*/
options fullstimer formchar='|----|+|---+=|-/\<>*'  error=3;
/***************************************************/
data prior;
input prior @@;
cards;
2 4 8 8 4 2 1 1 1 1
;
run;

data p;
do p=0.05 to 0.95 by 0.1;
output;
end;
run;

data p;
retain sum_prior 0;
do until (eof1);
set prior end=eof1;
sum_prior+prior;
end;
do until (eof2);
merge p  prior  end=eof2;
prior=prior/sum_prior;
drop sum_prior;
output;
end;
run;

goptions reset=all;
symbol  interpol=Needle  value=none  line=1 color=black;
axis1   label=(angle=90  'Prior Probability') order=(0 to 0.3 by 0.1) minor=none;
axis2   label=('p')   order=(0 to 1 by 0.2)  minor=none;
proc gplot data=p;
plot prior*p /vaxis=axis1  haxis=axis2;
run;quit;
``````

Next, we calculate the posterior probability given prior and likelihood conditional on prior. In the code below, we observe 11 heavy sleepers out of a total sample of 27. The first DATA STEP calculates the likelihood conditional on each prior choice, the posterior is output in the data set P2, where the normalization factor is the sum of conditional likelihood, scaled by max likelihood for numerical stability. You can observe the shape of posterior against hyper-parameter using GPLOT. The INTERPOL=NEEDLE option in SYMBOL statement minic the histogram style line of PLOT function in R.

``````
%let n0=16;
%let n1=11;

data p;
set p  end=eof1  nobs=ntotal;
if p>0 & p<1 then do;
log_like=log(p)*&n1 + log(1-p)*&n0;
end;
else
log_like=-999*(p=0)*(&n1>0) + (p=1)*(&n0>0);
log_posterior=log(prior) + log_like;
if log_posterior>max_like then max_like=log_posterior;
run;

proc means data=p  noprint;
var log_posterior;
output out=_max  max(log_posterior)=max;
run;

data p2;
set _max;
sum_post=0;
do until (eof);
set p  end=eof;
sum_post+exp(log_posterior-max);
end;
do until (eof2);
set p end=eof2;
posterior=exp(log_posterior-max)/sum_post;
output;
keep p prior  posterior;
end;
run;

goptions reset=all;
symbol value=NONE interpol=Needle;
axis1  label=(angle=90  'Posterior Probability');
axis2  label=('p');
proc gplot data=p2;
plot  posterior*p /vaxis=axis1  haxis=axis2;
run;quit;

``````

After tried discrete grid of prior, beta conjugate prior is introduced. Jim mentioned that by matching the believed 50% and 90% percentiles, the parameter of prior beta distribution can be obtained by try-and-error approach. We note that the believed percentiles data indicates that the beta distribution skewed to 0, so that parameter alpha is less than parameter beta and we can use a binary search to nail down the values, which I came down to approximately 3.4 and 7.5. But I will use the {3.4, 7.4} vector as in the book. Since beta distribution is a conjugate prior for proportion, the computation is very easy. The curves for prior, likelihood and posterior are visualized.

``````
/* beta prior */

%let a=3.4;
%let b=7.4;
%let s=11;
%let f=16;

data p;
do i=1 to 500;
p=i/500;
prior=pdf('beta', p, &a, &b);
like=pdf('beta', p, &s, &f);
post=pdf('beta', p, &a+&s, &b+&f);
drop i;
output;
end;
run;

goptions reset=all;
symbol1 interpol=j  color=red  width=1  value=none;
symbol2 interpol=j  color=blue width=2  value=none;
symbol3 interpol=j  color=gree width=3  value=none;
axis1   label=(angle=90  'Density')  order=(0 to 5 by 1) minor=none;
axis2   label=('p')      order=(0 to 1 by 0.2)  minor=none;;
legend label=none   position=(top right inside)  mode=share;
proc gplot data=p;
plot post*p   prior*p  like*p  /overlay
vaxis=axis1
haxis=axis2
legend=legend;
run;quit;

``````

Since we obtain closed form for posterior distribution, we can conduct inference based on exact calculation; alternatively, we can use simulated random sample.
``````
/* generate 1000 random obs from posterior distribution */
data rs;
call streaminit(123456);
do i=1 to 1000;
x=rand('beta', &a+&s, &b+&f);
output;
end;
run;

proc univariate data=rs  noprint;
var x;
histogram /midpoints=0 to 0.8 by 0.05
cbarline=blue  cfill=white
outhistogram=hist;
title "Histogram of ps";
run;
title;

axis1 label=(angle=90 "Frequency");
proc gbarline data=hist;
bar _midpt_/discrete  sumvar=_count_  space=0  axis=axis1 ;
title "Histogram of ps";
run;
title;

/* hist prior */
data midpt;
do pt=0.05 to 0.95 by 0.1;
output;
end;
run;
``````

In the book, Jim also mentioned using so called Histogram prior. The Histogram prior is a densed-version of discrete prior, with the distance between discrete choices filled in with dense choices of the same value, separated only by a small distance. It provides a more refined description of prior believe than the discrete prior. We plot the prior and posterior as well as random sample from posterior to give a visual impression. Note that we use PROC STDIZE method=SUM to normalize prior variable.

``````
data prior;
input prior @@;
datalines;
2 4 8 8 4 2 1 1 1 1
;
run;
proc stdize data=prior  method=sum  out=prior; var prior; run;

data _null_;
set midpt nobs=ntotal1;
call symput('nmid', ntotal1);
set prior nobs=ntotal2;
call symput('nprior', ntotal2);
stop;
run;

data histprior;
array _midpt{&nmid} _temporary_;
array _prior{&nprior} _temporary_;
if _n_=1 then do;
k=1;
do until (eof1);
set midpt  end=eof1;
_midpt[k]=pt; if k=2 then lo=_midpt[k]-_midpt[k-1];
k+1;
end;
do k=1 to &nmid;
_midpt[k]=round(_midpt[k]-lo/2, 0.00001);
put _midpt[k]=;
end;
k=1;
do until (eof2);
set prior  end=eof2;
_prior[k]=prior;
put k=  _prior[k]=;
if k<&nprior then k=k+1;
end;
drop k  lo;
end;

do p=1/500 to 1 by 1/500;
sk=0;
do k=1 to &nmid;
if p>=_midpt[k] then sk=min(&nmid, sk+1);
end;
histprior=_prior[sk];
output;
end;
stop;
run;

goptions reset=all;
symbol interpol=hiloj;
axis1  label=(angle=90  "Prior Density")  order=(0 to 0.3 by 0.05);
axis2  label=("p")                        order=(0 to 1   by 0.2);
proc gplot data=histprior;
plot histprior*p /vaxis=axis1  haxis=axis2;
run;quit;

data histposterior;
set histprior;
like=pdf('BETA',p,  &s+1, &f+1);
post=like*histprior;
run;

%put &s;
goptions reset=all;
symbol interpol=hiloj;
axis1 label=(angle=90  "Posterior Density")  order=(0 to 1 by 0.2);
proc gplot data=histposterior;
plot   post*p/vaxis=axis1;
run;quit;

proc means data=histposterior  sum;
var post;
run;

proc stdize data=histposterior  method=sum  out=data1;
var post;
run;
proc means data=data1  sum;
var post;
run;
``````

In R, the SAMPLE function is able to do a URS sampling with given probability. SAS's PROC SURVEYSELECT seems can't do this with method=URS. But URS sampling with given probability is very easy to code in SAS DATA STEP with the help of FORMAT.

``````
data fmt;
merge histposterior(rename=(post=start))
histposterior(firstobs=2   in=_2    rename=(post=end));
if _2;
run;

data fmt;
set data1   end=eof;
retain start;
retain fmtname 'ursp'  type 'n';
if _n_=1 then start=post;
else do;
end=start+post;  label=p;
keep  fmtname  start  end  label;
output;
start=end;
end;
if eof then do;
hlo='O';
label=.;
output;
end;
run;

proc sort data=fmt  nodupkey  out=fmt;
by start end;
run;

proc format cntlin=fmt; run;

data  samp;
do i=1 to 1000;
p=input(put(ranuni(78686), ursp.), best.);
output;
end;
run;

proc univariate data=samp noprint;
title "Histogram of ps";
histogram  p/midpoints=0.15 to 0.65 by 0.05
outhist=samp_hist
vaxislabel="Frequency"  vscale=count;
run;
title;
``````

Once we obtain posterior distribution of a variable, we can easily predict the value of some statistics given new samples. In the book, depending on the choice of weighting density, two types of predictions are discussed: 1. prior prediction; 2. posterior prediction. The calculation is straight-forward given formula.

``````
/* prediction */
data prior;
input prior @@;
cards;
2 4 8 8 4 2 1 1 1 1
;
run;

data p;
do p=0.05 to 0.95 by 0.1;
output;
end;
run;

data p;
retain sum_prior 0;
do until (eof1);
set prior end=eof1;
sum_prior+prior;
end;
do until (eof2);
merge p  prior  end=eof2;
prior=prior/sum_prior;
drop sum_prior;
output;
end;
run;

/* prediction using discrete prior */
%let m=20;
data pred;
do ys=0 to &m;
pred=0;
do i=1 to 10;
set p   point=i;
pred = pred + prior * pdf('BINOMIAL', ys, p, &m);
put p=  prior=  pred=  ;
end;
output;
end;
stop;
run;

/* prediction using beta(a, b) prior */
%let m=20;
%let a=3.4;
%let b=7.4;
data pred;
do ys=0 to &m;
pred=comb(&m, ys) *exp(logbeta(&a + ys, &b+&m -ys) - logbeta(&a, &b));
output;
end;
run;

/* prediction by simulation */
%let n=1000;
data pred;
do i=1 to &n;
p=rand('BETA', &a, &b);
ys=rand('BINOMIAL', p, &m);
drop i;
output;
end;
run;

proc freq data=pred  noprint;
table ys /out=y_freq;
run;
/*
proc gbarline data=y_freq;
bar y /discrete sumvar=percent;
run;quit;
*/
data y_freq;
set y_freq;
percent=percent/100;
label ys='y';
run;

goptions reset=all  border;
axis1  label=(angle=90  "Predictive Probability")  order=(0 to 0.14 by 0.02);
axis2  label=("y")   order=(0 to 16) offset=(15 points, 15 points) minor=none;
symbol  interpol=needle;
proc gplot data=y_freq;
plot  percent * ys/vaxis=axis1  haxis=axis2  ;
run;quit;

/* summarize discrete outcome with given coverage probability */
/* input dsn is y_freq from PROC FREQ. The following can be used as gauge.
data y_freq;
input y percent;
datalines;
0  0.013
1  0.032
2  0.065
3  0.103
4  0.102
5  0.115
6  0.114
7  0.115
8  0.095
9  0.083
10 0.058
11 0.036
12 0.029
13 0.014
14 0.015
15 0.006
16 0.005
;
run;
*/
%let covprob=0.9;
proc sort data=y_freq  out=y_freq2;
by  percent;
run;

data y_freq2;
set y_freq2;
retain cumprob  0;
cumprob+percent;
if cumprob<(1-&covprob) then select=0;
else select=1;
run;

proc means data=y_freq2  sum;
where select=1;
var percent;
run;
``````

﻿In Summary:

0. There different choices for specifying Prior believe that can be incorporated into Bayesian computation:
a.) A discrete grid of possible values, each value with some probability
b.) An appropriate distribution, where the parameters is determined from existing information
c.) A dense grid of possible values that span the whole range

1. We demonstrated calculation under these three scenarios and their impact on posterior distribution of statistics of interests;

2. We can conduct Sampling With Replacement with given sampling probability by using FORMAT and uniform random variable generator

3. We demonstrated that prediction in a Bayesian framework is straighforward, but depending on whether we use prior distribution or posterior distribution

In Chapter 2 of the book "Bayesian Computation with R" by Jim Albert, the philosoph of Bayesian statistics is introduced using an example where a parameter regarding proportion of heavy sleeping college students, i.e. those have more than 8 hours sleep a day, is studied. According to Bayes'Rule, the posterior is proportional to the product of prior probability and likelihood given prior, where the proportion factor is the likelihood of data over the whole parameter space. As we can see, the key point here is the specification of prior probability.

In order to illustrate this idea, Jim introduced three types of prior probability:
1. Prior probability over a discrete grid of choice, given a weight which serves as hyper-parameter;
2. Prior probability by a chosen probability. In most cases, the chosen probability is a so called conjugate prior becuase the posterior is the same family as the prior. In the book, a beta probability is chosen and it is conjugate in the sense that the posterior is also a beta distribution, only with different parameter;
3. A dense grid of prior probability choices which is called histogram prior;

Using a discrete grid of prior choices, we first generate a list of prior probabilities, in data set PRIOR, that we believe the true proportion of heavy sleepers should be, and assign a weight, i.e. the magnitude of credibility, for each choice in a data set P, which serves as hyper-parameter. We then consolidate the two data sets into one and plot the prior against its corresponding weight.

``/* Chapter 2 of Bayesian Computation with SAS *//* In this chapter, Jim talked about applying    Bayesian Rule to learn posterior probability   using an example where the proportion of    students sleeping over 8 hours a day is studies.   Bayes Rule:     Posterior(p|data) \prop Prior(p)* likelihood(data|p)*/options fullstimer formchar='|----|+|---+=|-/\<>*'  error=3;/***************************************************/data prior;     input prior @@;cards;2 4 8 8 4 2 1 1 1 1;run;data p;     do p=0.05 to 0.95 by 0.1;     output;  end;run;data p;     retain sum_prior 0;     do until (eof1);        set prior end=eof1;        sum_prior+prior;     end;     do until (eof2);        merge p  prior  end=eof2;        prior=prior/sum_prior;        drop sum_prior;        output;     end;run;goptions reset=all;symbol  interpol=Needle  value=none  line=1 color=black;axis1   label=(angle=90  'Prior Probability') order=(0 to 0.3 by 0.1) minor=none;axis2   label=('p')   order=(0 to 1 by 0.2)  minor=none;proc gplot data=p;      plot prior*p /vaxis=axis1  haxis=axis2;run;quit;``

Next, we calculate the posterior probability given prior and likelihood conditional on prior. In the code below, we observe 11 heavy sleepers out of a total sample of 27. The first DATA STEP calculates the likelihood conditional on each prior choice, the posterior is output in the data set P2, where the normalization factor is the sum of conditional likelihood, scaled by max likelihood for numerical stability. You can observe the shape of posterior against hyper-parameter using GPLOT. The INTERPOL=NEEDLE option in SYMBOL statement minic the histogram style line of PLOT function in R.

``%let n0=16;%let n1=11;data p;           set p  end=eof1  nobs=ntotal;     if p>0 & p<1 then do;        log_like=log(p)*&n1 + log(1-p)*&n0;        end;     else         log_like=-999*(p=0)*(&n1>0) + (p=1)*(&n0>0);        log_posterior=log(prior) + log_like;      if log_posterior>max_like then max_like=log_posterior;     run;proc means data=p  noprint;     var log_posterior;     output out=_max  max(log_posterior)=max;run;data p2;     set _max;     sum_post=0;     do until (eof);        set p  end=eof;        sum_post+exp(log_posterior-max);              end;     do until (eof2);        set p end=eof2;        posterior=exp(log_posterior-max)/sum_post;        output;        keep p prior  posterior;     end;run;goptions reset=all;symbol value=NONE interpol=Needle;axis1  label=(angle=90  'Posterior Probability');axis2  label=('p');proc gplot data=p2;      plot  posterior*p /vaxis=axis1  haxis=axis2;run;quit; ``

After tried discrete grid of prior, beta conjugate prior is introduced. Jim mentioned that by matching the believed 50% and 90% percentiles, the parameter of prior beta distribution can be obtained by try-and-error approach. We note that the believed percentiles data indicates that the beta distribution skewed to 0, so that parameter alpha is less than parameter beta and we can use a binary search to nail down the values, which I came down to approximately 3.4 and 7.5. But I will use the {3.4, 7.4} vector as in the book. Since beta distribution is a conjugate prior for proportion, the computation is very easy. The curves for prior, likelihood and posterior are visualized.

``/* beta prior */%let a=3.4;%let b=7.4;%let s=11;%let f=16;data p;      do i=1 to 500;      p=i/500;   prior=pdf('beta', p, &a, &b);   like=pdf('beta', p, &s, &f);   post=pdf('beta', p, &a+&s, &b+&f);   drop i;   output;   end;run;goptions reset=all;symbol1 interpol=j  color=red  width=1  value=none;symbol2 interpol=j  color=blue width=2  value=none;symbol3 interpol=j  color=gree width=3  value=none;axis1   label=(angle=90  'Density')  order=(0 to 5 by 1) minor=none;axis2   label=('p')      order=(0 to 1 by 0.2)  minor=none;;legend label=none   position=(top right inside)  mode=share;proc gplot data=p;      plot post*p   prior*p  like*p  /overlay                                        vaxis=axis1                                        haxis=axis2                                        legend=legend;run;quit;``

Since we obtain closed form for posterior distribution, we can conduct inference based on exact calculation; alternatively, we can use simulated random sample.
``/* generate 1000 random obs from posterior distribution */data rs;      call streaminit(123456);      do i=1 to 1000;         x=rand('beta', &a+&s, &b+&f);   output;   end;run;proc univariate data=rs  noprint;     var x;     histogram /midpoints=0 to 0.8 by 0.05                  cbarline=blue  cfill=white                  outhistogram=hist;     title "Histogram of ps";run;title;axis1 label=(angle=90 "Frequency");proc gbarline data=hist;      bar _midpt_/discrete  sumvar=_count_  space=0  axis=axis1 ;   title "Histogram of ps";run;title;/* hist prior */data midpt;     do pt=0.05 to 0.95 by 0.1;        output;     end;run;``

In the book, Jim also mentioned using so called Histogram prior. The Histogram prior is a densed-version of discrete prior, with the distance between discrete choices filled in with dense choices of the same value, separated only by a small distance. It provides a more refined description of prior believe than the discrete prior. We plot the prior and posterior as well as random sample from posterior to give a visual impression. Note that we use PROC STDIZE method=SUM to normalize prior variable.

``data prior;      input prior @@;datalines;2 4 8 8 4 2 1 1 1 1;run;proc stdize data=prior  method=sum  out=prior; var prior; run;data _null_;     set midpt nobs=ntotal1;     call symput('nmid', ntotal1);     set prior nobs=ntotal2;     call symput('nprior', ntotal2);     stop;run;data histprior;      array _midpt{&nmid} _temporary_;      array _prior{&nprior} _temporary_;      if _n_=1 then do;         k=1;         do until (eof1);             set midpt  end=eof1;            _midpt[k]=pt; if k=2 then lo=_midpt[k]-_midpt[k-1];            k+1;         end;         do k=1 to &nmid;            _midpt[k]=round(_midpt[k]-lo/2, 0.00001);            put _midpt[k]=;         end;          k=1;         do until (eof2);            set prior  end=eof2;            _prior[k]=prior;            put k=  _prior[k]=;            if k<&nprior then k=k+1;              end;         drop k  lo;         end;         do p=1/500 to 1 by 1/500;         sk=0;         do k=1 to &nmid;            if p>=_midpt[k] then sk=min(&nmid, sk+1);          end;         histprior=_prior[sk];         output;     end;     stop;run;goptions reset=all;symbol interpol=hiloj;axis1  label=(angle=90  "Prior Density")  order=(0 to 0.3 by 0.05);axis2  label=("p")                        order=(0 to 1   by 0.2);proc gplot data=histprior;      plot histprior*p /vaxis=axis1  haxis=axis2;run;quit;data histposterior;     set histprior;     like=pdf('BETA',p,  &s+1, &f+1);     post=like*histprior;run;%put &s;goptions reset=all;symbol interpol=hiloj;axis1 label=(angle=90  "Posterior Density")  order=(0 to 1 by 0.2);proc gplot data=histposterior;      plot   post*p/vaxis=axis1;run;quit;proc means data=histposterior  sum;     var post;run;proc stdize data=histposterior  method=sum  out=data1;      var post;run;proc means data=data1  sum;      var post;run;``

In R, the SAMPLE function is able to do a URS sampling with given probability. SAS's PROC SURVEYSELECT seems can't do this with method=URS. But URS sampling with given probability is very easy to code in SAS DATA STEP with the help of FORMAT.

``data fmt;     merge histposterior(rename=(post=start))       histposterior(firstobs=2   in=_2    rename=(post=end));     if _2;run;data fmt;     set data1   end=eof;     retain start;     retain fmtname 'ursp'  type 'n';     if _n_=1 then start=post;     else do;        end=start+post;  label=p;        keep  fmtname  start  end  label;        output;        start=end;     end;     if eof then do;        hlo='O';        label=.;        output;                end;run;proc sort data=fmt  nodupkey  out=fmt;      by start end;run;proc format cntlin=fmt; run;data  samp;      do i=1 to 1000;         p=input(put(ranuni(78686), ursp.), best.);         output;      end;run;proc univariate data=samp noprint;     title "Histogram of ps";     histogram  p/midpoints=0.15 to 0.65 by 0.05                    outhist=samp_hist                    vaxislabel="Frequency"  vscale=count;      run;title;``

Once we obtain posterior distribution of a variable, we can easily predict the value of some statistics given new samples. In the book, depending on the choice of weighting density, two types of predictions are discussed: 1. prior prediction; 2. posterior prediction. The calculation is straight-forward given formula.

``/* prediction */data prior;     input prior @@;cards;2 4 8 8 4 2 1 1 1 1;run;data p;     do p=0.05 to 0.95 by 0.1;     output;  end;run;data p;      retain sum_prior 0;   do until (eof1);      set prior end=eof1;   sum_prior+prior;   end;   do until (eof2);      merge p  prior  end=eof2;   prior=prior/sum_prior;   drop sum_prior;   output;   end;run;/* prediction using discrete prior */%let m=20;data pred;        do ys=0 to &m;               pred=0;            do i=1 to 10;                set p   point=i;      pred = pred + prior * pdf('BINOMIAL', ys, p, &m);    put p=  prior=  pred=  ;   end;   output;     end;  stop;run;/* prediction using beta(a, b) prior */%let m=20;%let a=3.4;%let b=7.4;data pred;     do ys=0 to &m;       pred=comb(&m, ys) *exp(logbeta(&a + ys, &b+&m -ys) - logbeta(&a, &b));       output;     end;run;/* prediction by simulation */%let n=1000;data pred;        do i=1 to &n;      p=rand('BETA', &a, &b);   ys=rand('BINOMIAL', p, &m);   drop i;   output;    end;run;proc freq data=pred  noprint;        table ys /out=y_freq;run;/*proc gbarline data=y_freq;        bar y /discrete sumvar=percent;run;quit;*/data y_freq;        set y_freq;  percent=percent/100;  label ys='y';run;goptions reset=all  border;axis1  label=(angle=90  "Predictive Probability")  order=(0 to 0.14 by 0.02);axis2  label=("y")   order=(0 to 16) offset=(15 points, 15 points) minor=none;symbol  interpol=needle;proc gplot data=y_freq;        plot  percent * ys/vaxis=axis1  haxis=axis2  ;run;quit;/* summarize discrete outcome with given coverage probability *//* input dsn is y_freq from PROC FREQ. The following can be used as gauge.data y_freq;        input y percent;datalines;0  0.0131  0.0322  0.0653  0.1034  0.1025  0.1156  0.1147  0.1158  0.0959  0.08310 0.05811 0.03612 0.02913 0.01414 0.01515 0.00616 0.005;run;*/%let covprob=0.9;proc sort data=y_freq  out=y_freq2;        by  percent;run;data y_freq2;        set y_freq2;  retain cumprob  0;  cumprob+percent;  if cumprob<(1-&covprob) then select=0;  else select=1;run;proc means data=y_freq2  sum;       where select=1;    var percent;run;``

﻿In Summary:

0. There different choices for specifying Prior believe that can be incorporated into Bayesian computation:
a.) A discrete grid of possible values, each value with some probability
b.) An appropriate distribution, where the parameters is determined from existing information
c.) A dense grid of possible values that span the whole range

1. We demonstrated calculation under these three scenarios and their impact on posterior distribution of statistics of interests;

2. We can conduct Sampling With Replacement with given sampling probability by using FORMAT and uniform random variable generator

3. We demonstrated that prediction in a Bayesian framework is straighforward, but depending on whether we use prior distribution or posterior distribution

The book "Bayesian Computation with R" by Jim Albert is an easy to read entry level book on applied Bayesian Statistics. While the book was written for R users, it is not difficult to translate the languages between R and SAS and I believe it is a good way to show Bayesian capability of SAS. In the next several months, I am going to translate all R code in the book into SAS. Readers are encouraged to buy this book to understand what's behind the code. The posts here will only spend minimum effects to explain the statistics underlying while most resource will still be denoted to SAS coding.

This post will cover Chapter 1.

We first follow section 1.2.2, using PROC IMPORT to read in TAB deliminated file, which corresponds to read.table(file, sep='\t', header=T)  in R. We also designated PDF as Output Destination.
``````
/* Chapter 1 of Bayesian Computation with R */

%let folder=C:\Documents\SAS for Bayesian Compu with R;
%let datafolder=&folder\data;

libname data "&datafolder";
options missing='.' formchar = "|----|+|---+=|-/\<>*"  errors=2 fullstimer;

/*@@@@@@ Section 1.2.2 @@@@@*/
/*--> Read Tab Deliminated Data into SAS */
proc import datafile="&datafolder/studentdata.txt"  out=student dbms=dlm   replace;
datarow=2;       /* redundent with GETNAMES= statement */
guessingrows=2;  /* Only use first 2 rows of data to guess the type of data */
delimiter='09'x;  /* this is redudent if we specify DBMS=TAB in PROC IMPORT option */
getnames=YES;  /* This is the same as 'header=True' in R's table.read function */
run;

ods pdf file="&folder/Chapter1.pdf";
/*-->  To see Variable Names and print first row of this data
Use SPLIT="*" to specify the split character being "*", which controls
*/
proc print data=student(obs=1)  split="*"; run;
``````

SAS output looks like:
Section 1.2.3, summarize frequency tables and use Barplot to visualize the counts.

PROC FREQ directly corresponds to table() function in R, but provides much richer functionality. PROC GBARLINE is used to answer what barplot() does in R. The LABEL= statement in SAS is a very handy tool for annotation purpose for PRINT or GRAPH.

PROC MEANS somehow corresponds to summary() function to obtain summary statistics for quantitative variables. You can sepecify MAXDEC= to keep desired number of decimals in print out.
Histogram can be obtained in SAS via either PROC UNIVARIATE or newly introduced PROC SGPLOT. PROC UNIVARIATE provides more control though. As shown below, you can specify desired mid-points while you can't do so in SGPLOT. On the other hand, SGPLOT is very convenient if you want to overlay density curve, either a normal fit or kernel one.
``````

/*@@@@ section 1.2.3 @@@*/
/*--> To summarize and graph a single Batch */
proc freq data=student;
table Drink;
run;

/*--> Barplot on single variable */
title "Original idea, summarize and plot";
proc freq data=student  noprint;
table Drink /out=t;
run;
goptions reset=all noborder;
proc gbarline data=t;
bar  Drink /discrete space=4  sumvar=count;
run;quit;
title;
/*--> Actually, in SAS, it is better to directly use
PROC GBARLINE */
goptions reset=all noborder;
title "In SAS, PROC GBARLINE directly summarize";
proc gbarline data=student;
bar Drink /discrete space=3 ;
run;quit;
title;

data student;
set student;
hours_of_sleep=WakeUp-ToSleep;
/* Label can be useful for annotation purpose */
label hours_of_sleep="Hours Of Sleep";
run;

proc means data=student
/* Request same statistics as in the book */
min q1 median q3 max  nmiss
/* MAXDEC= specifies keeping 2 decimals */
maxdec=2;
var hours_of_sleep;
run;

/*--> Histogram using PROC UNIVARIATE */
proc univariate data=student  noprint;
var hours_of_sleep;
histogram /midpoints=2.5 to 12.5 by 1;
run;

/* new SGPLOT is a handy way to draw histogram and density */
proc sgplot data=student;
histogram hours_of_sleep/scale=count;
density   hours_of_sleep/scale=count  type=normal;
run;
``````

Compare SAS outputs:

Boxplot to visualize distribution of  quantitative variables by some classification variable. Here, formula such as var1*var2 style corresponds to var1~var2 style in R.
``````

/*@@@ section 1.2.4 @@@@*/
/*---> Compare Batches using Boxplot */
proc sort data=student out=studentsorted; by Gender; run;
proc boxplot data=studentsorted;
plot  hours_of_sleep*Gender;
run;

proc means data=student
min q1 median q3 max nmiss
nway  maxdec=2;
class Gender;
var  Haircut;
run;
``````

SAS output looks like:

R has built-in jitter function, however, we have to make our own in SAS. This verision doesn't apply Fuzzy yet, only for demo purpose.
``````

%macro jitter(var, newname, data=last, factor=1, amount=);
/* follow the JITTER function in R, no fuzzy applied yet .
if amount is given, then use this value for disturbance
else if amount is 0, then use the range of &var for disturbance
else if amount is NULL, then use the smallest difference among
distinct values of &var for disturbance. if all values are
the same, then use the value obtained from range

if range of &var is 0, then use the lower range for disturbance
otherwise
*/
%let blank=%str( );
%if &data=' ' %then %let data=last;
%local fid;

data _null_;
fid=round(ranuni(0)*10000, 1);
call symput('fid', compress(fid));
run;
proc means data=&data  noprint;
var &var;
output  out=_util&fid
range(&var)=range
min(&var)=min
max(&var)=max;
run;
data _util&fid;
set _util&fid;
if range^=0 then z=range; else z=min;
if z=0 then z=1;
run;

%if %eval(&amount=&blank) %then %do;
proc sort data=&data.(keep=&var  where=(&var^=.))
out=_xuti&fid  nodupkey;
by &var;
run;
data _duti&fid;
set _xuti&fid  nobs=ntotal  end=eof;
array _x{2} _temporary_;
if ntotal=1 then do;
amount=&factor/50*abs(&var);
keep amount;
output;
stop;
end;
else do;
if _n_=1 then do;
_x[1]=&var; _x[2]=constant('BIG');
end;
else do;
_x[2]=min(_x[2], &var - _x[1]);
_x[1]=&var;
if eof then do;
amount=&factor/5*abs(_x[2]);
keep amount;
output;
end;
end;
end;
run;

%end;
%else %if %eval(&amount=0) %then %do;
data _duti&fid;
set _util&fid;
amount=&factor*z/50;
keep amount;
output;
run;
%end;
%else %do;
data _duti&fid;
amount=&amount;
keep amount;
output;
run;
%end;

proc sql noprint;
select name into :keepvars separated by ' '
from   sashelp.vcolumn
where  libname='WORK'
and  memname=%upcase("&data")
and  memtype="DATA"
;
quit;
data &data;
array _x{1} _temporary_;
if _n_=1 then do;
set _duti&fid;
_x[1]=amount;
end;
set &data;
&newname=&var + ranuni(0)*(2*_x[1])-_x[1];
label &newname="jitter(&var)";
keep &keepvars  &newname;
run;
proc datasets library=work nolist;
delete _duti&fid _xuti&fid _util&fid;
run;quit;
%mend;
``````

Now we can apply jitter function to the two variables studied in the book and check them out. For graphics, we use axis statement to define how we want the two axes to appear in print out, and use symbol statement to define the appearance of symbols for data. Here we ask for black circles, not connected (interpol=none).
``````

%jitter(hours_of_sleep, xhos, data=student, factor=1, amount=);
%jitter(ToSleep, xts, data=student, factor=1, amount=);

goptions border  reset=all;
axis1  label=("jitter(ToSleep)")
major=(height=2  number=5)   minor=none;
axis2  label=(angle=90  "jitter(Hours of Sleep")
major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=none  color=black;
proc gplot data=student;
plot xhos*xts/haxis=axis1  vaxis=axis2;
run;quit;
``````

In order to overlay a fitted regression line, we have multiple ways to do it in SAS, and the Statistical Plot procedures are particularly handy to generate this type of figures. Compare the code and corresponding figures (in the same order of LINEPRINTER, ODS GRAPHICS, Manually Drawn, SGPLOT). For manually drawn figures, since we ask for connected dots in order to obtain a line overlay on circles, we have to sort the data by X-axis variable to make the line as straight as possible. The manually drawn figure is also the one closest to the R figure in the book.
``````

/* old school LINEPRINTER option*/
proc reg data=student  lineprinter;
title "Line Printer Style";
model   Hours_of_Sleep=ToSleep /noprint;
var     xhos  xts;
plot pred.*xts='X'  xhos*xts/overlay  symbol='o';
run;quit;
title;

/* Using ODS GRAPHICS */
ods select none; /* to suppress PROC REG output */
title "ODS Graphics";
ods graphics on;
proc reg data=student ;
model   Hours_of_Sleep=ToSleep;
var     xhos  xts;
output  out=student  pred=p;
ods select FitPlot;
run;quit;
ods graphics off;
title;
ods select all;

/* Following standard R approach, tedious in SAS */
proc sort data=student  out=studentsorted;
by ToSleep;
run;
goptions border  reset=all;
axis1  label=("jitter(ToSleep)")      order=(-2 to 6 by 2)  major=(height=2  number=5)   minor=none;
axis2  label=(angle=90  "jitter(Hours of Sleep")    offset=(, 2 pct)  major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=none  color=black;
symbol2 value=none  interpol=line  color=black;
proc gplot data=studentsorted;
plot xhos*xts  p*xts/overlay  haxis=axis1  vaxis=axis2;
run;quit;

/* Using build-in REGRESSION capability of GPLOT by specifying
INTERPOL=R<><0><> option in SYMBOL statement.
In the first <>, you identify the type of regression:
In the second <>, you specify if intercept is omitted;
In the third <>, you specify type of CI:
CLM=CI of mean predicted value;
CLI=CI of individual points;
*/
goptions border reset=all;
axis1  label=("jitter(ToSleep)")      order=(-2 to 6 by 2)  major=(height=2  number=5)   minor=none;
axis2  label=(angle=90 "jitter(Hours of Sleep")    offset=(, 2 pct)  major=(height=2  number=5)  minor=none;
symbol1 value=circle  interpol=rlclm95  color=black  ci=blue  co=blue;
proc gplot data=student;
plot xhos*xts/regeqn /* haxis=axis1  vaxis=axis2*/;
run;quit;

/* New SGPLOT has build in regression line capability: REG, SPLINE, LOESS */
proc sgplot data=student;
scatter  x=xts  y=xhos;
reg  x=ToSleep  y=Hours_of_Sleep;
run;
``````

Now, to study the robustness of T-Statistics which relies on several assumptions, such as normality, homoskedasticity, etc. First, we need to define a "function" to calculate T-statistics like in the book.
``````

/* section 1.3 Robustness of T-Statistics */
/* section 1.3.2 Write a function to compute T-Statistics */
%macro tstatistics(vars, dsn=last, outdsn=, class=);
%if &class='' %then %let classstmt=;
%else %let classstmt=class &class;

%let nvars=%sysfunc(count(&vars, ' '));

%let var1=%scan(&vars, 1, ' ');
%let var2=%scan(&vars, 2, ' ');
proc means data=&dsn noprint  nway;
&classstmt;
var &vars;
output  out=tdata(where=(_STAT_ in ('STD', 'MEAN', 'N')));
run;
data &outdsn;
set tdata; by &class _TYPE_;
retain  m  n  mean1 mean2  std:;
select(compress(_STAT_));
when('N') do;
m=&var1;
n=&var2;
end;
when('MEAN') do;
mean1=&var1;
mean2=&var2;
end;
when('STD') do;
std1=&var1;
std2=&var2;
end;
otherwise;
end;
if last._TYPE_ then do;
sp=sqrt(((m-1)*std1**2 + (n-1)*std2**2)/(m+n-2));
t= (mean1-mean2)/(sp * sqrt(1/m + 1/n));
keep &class _TYPE_ m  n std:  mean:  sp t;
output;
end;
run;
%mend;
``````

Just like the "source(file)" command in R, in SAS, we have a counterpart called %include(or in abv. %inc). Suppose you have a SAS file called "tstatistics.sas" in some folder &folder, the following statement will read in the source code from this file. Since this is a SAS file, suffix ".sas" is not necessary.

``````
filename BCR "&folder";
%inc BCR(tstatistics);
``````

The following code pieces conduct the four simulate scenarios illustrated in the book and draw the last figure.
``````
data test;
input x y;
cards;
1  5
4  4
3  7
6  6
5  10
;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=);

/* section 1.3.3 Monte Carlo study on the Robustness of T-Statistics */
%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
retain seed1 99787  seed2 76568;
do iter=1 to &Niter;
do j=1 to max(&n, &m);
if j<=&m then call rannor(seed1, x); else x=.;
if j<=&n then call rannor(seed2, y); else y=.;
keep iter x y;
output;
end;
end;
run;

%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);

data _null_;
set tdata2  end=eof;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof then do;
call symput('n_reject', n_reject/_n_);
end;
run;
%put &n_reject;

/* section 1.3.4 Study the robustness of T-Statistics */

%let alpha=0.1;
%let m=10;
%let n=10;
%let Niter=10000;
data test;
retain seed1 987687  seed2 76568;
array x[4];
array y[4];
do iter=1 to &Niter;
do j=1 to max(&n, &m);
do k=1 to 4;
select (k);
when(1) do;
if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
if j<=&n then call rannor(seed2, y[k]); else y[k]=.;
end;
when(2) do;
if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
if j<=&n then call rannor(seed2, y[k]); else y[k]=.;
y[k]=y[k]*10;
end;
when(3) do;
if j<=&m then x[k]=rand('T', 4); else x[k]=.;
if j<=&n then y[k]=rand('T', 4); else y[k]=.;
end;
otherwise do;
if j<=&m then call rannor(seed1, x[k]); else x[k]=.;
if j<=&n then call ranexp(seed2, y[k]); else y[k]=.;
x[k]=10+x[k]*2;  y[k]=y[k]/0.1;
end;
end;
end;
keep iter x1-x4 y1-y4;
output;
end;
end;
run;

%tstatistics(x1 y1, dsn=test, outdsn=tdata1, class=iter);

%tstatistics(x2 y2, dsn=test, outdsn=tdata2, class=iter);

%tstatistics(x3 y3, dsn=test, outdsn=tdata3, class=iter);

%tstatistics(x4 y4, dsn=test, outdsn=tdata4, class=iter);

data _null_;
n_reject=0;
do until (eof1);
set tdata1  end=eof1  nobs=n1;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof1 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n1;
put "NOTE>> Prob(Rejection)=" p_reject  8.4;
end;
end;
n_reject=0;
do until (eof2);
set tdata2  end=eof2  nobs=n2;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof2 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n2;
put "NOTE>> Prob(Rejection)=" p_reject  8.4;
end;
end;

do until (eof3);
set tdata3  end=eof3 nobs=n3;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof3 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n3;
put "NOTE>> Prob(Rejection)=" p_reject  8.4;
end;
end;

do until (eof4);
set tdata4  end=eof4  nobs=n4;
retain n_reject 0;
if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;
if eof4 then do;
*call symput('n_reject', n_reject/_n_);
p_reject=n_reject/n4;
put "NOTE>> Prob(Rejection)=" p_reject  8.4;
end;
end;
run;

/* draw the overlay density curves of empirical T-statistics and T-distribution */
proc sort data=tdata2;
by t;
run;
data tdata2;
set tdata2;
dt=pdf('T', t, 18);
run;

ods select none;
ods output Controls=Controls;
ods output UnivariateStatistics=UniStat;
proc kde data=tdata2  ;
univar  t /out=density  method=snr  unistats;
run;
ods select all;

data density;
set density;
dt=pdf('T', value, 18);
run;

data _null_;
set UniStat;
if Descr='Bandwidth' then call symput ('bw', compress(round(t, 0.0001)));
run;

goptions reset=all border;

title "Comparing Empirical and Theoretical Densities";
legend label=none   position=(top right inside)  mode=share;
axis1  label=("N=&Niter Bandwidth=&bw")
order=(-4 to 8 by 2) offset=(1,1)
major=(height=2) minor=none;
axis2  label=(angle=90  "Density")
order=(0 to 0.4 by 0.1)  offset=(0, 0.1)
major=(height=2)  minor=none;
symbol1  interpol=join  color=black  value=none  width=3;
symbol2  interpol=join  color=grey   value=none  width=1;
proc gplot data=density;
plot density*value  dt*value/overlay  legend=legend
haxis=axis1  vaxis=axis2  ;
label dt='t(18)' density='Exact';
run;quit;

ods pdf close;
``````

Key Points Summary:

0. We can use FORMCHAR option to eliminate  weird characters

1. We can use PROC IMPORT to import deliminated data and it provided flexible options to control the output;

2. We can use PROC FREQ to summarize categorical data

3. We can use PROC MEANS and PROC UNIVARIATE to generate descriptive statistics for numerical data

4. We can study the dispersion of numerical variable by categorical variable using BOXPLOT

5. We provided a JITTER function for better graphical presentation of heavily overlapped data

6. We showed that there are multiple ways to visualize scatter plots together with fitted regression lines:

• 6.1 LINEPRINTER in PROC REG
• 6.2 Using PROC REG to obtain regression line and overlay it on the scatter plot
• 6.3 Using ODS Graphics in PROC REG
• 6.4 Using INTERPOL=R<><><> option in SYMBOL statement
• 6.5 Using PROC SGPLOT
7. We demonstrated how to conduct robust T-test in SAS following the code in the book.

The book "Bayesian Computation with R" by Jim Albert is an easy to read entry level book on applied Bayesian Statistics. While the book was written for R users, it is not difficult to translate the languages between R and SAS and I believe it is a good way to show Bayesian capability of SAS. In the next several months, I am going to translate all R code in the book into SAS. Readers are encouraged to buy this book to understand what's behind the code. The posts here will only spend minimum effects to explain the statistics underlying while most resource will still be denoted to SAS coding.

This post will cover Chapter 1.

We first follow section 1.2.2, using PROC IMPORT to read in TAB deliminated file, which corresponds to read.table(file, sep='\t', header=T)  in R. We also designated PDF as Output Destination.
``/* Chapter 1 of Bayesian Computation with R */%let folder=C:\Documents\SAS for Bayesian Compu with R;%let datafolder=&folder\data;libname data "&datafolder";options missing='.' formchar = "|----|+|---+=|-/\<>*"  errors=2 fullstimer;/*@@@@@@ Section 1.2.2 @@@@@*//*--> Read Tab Deliminated Data into SAS */proc import datafile="&datafolder/studentdata.txt"  out=student dbms=dlm   replace;      datarow=2;       /* redundent with GETNAMES= statement */   guessingrows=2;  /* Only use first 2 rows of data to guess the type of data */      delimiter='09'x;  /* this is redudent if we specify DBMS=TAB in PROC IMPORT option */   getnames=YES;  /* This is the same as 'header=True' in R's table.read function */run;ods pdf file="&folder/Chapter1.pdf";/*-->  To see Variable Names and print first row of this data    Use SPLIT="*" to specify the split character being "*", which controls    line breaks in column headings*/proc print data=student(obs=1)  split="*"; run;``

SAS output looks like:
Section 1.2.3, summarize frequency tables and use Barplot to visualize the counts.

PROC FREQ directly corresponds to table() function in R, but provides much richer functionality. PROC GBARLINE is used to answer what barplot() does in R. The LABEL= statement in SAS is a very handy tool for annotation purpose for PRINT or GRAPH.

PROC MEANS somehow corresponds to summary() function to obtain summary statistics for quantitative variables. You can sepecify MAXDEC= to keep desired number of decimals in print out.
Histogram can be obtained in SAS via either PROC UNIVARIATE or newly introduced PROC SGPLOT. PROC UNIVARIATE provides more control though. As shown below, you can specify desired mid-points while you can't do so in SGPLOT. On the other hand, SGPLOT is very convenient if you want to overlay density curve, either a normal fit or kernel one.
``/*@@@@ section 1.2.3 @@@*//*--> To summarize and graph a single Batch */proc freq data=student;     table Drink;run;/*--> Barplot on single variable */title "Original idea, summarize and plot";proc freq data=student  noprint;      table Drink /out=t;run;goptions reset=all noborder;proc gbarline data=t;      bar  Drink /discrete space=4  sumvar=count;run;quit;title;/*--> Actually, in SAS, it is better to directly use     PROC GBARLINE */goptions reset=all noborder;title "In SAS, PROC GBARLINE directly summarize";proc gbarline data=student;      bar Drink /discrete space=3 ;run;quit;title;data student;      set student;   hours_of_sleep=WakeUp-ToSleep;   /* Label can be useful for annotation purpose */   label hours_of_sleep="Hours Of Sleep"; run;proc means data=student            /* Request same statistics as in the book */           min q1 median q3 max  nmiss                        /* MAXDEC= specifies keeping 2 decimals */           maxdec=2;                              var hours_of_sleep;run;/*--> Histogram using PROC UNIVARIATE */proc univariate data=student  noprint;      var hours_of_sleep;   histogram /midpoints=2.5 to 12.5 by 1;run;/* new SGPLOT is a handy way to draw histogram and density */proc sgplot data=student;      histogram hours_of_sleep/scale=count;           density   hours_of_sleep/scale=count  type=normal;run;``

Compare SAS outputs:

Boxplot to visualize distribution of  quantitative variables by some classification variable. Here, formula such as var1*var2 style corresponds to var1~var2 style in R.
``/*@@@ section 1.2.4 @@@@*//*---> Compare Batches using Boxplot */proc sort data=student out=studentsorted; by Gender; run;proc boxplot data=studentsorted;      plot  hours_of_sleep*Gender;run;   proc means data=student              min q1 median q3 max nmiss              nway  maxdec=2;     class Gender;  var  Haircut;run;``

SAS output looks like:

R has built-in jitter function, however, we have to make our own in SAS. This verision doesn't apply Fuzzy yet, only for demo purpose.
``%macro jitter(var, newname, data=last, factor=1, amount=);/* follow the JITTER function in R, no fuzzy applied yet .    if amount is given, then use this value for disturbance   else if amount is 0, then use the range of &var for disturbance   else if amount is NULL, then use the smallest difference among        distinct values of &var for disturbance. if all values are         the same, then use the value obtained from range   if range of &var is 0, then use the lower range for disturbance   otherwise */%let blank=%str( );%if &data=' ' %then %let data=last;%local fid;data _null_;     fid=round(ranuni(0)*10000, 1);  call symput('fid', compress(fid));run;proc means data=&data  noprint;     var &var;  output  out=_util&fid            range(&var)=range            min(&var)=min            max(&var)=max;run;data _util&fid;     set _util&fid;  if range^=0 then z=range; else z=min;  if z=0 then z=1;run;%if %eval(&amount=&blank) %then %do;    proc sort data=&data.(keep=&var  where=(&var^=.))                 out=_xuti&fid  nodupkey;       by &var; run; data _duti&fid;      set _xuti&fid  nobs=ntotal  end=eof;         array _x{2} _temporary_;   if ntotal=1 then do;      amount=&factor/50*abs(&var);   keep amount;   output;   stop;   end;   else do;      if _n_=1 then do;          _x[1]=&var; _x[2]=constant('BIG');       end;   else do;               _x[2]=min(_x[2], &var - _x[1]);      _x[1]=&var;      if eof then do;         amount=&factor/5*abs(_x[2]);      keep amount;      output;      end;   end;   end;  run;    %end;%else %if %eval(&amount=0) %then %do;    data _duti&fid;      set _util&fid;   amount=&factor*z/50;   keep amount;   output; run;     %end;%else %do;    data _duti&fid;      amount=&amount;   keep amount;   output; run;%end;proc sql noprint;     select name into :keepvars separated by ' '  from   sashelp.vcolumn  where  libname='WORK'     and  memname=%upcase("&data")     and  memtype="DATA"  ;quit;data &data;       array _x{1} _temporary_;     if _n_=1 then do;     set _duti&fid;  _x[1]=amount;  end;     set &data;  &newname=&var + ranuni(0)*(2*_x[1])-_x[1];  label &newname="jitter(&var)";  keep &keepvars  &newname;run;proc datasets library=work nolist;     delete _duti&fid _xuti&fid _util&fid;run;quit;%mend; ``

Now we can apply jitter function to the two variables studied in the book and check them out. For graphics, we use axis statement to define how we want the two axes to appear in print out, and use symbol statement to define the appearance of symbols for data. Here we ask for black circles, not connected (interpol=none).
``%jitter(hours_of_sleep, xhos, data=student, factor=1, amount=); %jitter(ToSleep, xts, data=student, factor=1, amount=);goptions border  reset=all;axis1  label=("jitter(ToSleep)")           major=(height=2  number=5)   minor=none;axis2  label=(angle=90  "jitter(Hours of Sleep")            major=(height=2  number=5)  minor=none;symbol1 value=circle  interpol=none  color=black;proc gplot data=student;      plot xhos*xts/haxis=axis1  vaxis=axis2;run;quit;      ``

In order to overlay a fitted regression line, we have multiple ways to do it in SAS, and the Statistical Plot procedures are particularly handy to generate this type of figures. Compare the code and corresponding figures (in the same order of LINEPRINTER, ODS GRAPHICS, Manually Drawn, SGPLOT). For manually drawn figures, since we ask for connected dots in order to obtain a line overlay on circles, we have to sort the data by X-axis variable to make the line as straight as possible. The manually drawn figure is also the one closest to the R figure in the book.
``/* old school LINEPRINTER option*/proc reg data=student  lineprinter;      title "Line Printer Style";      model   Hours_of_Sleep=ToSleep /noprint;   var     xhos  xts;      plot pred.*xts='X'  xhos*xts/overlay  symbol='o';run;quit;title;/* Using ODS GRAPHICS */ods select none; /* to suppress PROC REG output */title "ODS Graphics";ods graphics on;proc reg data=student ;      model   Hours_of_Sleep=ToSleep;   var     xhos  xts;   output  out=student  pred=p;        ods select FitPlot; run;quit;ods graphics off;title;ods select all;/* Following standard R approach, tedious in SAS */proc sort data=student  out=studentsorted;      by ToSleep;run;goptions border  reset=all;axis1  label=("jitter(ToSleep)")      order=(-2 to 6 by 2)  major=(height=2  number=5)   minor=none;axis2  label=(angle=90  "jitter(Hours of Sleep")    offset=(, 2 pct)  major=(height=2  number=5)  minor=none;symbol1 value=circle  interpol=none  color=black;symbol2 value=none  interpol=line  color=black;proc gplot data=studentsorted;      plot xhos*xts  p*xts/overlay  haxis=axis1  vaxis=axis2;   run;quit;      /* Using build-in REGRESSION capability of GPLOT by specifying   INTERPOL=R<><0><> option in SYMBOL statement.       In the first <>, you identify the type of regression:          L=linear, Q=quadratic, C=cubic      In the second <>, you specify if intercept is omitted;      In the third <>, you specify type of CI:         CLM=CI of mean predicted value;         CLI=CI of individual points; */goptions border reset=all;axis1  label=("jitter(ToSleep)")      order=(-2 to 6 by 2)  major=(height=2  number=5)   minor=none;axis2  label=(angle=90 "jitter(Hours of Sleep")    offset=(, 2 pct)  major=(height=2  number=5)  minor=none;symbol1 value=circle  interpol=rlclm95  color=black  ci=blue  co=blue;proc gplot data=student;     plot xhos*xts/regeqn /* haxis=axis1  vaxis=axis2*/;run;quit;/* New SGPLOT has build in regression line capability: REG, SPLINE, LOESS */proc sgplot data=student;      scatter  x=xts  y=xhos;   reg  x=ToSleep  y=Hours_of_Sleep;run;``

Now, to study the robustness of T-Statistics which relies on several assumptions, such as normality, homoskedasticity, etc. First, we need to define a "function" to calculate T-statistics like in the book.
``/* section 1.3 Robustness of T-Statistics *//* section 1.3.2 Write a function to compute T-Statistics */%macro tstatistics(vars, dsn=last, outdsn=, class=);%if &class='' %then %let classstmt=;%else %let classstmt=class &class;%let nvars=%sysfunc(count(&vars, ' '));%let var1=%scan(&vars, 1, ' ');%let var2=%scan(&vars, 2, ' ');proc means data=&dsn noprint  nway;     &classstmt;  var &vars;  output  out=tdata(where=(_STAT_ in ('STD', 'MEAN', 'N')));run;data &outdsn;     set tdata; by &class _TYPE_;  retain  m  n  mean1 mean2  std:;  select(compress(_STAT_));          when('N') do;        m=&var1;     n=&var2;    end;    when('MEAN') do;              mean1=&var1;     mean2=&var2;    end;          when('STD') do;        std1=&var1;     std2=&var2;    end;    otherwise;  end;     if last._TYPE_ then do;        sp=sqrt(((m-1)*std1**2 + (n-1)*std2**2)/(m+n-2));     t= (mean1-mean2)/(sp * sqrt(1/m + 1/n));  keep &class _TYPE_ m  n std:  mean:  sp t;  output;  end;run;%mend;``

Just like the "source(file)" command in R, in SAS, we have a counterpart called %include(or in abv. %inc). Suppose you have a SAS file called "tstatistics.sas" in some folder &folder, the following statement will read in the source code from this file. Since this is a SAS file, suffix ".sas" is not necessary.

``filename BCR "&folder";%inc BCR(tstatistics);``

The following code pieces conduct the four simulate scenarios illustrated in the book and draw the last figure.
``data test;      input x y;cards;1  54  43  76  65  10;run;%tstatistics(x y, dsn=test, outdsn=tdata2, class=);/* section 1.3.3 Monte Carlo study on the Robustness of T-Statistics */%let alpha=0.1;%let m=10;%let n=10;%let Niter=10000;data test;      retain seed1 99787  seed2 76568;   do iter=1 to &Niter;      do j=1 to max(&n, &m);            if j<=&m then call rannor(seed1, x); else x=.;   if j<=&n then call rannor(seed2, y); else y=.;   keep iter x y;   output;   end;    end;run;%tstatistics(x y, dsn=test, outdsn=tdata2, class=iter);data _null_;      set tdata2  end=eof;   retain n_reject 0;   if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;   if eof then do;      call symput('n_reject', n_reject/_n_);   end;run;%put &n_reject;/* section 1.3.4 Study the robustness of T-Statistics */%let alpha=0.1;%let m=10;%let n=10;%let Niter=10000;data test;      retain seed1 987687  seed2 76568;   array x[4];   array y[4];      do iter=1 to &Niter;         do j=1 to max(&n, &m);      do k=1 to 4;      select (k);         when(1) do;                      if j<=&m then call rannor(seed1, x[k]); else x[k]=.;                            if j<=&n then call rannor(seed2, y[k]); else y[k]=.;      end;      when(2) do;                      if j<=&m then call rannor(seed1, x[k]); else x[k]=.;                      if j<=&n then call rannor(seed2, y[k]); else y[k]=.;                      y[k]=y[k]*10;      end;      when(3) do;                      if j<=&m then x[k]=rand('T', 4); else x[k]=.;                      if j<=&n then y[k]=rand('T', 4); else y[k]=.;        end;      otherwise do;                      if j<=&m then call rannor(seed1, x[k]); else x[k]=.;                      if j<=&n then call ranexp(seed2, y[k]); else y[k]=.;                      x[k]=10+x[k]*2;  y[k]=y[k]/0.1;      end;     end;              end;            keep iter x1-x4 y1-y4;            output;         end;      end;run;%tstatistics(x1 y1, dsn=test, outdsn=tdata1, class=iter);%tstatistics(x2 y2, dsn=test, outdsn=tdata2, class=iter);%tstatistics(x3 y3, dsn=test, outdsn=tdata3, class=iter);%tstatistics(x4 y4, dsn=test, outdsn=tdata4, class=iter);data _null_;      n_reject=0;      do until (eof1);         set tdata1  end=eof1  nobs=n1;         retain n_reject 0;         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;         if eof1 then do;             *call symput('n_reject', n_reject/_n_);       p_reject=n_reject/n1;       put "NOTE>> Prob(Rejection)=" p_reject  8.4;         end;  end;  n_reject=0;  do until (eof2);         set tdata2  end=eof2  nobs=n2;         retain n_reject 0;         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;         if eof2 then do;             *call symput('n_reject', n_reject/_n_);       p_reject=n_reject/n2;       put "NOTE>> Prob(Rejection)=" p_reject  8.4;         end;  end;  do until (eof3);         set tdata3  end=eof3 nobs=n3;         retain n_reject 0;         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;         if eof3 then do;             *call symput('n_reject', n_reject/_n_);       p_reject=n_reject/n3;       put "NOTE>> Prob(Rejection)=" p_reject  8.4;         end;  end;  do until (eof4);         set tdata4  end=eof4  nobs=n4;         retain n_reject 0;         if abs(t)>quantile('T', 1-&alpha/2, n+m-2) then n_reject+1;         if eof4 then do;             *call symput('n_reject', n_reject/_n_);       p_reject=n_reject/n4;       put "NOTE>> Prob(Rejection)=" p_reject  8.4;         end;  end;run;/* draw the overlay density curves of empirical T-statistics and T-distribution */proc sort data=tdata2;      by t;run;data tdata2;      set tdata2;   dt=pdf('T', t, 18);run;ods select none;ods output Controls=Controls;ods output UnivariateStatistics=UniStat;proc kde data=tdata2  ;      univar  t /out=density  method=snr  unistats;run;ods select all;data density;      set density;   dt=pdf('T', value, 18);run;data _null_;      set UniStat;   if Descr='Bandwidth' then call symput ('bw', compress(round(t, 0.0001)));run;goptions reset=all border;title "Comparing Empirical and Theoretical Densities";legend label=none   position=(top right inside)  mode=share;axis1  label=("N=&Niter Bandwidth=&bw")        order=(-4 to 8 by 2) offset=(1,1)         major=(height=2) minor=none;axis2  label=(angle=90  "Density")            order=(0 to 0.4 by 0.1)  offset=(0, 0.1)        major=(height=2)  minor=none;symbol1  interpol=join  color=black  value=none  width=3;symbol2  interpol=join  color=grey   value=none  width=1;proc gplot data=density;      plot density*value  dt*value/overlay  legend=legend                                   haxis=axis1  vaxis=axis2  ;      label dt='t(18)' density='Exact';   run;quit;ods pdf close;``

Key Points Summary:

0. We can use FORMCHAR option to eliminate  weird characters

1. We can use PROC IMPORT to import deliminated data and it provided flexible options to control the output;

2. We can use PROC FREQ to summarize categorical data

3. We can use PROC MEANS and PROC UNIVARIATE to generate descriptive statistics for numerical data

4. We can study the dispersion of numerical variable by categorical variable using BOXPLOT

5. We provided a JITTER function for better graphical presentation of heavily overlapped data

6. We showed that there are multiple ways to visualize scatter plots together with fitted regression lines:

• 6.1 LINEPRINTER in PROC REG
• 6.2 Using PROC REG to obtain regression line and overlay it on the scatter plot
• 6.3 Using ODS Graphics in PROC REG
• 6.4 Using INTERPOL=R<><><> option in SYMBOL statement
• 6.5 Using PROC SGPLOT
7. We demonstrated how to conduct robust T-test in SAS following the code in the book.