random number generation

5月 082018

WARNING: This blog post references Avengers: Infinity War and contains story spoilers. But it also contains useful information about random number generators (RNGs) -- tempting! If you haven't yet seen the movie, you should make peace with this inner conflict before reading on.

Throughout the movie, Thanos makes it clear that his goal is to eliminate half of the population of every civilization in the universe. With the power of all six infinity stones imbued into his gauntlet, he'll be able to accomplish this with a "snap of his fingers." By the end of the film, Thanos has all of the stones, and then he literally snaps his fingers. (Really? I kept thinking that this was just a figure of speech he used to indicate how simple this will be -- but I guess it works more like the ruby slippers in The Wizard of Oz. Some clicking was required.)

So, Thanos snaps his huge fingers and -- POOF -- there goes half of us. Apparently the universe already had some sort of population-reduction subroutine just waiting for a hacker like Thanos to access it. Who put that there? Not a good plan, universe designer. (Check here to see if you survived the snap.)

But how did Thanos (or the universe) determine which of us was wiped from existence and which of us was spared? I have to assume that it was a seriously high-performing, massively parallel random number generator. And if Thanos had access to 9.4 Maintenance 5 or later (part of the Power [to Know] stone?), then he would have his choice of algorithms.

(Tony Stark has been to SAS headquarters, but we haven't seen Thanos around here. Still, he's welcome to download SAS University Edition.)

Your own RNG gauntlet, built into SAS

I know a little bit about this topic because I talked with Rick Wicklin about RNGs. As Rick discusses in his blog post, a recent release of SAS added support for several new/updated RNG algorithms, including Mersenne twister, PCG, Threefry, and one that introduces hardware-based entropy for "extra randomness." If you want to save yourself some reading, watch our 10-minute discussion here.

Implementing my own random Avengers terminator

I was going to write a SAS program to simulate Thanos' "snap," but I don't have a list of every single person in the universe (thanks GDPR!). However, courtesy of IMDB.com, I do have a list of the approximately 100 credited characters in the Infinity War movie. I wrote a DATA step to pull each name into a data set and "randomly decide" each fate by using the new PCG algorithm and the RAND function with a Bernoulli (binomial) distribution. I learned that trick from Rick's post about simulating coin flips. (I hope I did this correctly; Rick will tell me if I didn't.)

%let algorithm = PCG;
data characters;
  call streaminit(2018,"&algorithm.");
  infile datalines dsd;
  retain x 0 y 1;
  length Name $ 60 spared 8 x 8 y 8;
  input Name;
  Spared = rand("Bernoulli", 0.5);
  if x > 10 then
    do; y+1; x = 1;end;
Tony Stark / Iron Man
Bruce Banner / Hulk
Steve Rogers / Captain America
/* about 96 more */

After all of the outcomes were generated, I used PROC FREQ to check the distribution. In this run, only 48% were spared, not an even 50%. Well, that's randomness for you. On the universal scale, I'm not sure that anyone is keeping track.

How many spared

Using a trick I learned from Sample 54315: Customize your symbols with the SYMBOLCHAR statement in PROC SGPLOT, I created a scatter plot of the outcomes. I included special Unicode characters to signify the result in terms that even Hulk can understand. Hearts represent survivors; frowny faces represent the vanished heroes. Here's the code:

data thanosmap;
  input id $ value $ markercolor $ markersymbol $;
status 0 black frowny
status 1 red heart
ods graphics / height=400 width=400 imagemap=on;
proc sgplot data=Characters noautolegend dattrmap=thanosmap;
  styleattrs wallcolor=white;
  scatter x=x y=y / markerattrs=(size=40) 
    group=spared tip=(Name Spared) attrid=status;
  symbolchar name=heart char='2665'x;
  symbolchar name=frowny char='2639'x;
  xaxis integer display=(novalues) label="Did Thanos Kill You? Red=Dead" 
    labelattrs=(family="Comic Sans MS" size=14pt);
    /* Comic Sans -- get it ???? */
  yaxis integer display=none;

Scatter plot of spared

For those of you who can read, you might appreciate a table with the rundown. For this one, I used a trick that I saw on SAS Support Communities to add strike-through text to a report. It's a simple COMPUTE column with a style directive, in a PROC REPORT step.

proc report data=Characters nowd;
  column Name spared;
  define spared / 'Spared' display;
  compute Spared;
    if spared=1 then
      call define(_row_,"style",
    if spared=0 then
      call define(_row_,"style",
        "style={color=red textdecoration=line_through}");

Table of results

Remember, my results here were generated with SAS and don't match the results from the film. (I feel like I need to say that to preempt a few comments.) The complete code for this blog post is available on my public Gist.

Learn more about RNGs

Just as the end of Avengers: Infinity War has sent throngs of viewers to the Internet to find out What's Next, I expect that readers of this blog are eager to learn more about these modern random number generators. Here are the go-to articles from Rick that are worth your review:

Unanswered questions

Before Thanos completed his gauntlet, his main hobby was traveling around the cosmos reducing the population of each civilization "the hard way." With the gauntlet in hand when he snapped his fingers, did he eliminate one-half of the remaining population? Or did the universe's algorithm spare those civilizations that had already been culled? Was this a random sample with replacement or not? In the film, Thanos did not express concern about these details (typical upper management attitude), but the grunt-workers of the universe need to know the parameters for this project. Coders need exact specifications, or else you can expect less-than-heroic results from your infinity gauntlet. I'm pretty sure it says so in the owner's manual.

The post Which random number generator did Thanos use? appeared first on The SAS Dummy.

1月 292018

What is a random number generator? What are the random-number generators in SAS, and how can you use them to generate random numbers from probability distributions? In SAS 9.4M5, you can use the STREAMINIT function to select from eight random-number generators (RNGs), including five new RNGs. After choosing an RNG, you can use the RAND function to obtain random values from probability distributions.

What is a random-number generator?

A random-number generator usually refers to a deterministic algorithm that generates uniformly distributed random numbers. The algorithm (sometimes called a pseudorandom-number generator, or PRNG) is initialized by an integer, called the seed, which sets the initial state of the PRNG. Each iteration of the algorithm produces a pseudorandom number and advances the internal state. The seed completely determines the sequence of random numbers. Although it is deterministic, a modern PRNG generates extremely long sequences of numbers whose statistical properties are virtually indistinguishable from a truly random uniform process.

An RNG can also refer to a hardware-based device (sometimes called a "truly random RNG") that samples from an entropy source, which is often thermal noise within the silicon of a computer chip. In the Intel implementation, the random entropy source is used to initialize a PRNG in the chip, which generates a short stream of pseudorandom values before resampling from the entropy source and repeating the process. A hardware-based RNG is not deterministic or reproducible.

An RNG generates uniformly distributed numbers. If you request random numbers from a nonuniform probability distribution (such as the normal or exponential distributions), SAS will automatically apply mathematical transformations that convert the uniform numbers into nonuniform random variates.

Using random numbers in the real world

In the real world, random numbers are used as the basis for statistical simulations. Financial analysts use simulations of the economy to evaluate the risk of a portfolio. Insurance companies use simulations of natural disasters to assess the impact on their customers. Government workers use simulations of population growth to predict the infrastructure and services that will be required by future generations. In all fields, data analysts generate random values for statistical sampling, bootstrap and resampling methods, Monte Carlo estimation, and data simulation.

For these techniques to be statistically valid, the values that are produced by statistical software must contain a high degree of randomness. The new random-number generators in SAS provide analysts with fast, high-quality, state-of-the-art random numbers.

New random-number generators in SAS

Prior to SAS 9.4M5, SAS supported the Mersenne twister random-number generator (introduced in SAS 9.1) and a hardware-based RNG (supported on Intel CPUs in SAS 9.4M3). SAS 9.4M5 introduces new random-number generators and new variants of the Mersenne twister, as follows:

  • PCG: A 64-bit permuted congruential generator (O’Neill 2014) with good statistical properties.
  • TF2: A 2x64-bit counter-based RNG that is based on the Threefish encryption function in the Random123 library (Salmon et al. 2011). The generator is also known as the 2x64 Threefry generator.
  • TF4: A 4x64-bit counter-based RNG that is also known as the 4x64 Threefry generator.
  • RDRAND: A hardware-based RNG (Intel, 2014) that is repeatedly reseeded by using random values from thermal noise in the chip. The RNG requires an Intel processor (Ivy Bridge and later) that supports the RDRAND instruction.
  • MT1998: (deprecated) The original 1998 32-bit Mersenne twister algorithm (Matsumoto and Nishimura, 1998). This was the default RNG for the RAND function prior to SAS 9.4M3. For a small number of seed values (those exactly divisible by 8,192), this RNG does not produce a sufficiently random stream of numbers. Consequently, other RNGS are preferable.
  • MTHYBRID: (default) A hybrid method that improves the MT1998 method by using the MT2002 initialization for seeds that are exactly divisible by 8,192. This is the default RNG, beginning with the SAS 9.4M3.
  • MT2002: The 2002 32-bit Mersenne twister algorithm (Matsumoto and Nishimura 2002).
  • MT64: A 64-bit version of the 2002 Mersenne twister algorithm.

Choose a random-number generator in SAS

In SAS, you can use the STREAMINIT subroutine to choose an RNG and to set the seed value. You can then use the RAND function to produce high-quality streams of random numbers for many common distributions. The first argument to the RAND function specifies the name of the distribution. Subsequent arguments specify parameters for the distribution.

If you want to use the default RNG (which is 'MTHYBRID'), you can specify only the seed. For example, the following DATA step initializes the default PRNG with the seed value 12345 and then generates five observations from the uniform distribution on (0,1) and from the standard normal distribution:

data RandMT(drop=i);
   call streaminit(12345);   /* choose default RNG (MTHYBRID) and seed=12345 */
   do i = 1 to 5;
      uMT = rand('Uniform'); /* generate random uniform: U ~ U(0,1)           */
      nMT = rand('Normal');  /* generate random normal: N ~ N(mu=0, sigma=1) */

If you want to choose a different RNG, you can specify the name of the RNG followed by the seed. For example, the following DATA step initializes the PCG method with the seed value 12345 and then generates five observations from the uniform and normal distributions:

data RandPCG(drop=i);
   call streaminit('PCG', 12345); /* SAS 9.4M5: choose PCG method, same seed */
   do i = 1 to 5;
      uPCG = rand('Uniform'); nPCG = rand('Normal'); output;

A hardware-based method is initialized by the entropy source, so you do not need to specify a seed value. For example, the following DATA step initializes the RDRAND method and then generates five observations. The three data sets are then merged and printed:

data RandHardware(drop=i);
   call streaminit('RDRAND');     /* SAS 9.4M3: Hardware method, no seed */
   do i = 1 to 5;
      uHardware = rand('Uniform'); nHardware = rand('Normal'); output;
data All;
   merge RandMT RandPCG RandHardware;
title "Uniform Random Variates for Three RNGs";
proc print data=All noobs; var u:; run;
title "Normal Random Variates for Three RNGs";
proc print data=All noobs; var n:; run;

The output shows that each RNG generates a different sequence of numbers. If you run this program yourself, you will generate the same sequences for the first two columns (the MTHYBRID and PCG streams), but your numbers for the RDRAND RNG will be different. Each column of the first table is an independent random sample from the uniform distribution. Similarly, each column of the second table is an independent random sample from the standard normal distribution.


In SAS 9.4M5, you can choose from eight different random-number generators. You use the STREAMINIT function to select the RNG and use the RAND function to obtain random values. In a future blog post, I will compare the attributes of the different RNGs and discuss the advantages of each. I will also show how to use the new random-number generators to generate random numbers in parallel by running a DS2 program in multiple threads.

The information in this article is taken from my forthcoming SAS Global Forum 2018 paper, "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018).


The post How to use random-number generators in SAS appeared first on The DO Loop.

12月 132010
In recent weeks, we've explored methods to fit logistic regression models when a state of quasi-complete separation exists. We considered Firth's penalized likelihood approach, exact logistic regression, and Bayesian models using Markov chain Monte Carlo (MCMC).

Today we'll show how to build a Monte Carlo experiment to compare these approaches. Suppose we have 100 observations with x=0 and 100 with x=1, and suppose that the Pr(Y=1|X=0) = 0.001, while the Pr(Y=1|X=1) = 0.05. Thus the true odds ratio is (0.05/0.95)/(0.001/0.999) = 52.8 and the log odds ratio we want to find is 3.96. But we will rarely observe any y=1 when x=0. Which of these approaches is most likely to give us acceptable results?

Note that in all of the MCMC analyses we use only 6000 iterations, which is likely too few to trust in practice.

The code is long enough here that we annotate within rather than write much text around the code.


All the SAS procedures used accept the events/trials syntax (section 4.1.1), so we'll generate example data sets as two observations of binomial random variates with the probabilities noted above. We also make extensive use of the ODS system to suppress all printed output (section A.7.1) and to save desired pieces of output as SAS data sets. The latter usage requires first using the ods trace on/listing statement to find the name of the output before saving it. Finally, we use the by statement (section A.6.2) to replicate the analysis for each simulated data set.

data rlog;
do trial = 1 to 100;
/* each "trial" is a simulated data set with two observations
containing the observed number of events with x=0 or x=1 */
x=0; events = ranbin(0,100,.001); n=100; output;
x=1; events = ranbin(0,100,.05); n=100; output;

ods select none; /* omit _all_ printed output */

ods output parameterestimates=glm; /* save the estimated betas */
proc logist data = rlog;
by trial;
model events / n=x; /* ordinary logistic regression */

ods output parameterestimates=firth; /* save the estimated betas */
/* note the output data set has the same name
as in the uncorrected glm */
proc logist data = rlog;
by trial;
model events / n = x / firth; /* do the firth bias correction */

ods output exactparmest=exact;
/* the exact estimates have a different name under ODS */
proc logist data=rlog;
by trial;
model events / n = x;
exact x / estimate; /* do the exact estimation */

data prior;
input _type_ $ Intercept x;
Var 25 25
Mean 0 0

ods output postsummaries=mcmc;
proc genmod data = rlog;
by trial;
model events / n = x / dist=bin;
bayes nbi=1000 nmc=6000
coeffprior=normal(input=prior) diagnostics=none
/* do the Bayes regression, using the prior made in the
previous data step */

Now I have four data sets with parameter estimates in them. I could use them separately, but I'd like to merge them together. I can do this with the merge statement (section 1.5.7) in a data step. I also need to drop the lines with the estimated intercepts and rename the variables that hold the parameter estimates. The latter is necessary because the names are duplicated across the output data sets and desirable in that it allows names that are meaningful. In any event, I can use the where and rename data set options to include these modifications as I do the merge. I'll also add the number of events when x=0 and when x=1, which requires merging in the original data twice.

data lregsep;
glm (where = (variable = "x") rename = (estimate = glm))
firth (where = (variable = "x") rename = (estimate = firth))
exact (rename = (estimate = exact))
mcmc (where = (parameter = "x") rename = (mean=mcmc))
rlog (where = (x = 1) rename = (events = events1))
rlog (where = (x = 0) rename = (events = events0));
by trial;

ods select all; /* now I want to see the output! */
/* check to make sure the output dataset looks right */
proc print data = lregsep (obs = 5) ;
var trial glm firth exact mcmc;

/* what do the estimates look like? */
proc means data=lregsep;
var glm firth exact mcmc;

With the following output.

Obs trial glm firth exact mcmc

1 1 12.7866 2.7803 2.3186 3.9635
2 2 12.8287 3.1494 2.7223 4.0304
3 3 10.7192 1.6296 0.8885 2.5613
4 4 11.7458 2.2378 1.6906 3.3409
5 5 10.7192 1.6296 0.8885 2.5115

Variable Mean Std Dev
glm 10.6971252 3.4362801
firth 2.2666700 0.5716097
exact 1.8237047 0.5646224
mcmc 3.1388274 0.9620103

The ordinary logistic estimates are entirely implausible, while the three alternate approaches are more acceptable. The MCMC result has the least bias, but it's unclear to what degree this is a happy coincidence between the odds ratio and the prior precision. The Firth approach appears to be less biased than the exact logistic regression

The R version is roughly analogous to the SAS version. The notable differences are that 1) I want the "weights" version of the data (see example 8.15) for the glm() and logistf() functions and need the events/trials syntax for the elrm() function and the expanded (one row per observation) version for the MCMClogit() funtion. The sapply() function (section B.5.3) serves a similar function to the by statement in SAS. Finally, rather than spelunking through the ods trace output to find the parameter estimates, I used the str() function (section 1.3.2) to figure out where they are stored in the output objects and indexes (rather than data set options) to pull out the one estimate I need.

# make sure the needed packages are present
# the runlogist() function generates a dataset and runs each analysis
# the parameter "trial" keeps track of which time we're calling runlogist()
runlogist = function(trial) {
# the result vector will hold the estimates temporarily
result = matrix(0,4)
# generate the number of events once
events.0 =rbinom(1,100, .001) # for x = 0
events.1 = rbinom(1,100, .05) # for x = 1
# following for glm and logistf "weights" format
xw = c(0,0,1,1)
yw = c(0,1,0,1)
ww = c(100 - events.0, events.0, 100 - events.1,events.1)
# run the glm and logistf, grab the estimates, and stick
# them into the results vector
result[1] =
glm(yw ~ xw, weights=ww, binomial)$coefficients[2]
result[2] = logistf(yw ~ xw, weights=ww)$coefficients[2]
# elrm() needs a data frame in the events/trials syntax
elrmdata = data.frame(events=c(events.0,events.1), x =c(0,1),
trials = c(100,100))
# run it and grab the estimate
result[3]=elrm(events/trials ~ x, interest = ~ x, iter = 6000,
burnIn = 1000, data = elrmdata, r = 2)$coeffs
# MCMClogit() needs expanded data
x = c(rep(0,100), rep(1,100))
y = c(rep(0,100-events.0), rep(1,events.0),
rep(0, 100-events.1), rep(1, events.1))
# run it and grab the mean of the MCMC posteriors
result[4] = summary(MCMClogit(y~as.factor(x), burnin=1000,
mcmc=6000, b0=0, B0=.04,
seed = list(c(781306, 78632467, 364981736, 6545634, 7654654,
# send back the four estimates, plus the number of events
# when x=0 and x=1
return(c(trial, events.0, events.1, result))

Note the construction of the seed= option to the MCMClogit() function. This allows a different seed in every call without actually using sequential seeds.

Now we're ready to call the function repeatedly. We'll do that with the sapply() function, but we need to nest that inside a t() function call to get the estimates to appear as columns rather than rows, and we'll also make it a data frame in the same command. Note that the parameters we change within the sapply() function are merely a list of trial numbers. Finally, we'll add descriptive names for the columns with the names() function (section 1.3.4).

res2 = as.data.frame(t(sapply(1:10, runlogist)))
names(res2) <- c("trial","events.0","events.1", "glm",
"firth", "exact-ish", "MCMC")
mean(res2[,4:7], na.rm=TRUE)

trial events.0 events.1 glm firth exact-ish MCMC
1 1 0 6 18.559624 2.6265073 2.6269087 3.643560
2 2 1 3 1.119021 0.8676031 1.1822296 1.036173
3 3 0 5 18.366720 2.4489268 2.1308186 3.555314
4 4 0 5 18.366720 2.4489268 2.0452446 3.513743
5 5 0 2 17.419339 1.6295391 0.9021854 2.629160
6 6 0 9 17.997524 3.0382577 2.1573979 4.017105

glm firth exact-ish MCMC
17.333356 2.278344 1.813203 3.268243

The results are notably similar to SAS, except for the unacceptable glm() results.

In most Monte Carlo experimental settings, one would also be interested in examining the confidence limits for the parameter estimates. Notes and code for doing this can be found here. In a later entry we'll consider plots for the results generated above. As a final note, there are few combinations of event numbers with any mass worth considering. One could compute the probability of each of these and the associated parameter estimates, deriving a more analytic answer to the question. However, this would be difficult to replicate for arbitrary event probabilities and Ns, and very awkward for continuous covariates, while the above approach could be extended with trivial ease.