simulation

2月 032021
 

In a previous article, I showed how to generate random points uniformly inside a d-dimensional sphere. In that article, I stated the following fact:

If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit sphere.

I was reminded of this fact recently when I wrote about how to simulate a random walk in high dimensions. Each step of a random walk needs to generate a random direction and a random step length. For a 2-D random walk, it is easy to generate a random direction as an angle, θ, chosen from the uniform distribution on the interval [0,2π). However, for higher-dimensional random walks, I do not suggest using random angles to determine directions. Although it is mathematically possible to use spherical coordinates in d-dimensions, computations in a spherical coordinate system are extremely messy.

A better alternative is to use random unit vectors to determine a direction for each step in a random walk. (In fact, a unit vector is sometimes called a direction vector.) Since a unit vector is a vector from the origin to a point on the unit sphere, this article shows how to generate random points uniformly on the unit sphere. I show the computation twice: once by using the SAS/IML matrix language and again by using the SAS DATA step. In order to visualize the data, I show how to create a contour plot of ungridded data in SAS.

Random points on a circle

For simplicity, let's see how to generate random points on the unit circle in 2-D. The following SAS/IML program simulates N=1000 points from d=2 independent normal distributions. The notation X[ ,##] is a subscript reduction operator that returns a row vector that contains the sum of the squared elements of each row of X. Thus the expression sqrt(X[,##]) gives the Euclidean distance of each row to the origin. If you divide the coordinates by the distance, you obtain a point on the unit circle:

%let N = 1000;         /* number of steps in random walk */
%let d = 2;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
 
proc iml;
call randseed(12345);
r = &r; d = &d;
x = randfun(&N // d, "Normal");   /* N points from MVN(0, I(d)) */
norm = sqrt( x[,##] );            /* Euclidean distance to origin */
x = r * x / norm;                 /* scale point so that distance to origin is r */
title "&n Random Points on Circle";
call scatter(x[,1], x[,2]) grid={x y}
             procopt="aspect=1" option="transparency=0.8";
QUIT;

As promised, the resulting points are on the circle of radius r=1. Recall that uniformly at random does not mean evenly spaced. Point patterns that are generated by a uniform process will typically have gaps and clusters.

Random points on a sphere

You can also use the SAS DATA step to generate the random points on a sphere. You can generate each coordinate independently from a normal distribution and use the EUCLID function in Base SAS to compute the Euclidean distance from a point to the origin. You can then scale the coordinates so that the point is on the sphere of radius r. The following DATA step generates d-dimensional points for any d:

%let N = 2000;         /* number of steps in random walk */
%let d = 3;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
data RandSphere;
array x[&d];
call streaminit(12345);
do i = 1 to &N;
   do j = 1 to &d;
      x[j] = rand("Normal");      /* random point from MVN(0, I(d)) */
   end;
   norm = Euclid( of x[*] );      /* Euclidean distance to origin */
   do j = 1 to &d;
      x[j] = &r * x[j] / norm;    /* scale point so that distance to origin is r */
   end;
   output;
end;
drop j norm;
run;

How can we visualize the points to assure ourselves that the program is running correctly? One way is to plot the first two coordinates and use colors to represent the value of the points in the third coordinate direction. For example, the following call to PROC SGPLOT creates a scatter plot of (x1,x2) and uses the COLORRESPONSE=x3 option to color the markers. The default color ramp is the ThreeAltColor ramp, which (for my ODS style) is a blue-black-red color ramp. That means that points near the "south pole" (x3 = -1) will be colored blue, points near the equator will be colored black, and points near the "north pole" (x3 = 1) will be colored red. I have also used the TRANSPARENCY= option so that the density of the points is shown.

title "&n Random Points on a Sphere";
proc sgplot data=RandSphere aspect=1;
   scatter x=x1 y=x2 / colorresponse=x3 transparency=0.5 markerattrs=(symbol=CircleFilled);
   xaxis grid; yaxis grid;
run;

Switching to (x,y,z) instead of (x1,x2,x3) notation, the graph shows that observations near the (x,y)-plane (for which x2 + y2 &asymp 1) have a dark color because the z-value is close 1 zero. In contrast, observations near for which x2 + y2 &asymp 0 have either a pure blue or pure red color since those observations are near one of the poles.

A contour plot of ungridded data

Another way to visualize the data would be to construct a contour plot of the points in the upper hemisphere of the sphere. The exact contours of the upper hemisphere are concentric circles (lines of constant latitude) centered at (x,y)=(0,0). For these simulated data, a contour plot should display contours that are approximately concentric circles.

In a previous article, I showed how to use PROC TEMPLATE and PROC SGRENDER in SAS to create a contour plot. However, in that article the data were aligned on an evenly spaced grid in the (x,y) plane. The data in this article have random positions. Consequently, on the CONTOURPLOTPARM statement in PROC TEMPLATE, you must use the GRIDDED=FALSE option so that the plot interpolates the data onto a rectangular grid. The following Graph Template Language (GTL) defines a contour plot for irregularly spaced data and creates the contour plot for the random point on the upper hemisphere:

/* Set GRIDDED=FALSE if the points are not arranged on a regular grid */
proc template;
define statgraph ContourPlotParmNoGrid;
dynamic _X _Y _Z _TITLE;
begingraph;
   entrytitle _TITLE;
   layout overlay;
      contourplotparm x=_X y=_Y z=_Z / GRIDDED=FALSE /* for irregular data */
        contourtype=fill nhint=12  colormodel=twocolorramp name="Contour";
      continuouslegend "Contour" / title=_Z;
   endlayout;
endgraph;
end;
run;
 
ods graphics / width=480px height=480px;
proc sgrender data=RandSphere template=ContourPlotParmNoGrid;
where x3 >= 0;
dynamic _TITLE="Contour Plot of &n Random Points on Upper Hemisphere"
        _X="x1" _Y="x2" _Z="x3";
run;

The data are first interpolated onto an evenly spaced grid and then the contours are estimated. The contours are approximately concentric circles. From the contour plot, you can conclude that the data form a dome or hemisphere above the (x,y) plane.

Summary

This article shows how to use SAS to generate random points on the sphere in any dimension. You can use this technique to generate random directions for a random walk. In addition, the article shows how to use the GRIDDED=FALSE option on the CONTOURPLOTPARM statement in GTL to create a contour plot from irregularly spaced ("ungridded") data in SAS.

The post Generate random points on a sphere appeared first on The DO Loop.

2月 012021
 

Imagine an animal that is searching for food in a vast environment where food is scarce. If no prey is nearby, the animal's senses (such as smell and sight) are useless. In that case, a reasonable search strategy is a random walk. The animal can choose a random direction, walk/swim/fly in that direction for a while, and hope that it encounters food. If not, it can choose another random direction and try again.

According to a story in Quanta magazine, an optimal search strategy is to use a variant of a random walk that is known as a Lévy flight. This article describes a Lévy flight, simulates it in SAS, and compares it to Gaussian random walk.

Random walks

For the purpose of this article, a 2-D random walk is defined as follows. From the current position,

  • Choose a random direction by choosing an angle, θ, uniformly in [0, 2π).
  • Choose a random distance, r.
  • Move that distance in that direction.

The amount of territory that gets explored during a random walk depends on the distribution of the distances. For example, if you always choose a unit distance, you obtain the "drunkard's walk." If you choose distances from a normal distribution, you obtain a Gaussian walk. In both scenarios, the animal primarily searches near where it has been previously. These random walks that always use short distances are good for finding food in a food-rich environment.

In a food-scarce environment, it is a waste of time and energy to always move a short distance from the previous position. It turns out that a better strategy is to occasionally move a long distance. That puts the animal in a new location, which it can explore by moving small distances again. According to the Quanta article, this behavior has been observed "to greater or lesser extents" in many animals that hunt at sea, including albatross, sharks, turtles, penguins, and tuna (Sims, et al., Nature, 2008).

Levy walks and the Levy distribution

A popular model for this behavior is the Lévy walk, which is often called a Lévy flight because it was first applied to the flight path of an albatross over open water. In a Lévy flight, the distance is a random draw from a heavy-tailed distribution. In this article, I will simulate a Lévy flight by drawing distances from a Lévy distribution, which is a special case of the inverse gamma distribution.

The Lévy distribution Levy(μ, c) has a location parameter (μ) and a shape parameter (c). It is a special case of the inverse gamma distribution, which is represented by IGamma(α, β). If X ~ Levy(μ, c) is a random variable, then X – μ ~ IGamma(1/2, β/2). In this article, μ=0, so a random draw from the Levy(0,c) distribution is a random draw from IGamma(1/2, c/2) distribution. Thus, you can use the SAS program in the previous article to generate a random sample from the Lévy distribution.

The inverse gamma distribution with α=1/2 has very heavy tails. This means that a random variate can be arbitrarily large. When modeling a physical or biological system, there are often practical bounds on the maximum value of a quantity. For example, if you are modeling the flight of an albatross, you know that there is a maximum distance that an albatross can fly in a given time. Thus, researchers often use a truncated Lévy distribution to model the distances that an animal travels in one unit of time. For example, an albatross can fly up to 800 km in one day (!!), so a model of albatross distances should truncate the distances at 800.

Simulate a Gaussian and Levy Walk

You can use the SAS DATA step to simulate random walks. The program in this section generates a random direction in [0, 2π), then generates a random distance. For the Gaussian walk, the distance is the absolute value of a random N(0, σ) variate. For the Lévy walk, the distances are sampled from the Levy(0, 0.3*σ) distribution. The constant 0.3 is chosen so that the two distance distributions have approximately the same median.

For convenience, you can define a macro that makes it easy to generate random variates from the Lévy distribution. The following program uses the %RandIGamma macro (explained in the previous article), which generates a random variate from the inverse gamma distribution.

The following program simulates four individuals (animals) who start at the corners of a unit square with coordinates (0,0), (1,0), (1,1), and (0,1). At each step, the animals choose a random direction and move a random distance in that direction. In one scenario, the distances generated as the absolute value of a normal distribution. In another scenario, the distances are Levy distributed.

/* useful macros: https://blogs.sas.com/content/iml/2021/01/27/inverse-gamma-distribution-sas.html */
%macro RandIGamma(alpha, beta);          /* IGamma(alpha,beta) */
   (1 / rand('Gamma', &alpha, 1/(&beta)))
%mend;
%macro RandLevy(mu, scale);              /* Levy(mu,c) */
   (&mu + %RandIGamma(0.5, (&scale)/2));
%mend;
%macro RandTruncLevy(mu, scale, max);    /* Truncated Levy */
   min((&mu + %RandIGamma(0.5, (&scale)/2)), &max);
%mend;
 
%let N = 1000;         /* number of steps in random walk */
%let sigma = 0.1;      /* scale parameter for N(0,sigma) */
%let c = (0.3*&sigma); /* makes medians similar */
 
data Walks;
array x0[4] _temporary_ (0 1 1 0);  /* initial positions */
array y0[4] _temporary_ (0 0 1 1);
call streaminit(12345);
twopi = 2*constant('pi');
c = &c;                    /* Levy parameter */
do subject = 1 to dim(x0);
   t=0;                    /* initial position for subject */
   x = x0[subject];   y = y0[subject]; 
   r=.; theta=.;           /* r = random distance; theta=random angle */
   Walk = "Gaussian";  output;
   Walk = "Levy";      output;
   xN = x; yN = y;         /* (xN,yN) are coords for Gaussian walk */
   xL = x; yL = y;         /* (xL,yL) are coords for Levy walk */
   do t = 1 to &N;         /* perform steps of a random walk */
      theta = rand("Uniform", 0, twopi);  /* random angle */
      ux = cos(theta); uy = sin(theta);   /* u=(ux,uy) is random unit vector */
      /* take a Gaussian step; remember location */
      Walk = "Gaussian";
      r = abs( rand("Normal", 0, &sigma) ); /* ~ abs(N(0,sigma)) */
      xN = xN + r*ux;   yN = yN + r*uy;
      /* output the Gaussian walk step */
      x = xN; y = yN; output;
 
      /* take a (truncated) Levy step; remember location */
      Walk = "Levy";
      r = %RandTruncLevy(0, c, 2);  /* Levy(0,c) truncated into [0,2] */
      xL = xL + r*ux;   yL = yL + r*uy;
      /* output the Levy walk step */
      x = xL; y = yL; output;
   end;
end;
drop twopi c xN yN xL yL ux uy;
run;

Let's compare the step sizes for the two types of random walks. The following call to PROC SGPANEL creates histograms of the step lengths for each type of random walk:

title "Distribution of Step Lengths";
proc sgpanel data=Walks;
   label r = "Step Length";
   panelby Walk;
   histogram r / binwidth=0.1 binstart=0.05;
run;

The graph shows that all step lengths for the Gaussian random walk are less than 0.4, which represents four standard deviations for the normal distribution. In contrast, the distribution of distances for the Lévy walk has a long tail. About 15% of the distances are greater than 0.75. About 9% of the step lengths are the maximum possible value, which is 2.

How does that difference affect the amount of territory covered by an animal that is foraging for food? The following statements show the paths of the four animals for each type of random walk:

title "Position of Subjects";
title2 "50 Steps of a Random Walk";
proc sgpanel data=Walks aspect=1;
   where t <= 50;
   panelby Walk / onepanel;
   series x=x y=y / group=Subject;
   colaxis grid; rowaxis grid;
run;

The differences in the paths are evident. For the Gaussian random walk, the simulated animals never leave a small neighborhood around their starting positions. After 50 steps, none of the animals who are taking a Gaussian walk have moved more than one unit away from their starting position. In contrast, the animals who take Lévy walk have explored a much greater region. You can see that the typical motion is several small steps followed by a larger step that brings the animal into a new region to explore.

Generalizing to higher dimensions

It is easy to extend the random walk to higher dimensions. The trick is to represent the "random direction" by a unit vector (u) instead of by an angle (θ). In a previous article about random point within a d-dimensional ball, I remark that "If Y is drawn from the uncorrelated multivariate normal distribution, then u = Y / ||Y|| has the uniform distribution on the unit d-sphere." You can use this trick to generate random unit vectors, then step a distance r in that direction, where r is from a specified distribution of distances.

Summary

An article in Quanta magazine describes how some animals use a swim/flight path that can be modeled by a (truncated) Lévy random walk. This type of movement is a great way to search for food in a vast food-scarce environment. This article defines the Lévy distribution as a sub-family of the inverse gamma distribution. It shows how to generate random variates from the Lévy distribution in SAS. Finally, it simulates a Gaussian random walk and a Lévy walk and visually compares the area explored by each.

The post Gaussian random walks and Levy flights appeared first on The DO Loop.

12月 172020
 

Do you know that you can create a vector that has a specific correlation with another vector? That is, given a vector, x, and a correlation coefficient, ρ, you can find a vector, y, such that corr(x, y) = ρ. The vectors x and y can have an arbitrary number of elements, n > 2. One application of this technique is to create a scatter plot that shows correlated data for any correlation in the interval (-1, 1). For example, you can create a scatter plot with n points for which the correlation is exactly a specified value, as shown at the end of this article.

The algorithm combines a mixture of statistics and basic linear algebra. The following facts are useful:

  • Statistical correlation is based on centered and normalized vectors. When you center a vector, it usually changes the direction of the vector. Therefore, the calculations use centered vectors.
  • Correlation is related to the angle between the centered vectors. If the angle is θ, the correlation between the vectors is cos(θ).
  • Projection is the key to finding a vector that has a specified correlation. In linear algebra, the projection of a vector w onto a unit vector u is given by the expression (w`u)*u.
  • Affine transformations do not affect correlation. For any real number, α, and for any β > 0, the vector α + β y has the same correlation with x as y does. For simplicity, the SAS program in this article returns a centered unit vector. You can scale and translate the vector to obtain other solutions.

The geometry of a correlated vector

Given a centered vector, u, there are infinitely-many vectors that have correlation ρ with u. Geometrically, you can choose any vector on a positive cone in the same direction as u, where the cone has angle θ and cos(θ)=ρ. This is shown graphically in the figure below. The plane marked \(\mathbf{u}^{\perp}\) is the orthogonal complement to the vector u. If you extend the cone through the plane, you obtain the cone of vectors that are negatively correlated with x

One way to obtain a correlated vector is to start with a guess, z. The vector z can be uniquely represented as the sum \(\mathbf{y} = \mathbf{w} + \mathbf{w}^{\perp}\), where w is the projection of z onto the span of u, and \(\mathbf{w}^{\perp}\) is the projection of z onto the orthogonal complement.

The following figure shows the geometry of the right triangle with angle θ such that cos(θ) = ρ. If you want the vector y to be unit length, you can read off the formula for y from the figure. The formula is
\(\mathbf{y} = \rho \mathbf{w} / \lVert\mathbf{w}\rVert + \sqrt{1 - \rho^2} \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert \)
In the figure, \(\mathbf{v}_1 = \mathbf{w} / \lVert\mathbf{w}\rVert\) and \(\mathbf{v}_2 = \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert\).

Compute a correlated vector

It is straightforward to implement this projection in a matrix-vector language such as SAS/IML. The following program defines two helper functions (Center and UnitVec) and uses them to implement the projection algorithm. The function CorrVec1 takes three arguments: the vector x, a correlation coefficient ρ, and an initial guess. The function centers and scales the vectors into the vectors u and z. The vector z is projected onto the span of u. Finally, the function uses trigonometry and the fact that cos(θ) = ρ to return a unit vector that has the required correlation with x.

/* Given a vector, x, and a correlation, rho, find y such that corr(x,y) = rho */
proc iml;
/* center a column vector by subtracting its mean */
start Center(v);
   return ( v - mean(v) );
finish;
/* create a unit vector in the direction of a column vector */
start UnitVec(v);
   return ( v / norm(v) );
finish;
 
/* Find a vector, y, such that corr(x,y) = rho. The initial guess can be almost 
   any vector that is not in span(x), orthog to span(x), and not in span(1) */
start CorrVec1(x, rho, guess);
   /* 1. Center the x and z vectors. Scale them to unit length. */
   u = UnitVec( Center(x)     );
   z = UnitVec( Center(guess) );
 
   /* 2. Project z onto the span(u) and the orthog complement of span(u) */
   w = (z`*u) * u;
   wPerp = z - w;       
 
   /* 3. The requirement that cos(theta)=rho results in a right triangle 
         where y (the hypotenuse) has unit length and the legs
         have lengths rho and sqrt(1-rho^2), respectively */
   v1 = rho * UnitVec(w);
   v2 = sqrt(1 - rho**2) * UnitVec(wPerp);
   y = v1 + v2;
 
   /* 4. Check the sign of y`*u. Flip the sign of y, if necessary */
   if sign(y`*u) ^= sign(rho) then 
      y = -y;
   return ( y );
finish;

The purpose of the function is to project the guess onto the green cone in the figure. However, if the guess is in the opposite direction from x, the algorithm will compute a vector, y, that has the opposite correlation. The function detects this case and flips y, if necessary.

The following statements call the function for a vector, x, and requests a unit vector that has correlation ρ = 0.543 with x:

/* Example: Call the CorrVec1 function */
x = {1,2,3};
rho = 0.543;
guess = {0, 1, -1};  
y = CorrVec1(x, rho, guess);
corr = corr(x||y);
print x y, corr;

As requested, the correlation coefficient between x and y is 0.543. This process will work provided that the guess satisfies a few mild assumptions. Specifically, the guess cannot be in the span of x or in the orthogonal complement of x. The guess also cannot be a multiple of the 1 vector. Otherwise, the process will work for positive and negative correlations.

The function returns a vector that has unit length and 0 mean. However, you can translate the vector and scale it by any positive quantity without changing its correlation with x, as shown by the following example:

/* because correlation is a relationship between standardized vectors, 
   you can translate and scale Y any way you want */
y2 = 100 + 23*y;      /* rescale and translate */
corr = corr(x||y2);   /* the correlation will not change */
print corr;

When y is a centered unit vector, the vector β*y has L2 norm β. If you want to create a vector whose standard deviation is β, use β*sqrt(n-1)*y, where n is the number of elements in y.

Random vectors with a given correlation

One application of this technique is to create a random vector that has a specified correlation with a given vector, x. For example, in the following program, the x vector contains the heights of 19 students in the Sashelp.Class data set. The program generates a random guess from the standard normal distribution and passes that guess to the CorrVec1 function and requests a vector that has the correlation 0.678 with x. The result is a centered unit vector.

use sashelp.class;
   read all var {"Height"} into X;
close;
 
rho = 0.678;
call randseed(123);
guess = randfun(nrow(x), "Normal");
y = CorrVec1(x, rho, guess);
 
mean = 100;
std = 23*sqrt(nrow(x)-1);
v = mean + std*y;
title "Correlation = 0.678";
title2 "Random Normal Vector";
call scatter(X, v) grid={x y};
A random vector that has a given correlation with a given vector. Computation by using SAS.

The graph shows a scatter plot between x and the random vector, v. The correlation in the scatter plot is 0.678. The sample mean of the vector v is 100. The sample standard deviation is 23.

If you make a second call to the RANDFUN function, you can get another random vector that has the same properties. Or you can repeat the process for a range of ρ values to visualize data that have a range of correlations. For example, the following graph shows a panel of scatter plots for ρ = -0.75, -0.25, 0.25, and 0.75. The X variable is the same for each plot. The Y variable is a random vector that was rescaled to have mean 100 and standard deviation 23, as above.

Random vectors that have a given correlation with a given vector. Computations in SAS.

The random guess does not need to be from the normal distribution. You can use any distribution.

Summary

This article shows how to create a vector that has a specified correlation with a given vector. That is, given a vector, x, and a correlation coefficient, ρ, find a vector, y, such that corr(x, y) = ρ. The algorithm in this article produces a centered vector that has unit length. You can multiply the vector by β > 0 to obtain a vector whose norm is β. You can multiply the vector by β*sqrt(n-1) to obtain a vector whose standard deviation is β.

There are infinitely-many vectors that have correlation ρ with x. The algorithm uses a guess to produce a particular vector for y. You can use a random guess to obtain a random vector that has a specified correlation with x.

The post Find a vector that has a specified correlation with another vector appeared first on The DO Loop.

11月 022020
 

When there are two equivalent ways to do something, I advocate choosing the one that is simpler and more efficient. Sometimes, I encounter a SAS program that simulates random numbers in a way that is neither simple nor efficient. This article demonstrates two improvements that you can make to your SAS code if you are simulating binary variables or categorical variables.

Simulate a random binary variable

The following DATA step simulates N random binary (0 or 1) values for the X variable. The probability of generating a 1 is p=0.6 and the probability of generating a 0 is 0.4. Can you think of ways that this program can be simplified and improved?

%let N = 8;
%let seed = 54321;
 
data Binary1(drop=p);
call streaminit(&seed);
p = 0.6;
do i = 1 to &N;
   u = rand("Uniform");
   if (u < p) then
      x=1;
   else
      x=0;
   output;
end;
run;
 
proc print data=Binary1 noobs; run;

The goal is to generate a 1 or a 0 for X. To accomplish this, the program generates a random uniform variate, u, which is in the interval (0, 1). If u < p, it assigns the value 1, otherwise it assigns the value 0.

Although the program is mathematically correct, the program can be simplified. It is not necessary for this program to generate and store the u variable. Yes, you can use the DROP statement to prevent the variable from appearing in the output data set, but a simpler way is to use the Bernoulli distribution to generate X directly. The Bernoulli(p) distribution generates a 1 with probability p and generates a 0 with probability 1-p. Thus, the following DATA step is equivalent to the first, but is both simpler and more efficient. Both programs generate the same random binary values.

data Binary2(drop=p);
call streaminit(&seed);
p = 0.6;
do i = 1 to &N;
   x = rand("Bernoulli", p);     /* Bern(p) returns 1  with probability p */
   output;
end;
run;
 
proc print data=Binary2 noobs; run;

Simulate a random categorical variable

The following DATA step simulates N random categorical values for the X variable. If p = {0.1, 0.1, 0.2, 0.1, 0.3, 0.2} is a vector of probabilities, then the probability of generating the value i is p[i]. For example, the probability of generating a 3 is 0.2. Again, the program generates a random uniform variate and uses the cumulative probabilities ({0.1, 0.2, 0.4, 0.5, 0.8, 1}) as cutpoints to determine what value to assign to X, based on the value of u. (This is called the inverse CDF method.)

/* p = {0.1, 0.1, 0.2, 0.1, 0.3, 0.2} */
/* Use the cumulative probability as cutpoints for assigning values to X */
%let c1 = 0.1;
%let c2 = 0.2;
%let c3 = 0.4;
%let c4 = 0.5;
%let c5 = 0.8;
%let c6 = 1;
 
data Categorical1;
call streaminit(&seed);
do i = 1 to &N;
   u = rand("Uniform");
   if (u <=&c1) then
      y=1;
   else if (u <=&c2) then
      y=2;
   else if (u <=&c3) then
      y=3;
   else if (u <=&c4) then
      y=4;
   else if (u <=&c5) then
      y=5;
   else
      y=6;
   output;
end;
run;
 
proc print data=Categorical1 noobs; run;

If you want to generate more than six categories, this indirect method becomes untenable. Suppose you want to generate a categorical variable that has 100 categories. Do you really want to use a super-long IF-THEN/ELSE statement to assign the values of X based on some uniform variate, u? Of course not! Just as you can use the Bernoulli distribution to directly generate a random variable that has two levels, you can use the Table distribution to directly generate a random variable that has k levels, as follows:

data Categorical2;
call streaminit(&seed);
array p[6] _temporary_ (0.1, 0.1, 0.2, 0.1, 0.3, 0.2);
do i = 1 to &N;
   x = rand("Table", of p[*]); /* Table(p) returns i with probability p[i] */
   output;
end;
 
proc print data=Categorical2 noobs; run;

Summary

In summary, this article shows two tips for simulating discrete random variables:

  1. Use the Bernoulli distribution to generate random binary variates.
  2. Use the Table distribution to generate random categorical variates.

These distributions enable you to directly generate categorical values based on supplied probabilities. They are more efficient than the oft-used method of assigning values based on a uniform random variate.

The post Tips to simulate binary and categorical variables appeared first on The DO Loop.

10月 282020
 

The skewness of a distribution indicates whether a distribution is symmetric or not. The Wikipedia article about skewness discusses two common definitions for the sample skewness, including the definition used by SAS. In the middle of the article, you will discover the following sentence: In general, the [estimators]are both biased estimators of the population skewness. The article goes on to say that the estimators are not biased for symmetric distributions. Similar statements are true for the sample kurtosis.

This statement might initially surprise you. After all, the statistics that we use to estimate the mean and standard deviation are unbiased. Although biased estimates are not inherently "bad," it is useful to get an intuitive feel for how biased an estimator might be.

Let's demonstrate the bias in the skewness statistic by running a Monte Carlo simulation. Choose an unsymmetric univariate distribution for which the population skewness is known. For example, the exponential distribution has skewness equal to 2. Then do the following:

  1. Choose a sample size N. Generate B random samples from the chosen distribution.
  2. Compute the sample skewness for each sample.
  3. Compute the average of the skewness values, which is the Monte Carlo estimate of the sample skewness. The difference between the Monte Carlo estimate and the parameter value is an estimate of the bias of the statistic.

In this article, I will generate B=10,000 random samples of size N=100 from the exponential distribution. The simulation shows that the expected value of the skewness is NOT close to the population parameter. Hence, the skewness statistic is biased.

A Monte Carlo simulation of skewness

The following DATA step simulates B random samples of size N from the exponential distribution. The call to PROC MEANS computes the sample skewness for each sample. The call to PROC SGPLOT displays the approximate sampling distribution of the skewness. The graph overlays a vertical reference line at 2, which is the skewness parameter for the exponential distribution, and also overlays a reference line at the Monte Carlo estimate of the expected value.

%let NumSamples = 10000;
%let N = 100;
/* 1. Simulate B random samples from exponential distribution */
data Exp;
call streaminit(1);
do SampleID = 1 to &NumSamples;
   do i = 1 to &N;
      x = rand("Expo");
      output;
   end;
end;
run;
 
/* 2. Estimate skewness (and other stats) for each sample */
proc means data=Exp noprint;
   by SampleID;
   var x;
   output out=MCEst mean=Mean stddev=StdDev skew=Skewness kurt=Kurtosis;
run;
 
/* 3. Graph the sampling distribution and overlay parameter value */
title "Monte Carlo Distribution of Sample Skewness";
title2 "N = &N; B = &NumSamples";
proc sgplot data=MCEst;
   histogram Skewness;
   refline 2 / axis=x lineattrs=(thickness=3 color=DarkRed) 
             labelattrs=(color=DarkRed) label="Parameter";
   refline 1.818 / axis=x lineattrs=(thickness=3 color=DarkBlue) label="Monte Carlo Estimate"
             labelattrs=(color=DarkBlue) labelloc=inside ;
run;
 
/* 4. Display the Monte Carlo estimate of the statistics */ 
proc means data=MCEst ndec=3 mean stddev;
   var Mean StdDev Skewness Kurtosis;
run;
Monte Carlo distribution of skewness statistic (B=10000, N=100)

For the exponential distribution, the skewness parameter has the value 2. However, according to the Monte Carlo simulation, the expected value of the sample skewness is about 1.82 for these samples of size 100. Thus, the bias is approximately 0.18, which is about 9% of the true value.

The kurtosis statistic is also biased. The output from PROC MEANS includes the Monte Carlo estimates for the expected value of the sample mean, standard deviation, skewness, and (excess) kurtosis. For the exponential distribution, the parameter values are 1, 1, 2, and 6, respectively. The Monte Carlo estimates for the sample mean and standard deviation are close to the parameter values because these are unbiased estimators. However, the estimates for the skewness and kurtosis are biased towards zero.

Summary

This article uses Monte Carlo simulation to demonstrate bias in the commonly used definitions of skewness and kurtosis. For skewed distributions, the expected value of the sample skewness is biased towards zero. The bias is greater for highly skewed distributions. The skewness statistic for a symmetric distribution is unbiased.

The post The sample skewness is a biased statistic appeared first on The DO Loop.

10月 212020
 

The triangulation theorem for polygons says that every simple polygon can be triangulated. In fact, if the polygon has V vertices, you can decompose it into V-2 non-overlapping triangles. In this article, a "polygon" always means a simple polygon. Also, a "random point" means one that is drawn at random from the uniform distribution.

The triangularization of a polygon is useful in many ways, but one application is to generate uniform random points in a polygon or a collection of polygons. Because polygons can be decomposed into triangles, the problem reduces to a simpler one: Given a list of k triangles, generate uniform random points in the union of the triangles. I have already shown how to generate random points in a triangle, so you can apply this method to generate random points in a polygon or collection of polygons.

An algorithm to generate random points in a polygon

Suppose that a polygon or any other planar region is decomposed into k triangles T1, T2, ..., Tk. If you want to generate N random points uniformly in the region, the number of points in any triangle should be proportional to the area of the triangle divided by the total area of the polygon.

One way to accomplish this is to use a two-step process. First, choose a triangle by using a probability proportional to the relative areas. Next, generate a random point in that triangle. This two-step approach is suitable for the SAS DATA step. At the end of this process, you have generated Ni observations in triangle Ti.

An equivalent formulation is to realize that the vector {N1, N2, ..., Nk} is a random draw from the multinomial distribution with parameters p = {p1, p2, ..., pk}, where pi = Area(Ti) / (Σj Area(Tj)). This second formulation is better for a vector languages such as the SAS/IML language.

Therefore, the following algorithm generates random points in a polygon:

  1. Decompose the polygon into triangles T1, T2, ..., Tk.
  2. Compute the areas: Ai = Area(Ti).
  3. Generate one random draw from the multinomial distribution with probability vector p = {p1, p2, ..., pk}, where pi = Ai / (Σj Aj). This gives a vector of numbers {N1, N2, ..., Nk}.
  4. Use the algorithm from the previous article to generate Ni random points in the triangle Ti.

Notice that Steps 2-4 of this algorithm apply to ANY collection of triangles. To make the algorithm flexible, I will implement the first step (the decomposition) in one function and the remaining steps in a second function.

Triangulate a convex polygon in SAS

There are various general methods for triangulating a polygon, but for convex polygons, there is a simple method. From among the V vertices, choose any vertex and call it P1. Enumerate the remaining vertices consecutively in a counter-clockwise direction: P2, P3, ..., Pk, where k = V-2. Because the polygon is convex, the following triangles decompose the polygon:

  • T1 = {P1, P2, P3}
  • T2 = {P1, P3, P4}, and so forth, up to
  • Tk-2 = {P1, Pk-1, Pk}

The following SAS/IML function decomposes a convex polygon into triangles. The triangles are returned in a SAS/IML list. The function is called on a convex hexagon and the resulting decomposition is shown below. The function uses the PolyIsConvex function, which is part of the Polygon package. You can download and install the Polygon package. You need to load the Polygon package before you call the function.

/* assume the polygon package is installed */
proc iml;
package load polygon;     /* load the polygon package */
 
/* Decompose a convex polygon into triangles. Return a list
   that contains the vertices for the triangles.
   This function uses a function in the Polygon package, which must be loaded.
*/
start TriangulateConvex(P);            /* input parameter(N x 2): vertices of polygon */
   isConvex = PolyIsConvex(P);
   if ^isConvex then
      return ( [] );                 /* The polygon is not convex */
   numTri = nrow(P) - 2;             /* number of triangles in convex polygon */
   L = ListCreate(numTri);           /* create list to store triangles */
   idx = 2:3;
   do i = 1 to ListLen(L);
      L$i = P[1,] // P[idx,];
      idx = idx + 1;
   end;
   return (L);
finish;
 
/* Specify a convex polygon and visualize the triangulation.  */
P = { 2 1 ,
      3 1 ,
      4 2 ,
      5 4 ,
      3 6 ,
      1 4 ,
      1 2 };
L = TriangulateConvex(P);
Decomposition of a convex polygon into triangles

To illustrate the process, I've included a graph that shows a decomposition of the convex hexagon into triangles. The triangles are returned in a list. The next section shows how to generate uniform points at random inside the union of the triangles in this list.

Generate random points in a collection of triangles

This section generates random points in a union of triangles. The following function takes two arguments: the number of points to generate (N) and a list of triangles (L). The algorithm computes the relative areas of the triangles and uses them to determine the probability that a point will be generated in each. It then uses the RandUnifTriangle function from the previous article to generate the random points.

/* Given a list of triangles (L), generate N random points in the union,
   where the number of points is proportional to Area(triangle) / Area(all triangles)
 
   This function uses functions in the Polygon package, which must be loaded.
*/
start RandUnifManyTriangles(N, L);
   numTri = ListLen(L);
   /* compute areas of each triangle in the list */
   AreaTri = j(1, numTri,.);         /* create vector to store areas */
   do i = 1 to numTri;
      AreaTri[i] = PolyArea(L$i);    /* PolyArea is in the Polygon package */
   end;
   /* Numbers of points in the triangles are multinomial with
      probability proportional to Area(triangle)/Area(polygon)
   */
   NTri = RandMultinomial(1, N, AreaTri/sum(AreaTri));
   cumulN = 0 || cusum(NTri);        /* cumulative counts; use as indices */
   z = j(N, 3, .);                   /* columns are (x,y,TriangleID) */
   do i = 1 to numTri;
      k = (cumulN[i]+1):cumulN[i+1]; /* the next NTri[i] elements */
      z[k, 1:2] = RandUnifTriangle(L$i, NTri[i]);
      z[k, 3] = i;                   /* store the triangle ID */
   end;
   return z;
finish;
 
/* The RandUnifTriangle function is defined at
   https://blogs.sas.com/content/iml/2020/10/19/random-points-in-triangle.html
*/
load module=(RandUnifTriangle);
call randseed(12345);
N = 2000;
z = RandUnifManyTriangles(N, L);

The z vector is an N x 3 matrix. The first two columns contain the (x,y) coordinates of N random points. The third column contains the ID number (values 1,2,...,k) that indicates the triangle that each point is inside of. You can use the PolyDraw function in the Polygon package to visualize the distribution of the points within the polygon:

title "Random Points in a Polygon";
title2 "Colors Assigned Based on Triangulation";
call PolyDraw(P, z);
Random uniform points in a polygon

The color of each point indicates which triangle the point is inside. You can see that triangles with relatively small areas (blue and purple) have fewer points than triangles with larger areas (green and brown).

Summary

In summary, this article shows how to generate random points inside a planar polygon. The first step is to decompose the polygon into triangles. You can use the relative areas of the triangles to determine the probability that a random point is in each triangle. Finally, you can generate random points in the union of the triangles. (Note: The algorithm works for any collection of planar triangles.)

This article uses functions in the Polygon package. Installing and loading a package is a way to define a set of related functions that you want to share. It is an alternative to using %INCLUDE to include the module definitions into your program.

You can download the complete SAS program that performs the computations and creates the graphs in this article.

The post Generate random points in a polygon appeared first on The DO Loop.

10月 212020
 

The triangulation theorem for polygons says that every simple polygon can be triangulated. In fact, if the polygon has V vertices, you can decompose it into V-2 non-overlapping triangles. In this article, a "polygon" always means a simple polygon. Also, a "random point" means one that is drawn at random from the uniform distribution.

The triangularization of a polygon is useful in many ways, but one application is to generate uniform random points in a polygon or a collection of polygons. Because polygons can be decomposed into triangles, the problem reduces to a simpler one: Given a list of k triangles, generate uniform random points in the union of the triangles. I have already shown how to generate random points in a triangle, so you can apply this method to generate random points in a polygon or collection of polygons.

An algorithm to generate random points in a polygon

Suppose that a polygon or any other planar region is decomposed into k triangles T1, T2, ..., Tk. If you want to generate N random points uniformly in the region, the number of points in any triangle should be proportional to the area of the triangle divided by the total area of the polygon.

One way to accomplish this is to use a two-step process. First, choose a triangle by using a probability proportional to the relative areas. Next, generate a random point in that triangle. This two-step approach is suitable for the SAS DATA step. At the end of this process, you have generated Ni observations in triangle Ti.

An equivalent formulation is to realize that the vector {N1, N2, ..., Nk} is a random draw from the multinomial distribution with parameters p = {p1, p2, ..., pk}, where pi = Area(Ti) / (Σj Area(Tj)). This second formulation is better for a vector languages such as the SAS/IML language.

Therefore, the following algorithm generates random points in a polygon:

  1. Decompose the polygon into triangles T1, T2, ..., Tk.
  2. Compute the areas: Ai = Area(Ti).
  3. Generate one random draw from the multinomial distribution with probability vector p = {p1, p2, ..., pk}, where pi = Ai / (Σj Aj). This gives a vector of numbers {N1, N2, ..., Nk}.
  4. Use the algorithm from the previous article to generate Ni random points in the triangle Ti.

Notice that Steps 2-4 of this algorithm apply to ANY collection of triangles. To make the algorithm flexible, I will implement the first step (the decomposition) in one function and the remaining steps in a second function.

Triangulate a convex polygon in SAS

There are various general methods for triangulating a polygon, but for convex polygons, there is a simple method. From among the V vertices, choose any vertex and call it P1. Enumerate the remaining vertices consecutively in a counter-clockwise direction: P2, P3, ..., Pk, where k = V-2. Because the polygon is convex, the following triangles decompose the polygon:

  • T1 = {P1, P2, P3}
  • T2 = {P1, P3, P4}, and so forth, up to
  • Tk-2 = {P1, Pk-1, Pk}

The following SAS/IML function decomposes a convex polygon into triangles. The triangles are returned in a SAS/IML list. The function is called on a convex hexagon and the resulting decomposition is shown below. The function uses the PolyIsConvex function, which is part of the Polygon package. You can download and install the Polygon package. You need to load the Polygon package before you call the function.

/* assume the polygon package is installed */
proc iml;
package load polygon;     /* load the polygon package */
 
/* Decompose a convex polygon into triangles. Return a list
   that contains the vertices for the triangles.
   This function uses a function in the Polygon package, which must be loaded.
*/
start TriangulateConvex(P);            /* input parameter(N x 2): vertices of polygon */
   isConvex = PolyIsConvex(P);
   if ^isConvex then
      return ( [] );                 /* The polygon is not convex */
   numTri = nrow(P) - 2;             /* number of triangles in convex polygon */
   L = ListCreate(numTri);           /* create list to store triangles */
   idx = 2:3;
   do i = 1 to ListLen(L);
      L$i = P[1,] // P[idx,];
      idx = idx + 1;
   end;
   return (L);
finish;
 
/* Specify a convex polygon and visualize the triangulation.  */
P = { 2 1 ,
      3 1 ,
      4 2 ,
      5 4 ,
      3 6 ,
      1 4 ,
      1 2 };
L = TriangulateConvex(P);
Decomposition of a convex polygon into triangles

To illustrate the process, I've included a graph that shows a decomposition of the convex hexagon into triangles. The triangles are returned in a list. The next section shows how to generate uniform points at random inside the union of the triangles in this list.

Generate random points in a collection of triangles

This section generates random points in a union of triangles. The following function takes two arguments: the number of points to generate (N) and a list of triangles (L). The algorithm computes the relative areas of the triangles and uses them to determine the probability that a point will be generated in each. It then uses the RandUnifTriangle function from the previous article to generate the random points.

/* Given a list of triangles (L), generate N random points in the union,
   where the number of points is proportional to Area(triangle) / Area(all triangles)
 
   This function uses functions in the Polygon package, which must be loaded.
*/
start RandUnifManyTriangles(N, L);
   numTri = ListLen(L);
   /* compute areas of each triangle in the list */
   AreaTri = j(1, numTri,.);         /* create vector to store areas */
   do i = 1 to numTri;
      AreaTri[i] = PolyArea(L$i);    /* PolyArea is in the Polygon package */
   end;
   /* Numbers of points in the triangles are multinomial with
      probability proportional to Area(triangle)/Area(polygon)
   */
   NTri = RandMultinomial(1, N, AreaTri/sum(AreaTri));
   cumulN = 0 || cusum(NTri);        /* cumulative counts; use as indices */
   z = j(N, 3, .);                   /* columns are (x,y,TriangleID) */
   do i = 1 to numTri;
      k = (cumulN[i]+1):cumulN[i+1]; /* the next NTri[i] elements */
      z[k, 1:2] = RandUnifTriangle(L$i, NTri[i]);
      z[k, 3] = i;                   /* store the triangle ID */
   end;
   return z;
finish;
 
/* The RandUnifTriangle function is defined at
   https://blogs.sas.com/content/iml/2020/10/19/random-points-in-triangle.html
*/
load module=(RandUnifTriangle);
call randseed(12345);
N = 2000;
z = RandUnifManyTriangles(N, L);

The z vector is an N x 3 matrix. The first two columns contain the (x,y) coordinates of N random points. The third column contains the ID number (values 1,2,...,k) that indicates the triangle that each point is inside of. You can use the PolyDraw function in the Polygon package to visualize the distribution of the points within the polygon:

title "Random Points in a Polygon";
title2 "Colors Assigned Based on Triangulation";
call PolyDraw(P, z);
Random uniform points in a polygon

The color of each point indicates which triangle the point is inside. You can see that triangles with relatively small areas (blue and purple) have fewer points than triangles with larger areas (green and brown).

Summary

In summary, this article shows how to generate random points inside a planar polygon. The first step is to decompose the polygon into triangles. You can use the relative areas of the triangles to determine the probability that a random point is in each triangle. Finally, you can generate random points in the union of the triangles. (Note: The algorithm works for any collection of planar triangles.)

This article uses functions in the Polygon package. Installing and loading a package is a way to define a set of related functions that you want to share. It is an alternative to using %INCLUDE to include the module definitions into your program.

You can download the complete SAS program that performs the computations and creates the graphs in this article.

The post Generate random points in a polygon appeared first on The DO Loop.

10月 192020
 
Random uniform points in a triangle

How can you efficiently generate N random uniform points in a triangular region of the plane? There is a very cool algorithm (which I call the reflection method) that makes the process easy. I no longer remember where I saw this algorithm, but it is different from the "weighted average" method in Devroye (1986, p. 569-570). This article describes and implements the reflection algorithm for generating random points in a triangle from the uniform distribution. The graph to the right shows 1,000 random points in the triangle with vertices P1=(1, 2), P2=(4, 4), and P3=(2, 0). The method works for any kind of triangle: acute, obtuse, equilateral, and so forth.

In this article, "random points" means that the points are drawn randomly from the uniform distribution.

Random points in a parallelogram

The easiest way to understand the algorithm is to think about generating points in a parallelogram. For simplicity, translate the parallelogram so that one vertex is at the origin. Two sides of the parallelogram share that vertex. Let a and b be the vectors from the origin to the adjacent vertices.

To produce a random point in the parallelogram, generate u1, u2 ~ U(0,1) and form the vector sum
    p = u1*a + u2*b
This is the 2-D parameterization of the parallelogram, so for random u1 and u2, the point p is uniformly distributed in the parallelogram. The geometry is shown in the following figure.

Random uniform points in a parallelogram

The following SAS/IML program generates 1,000 random points in the parallelogram. The graph is shown above.

proc iml;
n = 1000;
call randseed(1234);
 
/* random points in a parallelgram */
a = {3  2};                       /* vector along one side */
b = {1 -2};                       /* vector along adjacent side */
u = randfun(n // 2,  "Uniform");  /* u[,1], u[,2] ~ U(0,1) */
w = u[,1]@a + u[,2]@b;            /* linear combination of a and b */ 
title "Random Points in Parallelogram";
call scatter(w[,1], w[,2]) grid={x,y};

The only mysterious part of the program is the use of the Kronecker product (the '@' operator) to form linear combinations of the a and b vectors. The details of the Kronecker product operator are described in a separate article. Briefly, it is an efficient way to generate linear combinations of a and b without writing a loop.

The reflection step

A useful fact about random uniform variates is that if u ~ U(0,1), then also v = 1 - u ~ U(0,1). You can use this fact to convert N points in a parallelogram into N points in a triangle.

Let u1, u2 ~ U(0,1) be random variates in (0,1). If u1 + u2 ≤ 1, then the vector u1*a + u2*b is in the triangle with sides a and b. If u1 + u2 > 1, then define v1 = 1 - u1 and v2 = 1 - u2, which are also random uniform variates. You can verify that the sum v1 + v2 ≤ 1, which means that the vector v1*a + v2*b is in the triangle with sides a and b.

This is shown in the following graph. The blue points are the points for which u1 + u2 ≤ 1. The red points are for u1 + u2 > 1. When you form v1 and v2, the red triangle get reflected twice and ends up on top of the blue triangle. The two reflections are equivalent to a 180 degree rotation about the center of the parallelogram, which might be easier to visualize.

The method of reflection for generating uniform random points in a triangle

Generate random points in a triangle

With this background, you can now generate random points in any triangle. Let P1, P2, and P3 be the vertices of the triangle. The algorithm to generate random points in the triangle is as follows:

  1. Define the vectors a = P2 - P1 and b = P3 - P1. The vectors define the sides of the triangle when it is translated to the origin.
  2. Generate random uniform values u1, u2 ~ U(0,1)
  3. If u1 + u2 > 1, apply the transformation u1 → 1 - u1 and u2 → 1 - u2.
  4. Form w = u1 a + u2 b, which is a random point in the triangle at the origin.
  5. The point w + P1 is a random point in the original triangle.

The following SAS/IML program implements this algorithm and runs it for the triangle with vertices P1=(1, 2), P2=(4, 4), and P3=(2, 0).

/* generate random uniform sample in triangle with vertices
   P1 = (x0,y0), P2 = (x1,y1), and P3 = (x2,y2)
   The triangle is specified as a 3x2 matrix, where each row is a vertex.
*/
start randUnifTriangle(P, n);
   a = P[2,] - P[1,];         /* translate triangle to origin */
   b = P[3,] - P[1,];         /* a and b are vectors at the origin */
   u = randfun(n // 2,  "Uniform");
   idx = loc(u[,+] >= 1);     /* identify points outside of the triangle */
   if ncol(idx)>0 then 
      u[idx,] = 1 - u[idx,];  /* transform variates into the triangle */
   w = u[,1]@a + u[,2]@b;     /* linear combination of a and b vectors */ 
   return( P[1,] + w );       /* translate triangle back to original position */
finish;
store module=(randUnifTriangle);
 
/* triangle contains three vertices */
call randseed(1234,1);
P = {1 2,      /* P1 */
     4 4,      /* P2 */
     2 0};     /* P3 */
n = 1000;
w = randUnifTriangle(P, n);
 
title "Random Points in Triangle";
ods graphics / width=480px height=480px;
call scatter(w[,1], w[,2]) grid={x,y};

The graph of the 1,000 random points appears at the top of this program.

Showing scatter plots with polygons

As written, the programs in this article create scatter plots that show the random points. To improve the exposition, I used the polygon package to draw graphs that overlay the scatter plot and a polygon. You can download and install the polygon package if you have PROC IML with SAS 9.4m3 or later. You can download the complete SAS program that performs all the computations and creates all the graphs in this article.

Summary

In summary, this article shows how to generate random uniform points in a triangle by using the reflection algorithm. The reflection algorithm is based on generating random points in a parallelogram. If you draw the diagonal of a parallelogram, you get two congruent triangles. The algorithm reflects (twice) all points in one triangle into the other triangle. The algorithm is implemented in SAS by using the SAS/IML language, although you could also use the SAS DATA step.

The post Generate random points in a triangle appeared first on The DO Loop.

9月 282020
 

The Poisson-binomial distribution is a generalization of the binomial distribution.

  • For the binomial distribution, you carry out N independent and identical Bernoulli trials. Each trial has a probability, p, of success. The total number of successes, which can be between 0 and N, is a binomial random variable. The distribution of that random variable is the binomial distribution.
  • The Poisson-binomial distribution is similar, but the probability of success can vary among the Bernoulli trials. That is, you carry out N independent but not identical Bernoulli trials. The j_th trial has a probability, pj, of success. The total number of successes, which can be between 0 and N, is a random variable. The distribution of that random variable is the Poisson-binomial distribution.

This article shows how to generate a random sample from the Poisson-binomial distribution in SAS. Simulating a random sample is a great way to begin exploring a new distribution because the empirical density and empirical cumulative distribution enable you to see the shape of the distribution and how it depends on parameters.

Generate a random binomial sample

Before looking at the Poisson-binomial distribution, let's review the more familiar binomial distribution. The binomial distribution has two parameters: the probability of success (p) and the number of Bernoulli trials (N). The output from a binomial distribution is a random variable, k. The random variable is an integer between 0 and N and represents the number of successes among the N Bernoulli trials. If you perform many draws from the binomial distribution, the sample will look similar to the underlying probability distribution, which has mean N*p and variance N*p*(1-p).

SAS supports sampling from the binomial distribution by using the RAND function. The following SAS DATA step generates a random sample of 1,000 observations from the binomial distribution and plots the distribution of the sample:

/* Generate a random sample from the binomial distribution */
/* The Easy Way: call rand("Binom", p, N) */
%let m = 1000;              /* number of obs in random sample */
data BinomSample;
call streaminit(123);
p = 0.8; N = 10;            /* p = prob of success; N = num trials */
label k="Number of Successes";
do i = 1 to &m;
   k = rand("Binom", p, N); /* k = number of successes in N trials */
   output;
end;
keep i k;
run;
 
title "Binomial Sample";
footnote J=L "Sample Size = &m";
proc sgplot data=BinomSample;
   vbar k;
   yaxis grid;
run;
A random sample from the binomial distribution

This is the standard way to generate a random sample from the binomial distribution. In fact, with appropriate modifications, this program shows the standard way to simulate a random sample of size m from ANY of the built-in probability distributions in SAS.

Another way to simulate binomial data

Let's pretend for a moment that SAS does not support the binomial distribution. You could still simulate binomial data by making N calls to the Bernoulli distribution and counting the number of successes. The following program generates a binomial random variable by summing the results of N Bernoulli random variables:

/* The Alternative Way: Make N calls to rand("Bernoulli", p) */
data BinomSample2(keep=i k);
call streaminit(123);
p = 0.8; N = 10;
label k="Number of Successes";
do i = 1 to &m;
   k = 0;                            /* initialize k=0 */
   do j = 1 to N;                    /* sum of N Bernoulli variables */
      k = k + rand("Bernoulli", p);  /* every trial has the same probability, p */
   end;
   output;
end;
run;

The output data set is also a valid sample from the Binom(p, N) distribution. The program shows that you can replace the single call to RAND("Binom",p,N) with N calls to RAND("Bernoulli",p). You then must add up all the binary (0 or 1) random variables from the Bernoulli distribution.

Simulate Poisson-binomial data

The program in the previous section can be modified to generate data from the Poisson-binomial distribution. Instead of using the same probability for all Bernoulli trials, you can define an array of probabilities and use them to generate the Bernoulli random variables. This is shown by the following program, which generates the number of successes in a sequence of 10 Bernoulli trials where the probability of success varies among the trials:

/* generate a random sample from the Poisson-binomial distribution 
   that has 10 parameters: p1, p2, p3, ..., p10 */
data PoisBinomSample(keep=i k);
/* p[j] = probability of success for the j_th trial, i=1,2,...,10 */
array p[10] (0.2 0.2 0.3 0.3 0.4 0.6 0.7 0.8 0.8 0.9);  /* N = dim(p) */
call streaminit(123);
label k="Number of Successes";
do i = 1 to &m;
   k = 0;                              /* initialize k=0 */
   do j = 1 to dim(p);                 /* sum of N Bernoulli variables */
      k = k + rand("Bernoulli", p[j]); /* j_th trial has probability p[j] */
   end;
   output;
end;
run;
 
title "Poisson-Binomial Sample";
proc sgplot data=PoisBinomSample;
   vbar k;
   yaxis grid;
run;
A random sample from the Poisson-binomial distribution

The graph shows the distribution of a Poisson-binomial random sample. Each observation in the sample is the result of running the 10 trials and recording the number of successes. You can see from the graph that many of the trials resulted in 5 successes, although 4 or 6 are also very likely. For these parameters, it is rare to see 0 or 1 success, although both occurred during the 1,000 sets of trials. Seeing 10 successes is mathematically possible but did not occur in this simulation.

In this example, the vector of probabilities has both high and low probabilities. The probability of success is 0.2 for one trial and 0.9 for another. If you change the parameters in the Poisson-binomial distribution, you can get distributions that have different shapes. For example, if all the probabilities are small (close to zero), then the distribution will be positively skewed and the probability of seeing 0, 1, or 2 successes is high. Similarly, if all the probabilities are large (close to one), then the distribution will be negatively skewed and there is a high probability of seeing 8, 9, or 10 successes. If all probabilities are equal, then you get a binomial distribution.

Simulate Poisson-binomial data by using SAS/IML

The SAS/IML language makes it easy to encapsulate the Poisson-binomial simulation into a function. The function has two input arguments: the number of observations to simulate (m) and the vector of probabilities (p). The output of the function is a vector of m integers. For example, the following SAS/IML function implements the simulation of Poisson-binomial data:

proc iml;
/* Simulate from the Poisson-Binomial distribution. 
   Input: m = number of observations in sample
          p = column vector of N probabilities. The probability 
              of success for the j_th Bernoulli trial is p_j.
   Output: row vector of m realizations of X ~ PoisBinom(p) 
*/
start RandPoisBinom(m, _p);
   p = colvec(_p);
   b = j(nrow(p), m);     /* each column is a binary indicator var */
   call randgen(b, "Bernoulli", p); 
   return( b[+, ] );      /* return numSuccesses = sum each column */
finish;
 
/* The Poisson-binomial has N parameters: p1, p2, ..., pN */
p = {0.2 0.2 0.3 0.3 0.4 0.6 0.7 0.8 0.8 0.9};  /* 10 trials */
call randseed(123);
S = RandPoisBinom(&m, p);
call bar(S) grid="y" label="Number of Successes";
A random sample from the Poisson-binomial distribution

The SAS/IML program uses the RANDGEN function to fill up an N x m matrix with values from the Bernoulli distribution. The RANDGEN function supports a vector of parameters, which means that you can easily specify that each column should have a different probability of success. After filling the matrix with binary values, you can use the summation subscript reduction operator to obtain the number of successes (1s) in each column. The result is a row vector that contains m integers, each of which is the number of successes from a set of N Bernoulli trials with the given probabilities.

The PROC IML program uses the same parameters as the DATA step simulation. Accordingly, the sample distributions are similar.

The expected value of the Poisson-binomial distribution is the sum of the vector of probabilities. The variance is the sum of the individual Bernoulli variances. You can compute the mean and variance of the distribution and compare it to the sample mean and variance:

/* Expected values: mean and variance of the Poisson-binomial distribution */
mu = sum(p);
var = sum( (1-p)#p );
/* sample estimates of mean and variance */
XBar = mean(S`);
s2 = var(S`);
Moments = (mu//XBar) || (var//s2);
print Moments[f=5.2 r={"Distribution" "Sample Estimate"} c={"Mean" "Variance"}];
Expected values and sample statistics for the Poisson-binomial distribution

The expected number of successes in the Poisson-binomial distribution with these parameters is 5.2. In the sample, the average number of successes is 5.17. The variance of the Poisson-binomial distribution is 1.84. The variance of this sample is 1.75. The sample statistics are close to their expected values, which is what you expect to happen for a large random sample.

Summary

This article shows how to simulate data from the Poisson-binomial distribution. A random variable that follows the Poisson-binomial distribution gives the total number of success in N Bernoulli trials, where the j_th trial has the probability pj of success. The example in this article uses a 10-parameter vector of probabilities. If all parameter values are identical (p), then the Poisson-binomial distribution reduces to the standard Binom(p, 10) distribution. However, the Poisson-binomial distribution allows the probabilities to be different.

You can download the SAS program that computes the quantities and creates the graphs in this article.

8月 262020
 

In the paper "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018), I discussed an example that uses the new STREAMREWIND subroutine in Base SAS 9.4M5. As its name implies, the STREAMREWIND subroutine rewinds a random number stream, essentially resetting the stream to the beginning. I struggled to create a compelling example for the STREAMREWIND routine because using the subroutine "results in dependent streams of numbers" and because "it is usually not necessary in simulation studies" (p. 12). Regarding an application, I asserted that the subroutine "is convenient for testing."

But recently I was thinking about two-factor authentication and realized that I could use the STREAMREWIND subroutine to emulate generating a random token that changes every 30 seconds. I think it is a cool example, and it gives me the opportunity to revisit some of the newer features of random-number generation in SAS, including new generators and random number keys.

A brief overview of two-factor authentication

I am not an expert on two-factor authentication (TFA), but I use it to access my work computer, my bank accounts, and other sensitive accounts. The main idea behind TFA is that before you can access a secure account, you must authenticate yourself in two ways:

  • Provide a valid username and password.
  • Provide information that depends on a physical device that you own and that you have previously registered.

Most people use a smartphone as the physical device, but it can also be a PC or laptop. If you do an internet search for "two factor authentication tokens," you can find many images like the one on the right. This is the display from a software program that runs on a PC, laptop, or phone. The "Credential ID" field is a long string that is unique to each device. (For simplicity, I've replaced the long string with "12345.") The "Security Code" field displays a pseudorandom number that changes every 30 seconds. The Security Code depends on the device and on the time of day (within a 30-second interval). In the image, you can see a small clock and the number 28, which indicates that the Security Code will be valid for another 28 seconds before a new number is generated.

After you provide a valid username and password, the account challenges you to type in the current Security Code for your registered device. When you submit the Security Code, the remote server checks whether the code is valid for your device and for the current time of day. If so, you can access your account.

Two-factor random number streams

I love the fact that the Security Code is pseudorandom and yet verifiable. And it occurred to me that I can use the main idea of TFA to demonstrate some of the newer features in the SAS random number generators (RNGs).

Long-time SAS programmers know that each stream is determined by a random number seed. But a newer feature is that you can also set a "key" for a random number stream. For several of the new RNGs, streams that have the same seed but different keys are independent. You can use this fact to emulate the TFA app:

  • The Credential ID (which is unique to each device) is the "seed" for an RNG.
  • The time of day is the "key" for an RNG. Because the Security Code must be valid for 30 seconds, round the time to the nearest 30-second boundary.
  • Usually each call to the RAND function advances the state of the RNG so that the next call to RAND produces a new pseudorandom number. For this application, we want to get the same number for any call within a 30-second period. One way to do this is to reset the random number stream before each call so that RAND always returns the FIRST number in the stream for the (seed, time) combination.

Using a key to change a random-number stream

Before worrying about using the time of day as the key value, let's look at a simpler program that returns the first pseudorandom number from independent streams that have the same seed but different key values. I will use PROC FCMP to write a function that can be called from the SAS DATA step. Within the DATA step, I set the seed value and use the "Threefry 2" (TF2) RNG. I then call the Rnd6Int function for six different key values.

proc fcmp outlib=work.TFAFunc.Access;
   /* this function sets the key of a random-numbers stream and 
      returns the first 6-digit pseudorandom number in that stream */
   function Rnd6Int(Key);
      call stream(Key);               /* set the Key for the stream */
      call streamrewind(Key);         /* rewind stream with this Key */
      x = rand("Integer", 0, 999999); /* first 6-digit random number in stream */
      return( x );
   endsub;
quit;
 
options cmplib=(work.TFAFunc);       /* DATA step looks here for unresolved functions */
data Test;
DeviceID = 12345;                    /* ID for some device */
call streaminit('TF2', DeviceID);    /* set RNG and seed (once per data step) */
do Key = 1 to 6;
   SecCode = Rnd6Int(Key);           /* get random number from seed and key values */
   /* Call the function again. Should produce the same value b/c of STREAMREWIND */
   SecCodeDup = Rnd6Int(Key);  
   output;
end;
keep DeviceID Key SecCode:;
format SecCode SecCodeDup Z6.;
run;
 
proc print data=Test noobs; run;

Each key generates a different pseudorandom six-digit integer. Notice that the program calls the Rnd6Int function twice for each seed value. The function returns the same number each time because the random number stream for the (seed, key) combination gets reset by the STREAMREWIND call during each call. Without the STREAMREWIND call, the function would return a different value for each call.

Using a time value as a key

With a slight modification, the program in the previous section can be made to emulate the program/app that generates a new TFA token every 30 seconds. However, so that we don't have to wait so long, the following program sets the time interval (the DT macro) to 10 seconds instead of 30. Instead of talking about a 30-second interval or a 10-second interval, I will use the term "DT-second interval," where DT can be any time interval.

The program below gets the "key" by looking at the current datetime value and rounding it to the nearest DT-second interval. This value (the RefTime variable) is sent to the Rnd6Int function to generate a pseudorandom Security Code. To demonstrate that the program generates a new Security Code every DT seconds, I call the Rnd6Int function 10 times, waiting 3 seconds between each call. The results are printed below:

%let DT = 10;                  /* change the Security Code every DT seconds */
 
/* The following DATA step takes 30 seconds to run because it
   performs 10 iterations and waits 3 secs between iterations */
data TFA_Device;
keep DeviceID Time SecCode;
DeviceID = 12345;
call streaminit('TF2', DeviceID);   /* set the RNG and seed */
do i = 1 to 10;
   t = datetime();                  /* get the current time */
   /* round to the nearest DT seconds and save the "reference time" */
   RefTime = round(t, &DT); 
   SecCode = Rnd6Int(RefTime);      /* get a random Security Code */
   Time = timepart(t);              /* output only the time */
   call sleep(3, 1);                /* delay 3 seconds; unit=1 sec */
   output;
end;
format Time TIME10. SecCode Z6.;
run;
 
proc print data=TFA_Device noobs; 
   var DeviceId Time SecCode;
run;

The output shows that the program generated three different Security Codes. Each code is constant for a DT-second period (here, DT=10) and then changes to a new value. For example, when the seconds are in the interval [05, 15), the Security Code has the same value. The Security Code is also constant when the seconds are in the interval [15, 25) and so forth. A program like this emulates the behavior of an app that generates a new pseudorandom Security Code every DT seconds.

Different seeds for different devices

For TFA, every device has a unique Device ID. Because the Device ID is used to set the random number seed, the pseudorandom numbers that are generated on one device will be different than the numbers generated on another device. The following program uses the Device ID as the seed value for the RNG and the time of day for the key value. I wrapped a macro around the program and called it for three hypothetical values of the Device ID.

%macro GenerateCode(ID, DT);
data GenCode;
   keep DeviceID Time SecCode;
   format DeviceID 10. Time TIME10. SecCode Z6.;
   DeviceID = &ID;
   call streaminit('TF2', DeviceID); /* set the seed from the device */
   t = datetime();                   /* look at the current time */
   /* round to the nearest DT seconds and save the "reference time" */
   RefTime = round(t, &DT);          /* round to nearest DT seconds */
   SecCode = Rnd6Int(RefTime);       /* get a random Security Code */
   Time = timepart(t);               /* output only the time */
run;
 
proc print data=GenCode noobs; run;
%mend;
 
/* each device has a unique ID */
%GenerateCode(12345, 30);
%GenerateCode(24680, 30);
%GenerateCode(97531, 30);

As expected, the program produces different Security Codes for different Device IDs, even though the time (key) value is the same.

Summary

In summary, you can use features of the SAS random number generators in SAS 9.4M5 to emulate the behavior of a TFA token generator. The SAS program in this article uses the Device ID as the "seed" and the time of day as a "key" to select an independent stream. (Round the time into a certain time interval.) For this application, you don't want the RAND function to advance the state of the RNG, so you can use the STREAMREWIND call to rewind the stream before each call. In this way, you can generate a pseudorandom Security Code that depends on the device and is valid for a certain length of time.

The post Rewinding random number streams: An application appeared first on The DO Loop.