simulation

10月 212020
 

The triangulation theorem for polygons says that every simple polygon can be triangulated. In fact, if the polygon has V vertices, you can decompose it into V-2 non-overlapping triangles. In this article, a "polygon" always means a simple polygon. Also, a "random point" means one that is drawn at random from the uniform distribution.

The triangularization of a polygon is useful in many ways, but one application is to generate uniform random points in a polygon or a collection of polygons. Because polygons can be decomposed into triangles, the problem reduces to a simpler one: Given a list of k triangles, generate uniform random points in the union of the triangles. I have already shown how to generate random points in a triangle, so you can apply this method to generate random points in a polygon or collection of polygons.

An algorithm to generate random points in a polygon

Suppose that a polygon or any other planar region is decomposed into k triangles T1, T2, ..., Tk. If you want to generate N random points uniformly in the region, the number of points in any triangle should be proportional to the area of the triangle divided by the total area of the polygon.

One way to accomplish this is to use a two-step process. First, choose a triangle by using a probability proportional to the relative areas. Next, generate a random point in that triangle. This two-step approach is suitable for the SAS DATA step. At the end of this process, you have generated Ni observations in triangle Ti.

An equivalent formulation is to realize that the vector {N1, N2, ..., Nk} is a random draw from the multinomial distribution with parameters p = {p1, p2, ..., pk}, where pi = Area(Ti) / (Σj Area(Tj)). This second formulation is better for a vector languages such as the SAS/IML language.

Therefore, the following algorithm generates random points in a polygon:

  1. Decompose the polygon into triangles T1, T2, ..., Tk.
  2. Compute the areas: Ai = Area(Ti).
  3. Generate one random draw from the multinomial distribution with probability vector p = {p1, p2, ..., pk}, where pi = Ai / (Σj Aj). This gives a vector of numbers {N1, N2, ..., Nk}.
  4. Use the algorithm from the previous article to generate Ni random points in the triangle Ti.

Notice that Steps 2-4 of this algorithm apply to ANY collection of triangles. To make the algorithm flexible, I will implement the first step (the decomposition) in one function and the remaining steps in a second function.

Triangulate a convex polygon in SAS

There are various general methods for triangulating a polygon, but for convex polygons, there is a simple method. From among the V vertices, choose any vertex and call it P1. Enumerate the remaining vertices consecutively in a counter-clockwise direction: P2, P3, ..., Pk, where k = V-2. Because the polygon is convex, the following triangles decompose the polygon:

  • T1 = {P1, P2, P3}
  • T2 = {P1, P3, P4}, and so forth, up to
  • Tk-2 = {P1, Pk-1, Pk}

The following SAS/IML function decomposes a convex polygon into triangles. The triangles are returned in a SAS/IML list. The function is called on a convex hexagon and the resulting decomposition is shown below. The function uses the PolyIsConvex function, which is part of the Polygon package. You can download and install the Polygon package. You need to load the Polygon package before you call the function.

/* assume the polygon package is installed */
proc iml;
package load polygon;     /* load the polygon package */
 
/* Decompose a convex polygon into triangles. Return a list
   that contains the vertices for the triangles.
   This function uses a function in the Polygon package, which must be loaded.
*/
start TriangulateConvex(P);            /* input parameter(N x 2): vertices of polygon */
   isConvex = PolyIsConvex(P);
   if ^isConvex then
      return ( [] );                 /* The polygon is not convex */
   numTri = nrow(P) - 2;             /* number of triangles in convex polygon */
   L = ListCreate(numTri);           /* create list to store triangles */
   idx = 2:3;
   do i = 1 to ListLen(L);
      L$i = P[1,] // P[idx,];
      idx = idx + 1;
   end;
   return (L);
finish;
 
/* Specify a convex polygon and visualize the triangulation.  */
P = { 2 1 ,
      3 1 ,
      4 2 ,
      5 4 ,
      3 6 ,
      1 4 ,
      1 2 };
L = TriangulateConvex(P);
Decomposition of a convex polygon into triangles

To illustrate the process, I've included a graph that shows a decomposition of the convex hexagon into triangles. The triangles are returned in a list. The next section shows how to generate uniform points at random inside the union of the triangles in this list.

Generate random points in a collection of triangles

This section generates random points in a union of triangles. The following function takes two arguments: the number of points to generate (N) and a list of triangles (L). The algorithm computes the relative areas of the triangles and uses them to determine the probability that a point will be generated in each. It then uses the RandUnifTriangle function from the previous article to generate the random points.

/* Given a list of triangles (L), generate N random points in the union,
   where the number of points is proportional to Area(triangle) / Area(all triangles)
 
   This function uses functions in the Polygon package, which must be loaded.
*/
start RandUnifManyTriangles(N, L);
   numTri = ListLen(L);
   /* compute areas of each triangle in the list */
   AreaTri = j(1, numTri,.);         /* create vector to store areas */
   do i = 1 to numTri;
      AreaTri[i] = PolyArea(L$i);    /* PolyArea is in the Polygon package */
   end;
   /* Numbers of points in the triangles are multinomial with
      probability proportional to Area(triangle)/Area(polygon)
   */
   NTri = RandMultinomial(1, N, AreaTri/sum(AreaTri));
   cumulN = 0 || cusum(NTri);        /* cumulative counts; use as indices */
   z = j(N, 3, .);                   /* columns are (x,y,TriangleID) */
   do i = 1 to numTri;
      k = (cumulN[i]+1):cumulN[i+1]; /* the next NTri[i] elements */
      z[k, 1:2] = RandUnifTriangle(L$i, NTri[i]);
      z[k, 3] = i;                   /* store the triangle ID */
   end;
   return z;
finish;
 
/* The RandUnifTriangle function is defined at
   https://blogs.sas.com/content/iml/2020/10/19/random-points-in-triangle.html
*/
load module=(RandUnifTriangle);
call randseed(12345);
N = 2000;
z = RandUnifManyTriangles(N, L);

The z vector is an N x 3 matrix. The first two columns contain the (x,y) coordinates of N random points. The third column contains the ID number (values 1,2,...,k) that indicates the triangle that each point is inside of. You can use the PolyDraw function in the Polygon package to visualize the distribution of the points within the polygon:

title "Random Points in a Polygon";
title2 "Colors Assigned Based on Triangulation";
call PolyDraw(P, z);
Random uniform points in a polygon

The color of each point indicates which triangle the point is inside. You can see that triangles with relatively small areas (blue and purple) have fewer points than triangles with larger areas (green and brown).

Summary

In summary, this article shows how to generate random points inside a planar polygon. The first step is to decompose the polygon into triangles. You can use the relative areas of the triangles to determine the probability that a random point is in each triangle. Finally, you can generate random points in the union of the triangles. (Note: The algorithm works for any collection of planar triangles.)

This article uses functions in the Polygon package. Installing and loading a package is a way to define a set of related functions that you want to share. It is an alternative to using %INCLUDE to include the module definitions into your program.

You can download the complete SAS program that performs the computations and creates the graphs in this article.

The post Generate random points in a polygon appeared first on The DO Loop.

10月 212020
 

The triangulation theorem for polygons says that every simple polygon can be triangulated. In fact, if the polygon has V vertices, you can decompose it into V-2 non-overlapping triangles. In this article, a "polygon" always means a simple polygon. Also, a "random point" means one that is drawn at random from the uniform distribution.

The triangularization of a polygon is useful in many ways, but one application is to generate uniform random points in a polygon or a collection of polygons. Because polygons can be decomposed into triangles, the problem reduces to a simpler one: Given a list of k triangles, generate uniform random points in the union of the triangles. I have already shown how to generate random points in a triangle, so you can apply this method to generate random points in a polygon or collection of polygons.

An algorithm to generate random points in a polygon

Suppose that a polygon or any other planar region is decomposed into k triangles T1, T2, ..., Tk. If you want to generate N random points uniformly in the region, the number of points in any triangle should be proportional to the area of the triangle divided by the total area of the polygon.

One way to accomplish this is to use a two-step process. First, choose a triangle by using a probability proportional to the relative areas. Next, generate a random point in that triangle. This two-step approach is suitable for the SAS DATA step. At the end of this process, you have generated Ni observations in triangle Ti.

An equivalent formulation is to realize that the vector {N1, N2, ..., Nk} is a random draw from the multinomial distribution with parameters p = {p1, p2, ..., pk}, where pi = Area(Ti) / (Σj Area(Tj)). This second formulation is better for a vector languages such as the SAS/IML language.

Therefore, the following algorithm generates random points in a polygon:

  1. Decompose the polygon into triangles T1, T2, ..., Tk.
  2. Compute the areas: Ai = Area(Ti).
  3. Generate one random draw from the multinomial distribution with probability vector p = {p1, p2, ..., pk}, where pi = Ai / (Σj Aj). This gives a vector of numbers {N1, N2, ..., Nk}.
  4. Use the algorithm from the previous article to generate Ni random points in the triangle Ti.

Notice that Steps 2-4 of this algorithm apply to ANY collection of triangles. To make the algorithm flexible, I will implement the first step (the decomposition) in one function and the remaining steps in a second function.

Triangulate a convex polygon in SAS

There are various general methods for triangulating a polygon, but for convex polygons, there is a simple method. From among the V vertices, choose any vertex and call it P1. Enumerate the remaining vertices consecutively in a counter-clockwise direction: P2, P3, ..., Pk, where k = V-2. Because the polygon is convex, the following triangles decompose the polygon:

  • T1 = {P1, P2, P3}
  • T2 = {P1, P3, P4}, and so forth, up to
  • Tk-2 = {P1, Pk-1, Pk}

The following SAS/IML function decomposes a convex polygon into triangles. The triangles are returned in a SAS/IML list. The function is called on a convex hexagon and the resulting decomposition is shown below. The function uses the PolyIsConvex function, which is part of the Polygon package. You can download and install the Polygon package. You need to load the Polygon package before you call the function.

/* assume the polygon package is installed */
proc iml;
package load polygon;     /* load the polygon package */
 
/* Decompose a convex polygon into triangles. Return a list
   that contains the vertices for the triangles.
   This function uses a function in the Polygon package, which must be loaded.
*/
start TriangulateConvex(P);            /* input parameter(N x 2): vertices of polygon */
   isConvex = PolyIsConvex(P);
   if ^isConvex then
      return ( [] );                 /* The polygon is not convex */
   numTri = nrow(P) - 2;             /* number of triangles in convex polygon */
   L = ListCreate(numTri);           /* create list to store triangles */
   idx = 2:3;
   do i = 1 to ListLen(L);
      L$i = P[1,] // P[idx,];
      idx = idx + 1;
   end;
   return (L);
finish;
 
/* Specify a convex polygon and visualize the triangulation.  */
P = { 2 1 ,
      3 1 ,
      4 2 ,
      5 4 ,
      3 6 ,
      1 4 ,
      1 2 };
L = TriangulateConvex(P);
Decomposition of a convex polygon into triangles

To illustrate the process, I've included a graph that shows a decomposition of the convex hexagon into triangles. The triangles are returned in a list. The next section shows how to generate uniform points at random inside the union of the triangles in this list.

Generate random points in a collection of triangles

This section generates random points in a union of triangles. The following function takes two arguments: the number of points to generate (N) and a list of triangles (L). The algorithm computes the relative areas of the triangles and uses them to determine the probability that a point will be generated in each. It then uses the RandUnifTriangle function from the previous article to generate the random points.

/* Given a list of triangles (L), generate N random points in the union,
   where the number of points is proportional to Area(triangle) / Area(all triangles)
 
   This function uses functions in the Polygon package, which must be loaded.
*/
start RandUnifManyTriangles(N, L);
   numTri = ListLen(L);
   /* compute areas of each triangle in the list */
   AreaTri = j(1, numTri,.);         /* create vector to store areas */
   do i = 1 to numTri;
      AreaTri[i] = PolyArea(L$i);    /* PolyArea is in the Polygon package */
   end;
   /* Numbers of points in the triangles are multinomial with
      probability proportional to Area(triangle)/Area(polygon)
   */
   NTri = RandMultinomial(1, N, AreaTri/sum(AreaTri));
   cumulN = 0 || cusum(NTri);        /* cumulative counts; use as indices */
   z = j(N, 3, .);                   /* columns are (x,y,TriangleID) */
   do i = 1 to numTri;
      k = (cumulN[i]+1):cumulN[i+1]; /* the next NTri[i] elements */
      z[k, 1:2] = RandUnifTriangle(L$i, NTri[i]);
      z[k, 3] = i;                   /* store the triangle ID */
   end;
   return z;
finish;
 
/* The RandUnifTriangle function is defined at
   https://blogs.sas.com/content/iml/2020/10/19/random-points-in-triangle.html
*/
load module=(RandUnifTriangle);
call randseed(12345);
N = 2000;
z = RandUnifManyTriangles(N, L);

The z vector is an N x 3 matrix. The first two columns contain the (x,y) coordinates of N random points. The third column contains the ID number (values 1,2,...,k) that indicates the triangle that each point is inside of. You can use the PolyDraw function in the Polygon package to visualize the distribution of the points within the polygon:

title "Random Points in a Polygon";
title2 "Colors Assigned Based on Triangulation";
call PolyDraw(P, z);
Random uniform points in a polygon

The color of each point indicates which triangle the point is inside. You can see that triangles with relatively small areas (blue and purple) have fewer points than triangles with larger areas (green and brown).

Summary

In summary, this article shows how to generate random points inside a planar polygon. The first step is to decompose the polygon into triangles. You can use the relative areas of the triangles to determine the probability that a random point is in each triangle. Finally, you can generate random points in the union of the triangles. (Note: The algorithm works for any collection of planar triangles.)

This article uses functions in the Polygon package. Installing and loading a package is a way to define a set of related functions that you want to share. It is an alternative to using %INCLUDE to include the module definitions into your program.

You can download the complete SAS program that performs the computations and creates the graphs in this article.

The post Generate random points in a polygon appeared first on The DO Loop.

10月 192020
 
Random uniform points in a triangle

How can you efficiently generate N random uniform points in a triangular region of the plane? There is a very cool algorithm (which I call the reflection method) that makes the process easy. I no longer remember where I saw this algorithm, but it is different from the "weighted average" method in Devroye (1986, p. 569-570). This article describes and implements the reflection algorithm for generating random points in a triangle from the uniform distribution. The graph to the right shows 1,000 random points in the triangle with vertices P1=(1, 2), P2=(4, 4), and P3=(2, 0). The method works for any kind of triangle: acute, obtuse, equilateral, and so forth.

In this article, "random points" means that the points are drawn randomly from the uniform distribution.

Random points in a parallelogram

The easiest way to understand the algorithm is to think about generating points in a parallelogram. For simplicity, translate the parallelogram so that one vertex is at the origin. Two sides of the parallelogram share that vertex. Let a and b be the vectors from the origin to the adjacent vertices.

To produce a random point in the parallelogram, generate u1, u2 ~ U(0,1) and form the vector sum
    p = u1*a + u2*b
This is the 2-D parameterization of the parallelogram, so for random u1 and u2, the point p is uniformly distributed in the parallelogram. The geometry is shown in the following figure.

Random uniform points in a parallelogram

The following SAS/IML program generates 1,000 random points in the parallelogram. The graph is shown above.

proc iml;
n = 1000;
call randseed(1234);
 
/* random points in a parallelgram */
a = {3  2};                       /* vector along one side */
b = {1 -2};                       /* vector along adjacent side */
u = randfun(n // 2,  "Uniform");  /* u[,1], u[,2] ~ U(0,1) */
w = u[,1]@a + u[,2]@b;            /* linear combination of a and b */ 
title "Random Points in Parallelogram";
call scatter(w[,1], w[,2]) grid={x,y};

The only mysterious part of the program is the use of the Kronecker product (the '@' operator) to form linear combinations of the a and b vectors. The details of the Kronecker product operator are described in a separate article. Briefly, it is an efficient way to generate linear combinations of a and b without writing a loop.

The reflection step

A useful fact about random uniform variates is that if u ~ U(0,1), then also v = 1 - u ~ U(0,1). You can use this fact to convert N points in a parallelogram into N points in a triangle.

Let u1, u2 ~ U(0,1) be random variates in (0,1). If u1 + u2 ≤ 1, then the vector u1*a + u2*b is in the triangle with sides a and b. If u1 + u2 > 1, then define v1 = 1 - u1 and v2 = 1 - u2, which are also random uniform variates. You can verify that the sum v1 + v2 ≤ 1, which means that the vector v1*a + v2*b is in the triangle with sides a and b.

This is shown in the following graph. The blue points are the points for which u1 + u2 ≤ 1. The red points are for u1 + u2 > 1. When you form v1 and v2, the red triangle get reflected twice and ends up on top of the blue triangle. The two reflections are equivalent to a 180 degree rotation about the center of the parallelogram, which might be easier to visualize.

The method of reflection for generating uniform random points in a triangle

Generate random points in a triangle

With this background, you can now generate random points in any triangle. Let P1, P2, and P3 be the vertices of the triangle. The algorithm to generate random points in the triangle is as follows:

  1. Define the vectors a = P2 - P1 and b = P3 - P1. The vectors define the sides of the triangle when it is translated to the origin.
  2. Generate random uniform values u1, u2 ~ U(0,1)
  3. If u1 + u2 > 1, apply the transformation u1 → 1 - u1 and u2 → 1 - u2.
  4. Form w = u1 a + u2 b, which is a random point in the triangle at the origin.
  5. The point w + P1 is a random point in the original triangle.

The following SAS/IML program implements this algorithm and runs it for the triangle with vertices P1=(1, 2), P2=(4, 4), and P3=(2, 0).

/* generate random uniform sample in triangle with vertices
   P1 = (x0,y0), P2 = (x1,y1), and P3 = (x2,y2)
   The triangle is specified as a 3x2 matrix, where each row is a vertex.
*/
start randUnifTriangle(P, n);
   a = P[2,] - P[1,];         /* translate triangle to origin */
   b = P[3,] - P[1,];         /* a and b are vectors at the origin */
   u = randfun(n // 2,  "Uniform");
   idx = loc(u[,+] >= 1);     /* identify points outside of the triangle */
   if ncol(idx)>0 then 
      u[idx,] = 1 - u[idx,];  /* transform variates into the triangle */
   w = u[,1]@a + u[,2]@b;     /* linear combination of a and b vectors */ 
   return( P[1,] + w );       /* translate triangle back to original position */
finish;
store module=(randUnifTriangle);
 
/* triangle contains three vertices */
call randseed(1234,1);
P = {1 2,      /* P1 */
     4 4,      /* P2 */
     2 0};     /* P3 */
n = 1000;
w = randUnifTriangle(P, n);
 
title "Random Points in Triangle";
ods graphics / width=480px height=480px;
call scatter(w[,1], w[,2]) grid={x,y};

The graph of the 1,000 random points appears at the top of this program.

Showing scatter plots with polygons

As written, the programs in this article create scatter plots that show the random points. To improve the exposition, I used the polygon package to draw graphs that overlay the scatter plot and a polygon. You can download and install the polygon package if you have PROC IML with SAS 9.4m3 or later. You can download the complete SAS program that performs all the computations and creates all the graphs in this article.

Summary

In summary, this article shows how to generate random uniform points in a triangle by using the reflection algorithm. The reflection algorithm is based on generating random points in a parallelogram. If you draw the diagonal of a parallelogram, you get two congruent triangles. The algorithm reflects (twice) all points in one triangle into the other triangle. The algorithm is implemented in SAS by using the SAS/IML language, although you could also use the SAS DATA step.

The post Generate random points in a triangle appeared first on The DO Loop.

9月 282020
 

The Poisson-binomial distribution is a generalization of the binomial distribution.

  • For the binomial distribution, you carry out N independent and identical Bernoulli trials. Each trial has a probability, p, of success. The total number of successes, which can be between 0 and N, is a binomial random variable. The distribution of that random variable is the binomial distribution.
  • The Poisson-binomial distribution is similar, but the probability of success can vary among the Bernoulli trials. That is, you carry out N independent but not identical Bernoulli trials. The j_th trial has a probability, pj, of success. The total number of successes, which can be between 0 and N, is a random variable. The distribution of that random variable is the Poisson-binomial distribution.

This article shows how to generate a random sample from the Poisson-binomial distribution in SAS. Simulating a random sample is a great way to begin exploring a new distribution because the empirical density and empirical cumulative distribution enable you to see the shape of the distribution and how it depends on parameters.

Generate a random binomial sample

Before looking at the Poisson-binomial distribution, let's review the more familiar binomial distribution. The binomial distribution has two parameters: the probability of success (p) and the number of Bernoulli trials (N). The output from a binomial distribution is a random variable, k. The random variable is an integer between 0 and N and represents the number of successes among the N Bernoulli trials. If you perform many draws from the binomial distribution, the sample will look similar to the underlying probability distribution, which has mean N*p and variance N*p*(1-p).

SAS supports sampling from the binomial distribution by using the RAND function. The following SAS DATA step generates a random sample of 1,000 observations from the binomial distribution and plots the distribution of the sample:

/* Generate a random sample from the binomial distribution */
/* The Easy Way: call rand("Binom", p, N) */
%let m = 1000;              /* number of obs in random sample */
data BinomSample;
call streaminit(123);
p = 0.8; N = 10;            /* p = prob of success; N = num trials */
label k="Number of Successes";
do i = 1 to &m;
   k = rand("Binom", p, N); /* k = number of successes in N trials */
   output;
end;
keep i k;
run;
 
title "Binomial Sample";
footnote J=L "Sample Size = &m";
proc sgplot data=BinomSample;
   vbar k;
   yaxis grid;
run;
A random sample from the binomial distribution

This is the standard way to generate a random sample from the binomial distribution. In fact, with appropriate modifications, this program shows the standard way to simulate a random sample of size m from ANY of the built-in probability distributions in SAS.

Another way to simulate binomial data

Let's pretend for a moment that SAS does not support the binomial distribution. You could still simulate binomial data by making N calls to the Bernoulli distribution and counting the number of successes. The following program generates a binomial random variable by summing the results of N Bernoulli random variables:

/* The Alternative Way: Make N calls to rand("Bernoulli", p) */
data BinomSample2(keep=i k);
call streaminit(123);
p = 0.8; N = 10;
label k="Number of Successes";
do i = 1 to &m;
   k = 0;                            /* initialize k=0 */
   do j = 1 to N;                    /* sum of N Bernoulli variables */
      k = k + rand("Bernoulli", p);  /* every trial has the same probability, p */
   end;
   output;
end;
run;

The output data set is also a valid sample from the Binom(p, N) distribution. The program shows that you can replace the single call to RAND("Binom",p,N) with N calls to RAND("Bernoulli",p). You then must add up all the binary (0 or 1) random variables from the Bernoulli distribution.

Simulate Poisson-binomial data

The program in the previous section can be modified to generate data from the Poisson-binomial distribution. Instead of using the same probability for all Bernoulli trials, you can define an array of probabilities and use them to generate the Bernoulli random variables. This is shown by the following program, which generates the number of successes in a sequence of 10 Bernoulli trials where the probability of success varies among the trials:

/* generate a random sample from the Poisson-binomial distribution 
   that has 10 parameters: p1, p2, p3, ..., p10 */
data PoisBinomSample(keep=i k);
/* p[j] = probability of success for the j_th trial, i=1,2,...,10 */
array p[10] (0.2 0.2 0.3 0.3 0.4 0.6 0.7 0.8 0.8 0.9);  /* N = dim(p) */
call streaminit(123);
label k="Number of Successes";
do i = 1 to &m;
   k = 0;                              /* initialize k=0 */
   do j = 1 to dim(p);                 /* sum of N Bernoulli variables */
      k = k + rand("Bernoulli", p[j]); /* j_th trial has probability p[j] */
   end;
   output;
end;
run;
 
title "Poisson-Binomial Sample";
proc sgplot data=PoisBinomSample;
   vbar k;
   yaxis grid;
run;
A random sample from the Poisson-binomial distribution

The graph shows the distribution of a Poisson-binomial random sample. Each observation in the sample is the result of running the 10 trials and recording the number of successes. You can see from the graph that many of the trials resulted in 5 successes, although 4 or 6 are also very likely. For these parameters, it is rare to see 0 or 1 success, although both occurred during the 1,000 sets of trials. Seeing 10 successes is mathematically possible but did not occur in this simulation.

In this example, the vector of probabilities has both high and low probabilities. The probability of success is 0.2 for one trial and 0.9 for another. If you change the parameters in the Poisson-binomial distribution, you can get distributions that have different shapes. For example, if all the probabilities are small (close to zero), then the distribution will be positively skewed and the probability of seeing 0, 1, or 2 successes is high. Similarly, if all the probabilities are large (close to one), then the distribution will be negatively skewed and there is a high probability of seeing 8, 9, or 10 successes. If all probabilities are equal, then you get a binomial distribution.

Simulate Poisson-binomial data by using SAS/IML

The SAS/IML language makes it easy to encapsulate the Poisson-binomial simulation into a function. The function has two input arguments: the number of observations to simulate (m) and the vector of probabilities (p). The output of the function is a vector of m integers. For example, the following SAS/IML function implements the simulation of Poisson-binomial data:

proc iml;
/* Simulate from the Poisson-Binomial distribution. 
   Input: m = number of observations in sample
          p = column vector of N probabilities. The probability 
              of success for the j_th Bernoulli trial is p_j.
   Output: row vector of m realizations of X ~ PoisBinom(p) 
*/
start RandPoisBinom(m, _p);
   p = colvec(_p);
   b = j(nrow(p), m);     /* each column is a binary indicator var */
   call randgen(b, "Bernoulli", p); 
   return( b[+, ] );      /* return numSuccesses = sum each column */
finish;
 
/* The Poisson-binomial has N parameters: p1, p2, ..., pN */
p = {0.2 0.2 0.3 0.3 0.4 0.6 0.7 0.8 0.8 0.9};  /* 10 trials */
call randseed(123);
S = RandPoisBinom(&m, p);
call bar(S) grid="y" label="Number of Successes";
A random sample from the Poisson-binomial distribution

The SAS/IML program uses the RANDGEN function to fill up an N x m matrix with values from the Bernoulli distribution. The RANDGEN function supports a vector of parameters, which means that you can easily specify that each column should have a different probability of success. After filling the matrix with binary values, you can use the summation subscript reduction operator to obtain the number of successes (1s) in each column. The result is a row vector that contains m integers, each of which is the number of successes from a set of N Bernoulli trials with the given probabilities.

The PROC IML program uses the same parameters as the DATA step simulation. Accordingly, the sample distributions are similar.

The expected value of the Poisson-binomial distribution is the sum of the vector of probabilities. The variance is the sum of the individual Bernoulli variances. You can compute the mean and variance of the distribution and compare it to the sample mean and variance:

/* Expected values: mean and variance of the Poisson-binomial distribution */
mu = sum(p);
var = sum( (1-p)#p );
/* sample estimates of mean and variance */
XBar = mean(S`);
s2 = var(S`);
Moments = (mu//XBar) || (var//s2);
print Moments[f=5.2 r={"Distribution" "Sample Estimate"} c={"Mean" "Variance"}];
Expected values and sample statistics for the Poisson-binomial distribution

The expected number of successes in the Poisson-binomial distribution with these parameters is 5.2. In the sample, the average number of successes is 5.17. The variance of the Poisson-binomial distribution is 1.84. The variance of this sample is 1.75. The sample statistics are close to their expected values, which is what you expect to happen for a large random sample.

Summary

This article shows how to simulate data from the Poisson-binomial distribution. A random variable that follows the Poisson-binomial distribution gives the total number of success in N Bernoulli trials, where the j_th trial has the probability pj of success. The example in this article uses a 10-parameter vector of probabilities. If all parameter values are identical (p), then the Poisson-binomial distribution reduces to the standard Binom(p, 10) distribution. However, the Poisson-binomial distribution allows the probabilities to be different.

You can download the SAS program that computes the quantities and creates the graphs in this article.

8月 262020
 

In the paper "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018), I discussed an example that uses the new STREAMREWIND subroutine in Base SAS 9.4M5. As its name implies, the STREAMREWIND subroutine rewinds a random number stream, essentially resetting the stream to the beginning. I struggled to create a compelling example for the STREAMREWIND routine because using the subroutine "results in dependent streams of numbers" and because "it is usually not necessary in simulation studies" (p. 12). Regarding an application, I asserted that the subroutine "is convenient for testing."

But recently I was thinking about two-factor authentication and realized that I could use the STREAMREWIND subroutine to emulate generating a random token that changes every 30 seconds. I think it is a cool example, and it gives me the opportunity to revisit some of the newer features of random-number generation in SAS, including new generators and random number keys.

A brief overview of two-factor authentication

I am not an expert on two-factor authentication (TFA), but I use it to access my work computer, my bank accounts, and other sensitive accounts. The main idea behind TFA is that before you can access a secure account, you must authenticate yourself in two ways:

  • Provide a valid username and password.
  • Provide information that depends on a physical device that you own and that you have previously registered.

Most people use a smartphone as the physical device, but it can also be a PC or laptop. If you do an internet search for "two factor authentication tokens," you can find many images like the one on the right. This is the display from a software program that runs on a PC, laptop, or phone. The "Credential ID" field is a long string that is unique to each device. (For simplicity, I've replaced the long string with "12345.") The "Security Code" field displays a pseudorandom number that changes every 30 seconds. The Security Code depends on the device and on the time of day (within a 30-second interval). In the image, you can see a small clock and the number 28, which indicates that the Security Code will be valid for another 28 seconds before a new number is generated.

After you provide a valid username and password, the account challenges you to type in the current Security Code for your registered device. When you submit the Security Code, the remote server checks whether the code is valid for your device and for the current time of day. If so, you can access your account.

Two-factor random number streams

I love the fact that the Security Code is pseudorandom and yet verifiable. And it occurred to me that I can use the main idea of TFA to demonstrate some of the newer features in the SAS random number generators (RNGs).

Long-time SAS programmers know that each stream is determined by a random number seed. But a newer feature is that you can also set a "key" for a random number stream. For several of the new RNGs, streams that have the same seed but different keys are independent. You can use this fact to emulate the TFA app:

  • The Credential ID (which is unique to each device) is the "seed" for an RNG.
  • The time of day is the "key" for an RNG. Because the Security Code must be valid for 30 seconds, round the time to the nearest 30-second boundary.
  • Usually each call to the RAND function advances the state of the RNG so that the next call to RAND produces a new pseudorandom number. For this application, we want to get the same number for any call within a 30-second period. One way to do this is to reset the random number stream before each call so that RAND always returns the FIRST number in the stream for the (seed, time) combination.

Using a key to change a random-number stream

Before worrying about using the time of day as the key value, let's look at a simpler program that returns the first pseudorandom number from independent streams that have the same seed but different key values. I will use PROC FCMP to write a function that can be called from the SAS DATA step. Within the DATA step, I set the seed value and use the "Threefry 2" (TF2) RNG. I then call the Rnd6Int function for six different key values.

proc fcmp outlib=work.TFAFunc.Access;
   /* this function sets the key of a random-numbers stream and 
      returns the first 6-digit pseudorandom number in that stream */
   function Rnd6Int(Key);
      call stream(Key);               /* set the Key for the stream */
      call streamrewind(Key);         /* rewind stream with this Key */
      x = rand("Integer", 0, 999999); /* first 6-digit random number in stream */
      return( x );
   endsub;
quit;
 
options cmplib=(work.TFAFunc);       /* DATA step looks here for unresolved functions */
data Test;
DeviceID = 12345;                    /* ID for some device */
call streaminit('TF2', DeviceID);    /* set RNG and seed (once per data step) */
do Key = 1 to 6;
   SecCode = Rnd6Int(Key);           /* get random number from seed and key values */
   /* Call the function again. Should produce the same value b/c of STREAMREWIND */
   SecCodeDup = Rnd6Int(Key);  
   output;
end;
keep DeviceID Key SecCode:;
format SecCode SecCodeDup Z6.;
run;
 
proc print data=Test noobs; run;

Each key generates a different pseudorandom six-digit integer. Notice that the program calls the Rnd6Int function twice for each seed value. The function returns the same number each time because the random number stream for the (seed, key) combination gets reset by the STREAMREWIND call during each call. Without the STREAMREWIND call, the function would return a different value for each call.

Using a time value as a key

With a slight modification, the program in the previous section can be made to emulate the program/app that generates a new TFA token every 30 seconds. However, so that we don't have to wait so long, the following program sets the time interval (the DT macro) to 10 seconds instead of 30. Instead of talking about a 30-second interval or a 10-second interval, I will use the term "DT-second interval," where DT can be any time interval.

The program below gets the "key" by looking at the current datetime value and rounding it to the nearest DT-second interval. This value (the RefTime variable) is sent to the Rnd6Int function to generate a pseudorandom Security Code. To demonstrate that the program generates a new Security Code every DT seconds, I call the Rnd6Int function 10 times, waiting 3 seconds between each call. The results are printed below:

%let DT = 10;                  /* change the Security Code every DT seconds */
 
/* The following DATA step takes 30 seconds to run because it
   performs 10 iterations and waits 3 secs between iterations */
data TFA_Device;
keep DeviceID Time SecCode;
DeviceID = 12345;
call streaminit('TF2', DeviceID);   /* set the RNG and seed */
do i = 1 to 10;
   t = datetime();                  /* get the current time */
   /* round to the nearest DT seconds and save the "reference time" */
   RefTime = round(t, &DT); 
   SecCode = Rnd6Int(RefTime);      /* get a random Security Code */
   Time = timepart(t);              /* output only the time */
   call sleep(3, 1);                /* delay 3 seconds; unit=1 sec */
   output;
end;
format Time TIME10. SecCode Z6.;
run;
 
proc print data=TFA_Device noobs; 
   var DeviceId Time SecCode;
run;

The output shows that the program generated three different Security Codes. Each code is constant for a DT-second period (here, DT=10) and then changes to a new value. For example, when the seconds are in the interval [05, 15), the Security Code has the same value. The Security Code is also constant when the seconds are in the interval [15, 25) and so forth. A program like this emulates the behavior of an app that generates a new pseudorandom Security Code every DT seconds.

Different seeds for different devices

For TFA, every device has a unique Device ID. Because the Device ID is used to set the random number seed, the pseudorandom numbers that are generated on one device will be different than the numbers generated on another device. The following program uses the Device ID as the seed value for the RNG and the time of day for the key value. I wrapped a macro around the program and called it for three hypothetical values of the Device ID.

%macro GenerateCode(ID, DT);
data GenCode;
   keep DeviceID Time SecCode;
   format DeviceID 10. Time TIME10. SecCode Z6.;
   DeviceID = &ID;
   call streaminit('TF2', DeviceID); /* set the seed from the device */
   t = datetime();                   /* look at the current time */
   /* round to the nearest DT seconds and save the "reference time" */
   RefTime = round(t, &DT);          /* round to nearest DT seconds */
   SecCode = Rnd6Int(RefTime);       /* get a random Security Code */
   Time = timepart(t);               /* output only the time */
run;
 
proc print data=GenCode noobs; run;
%mend;
 
/* each device has a unique ID */
%GenerateCode(12345, 30);
%GenerateCode(24680, 30);
%GenerateCode(97531, 30);

As expected, the program produces different Security Codes for different Device IDs, even though the time (key) value is the same.

Summary

In summary, you can use features of the SAS random number generators in SAS 9.4M5 to emulate the behavior of a TFA token generator. The SAS program in this article uses the Device ID as the "seed" and the time of day as a "key" to select an independent stream. (Round the time into a certain time interval.) For this application, you don't want the RAND function to advance the state of the RNG, so you can use the STREAMREWIND call to rewind the stream before each call. In this way, you can generate a pseudorandom Security Code that depends on the device and is valid for a certain length of time.

The post Rewinding random number streams: An application appeared first on The DO Loop.

3月 162020
 

Books about statistics and machine learning often discuss the tradeoff between bias and variance for an estimator. These discussions are often motivated by a sophisticated predictive model such as a regression or a decision tree. But the basic idea can be seen in much simpler situations. This article presents a simple situation that is discussed in a short paper by Dan Jeske (1993). Namely, if a random process produces integers with a known set of probabilities, what method should you use to predict future values of the process?

I will start by briefly summarizing Jeske's result, which uses probability theory to derive the best biased and unbiased estimators. I then present a SAS program that simulates the problem and compares two estimators, one biased and one unbiased.

A random process that produces integers

Suppose a gambler asks you to predict the next roll of a six-sided die. He will reward you based on how close your guess is to the actual value he rolls. No matter what number you pick, you only have a 1/6 chance of being correct. But if the strategy is to be close to the value rolled, you can compute the expected value of the six faces, which is 3.5. Assuming that the gambler doesn't let you guess 3.5 (which is not a valid outcome), one good strategy is to round the expected value to the nearest integer. For dice, that means you would guess ROUND(3.5) = 4. Another good strategy is to randomly guess either 3 or 4 with equal probability.

Jeske's paper generalizes this problem. Suppose a random process produces the integers 1, 2, ..., N, with probabilities p1, p2, ..., pN, where the sum of the probabilities is 1. (This random distribution is sometimes called the "table distribution.") If your goal is to minimize the mean squared error (MSE) between your guess and a series of future random values, Jeske shows that the optimal solution is to guess the value that is closest to the expected value of the random variable. I call this method the ROUNDING estimator. In general, this method is biased, but it has the smallest expected MSE. Recall that the MSE is a measure of the quality of an estimator (smaller is better).

An alternative method is to randomly guess either of the two integers that are closest to the expected value, giving extra weight to the integer that is closer to the expected value. I call this method the RANDOM estimator. The random estimator is unbiased, but it has a higher MSE.

An example

The following example is from Jeske's paper. A discrete process generates a random variable, X, which can take the values 1, 2, and 3 according to the following probabilities:

  • P(X=1) = 0.2, which means that the value 1 appears with probability 0.2.
  • P(X=2) = 0.3, which means that the value 2 appears with probability 0.3.
  • P(X=3) = 0.5, which means that the value 3 appears with probability 0.5.

A graph of the probabilities is shown to the right. The expected value of this random variable is E(X) = 1(0.2) + 2(0.3) + 3(0.5) = 2.3. However, your guess must be one of the feasible values of X, so you can't guess 2.3. The best prediction (in the MSE sense) is to round the expected value. Since ROUND(2.3) is 2, the best guess for this example is 2.

Recall that an estimator for X is biased if its expected value is different from the expected value of X. Since E(X) ≠ 2, the rounding estimator is biased.

You can construct an unbiased estimator by randomly choosing the values 2 and 3, which are the two closest integers to E(X). Because E(X) = 2.3 is closer to 2 than to 3, you want to choose 2 more often than 3. You can make sure that the guesses average to 2.3 by guessing 2 with probability 0.7 and guessing 3 with probability 0.3. Then the weighted average of the guesses is 2(0.7) + 3(0.3) = 2.3, and this method produces an unbiased estimate. The random estimator is unbiased, but it will have a larger MSE.

Simulate the prediction of a random integer

Jeske proves these facts for an arbitrary table distribution, but let's use SAS to simulate the problem for the previous example. The first step is to compute the expected values of X. This is done by the following DATA step, which puts the expected value into a macro variable named MEAN:

/* Compute the expected value of X where 
   P(X=1) = 0.2
   P(X=2) = 0.3
   P(X=3) = 0.5
*/
data _null_;
array prob[3] _temporary_ (0.2, 0.3, 0.5);
do i = 1 to dim(prob);
   ExpectedValue + i*prob[i];       /* Sum of i*prob[i] */
end;
call symputx("Mean", ExpectedValue);
run;
 
%put &=Mean;
MEAN=2.3

The next step is to predict future values of X. For the rounding estimator, the predicted value is always 2. For the random estimator, let k be the greatest integer less than E(X) and let F = E(X) - k be the fractional part of E(x). To get an unbiased estimator, you can randomly choose k with probability 1-F and randomly choose k+1 with probability F. This is done in the following DATA step, which makes the predictions, generates a realization of X, and computes the residual difference for each method for 1,000 random values of X:

/* If you know mean=E(X)=expected value of X, Jeske (1993) shows that round(mean) is the best 
   MSE predictor, but it is biased.
   Randomly guessing the two closest integers is the best UNBIASED MSE predictor
   https://www.academia.edu/15728006/Predicting_the_value_of_an_integer-valued_random_variable
 
   Use these two predictors for 1000 future random variates.
*/
%let NumGuesses = 1000;
data Guess(keep = x PredRound diffRound PredRand diffRand);
call streaminit(12345);
array prob[3] _temporary_ (0.2, 0.3, 0.5);  /* P(X=i) */
 
/* z = floor(z) + frac(z) where frac(z) >= 0 */
/* https://blogs.sas.com/content/iml/2020/02/10/fractional-part-of-a-number-sas.html */
k = floor(&Mean);
Frac = &Mean - k;                        /* distance from E(X) to x */
do i = 1 to &NumGuesses;
   PredRound = round(&Mean);             /* guess the nearest integer */
   PredRand = k + rand("Bern", Frac);    /* random guesses between k and k+1, weighted by Frac */
   /* The guesses are made. Now generate a new instance of X and compute residual difference */
   x = rand("Table", of prob[*]);
   diffRound = x - PredRound;            /* rounding estimate */
   diffRand  = x - PredRand;             /* unbiased estimate */
   output;
end;
run;
 
/* sure enough, ROUND is the best predictor in the MSE sense */
proc means data=Guess n USS mean;
   var diffRound DiffRand;
run;

The output from PROC MEANS shows the results of generating 1,000 random integers from X. The uncorrected sum of squares (USS) column shows the sum of the squared residuals for each estimator. (The MSE estimate is USS / 1000 for these data.) The table shows that the USS (and MSE) for the rounding estimator is smaller than for the random estimator. On the other hand, The mean of the residuals is not close to zero for the rounding method because it is a biased method. In contrast, the mean of the residuals for the random method, which is unbiased, is close to zero.

It might be easier to see the bias of the estimators if you look at the predicted values themselves, rather than at the residuals. The following call to PROC MEANS computes the sample mean for X and the two methods of predicting X:

/* the rounding method is biased; the random guess is unbiased */
proc means data=Guess n mean stddev;
   var x PredRound PredRand;
run;

This output shows that the simulated values of X have a sample mean of 2.34, which is close to the expected value. In contrast, the rounding method always predicts 2, so the sample mean for that column is exactly 2.0. The sample mean for the unbiased random method is 2.32, which is close to the expected value.

In summary, you can use SAS to simulate a simple example that compares two methods of predicting the value of a discrete random process. One method is biased but has the lowest MSE. The other is unbiased but has a larger MSE. In statistics and machine learning, practitioners often choose between an unbiased method (such as ordinary least squares regression) and a biased method (such as ridge regression or LASSO regression). The example in this article provides a very simple situation that you can use to think about these issues.

The post Predict a random integer: The tradeoff between bias and variance appeared first on The DO Loop.

3月 092020
 

In a previous article, I discussed the binormal model for a binary classification problem. This model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the Negative population is less than the mean of scores for the Positive population. I showed how you can construct an exact ROC curve for the population.

Of course, in the real world, you do not know the distribution of the population, so you cannot obtain an exact ROC curve. However, you can estimate the ROC curve by using a random sample from the population.

This article draws a random sample from the binormal model and constructs the empirical ROC curve by using PROC LOGISTIC in SAS. The example shows an important truth that is sometimes overlooked: a sample ROC curve is a statistical estimate. Like all estimates, it is subject to sampling variation. You can visualize the variation by simulating multiple random samples from the population and overlaying the true ROC curve on the sample estimates.

Simulate data from a binormal model

The easiest way to simulate data from a binormal model is to simulate the scores. Recall the assumptions of the binormal model: all variables are continuous and there is a function that associates a score with each individual. The distribution of scores is assumed to be normal for the positive and negative populations. Thus, you can simulate the scores themselves, rather than the underlying data.

For this article, the scores for the Negative population (those who do not have a disease or condition) is N(0, 1). The scores for the Positive population (those who do have the disease or condition) is N(1.5, 0.75). The ROC curve for the population is shown to the right. It was computed by using the techniques from a previous article about binormal ROC curves.

The following SAS DATA step simulates a sample from this model. It samples nN = 50 scores from the Negative population and nP = 25 scores from the Positive population. The distribution of the scores in the sample are graphed below:

%let mu_N    = 0;       /* mean of Negative population */
%let sigma_N = 1;       /* std dev of Negative population */
%let mu_P    = 1.5;     /* mean of Positive population */
%let sigma_P = 0.75;    /* std dev of Positive population */
 
%let n_N = 50;     /* number of individuals from the Negative population */
%let n_P = 25;     /* number of individuals from the Positive population */
 
/* simulate one sample from the binormal model */
data BinormalSample;
call streaminit(12345);
y = 1;             /* positive population */
do i = 1 to &n_P;
   x = rand("Normal", &mu_P, &sigma_P); output;
end;
y = 0;             /* negative population */
do i = 1 to &n_N;
   x = rand("Normal", &mu_N, &sigma_N); output;
end;
drop i;
run;
 
title "Distribution of Scores for 'Negative' and 'Positive' Samples";
ods graphics / width=480px height=360px subpixel;
proc sgpanel data=BinormalSample noautolegend;
   styleattrs datacolors=(SteelBlue LightBrown);
   panelby y      / layout=rowlattice onepanel noheader;
   inset y        / position=topleft textattrs=(size=14) separator="=";
   histogram x    / group=y;
   rowaxis offsetmin=0;
   colaxis label="Score";
run;

Create an ROC plot for a sample

You can create an ROC curve by first creating a statistical model that classifies each observation into one of the two classes. You can then call PROC LOGISTIC in SAS to create the ROC curve, which summarizes the misclassification matrix (also called the confusion matrix) at various cutoff values for a threshold parameter. Although you can create an ROC curve for any predictive model, the following statements fit a logistic regression model. You can use the OUTROC= option on the MODEL statement to write the values of the sample ROC curve to a SAS data set. You can then overlay the ROC curves for the sample and for the population to see how they compare:

/* use logistic regression to classify individuals */
proc logistic data=BinormalSample noprint;
   model y(event='1') = x / outroc=ROC1;
run;
 
/* merge in the population ROC curve */
data ROC2;
   set ROC1 PopROC;
run;
 
title "Population ROC Curve vs. Estimate";
ods graphics / width=480px height=480px;
proc sgplot data=ROC2 aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / lineattrs=(color=blue) name="est" legendlabel="Estimate";
   series x=FPR y=TPR / lineattrs=(color=black) name="pop" legendlabel="Population";
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   keylegend "est" "pop" / location=inside position=bottomright across=1 opaque;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows the sample ROC curve (the blue, piecewise-constant curve) and the population ROC curve (the black, smooth curve). The sample ROC curve is an estimate of the population curve. For this random sample, you can see that most estimates of the true population rate are too large (given a value for the threshold parameter) and most estimates of the false positive rate are too low. In summary, the ROC estimate for this sample is overly optimistic about the ability of the classifier to discriminate between the positive and negative populations.

The beauty of the binormal model is that we know the true ROC curve. We can compare estimates to the truth. This can be useful when evaluating competing models: Instead of comparing sample ROC curves with each other, we can compare them to the ROC curve for the population.

Variation in the ROC estimates

If we simulate many samples from the binormal model and plot many ROC estimates, we can get a feel for the variation in the ROC estimates in random samples that have nN = 50 and nP = 25 observations. The following DATA step simulates B = 100 samples. The subsequent call to PROC LOGISTIC uses BY-group analysis to fit all B samples and generate the ROC curves. The ROC curves for five samples are shown below:

/* 100 samples from the binormal model */
%let NumSamples = 100;
data BinormalSim;
call streaminit(12345);
do SampleID = 1 to &NumSamples;
   y = 1;             /* sample from positive population */
   do i = 1 to &n_P;
      x = rand("Normal", &mu_P, &sigma_P); output;
   end;
   y = 0;             /* sample from negative population */
   do i = 1 to &n_N;
      x = rand("Normal", &mu_N, &sigma_N); output;
   end;
end;
drop i;
run;
 
proc logistic data=BinormalSim noprint;
   by SampleID;
   model y(event='1') = x / outroc=ROCs;
run;
 
title "5 Sample ROC Curves";
proc sgplot data=ROCs aspect=1 noautolegend;
   where SampleID <= 5;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

There is a moderate amount of variation between these curves. The brown curve looks different from the magenta curve, even though each curve results from fitting the same model to a random sample from the same population. The difference between the curves are only due to random variation in the data (scores).

You can use partial transparency to overlay all 100 ROC curves on the same graph. The following DATA step overlays the empirical estimates and the ROC curve for the population:

/* merge in the population ROC curve */
data AllROC;
   set ROCs PopROC;
run;
 
title "ROC Curve for Binormal Model";
title2 "and Empirical ROC Curves for &NumSamples Random Samples";
proc sgplot data=AllROC aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID transparency=0.9
                                lineattrs=(color=blue pattern=solid thickness=2);
   series x=FPR y=TPR / lineattrs=(color=black thickness=2);
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows 100 sample ROC curves in the background (blue) and the population ROC curve in the foreground (black). The ROC estimates show considerable variability.

Summary

In summary, this article shows how to simulate samples from a binormal model. You can use PROC LOGISTIC to generate ROC curves for each sample. By looking at the variation in the ROC curves, you can get a sense for how these estimates can vary due to random variation in the data.

The post ROC curves for a binormal sample appeared first on The DO Loop.

7月 222019
 
Illustration of the 68-95-99.7 rule

Is 4 an extreme value for the standard normal distribution? In high school, students learn the famous 68-95-99.7 rule, which is a way to remember that 99.7 percent of random observation from a normal distribution are within three standard deviations from the mean. For the standard normal distribution, the probability that a random value is bigger than 3 is 0.0013. The probability that a random value is bigger than 4 is even smaller: about 0.00003 or 3 x 10-5.

So, if you draw randomly from a standard normal distribution, it must be very rare to see an extreme value greater than 4, right? Well, yes and no. Although it is improbable that any ONE observation is more extreme than 4, if you draw MANY independent observations, the probability that the sample contains an extreme value increases with the sample size. If p is the probability that one observation is less than 4, then pn is the probability that n independent observations are all less than 4. Thus 1 – pn is the probability that some value is greater than 4. You can use the CDF function in SAS to compute these probabilities in a sample of size n:

/* What is a "large value" of a normal sample? The answer depends on the size of the sample. */
/* Use CDF to find probability that a random value from N(0,1) exceeds 4 */
proc iml;
P_NotGT4 = cdf("Normal", 4);   /* P(x < 4) */
 
/* Probability of an extreme obs in a sample that contains n independent observations */
n = {1, 100, 1000, 10000};     /* sample sizes */
P_NotGT4 = P_NotGT4**n;        /* P(all values are < 4) */
P_GT4 = 1 - P_NotGT4;          /* P(any value is > 4)   */
print n P_NotGT4 P_GT4;
Probability of a value greater than 4 in a sample that contains n independent observations from the standard normal distribution

The third column of the table shows that the probability of seeing an observation whose value is greater than 4 increases with the size of the sample. In a sample of size 10,000, the probability is 0.27, which implies that about one out of every four samples of that size will contain a maximum value that exceeds 4.

The distribution of the maximum in a Gaussian sample: A simulation approach

You can use a simulation to approximate the distribution of the maximum value of a normal sample of size n. For definiteness, choose n = 1,000 and sample from a standard normal distribution N(0,1). The following SAS/IML program simulates 5,000 samples of size n and computes the maximum value of each sample. You can then graph the distribution of the maximum values to understand how the maximum value varies in random samples of size n.

/* What is the distribution of the maximum value in a 
   sample of size n drawn from the standard normal distribution? */
call randseed(12345);
n = 1000;                  /* sample size */
numSim = 5000;             /* number of simulations */
x = j(n, numSim);          /* each column will be a sample */
call randgen(x, "Normal"); /* generate numSim samples */
max = x[<>, ];             /* find max of each sample (column) */
 
Title "Distribution of Maximum Value of a Normal Sample";
title2 "n = 1000";
call histogram(max);
 
/* compute some descriptive statistics */
mean = mean(max`);
call qntl(Q, max`, {0.25 0.5 0.75});
print (mean // Q)[rowname={"Mean" "P25" "Median" "P75"}];
Simulate 5000 samples of size n=1000. Plot the maximum value of each sample.

Based on this simulation, the expected maximum value of a sample of size n = 1,000 is about 3.2. The table indicates that about 25% of the samples have a maximum value that is less than 3. About half have a maximum value less than 3.2. About 75% of the samples have a maximum value that is less than 3.4. The graph shows the distribution of maxima in these samples. The maximum value of a sample ranged from 2.3 to 5.2.

The distribution of a maximum (or minimum) value in a sample is studied in an area of statistics that is known as extreme value theory. It turns out that you can derive the sampling distribution of the maximum of a sample by using the Gumbel distribution, which is also known as the "extreme value distribution of Type 1." You can use the Gumbel distribution to describe the distribution of the maximum of a sample of size n. The Gumbel distribution actually describes the maximum for many distributions, but for simplicity I will only refer to the normal distribution.

The distribution of the maximum in a Gaussian sample: The Gumbel distribution

This section does two things. First, it uses PROC UNIVARIATE to fit the parameters of a Gumbel distribution to the maximum values from the simulated samples. The Gumbel(3.07, 0.29) distribution is the distribution that maximizes the likelihood function for the simulated data. Second, it uses a theoretical formula to show that a Gumbel(3.09, 0.29) distribution is the distribution that models the maximum of a normally distributed sample of size n = 1,000. Thus, the results from the simulation and the theory are similar.

You can write the 5,000 maximum values from the simulation to a SAS data set and use PROC UNIVARIATE to estimate the MLE parameters for the Gumbel distribution, as follows:

create MaxVals var "max"; append; close;
QUIT;
 
/* Fit a Gumbel distribution, which models the distribution of maximum values */
proc univariate data=MaxVals;
   histogram max / gumbel;
   ods select Histogram ParameterEstimates FitQuantiles;
run;

The output from PROC UNIVARIATE shows that the Gumbel(3.07, 0.29) distribution is a good fit to the distribution of the simulated maxima. But where do those parameter values come from? How are the parameters of the Gumbel distribution related to the sample size of the standard normal distribution?

It was difficult to find an online reference that shows how to derive the Gumbel parameters for a normal sample of size n. I finally found a formula in the book Extreme Value Distributions (Kotz and Nadarajah, 2000, p. 9). For any cumulative distribution F that satisfies certain conditions, you can use the quantile function of the distribution to estimate the Gumbel parameters. The result is that the location parameter (μ) is equal to μ = F-1(1-1/n) and the scale parameter (σ) is equal to σ = F-1(1-1/(ne)) - μ, where e is the base of the natural logarithm. The following SAS/IML program uses the (1 – 1/n)th quantile of the normal distribution to derive the Gumbel parameters for a normal sample of size n:

proc iml;
n = 1000;
/* Compute parameters of Gumbel distribution when the sample is normal of size n.
   SAS calls the parameters (mu, sigma). Wikipedia calls them (mu, beta). Other
   references use (alpha, beta). */
mu_n    = quantile("Normal", 1-1/n);                /* location parameter */
sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */
print n mu_n sigma_n;
 
/* what is the mean (expected value) and median of this Gumbel distribution? */
gamma = constant("Euler");             /* Euler–Mascheroni constant = 0.5772157 */
mean = mu_n + gamma*sigma_n;           /* expected value of maximum */
median = mu_n - sigma_n * log(log(2)); /* median of maximum value distribution */
print n mean median;

Notice that these parameters are very close to the MLE estimates from the simulated normal samples. The results tell you that the expected maximum in a standard normal sample is 3.26 and about 50% of samples will have a maximum value of 3.19 or less.

You can use these same formulas to find the expected and median values of the maximum in samples of other sizes:

/* If the sample size is n, what is expected maximum */
n = {1E4, 2E4, 1E5, 2E5, 1E6, 2E6};
mu_n    = quantile("Normal", 1-1/n);                /* location parameter */
sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */
mean = mu_n + gamma*sigma_n;           /* expected value of maximum */
median = mu_n - sigma_n * log(log(2)); /* meadian of maximum value distribution */
print n mean median;

The table shows that you would expect to see a maximum value of 4 in a sample of size 20,000. If there are two million observations in a sample, you would expect to see a maximum value of 5!

You can graph this data over a range of sample sizes. The following graph shows the expected value of the maximum value in a sample of size n (drawn from a standard normal distribution) for large values of n.

You can create similar images for quantiles. The p_th quantile for the Gumbel distribution is q = mu_n - sigma_n log(-log(p)).

So, is 4 an unlikely value for the standard normal distribution? Yes, but for sufficiently large samples that value is likely to be observed. You can use the Gumbel distribution to model the distribution of the maximum in a normal sample of size n to determine how likely it is that the sample contains an extreme value. The larger the sample, the more likely it is to observe an extreme value. Although I do not discuss the general case, the Gumbel distribution can also model the maximum value for samples drawn from some non-normal distributions.

The post Extreme values: What is an extreme value for normally distributed data? appeared first on The DO Loop.

5月 202019
 

Recently I wrote about how to compute the Kolmogorov D statistic, which is used to determine whether a sample has a particular distribution. One of the beautiful facts about modern computational statistics is that if you can compute a statistic, you can use simulation to estimate the sampling distribution of that statistic. That means that instead of looking up critical values of the D statistic in a table, you can estimate the critical value by using empirical quantiles from the simulation.

This is a wonderfully liberating result! No longer are we statisticians constrained by the entries in a table in the appendix of a textbook. In fact, you could claim that modern computation has essentially killed the standard statistical table.

Obtain critical values by using simulation

Before we compute anything, let's recall a little statistical theory. If you get a headache thinking about null hypotheses and sampling distributions, you might want to skip the next two paragraphs!

When you run a hypothesis test, you compare a statistic (computed from data) to a hypothetical distribution (called the null distribution). If the observed statistic is way out in a tail of the null distribution, you reject the hypothesis that the statistic came from that distribution. In other words, the data does not seem to have the characteristic that you are testing for. Statistical tables use "critical values" to designate when a statistic is in the extreme tail. A critical value is a quantile of the null distribution; if the observed statistic is greater than the critical value, then the statistic is in the tail. (Technically, I've described a one-tailed test.)

One of the uses for simulation is to approximate the sampling distribution of a statistic when the true distribution is not known or is known only asymptotically. You can generate a large number of samples from the null hypothesis and compute the statistic on each sample. The union of the statistics approximates the true sampling distribution (under the null hypothesis) so you can use the quantiles to estimate the critical values of the null distribution.

Critical values of the Kolmogorov D distribution

You can use simulation to estimate the critical value for the Kolmogorov-Smirnov statistical test for normality. For the data in my previous article, the null hypothesis is that the sample data follow a N(59, 5) distribution. The alternative hypothesis is that they do not. The previous article computed a test statistic of D = 0.131 for the data (N = 30). If the null hypothesis is true, is that an unusual value to observe? Let's simulate 40,000 samples of size N = 30 from N(59,5) and compute the D statistic for each. Rather than use PROC UNIVARIATE, which computes dozens of statistics for each sample, you can use the SAS/IML computation from the previous article, which is very fast. The following simulation runs in a fraction of a second.

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
%let NumSamples = 40000;
 
proc iml;
call randseed(73);
N = &N;
i = T( 1:N );                           /* ranks */
u = i/N;                                /* ECDF height at right-hand endpoints */
um1 = (i-1)/N;                          /* ECDF height at left-hand endpoints  */
 
y = j(N, &NumSamples, .);               /* columns of Y are samples of size N */
call randgen(y, "Normal", &mu, &sigma); /* fill with random N(mu, sigma)      */
D = j(&NumSamples, 1, .);               /* allocate vector for results        */
 
do k = 1 to ncol(y);                    /* for each sample:                   */
   x = y[,k];                           /*    get sample x ~ N(mu, sigma)     */
   call sort(x);                        /*    sort sample                     */
   F = cdf("Normal", x, &mu, &sigma);   /*    CDF of reference distribution   */
   D[k] = max( F - um1, u - F );        /*    D = max( D_minus, D_plus )      */
end;
 
title "Monte Carlo Estimate of Sampling Distribution of Kolmogorov's D Statistic";
title2 "N = 30; N_MC = &NumSamples";
call histogram(D) other=
     "refline 0.131 / axis=x label='Sample D' labelloc=inside lineattrs=(color=red);";

The test statistic is right smack dab in the middle of the null distribution, so there is no reason to doubt that the sample is distributed as N(59, 5).

How big would the test statistic need to be to be considered extreme? To test the hypothesis at the α significance level, you can compute the 1 – α quantile of the null distribution. The following statements compute the critical value for α = 0.05 and N = 30:

/* estimate critical value as the 1 - alpha quantile */
alpha = 0.05;
call qntl(Dcrit_MC, D, 1-alpha);
print Dcrit_MC;

The estimated critical value for a sample of size 30 is 0.242. This compares favorably with the exact critical value from a statistical table, which gives Dcrit = 0.2417 for N = 30.

You can also use the null distribution to compute a p value for an observed statistic. The p value is estimated as the proportion of statistics in the simulation that exceed the observed value. For example, if you observe data that has a D statistic of 0.28, the estimated p value is obtained by the following statements:

Dobs = 0.28;                        /* hypothetical observed statistic */
pValue = sum(D >= Dobs) / nrow(D);  /* proportion of distribution values that exceed D0 */
print Dobs pValue;

This same technique works for any sample size, N, although most tables critical values only for all N ≤ 30. For N > 35, you can use the following asymptotic formulas, developed by Smirnov (1948), which depend only on α:

The Kolmogorov D statistic does not depend on the reference distribution

It is reasonable to assume that the results of this article apply only to a normal reference distribution. However, Kolmogorov proved that the sampling distribution of the D statistic is actually independent of the reference distribution. In other words, the distribution (and critical values) are the same regardless of the continuous reference distribution: beta, exponential, gamma, lognormal, normal, and so forth. That is a surprising result, which explains why there is only one statistical table for the critical values of the Kolmogorov D statistic, as opposed to having different tables for different reference distributions.

In summary, you can use simulation to estimate the critical values for the Kolmogorov D statistic. In a vectorized language such as SAS/IML, the entire simulation requires only about a dozen statements and runs extremely fast.

The post Critical values of the Kolmogorov-Smirnov test appeared first on The DO Loop.

5月 062019
 

Here's a simulation tip: When you simulate a fixed-effect generalized linear regression model, don't add a random normal error to the linear predictor. Only the response variable should be random. This tip applies to models that apply a link function to a linear predictor, including logistic regression, Poisson regression, and negative binomial regression.

Recall that a generalized linear model has three components:

  1. A linear predictor η = X β, which is a linear combination of the regressors.
  2. An invertible link function, g, which transforms the linear predictor to the expected value of the response variable. If μ = E(Y), then g(μ) = η or μ = g-1(η).
  3. A random component, which specifies the conditional distribution of the response variable, Y, given the values of the independent variables.

Notice that only the response variable is randomly generated. In a previous article about simulating data from a logistic regression model, I showed that the following SAS DATA step statements can be used to simulate data for a logistic regression model. The statements model a binary response variable, Y, which depends linearly on two explanatory variables X1 and X2:

   /* CORRECT way to simulate data from a logistic model with parameters (-2.7, -0.03, 0.07) */
   eta = -2.7 - 0.03*x1 + 0.07*x2;   /* linear predictor */
   mu = logistic(eta);               /* transform by inverse logit */
   y = rand("Bernoulli", mu);        /* simulate binary response with probability mu */

Notice that the randomness occurs only during the last step when you generate the response variable. Sometimes I see a simulation program in which the programmer adds a random term to the linear predictor, as follows:

   /* WRONG way to simulate logistic data. This is a latent-variable model. */
   eta = -2.7 - 0.03*x1 + 0.07*x2 + RAND("NORMAL", 0, 0.8);   /* WRONG: Leads to a misspecified model! */
   ...

Perhaps the programmer copied these statements from a simulation of a linear regression model, but it is not correct for a fixed-effect generalized linear model. When you simulate data from a generalized linear model, use the first set of statements, not the second.

Why is the second statement wrong? Because it has too much randomness. The model that generates the data includes a latent (unobserved) variable. The model you are trying to simulate is specified in a SAS regression procedure as MODEL Y = X1 X2, but the model for the latent-variable simulation (the second one) should be MODEL Y = X1 X2 X3, where X3 is the unobserved normally distributed variable.

What happens if you add a random term to the linear predictor

I haven't figured out all the mathematical ramifications of (incorrectly) adding a random term to the linear predictor prior to applying the logistic transform, but I ran a simulation that shows that the latent-variable model leads to biased parameter estimates when you fit the simulated data.

You can download the SAS program that generates data from two models: from the correct model (the first simulation steps) and from the latent-variable model (the second simulation). I generated 100 samples (each containing 5057 observations), then used PROC LOGISTIC to generate the resulting 100 sets of parameter estimates by using the statement MODEL Y = X1 X2. The results are shown in the following scatter plot matrix.

The blue markers are the parameter estimates from the correctly specified simulation. The reference lines in the upper right cells indicate the true values of the parameters in the simulation: (β0, β1, β2) = (-2.7, -0.03, 0.07). You can see that the true parameter values are in the center of the cloud of blue markers, which indicates that the parameter estimates are unbiased.

In contrast, the red markers show that the parameter estimates for the misspecified latent-variable model are biased. The simulated data does not come from the model that is being fit. This simulation used 0.8 for the standard deviation of the error term in the linear predictor. If you use a smaller value, the center of the red clouds will be closer to the true parameter values. If you use a larger value, the clouds will move farther apart.

For additional evidence that the data from the second simulation does not fit the model Y = X1 X2, the following graphs show the calibration plots for a data sets from each simulation. The plot on the left shows nearly perfect calibration: This is not surprising because the data were simulated from the same model that is fitted! The plot on the right shows the calibration plot for the latent-variable model. The calibration plot shows substantial deviations from a straight line, which indicates that the model is misspecified for the second set of data.

In summary, be careful when you simulate data for a generalized fixed-effect linear model. The randomness only appears during the last step when you simulate the response variable, conditional on the linear predictor. You should not add a random term to the linear predictor.

I'll leave you with a thought that is trivial but important: You can use the framework of the generalized linear model to simulate a linear regression model. For a linear model, the link function is the identity function and the response distribution is normal. That means that a linear model can be simulated by using the following:

   /* Alternative way to simulate a linear model with parameters (-2.7, -0.03, 0.07) */
   eta = -2.7 - 0.03*x1 + 0.07*x2;   /* linear predictor */
   mu = eta;                         /* identity link function */
   y = rand("Normal", mu, 0.7);      /* simulate Y as normal response with RMSE = 0.7 */

Thus simulating a linear model fits into the framework of simulating a generalized linear model, as it should!

Download the SAS program that generates the images in this article.

The post How to simulate data from a generalized linear model appeared first on The DO Loop.