Math

5月 262020
 

If you have been learning about machine learning or mathematical statistics, you might have heard about the Kullback–Leibler divergence. The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. It measures how much one distribution differs from a reference distribution. This article explains the Kullback–Leibler divergence and shows how to compute it for discrete probability distributions.

Recall that there are many statistical methods that indicate how much two distributions differ. For example, a maximum likelihood estimate involves finding parameters for a reference distribution that is similar to the data. Statistics such as the Kolmogorov-Smirnov statistic are used in goodness-of-fit tests to compare a data distribution to a reference distribution.

What is the Kullback–Leibler divergence?

Let f and g be probability mass functions that have the same domain. The Kullback–Leibler (K-L) divergence is the sum
KL(f, g) = Σx f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.) The K-L divergence measures the similarity between the distribution defined by g and the reference distribution defined by f.

For this sum to be well defined, the distribution g must be strictly positive on the support of f. That is, the Kullback–Leibler divergence is defined only when g(x) > 0 for all x in the support of f.

Some researchers prefer the argument to the log function to have f(x) in the denominator. Flipping the ratio introduces a negative sign, so an equivalent formula is
KL(f, g) = –Σx f(x) log( g(x)/f(x) )

Notice that if the two density functions (f and g) are the same, then the logarithm of the ratio is 0. Therefore, the K-L divergence is zero when the two distributions are equal. The K-L divergence is positive if the distributions are different.

The Kullback–Leibler divergence was developed as a tool for information theory, but it is frequently used in machine learning. The divergence has several interpretations. In information theory, it measures the information loss when f is approximated by g. In statistics and machine learning, f is often the observed distribution and g is a model.

An example of Kullback–Leibler divergence

As an example, suppose you roll a six-sided die 100 times and record the proportion of 1s, 2s, 3s, etc. You might want to compare this empirical distribution to the uniform distribution, which is the distribution of a fair die for which the probability of each face appearing is 1/6. The following SAS/IML statements compute the Kullback–Leibler (K-L) divergence between the empirical density and the uniform density:

proc iml;
/* K-L divergence is defined for positive discrete densities */
/* Face: 1   2   3   4   5   6 */
f    = {20, 14, 19, 19, 12, 16} / 100;  /* empirical density; 100 rolls of die */
g    = { 1,  1,  1,  1,  1,  1} / 6;    /* uniform density */
KL_fg = -sum( f#log(g/f) );             /* K-L divergence using natural log */
print KL_fg;

The K-L divergence is very small, which indicates that the two distributions are similar.

Although this example compares an empirical distribution to a theoretical distribution, you need to be aware of the limitations of the K-L divergence. The K-L divergence compares two distributions and assumes that the density functions are exact. The K-L divergence does not account for the size of the sample in the previous example. The computation is the same regardless of whether the first density is based on 100 rolls or a million rolls. Thus, the K-L divergence is not a replacement for traditional statistical goodness-of-fit tests.

The Kullback–Leibler is not symmetric

Let's compare a different distribution to the uniform distribution. Let h(x)=9/30 if x=1,2,3 and let h(x)=1/30 if x=4,5,6. The following statements compute the K-L divergence between h and g and between g and h. In the first computation, the step distribution (h) is the reference distribution. In the second computation, the uniform distribution is the reference distribution.

h = { 9,  9,  9,  1,  1,  1} /30;    /* step density */
g = { 1,  1,  1,  1,  1,  1} / 6;    /* uniform density */
KL_hg = -sum( h#log(g/h) );          /* h is reference distribution */
print KL_hg;
 
/* show that K-L div is not symmetric */
KL_gh = -sum( g#log(h/g) );          /* g is reference distribution */
print KL_gh;

First, notice that the numbers are larger than for the example in the previous section. The f density function is approximately constant, whereas h is not. So the distribution for f is more similar to a uniform distribution than the step distribution is. Second, notice that the K-L divergence is not symmetric. In the first computation (KL_hg), the reference distribution is h, which means that the log terms are weighted by the values of h. The weights from h give a lot of weight to the first three categories (1,2,3) and very little weight to the last three categories (4,5,6). In contrast, g is the reference distribution for the second computation (KL_gh). Because g is the uniform density, the log terms are weighted equally in the second computation.

The fact that the summation is over the support of f means that you can compute the K-L divergence between an empirical distribution (which always has finite support) and a model that has infinite support. However, you cannot use just any distribution for g. Mathematically, f must be absolutely continuous with respect to g. (Another expression is that f is dominated by g.) This means that for every value of x such that f(x)>0, it is also true that g(x)>0.

A SAS function for the Kullback–Leibler divergence

It is convenient to write a function, KLDiv, that computes the Kullback–Leibler divergence for vectors that give the density for two discrete densities. The call KLDiv(f, g) should compute the weighted sum of log( g(x)/f(x) ), where x ranges over elements of the support of f. By default, the function verifies that g > 0 on the support of f and returns a missing value if it isn't. The following SAS/IML function implements the Kullback–Leibler divergence.

/* The Kullback–Leibler divergence between two discrete densities f and g.
   The f distribution is the reference distribution, which means that 
   the sum is probability-weighted by f. 
   If f(x0)>0 at some x0, the model must allow it. You cannot have g(x0)=0.
*/
start KLDiv(f, g, validate=1);
   if validate then do;        /* if g might be zero, validate */
      idx = loc(g<=0);         /* find locations where g = 0 */
      if ncol(idx) > 0 then do; 
         if any(f[idx]>0) then do; /* g is not a good model for f */
            *print "ERROR: g(x)=0 when f(x)>0";
            return( . );
         end;
      end;
   end;
   if any(f<=0) then do;        /* f = 0 somewhere */
      idx = loc(f>0);           /* support of f */
      fp = f[idx];              /* restrict f and g to support of f */
      return -sum( fp#log(g[idx]/fp) );  /* sum over support of f */
   end;
   /* else, f > 0 everywhere */
   return -sum( f#log(g/f) );  /* K-L divergence using natural logarithm */
finish;
 
/* test the KLDiv function */
f = {0, 0.10, 0.40, 0.40, 0.1};
g = {0, 0.60, 0.25, 0.15, 0};
KL_pq = KLDiv(f, g);  /* g is not a valid model for f; K-L div not defined */
KL_qp = KLDiv(g, f);  /* f is valid model for g. Sum is over support of g */
print KL_pq KL_qp;

The first call returns a missing value because the sum over the support of f encounters the invalid expression log(0) as the fifth term of the sum. The density g cannot be a model for f because g(5)=0 (no 5s are permitted) whereas f(5)>0 (5s were observed). The second call returns a positive value because the sum over the support of g is valid. In this case, f says that 5s are permitted, but g says that no 5s were observed.

Summary

This article explains the Kullback–Leibler divergence for discrete distributions. A simple example shows that the K-L divergence is not symmetric. This is explained by understanding that the K-L divergence involves a probability-weighted sum where the weights come from the first argument (the reference distribution). Lastly, the article gives an example of implementing the Kullback–Leibler divergence in a matrix-vector language such as SAS/IML. This article focused on discrete distributions. The next article shows how the K-L divergence changes as a function of the parameters in a model.

References

The post The Kullback–Leibler divergence between discrete probability distributions appeared first on The DO Loop.

5月 112020
 

I recently showed how to use linear interpolation in SAS. Linear interpolation is a common way to interpolate between a set of planar points, but the interpolating function (the interpolant) is not smooth. If you want a smoother interpolant, you can use cubic spline interpolation. This article describes how to use a cubic spline interpolation in SAS.

As mentioned in my previous post, an interpolation requires two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. These sample data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

The following SAS DATA steps define the data for this example. The POINTS data set contains the sample data, which are shown as blue markers on the graph to the right. The SCORE data set contains the scoring data, which are shown as red tick marks along the horizontal axis.

/* Example dats for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

On the graph, the blue curve is the cubic spline interpolant. Every point that you interpolate will be on that curve. The red asterisks are the interpolated values for the values in the SCORE data set. Notice that points -1 and 10.5 are not interpolated because they are outside of the data range. The following section shows how to compute the cubic spline interpolation in SAS.

Cubic spline interpolation in SAS

A linear interpolation uses a linear function on each interval between the data points. In general, the linear segments do not meet smoothly: the resulting interpolant is continuous but not smooth. In contrast, spline interpolation uses a polynomial function on each interval and chooses the polynomials so that the interpolant is smooth where adjacent polynomials meet. For polynomials of degree k, you can match the first k – 1 derivatives at each data point.

A cubic spline is composed of piecewise cubic polynomials whose first and second derivatives match at each data point. Typically, the second derivatives at the minimum and maximum of the data are set to zero. This kind of spline is known as a "natural cubic spline" with knots placed at each data point.

I have previously shown how use the SPLINE call in SAS/IML to compute a smoothing spline. A smoothing spline is not an interpolant because it does not pass through the original data points. However, you can get interpolation by using the SMOOTH=0 option. Adding the TYPE='zero' option results in a natural cubic spline.

For more control over the interpolation, you can use the SPLINEC function ('C' for coefficients) to fit the cubic splines to the data and obtain a matrix of coefficients. You can then use that matrix in the SPLINEV function ('V' for value) to evaluate the interpolant at the locations in the scoring data.

The following SAS/IML function (CubicInterp) computes the spline coefficients from the sample data and then interpolates the scoring data. The details of the computation are provided in the comments, but you do not need to know the details in order to use the function to interpolate data:

/* Cubic interpolating spline in SAS.
   The interpolation is based on the values (x1,y1), (x2,y2), ..., (xn, yn).
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are interpolated.
*/
proc iml;
start CubicInterp(x, y, t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(t>=min(x) && t<=max(x));    /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   /* fit the cubic model to the data */
   call splinec(splPred, coeff, endSlopes, x||y) smooth=0 type="zero";
 
   p = j(nrow(t)*ncol(t), 1, .);       /* allocate output (prediction) vector */
   call sortndx(ndx, colvec(t));       /* SPLINEV wants sorted data, so get sort index for t */
   sort_t = t[ndx];                    /* sorted version of t */
   sort_pred = splinev(coeff, sort_t); /* evaluate model at (sorted) points of t */
   p[ndx] = sort_pred[,2];             /* "unsort" by using the inverse sort index */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = CubicInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

The visualization of the interpolation is similar to the code in the previous article, so the code is not shown here. However, you can download the SAS program that performs the cubic interpolation and creates the graph at the top of this article.

Although cubic spline interpolation is slower than linear interpolation, it is still fast: The CubicInterp program takes about 0.75 seconds to fit 1000 data points and interpolate one million scoring values.

Summary

In summary, the SAS/IML language provides the computational tools for cubic spline interpolation. The CubicInterp function in this article encapsulates the functionality so that you can perform cubic spline interpolation of your data in an efficient manner.

The post Cubic spline interpolation in SAS appeared first on The DO Loop.

4月 272020
 

I've previously written about how to generate points that are uniformly distributed in the unit disk. A seemingly unrelated topic is the distribution of eigenvalues (in the complex plane) of various kinds of random matrices. However, I recently learned that these topics are somewhat related!

A mathematical result called the "Circular Law" states that the (complex) eigenvalues of a (scaled) random n x n matrix are uniformly distributed in a disk as n → ∞. For this article, a random matrix is one whose entries are independent random variates from a specified distribution that has mean 0 and unit variance. A more precise statement of the circular law is found in the link, but, loosely speaking, if An is a sequence of random matrices of increasing size, the distribution of the eigenvalues of (1/sqrt(n)) An converges to the uniform distribution in the unit disk as n → ∞.

This article uses SAS to visualize the eigenvalue of several large random matrices. The graph to the right shows the complex eigenvalues for an n x n matrix whose entries are normal random variates from N(0, 1/sqrt(n)), where n = 2000. These eigenvalues appear to be uniformly distributed, as the circular law dictates, and most are inside the unit disk.

To demonstrate the circular law, you can fill a large n x n matrix with independent variates from any probability distribution that has mean 0 and unit variance. The matrix is then scaled by dividing each entry by sqrt(n). Equivalently, you can save a step and fill the matrix with random variates from a distribution that has 0 mean and standard deviation 1/sqrt(n).

Eigenvalues of random matrices

Let me be perfectly clear: There are simple and efficient algorithms to simulate uniformly distributed variates in the unit disk. No one should use the eigenvalue of a large matrix as a way to simulate uniform variates in a disk!

The following SAS/IML program creates a random 2000 x 2000 matrix where each element is a random variate from the N(0, 1/sqrt(n)) distribution, which has mean 0 and standard deviation 1/sqrt(n). It then computes the eigenvalues (which are complex) and writes them to a data set. Because I intend to repeat these computations for other distributions, I use a macro variable (DSName) to hold the name of the variable and I use a macro (RandCall) to specify the IML statement that fills a matrix, M, with random variates:

/* Demonstrate the circular law: eigenvalues are uniform in unit disk */
title "Eigenvalues of n x n Matrix";
 
/* title, data set name, and IML statement for a distribution */
title2 "Elements ~ N(0, 1/sqrt(n))";
%let DSName = Normal;
%macro RandCall;
   call randgen(M, "Normal", 0, 1/sqrt(n));
%mend;
 
proc iml;
call randseed(12345);
n = 2000;                            /* sample size = dim of matrix */
M = j(n, n);
%RandCall;
v = eigval(M);
create &DSName from v[c={"Re" "Im"}]; append from v; close; 
quit;

To visualize the eigenvalues, you can create a scatter plot that displays the real part (Re) and the imaginary part (Im) of each eigenvalue. I want to overlay a unit circle on the graph, so the following DATA step creates a piecewise linear approximation of a circle and uses the POLYGON statement in PROC SGPLOT to overlay the eigenvalues and the circle:

data Circle;
retain ID 1  k 100;
pi = constant('pi');
do theta = 0 to 2*pi by 2*pi/k;
   px = cos(theta);   py = sin(theta);   output;
end;
keep ID px py;
run;
 
data All;
set &DSName Circle;
label Re = "Real Part of Eigenvalue"
      Im = "Imaginary Part of Eigenvalue"
run;
 
ods graphics / width=480px height=480px;
proc sgplot data=All noautolegend;
   polygon x=px y=py ID=ID;   /* the unit circle */
   scatter x=Re y=Im;         /* the complex eigenvalues */
   xaxis grid; yaxis grid;
quit;

The graph is shown at the top of this article. It certainly appears that the eigenvalues of the 2000 x 2000 matrix are uniformly distributed. However, if you look carefully, a few eigenvalues appear to be outside of the unit disk. You can compute the distance from the origin for each eigenvalue (Re2 + Im2) and compare that distance to 1 to count how many of the 2000 eigenvalues are outside the unit circle:

data InOut;
set &DSName;
InDisk = (Re**2 + Im**2 <= 1);   /* indicator variable */
run;
 
proc freq data=InOut order=freq;
   tables InDisk / nocum ;
run;

For this random matrix, 12 eigenvalues are not inside the unit disk. Thus, although the eigenvalues appear to be approximately uniformly distributed, it is only in the limit as n → ∞ that all eigenvalues are inside the closed disk.

Other probability distributions

The remarkable thing about the circular law is that it is true for ANY probability distribution that has mean 0 and standard deviation 1/sqrt(n). To demonstrate, you can graph the eigenvalues for matrices that are filled with random variates from two other probability distributions:

  • U(-sqrt(3/n), sqrt(3/n)): For a uniform distribution on [a, b], the mean is b - a and the standard deviation is (b - a) / sqrt(12).
  • A discrete distribution for the random variable X/sqrt(n), where X can take on the values -3, 0, and 1 with probabilities 1/12, 8/12, and 3/12, respectively. For a discrete distribution, the mean is μ = Σi piXi and the variance is Σi pi(Xi - μ)2, where pi = P(X=xi).

To plot the eigenvalues of the matrix filled with random variates from the uniform distribution, you can modify the macro variables in the previous program:

/* matrix of random uniform variates */
title2 "Elements ~ U(-sqrt(3/n), sqrt(3/n))";
%let DSName = Uniform;
%macro RandCall;
   call randgen(M, "Uniform", -sqrt(3/n), sqrt(3/n));
%mend;

To plot the eigenvalues of the matrix filled with random variates from the discrete distribution, modify the macro variables as follows:

/* matrix of discrete random uniform variates */
title2 "Elements ~ P{X=-3)=1/12, P(X=0)=8/12, P(X=1)=3/12";
%let DSName = Discrete;
%macro RandCall;
   call randgen(M, "Table", {0.0833333, 0.6666667, 0.25});  /* returns values 1, 2, 3 */
   vals = {-3, 0, 1} / sqrt(n);     /* substitute values from discrete distribution */
   M = shape( vals[M], n, n );      /* reshape into n x n matrix */
%mend;

The graph for each matrix is shown below. To my eye, they look similar. I had wondered whether I would be able to determine which eigenvalues came from a matrix that has only three discrete entries (-3/2000, 0, and 1/2000). It is amazing that the spectrum of a matrix that has only three unique elements is so similar to the spectrum of a matrix in which almost every element is distinct!

In summary, I used SAS to visualize the Circular Law, which is an interesting theorem about the distribution of the eigenvalues of a random matrix. The theorem states that the distribution converges to the uniform distribution on the unit disk. The elements of the matrix are independent random variates from any probability distribution that has mean 0 and standard deviation 1/sqrt(n). The SAS/IML language makes it convenient to create the random matrices and compute their eigenvalues.

The post The circular law for eigenvalues appeared first on The DO Loop.

3月 112020
 
Regular polygons approximate a circle

Recently, I saw a graphic on Twitter by @neilrkaye that showed the rapid convergence of a regular polygon to a circle as you increase the number of sides for the polygon. The author remarked that polygons that have 40 or more sides "all look like circles to me." That is, a regular polygon with a large number of sides is visually indistinguishable from a circle.

I had two reactions to the graphic. The first was "Cool! I want to create a similar figure in SAS!" (I have done so on the right; click to enlarge.) The second reaction was that the idea is both mathematically trivial and mathematically sophisticated. It is trivial because it is known to Archimedes more than 2,000 years ago and is readily understood by using high school math. But it is sophisticated because it is a particular instance of a mathematical limit, which is the foundation of calculus and the mathematical field of analysis.

The figure also demonstrates the fact that you can approximate a smooth object (a circle, a curve, a surface, ...) by a large number of small, local, linear approximations. It is not an exaggeration to say that that concept is among the most important concepts in applied mathematics. It is used in numerical methods to solve nonlinear equations, integrals, differential equations, and more. It is used in numerical optimization. It forms the foundations of computer graphics. A standard technique in computational algorithms that involves a nonlinear function is to approximate the function by the first term of the Taylor expansion—a process known as linearization.

An approximation of pi

Archimedes used this process (local approximation by a linear function) to approximate pi (π), which is the ratio of the circumference of a circle to its diameter. He used two sets of regular polygons: one that is inscribed in the unit circle and the other that circumscribes the unit circle. The circumference of an inscribed polygon is less than the circumference of the circle. The circumference of a circumscribed polygon is greater than the circumference of the circle. They converge to a common value, which is the circumference of the circle. If you apply this process to a unit circle, you approximate the value 2π. Archimedes used the same construction to approximate the area of the unit circle, which is π.

You can use Archimedes's ideas to approximate π by using trigonometry and regular polygons, as shown in the following paragraph.

Suppose that a regular polygon has n sides. Then the central angle between adjacent vertices is θ = 2π/n radians. The following diagram illustrates the geometry of inscribed and circumscribed polygons. The right triangle that is shown is half of the triangle between adjacent vertices. Consequently,

  • The half-length of a side is b, where b = cos(θ/2) for the inscribed polygon and b = tan(θ/2) for the circumscribed polygon.
  • The height of the triangle is h, where h = |sin(θ/2)| for the inscribed polygon and h = 1 for the circumscribed polygon.
  • The circumfernce of the regular polygon is 2 n b
  • The area of the regular polygon is 2 n ( b h/2 ).

Approximate pi by using Archimedes's method

Although Archimedes did not have access to a modern computer, you can easily write a SAS DATA step program to reproduce Archimedes's approximations to the circumference and area of the unit circle, as shown below:

/* area and circomference */
data ApproxCircle;
pi = constant('pi');           /* Hah! We must know pi in order to evaluate the trig functions! */
south = 3*pi/2;                /* angle that is straight down */
do n = 3 to 100;
   angle = 2*pi/n;
   theta = south + angle/2;    /* first vertex for this n-gon */
   /* Circumference and Area for circumscribed polygon */
   b = tan(angle/2);
   h = 1;
   C_Out = 2*n * b;
   A_Out = 2*n * 0.5*b*h;
   /* Circumference and Area for inscribed polygon */
   b = cos(theta);
   h = abs( sin(theta) );
   C_In = 2*n * b;
   A_In = 2*n * 0.5*b*h;
   /* difference between the circumscribed and inscribed circles */
   CDiff = C_Out - C_In;
   ADiff = A_Out - A_In;
   output;
end;
label CDiff="Circumference Difference" ADiff="Area Difference";
keep n C_In C_Out A_In A_Out CDiff ADiff;
run;
 
/* graph the circumference and area of the n-gons as a function of n */
ods graphics / reset height=300px width=640px;
%let curveoptions = curvelabel curvelabelpos=min;
title "Circumference of Regular Polygons";
proc sgplot data=ApproxCircle;
   series x=n y=C_Out / &curveoptions;
   series x=n y=C_In  / &curveoptions;
   refline 6.2831853 / axis=y;              /* reference line at 2 pi */
   xaxis grid values=(0 to 100 by 10) valueshint;
   yaxis values=(2 to 10) valueshint label="Regular Polygon Approximation";
run;
 
title "Area of Regular Polygons";
proc sgplot data=ApproxCircle;
   series x=n y=A_Out / &curveoptions;
   series x=n y=A_In  / &curveoptions;
   refline 3.1415927 / axis=y;              /* reference line at pi */
   xaxis grid values=(0 to 100 by 10) valueshint;
   yaxis  values=(2 to 10) valueshint label="Regular Polygon Approximation";
run;

As you can see from the graph, the circumference and area of the regular polygons converge quickly to 2π and π, respectively. After n = 40 sides, the curves are visually indistinguishable from their limits, which is the same result that we noticed visually when looking at the grid of regular polygons.

The DATA step also computes the difference between the measurements of the circumscribed and inscribed polygons. You can print out a few of the differences to determine how close these estimates are to the circumference and area of the unit circle:

 
proc print data=ApproxCircle noobs;
   where n in (40, 50, 56, 60, 80, 100);
   var n CDiff ADiff;
run;

The table shows that the difference between the circumference of the circumscribed and inscribed polygons is about 0.02 when n = 40. For n = 56, the difference is less than 0.01, which means that the circumference of a regular polynomial approximates the circumference of the unit circle to two decimal places when n ≥ 56. If you use a regular 100-gon, the circumference is within 0.003 of the circumference of the unit circle. Although it is not shown, it turns out you need to use 177 sides before the difference is within 0.001, meaning that a 177-gon approximates the circumference of the unit circle to three decimal places.

Similar results hold for the area of the polygons and the area of a unit circle.

In conclusion, not only does a regular n-gon look very similar to the circle when n is large, but you can quantify how quickly the circumference and areas of an n-gon converges to the values 2 π and π, respectively. For n=56, the polygon values are accurate to two decimal places; for n=177, the polygon values are accurate to three decimal places.

Approximating a smooth curve by a series of discrete approximations is the foundation of calculus and modern numerical methods. The idea had its start in ancient Greece, but the world had to wait for Newton, Leibnitz, and others to provide the mathematical machinery (limits, convergence, ...) to understand the concepts rigorously.

The post Polygons, pi, and linear approximations appeared first on The DO Loop.

11月 272019
 
Visualization of a quadratic function and a linear subspace

This article discusses how to restrict a multivariate function to a linear subspace. This is a useful technique in many situations, including visualizing an objective function that is constrained by linear equalities. For example, the graph to the right is from a previous article about how to evaluate quadratic polynomials. The graph shows a heat map for a quadratic polynomial of two variables. The diagonal line represents a linear constraint between the X and Y variables. If you restrict the polynomial to the diagonal line, you obtain a one-dimensional function that you can easily graph and visualize. By repeating this process for other lines, you can construct a "stack" of lower-dimensional slices that enable you to understand the graph of the original bivariate function.

This example generalizes. For a function of many variables, you can restrict the function to a lower-dimensional subspace. By visualizing the restricted function, you can understand the high-dimensional function better. I previously demonstrated this technique for multidimensional regression models by using the SLICEFIT option in the EFFECTPLOT statement in SAS. This article shows how to use ideas from vector calculus to restrict a function to a parameterized linear subspace.

Visualize high-dimensional data and functions

There are basically two techniques for visualizing high-dimensional objects: projection and slicing. Many methods in multivariate statistics compute a linear subspace such that the projection of the data onto the subspace has a desirable property. One example is principal component analysis, which endeavors to find a low dimensional subspace that captures most of the variation in the data. Another example is linear discriminant analysis, which aims to find a linear subspace that maximally separates groups in the data. In both cases, the data are projected onto the computed subspaces to reveal characteristics of the data. There are also statistical methods that attempt to find nonlinear subspaces.

Whereas projection is used for data, slicing is often used to visualize the graphs of functions. In regression, you model a response variable as a function of the explanatory variables. Slicing the function along a linear subspace of the domain can reveal important characteristics about the function. That is the method used by the SLICEFIT option in the EFFECTPLOT statement, which enables you to visualize multidimensional regression models. The most common "slice" for a regression model is to specify constant values for all but one continuous variable in the model.

When visualizing an objective function that is constrained by a set of linear equalities, the idea is similar. The main difference is that the "slice" is usually determined by a general linear combination of variables.

Restrict a function to a one-dimensional subspace

A common exercise in multivariate calculus is to restrict a bivariate function to a line. (This is equivalent to a "slice" of the bivariate function.) Suppose that you choose a point, x0, and a unit vector, u, which represents the "direction". Then a parametric line through x0 in the direction of u is given by the vector expression v(t) = x0 + t u, where t is any real number. If you restrict a multivariate function to the image of v, you obtain a one-dimensional function.

For example, consider the quadratic polynomial in two variables:
f(x,y) = (9*x##2 + x#y + 4*y##2) - 12*x - 4*y + 6
The function is visualized by the heat map at the top of this article. Suppose x0 = {0, -1} and choose u to be the unit vector in the direction of the vector {3, 1}. (This choice corresponds to a linear constraint of the form x – 3y = 3.)

The following SAS/IML program constructs the parameterized line v(t) for a uniform set of t values. The quadratic function is evaluated at this set of values and the resulting one-dimensional function is graphed:

proc iml;
/* Evaluate the quadratic function at each column of X and return a row vector. */
start Func(XY);
   x = XY[1,]; y = XY[2,];
   return (9*x##2  + x#y + 4*y##2) - 12*x - 4*y + 6;
finish;
 
x0 = {0, -1};          /* evaluate polynomial at this point */
d  = {3, 1};           /* vector that determines direction */
u  = d / norm(d);      /* unit vector (direction) */
 
t = do(-2, 2, 0.1);    /* parameter values */
v = x0 + t @ u;        /* evenly spaced points along the linear subspace */
f = Func(v);           /* evaluate the 2-D function along the 1-D subspace */
title "Quadratic Form Evaluated on Linear Subspace";
call series(t, f) grid={x y};
Quadratic function restricted to linear subspace

The graph shows the restriction of the 2-D function to the 1-D subspace. According to the graph, the restricted function reaches a minimum value for t ≈ 0.92. The corresponding (x, y) values and the corresponding function value are shown below:

tMin = 0.92;           /* t* = parameter near the minimum */
vMin = x0 + tMin * u;  /* corresponding (x(t*), y(t*)) */
fMin = Func(vMin);     /* f(x(t*), y(t*)) */
print tMin vMin fMin;

If you look back to the heat map at the top of this article, you will see that the value (x,y) = (0.87,-0.71) correspond to the location for which the function achieves a minimum value when restricted to the linear subspace. The value of the function at that (x,y) value is approximately 6.6.

The SAS/IML program uses the Kronecker direct-product operator (@) to create points along the line. The operation is v = x0 + t @ u. The symbols x0 and u are both column vectors with two elements. The Kronecker product creates a 2 x k matrix, where k is the number of elements in the row vector t. The Kronecker product enables you to vectorize the program instead of looping over the elements of t and forming the individual vectors x0 + t[i]*u.

You can generalize this example to evaluate a multivariate function on a two-dimensional subspace. If u1 and u2 are two orthogonal, p-dimensional, unit vectors, the expression x0 + s*u1 + t*u2 spans a two-dimensional subspace of Rp. You can use the ExpandGrid function in SAS/IML to generate ordered pairs (s, t) on a two-dimensional grid.

Summary

In summary, you can slice a multivariate function along a linear subspace. Usually, 1-D or 2-D subspaces are used and the resulting (restricted) function is graphed. This helps you to understand the function. You can choose a series of slices (often parallel to each other) and "stack" the slices in order to visualize the multivariate function. Typically, this technique works best for functions that have three or four continuous variables.

You can download the SAS program that creates the graphs in this article.

The post Evaluate a function on a linear subspace appeared first on The DO Loop.

9月 302019
 

There are several different kinds of means. They all try to find an average value from among a set of numbers. Although the most popular mean is the arithmetic mean, the geometric mean can be useful for problems in statistics, finance, and biology. A common application of the geometric mean is to find an average growth rate for an asset or for a population.

What is the geometric mean?

The geometric mean of n nonnegative numbers is the n_th root of the product of the numbers:
GM = (x1 * x2 * ... * xn)1/n = (Π xi)1/n
When the numbers are all positive, the geometric mean is equivalent to computing the arithmetic mean of the log-transformed data and then using the exponential function to back-transform the result: GM = exp( (1/n) Σ log(xi) ).

Physical interpretation of the geometric mean

The geometric mean has a useful interpretation in terms of the volume of an n-dimensional rectangular solid. If GM is the geometric mean of n positive numbers, then GM is the length of the side of an n-dimensional cube that has the same volume as the rectangular solid with side lengths x1, x2, ..., xn. For example, a rectangular solid with sides 1.5, 2, and 3 has a volume V = 9. The geometric mean of those three numbers are ((1.5)(2)(3))1/3 ≈ 2.08. The volume of a cube with sides of length 2.08 is 9.

This interpretation in terms of volume is analogous to interpreting the arithmetic mean in terms of length. Namely, if you have n line segments of lengths x1, x2, ..., xn, then the total length of the segments is the same as the length of n copies of a segment of length AM, where AM is the arithmetic mean.

What is the geometric mean good for?

The geometric mean can be used to estimate the "center" of any set of positive numbers but is frequently used to estimate an average value in problems that deal with growth rates or ratios. For example, the geometric mean is an average growth rate for an asset or for a population. The following example uses the language of finance, although you can replace "initial investment" by "initial population" if you are interested in population growth

In precalculus, growth rates are introduced in terms of the growth of an initial investment amount (the principle) compounded yearly at a fixed interest rate, r. After n years, the principle (P) is worth
An = P(1 + r)n
The quantity 1 + r sometimes confuses students. It appears because when you add the principle (P) and the interest (Pr), you get P(1 + r).

In precalculus, the interest rate is assumed to be a constant, which is fine for fixed-rate investments like bank CDs. However, many investments have a growth rate that is not fixed but varies from year to year. If the growth rate of the investment is r1 during the first year, r2 during the second year, and rn during the n_th year, then after n years the investment is worth
An = P(1 + r1)(1 + r2)...(1 + rn)
  = P (Π xi), where xi = 1 - ri.

What is the average growth rate for the investment? One interpretation of the average growth rate is the fixed rate that would give the same return after n years. That hypothetical fixed-rate growth is found by using the geometric mean of the values x1, x2, ..., xn. That is, if GM is the geometric mean of the xi, the value
An = P*GMn,
which assumes a fixed interest rate, is exactly the same as for the varying-rate computation.

An example of the geometric mean: The growth rate of gold

Let's apply the ideas in the preceding section. Suppose that you bought $1000 of gold on Jan 1, 2010. The following table gives the yearly rate of return for gold during the years 2010–2018, along with the value of the $1000 investment at the end of each year.

According to the table, the value of the investment after 9 years is $1160.91, which represents a total return of about 16%. What is the fixed-rate that would give the same return after 9 years when compounded annually? That is found by computing the geometric mean of the numbers in the third column:
GM = (1.2774 * 1.1165 * ... * 0.9885)1/9 = 1.01672
In other words, the investment in gold yielded the same return as a fixed-rate bank CD at 1.672% that is compounded yearly for 9 years. The end-of-year values for both investments are shown in the following graph.

The geometric mean in statistics

The geometric mean arises naturally in situations in which quantities are multiplied together. This happens so often that there is a probability distribution, called the lognormal distribution, that models this situation. if Z has a normal distribution, then you can obtain a lognormal distribution by applying the exponential transformation: X = exp(Z) is lognormal. The Wikipedia article for the lognormal distribution states, "The lognormal distribution is important in the description of natural phenomena... because many natural growth processes are driven by the accumulation of many small percentage changes." For lognormal data, the geometric mean is often more useful than the arithmetic mean. In my next blog post, I will show how to compute the geometric mean and other associated statistics in SAS.

The post What is a geometric mean? appeared first on The DO Loop.

8月 122019
 
What is this math good for, anyway?
     –Every student, everywhere

I am a professional applied mathematician, yet many of the mathematical and statistical techniques that I use every day are not from advanced university courses but are based on simple ideas taught in high school or even in grade school. I've written many blog posts in which the solution to an interesting problem requires little more than high-school math. Even when the solution requires advanced techniques, high-school math often provides the basis for solving the problem.

In celebration of the upcoming school year, here are 12 articles that show connections between advanced topics and mathematical ideas that are introduced in high-school or earlier. If you have a school-age child, read some of the articles to prepare yourself for the inevitable mid-year whine, "But, Mom/Dad, when will I ever use this stuff?!"

Grade-school math

Obviously, most adults use basic arithmetic, fractions, decimals, and percents, but here are some less obvious uses of grade-school math:

High-school math

Algebra, linear transformations, geometry, and trigonometry are the main topics in high school mathematics. These topics are the bread-and-butter of applied mathematics:

  • Linear transformations: Anytime you create a graph on the computer screen, a linear transformation transforms the data from physical units (weight, cost, time) into pixel values. Although modern graphical software performs that linear transformation for you, a situation in which you have to manually apply a linear transformation is when you want to display data in two units (pounds and kilograms, dollars and Euros, etc). You can use a simple linear transformation to align the tick labels on the two axes.
  • Intersections: In high school, you learn to compute the intersection between two lines. You can extend the problem to find the intersection of two line segments. I needed that result to solve a problem in probability theory.
  • Solve a system of equations: In Algebra II, you learn to solve a system of equations. Solving linear systems is the foundation of linear algebra. Solving nonlinear systems is among the most important skills in applied mathematics.
  • Find the roots of a nonlinear equation: Numerically finding the roots of a function is taught in pre-calculus. It is the basis for all "inverse problems" in which you want to find inputs to a function that produce a specified output. In statistics, a common "inverse problem" is finding the quantile of a cumulative probability distribution.
  • Binomial coefficients: In algebra, many teachers use the mnemonic FOIL (First, Outer, Inner, Last) to teach students how to compute the quadratic expansion of (a + b)2. Later, students learn the binomial expansion of an arbitrary power, (a + b)n. The coefficients in this expansion are called the binomial coefficients and appear in Pascal's triangle as well as in many discrete probability distributions such as the negative binomial and hypergeometric distributions.
  • Pythagorean triples: In trigonometry, a huge number of homework problems involve using right triangles with side lengths that are proportional to (3, 4, 5) and (5, 12, 13). These are two examples of Pythagorean triples: right triangles whose side lengths are all integers. It turns out that you can use linear transformations to generate all primitive triples from the single triple (3, 4, 5). A high school student can understand this process, although the process is most naturally expressed in terms of matrix multiplication, which is not always taught in high school.

High-school statistics

Many high schools offer a unit on probability and statistics, and some students take AP Statistics.

Einstein famously said, "everything should be made as simple as possible, but not simpler." It is surprising to me how often an advanced technique can be simplified and explained by using elementary math. I don't claim that "everything I needed to know about math I learned in kindergarten," but I often return to elementary techniques when I describe how to solve non-elementary problems.

What about you? What are some elementary math or statistics concepts that you use regularly in your professional life? Are there fundamental topics that you learned in high school that are deeper and more useful than you realized at the time? Leave a comment.

The post The math you learned in school: Yes, it’s useful! appeared first on The DO Loop.

8月 122019
 
What is this math good for, anyway?
     –Every student, everywhere

I am a professional applied mathematician, yet many of the mathematical and statistical techniques that I use every day are not from advanced university courses but are based on simple ideas taught in high school or even in grade school. I've written many blog posts in which the solution to an interesting problem requires little more than high-school math. Even when the solution requires advanced techniques, high-school math often provides the basis for solving the problem.

In celebration of the upcoming school year, here are 12 articles that show connections between advanced topics and mathematical ideas that are introduced in high-school or earlier. If you have a school-age child, read some of the articles to prepare yourself for the inevitable mid-year whine, "But, Mom/Dad, when will I ever use this stuff?!"

Grade-school math

Obviously, most adults use basic arithmetic, fractions, decimals, and percents, but here are some less obvious uses of grade-school math:

High-school math

Algebra, linear transformations, geometry, and trigonometry are the main topics in high school mathematics. These topics are the bread-and-butter of applied mathematics:

  • Linear transformations: Anytime you create a graph on the computer screen, a linear transformation transforms the data from physical units (weight, cost, time) into pixel values. Although modern graphical software performs that linear transformation for you, a situation in which you have to manually apply a linear transformation is when you want to display data in two units (pounds and kilograms, dollars and Euros, etc). You can use a simple linear transformation to align the tick labels on the two axes.
  • Intersections: In high school, you learn to compute the intersection between two lines. You can extend the problem to find the intersection of two line segments. I needed that result to solve a problem in probability theory.
  • Solve a system of equations: In Algebra II, you learn to solve a system of equations. Solving linear systems is the foundation of linear algebra. Solving nonlinear systems is among the most important skills in applied mathematics.
  • Find the roots of a nonlinear equation: Numerically finding the roots of a function is taught in pre-calculus. It is the basis for all "inverse problems" in which you want to find inputs to a function that produce a specified output. In statistics, a common "inverse problem" is finding the quantile of a cumulative probability distribution.
  • Binomial coefficients: In algebra, many teachers use the mnemonic FOIL (First, Outer, Inner, Last) to teach students how to compute the quadratic expansion of (a + b)2. Later, students learn the binomial expansion of an arbitrary power, (a + b)n. The coefficients in this expansion are called the binomial coefficients and appear in Pascal's triangle as well as in many discrete probability distributions such as the negative binomial and hypergeometric distributions.
  • Pythagorean triples: In trigonometry, a huge number of homework problems involve using right triangles with side lengths that are proportional to (3, 4, 5) and (5, 12, 13). These are two examples of Pythagorean triples: right triangles whose side lengths are all integers. It turns out that you can use linear transformations to generate all primitive triples from the single triple (3, 4, 5). A high school student can understand this process, although the process is most naturally expressed in terms of matrix multiplication, which is not always taught in high school.

High-school statistics

Many high schools offer a unit on probability and statistics, and some students take AP Statistics.

Einstein famously said, "everything should be made as simple as possible, but not simpler." It is surprising to me how often an advanced technique can be simplified and explained by using elementary math. I don't claim that "everything I needed to know about math I learned in kindergarten," but I often return to elementary techniques when I describe how to solve non-elementary problems.

What about you? What are some elementary math or statistics concepts that you use regularly in your professional life? Are there fundamental topics that you learned in high school that are deeper and more useful than you realized at the time? Leave a comment.

The post The math you learned in school: Yes, it’s useful! appeared first on The DO Loop.

5月 222019
 

The eigenvalues of a matrix are not easy to compute. It is remarkable, therefore, that with relatively simple mental arithmetic, you can obtain bounds for the eigenvalues of a matrix of any size. The bounds are provided by using a marvelous mathematical result known as Gershgorin's Disc Theorem. For certain matrices, you can use Gershgorin's theorem to quickly determine that the matrix is nonsingular or positive definite.

The Gershgorin Disc Theorem appears in Golub and van Loan (p. 357, 4th Ed; p. 320, 3rd Ed), where it is called the Gershgorin Circle Theorem. The theorem states that the eigenvalues of any N x N matrix, A, are contained in the union of N discs in the complex plane. The center of the i_th disc is the i_th diagonal element of A. The radius of the i_th disc is the absolute values of the off-diagonal elements in the i_th row. In symbols,
Di = {z ∈ C | |z - Ai i| ≤ ri }
where ri = Σi ≠ j |Ai j|. Although the theorem holds for matrices with complex values, this article only uses real-valued matrices.

An example of Gershgorin discs is shown to the right. The discs are shown for the following 4 x 4 symmetric matrix:

At first glance, it seems inconceivable that we can know anything about the eigenvalues without actually computing them. However, two mathematical theorems tells us quite a lot about the eigenvalues of this matrix, just by inspection. First, because the matrix is real and symmetric, the Spectral Theorem tells us that all eigenvalues are real. Second, the Gershgorin Disc Theorem says that the four eigenvalues are contained in the union of the following discs:

  • The first row produces a disc centered at x = 200. The disc has radius |30| + |-15| + |5| = 50.
  • The second row produces a disc centered at x = 100 with radius |30| + |5| + |5| = 40.
  • The third row produces a disc centered at x = 55 with radius |-15| + |5| + |0| = 20.
  • The last row produces a disc centered at x = 25 with radius |5| + |5| + |0| = 10.

Although the eigenvalues for this matrix are real, the Gershgorin discs are in the complex plane. The discs are visualized in the graph at the top of this article. The true eigenvalues of the matrix are shown inside the discs.

For this example, each disc contains an eigenvalue, but that is not true in general. (For example, the matrix A = {1 −1, 2 −1} does not have any eigenvalues in the disc centered at x=1.) What is true, however, is that disjoint unions of discs must contain as many eigenvalues as the number of discs in each disjoint region. For this matrix, the discs centered at x=25 and x=200 are disjoint. Therefore they each contain an eigenvalue. The union of the other two discs must contain two eigenvalues, but, in general, the eigenvalues can be anywhere in the union of the discs.

The visualization shows that the eigenvalues for this matrix are all positive. That means that the matrix is not only symmetric but also positive definite. You can predict that fact from the Gershgorin discs because no disc intersects the negative X axis.

Of course, you don't have to perform the disc calculations in your head. You can write a program that computes the centers and radii of the Gershgorin discs, as shown by the following SAS/IML program, which also computes the eigenvalues for the matrix:

proc iml;
A = { 200  30 -15  5,
       30 100   5  5,
      -15   5  55  0, 
        5   5   0 15};
 
evals = eigval(A);                 /* compute the eigenvalues */
center = vecdiag(A);               /* centers = diagonal elements */
radius = abs(A)[,+] - abs(center); /* sum of abs values of off-diagonal elements of each row */
discs = center || radius || round(evals,0.01);
print discs[c={"Center" "Radius" "Eigenvalue"} L="Gershgorin Discs"];

Diagonally dominant matrices

For this example, the matrix is strictly diagonally dominant. A strictly diagonally dominant matrix is one for which the magnitude of each diagonal element exceeds the sum of the magnitudes of the other elements in the row. In symbols, |Ai i| > Σi ≠ j |Ai j| for each i. Geometrically, this means that no Gershgorin disc intersects the origin, which implies that the matrix is nonsingular. So, by inspection, you can determine that his matrix is nonsingular.

Gershgorin discs for correlation matrices

The Gershgorin theorem is most useful when the diagonal elements are distinct. For repeated diagonal elements, it might not tell you much about the location of the eigenvalues. For example, all diagonal elements for a correlation matrix are 1. Consequently, all Gershgorin discs are centered at (1, 0) in the complex plane. The following graph shows the Gershgorin discs and the eigenvalues for a 10 x 10 correlation matrix. The eigenvalues of any 10 x 10 correlation matrix must be real and in the interval [0, 10], so the only new information from the Gershgorin discs is a smaller upper bound on the maximum eigenvalue.

Gershgorin discs for unsymmetric matrices

Gershgorin's theorem can be useful for unsymmetric matrices, which can have complex eigenvalues. The SAS/IML documentation contains the following 8 x 8 block-diagonal matrix, which has two pairs of complex eigenvalues:

A = {-1  2  0       0       0       0       0  0,
     -2 -1  0       0       0       0       0  0,
      0  0  0.2379  0.5145  0.1201  0.1275  0  0,
      0  0  0.1943  0.4954  0.1230  0.1873  0  0,
      0  0  0.1827  0.4955  0.1350  0.1868  0  0,
      0  0  0.1084  0.4218  0.1045  0.3653  0  0,
      0  0  0       0       0       0       2  2,
      0  0  0       0       0       0      -2  0 };

The matrix has four smaller Gershgorin discs and three larger discs (radius 2) that are centered at (-1,0), (2,0), and (0,0), respectively. The discs and the actual eigenvalues of this matrix are shown in the following graph. Not only does the Gershgorin theorem bound the magnitude of the real part of the eigenvalues, but it is clear that the imaginary part cannot exceed 2. In fact, this matrix has eigenvalues -1 ± 2 i, which are on the boundary of one of the discs, which shows that the Gershgorin bound is tight.

Conclusions

In summary, the Gershgorin Disc Theorem provides a way to visualize the possible location of eigenvalues in the complex plane. You can use the theorem to provide bounds for the largest and smallest eigenvalues.

I was never taught this theorem in school. I learned it from a talented mathematical friend at SAS. I use this theorem to create examples of matrices that have particular properties, which can be very useful for developing and testing software.

This theorem also helped me to understand the geometry behind "ridging", which is a statistical technique in which positive multiples of the identity are added to a nearly singular X`X matrix. The Gershgorin Disc Theorem shows the effect of ridging a matrix is to translate all of the Gershgorin discs to the right, which moves the eigenvalues away from zero while preserving their relative positions.

You can download the SAS program that I used to create the images in this article.

Further reading

There are several papers on the internet about Gershgorin discs. It is a favorite topic for advanced undergraduate projects in mathematics.

The post Gershgorin discs and the location of eigenvalues appeared first on The DO Loop.

9月 242018
 

Last week I compared the overhand shuffle to the riffle shuffle. I used random operations to simulate both kinds of shuffles and then compared how well they mix cards. The article caused one my colleague and fellow blogger, Rob Pratt, to ask if I was familiar with a bit of shuffling trivia: if you perform a perfect riffle shuffle, the cards return to their original order after exactly eight perfect shuffles! This mathematical curiosity is illustrated in a beautiful animation by James Miles, which shows the results of eight perfect shuffles.

After being reminded of this interesting fact, I wondered how the result generalizes to decks of various sizes. That is, if you use a deck with N cards, what is the minimum number of perfect riffle shuffles (call it P(N)) that you need to restore the deck to its original order? I decided to run a SAS program to discover the answer. The result is summarized in the following graph, which plots P(N) versus the number of cards in a deck. All points are below the identity line, which implies that at most N – 1 shuffles are required for a deck that contains N cards. If you want to learn more about the graph and its interesting patterns, read on.

Number of perfect shuffles for deck of N cards to restore the original order

Coding the perfect riffle shuffle in SAS

The perfect riffle shuffle is easier to program than the Gilbert-Shannon-Reeds (GSR) model, which uses randomness to model how real people shuffle real cards. In a perfect riffle shuffle, you cut the deck exactly two equal parts. Then you interlace the cards from the top stack with the cards from the bottom stack. The new deck ordering is TBTBTB..., where T indicates a card from the top half of the original deck and B indicates a card from the bottom half.

There are actually two perfect riffle shuffles for an even number of cards: you can interlace the cards TBTBTB..., which is called an "out shuffle," or you can interlace the cards BTBTBT..., which is called an "in shuffle." For odd-numbered decks, there is also a choice of where to cut the deck: does the top part of the deck get the extra card or does the bottom part? My program always uses the "out shuffle" convention (the top card stays on top) and gives the top stack the extra card when there is an odd number of cards. Therefore, the shuffled deck looks like TBTB...TB for even-numbered decks and TBTB...TBT for odd-numbered decks. The following SAS/IML function takes a row vector that represents a card deck and performs a perfect riffle shuffle to reorder the cards in the deck:

proc iml;
start PerfectRiffle(deck);
   N = ncol(deck);                        /* send in deck as a row vector */
   shuffle = j(1,N,.);                    /* allocate new deck */
   nT = ceil(N/2);                        /* cut into two stacks; "Top" gets extra card when N odd */
   nB = N - nT;                           /* nT = size of Top; nB = size of Bottom */
   shuffle[, do(1,N,2)] = deck[,1:nT];   /* top cards go into odd positions */
   shuffle[, do(2,N,2)] = deck[,nT+1:N]; /* bottom cards go into even positions */
   return(shuffle);
finish;
 
/* test the algorithm on a deck that has N = 10 cards */
origDeck = 1:10;
deck = PerfectRiffle(origDeck);
print (origdeck//deck)[r={"Orig Deck" "Perfect Riffle"}];
Perfect shuffle on deck of N=10 cards

The output shows one perfect riffle of a deck that has 10 cards that are originally in the order 1, 2, ..., 10.

How many perfect riffle shuffles to restore order?

If you call the PerfectSuffle function repeatedly on the same deck, you can simulate perfect riffle shuffles. After each shuffle, you can test whether the order of the deck is in the original order. If so, you can stop shuffling. If not, you can shuffle again. The following SAS/IML loops test decks that contain up to 1,000 cards. The array nUntilRestore keeps track of how many perfect riffle shuffles were done for each deck size.

decksize = 1:1000;                       /* N = 1, 2, 3, ..., 1000 */
nUntilRestore = j(1, ncol(decksize), .); /* the results vector */
do k = 2 to ncol(decksize);              /* for the deck of size N */
   origDeck = 1:decksize[k];             /* original order is 1:N */
   deck = OrigDeck;                      /* set order of deck */
   origOrder = 0;                        /* flag variable */
   do nShuffles = 1 to decksize[k] until (origOrder); /* repeat shuffling */
      deck = PerfectRiffle( deck );      /* shuffle deck */
      origOrder = all(deck = origDeck);  /* until the order of the cards are restored */
   end;
   nUntilRestore[k] = nShuffles;         /* store the number of shuffles */
end;
 
/* print the results for N=1,..,99 */
x = j(10, 10, .);
x[2:100] = nUntilRestore[1:99];          /* present in 10x10 array */
rr = putn( do(0, 90, 10), "Z2.");        /* each row has 10 elements */
cc = ("0":"9");
print x[r=rr c=cc L="Number of Perfect Shuffles to Restore Order"];
 
title "Number of Perfect Shuffles to Restore Deck Order";
call scatter(deckSize, nUntilRestore) grid={x y} other="lineparm x=0 y=0 slope=1;"
     label={"N: Number of Cards in Deck" "P(N): Number of Perfect Shuffles to Restore Order"}
     procopt="noautolegend";
Number of perfect shuffles required to restore order in a deck of N cards, N-1..99

The table shows the number of perfect riffle shuffles required for decks that have fewer than 99 cards. The results are in a 10x10 grid. The first row shows decks that have less than 10 cards, the second row shows sizes 10-19, and so on. For example, to find the result for a 52-card deck, move down to the row labeled "50" and over to the column labeled "2". The result in that cell is 8, which is the number of perfect shuffles required to restore a 52-card deck to its original order.

Remarks on the number of perfect riffle shuffles

The graph at the top of this article shows the number of shuffles for decks that contain up to 1,000 cards. To me, the most surprising feature of the graph is the number of points that fall near diagonal lines of various rational slopes. The most apparent diagonal features are the lines whose slopes are approximately equal to 1, 1/2, and 1/3.

A complete mathematical description of this problem is beyond the scope of this blog article, but here are a few hints and links to whet your appetite.

  • The integer sequence P(N) is related to the "multiplicative order of 2 mod 2n+1" in the On-Line Encyclopedia of Integer Sequences (OEIS). The encyclopedia includes a graph that is similar to the one here, but is computed for N ≤ 10,000. That integer sequence applies to the number of perfect riffle shuffles of an EVEN number of cards.
  • The link to the OEIS shows a quick way to explicitly find P(N) for even values of N. P(N) is the smallest value of k for which 2k is congruent to 1 (mod N–1). For example, 8 is the smallest number for which 28 is congruent to 1 (mod 51) as you can compute explicitly: the value of mod(2**8, 51) is 1.
  • The points in the graph for which P(N) = N-1 are all prime numbers: 2, 3, 5, 11, 13, 19, 29, 37, 53, 59, 61, 67, .... However, notice that not all prime numbers are on this list.

There is much more to be said about this topic, but I'll stop here. If you have something to add, please leave a comment.

The post How many perfect riffle shuffles are required to restore a deck to its initial order? appeared first on The DO Loop.