Just for Fun

3月 132019

It's time to celebrate Pi Day! Every year on March 14th (written 3/14 in the US), math-loving folks celebrate "all things pi-related" because 3.14 is the three-decimal approximation to the mathematical constant, π. Although children learn that pi is approximately 3.14159..., the actual definition of π is the ratio of a circle's circumference to its diameter. Equivalently, it is distance around half of the unit circle. (The unit circle has a unit radius, so its diameter is 2.) The value for pi, therefore, depends on the definition of a circle.

But we all know what a circle looks like, don't we? How can there be more than one circle?

Generalizing the circle

A circle is defined as the locus of points in the plane that are a given distance from a given point. This definition depends on the definition of a "distance," and it turns out that there are infinitely many ways to measure the distance between two points in the plane. The Euclidean distance between two points is the most familiar distances, but there are other definitions. For two points a = (x1, y1) and b = (x2, y2), you can define the "Lp distance" between a and b by the formula
Dp = ( |x1 – x2|p + |y1 – y2|p )1/p
This definition defines a distance metric for every value of p ≥ 1. If you set p=2 in the formula, you get the usual L2 (Euclidean) distance. If you set p=1, you get the L1 metric, which is known as the "taxicab" or "city block" distance.

You might think that the Euclidean distance is the only relevant distance, but it turns out that some of these other distances have practical applications in statistics, machine learning, linear algebra, and many fields of applied mathematics. For example, the 2-norm (L2) distance is used in least-squares regression whereas the 1-norm (L1) distance is used in robust regression and quantile regression. A combination of the two distances is used for ridge regression, LASSO regression, and "elastic net" regression.

Here's the connection to pi: If you can define infinitely many distance formulas, then there are infinitely many unit circles, one for each value of p ≥ 1. And if there are infinitely many circles, there might be infinitely many values of pi. (Spoiler alert: There are!)

Would the real circle please stand up?

You can easily solve for y as a function of x and draw the unit circle for a representative set of values for p. The following graph was generated by the SAS step and PROC SGPLOT. You can download the SAS program that generates the graphs in this article.

The L1 unit circle is a diamond (the top half is shown), the L2 unit circle is the familiar round shape, and as p gets large the unit circle for the Lp distance approaches the boundary of the square defined by the four points (±1, ±1). For more information about Lp circles and metrics, see the Wikipedia article "Lp Space: The p-norm in finite dimensions."

Here comes the surprise: Just as each Lp metric has its own unit circle, each metric has its own numerical value for pi, which is the length of the unit semicircle as measured by that metric.

π(p): The length of the unit semicircle for the Lp distance metric

So far, we've only used geometry, but it's time to use a little calculus. This presentation is based on Keller and Vakil (2009, p. 931-935), who give more details about the formulas in this section.

For a curve that is represented as a graph (y as a function of x), you can obtain the length of the curve by integrating the arclength. In Calculus 2, the arclength formula is derived for Euclidean distance, but it is straightforward to give the formula for the Lp distance:
s(p) = ∫ (1 + |dy/dx|p)1/p dx

To obtain a value for pi in the Lp metric, you can integrate the arclength for the upper half of the Lp unit circle. Equivalently, by symmetry, you can integrate one-eighth of the unit circle and multiply by 4. A convenient choice for the limits of integration is [0, 2-1/p] because 2-1/p is the x value where the 45-degree line intersects the unit circle for the Lp metric.

Substituting for the derivative gives the following formula (Keller and Vakil, 2009, p. 932):
π(p) = 4 ∫ (1 + u(x))1/p dx, where u(x) = |x-p - 1|1-p and the interval of integration is [0, 2-1/p].

A pi for each Lp metric

For each value of p, you get a different value for pi. You can use your favorite numerical integration routine to approximate π(p) by integrating the formula for various values of p ≥ 1. I used SAS/IML, which supports the QUAD function for numerical integration. The arclength computation for a variety of values for p is summarized by the following graph. The graph shows the computation of π(p), which is the length of the semicircle in the Lp metric, versus values of p for p in [1, 11].

The graph shows that the L1 value for pi is 4. The value decreases rapidly as p approaches 2 and reaches a minimum value when p=2 and the value of pi is 3.14159.... For p > 2, the graph of π(p) increases slowly. You can show that π(p) asymptotically approaches the value 4 as p approaches infinity.

On Pi Day, some places have contests to see who can recite the most digits of pi. I encourage you to enter the contest and say "Pi, in the L1 metric, is FOUR point zero, zero, zero, zero, ...." If they refuse to give you the prize, tell them to read this article! 😉

Reflections on pi

One the one hand, this article shows that there is nothing special about the value 3.14159.... For an Lp metric, the ratio of the circumference of a circle to its diameter can be any value between π and 4. On the other hand, the graph shows that π is the unique minimizer of the graph. Among an infinitude of circles and metrics, the well-known Euclidean distance is the only Lp metric for which pi is 3.14159....

If you ask me, our value of π is special, without a doubt!


Download the SAS program that creates the graphs in this article.

The post The value of pi depends on how you measure distance appeared first on The DO Loop.

12月 102018
The best way to spread Christmas cheer
is singing loud for all to hear!
-Buddy in Elf

In the Christmas movie Elf (2003), Jovie (played by Zooey Deschanel) must "spread Christmas cheer" to help Santa. She chooses to sing "Santa Claus is coming to town," and soon all of New York City is singing along.

The best sing-along songs are short and have lyrics that repeat. Jovie's choice, "Santa Claus is coming to town," satisfies both criteria. The musical structure of the song is simple:

  • Verse 1: You better watch out / You better not cry / Better not pout / I'm telling you why
  • Tag line: Santa Claus is coming to town
  • Verse 2: He's making a list / And checking it twice; / Gonna find out / Who's naughty and nice
  • Tag line repeats
  • Bridge: He sees you when you're sleeping / He knows when you're awake / He knows if you've been bad or good / So be good for goodness sake! / O!
  • Verse 1 repeats
  • Partial tags and final tag: Santa Claus is coming / Santa Claus is coming / Santa Claus is coming to town

There is a fun way to visualize repetition in song lyrics. For a song that has N words, you can define the repetition matrix to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. You can visualize the matrix by using a two-color heat map. Colin Morris has a web site devoted to these visualizations.

The following image visualizes the lyrics of "Santa Claus is coming to town." I have added some vertical and horizontal lines to divide the lyrics into seven sections: the verses (V1 and V2), the tag line (S), and the bridge (B).

The image shows the structure of the repetition in the song lyrics:

  • The first verse contains the repetition of the words 'you', 'better', and 'not'.
  • The second verse repeats only the word 'out' from Verse 1.
  • The bridge repeats the word 'you', which appeared three times in Verse 1. It also repeats several words ('when', 'knows', 'good', ...) within the bridge.
  • The tag line "Santa Claus is coming [to town]" is repeated a total of five times.

Now that you understand what a repetition matrix looks like and how to interpret it, let's visualize a few other classic Christmas songs that contain repetitive lyrics! To help "spread Christmas cheer," I'll use shades of red and green to visualize the lyrics, rather than the boring white and black colors.

The Twelve Days of Christmas

If you make a list of Christmas songs that have repetition, chances are "The Twelve Days of Christmas" will be at the top of the list. The song is formulaic: each new verse adds a few new words before repeating the words from the previous verse. As a result, the repetition matrix is almost boring in its regularity. Here is the visualization of the classic song (click to enlarge):

Little Drummer Boy

Another highly repetitive Christmas song is "The Little Drummer Boy," which features an onomatopoeic phrase (Pa rum pum pum pum) that alternates with the other lyrics. A visualization of the classic song is shown below:

Silver Bells

In addition to repeating the title, "Silver Bells" repeats several phrases. Most notably, the phrase "Soon it will be Christmas Day" is repeated multiple times at the end of the song. Because only certain phrases are repeated, the visualization has a pleasing structure that complements the song's lyrical qualities:

Silent Night

To contrast the hustle, bustle, and commercialism of Christmas, I enjoy hearing songs that are musically simple. One of my favorites is "Silent Night." Each verse is distinct, yet each begins with "Silent night, holy night!" and ends by repeating a phrase. The resulting visualization is devoid of clutter. It is visually empty and matches the lyrical imagery, "all is calm, all is bright."

Your turn!

You can download the SAS program that creates these images. The program also computes visualizations of some contemporary songs such as "Last Christmas" by Wham!, "Someday at Christmas" (Stevie Wonder version), "Rockin' Around the Christmas Tree" (Brenda Lee version), and "Happy XMas (War Is Over)" by John Lennon and Yoko Ono. If you have access to SAS, you can even add your own favorite lyrics to the program! If you don't have access to SAS, Colin Morris's website enables you to paste in the lyrics and see the visualization.

In a little-known "deleted scene" from Elf, Buddy says that the second-best way to spread Christmas cheer is posting images for all to share! So post a comment and share your favorite visualization of a Christmas song!

Happy holidays to all my readers. I am grateful for you. Merry Christmas to all, and to all a good night!

The post Visualize Christmas songs appeared first on The DO Loop.

9月 242018

Last week I compared the overhand shuffle to the riffle shuffle. I used random operations to simulate both kinds of shuffles and then compared how well they mix cards. The article caused one my colleague and fellow blogger, Rob Pratt, to ask if I was familiar with a bit of shuffling trivia: if you perform a perfect riffle shuffle, the cards return to their original order after exactly eight perfect shuffles! This mathematical curiosity is illustrated in a beautiful animation by James Miles, which shows the results of eight perfect shuffles.

After being reminded of this interesting fact, I wondered how the result generalizes to decks of various sizes. That is, if you use a deck with N cards, what is the minimum number of perfect riffle shuffles (call it P(N)) that you need to restore the deck to its original order? I decided to run a SAS program to discover the answer. The result is summarized in the following graph, which plots P(N) versus the number of cards in a deck. All points are below the identity line, which implies that at most N – 1 shuffles are required for a deck that contains N cards. If you want to learn more about the graph and its interesting patterns, read on.

Number of perfect shuffles for deck of N cards to restore the original order

Coding the perfect riffle shuffle in SAS

The perfect riffle shuffle is easier to program than the Gilbert-Shannon-Reeds (GSR) model, which uses randomness to model how real people shuffle real cards. In a perfect riffle shuffle, you cut the deck exactly two equal parts. Then you interlace the cards from the top stack with the cards from the bottom stack. The new deck ordering is TBTBTB..., where T indicates a card from the top half of the original deck and B indicates a card from the bottom half.

There are actually two perfect riffle shuffles for an even number of cards: you can interlace the cards TBTBTB..., which is called an "out shuffle," or you can interlace the cards BTBTBT..., which is called an "in shuffle." For odd-numbered decks, there is also a choice of where to cut the deck: does the top part of the deck get the extra card or does the bottom part? My program always uses the "out shuffle" convention (the top card stays on top) and gives the top stack the extra card when there is an odd number of cards. Therefore, the shuffled deck looks like TBTB...TB for even-numbered decks and TBTB...TBT for odd-numbered decks. The following SAS/IML function takes a row vector that represents a card deck and performs a perfect riffle shuffle to reorder the cards in the deck:

proc iml;
start PerfectRiffle(deck);
   N = ncol(deck);                        /* send in deck as a row vector */
   shuffle = j(1,N,.);                    /* allocate new deck */
   nT = ceil(N/2);                        /* cut into two stacks; "Top" gets extra card when N odd */
   nB = N - nT;                           /* nT = size of Top; nB = size of Bottom */
   shuffle[, do(1,N,2)] = deck[,1:nT];   /* top cards go into odd positions */
   shuffle[, do(2,N,2)] = deck[,nT+1:N]; /* bottom cards go into even positions */
/* test the algorithm on a deck that has N = 10 cards */
origDeck = 1:10;
deck = PerfectRiffle(origDeck);
print (origdeck//deck)[r={"Orig Deck" "Perfect Riffle"}];
Perfect shuffle on deck of N=10 cards

The output shows one perfect riffle of a deck that has 10 cards that are originally in the order 1, 2, ..., 10.

How many perfect riffle shuffles to restore order?

If you call the PerfectSuffle function repeatedly on the same deck, you can simulate perfect riffle shuffles. After each shuffle, you can test whether the order of the deck is in the original order. If so, you can stop shuffling. If not, you can shuffle again. The following SAS/IML loops test decks that contain up to 1,000 cards. The array nUntilRestore keeps track of how many perfect riffle shuffles were done for each deck size.

decksize = 1:1000;                       /* N = 1, 2, 3, ..., 1000 */
nUntilRestore = j(1, ncol(decksize), .); /* the results vector */
do k = 2 to ncol(decksize);              /* for the deck of size N */
   origDeck = 1:decksize[k];             /* original order is 1:N */
   deck = OrigDeck;                      /* set order of deck */
   origOrder = 0;                        /* flag variable */
   do nShuffles = 1 to decksize[k] until (origOrder); /* repeat shuffling */
      deck = PerfectRiffle( deck );      /* shuffle deck */
      origOrder = all(deck = origDeck);  /* until the order of the cards are restored */
   nUntilRestore[k] = nShuffles;         /* store the number of shuffles */
/* print the results for N=1,..,99 */
x = j(10, 10, .);
x[2:100] = nUntilRestore[1:99];          /* present in 10x10 array */
rr = putn( do(0, 90, 10), "Z2.");        /* each row has 10 elements */
cc = ("0":"9");
print x[r=rr c=cc L="Number of Perfect Shuffles to Restore Order"];
title "Number of Perfect Shuffles to Restore Deck Order";
call scatter(deckSize, nUntilRestore) grid={x y} other="lineparm x=0 y=0 slope=1;"
     label={"N: Number of Cards in Deck" "P(N): Number of Perfect Shuffles to Restore Order"}
Number of perfect shuffles required to restore order in a deck of N cards, N-1..99

The table shows the number of perfect riffle shuffles required for decks that have fewer than 99 cards. The results are in a 10x10 grid. The first row shows decks that have less than 10 cards, the second row shows sizes 10-19, and so on. For example, to find the result for a 52-card deck, move down to the row labeled "50" and over to the column labeled "2". The result in that cell is 8, which is the number of perfect shuffles required to restore a 52-card deck to its original order.

Remarks on the number of perfect riffle shuffles

The graph at the top of this article shows the number of shuffles for decks that contain up to 1,000 cards. To me, the most surprising feature of the graph is the number of points that fall near diagonal lines of various rational slopes. The most apparent diagonal features are the lines whose slopes are approximately equal to 1, 1/2, and 1/3.

A complete mathematical description of this problem is beyond the scope of this blog article, but here are a few hints and links to whet your appetite.

  • The integer sequence P(N) is related to the "multiplicative order of 2 mod 2n+1" in the On-Line Encyclopedia of Integer Sequences (OEIS). The encyclopedia includes a graph that is similar to the one here, but is computed for N ≤ 10,000. That integer sequence applies to the number of perfect riffle shuffles of an EVEN number of cards.
  • The link to the OEIS shows a quick way to explicitly find P(N) for even values of N. P(N) is the smallest value of k for which 2k is congruent to 1 (mod N–1). For example, 8 is the smallest number for which 28 is congruent to 1 (mod 51) as you can compute explicitly: the value of mod(2**8, 51) is 1.
  • The points in the graph for which P(N) = N-1 are all prime numbers: 2, 3, 5, 11, 13, 19, 29, 37, 53, 59, 61, 67, .... However, notice that not all prime numbers are on this list.

There is much more to be said about this topic, but I'll stop here. If you have something to add, please leave a comment.

The post How many perfect riffle shuffles are required to restore a deck to its initial order? appeared first on The DO Loop.

7月 162018

My colleague Robert Allison recently blogged about using the diameter of Texas as a unit of measurement. The largest distance across Texas is about 801 miles, so Robert wanted to find the set of all points such that the distance from the point to Texas is less than or equal to 801 miles.

Robert's implementation was complicated by the fact that he was interested in points on the round earth that are within 801 miles from Texas as measured along a geodesic. However, the idea of "thickening" or "inflating" a polygonal shape is related to a concept in computational geometry called the offset polygon or the inflated polygon. A general algorithm to inflate a polygon is complicated, but this article demonstrates the basic ideas that are involved. This article discusses offset regions for convex and nonconvex polygons in the plane. The article concludes by drawing a planar region for a Texas-shaped polygon that has been inflated by the diameter of the polygon. And, of course, I supply the SAS programs for all computations and images.

Offset regions as a union of circles and rectangles

Assume that a simple polygon is defined by listing its vertices in counterclockwise order. (Recall that a simple polygon is a closed, nonintersecting, shape that has no holes.) You can define the offset region of radius r as the union of the following shapes:

  • The polygon itself
  • Circles of radius r centered at each vertex
  • Rectangles of width r positions outside the polygon along each edge

The following graphic shows the offset region (r = 0.5) for a convex "house-shaped" polygon. The left side of the image shows the polygon with an overlay of circles centered at each vertex and outward-pointing rectangles along each edge. The right side of the graphic shows the union of the offset regions (blue) and the original polygon (red):

The image on the right shows why the process is sometimes called an "inflating" a polygon. For a convex polygon, the edges are pushed out by a distance r and the vertices become rounded. For large values of r, the offset region becomes a nearly circular blob, although the boundary is always the union of line segments and arcs of circles.

You can draw a similar image for a nonconvex polygon. The inflated region near a convex (left turning) vertex looks the same as before. However, for the nonconvex (right turning) vertices, the circles do not contribute to the offset region. Computing the offset region for a nonconvex polygon is tricky because if the distance r is greater than the minimum distance between vertices, nonlocal effects can occur. For example, the following graphic shows a nonconvex polygon that has two "prongs." Let r0 be the distance between the prongs. When you inflate the polygon by an amount r > r0/2, the offset region can contain a hole, as shown. Furthermore, the boundary of the offset regions is not a simple polygon. For larger values of r, the hole can disappear. This demonstrates why it is difficult to construct the boundary of an offset region for nonconvex polygons.

Inflating a Texas-shaped polygon

The shape of the Texas mainland is nonconvex. I used PROC GREDUCE on the MAPS.US data set in SAS to approximate the shape of Texas by a 36-sided polygon. The polygon is in a standardized coordinate system and has a diameter (maximum distance between vertices) of r = 0.2036. I then constructed the inflated region by using the same technique as shown above. The polygon and its inflated region are shown below.

The image on the left, which shows 36 circles and 36 rectangles, is almost indecipherable. However, the image on the right is almost an exact replica of the region that appears in Robert Allison's post. Remember, though, that the distances in Robert's article are geodesic distances on a sphere whereas these distances are Euclidean distances in the plane. For the planar problem, you can classify a point as within the offset region by testing whether it is inside the polygon itself, inside any of the 36 rectangles, or within a distance r of a vertex. That computation is relatively fast because it is linear in the number of vertices in the polygon.

I don't want to dwell on the computation, but I do want to mention that it requires fewer than 20 SAS/IML statements! The key part of the computation uses vector operations to construct the outward-facing normal vector of length r to each edge of the polygon. If v is the vector that connects the i_th and (i+1)_th vertex of the polygon, then the outward-facing normal vector is given by the concise vector expression r * (v / ||v||) * M, where M is a rotation matrix that rotates by 90 degrees. You can download the SAS program that computes all the images in this article.

In conclusion, you can use a SAS program to construct the offset region for an arbitrary simple polygon. The offset region is the union of circles, rectangles, and the original polygon, which means that it is easy to test whether an arbitrary point in the plane is in the offset region. That is, you can test whether any point is within a distance r to an arbitrary polygon.

The post Offset regions: Find all points within a specified distance from a polygon appeared first on The DO Loop.

12月 142016
Animated image of falling Koch snowflakes
Out of the bosom of the Air,
      Out of the cloud-folds of her garments shaken,
Over the woodlands brown and bare,
      Over the harvest-fields forsaken,
            Silent, and soft, and slow
            Descends the snow.

"Snow-flakes" by Henry Wadsworth Longfellow

Happy holidays to all my readers! In my last post I showed how to create a well-known fractal called the Koch snowflake. The snowflake is aptly named because it has six-fold symmetry. But as Longfellow noted, a real snowflake is not stationary, but descends "silent, and soft, and slow."

As a gift to my readers, I decided to create an animated greeting card entirely in SAS. The animated GIF (click to enlarge) uses some of the SAS techniques that I have blogged about in 2016. The falling and rotating snowflakes were created by using matrix computations in the SAS/IML language. The animated GIF was created by using ODS graphics and PROC SGPLOT.

Create an animated greeting card with #SAS
Click To Tweet

Techniques used to create the animation

If you want to learn how I created this animated GIF with SAS, read on. I've blogged about all the essential techniques in previous posts:

You can download the SAS program that creates the greeting card. Let me know if you adapt this program to create other animated images.

If you like SAS statistical programming or want to learn more about it, subscribe to this blog. In most articles I show how to use SAS for serious pursuits like statistical modeling, data analysis, optimization, and more. But programming with SAS can also be fun, and sometimes it takes a less-serious application to make people say, "Wow! That's cool! I didn't know SAS could do that!"

tags: Fractal, Just for Fun

The post Animate snowfall in SAS appeared first on The DO Loop.

12月 102016
Koch Snowflake in SAS

I have a fondness for fractals.

In previous articles, I've used SAS to create some of my favorite fractals, including a fractal Christmas tree and the "devil's staircase" (Cantor ) function. Because winter is almost here, I think it is time to construct the Koch snowflake fractal in SAS.

A Koch snowflake is shown to the right. The Koch snowflake has a simple construction. Begin with an equilateral triangle. For every line segment, replace the segment by four shorter segments, each of which is one-third the length of the original segment. The two middle segments are offset by 60 degrees from the direction of the original segment. You then iterate this process, with each step replacing every line segment with four others.

The Koch division process

One step in the construction of a Koch Snowflake

For simplicity, the graph at right shows the first step of the construction on a single line segment that has one endpoint at (0,0) and another endpoint at (1,0). The original horizontal line segment of length one is replaced by four line segments of length 1/3. The first and fourth segments are in the same direction as the original segment (horizontal). The second segment is rotated -60 degrees from the original direction. It, too, is 1/3 the original length. The third segment connects the second and fourth line segments.

The following program defines a function that uses the vector capabilities of SAS/IML software to carry out the fundamental step on a line segment. The function takes two row vectors as arguments, which are the (x,y) coordinates of the endpoints of the line segment. The function returns a 5 x 2 matrix, where the rows contain the endpoints of the four line segments that result from the Koch construction.

Two statements in the function might be unfamiliar. The direct product operator (@) quickly constructs points that are k/3 units along the original line segment for k=0, 1, 2, and 3. The ATAN2 function computes the direction angle for a two-dimensional vector.

proc iml;
/* divide a line segment (2 pts) into 4 segments (5 pts).  
   Create middle point by rotating through -60 degrees */ 
start KochDivide(A, E);                    /* (x,y) coords of endpoints */
   segs = j(5, 2, .);                      /* matrix to hold 4 shorter segs */
   v = (E-A) / 3;                          /* vector 1/3 as long as orig */
   segs[{1 2 4 5}, ] = A + v @ T(0:3);     /* endpoints of new segs */
   /* Now compute middle point. Use ATAN2 to find direction angle. */
   theta = -constant("pi")/3 + atan2(v[2], v[1]); /* change angle by pi/3 */
   w = cos(theta) || sin(theta);           /* vector to middle point */
   segs[3,] = segs[2,] + norm(v)*w;
   return segs;
/* test on line segment from (0,0) to (1,0) */
s = KochDivide({0 0}, {1 0});
title  "Fundamental Step in the Koch Snowflake Construction";
ods graphics / width=600  height=300; 
call series(s[,1], s[,2]) procopt="aspect=0.333";

Construct the Koch snowflake in SAS

Now that we have defined a function that can replace a segment by four shorter segments, let's define a function that applies that function repeatedly to each segment of a polygon. The following function takes two arguments: the polygon and the number of iterations. An N-sided closed polygon is represented as an (N+1) x 2 matrix of vertices.

/* create Koch Snowflake from and equilateral triangle */
start KochPoly(P0, iters=5);    
P = P0;
do j=1 to iters;
   N = nrow(P) - 1;             /* old number of segments */
   newP = j(4*N+1, 2);          /* new number of segments + 1 */
   do i=1 to N;                 /* for each segment... */
      idx = (4*i-3):(4*i+1);                    /* rows for 4 new segments */
      newP[idx, ] = KochDivide(P[i,], P[i+1,]); /* generate new segments */
   P = newP;                    /* update polygon and iterate */
return P;

To test the function, define an equilateral triangle. The call to the KochPoly function creates a Koch snowflake from the equilateral triangle.

/* create equilateral triangle as base for snowflake */
pi = constant("pi");
angles = -pi/6 // pi/2 // 7*pi/6;  /* angles for equilateral triangle */
P = cos(angles) || sin(angles);    /* vertices of equilateral triangle */
P = P // P[1,];                    /* append first vertex to close triangle */
K = KochPoly(P);                   /* create the Koch snowflake */

The Koch snowflake that results from this iteration is shown at the top of this article. For completeness, here are the statements that write the snowflake to a SAS data set and graph it by using PROC SGPLOT:

S = j(nrow(K), 3, 1);      /* add ID variable with constant value 1 */
S[ ,1:2] = K;
create Koch from S[colname={X Y ID}];  append from S;  close;
ods graphics / width=400px height=400px;
footnote justify=Center "Koch Snowflake";
proc sgplot data=Koch;
   styleattrs wallcolor=CXD6EAF8;               /* light blue */
   polygon x=x y=y ID=ID / outline fill fillattrs=(color=white);
   xaxis display=none;
   yaxis display=none;

Generalizations of the Koch snowflake

The KochPoly function does not check whether the polygon argument is an equilateral triangle. Consequently, use the function to "Koch-ify" any polygon! For example pass in the rectangle with vertices P = {0 0, 2 0, 2 1, 0 1, 0 0}; to create a "Koch box."

Enjoy playing with the program. Let me know if you generate an interesting shape!

tags: Fractal, Just for Fun

The post Create a Koch snowflake with SAS appeared first on The DO Loop.

7月 052016

Last week I blogged about how to draw the Cantor function in SAS. The Cantor function is used in mathematics as a pathological example of a function that is constant almost everywhere yet somehow manages to "climb upwards," thus earning the nickname "the devil's staircase."

The Cantor function has three properties: (1) it is continuous, (2) it is non-decreasing, and (3) F(0)=0 and F(1)=1. There is a theorem that says that any function with these properties is the cumulative distribution function (CDF) for some random variable. In theory, therefore, you can generate a large number of random variates (a large sample), then use PROC UNIVARIATE to plot the empirical CDF. The empirical CDF should be the devil's staircase function.

A random sample from the Cantor set

My last article showed that every point in the Cantor set can be written as the infinite sum x = Σi ai3-i where the ai equal 0 or 2. You can approximate the Cantor set by truncating the series at a certain number of terms. You can generate random points from the (approximate) Cantor set by choosing the coefficients randomly from the Bernoulli distribution with probability 0.5. The following DATA step generates 10,000 points from the Cantor set by using random values for the first 20 coefficients:

data Cantor;
call streaminit(123456);
k = 20;             /* maximum number of coefficients */
n = 10000;          /* sample size to generate */
do i = 1 to n;      /* generate n random points in Cantor set */
   x = 0;
   do j = 1 to k;   /* each point is sum of powers of 1/3 */
      b = 2 * rand("Bernoulli", 0.5);  /* random value 0 or 2 */
      x + b * 3**(-j);                 /* power of 1/3 */
keep x;
proc univariate data=Cantor;
   cdf x;

The call to PROC UNIVARIATE displays the empirical CDF for the random sample of points. And voilà! The Cantor distribution function appears! As a bonus, the output from PROC UNIVARIATE (not shown) indicates that the mean of the distribution is 1/2 (which is obvious by symmetry) and the variance is 1/8, which you might not have expected.

This short SAS program does two things: it shows how to simulate a random sample uniformly from the (approximate) Cantor set, and it indicates that the devil's staircase function is the distribution function for this a uniform random variable.

Did you know that the Cantor staircase function is a cumulative distribution? Simulate it in #SAS!
Click To Tweet

A random sample in SAS/IML

Alternatively, you can generate the same random sample by using SAS/IML and matrix computations. My previous article drew the Cantor function by systematically using all combinations of 0s and 2s to construct the elements of the Cantor set. However, you can use the same matrix-based approach but generate random coefficients, therefore obtaining a random sample from the Cantor set:

proc iml;
k = 20;          /* maximum number of coefficients */
n = 10##4;       /* sample size to generate */
B = 2 * randfun(n||k, "Bernoulli", 0.5);  /* random n x k matrix of 0/2 */
v = 3##(-(1:k)); /* powers of 1/3 */
x = B * v`;      /* sum of random terms that are either 0*power or 2*power */

Recall that ECDF is a step function that plots the ith ordered datum at the height i/n. You can approximate the empirical CDF in SAS/IML by using the SERIES subroutine. Technically, the lines that appear in the line plot have a nonzero slope, but the approximation looks very similar to the PROC UNIVARIATE output:

call sort(x, 1);             /* order elements in the Cantor set */
x = 0 // x // 1;             /* append end points */
y = do(0, 1, 1/(nrow(x)-1)); /* empirical CDF */
title "Empirical Distribution of a Uniform Sample from the Cantor Set";
call series(x, y);

Interpretation in probability

It is interesting that you can actually define the Cantor set in terms of rolling a fair six-sided die. Suppose that you roll a die infinitely often and we adopt the following rules:

  • If the die shows a 1 or 2, you get zero points.
  • If the die shows a 3 or 4, you get one point.
  • If the die shows a 5 or 6, you get two points.

This game is closely related to the Cantor set. Recall that the Cantor set can be written as the set of all base-3 decimal values in [0,1] for which the decimal expansion does not contain a 1. In this game, each point in the Cantor set corresponds to a sequence of rolls that never contain a 3 or 4. (Equivalently, the score is always even.) Obviously this would never happen in real life, which is an intuitive way of saying that the Cantor set has "measure zero."


The first time I saw the Cantor function depicted as an empirical distribution function was when I saw a very compact MATLAB formula like this:

stairs([min(x) sort(x)],0:1/length(x):1) % Plot the c.d.f of x

The previous formula is equivalent to my SAS/IML program, but in my program I broke the formula apart so that the components could be understood more easily. This formula appears in a set of probability demos by Peter Doyle. I had the privilege to interact with Professor Doyle at The Geometry Center (U. MN) in the early 1990s, so perhaps he was responsible for showing me the Cantor distribution.

These UNC course notes from Jan Hannig also discuss the Cantor distribution, but the original source is not cited.

tags: Just for Fun, Math

The post Cantor sets, the devil's staircase, and probability appeared first on The DO Loop.

6月 292016

I was a freshman in college the first time I saw the Cantor middle-thirds set and the related Cantor "Devil's staircase" function. (Shown at left.) These constructions expanded my mind and led me to study fractals, real analysis, topology, and other mathematical areas.

The Cantor function and the Cantor middle-thirds set are often used as counter-examples to mathematical conjectures. The Cantor set is defined by a recursive process that requires infinitely many steps. However, you can approximate these pathological objects in a matrix-vector language such as SAS/IML with only a few lines of code!

Construction of the Cantor middle-thirds set

The Cantor middle-thirds set is defined by the following iterative algorithm. The algorithm starts with the closed interval [0,1], then does the following:

  1. In the first step, remove the open middle-thirds of the interval. The two remaining intervals are [0,1/3] and [2/3, 1], which each have length 1/3.
  2. In the ith step, remove the open middle-thirds of all intervals from the (i-1)th step. You now have twice as many intervals and each interval is 1/3 as long as the intervals in the previous step.
  3. Continue this process forever.

After two steps you have four intervals: [0,1/9], [2/9,1/3], [2/3, 7/9], and [8/9,1]. After three steps you have eight intervals of length 1/27. After k steps you have 2k intervals of length 3-k.

The Cantor set is the set of all points that never get removed during this infinite process. The Cantor set clearly contains infinitely many points because it contains the endpoints of the intervals that are removed: 0, 1, 1/3, 2/3, 1/9. 2/9, 7/9, 8/9, and so forth. Less intuitive is the fact that the cardinality of the Cantor set is uncountably infinite even though it is a set of measure zero.

Construction of the Cantor function

The Cantor function F: [0,1] → [0,1] can be defined iteratively in a way that reflects the construction of the Cantor middle-thirds set. The function is shown at the top of this article.

  • At step 0, define the function on the endpoints of the [0,1] interval by F(0)=0 and F(1)=1.
  • At step 1, define the function on the closed interval [1/3, 2/3] to be 1/2.
  • At step 2, define the function on [1/9, 2/9] to be 1/4. Define the function on [7/9, 8/9] to be 3/4.
  • Continue this construction. At each step, the function is defined on the closure of the middle-thirds intervals so that the function value is halfway between adjacent values. The function looks like a staircase with steps of difference heights and widths. The function is constant on every middle-thirds interval, but nevertheless is continuous and monotonically nondecreasing.

Visualize the Cantor staircase function in #SAS.
Click To Tweet

Visualizing the Cantor function in SAS

This is a SAS-related blog, so I want to visualize the Cantor function in SAS. The middle-third intervals during the kth step of the construction have length 3-k, so you can stop the construction after a small number of iterations and get a decent approximation. I'll use k=8 steps.

Although the Cantor set and function were defined geometrically, they have an equivalent definition in terms of decimal expansion. The Cantor set is the set of decimal values that can be written in base 3 without using the '1' digit. In other words, elements of the Cantor set have the form x = 0.a1a2a3... (base 3), where ai equals 0 or 2.

An equivalent definition in terms of fractions is x = Σi ai3-i where ai equals 0 or 2. Although the sum is infinite, you can approximate the Cantor set by truncating the series after finitely many terms. A sum like this can be expressed as an inner product x = a*v` where a is a k-element row vector that contains 0s and 2s and v is a vector that contains the elements {1/3, 1/9, 1/27, ..., 1/3-k}.

You can define B to be a matrix with k columns and 2k rows that contains all combinations of 0s and 2s. Then the matrix product B*v is an approximation to the Cantor set after k steps of the construction. It contains the right-hand endpoints of the middle-third intervals.

In SAS/IML you can use the EXPANDGRID function to create a matrix whose rows contain all combinations of 0s and 2s. The ## operator raises an element to a power. Therefore the following statements construct and visualize the Cantor function. With a little more effort, you can write a few more statements that improve the approximation and add fractional tick marks to the axes, as shown in the graph at the top of this article.

proc iml;
/* rows of B contain all 8-digit combinations of 0s and 2s */ 
B = expandgrid({0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2}, {0 2});
B = B[2:nrow(B),];      /* remove first row of zeros */
k = ncol(B);            /* k = 8 */
v = 3##(-(1:k));       /* vector of powers 3^{-i} */
t = B * v`;            /* x values: right-hand endpts of middle-third intervals */
u = 2##(-(1:k));       /* vector of powers 2^{-i} */
f = B/2 * u`;          /* y values: Cantor function on Cantor set */
call series(t, f);     /* graph the Cantor function */

I think this is a very cool construction. Although the Cantor function is defined iteratively, there are no loops in this program. The loops are replaced by matrix multiplication and vectors. The power of a matrix language is that it enables you to compute complex quantities with only a few lines of programming.

Do you have a favorite example from math or statistics that changed the way that you look at the world? Leave a comment.


This short article cannot discuss all the mathematically interesting features of the Cantor set and Cantor function. The following references are provided for the readers who want additional information:

tags: Just for Fun, vectorization

The post Visualize the Cantor function in SAS appeared first on The DO Loop.

12月 162015
Lo how a rose e'er blooming
From tender stem hath sprung.

As I write this blog post, a radio station is playing Chrismas music. One of my favorite Christmas songs is the old German hymn that many of us know as "Lo, How a Rose E're Blooming." I was humming the song recently when someone asked me how to use SAS to draw curves in polar coordinates. I immediately thought about "polar roses," mathematical parametric curves of the form r = cos(k θ).

In the spirit of Christmas, I present "Lo, how a polar rose e'er blooming, from SGPLOT hath sprung."

Plotting polar coordinates in SAS

It is easy to graph a polar curve in SAS. Here's a quick overview, or a "polar express" tour, if you prefer.

Many familiar polar equations are parametric curves. The simplest are relations are of the form r = f(θ), where θ is the polar angle. The following SAS DATA step creates evenly spaced values of theta in the range [0, 2π]. For each value of theta, it computes r = cos(k θ), which is the equation for a rose with k petals. The polar coordinates (r, θ) are converted to Euclidean coordinates through the usual transformation x = r*cos(theta) and y = r*sin(theta).

After creating the points along the curve, you can visualize it by using PROC SGPLOT. The SERIES statement is used to display the parametric curve in Euclidean coordinates. The XAXIS and YAXIS statements use the MIN= and MAX= options to ensure that the image appears in a square region, and the ASPECT=1 option is used to ensure that the aspect ratio of the plot does not distort the geometry of the rose.

%let k = 5;
/* draw the curve r=cos(k * theta) in polar coords, which is a k-leaved rose */
data Rose;
do theta = 0 to 2*constant("pi") by 0.01;
   r = cos(&k * theta);       /* rose */
   x = r*cos(theta);          /* convert to Euclidean coords */
   y = r*sin(theta);
title "Polar Rose: r = cos(&k theta)";;
proc sgplot data=Rose aspect=1;
   refline 0 / axis=x;
   refline 0 / axis=y;
   series x=x y=y;
   xaxis min=-1 max=1;
   yaxis min=-1 max=1;

Graphing generalized roses

The Wikipedia page for the polar rose contains a brief discussion about what the curve r = cos(k θ) looks like when k is a rational number n/d. When k is not an integer, most of the curves do not look like a rose—or any other kind of flower! Only a myopic mathematician would call some of these images roses.

Regardless, let's see how to get some of these "roses" to bloom in SAS. The following DATA step generates all positive rational numbers of the form n/d, where n ≤ 7 and d ≤ 9 are integers. For each value of k, the program computes the generalize rose equation and converts the values from polar to Euclidean coordinates:

data Roses;
do n = 1 to 7;
   do d = 1 to 9;
      k = n / d;
      /* draw the rose r=cos(n/d * theta) */
      do theta = 0 to 2*lcm(n,d)*constant("pi") by 0.05;
         r = cos(k * theta);      /* generalized rose */
         x = r*cos(theta);        /* convert to Euclidean coords */
         y = r*sin(theta);

It is not any harder to visualize 63 roses than it is to visualize one. The SGPANEL procedures was designed to display multiple cells, each with the same type of graph. Therefore, to display an entire bouquet of roses, you merely have to specify the N and D variables on the PANELBY statement, and SAS handles the rest:

title "Polar Roses: r = cos(n/d theta)";
proc sgpanel data=Roses aspect=1;
   panelby n d / layout=lattice onepanel rowheaderpos=left;
   refline 0 / axis=x transparency=0.5;
   refline 0 / axis=y transparency=0.5;
   series x=x y=y;
   colaxis display=none;
   rowaxis display=none;

You can see certain numeric patterns in the lattice display. In particular, when n/d is a reducible fraction (such as 2/4, 3/6, and 4/8), the image is identical to the reduced fraction (such as 1/2). Do you see other patterns? Leave a comment.

These roses are beautiful, like the song that inspired me to create them. Since I get up early in the morning to write my blog, a few additional lyrics seem appropriate:

It came a blossom bright.
Amid the cold of winter
When half spent was the night.

Happy holidays to all my readers. I am taking a short vacation and will return in 2016. Until then, keep your eyes open: math, statistics, and beauty are all around you, ever blooming!

tags: Just for Fun, Math

The post Lo, how a polar rose e'er blooming appeared first on The DO Loop.

9月 042015

In my previous blog post, I showed how you can use SAS to program a "weaving" algorithm that takes an image, cuts it into strips, and weaves the strips together to create mathematical art. I used matrices and heat maps for the computations and visualization.

At the end of the post, I presented an image of woven strips (shown at left) and challenged the reader to "deduce the original image from the weave." This article shows that you can "unweave" the image, reassemble the strips, and use interpolation to obtain an approximation to the original image. You can download the SAS/IML program that computes and visualizes the preimage.

Unweaving a weave

The first step in reconstructing the original image is to "pull apart" the horizontal and vertical strips, rotate the horizontal ones, put them next to the vertical strips, and close all the gaps. This requires writing the inverse of the weave algorithm. When you do this, you obtain the following image:


Notice that there are gaps in our knowledge about the original image. Due to the weaving process, there are places where a vertical strip was covered by a horizontal strip, or the other way around. These missing values are shown in white. Mathematically, the weaving algorithm represents a many-to-one function, so there is no inverse function that can restore the preimage from the image.

Nevertheless, the unweaving reveals the main features of the original image. And we can do more. If we know that the original image was created from a continuous mathematical function, you can use interpolation to try to fill in the gaps. For this image you can use the information from nearby cells (north, south, east, and west) to estimate a value for the missing cells. The following uses a simple estimation scheme, which is to assign each missing cell the average value of its neighbors. You can proceed in five steps:

  1. Approximate the corners. For the lower left corner, use the average of the cell above and the cell to the right. For the lower right corner, use the average of the cell above and the cell to the left.
  2. Approximate the left edge. For each missing cell, average the cells above, below, and to the right.
  3. Approximate the right edge. For each missing cell, average the cells above, below, and to the left.
  4. Approximate the bottom edge. For each missing cell, average the cells above, to the left, and to the right.
  5. Approximate the interior cells. For each missing cell, average the cells above, below, to the left, and to the right.

After implementing this process, you obtain the following image, which approximates the original image:


How well does this image approximate the original image? Very well! The mean square error is very small, and the maximum error values are within 20% of the true values. For comparison, here is the original image. You can see that it is very similar to the reconstructed image.


The approximation could be improved by using a more sophisticated interpolation scheme, such as bivariate linear interpolation.

Feel free to download the SAS/IML program and play with other weaves. How well can you recover an original image from its weave?
tags: Just for Fun

The post Mathematical art (part 2): Unweaving matrices appeared first on The DO Loop.