7月 162018

My colleague Robert Allison recently blogged about using the diameter of Texas as a unit of measurement. The largest distance across Texas is about 801 miles, so Robert wanted to find the set of all points such that the distance from the point to Texas is less than or equal to 801 miles.

Robert's implementation was complicated by the fact that he was interested in points on the round earth that are within 801 miles from Texas as measured along a geodesic. However, the idea of "thickening" or "inflating" a polygonal shape is related to a concept in computational geometry called the offset polygon or the inflated polygon. A general algorithm to inflate a polygon is complicated, but this article demonstrates the basic ideas that are involved. This article discusses offset regions for convex and nonconvex polygons in the plane. The article concludes by drawing a planar region for a Texas-shaped polygon that has been inflated by the diameter of the polygon. And, of course, I supply the SAS programs for all computations and images.

Offset regions as a union of circles and rectangles

Assume that a simple polygon is defined by listing its vertices in counterclockwise order. (Recall that a simple polygon is a closed, nonintersecting, shape that has no holes.) You can define the offset region of radius r as the union of the following shapes:

  • The polygon itself
  • Circles of radius r centered at each vertex
  • Rectangles of width r positions outside the polygon along each edge

The following graphic shows the offset region (r = 0.5) for a convex "house-shaped" polygon. The left side of the image shows the polygon with an overlay of circles centered at each vertex and outward-pointing rectangles along each edge. The right side of the graphic shows the union of the offset regions (blue) and the original polygon (red):

The image on the right shows why the process is sometimes called an "inflating" a polygon. For a convex polygon, the edges are pushed out by a distance r and the vertices become rounded. For large values of r, the offset region becomes a nearly circular blob, although the boundary is always the union of line segments and arcs of circles.

You can draw a similar image for a nonconvex polygon. The inflated region near a convex (left turning) vertex looks the same as before. However, for the nonconvex (right turning) vertices, the circles do not contribute to the offset region. Computing the offset region for a nonconvex polygon is tricky because if the distance r is greater than the minimum distance between vertices, nonlocal effects can occur. For example, the following graphic shows a nonconvex polygon that has two "prongs." Let r0 be the distance between the prongs. When you inflate the polygon by an amount r > r0/2, the offset region can contain a hole, as shown. Furthermore, the boundary of the offset regions is not a simple polygon. For larger values of r, the hole can disappear. This demonstrates why it is difficult to construct the boundary of an offset region for nonconvex polygons.

Inflating a Texas-shaped polygon

The shape of the Texas mainland is nonconvex. I used PROC GREDUCE on the MAPS.US data set in SAS to approximate the shape of Texas by a 36-sided polygon. The polygon is in a standardized coordinate system and has a diameter (maximum distance between vertices) of r = 0.2036. I then constructed the inflated region by using the same technique as shown above. The polygon and its inflated region are shown below.

The image on the left, which shows 36 circles and 36 rectangles, is almost indecipherable. However, the image on the right is almost an exact replica of the region that appears in Robert Allison's post. Remember, though, that the distances in Robert's article are geodesic distances on a sphere whereas these distances are Euclidean distances in the plane. For the planar problem, you can classify a point as within the offset region by testing whether it is inside the polygon itself, inside any of the 36 rectangles, or within a distance r of a vertex. That computation is relatively fast because it is linear in the number of vertices in the polygon.

I don't want to dwell on the computation, but I do want to mention that it requires fewer than 20 SAS/IML statements! The key part of the computation uses vector operations to construct the outward-facing normal vector of length r to each edge of the polygon. If v is the vector that connects the i_th and (i+1)_th vertex of the polygon, then the outward-facing normal vector is given by the concise vector expression r * (v / ||v||) * M, where M is a rotation matrix that rotates by 90 degrees. You can download the SAS program that computes all the images in this article.

In conclusion, you can use a SAS program to construct the offset region for an arbitrary simple polygon. The offset region is the union of circles, rectangles, and the original polygon, which means that it is easy to test whether an arbitrary point in the plane is in the offset region. That is, you can test whether any point is within a distance r to an arbitrary polygon.

The post Offset regions: Find all points within a specified distance from a polygon appeared first on The DO Loop.

7月 092018

Back in high school, you probably learned to find the intersection of two lines in the plane. The intersection requires solving a system of two linear equations. There are three cases: (1) the lines intersect in a unique point, (2) the lines are parallel and do not intersect, or (3) the lines are coincident. Thus, for two lines, the intersection problem has either 1, 0, or infinitely many solutions. Most students quickly learn that the lines always intersect when their slopes are different, whereas the special cases (parallel or coincident) occur when the lines have the same slope.

Recently I had to find the intersection between two line segments in the plane. Line segments have finite extent, so segments with different slopes may or may not intersect. For example, the following panel of graphs shows three pairs of line segments in the plane. In the first panel, the segments intersect. In the second panel, the segments have the same slopes as in the first panel, but these segments do not intersect. In the third panel, the segments intersect in an interval. This article shows how to construct a linear system of equations that distinguishes between the three cases and compute an intersection point, if it exists.

Parameterization of line segments

Let p1 and p2 be the endpoints of one segment and let q1 and q2 be the endpoints of the other. Recall that a parametrization of the first segment is (1-s)*p1 + s*p2, where s ∈ [0,1] and the endpoints are treated as 2-D vectors. Similarly, a parametrization of the second segment is (1-t)*q1 + t*q2, where t ∈ [0,1]. Consequently, the segments intersect if and only if there exists values for (s,t) in the unit square such that
(1-s)*p1 + s*p2 = (1-t)*q1 + t*q2
You can rearrange the terms to rewrite the equation as
(p2-p1)*s + (q1-q2)*t = q1 - p1

This is a vector equation which can be rewritten in terms of matrices and vectors. Define the 2 x 2 matrix A whose first column contains the elements of (p2-p1) and whose second column contains the elements of (q1-q2). Define b = q1 - p1 and x = (s,t). If the solution of the linear system A*x = b is in the unit square, then the segments intersect. If the solution is not in the unit square, the segments do not intersect. If the segments have the same slope, then the matrix A is singular and you need to perform additional tests to determine whether the segments intersect.

A vector solution for the intersection of segments

As shown above, the intersection of two planar line segments is neatly expressed in terms of a matrix-vector system. In SAS, the SAS/IML language provides a natural syntax for expressing and solving matrix-vector equations. The following SAS/IML function constructs and solves a linear system. For simplicity, this version does not handle the degenerate case of two segments that have the same slope. That case is handled in the next section.

start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
/* Test 1: intersection at (0.95, 1.25) */
p1 = {1.8 2.1};   p2 = {0.8 1.1};
q1 = {1 1.25};    q2 = {0 1.25};
z = IntersectSegsSimple(p1,p2,q1,q2);
print z;
/* Test 2: no intersection */
p1 = {-1 0.5};  p2 = {1 0.5};
q1 = {0 1};     q2 = {0 2};
v = IntersectSegsSimple(p1, p2, q1, q2);
print v;

The function contains only a few statements. The function is called to solve the examples in the first two panels of the previous graph. The SOLVE function solves the linear system (assuming that a solution exists), and the IF-THEN statement tests whether the solution is in the unit square [0,1] x [0,1]. If so, the function returns the point of intersection. If not, the function returns a pair of missing values.

The general case: Handling overlapping line segments

For many applications, the function in the previous section is sufficient because it handles the generic cases. For completeness the following module also handles segments that have identical slopes. The DET function determines whether the segments have the same slope. If so, the segments could be parallel or collinear. To determine whether collinear segments intersect, you can test for three conditions:

  • The endpoint q1 is inside the segment [p1, p2].
  • The endpoint q2 is inside the segment [p1, p2].
  • The endpoint p1 is inside the segment [q1, q2].

Notice that the condition "p2 is inside [q1,q2]" does not need to be checked separately because it is already handled by the existing checks. If any of the three conditions are true, there are infinitely many solutions (or the segments share an endpoint). If none of the conditions hold, the segments do not intersect. For overlapping segments, the following function returns an endpoint of the intersection interval.

/* handle all cases: determine intersection of two planar line segments 
   [p1, p2] and [q1, q2] */
start Intersect2DSegs(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2);
   if det(A)^=0 then do;         /* nonsingular system: 0 or 1 intersection */
      x = solve(A, b);           /* x = (s,t) */
      if all(0<=x && x<=1) then  /* if x is in [0,1] x [0,1] */
         return (1-x[1])*p1 + x[1]*p2;  /* return intersection */
      else                       /* segments do not intersect */
         return ({. .});         /* return missing values */
   /* segments are collinear: 0 or infinitely many intersections */
   denom = choose(p2-p1=0, ., p2-p1);  /* protect against division by 0 */
   s = (q1 - p1) / denom;        /* Is q1 in [p1, p2]? */
   if any(0<=s && s<=1) then
      return q1;
   s = (q2 - p1) / denom;        /* Is q2 in [p1, p2]? */
   if any(0<=s && s<=1) then
      return q2;
   denom = choose(q2-q1=0, ., q2-q1);  /* protect against division by 0 */
   s = (p1 - q1) / denom;        /* Is p1 in [q1, q2]? */
   if any(0<=s && s<=1) then
      return p1;
   return ({. .});               /* segments are disjoint */
/* test overlapping segments; return endpoint of one segment */
p1 = {-1 1};  p2 = {1 1};
q1 = {0 1};   q2 = {2 1};
w = Intersect2DSegs(p1, p2, q1, q2);
print w;

In summary, by using matrices, vectors, and linear algebra, you can easily solve for the intersection of two line segments or determine that the segments do not intersect. The general case needs some special logic to handle degenerate configurations, but the code that solves the generic cases is straightforward when expressed in a vectorized language such as SAS/IML.

The post The intersection of two line segments appeared first on The DO Loop.

11月 092016

This article uses graphical techniques to visualize one of my favorite geometric objects: the surface of a three-dimensional torus. Along the way, this article demonstrates techniques that are useful for visualizing more mundane 3-D point clouds that arise in statistical data analysis.

Projections of a 3-D torus: Four visualization techniques for high-dimensional data. #DataViz
Click To Tweet

Define points on a torus

A torus is the product of two circles, so it can be parameterized by two angles, usually called θ and φ. You can create a regular grid of points in the square (θ, φ) ∈ [0, 2π] x [0, 2π] and map the points into Euclidean three-space as shown in the following SAS DATA step:

data Torus;
R = 8;       /* radius to center of tube */
A = 3;       /* radius of tube */
pi = constant('pi');
step = 2*pi/50;
/* create torus as parametric image of [0, 2*pi] x [0,2*pi] */
do theta = 0 to 2*pi-step by step;
   do phi = 0 to 2*pi-step by 2*pi/50;
      x = (R + A*cos(phi)) * cos(theta);
      y = (R + A*cos(phi)) * sin(theta);
      z =      A*sin(phi);
keep x y z;
title "Projections of a Standard Torus";
proc sgscatter data=Torus;
   matrix x y z;
Projections of torus onto coordinate planes

The SGSCATTER procedure displays the projection of the torus onto the three coordinate planes. In the (x,y) plane you can see that the object has a hole, but the projection onto the other coordinate planes are not very insightful because there is a lot of overplotting.

Rotating scatter plot in SAS/IML Studio

One way to avoid overplotting is to visualize the torus as a 3-D cloud of points. The SAS/IML Studio environment supports a rotating 3-D scatter plot, as shown to the left. In SAS/IML Studio you can interactively rotate the 3-D cloud of points, change the marker colors, and perform other techniques in exploratory data analysis.

Alternatively, if you want to look at planar projections with PROC SGSCATTER, you can rotate the torus so that the projections onto the coordinate planes are not degenerate.

Rotating a cloud of points

My previous article defined SAS/IML functions that create rotation matrices. The following SAS/IML program is almost identical to the program in the previous article, so I will not explain each statement. The program reads the Torus data set, rotates the points in a particular way, and writes the rotated coordinates to a SAS data set.

proc iml;
/* choose rotation matrix as product of canonical rotations */
load module=Rot3D;         /* see */
pi = constant('pi');
Rz = Rot3D(-pi/6, "Z");    /* rotation matrix for (x,y) plane */
Rx = Rot3D(-pi/3, "X");    /* rotation matrix for (y,z) plane */ 
Ry = Rot3D( pi/6, "Y");    /* rotation matrix for (x,z) plane */
P = Rx*Ry*Rz;              /* cumulative rotation */
use Torus;                 /* read data (points on a torus) */
read all var {x y z} into M;
close Torus;
Rot = M * P`;              /* apply rotation and write to data set */
create RotTorus from Rot[colname={"Px" "Py" "Pz"}];
append from Rot;

You can use PROC SGSCATTER to project these rotated points onto the coordinate planes. The TRANSPARENCY= option creates semi-transparent markers that give the illusion of a ghostly see-through torus. This can be an effective technique for visualizing any point cloud, since it enables you to see regions in which overplotting occurs. The statements also use the COLORRESPONSE= option to color the markers by the (original) Z variable. The COLORRESPONSE= option requires SAS 9.4m3.

data Coords;
merge Torus RotTorus;
title "Projections of a Rotated Torus";
proc sgscatter data=Coords;
matrix Px Py Pz / markerattrs=(size=12 symbol=CircleFilled)
       colorresponse=Z  colormodel=ThreeColorRamp; /* COLORRESPONSE= requires 9.4m3 */
Projections of rotated torus onto coordinate planes

The transparent markers serve a second function. The torus is composed of a sequence of circles. Therefore, the last circles (near θ = 2π) will obscure the first circles (near θ = 0) if opaque markers are used. The parametric construction is very apparent if you remove the TRANSPARENCY= option.

If you want to plot without transparency, you should sort the data, which is a standard technique in the graphics community where it is called z-ordering or depth sorting. For example, in the (Px,Py) plane you can sort by the Pz variable so that negative Pz values are plotted first and positive Pz values are plotted on top of the earlier markers. Unfortunately, you can only use this sorting trick to plot one pair of coordinates at a time. The following code demonstrates this trick for the (Px, Py) plane:

proc sort data=Coords out=SortCoords;
   by Pz;
title "Projection of a Rotated Torus";
title2 "Sorted by Distance Above Coordinate Plane";
proc sgplot data=SortCoords;
   scatter x=Px y=Py / markerattrs=(size=12 symbol=CircleFilled)
           colorresponse=Z  colormodel=ThreeColorRamp;
Projection of torus onto coordinate plane

In conclusion, PROC SGSCATTER enables you to visualize high-dimensional data via projections onto coordinate planes. I demonstrated the techniques on a 3-D torus, but the techniques apply generally to any high-dimensional data. Visualization tricks in this article include:

  • If the projections of the data onto coordinate planes are degenerate, rotate the points.
  • Use semi-transparency to reduce the effect of overplotting.
  • Use the COLORRESPONSE= option to color the markers by one of the variables. This requires SAS 9.4m4.
  • If you do not want to use transparency, sort the data in a direction perpendicular to the projected plane. That makes the rendering look more realistic.

Your high-dimensional point clouds might not look as cool as this torus, but if you use these visualization tips, the data will be easier to interpret and understand. The SGSCATTER procedure in SAS is an easy-to-use routine for investigating relationships among several variables.

tags: Math, SAS/IML Studio, Statistical Graphics

The post Visualize a torus in SAS appeared first on The DO Loop.

7月 052016

Last week I blogged about how to draw the Cantor function in SAS. The Cantor function is used in mathematics as a pathological example of a function that is constant almost everywhere yet somehow manages to "climb upwards," thus earning the nickname "the devil's staircase."

The Cantor function has three properties: (1) it is continuous, (2) it is non-decreasing, and (3) F(0)=0 and F(1)=1. There is a theorem that says that any function with these properties is the cumulative distribution function (CDF) for some random variable. In theory, therefore, you can generate a large number of random variates (a large sample), then use PROC UNIVARIATE to plot the empirical CDF. The empirical CDF should be the devil's staircase function.

A random sample from the Cantor set

My last article showed that every point in the Cantor set can be written as the infinite sum x = Σi ai3-i where the ai equal 0 or 2. You can approximate the Cantor set by truncating the series at a certain number of terms. You can generate random points from the (approximate) Cantor set by choosing the coefficients randomly from the Bernoulli distribution with probability 0.5. The following DATA step generates 10,000 points from the Cantor set by using random values for the first 20 coefficients:

data Cantor;
call streaminit(123456);
k = 20;             /* maximum number of coefficients */
n = 10000;          /* sample size to generate */
do i = 1 to n;      /* generate n random points in Cantor set */
   x = 0;
   do j = 1 to k;   /* each point is sum of powers of 1/3 */
      b = 2 * rand("Bernoulli", 0.5);  /* random value 0 or 2 */
      x + b * 3**(-j);                 /* power of 1/3 */
keep x;
proc univariate data=Cantor;
   cdf x;

The call to PROC UNIVARIATE displays the empirical CDF for the random sample of points. And voilà! The Cantor distribution function appears! As a bonus, the output from PROC UNIVARIATE (not shown) indicates that the mean of the distribution is 1/2 (which is obvious by symmetry) and the variance is 1/8, which you might not have expected.

This short SAS program does two things: it shows how to simulate a random sample uniformly from the (approximate) Cantor set, and it indicates that the devil's staircase function is the distribution function for this a uniform random variable.

Did you know that the Cantor staircase function is a cumulative distribution? Simulate it in #SAS!
Click To Tweet

A random sample in SAS/IML

Alternatively, you can generate the same random sample by using SAS/IML and matrix computations. My previous article drew the Cantor function by systematically using all combinations of 0s and 2s to construct the elements of the Cantor set. However, you can use the same matrix-based approach but generate random coefficients, therefore obtaining a random sample from the Cantor set:

proc iml;
k = 20;          /* maximum number of coefficients */
n = 10##4;       /* sample size to generate */
B = 2 * randfun(n||k, "Bernoulli", 0.5);  /* random n x k matrix of 0/2 */
v = 3##(-(1:k)); /* powers of 1/3 */
x = B * v`;      /* sum of random terms that are either 0*power or 2*power */

Recall that ECDF is a step function that plots the ith ordered datum at the height i/n. You can approximate the empirical CDF in SAS/IML by using the SERIES subroutine. Technically, the lines that appear in the line plot have a nonzero slope, but the approximation looks very similar to the PROC UNIVARIATE output:

call sort(x, 1);             /* order elements in the Cantor set */
x = 0 // x // 1;             /* append end points */
y = do(0, 1, 1/(nrow(x)-1)); /* empirical CDF */
title "Empirical Distribution of a Uniform Sample from the Cantor Set";
call series(x, y);

Interpretation in probability

It is interesting that you can actually define the Cantor set in terms of rolling a fair six-sided die. Suppose that you roll a die infinitely often and we adopt the following rules:

  • If the die shows a 1 or 2, you get zero points.
  • If the die shows a 3 or 4, you get one point.
  • If the die shows a 5 or 6, you get two points.

This game is closely related to the Cantor set. Recall that the Cantor set can be written as the set of all base-3 decimal values in [0,1] for which the decimal expansion does not contain a 1. In this game, each point in the Cantor set corresponds to a sequence of rolls that never contain a 3 or 4. (Equivalently, the score is always even.) Obviously this would never happen in real life, which is an intuitive way of saying that the Cantor set has "measure zero."


The first time I saw the Cantor function depicted as an empirical distribution function was when I saw a very compact MATLAB formula like this:

stairs([min(x) sort(x)],0:1/length(x):1) % Plot the c.d.f of x

The previous formula is equivalent to my SAS/IML program, but in my program I broke the formula apart so that the components could be understood more easily. This formula appears in a set of probability demos by Peter Doyle. I had the privilege to interact with Professor Doyle at The Geometry Center (U. MN) in the early 1990s, so perhaps he was responsible for showing me the Cantor distribution.

These UNC course notes from Jan Hannig also discuss the Cantor distribution, but the original source is not cited.

tags: Just for Fun, Math

The post Cantor sets, the devil's staircase, and probability appeared first on The DO Loop.

6月 032016

In a world filled with jargon, it’s refreshing to hear from a subject-matter expert who can communicate in a direct and uncomplicated fashion – so that even a layperson would understand. You could say this is Randall Munroe’s mission. Munroe is masterful at using math, science and comics to make […]

The post Complicated stuff in simple words from Randall Munroe appeared first on JMP Blog.

1月 252016

A moving average (also called a rolling average) is a satistical technique that is used to smooth a time series. Moving averages are used in finance, economics, and quality control. You can overlay a moving average curve on a time series to visualize how each value compares to a rolling average of previous values. For example, the following graph shows the monthly closing price of IBM stock over a 20-year period. Three kinds of moving averages are overlaid on a scatter plot of the data.

Moving average of stock price

The IBM stock price increased in some time periods and decreased in others. The moving-average curves help to visualize these trends and identify these time periods. For a simple moving average, the smoothness of a curve is determined by the number of time points, k, that is used to compute the moving average. Small values of k result in curves that reflect the short-term ups and downs of the data; large values of k undulate less. For stock charts that show daily prices, the 30-day moving average and the 5-day moving average are popular choices.

How do you define a moving average?

The most common moving averages are the simple moving average (MA), the weighted moving average (WMA), and the exponentially weighted moving average (EWMA). The following list provides a brief description and mathematical formula for these kinds of moving averages. See the Wikipedia article on moving averages for additional details.

Let {y0, y1, ..., yt, ...} be the time series that you want to smooth, where yt is the value of the response at time t.

  • The simple moving average at time t is the arithmetic mean of the series at yt and the previous k-1 time points. In symbols,
          MA(t; k) = (1/k) Σ yi
    where the summation is over the k values {yt-k+1, ..., yt}.
  • The weighted moving average (WMA) at time t is a weighted average of the series at yt and the previous k-1 time points. Typically the weights monotonically decrease so that data from "long ago" contribute less to the average than recent data. If the weights sum to unity (Σ wi = 1) then
          WMA(t; k) = Σ wi yi
    If the weights do not sum to unity, then divide that expression by Σ wi.
  • The exponentially weighted moving average (EWMA) does not use a finite rolling window. Instead of the parameter k, the EWMA uses a decay parameter α, where 0 < α < 1. The smoothed value at time t is defined recursively as
          EWMA(t; α) = α yt + (1 - α) EWMA(t-1; α)
    You can "unwind" this equation to obtain the EWMA as a WMA where the weights decrease geometrically. The choice of α determines the smoothness of the EWMA. A value of α ≈ 1 implies that older data contribute very little to the average. Conversely, small values of α imply that older data contribute to the moving average almost as much as newer data.

Each of these definitions contains an ambiguity for the first few values of the moving average. For example, if t < k, then there are fewer than k previous values in the MA and WMA methods. Some practitioners assign missing values to the first k-1 values, whereas others average the values even when fewer than k previous data points exist. For the EWMA, the recursive definition requires a value for EWMA(0; α), which is often chosen to be y0.

My next blog post shows how to compute various moving averages in SAS. The article shows how to create the IBM stock price example, which is a time series plot overlaid with MA, WMA, and EWMA curves.

tags: Data Analysis, Getting Started, Math

The post What is a moving average? appeared first on The DO Loop.

1月 132016

Recently I blogged about how to compute a weighted mean and showed that you can use a weighted mean to compute the center of mass for a system of N point masses in the plane. That led me to think about a related problem: computing the center of mass (called the centroid) for a planar polygon.

If you cut a convex polygon out of stiff cardboard, the centroid is the position where the polygon would balance on the tip of a needle. However, for non-convex polygons, the centroid might not be located in the interior of the polygon.

For a general planar figure, the mathematical formula for computing the centroid requires computing an integral. However, for polygons there are some well-known equations that involve only the locations of the vertices (provided that the vertices are enumerated in a consistent clockwise or counter-clockwise manner). The Wikipedia article about the centroid give a useful computational formula. It turns out that the (signed) area of a simple polygon with vertices (x0, y0), (x1, y1), ..., (xN-1, yN-1), is given by


In the formula, the vertex (xN, yN) is identified with the first vertex, (x0, y0). The centroid of the polygon occurs at the point (Cx, Cy), where

centroidEqn1 centroidEqn2

A polygon whose vertices are enumerated in a counter-clockwise manner will have a positive value for A. The formula for A will be negative when vertices are enumerated in a clockwise manner.

In the SAS language, arrays use 1-based indexing by convention. Therefore, when implementing the formula in SAS, it is customary for the vertices to be enumerated from 1 to N. The limits in the summation formulas are modified similarly. The (N+1)st vertex is identified with the first vertex.

Centroids are sometimes used as a position to label a country or state on a map. For SAS customers who have a license for SAS/GRAPH, the official SAS %CENTROID macro computes the centroid of a polygonal region. It is designed for use with map data sets, which have variables named ID, X, and Y. My colleague Sanjay Matange provides a similar macro in his blog post "A Macro for Polygon Area and Center."

A vectorized SAS/IML function that computes a centroid

Implementing the centroid formulas in SAS/IML is a good exercise in vectorization. Recall that function is vectorized if it avoids unnecessary loops and instead uses vector and matrix operations to perform the computations. For the area and centroid formulas, you can form a vector that contains all of the values that are to be summed. You can use the elementwise multiplication operator (#) to compute the elementwise product of two vectors. The SUM function adds up all elements in a vector.

The following function computes the centroid of a single polygon in a vectorized manner. The first few statements build the vectors from the values of the vertices; the last four statements compute the area and centroid by using vectorized operations. There are no loops.

proc iml;
/* _PolyCentroid: Return the centroid of a simple polygon.
   P  is an N x 2 matrix of (x,y) values for consectutive vertices. N > 2. */
start _PolyCentroid(P);
   lagIdx = 2:nrow(P) || 1;              /* vertices 2, 3, ..., N, 1 */
   xi   = P[ ,1];       yi = P[ ,2];     /* x_i and y_i for i=1..N   */
   xip1 = xi[lagIdx]; yip1 = yi[lagIdx]; /* x_{i+1} and y_{i+1}, x_{N+1}=x_1 */
   d = xi#yip1 - xip1#yi;                /* vector of difference of products */
   A = 0.5 * sum(d);                     /* signed area of polygon */
   cx = sum((xi + xip1) # d) / (6*A);    /* X coord of centroid */
   cy = sum((yi + yip1) # d) / (6*A);    /* Y coord of centroid */
   return( cx || cy );
/* test the function by using an L-shaped polygon */
L = {0 0, 6 0, 6 1, 2 1, 2 3, 0 3};
centroid = _PolyCentroid(L);
print centroid;

For this L-shaped polygon, the centroid is located outside the polygon. This example shows that the centroid might not be the best position for a label for a polygon. Examples of this phenomenon are easy to find on real maps. For example, the states of Louisiana and Florida have centroids that are near the boundaries of the states.

Compute the centroid of many polygons

The %CENTROID macro can compute the centroids for multiple polygons, assuming that there is an identifying categorical variable (called the ID variable) whose unique values distinguish one polygon from another.

If you want to compute the centroids of multiple polygons in SAS/IML, you can use the UNIQUEBY function to extract the rows that correspond to each polygon. For each polygon, the previous _PolyCentroid function is called, as follows:

/* If the polygon P has two columns, return the centroid. If it has three 
   columns, assume that the third column is an ID variable that identifies
   distinct polygons. Return the centroids of the multiple polygons. */
start PolyCentroid(P);
   if ncol(P)=2 then  return( _PolyCentroid(P) );
   ID = P[,3];
   u = uniqueby(ID);         /* starting index for each group */
   result = j(nrow(u), 2);   /* allocate vector to hold results */
   u = u // (nrow(ID)+1);    /* append (N+1) to end of indices */
   do i = 1 to nrow(u)-1;    /* for each group... */
      idx = u[i]:(u[i+1]-1); /* get rows in group */
      result[i,] = _PolyCentroid( P[idx, 1:2] ); 
   return( result );
/* test it on an example */
rect = {0 0,  2 0,  2 1,  0 1};
ID = j(nrow(L), 1, 1) // j(nrow(rect), 1, 2);
P = (L // rect) || ID;
centroids = PolyCentroid(P);
print centroids;

In summary, you can construct an efficient function in SAS/IML to compute the centroid of a simple polygon. The construction shows how to vectorize a computation by putting the vertices in vectors, rather than the less efficient alternative, which is to iterate over the vertices in a loop. Because polygons are often used to display geographical regions, I also included a function that iterates over polygons and computes the centroid of each.

tags: Math, vectorization

The post Compute the centroid of a polygon in SAS appeared first on The DO Loop.

12月 162015
Lo how a rose e'er blooming
From tender stem hath sprung.

As I write this blog post, a radio station is playing Chrismas music. One of my favorite Christmas songs is the old German hymn that many of us know as "Lo, How a Rose E're Blooming." I was humming the song recently when someone asked me how to use SAS to draw curves in polar coordinates. I immediately thought about "polar roses," mathematical parametric curves of the form r = cos(k θ).

In the spirit of Christmas, I present "Lo, how a polar rose e'er blooming, from SGPLOT hath sprung."

Plotting polar coordinates in SAS

It is easy to graph a polar curve in SAS. Here's a quick overview, or a "polar express" tour, if you prefer.

Many familiar polar equations are parametric curves. The simplest are relations are of the form r = f(θ), where θ is the polar angle. The following SAS DATA step creates evenly spaced values of theta in the range [0, 2π]. For each value of theta, it computes r = cos(k θ), which is the equation for a rose with k petals. The polar coordinates (r, θ) are converted to Euclidean coordinates through the usual transformation x = r*cos(theta) and y = r*sin(theta).

After creating the points along the curve, you can visualize it by using PROC SGPLOT. The SERIES statement is used to display the parametric curve in Euclidean coordinates. The XAXIS and YAXIS statements use the MIN= and MAX= options to ensure that the image appears in a square region, and the ASPECT=1 option is used to ensure that the aspect ratio of the plot does not distort the geometry of the rose.

%let k = 5;
/* draw the curve r=cos(k * theta) in polar coords, which is a k-leaved rose */
data Rose;
do theta = 0 to 2*constant("pi") by 0.01;
   r = cos(&k * theta);       /* rose */
   x = r*cos(theta);          /* convert to Euclidean coords */
   y = r*sin(theta);
title "Polar Rose: r = cos(&k theta)";;
proc sgplot data=Rose aspect=1;
   refline 0 / axis=x;
   refline 0 / axis=y;
   series x=x y=y;
   xaxis min=-1 max=1;
   yaxis min=-1 max=1;

Graphing generalized roses

The Wikipedia page for the polar rose contains a brief discussion about what the curve r = cos(k θ) looks like when k is a rational number n/d. When k is not an integer, most of the curves do not look like a rose—or any other kind of flower! Only a myopic mathematician would call some of these images roses.

Regardless, let's see how to get some of these "roses" to bloom in SAS. The following DATA step generates all positive rational numbers of the form n/d, where n ≤ 7 and d ≤ 9 are integers. For each value of k, the program computes the generalize rose equation and converts the values from polar to Euclidean coordinates:

data Roses;
do n = 1 to 7;
   do d = 1 to 9;
      k = n / d;
      /* draw the rose r=cos(n/d * theta) */
      do theta = 0 to 2*lcm(n,d)*constant("pi") by 0.05;
         r = cos(k * theta);      /* generalized rose */
         x = r*cos(theta);        /* convert to Euclidean coords */
         y = r*sin(theta);

It is not any harder to visualize 63 roses than it is to visualize one. The SGPANEL procedures was designed to display multiple cells, each with the same type of graph. Therefore, to display an entire bouquet of roses, you merely have to specify the N and D variables on the PANELBY statement, and SAS handles the rest:

title "Polar Roses: r = cos(n/d theta)";
proc sgpanel data=Roses aspect=1;
   panelby n d / layout=lattice onepanel rowheaderpos=left;
   refline 0 / axis=x transparency=0.5;
   refline 0 / axis=y transparency=0.5;
   series x=x y=y;
   colaxis display=none;
   rowaxis display=none;

You can see certain numeric patterns in the lattice display. In particular, when n/d is a reducible fraction (such as 2/4, 3/6, and 4/8), the image is identical to the reduced fraction (such as 1/2). Do you see other patterns? Leave a comment.

These roses are beautiful, like the song that inspired me to create them. Since I get up early in the morning to write my blog, a few additional lyrics seem appropriate:

It came a blossom bright.
Amid the cold of winter
When half spent was the night.

Happy holidays to all my readers. I am taking a short vacation and will return in 2016. Until then, keep your eyes open: math, statistics, and beauty are all around you, ever blooming!

tags: Just for Fun, Math

The post Lo, how a polar rose e'er blooming appeared first on The DO Loop.

6月 242015
Image from Wikipedia:

In my article about finding an initial guess for root-finding algorithms, I stated that Newton's root-finding method "might not converge or might converge to a root that is far away from the root that you wanted to find." A reader wanted more information about that statement.

I have previously shown how to implement Newton's method in SAS. The behavior of Newton's method depends on the initial guess. If you provide a guess that is sufficiently close to a simple root, Newton's method will converge quadratically to the nearby root. However, if your guess is near a critical point of the function, Newton's method will produce a "next guess" that is far away from the initial guess. Further iterations might converge to an arbitrary root, might endlessly cycle in a periodic or aperiodic manner, or might diverge to infinity.

The dynamics of Newton iteration can be quite complex. If you owned a PC in the 80's and early 90's, you might have spent countless hours computing Mandelbrot sets and Julia sets. You might have seen pictures like the one at the beginning of this article, which show the domains of attraction for Newton's iteration for a cubic polynomial. In the picture, each point in the complex plane is colored according to which root Newton's method converges to when it begins at that point. The points that eventually converge to a root are the Fatou set, whereas the points that do not converge form the Julia set.

The sensitivity of Newton's method

You can perform the same kind of computer experiments for Newton's method applied to a real function. (You can download the SAS/IML programs that I used to create the graphs in this article.) Consider the polynomial f(x) = x5 – 2 x4 – 10 x3 + 20 x2 + 9 x – 18. The roots of the polynomial are {-3, -1, 1, 2, 3}. The polynomial has critical points (where the derivative vanishes) near -2.3, -0.2, 1.5, and 2.6. Recall that Newton's method involves iteration of the rational function N(x) = xf(x)/f'(x), which has singularities at the critical points of f.

You can ask the following question: For each point, x, to which root does Newton's method converge when x is the initial guess? You can also keep track of how many iterations it takes for Newton's method to converge to a root.


If you apply Newton's method to 250 initial conditions on the interval [-4, 4], you get the results that are summarized in the needle plot to the left. (Click to enlarge.) The color of the needle at x indicates the root to which x converges under Newton's method. The height of the needle indicates the number of iterations required to converge.

You can see that initial guesses that are close to a root converge to the nearby root in five or fewer iterations. Near the critical points, Newton's method requires more iterations to converge, often more than 10 and sometimes more than 20 iterations. The regions near the critical points are interlaced bands of color, which indicates that the dynamics of Newton's method is complicated in those regions. A small change in the initial guess can result in a big difference in the Newton iterations.


The dynamics near 0 seem interesting, so let's zoom into that region. In particular, repeat the previous computation for 250 initial conditions on the interval [-0.5, 0.5]. The needle plot to the left reveals additional details. All roots can be reached from initial guesses in this region, even though the nearest roots are at x = -1 and x = 1 (roots #2 and #3). Again, there are regions for which many iterations are required and there is an interlacing of colors, which indicates that newton's method is sensitive to the initial guess.

You can zoom again into a multicolored region and repeat the computations on a new interval. The behavior continues ad infinitum. You can find two initial guesses that differ by an arbitrarily small amount, yet their iteration under Newton's method eventually diverge and result in different roots. This is known as sensitive dependence on initial conditions, or, more poetically, as the butterfly effect.


Newton's method is one of my favorite root-finding techniques. However, it is important to understand that the famous quadratic convergence of Newton's method applies to initial guesses that are close to a root. For an arbitrary initial guess, Newton's method can be result in divergence, periodic orbits, or convergence to a far-away root. Fractals and chaos are fun topics to explore, but not when you need to find a root as part of your work. Therefore I recommend using a pre-search method, as described in my previous article, to obtain a good initial guess.

Further details

tags: Math, Numerical Analysis

The post The sensitivity of Newton's method to an initial guess appeared first on The DO Loop.

6月 112015

"Daddy, help! Help me! Come quick!"

I heard my daughter's screams from the upstairs bathroom and bounded up the stairs two at a time. Was she hurt? Bleeding? Was the toilet overflowing?

When I arrived in the doorway, she pointed at the wall and at the floor. The wall was splattered with black nail polish. On the floor laid a broken bottle in an expanding pool of black ooze.

"It slipped," she sobbed.

As a parent, I know that there are times when I should not raise my voice. I knew intellectually that this was one of those times. But staring at that wall, seeing what I was seeing, I could not prevent myself from yelling.

"Oh my goodness!" I exclaimed. "Is that a logarithmic spiral?"


There on the wall was a spiral formed from the drops of black nail polish as the bottle had spiraled out of my daughter's hands and ejected its contents. Logarithmic spirals appear in many natural phenomena, such as the spiral arms of galaxies, the bands of hurricanes, and the curve of a mollusk's shell. However, I had not expected to see such a beautiful spiral of splattered nail polish on my wall.

I took a picture of the spiral, and we began a futile attempt to clean the wall. The nail polish proved impossible to wipe from the wall, although we were able to clean the tile floor.

How do you fit a logarithmic spiral?

As I was cleaning up the mess, my mind was racing. Is this really a logarithmic spiral? Could I construct a "spiral of best fit" that passes through the splattered drops?

I remembered that I once fit a circle to data by minimizing the radial distance between the data points and a circle with center (x0, y0) and radius R. This problem seems similar, but there are two differences. In the circle problem, I used the implicit equation of a circle (x-x0)2 + (y-y0)2 = R2. The logarithmic spiral is usually given explicitly as a polar (parametric) curve with the standard form R(θ) = a*exp(bθ), where a and b are unknown parameters. The second problem is that the spiral has multiple branches. For example, in the vertical direction there are several polar angles (π/2 and 5π/2), each with a different radius.

To construct a "spiral of best fit," you can do the following:

  1. Choose a coordinate system and use the photo to find the (x,y) coordinates of points along the spiral curve.
  2. Create an objective function whose minimum will produce parameters for the spiral that best fits the data. This problem has four parameters: The center of the spiral (x0, y0) and the two parameters a and b the amplitude and decay rate for the logarithmic spiral.
  3. Fit the curve to the data and see if the splatter marks on the wall appear to be modeled by the curve.

Finding points along the curve


As punishment for my daughter's carelessness, I told her that she had to help me input data from the photograph. She claimed that this punishment was "cruel and unusual," but she did it anyway.

I printed out the photo and used a ruler to draw a grid of lines 1 centimeter apart across the page. I chose the origin to be a big blob of nail polish at the end of spiral, about 1/3 up from the bottom of the image. I then marked points at 5 mm intervals along the curve and read out the coordinates (to the nearest 1 mm) as my daughter typed them into SAS. The resulting coordinates are shown in the scatter plot. A large portion of the nail polish landed on the door-frame molding, which is raised 1.5 cm above the wall. That explains the apparent discontinuity in the upper left portion of the graph.

Create an objective function

To find the logarithmic spiral that best fit the data, you can minimize a function that computes the sum of the squared distances (in the radial direction) between the data and the predicted points along the spiral. Because the center of the spiral is unknown, you have fit four parameters: the center of the spiral (x0, y0) and the parameters a and b.

For given values (x0, y0), you can center the (x, y) coordinates by subtracting the coordinates of the center point. In a previous blog post, I showed how to obtain an increasing sequence of polar angles from the coordinates of (centered) data that are ordered along a parametric curve. Let θi be the polar angle that corresponds to the point (xi, yi). By using the fact that Euclidean and polar coordinates are related by x = R cos(θ) and y = R sin(θ), you can write down the predicted Euclidean coordinates of the logarithmic spiral:
ui = a cos(thetai) exp(b*thetai)
vi = a sin(thetai) exp(b*thetai)

The objective is therefore to minimize the radial distance between the predicted and observed values:
F(x0, y0, a, b) = Σi   (xi-ui)2 + (yi-vi)2

Fit a logarithmic spiral to the data

To optimize the function I decided to use the trust region method (the NLPTR call) in SAS/IML software, just as I did when fitting a circle to data. Based on the scatter plot, I guessed that the center was approximately (x0, y0) = (1, -1), and that a ≈ 26 and b ≈ -0.5.


You can download the complete SAS program, which uses SAS/IML to optimize the parameters. The logarithmic spiral that best fits the data is shown overlaid on the scatter plot of the data. The curve fits remarkably well, with the obvious exception of the nail polish that splattered on the door frame (trim) instead of on the wall. If that section of data were moved upwards by 1–1.5 cm, then the fit would be even better. For the record, the parameter estimates are (x0, y0) = (0.69, -1.06), a = 22.34, and b = -0.35.

All models are wrong...including this one

As George Box famously said, "All models are wrong, but some are useful." However, this spiral model might be both wrong and useless.

I was so stunned by the appearance of this spiral on my wall that I didn't stop to think about whether a logarithmic spiral is a curve that can appear when a rotating bottle falls under the effects of gravity. The physics of falling objects is well understood.

Of course, I have no idea how the bottle was tumbling when it left my daughter's hands, nor do I know what angle the bottle was moving relative to the wall. Nevertheless, you can make some simplifying assumptions (bottle falling straight down, rotating in the horizontal plane as it falls,...) and compute the path that the nail polish would trace out in that simple situation. The physics-based model (not shown) also looks similar to the pattern on my wall. It's been a long time since freshman physics, and this is not a physics blog, so I invite the interested reader to download the SAS program and read my attempt to derive a physics-based model.

The main difference between the physics-based path and the logarithmic spiral is the function that is responsible for the radial decay of the spiral. In the logarithmic spiral, the decay function is the exponential function, which is a decreasing function that never reaches zero. In the physics-based model, the decay function is related to the distance between the bottle and the wall, which is a decreasing function that reaches zero in finite time. Any decreasing function will lead to a spiral, some of which have names. For example, a linear decreasing function is an Archimedean spiral.

I think that these data can be fit by many families of spirals because the bottle rotated only about 1.25 times before it hit the wall. To distinguish between the spirals, the data should show at least two full rotations so that the decay function can be more accurately modeled. However, no matter what model you choose, you can use the same optimization ideas to fit the parameters in the model.

To me it doesn't matter whether the spiral on my wall was logarithmic, Archimedean, or something else. It was cool to see, and I got to spend a little time talking about math with my daughter, which is a special treat. And, of course, I enjoyed the statistical modeling and SAS programming that I used to analyze these data.

Now I must stop writing. I have to paint that wall!

tags: Data Analysis, Math

The post The spiral of splatter appeared first on The DO Loop.