9月 102018
 

The customer journey is at the forefront of every discussion about modern marketing. The idea that customers move in premeditated or, at the very least, marketer-meditated paths between well-defined states is alluring (and comforting) to a marketing professional. Of course, even a cursory examination of a web path analytics (Sankey) [...]

AI-based customer journey optimization was published on Customer Intelligence Blog.

9月 102018
 

Given a rectangular grid with unit spacing, what is the expected distance between two random vertices, where distance is measured in the L1 metric? (Here "random" means "uniformly at random.") I recently needed this answer for some small grids, such as the one to the right, which is a 7 x 6 grid. The graph shows that the L1 distance between the points (2,6) and (5,2) is 7, the length of the shortest path that connects the two points. The L1 metric is sometimes called the "city block" or "taxicab" metric because it measures the distance along the grid instead of "as the bird flies," which is the Euclidean distance.

The answer to the analogous question for the continuous case (a solid rectangle) is difficult to compute. The main result is that the expected distance is less than half of the diameter of the rectangle. In particular, among all rectangles of a given area, a square is the rectangle that minimizes the expected distance between random points.

Although I don't know a formula for the expected distance on a discrete regular grid, the grids in my application were fairly small so this article shows how to compute all pairwise distances and explicitly find the average (expected) value. The DISTANCE function in SAS/IML makes the computation simple because it supports the L1 metric. It is also simple to perform computer experiments to show that among all grids that have N*M vertices, the grid that is closest to being square minimizes the expected L1 distance.

L1 distance on a regular grid

An N x M grid contains NM vertices. Therefore the matrix of pairwise distances is NM x NM. Without loss of generality, assume that the vertices have X coordinates between 1 and N and Y coordinates between 1 and M. Then the following SAS/IML function defines the distance matrix for vertices on an N x M grid. To demonstrate the computation, the distance matrix for a 4 x 4 grid is shown, along with the average distance between the 16 vertices:

proc iml;
start DistMat(rows, cols);
   s = expandgrid(1:rows, 1:cols);   /* create (x, y) ordered pairs */
   return distance(s, "CityBlock");  /* pairwise L1 distance matrix */
finish;
 
/* test function on a 4 x 4 grid */
D = DistMat(4, 4);
AvgDist = mean(colvec(D));
print D[L="L1 Distance Matrix for 4x4 Grid"];
print AvgDist[L="Avg Distance on 4x4 Grid"];
L1 Distance matrix for vertices on a regular 4 x 4 grid Average L1 distance on a 4 x 4 grid

For an N x M grid, the L1 diameter of the grid is the L1 distance between opposite corners. That distance is always (N-1)+(M-1), which equates to 6 units for a 4 x 4 grid. As for the continuous case, the expected L1 distance is less than half the diameter. In this case, E(L1 distance) = 2.5.

A comparison of distances for grids of different aspect ratios

As indicated previously, the expected distance between two random vertices on a grid depends on the aspect ratio of the grid. A grid that is nearly square has a smaller expected distance than a short-and-wide grid that contains the same number of vertices. You can illustrate this fact by computing the distance matrices for grids that each contain 36 vertices. The following computation computes the distances for five grids: a 1 x 36 grid, a 2 x 18 grid, a 3 x 12 grid, a 4 x 9 grid, and a 6 x 6 grid.

/* average L1 distance on 36 x 36 grid in several configurations */
N=36;
rows = {1, 2, 3, 4, 6};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
/* show average distance as a decimal and as a fraction */
numer = AvgDist*108; AvgDistFract = char(numer) + " / 108"; 
print rows cols AvgDist AvgDistFract;

The table shows that short-and-wide tables have an average distance that is much greater than a nearly square grid that contains the same number of vertices. When the points are arranged in a 6 x 6 grid, the distance matrix naturally decomposes into a block matrix of 6 x 6 symmetric blocks, where each block corresponds to a row. When the points are arranged in a 3 x 12 grid, the distance matrix decomposes into a block matrix of 12 x 12 blocks The following heat maps visualize the patterns in the distance matrices:

D = DistMat(6, 6);
call heatmapcont(D) title="L1 Distance Between Vertices in a 6 x 6 Grid";
D = DistMat(3, 12);
call heatmapcont(D) title="L1 Distance Between Vertices in a 3 x 12 Grid";
Visualization of L1 distance matrix for items arranged on a 6 x 6 grid Visualization of L1 distance matrix for items arranged on a 3 x 12 grid

Expected (average) distance versus number of rows

You can demonstrate how the average distance depends on the number of rows by choosing the number of vertices to be a highly composite number such as 5,040. A highly composite number has many factors. The following computation computes the average distance between points on 28 grids that each contain 5,040 vertices. A line chart then displays the average distance as a function of the number of rows in the grid:

N = 5040;               /* 5040 is a highly composite number */
rows = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 
       21, 24, 28, 30, 35, 36, 40, 42, 45, 48, 56, 60, 63, 70};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
 
title "Average L1 Distance Between 5040 Objects in a Regular Grid";
call series(rows, AvgDist) grid={x y} other="yaxis offsetmin=0 grid;";
Expected L1 distance between vertices on a regular grid with 5,040 points. The curve demonstraes that nearly square grids have a much lower average distance than short and wide grids.

Each point on the graph represents the average distance for an N x M grid where NM = 5,040. The horizontal axis displays the value of N (rows). The graph shows that nearly square grids (rows greater than 60) have a much lower average distance than very short and wide grids (rows less than 10). The scale of the graph makes it seem like there is very little difference between the average distance in a grid with 40 versus 70 rows, but that is somewhat of an illusion. A 40 x 126 grid (aspect ratio = 3.15) has an average distance of 55.3; a 70 x 72 grid (aspect ratio = 1.03) has an average distance of 47.3.

Applications and conclusions

In summary, you can use the DISTANCE function in SAS/IML to explicitly compute the expected L1 distance (the "city block" distance) between random points on a regular grid. You can minimize the average pairwise distance by making the grid as square as possible.

City planning provides a real-world application of the L1 distance. If you are tasked with designing a city with N buildings along a grid, then the average distance between buildings is smallest when the grid is square. Of course, in practice, some buildings have a higher probability of being visited than others (such as a school or a grocery). You should position those buildings in the center of town to shorten the average distance that people travel.

The post Distances on rectangular grids appeared first on The DO Loop.

9月 072018
 

I usually create very technical graphs, that just focus on conveying the information in a concise and straightforward manner (no wasted colors, and nothing to distract you from the data). But sometimes, depending on your audience and the purpose of the graph, you might need to create a graph that [...]

The post Creating industry-specific infographics (eg, 3d printing) appeared first on SAS Learning Post.

9月 072018
 

I usually create very technical graphs, that just focus on conveying the information in a concise and straightforward manner (no wasted colors, and nothing to distract you from the data). But sometimes, depending on your audience and the purpose of the graph, you might need to create a graph that [...]

The post Creating industry-specific infographics (eg, 3d printing) appeared first on SAS Learning Post.

9月 072018
 

PC Magazine defines the broad industry term Software as a Service (SaaS) as, “Software that is rented rather than purchased. Instead of buying applications and paying for periodic upgrades, SaaS is subscription based, and upgrades are automatic during the subscription period.” SaaS, according to the same source, is ideally suited [...]

Does software as a service work for analytics? was published on SAS Voices by David Annis

9月 062018
 

Continued fractions show up in surprising places. They are used in the numerical approximations of certain functions, including the evaluation of the normal cumulative distribution function (normal CDF) for large values of x (El-bolkiny, 1995, p. 75-77) and in approximating the Lambert W function, which has applications in the modeling of heavy-tailed error processes. This article shows how to use SAS to convert a rational number into a finite continued fraction expansion and vice versa.

What is a continued fraction?

I discussed continued fractions in a previous post about the contiunued fraction expansion of pi. Recall that every real number x can be represented as a continued fraction:

In this continued fraction, a0 is the integer portion of the number and the ai for i > 0 are positive integers that represent the noninteger portion. Because all of the fractions have 1 in the numerator, the continued fraction can be compactly represented by specifying only the integers: x = [a0; a1, a2, a3, ...]. Every rational number is represented by a finite sequence; every irrational number requires an infinite sequence. Notice that the decimal representation of a number does NOT have this nice property: 1/3 is rational but its decimal representation is infinite.

Convert a finite continued fraction to a rational number

My article about the continued fraction expansion of pi contains a few lines of SAS code that compute the decimal representation of a finite continued fraction. Since every finite continued fraction corresponds to a rational number, you can modify the algorithm to compute the rational number instead of a decimal approximation.

The algorithm is straightforward. Given a vector
a = [a0; a1, a2, ..., ak],
start at the right side, form the fraction sk = 1 / ak. Then move to the left and compute the fraction sk-1 = ak-1 + sk. Continue moving to the left until all fractions are added. In the SAS/IML language (or in the SAS DATA step), you can represent each fraction as a two-element vector where the first element is the numerator of the fraction and the second is the denominator of the fraction. This leads to the following algorithm that computes a rational number from a finite continued fraction expansion:

proc iml;
start ContinuedFrac(_a);
   a = colvec(_a);
   f = a[nrow(a)] // 1;         /* trick: start with reciprocal */
   do i = nrow(a)-1 to 1 by -1; /* evaluate from right to left */
      f = f[2:1];               /* take reciprocal */
      f[1] = a[i]*f[2] + f[1];  /* compute new numerator */
   end;
   return f;
finish;
 
a = {3, 4, 12, 4}; 
f = ContinuedFrac(a);      /* 649 / 200 */
pi = {3, 7,15,1,292,1,1,1,2,1,3,1};
f_pi = ContinuedFrac(pi);  /* 5419351 / 1725033 */
e = {2, 1,2,1,1,4,1,1,6,1,1,8};
f_e = ContinuedFrac(e);    /* 23225 / 8544 */
print (f||f_pi||f_e)[r={"Numer" "Denom"} c={"f" "f_pi" "f_e"}];
Convert (finite) continued fractions to rational numbers

The examples show a few continued fraction expansions and the rational numbers that they represent:

  • The continued fraction expansion (cfe) [3; 4, 12, 4] corresponds to the fraction 649 / 200.
  • The 12-term partial cfe for pi corresponds to the fraction 5419351 / 1725033. You might have learned in grade school that 22/7 is an approximation to pi. So to is this fraction.
  • The partial cfe for e (using 12 terms) corresponds to the fraction 23225 / 8544.

Rational numbers to continued fractions

You can program the inverse algorithm to produce the cfe from a rational number. The algorithm is called the Euclidean algorithm and consists of finding the integer part, subtracting the integer part from the original fraction to obtain a remainder (expressed as a fraction), and then repeating this process. The process terminates for every rational number. In the SAS/IML language, the Euclidean algorithm looks like this:

start EuclidAlgRational(val, nSteps=10);
   a = j(nSteps, 1, .);
   x = val;
   done = 0;
   do i = 1 to nSteps while(^done);
      a[i] = floor(x[1] / x[2]);
      x[1] = x[1] - a[i]*x[2];  /* remainder */
      if x[1]=0 then done=1;
      x = x[2:1];               /* reciprocal */
   end;
   idx = loc(a=.);
   if ncol(idx)>0 then a = colvec(remove(a, idx));
   return a;
finish;
 
v = {649, 200};          /* = [3,4,12,4] */
a = EuclidAlgRational(v);
v = {5419351, 1725033};  /* pi = [3; 7,15,1,292,1,1,1,2,1,3,1,...] */
a_pi = EuclidAlgRational(v);
v = {23225, 8544};       /* e = [2; 1,2,1,1,4,1,1,6,1,1,8,...] */
a_e = EuclidAlgRational(v);
print a a_pi a_e;
Convert rational numbers to (finite) continued fractions

Each column of the table shows the continued fraction representations of a rational number. The output shows that the ContinuedFrac and EuclidAlgRational functions are inverse operations: each function undoes the action of the other.

Two amazing properties of the continued fraction representation

There are some very interesting mathematical properties of continued fractions. Two statistical properties are discussed in Barrow (2000), "Chaos in Numberland: The secret life of continued fractions." Both properties are illustrated by the first 97 terms in the continued fraction expansion of pi, which are
π = [3; 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, 2, 2, 2, 1, 84, 2, 1, 1,
15, 3, 13, 1, 4, 2, 6, 6, 99, 1, 2, 2, 6, 3, 5, 1, 1, 6, 8, 1, 7, 1, 2, 3, 7,
1, 2, 1, 1, 12, 1, 1, 1, 3, 1, 1, 8, 1, 1, 2, 1, 6, 1, 1, 5, 2, 2, 3, 1, 2,
4, 4, 16, 1, 161, 45, 1, 22, 1, 2, 2, 1, 4, 1, 2, 24, 1, 2, 1, 3, 1, 2, 1, ...]

The first property concerns the probability distribution of the ak for sufficiently large values of k. Notice that the continued fractions in this article mostly contain small integers such as 1, 2, 3, and 4. It turns out that this is generally true. For almost every irrational number in (0,1) and for k sufficiently large, the k_th term ak is more likely to be a small integer than a large integer. In fact, the Gauss-Kuzmin theorem states that the distribution of ak (thought of as a random variable) asymptotically approaches the probability mass function P(ak = t) = -log2(1 - 1/(1+t)2), which is shown below for t ≤ 20. Thus for a random irrational number, although the value of, say, the 1000th term in the continued fraction expansion could be any positive integer, it is probably less than 20!

The second interesting mathematical property is Khinchin's Theorem, which states that for almost every irrational number in (0,1), the geometric mean of the first t digits in the continued fraction expansion converges to a constant called Khinchin's constant, which has the value 2.685452001.... For example, the first 97 terms in the continued fraction expansion of pi have a geometric mean of 2.685252.

Gauss-Kuzmin distribution of the coefficients of a continued fraction expnsion

The post The continued fraction representation of a rational number appeared first on The DO Loop.

9月 062018
 

Many of the most beautiful areas in the US are owned by the government, to preserve them and allow access for everyone to enjoy them. And most US schools are traditionally closed during the summer, which provides families a great opportunity to go visit state and federal lands (parks, forests, [...]

The post How much land does the government own in each US state? appeared first on SAS Learning Post.