rick wicklin

1月 122022
 

Some colors have names, such as "Red," "Magenta," and "Dark Olive Green." But the most common way to specify a color is to use a hexadecimal value such as CX556B2F. It is not obvious that "Dark Olive Green" and CX556B2F represent the same color, but they do! I like to use color names (when possible) instead of hexadecimal values because the names make the program more readable than the hexadecimal values. For example, a color ramp that is defined by using the names ("DarkSeaGreen" "SandyBrown" "Tomato" "Sienna") is easier to interpret than the equivalent color ramp that is defined by using the hexadecimal values (CX8FBC8F CXF4A460 CXFF6347, CXA0522D).

This article shows how to find a "named color" that is close to any color that you specify. Shakespeare asked, "What's in a name?" To paraphrase his response, this article shows that the name of "Rose" looks just as sweet as CXFF6060 but is easier to use!

Colors in SAS

When you create a graph in SAS, there are three ways to specify colors: use a pre-defined color name from the SAS registry, use the SAS-naming convention to specify hues, or use hexadecimal values to specify an 8-bit color for the RGB color model. An example of a pre-defined color name is "DarkOliveGreen," an example of a hue-based color is "Dark Moderate Green," and an example of a hexadecimal value is CX556B2F. Each hexadecimal value encodes the three RGB values for the color. For example, the hexadecimal values 55, 6B, and 2F correspond to the decimal integers 85, 107, and 47, so CX556B2F can be thought of as the RGB triplet (85, 107, 47).

In my SAS registry, there are 151 pre-defined color names, whereas there are 2563 = 16.7 million 8-bit RGB colors. Clearly, there are many RGB colors that do not have names! I thought it would be interesting to write a program that finds the closest pre-defined name to any RGB color that you specify. You can think of each color as a three-dimensional point (R, G, B), where R, G, and B are integers and 0 ≤ R,G,B ≤ 255. Thus, the space of all 8-bit colors is a three-dimensional integer lattice. Colors that are close to each other (in the Euclidean metric) have similar shades. Consequently, you can find named color that is closest to another color by using the following steps:

  1. Load the pre-defined color names and their RGB values.
  2. For any specified hexadecimal value, convert it to an RGB value.
  3. In RGB coordinates, find the pre-defined color name that is closest (in the Euclidean metric) to the specified value.

For example, if you specify an unnamed color such as CXE99F62=RGB(244, 107, 53), the program can tell you that "SandyBrown"=RGB(244, 164, 96) is the closest pre-defined color to CXE99F62. If you want to make your program more readable (and don't mind modifying the hues a little), you can replace CXE99F62 with "SandyBrown" in your program.

Read colors from the SAS registry

For the reference set, I will use the pre-defined colors in the SAS registry, but you could use any other set of names and RGB values. The SAS documentation shows how to use PROC REGISTRY to list the colors in your SAS registry.

The following program modifies the documentation example and writes the registry colors to a temporary text file:

filename _colors TEMP;    /* create a temporary text file */
 
/* write text file with colors from the registry */
proc registry startat='HKEY_SYSTEM_ROOT\COLORNAMES' list export=_colors;
run; 
 
/* In the flat file, the colors look like this:
"AliceBlue"= hex: F0,F8,FF
"AntiqueWhite"= hex: FA,EB,D7
"Aqua"= hex: 00,FF,FF
"Aquamarine"= hex: 7F,FD,D4
"Azure"= hex: F0,FF,FF
"Beige"= hex: F5,F5,DC
"Bisque"= hex: FF,E4,C4
"Black"= hex: 00,00,00
...
*/

You can use a DATA step to read the registry values and create a SAS data set that contains the names, the hexadecimal representation, and the RGB coordinates for each pre-defined color:

data RegistryRGB;
   infile _colors end=eof;         /* read from text file; last line sets EOF flag to true */
   input;                          /* read one line at a time into _infile_ */
 
   length ColorName $32 hex $8; 
   retain hex "CX000000";
   s = _infile_;
   k = findw(s, 'hex:');          /* does the string 'hex' appear? */
   if k then do;                  /* this line contains a color */
      i = findc(s, '=', 2);       /* find the second quotation mark (") */
      ColorName = substr(s, 2, i-3);            /* name is between the quotes */
      /* build up the hex value from a comma-delimited value like 'FA,EB,D7' */
      substr(hex, 3, 2) = substr(s, k+5 , 2);
      substr(hex, 5, 2) = substr(s, k+8 , 2);
      substr(hex, 7, 2) = substr(s, k+11, 2);
 
      R = inputn(substr(hex, 3, 2), "HEX2."); /* get RGB coordinates from hex */
      G = inputn(substr(hex, 5, 2), "HEX2.");
      B = inputn(substr(hex, 7, 2), "HEX2.");
   end;
   if k;
   drop k i s;
run;
 
proc print data=RegistryRGB(obs=8); run;

The above program works in SAS 9 and also in SAS Viya if you submit the program through SAS Studio. My versions of SAS each have 151 pre-defined colors. The output from PROC PRINT shows that the RegistryRGB data set contains the ColorName, Hex, R, G, and B variables, which describe each pre-defined color.

Find the closest "named color"

The RegistryRGB data set enables you to answer the following question: Given an RGB color, what "named color" is it closest to?

For example, in the article "A statistical palette of Christmas colors," I created a palette of colors that had the values {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D}. These colors are shown to the right, but it would be challenging to look solely at the hexadecimal values and know what colors they represent. However, if told you that the colors were close to other colors such as "DarkOliveGreen," "SandyBrown," and "Tomato," you would have a clue about what colors are represented by the hexadecimal values.

The following program uses two functions from previous articles:

proc iml;
/* function to convert an array of colors from hexadecimal to RGB 
   https://blogs.sas.com/content/iml/2014/10/06/hexadecimal-to-rgb.html */
start Hex2RGB(_hex);
   hex = colvec(_hex);        /* convert to column vector */
   rgb = j(nrow(hex),3);      /* allocate three-column matrix for results */
   do i = 1 to nrow(hex);     /* for each color, translate hex to decimal */
      rgb[i,] = inputn(substr(hex[i], {3 5 7}, 2), "HEX2.");
   end;
   return( rgb);
finish;
 
/* Compute indices (row numbers) of k nearest neighbors.
   INPUT:  S    an (n x d) data matrix
           R    an (m x d) matrix of reference points
           k    specifies the number of nearest neighbors (k>=1) 
   OUTPUT: idx  an (n x k) matrix of row numbers. idx[,j] contains the
                row numbers (in R) of the j_th closest elements to S
           dist an (n x k) matrix. dist[,j] contains the distances
                between S and the j_th closest elements in R
   https://blogs.sas.com/content/iml/2016/09/28/distance-between-two-group.html
*/
start PairwiseNearestNbr(idx, dist, S, R, k=1);
   n = nrow(S);
   idx = j(n, k, .);
   dist = j(n, k, .);
   D = distance(S, R);          /* n x m */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:n)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */
      end;      
   end;
finish;

With those two functions defined, the remainder of the program is easy: read the reference RGB colors, define a palette of colors, and find the closest reference color to each specified color.

/* read the set of reference colors, which have names */
use RegistryRGB;
read all var {ColorName hex};
close;
RegRGB = Hex2RGB(hex);         /* RGB values for the colors in the SAS registry */
 
/* define the hex values that you want to test */
HaveHex = {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D};
HaveRGB = Hex2RGB(HaveHex);    /* convert test values to RGB coordinates */
 
run PairwiseNearestNbr(ClosestIdx, Dist, HaveRGB, RegRGB);
ClosestName = ColorName[ClosestIdx];  /* names of closest reference colors */
ClosestHex = hex[ClosestIdx];         /* hex values for closest reference colors */
print HaveHex ClosestHex ClosestName Dist;

The table shows the closest reference color to each specified color. For example, the color CX545733 is closest to the reference color "DarkOliveGreen." How close are they? In three-dimensional RGB coordinates, they are about 20.4 units apart. If you want to see the difference in each coordinate direction, you can print the difference between the RGB values:

/* how different is each coordinate? */
Diff = HaveRGB - RegRGB[Closestidx,];
print Diff[c={R G B}];

You can see that the red and blue coordinates of CX545733 and "DarkOliveGreen" are almost identical. The green coordinates differ by 20 units or about 8%. The "SandyBrown" color is a very good approximation to CXE99F62 because the distance between those colors is about 12.2 units. Every RGB coordinate of "SandyBrown" is within 11 units of the corresponding coordinate of CXE99F62.

You can display both palettes adjacent to each other to compare how well the reference colors approximate the test colors:

/* visualize the palettes */
ods graphics / width=640px height=160px;
k = nrow(HaveHex);
run HeatmapDisc(1:k, HaveHex) title="Original Palette" ShowLegend=0 
               xvalues=HaveHex yvalues="Colors";
run HeatmapDisc(1:k, ClosestHex) title="Nearby Palette of Registry Colors" ShowLegend=0 
               xvalues=ClosestName yvalues="Colors";

The eye can detect small differences in shades, but the overall impression is that the palette of named colors is very similar to the original palette. The palette of named colors is more informative in the sense that people have can visualize "SeaGreen" and "Tomato" without seeing the palette.

Summary

This article discusses how to create a SAS data set that contains the names and RGB values of a set of "named colors." For this article, I used the named colors in the SAS registry. You can use these named colors as reference colors. Given any other color, you can find the reference color that is closest to the specified color. This enables you to describe the color as being "close to SeaGreen" or "close to SandyBrown," which might help you when you discuss colors with your colleagues.

This article is about approximating colors by using a set of reference colors. If you want to visualize the reference colors themselves, Robert Allison has shown how to display a color swatch for each color in a SAS data set.

The post How to assign a name to a color appeared first on The DO Loop.

1月 102022
 

On this blog, I write about a diverse set of topics that are relevant to statistical programming and data visualization. In a previous article, I presented some of the most popular blog posts from 2021. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst. However, I also write articles that discuss more advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles deserve a second look. I have grouped them into four categories: SAS programming, statistics, optimization, and data simulation.

SAS programming

For years, I've been writing articles about how to accomplish tasks in Base SAS. In addition, I now write about how to program basic tasks in SAS Viya.

Use the DOLIST Syntax to Specify Tick Marks on a Graph

Probability and statistics

A Family of Density Curves for the Inverse Gamma Distribution

Probability and statistics provide the theoretical basis for modern data analysis. You cannot understand data science, machine learning, or artificial intelligence without understanding the basic principles of statistics. Most readers are familiar with common probability distributions, Pearson correlation, and least-squares regression. The following articles discuss some of the lesser-known statistical cousins of these topics:

  • The inverse gamma distribution: To use any probability distribution, you need to know four essential functions: the density, the cumulative probability, the quantiles, and how to generate random variates. You might be familiar with the gamma distribution, but what is the inverse gamma distribution and how do you define the four essential functions in SAS? Similarly, what is the generalized gamma distribution?
  • The Hoeffding D statistic: The Hoeffding D statistic measures the association between two variables. How do you compute the Hoeffding D statistic in SAS, and how do you interpret the results?
  • Weibull regression: In ordinary least squares regression, the response variable is assumed to be modeled by a set of explanatory variables and normally distributed errors. If the error terms are Weibull distributed, you can estimate parameters for the Weibull distribution as part of a regression analysis, but you need to transform the regression estimates to obtain the usual Weibull parameters.

Optimization

Optimization is at the heart of many statistical techniques, such as maximum likelihood estimation. But sometimes you need to solve a "pure" optimization problem. SAS supports many methods for solving optimization problems:

Multivariate simulation and bootstrapping

It is straightforward to simulate univariate data. It is harder to simulate multivariate data while preserving important relations between the variables, such as correlation. Similarly, it can be challenging to analyze the bootstrap distribution of a multivariate statistic, such as a correlation matrix:

The Geometry of the Iman-Conover Transformation

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2021? If so, leave a comment and tell me what topic you found interesting or useful.

The post 12 blog posts from 2021 that deserve a second look appeared first on The DO Loop.

1月 052022
 

You can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This method is encapsulated in the RANDNORMAL function in SAS/IML software, but you can also perform the computations manually by calling the ROOT function to get the Cholesky root and then use matrix multiplication to correlate multivariate norm (MVN) data, as follows:

%let d = 12;           /* number of variables */
%let N = 500;          /* number of observations */
proc iml;
d = &d;                /* want to generate d variables that are MVN */
N = &N;                /* sample size */
 
/* easy to generate uncorrelated MVN variables */
call randseed(1234);
z = j(d, N);
call randgen(z, "Normal"); /* z is MVN(0,I(d)) */
 
/* use Cholesky root to transform to correlated variables */
Sigma = toeplitz(d:1);   /* example of covariance matrix */
G = root(Sigma);         /* the Cholesky root of the full covariance matrix */
x = G`*z;                /* x is MVN(0, Sigma) */
title "Two Components of MVN(0,Sigma) Data";
call scatter(x[1,], x[2,]) label={x1 x2};

The simulation creates d=12 correlated variables and 500 observations. The first two variables are plotted against each other so that you can see that they are correlated. The other pairs of variables are also correlated according to the entries in the Σ covariance matrix.

Generating a huge number of MVN variables

In the comments to one of my previous articles, a SAS programmer asked whether it is possible to generate MVN data even if the covariance matrix is so large that it cannot be factored by the ROOT function. For example, suppose you want to generate d=20,000 MVN variables. A d x d covariance matrix of that size requires 3 GB of RAM, and on some operating system you cannot form the Cholesky root of a matrix that large. However, I have developed an algorithm that enables you to deal with the covariance matrix (Σ) as a block matrix.

To introduce the algorithm, represent the Σ matrix as a 2 x 2 block matrix. (See the last section for a more general representation.) For a 2 x 2 block algorithm, you only need to form the Cholesky roots of matrices of size (d/2), which are 10000 x 10000 matrices for this example. Matrices of that size require only 0.75 GB and can be quickly factored. Of course, there is no such thing as a free lunch: The new algorithm is more complicated and requires additional computations, albeit on smaller matrices.

Let's look at how to simulate MVN data by treating Σ as a block matrix of size d/2. For ease of exposition, assume d is even and each block is of size d/2. However, the algorithm easily handles the case where d is odd. When d is odd, choose the upper blocks to have floor(d/2) rows and choose the lower clocks to have ceil(d/2) rows. The SAS code (below) handles the general case. (In fact, the algorithm doesn't care about the block size. For any p in (0, 1), one block can be size pd and the other size (1-p)d.

In block form, the covariance matrix is
\(\Sigma = \begin{bmatrix}A & B\\B^\prime & C\end{bmatrix}\)
There is a matrix decomposition called the LDLT decomposition that enables you to write Σ as the product of three block matrices:

\(\begin{bmatrix}A & B\\B^\prime & C\end{bmatrix} = \begin{bmatrix}I & 0\\B^\prime A^{-1} & I\end{bmatrix} \begin{bmatrix}A & 0\\0 & S\end{bmatrix} \begin{bmatrix}I & A^{-1}B\\0 & I\end{bmatrix} \)
where \(S = C - B^\prime A^{-1} B\) is the Schur complement of the block matrix C. There is a theorem that says that if Σ is symmetric positive definite (SPD), then so is every principal submatrix, so A is SPD. Thus, we can use the Cholesky decomposition and write \(A = G^\prime_A G_A\) for an upper triangular matrix, \(G_A\). By reordering the variables, C is SPD as well. I don't have a proof that S is PD, but it seems to be true in the examples I've tried, so let's list that as an assumption and assume that \(S = G^\prime_S G_S\) for an upper triangular matrix \(G_S\).

Using these definitions, the Cholesky decomposition of Σ can be written in block form as
\(\Sigma = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}G_A & (G^\prime_A)^{-1} B \\ 0 & G_S\end{bmatrix} \)

To simulate MVN data, we let z be uncorrelated MVN(0, I) data and define
\(x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}z_1 \\ z_2\end{bmatrix} = \begin{bmatrix}G^\prime_A z_1 \\ B^\prime G^{-1}_A z_1 + G^\prime_S z_2\end{bmatrix} \)

The previous equation indicates that you can generate correlated MVN(0, Σ) data if you know the Cholesky decomposition of the upper-left block of Σ (which is GA) and the Cholesky decomposition of the Schur complement (which is GS). Each of the matrices in these equations is of size d/2, which means that the computations are performed on matrices that are half the size of Σ. However, notice that generating the x2 components requires some extra work: the Schur complement involves the inverse of A and the formula for x2 also involves the inverse of GA. These issues are not insurmountable, but it means that the block algorithm is more complicated than the original algorithm, which merely multiplies z by the Cholesky root of Σ.

A SAS program for block simulation of MVN data

The SAS/IML program in a previous section shows how to generate correlated MVN(0, Σ) data (x) from uncorrelated MVN(0, I) data (z) by using the Cholesky root (G) of Σ. This section shows how to get exactly the same x values by performing block operations. For the block operations, each matrix is size d/2, so has 1/4 of the elements of Σ.

The first step is to represent Σ and z in block form.

/* The Main Idea: get the same MVN data by performing block 
   operations on matrices that are size d/2 */
/* 1. Represent Sigma={A B, B` C} and z={z1, z2} in block form */
d1 = ceil(d/2);           /* half the variables */
d2 = d - d1;              /* the other half of the vars */
dp1 = d1 + 1;
 
A = Sigma[1:d1,  1:d1];   /* break up the symmetric (d x d) covariance */
B = Sigma[1:d1,  dp1:d];  /* matrix into blocks of size d/2 */
C = Sigma[dp1:d, dp1:d];
z1 = z[1:d1, ];           /* extract the first d1 obs for MVN(0,I) data */
z2 = z[dp1:d, ];          /* extract the last d2 obs for MVN(0,I) data */

It is easy to generate x1, which contains the first d/2 components of the MVN(0, Σ) simulated data. You simply use the Cholesky decomposition of A, which is the upper-left block of Σ:

/* 2. Compute Cholesky root of A and compute x1 z1 */
G_A = root(A);            /* Cholesky of upper left block */
x1 = G_A`*z1;             /* generate first half of variables */

It is not as easy to generate x2, which contains the last d/2 components. In block form, \(x_2 = v + u\), where \(v = B^\prime G_A^{-1} z_1\) and \(u = G_S^\prime z_2\), and where \(S = G_S^\prime G_S\) is the Cholesky decomposition of the Shur complement. This is shown in the following statements:

/* 3. Generate the second half of variables */
S = C - B`*inv(A)*B;      /* S is the Schur complement */
G_S = root(S);            /* Cholesky root of Schur complement */
v = B`*inv(G_A)*z1;
u = G_S`*z2;
x2 = v + u;
 
/* verify that the computation gives exactly the same simulated data */
print (max(abs(x[1:d1,] - x1)))[L="Diff(x - x1)"];
print (max(abs(x[dp1:d,] - x2)))[L="Diff(x - x2)"];

The output shows that the block computation, which uses multiple matrices of size d/2, gives exactly the same answer as the original computation, which uses matrices of size d.

Analysis of the block algorithm

Now that the algorithm runs on a small example, you can increase the dimensions of the matrices. For example, try rerunning the algorithm with the following parameters:

%let d = 5000;    /* number of variables */
%let N = 1000;    /* number of observations */

When I run this larger program on my PC, it takes about 2.3 seconds to simulate the MVN data by using the full Cholesky decomposition. It takes about 17 seconds to simulate the data by using the block method. Most of the time for the block method is spent on the computation v = B`*inv(G_A)*z1, which takes about 15 seconds. So if you want to improve the performance, you should focus on improving that computation.

You can improve the performance of the block algorithm in several ways:
  • You don't need to find the inverse of A. You have the Cholesky root of A, which is triangular, so you can define H = inv(G`A)*B and compute the Schur complement as C – H`*H.
  • You don't need to explicitly form any inverses. Ultimately, all these matrices are going to operate on the vector z2, which means that you can simulate the data by using only matrix multiplication and solving linear equations.

I tried several techniques, but I could not make the block algorithm competitive with the performance of the original Cholesky algorithm. The issue is that the block algorithm computes two Cholesky decompositions, solves some linear equations, and performs matrix multiplication. Even though each computation is on a matrix that is half the size of the original matrix, the additional computations result in a longer runtime.

Generalization to additional blocks

The algorithm generalizes to other block sizes. I will outline the case for a 3 x 3 block matrix. The general case of k x k blocks follows by induction.

The figure to the right shows the Σ covariance matrix as a 3 x 3 block matrix. By using 2 x 2 block algorithm, you can obtain multivariate normal vectors x1 and x2, which correspond to the covariances in the four blocks in the upper-left corner of the matrix. You can then view the upper four blocks as a single (larger) block, thus obtaining a 2 x 2 block matrix where the upper-left block has size 2d/3 and the lower-right block (Σ33) has size d/3.

There are many details to work through, but the main idea is that you can repeat the computation on this new block matrix where the block sizes are unequal. To proceed, you need the Cholesky decomposition of the large upper-left block, but we have already shown how to compute that in terms of the Cholesky root of Σ11, its inverse, and the Cholesky root of the Schur complement of Σ22. You also need to form the Schur complement of Σ33, but that is feasible because we have the Cholesky decomposition of the large upper-left block.

I have not implemented the details because it is not clear to me whether three or more blocks provides a practical advantage. Even if I work out the details for k=3, the algebra for k > 3 blocks seems to be daunting.

Summary

It is well known that you can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This article shows how to break up the task by using a block Cholesky method. The method is implemented for k=2 blocks. The block method requires more matrix operations, but each operation is performed on a smaller matrix. Although the block method takes longer to run, it can be used to generate a large number of variables by generating k variables at a time, where k is small enough that the k x k matrices can be easily handled.

The block method generalizes to k > 2 blocks, but it is not clear whether more blocks provide any practical benefits.

The post A block-Cholesky method to simulate multivariate normal data appeared first on The DO Loop.

1月 032022
 

Last year, I wrote almost 100 posts for The DO Loop blog. My most popular articles were about data visualization, statistics and data analysis, and simulation and bootstrapping. If you missed any of these gems when they were first published, here are some of the most popular articles from 2021:

SAS programming and data visualization

Panel of Regression Diagnostic Plots in SAS

SAS programming and data visualization

  • Display the first or last observations in data: Whether your data are in a SAS data set or a SAS/IML matrix, this article describes how to display to print the top rows (and bottom rows!) of a rectangular data set.
  • Customize titles in a visualization of BY groups: Have you ever use the BY statement to graph data across groups, such as Males/Females or Control/Experimental groups? If so, you might want to learn how to use the #BYVAR and #BYVAL keywords to customize the titles that appear on each graph.
  • Horizontal Bar Chart in SAS

  • Reasons to prefer a horizontal bar chart: Bar charts are used to visualize the counts of categorical data. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart in SAS and gives examples for which a horizontal chart is preferable.
  • Why you should visualize distributions: It is common to report the means (of difference between means) for different groups in a study. However, means and standard deviations only tell part of the story. This article shows four examples where the mean difference between group scores is five points. However, when you plot the data for each example, you discover additional information about how the groups differ.

Simulation and Bootstrapping

Since I am the guy who wrote the book on statistical simulation in SAS, it is not surprising that my simulation articles are popular. Simulation helps analysts understand expected values, sampling variation, and standard errors for statistics.

Bootstrapping Residuals for a Time Series Regression Model

Did you resolve to learn something new in the New Year? Reading these articles requires some effort, but they provide tips and techniques that make the effort worthwhile. So, read (or re-read!) these popular articles from 2021. To ensure you don't miss a single article in 2022, consider subscribing to The DO Loop.

The post Top 10 posts from <em>The DO Loop</em> in 2021 appeared first on The DO Loop.

12月 152021
 

Suppose you are creating a craft project for the Christmas holidays, and you want to choose a palette of Christmas colors to give it a cheery holiday appearance. You could use one of the many online collections of color palettes to choose a Christmas-themed palette. However, I didn't want to merely take a palette that someone else designed. As a statistician, I wanted to create my own Christmas palette by using some sort of mathematical technique that creates a "statistically perfect" set of Christmas colors from other palettes!

Artists use their talents to create aesthetically pleasing and harmonious palettes of colors. But why should artists have all the fun? In this article, I show how a statistician might create a palette of Christmas colors! Starting from existing palettes of colors, the technique uses principal component analysis to select a linear subspace of colors. From within that subspace, you can create your own palette of colors. An example is the green-gold-red palette that is shown below. It was made by math, but I think it is pretty!

The same technique could also be used to create autumn palettes, spring palettes, and so forth, but don't take this too seriously: this article is intended to be geeky fun for the holidays. I'll let others debate whether this mathematical technique is useful for creating art.

Outline of the method

Here is the general idea behind creating a statistical palette of colors:

  1. Collect 6-10 existing palettes that you like. This process should produce 40-60 colors that are related to your theme (such as "Christmas"). I have written about how to read and visualize palettes in SAS.
  2. Each color can be represented a triplet of values in RGB space. You can use principal component (PC) analysis to project the colors into a plane that captures most of the variation in the colors. The graph at the right shows the principal component scores for 41 colors in seven Christmas palettes.
  3. Each palette is represented as a path through this linear subspace. So, you can choose a new sequence of points in this subspace and then REVERSE the projection to obtain a new palette of RGB colors.
  4. There are infinitely many paths you could choose, but one choice is to use vertices of a regular polygon to create a "statistically perfect" set of Christmas colors.

Paths through principal component space

For each palette, if you connect the sequence of colors in the palette, you obtain a path through the plane of the principal component scores. If you try to display the paths for all seven palettes, you will get a jumbled mess of lines that crisscross the graph. To make it easier to visualize the paths for each palette, the following graph limits the display to two palettes. The "Jingle Bell" palette has six colors. It starts with two green colors, moves to golden colors, and ends with two red colors. The "Unwrapping My Gifts" palette has five colors. It starts with two green colors, moves to a silver color, and ends with a red and dark burgundy color. The other palettes trace similar paths through the plane of scores.

Create your own path

If each palette corresponds to a path, then each path corresponds to a palette. In other words, if you choose a sequence five or six coordinates in the space of principal component scores, the path defines a new Christmas palette.

The ancient Greeks thought that a circle was the most perfect shape. Regular polygons, which approximate the circle, are also seen as perfect in their symmetry and beauty. Therefore, it makes sense to use a regular hexagon to create a new Christmas palette. The following graph demonstrates how you can choose a new Christmas palette. The circle represents a path through the space of colors. The gray stars represent six points on the circle, which are also vertices on a regular hexagon. I will define a new palette by using math to figure out what colors to associate with the coordinates of the six stars. I've ordered the six points starting from the point that I know will be green (because it is in the lower-left corner) and ending with the point that I know will be red (because it is in the upper-left corner).

Inverting the principal component transformation

Until now, we've only considered the projection of the RGB colors onto the plane of the first two PCs. Of course, this projection is not invertible: There are many colors that share the same two PC scores. However, in general, the transformation from RGB space to the space of all principal component scores is invertible. You can represent the transformation as a matrix by using the spectral decomposition of the correlation matrix of the colors. To embed the plane of the first two PCs, you need to set a value for the coordinate of the third PC. The obvious statistical choice is to use PC3=0, which represents the average score in the third PC. In other words, you can associate the first two PC scores with the plane that has ordered triplets (PC1, PC2, 0).

With that choice, it is straightforward to apply the inverse PC transformation. The inverse transformation maps PC scores into colors in RGB space. You can use the SAS IML language to carry out this transformation. If you visualize a dense set of points on the circle, you get the following spectrum of colors:

Any smaller palette will contain a subset of these colors. For example, if you choose six points on the circle that are vertices of a regular histogram, you get the palette that is shown at the top of this article. The new palette is based on the original seven palettes but is distinct from them. The new palette incorporates the properties of the original palettes but is defined by defining a new path through the PC scores.

Summary

In summary, this article shows how to create a new palette of colors that is based on statistical properties of a set of colors. You can perform a principal component analysis in RGB space to obtain a plane of PC scores. Every path through that plane corresponds to a new palette of colors. For my palette, I chose a sequence of vertices on a regular hexagon.

This method is not guaranteed to give an aesthetically pleasing palette of colors, but I think it does in this example. And if you don't like my palette, you can choose any other path through the set of PC scores to create a new palette! If you would like to experiment with this technique, you can download the SAS program that I used to generate the figures in this article.

The post A statistical palette of Christmas colors appeared first on The DO Loop.

12月 132021
 

In a previous article, I visualized seven Christmas-themed palettes of colors, as shown to the right. You can see that the palettes include many red, green, and golden colors. Clearly, the colors in the Christmas palettes are not a random sample from the space of RGB colors. Rather, they represent a specific subset of Christmas colors. I thought it would be fun to use principal component analysis to compute and visualize the linear subspace that best captures the set of Christmas colors!

Convert hexadecimal colors to RGB

Each color can be represented as an ordered triplet of unsigned integers in RGB space. The colors in SAS are 8-bit colors, which means that each coordinate in RGB space has a value between 0 and 255. The following DATA step reads the colors in the Christmas palettes and uses the HEX2. informat to convert the hexadecimal values to their equivalent RGB values. As explained in the previous article, the DATA step also creates an ID value for each color and creates a macro variable (AllColors) that contains the list of all colors.

/* read Christmas palettes. Convert hex to RGB */
data RGB;
length Name $30 color $8;
length palette $450; /* must be big enough to hold all colors */
retain ID 1 palette;
input Name 1-22 n @;
do col = 1 to n;
   input color @;
   R = inputn(substr(color, 3, 2), "HEX2."); /* get RGB colors from hex */
   G = inputn(substr(color, 5, 2), "HEX2.");
   B = inputn(substr(color, 7, 2), "HEX2.");
   palette = catx(' ', palette, color);
   output;
   ID + 1;                             /* assign a unique ID to each color */
end;
call symput('AllColors', palette);   /* concatenate colors; store in macro */
drop n col palette;
/* Palette Name      |n| color1 | color2 | color2 | color4 | ... */
datalines;
Jingle Bell           6 CX44690D CX698308 CXF3C543 CXFFEDC7 CXCA2405 CX9E1007
Holiday Red and Gold  6 CXCF1931 CXAD132D CXD9991A CXEAA61E CXF2BC13 CX216A1B
Green and Gold        6 CX03744B CX15885A CX1E9965 CXFBE646 CXFBC34D CXF69F44
Unwrapping My Gifts   5 CX2A5B53 CX5EB69D CXECEBF1 CXD34250 CX5F1326
Christmas Wrapping    6 CX237249 CX3B885C CXE5D6B5 CXE3CD8E CXDA111E CXC00214
Christmas Wedding     6 CX325C39 CX9C9549 CXDBAA46 CXFFE9D9 CXFF4A4A CXDB2121
Real Christmas Tree   6 CX779645 CX497542 CX274530 CX6E3C3B CXBF403B CXEDB361
;

The result of this DATA step is a data set that contains 41 rows. The R, G, and B variables contain the red, green, and blue components, respectively, which are in the range 0 to 255. Thus, the data contains 41 triplets of RGB values.

Let's use PROC PRINCOMP in SAS to run a principal component analysis of these 41 observations:

proc princomp data=RGB N=2 out=PCOut plots(only)=(Score Pattern);
   var R G B;
   ID color;
run;

Recall that each principal component is a linear combination of the original variables (R, G, and B). The table shows that the first principal component (PC) is the linear combination
PC1 = 0.39*R + 0.66*G + 0.64*B
The first PC is a weighted sum of the amount of color across all three variables. More weight is given to the green and blue components than to the red component. Along the axis of the first PC, black is on the extreme left since black has the RGB value (0, 0, 0). Similarly, white is on the extreme right since white has the RGB value (255, 255, 255).

The second principal component is the linear combination
PC2 = 0.91*R - 0.21*G - 0.35*B
The second PC is a contrast between the red coordinate and the green and blue coordinates. Along the axis of the second PC, colors that have a lot of green and blue (such as cyan, which is (0, 255, 255)) have extreme negative values whereas colors that are mostly red have extreme positive values.

An additional table (not shown) indicates that 89% of the variation in the data is explained by using the first two principal components. You could add a third principal component if you want to have a way to separate the green and blue colors.

The PRINCOMP procedure creates a graph that projects each RGB value onto the principal component axes. Because I put the COLOR variable on the ID statement, each marker is labeled by using its hexadecimal value:

This graph shows how the colors in the Christmas palettes are distributed in the space of the first two PCs. However, wouldn't it be cool if the markers in this plot were the colors that the markers represent? The next section adds color to the scatter plot by using PROC SGPLOT.

Adding color to the score plot

The previous call to PROC PRINCOMP includes the OUT= option, which writes the principal component scores to a SAS data set. The names of the scores are PRIN1 and PRIN2. If you use the PRIN1 and PRIN2 variables to create a scatter plot, you can re-create the score plot from PROC PRINCOMP. However, I am going to add a twist. As shown in my previous post, I will use the GROUP= option to specify that the marker colors be assigned according to the value of the ID variables, which are the integers 1, 2, ..., 41. I will use the STYLEATTRS statement and the DATACONTRASTCOLORS= option to specify that the colors that should be used for each group. The result is a scatter plot in which each marker is colored according to a specified list of colors.

title "Principal Component Scores for Christmas Colors";
proc sgplot data=PCOut noautolegend aspect=1;
   label Prin1="Component 1 (60.37%)" Prin2="Component 2 (28.62%)";
   styleattrs datacontrastcolors=(&AllColors);
   scatter x=Prin1 y=Prin2 / markerattrs=(symbol=CircleFilled size=16) group=ID;
   refline 0 / axis=x;
   refline 0 / axis=y;
run;

The graph shows the distribution of the colors, using the colors themselves as a visual cue. The graph shows that the first PC differentiates between dark colors (on the left) and light colors (on the right). The second PC differentiates between blue-green colors (on the bottom) and red colors (on the top).

Summary

I almost titled this article, "A Statistician Looks at Christmas Colors." I started with a set of Christmas-themed palettes with 41 colors that were primarily red, green, and gold. By performing a principal component analysis, you can project the RGB coordinates of these colors onto the plane that captures most of the variation in the data. If you color the markers in this projection, you obtain a scatter plot that shows green colors in one region, red colors in another region, and gold and silver colors in yet another.

The graph shows the various Christmas colors in one image. To me, the image seems more coherent and organized than the original strips of colors. I like that similar colors are grouped together, and that the three groups of colors (red, green, and blue) are visibly distinct.

The post A principal component analysis of color palettes appeared first on The DO Loop.

12月 082021
 

In data visualization, colors can represent the values of a variable in a choropleth map, a heatmap, or a scatter plot. But how do you visualize a palette of colors from the RGB or hexadecimal values of the colors? One way is to use the HEATMAPDISC subroutine in SAS/IML, which enables you to specify a color ramp as part of the syntax. If you do not have SAS/IML software, you can use the HEATMAPPARM statement in PROC SGPLOT and use a discrete attribute map to define which color is used for each cell in the heat map. If you do not yet know about discrete attribute maps, I highly recommend that you take a moment to learn about them; they are very useful!

If you do not want to use a discrete attribute map and you do not have access to SAS/IML, there is a third way to visualize color palettes: use the POLYGON statement in PROC SGPLOT. As discussed in this article, each polygon has a unique value of an ID variable. You can use the GROUP= option to associate a color with each value of the ID variable. You can use the DATACOLORS= option on the STYLEATTRS statement to specify the colors for each polygon. This article uses rectangular polygons and arranges them so that you can visualize a palette of colors. Because I am writing this article in the Christmas season, I will use a set of Christmas-themed red-and-green palettes to demonstrate the technique. The final visualization is shown to the right.

How to use the POLYGON statement to visualize a palette

The following list outlines the main steps for using the POLYGON statement in PROC SGPLOT to visualize a set of palettes:

  1. Use the DATA step to read the colors in each palette.
  2. Assign a unique ID to each color.
  3. Concatenate all the colors and store the set of colors in a macro variable. This macro variable will be used to assign a color to each polygon.
  4. Create a polygon for each color. This example uses rectangles. The i_th color for the j_th palette is represented by a rectangle that is centered at the coordinate (i,j).
  5. Optionally, use the TEXT statement in PROC SGPLOT to overlay the hexadecimal values of the colors on the graph.
  6. Optionally, use the TEXT statement to display the name of each palette.
  7. Use the POLYGON and TEXT statements to display the palettes of colors.

Read the colors in each palette

I found several Christmas-themed palettes at www.schemecolor.com. The specific palettes are cited in the comments of the following program, which implements the first three steps of the outline:

ods graphics/reset;
/* Acknowledgements:
   Jingle Bell palette:  https://www.schemecolor.com/jingle-bells.php
   Holiday Red and Gold: https://www.schemecolor.com/holiday-red-gold.php
   Green and Gold:       https://www.schemecolor.com/green-and-gold.php
   Unwrapping My Gifts:  https://www.schemecolor.com/unwrapping-my-gifts.php
   Christmas Wrapping:   https://www.schemecolor.com/christmas-wrapping.php
   Christmas Wedding:    https://www.schemecolor.com/christmas-wedding.php
   Real Christmas Tree:  https://www.schemecolor.com/real-christmas-tree.php
*/
/* 1. Use the DATA step to read the colors in each palette. */
data Palettes;
length Name $30 color $8;
length palette $450; /* must be big enough to hold all colors */
retain ID 1 row 1 palette;
input Name 1-22 n @;
do col = 1 to n;
   input color @;
   palette = catx(' ', palette, color);
   output;
   ID + 1;    /* Assign a unique ID to each color. */
end;
row + 1;
call symput('AllPalettes', palette);  /* 3. Concatenate colors; store in macro. */
drop palette;
/* Palette Name      |n| color1 | color2 | color2 | color4 | ... */
datalines;
Jingle Bell           6 CX44690D CX698308 CXF3C543 CXFFEDC7 CXCA2405 CX9E1007
Holiday Red and Gold  6 CXCF1931 CXAD132D CXD9991A CXEAA61E CXF2BC13 CX216A1B
Green and Gold        6 CX03744B CX15885A CX1E9965 CXFBE646 CXFBC34D CXF69F44
Unwrapping My Gifts   5 CX2A5B53 CX5EB69D CXECEBF1 CXD34250 CX5F1326
Christmas Wrapping    6 CX237249 CX3B885C CXE5D6B5 CXE3CD8E CXDA111E CXC00214
Christmas Wedding     6 CX325C39 CX9C9549 CXDBAA46 CXFFE9D9 CXFF4A4A CXDB2121
Real Christmas Tree   6 CX779645 CX497542 CX274530 CX6E3C3B CXBF403B CXEDB361
;
 
%put &AllPalettes;   /* check that all colors are here */

The following sections implement the visualization in three stages: an initial visualization that uses a scatter plot, an improved visualization that uses rectangular polygons, and a final visualization.

An initial visualization

After running this DATA step, the Palettes data set contains all the information you need to visualize the colors in each palette. In fact, you can use a scatter plot to visualize the data and make sure it was read correctly. The following call to PROC SGPLOT creates a graph for which each row represents a palette of colors. Recall that the STYLEATTRS statement supports two different options for specifying how colors are assigned to groups when you use the GROUP= option:

  • The DATACOLORS= option specifies the fill color for bars and polygons.
  • The DATACONTRASTCOLORS= option specifies the colors for lines and markers.

For a scatter plot, you can use the DATACONTRASTCOLORS= option to assign group colors, as follows:

%let gray   = CXF8F8F8;
title "Initial Visualization of Christmas Palettes";
proc sgplot data=palettes noautolegend noborder;
   styleattrs backcolor=&gray wallcolor=&gray datacontrastcolors=(&AllPalettes);
   scatter x=col y=row / group=ID markerattrs=(symbol=SquareFilled size=45);
   text x=col y=row text=color;
run;

This is not a professional-looking graph, but it does provide a quick-and-easy way to visualize the colors. Each row is a palette of colors. The colors are labeled by using their hexadecimal values. In the next section, the markers are replaced by rectangular polygons.

A better visualization of palettes

In the data set, the (col, row) coordinates are the center of each polygon. The following DATA step creates new variables (x, y) for the vertices of rectangles:

/* 4. Create a polygon for each color. 
      The i_th color for the j_th palette is represented by a rectangle 
      that is centered at the coordinate (i,j). */
data Poly;
set Palettes;
w = 0.5; h = 0.48;                /* width and height of rectangle */
x = col - w; y = row - h; output; /* four vertices of square */
x = col + w; y = row - h; output;
x = col + w; y = row + h; output;
x = col - w; y = row + h; output;
drop w h;
run;

With that minor change, you can use a POLYGON statement to display colored rectangles. (Be sure to use the DATACOLORS= option on the STYLEATTRS statement.) The resulting visualization looks much better:

%let gray   = CXF8F8F8;
title "Second Visualization of Christmas Palettes";
title2 "A Polygon Plot";
proc sgplot data=Poly noautolegend;
      styleattrs wallcolor=&gray datacolors=(&AllPalettes);
   polygon x=x y=y ID=ID / group=ID fill;
   text x=col y=row text=color;
run;

The final visualization

Although both previous graphs show the palettes of colors, you can improve the graphs by making minor changes:

  • Because each rectangular polygon is defined by using four rows in the data, the hexadecimal value of the colors is plotted four times. You can see in the previous graphs that the color values appear to be boldfaced. You only need to plot the value once for each value of the ID variable.
  • Add the name of the palette to the graph.
  • Do not display the axes. They are not necessary.
/* 5. Optionally, use the TEXT statement to display the name of each palette. */
data Temp;
set Poly;
by ID;
if ^first.ID then  color=" ";
run;
 
/* 6. Optionally, use the TEXT statement to overlay the hex values of the colors */
data PlotPalettes;
set Temp;
by Name notsorted;
label = Name;
if first.Name then do;
   lx = col - 0.6;    ly = row;
end;
else do;
   label=" ";  lx = .;  ly = .;
end;
run;
 
/* 7. Use the POLYGON and TEXT statements to display the palettes of colors. */
%let gray   = CXF8F8F8;
title "Visualization of Christmas Palettes";
proc sgplot data=PlotPalettes noautolegend noborder;
   styleattrs backcolor=&gray wallcolor=&gray datacolors=(&AllPalettes);
   polygon x=x y=y ID=ID / group=ID fill;
   text x=col y=row text=color;
   text x=lx y=ly text=label / position=left textattrs=(size=12);
   xaxis display=none offsetmax=0;
   yaxis display=none thresholdmin=0 thresholdmax=0;
run;

The final graph is shown at the top of this article. It enables you to see that most of the palettes are green-to-red and include either a golden or creamy-white color. Three palettes are different from the others:

  • The "Real Christmas Tree" palette ends with gold instead of red.
  • The "Green and Gold" palette does not contain any red.
  • The "Holiday Red and Gold" palette starts with red and ends with green.

Summary

This article shows how to use a polygon plot to visualize a set of Christmas-themed color palettes. By using the STYLEATTRS statement in PROC SGPLOT, you can specify how colors are assigned to groups when you use the GROUP= option. You can use the DATACOLORS= option to specify the fill color for polygons. You can use the POLYGON statement and the GROUP=ID option to assign each color to a rectangular polygon, thus visualizing each palette as a strip of colors.

The post Visualize palettes of colors in SAS appeared first on The DO Loop.

12月 062021
 

While discussing how to compute convex hulls in SAS with a colleague, we wondered how the size of the convex hull compares to the size of the sample. For most distributions of points, I claimed that the size of the convex hull is much less than the size of the original sample. This article looks at the expected value of the proportion p = k / N, where k is the size of the convex hull for a set of N planar points. The proportion depends on the distribution of the points. This article shows the results for three two-dimensional distributions: the uniform distribution on a rectangle, the normal distribution, and the uniform distribution on a disk.

Monte Carlo estimates of the expected number of points on a convex hull

You can use Monte Carlo simulation to estimate the expected number of points on a convex hull. First, specify the distribution of the points and the size of the sample, N. Then do the following:

  • Simulate N random points from the distribution and compute the convex hull of these points. This gives you a value, k, for this one sample.
  • Repeat the simulation B times. You will have B values of k: {k1, k2, ..., kB}.
  • The average of the B values is an estimate of the expected value for the number of points on the convex hull for a random set of size N.
  • If you prefer to think in terms of proportions, the ratio k / N is an estimate of the expected proportion of points on the convex hull.

Simulate uniform random samples for N=200

Before you run a full simulation study, you should always run a smaller simulation to ensure that your implementation is correct. The following SAS/IML statements generate 1000 samples of size N=200 from the uniform distribution on the unit square. For each sample, you can compute the number of points on the convex hull. You can then plot the distribution of the number of points on the convex hull, as follows:

proc iml;
call randseed(1);
 
/* 1. Convex hull of N=200 random uniformly distributed 
      points in the unit square [0,1] x [0,1] */
NRep = 1000;            /* number of Monte Carlo samples */
Distrib = "Uniform";
N = 200;                /* sample size */
k = j(NRep, 1, 0);      /* store the number of points on CH */
X = j(N, 2, 0);         /* allocate matrix for sample */
do j = 1 to NRep;
   call randgen(X, Distrib);  /* random sample */
   Indices = cvexhull(X);
   k[j] = sum(Indices>0);     /* the positive indices are on CH */
end;
 
Ek = mean(k);           /* MC estimate of expected value */
print Ek;
title "k = Number of Points on Convex Hull; N = 200";
call bar(k) other="inset ('E[k] =' = '13.88') / border;";

The bar chart shows the distribution of the size of the convex hull for the 1,000 simulated samples. For most samples, the convex hull contains between 12 and 15 points. The expected value is the average of these values, which is 13.88 for this simulation.

If you want to compare samples of different sizes, it is useful to standardize the estimated quantity. Rather than predict the number of points on the convex hull, it is useful to predict the proportion of the sample that lies on the convex hull. Furthermore, we should report not only the expected value but also a confidence interval. This makes it clearer that the proportion for any one sample is likely to be within a range of values:

p = k / N;              /* proportion of points on the CH */
mean = mean(p);         /* expected proportion */
call qntl(CI, p, {0.025, 0.975}); /* 95% confidence interval */
print mean CI;
 
refStmt = 'refline ' + char(mean//CI) + '/ axis=X lineattrs=(color=red);';
title "Proportion of Points on Convex Hull, 1000 Random Samples";
title2 "X ~ U(Square); N = 200";
call histogram(p) other=refStmt label='Proportion';

For this distribution of points, you should expect about 7% of the sample to lie on the convex hull. An empirical 95% confidence interval is between 5% and 9%.

According to an unpublished preprint by Har-Peled ("On the Expected Complexity of Random Convex Hulls", 2011), the asymptotic formula for the expected number of points on the convex hull of the unit square is O(log(n)).

Simulate samples for many sizes

The previous section simulated B samples for size N=200. This section repeats the simulation for values of N between 300 and 8,000. Instead of plotting histograms, I plot the expected value and a 95% confidence interval for each value of N. So that the simulation runs more quickly, I have set B=250. The following SAS/IML program simulates B samples for various values of N, all drawn from the uniform distribution on the unit square:

/* 2. simulation for X ~  U(Square), N = 300..8000 */
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
Distrib = "Uniform";
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call randgen(X, Distrib);  /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */

You can use the CREATE statement to write the results to a SAS data set. You can use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT to visualize the result. Because I want to create the same plot several times, I will define a subroutine that creates the graph:

/* define subroutine that creates a scatter plot with confidence intervals */
start HiLow(N, p, lower, upper);
    Ek = N#p;  /* expected value of k */
    create CH var {N p lower upper Ek}; append; close;
    submit;
      proc sgplot data=CH noautolegend;
        format Ek 2.0;
        highlow x=N low=lower high=upper;
        scatter x=N y=p / datalabel=Ek;
        xaxis grid label='Sample Size (N)' offsetmax=0.08;
        yaxis grid label='Proportion on Convex Hull';
      run;
    endsubmit;
finish;
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Square)";
run HiLow(N, mean, CI[1,], CI[2,]);

The graph shows the expected proportion of points on the convex hull for samples of size N. The expected number of points (rounded to the nearest integer) is shown adjacent to each marker. For example, you should expect a sample of size N=2000 to have about 1% of its points (k=20) on the convex hull. As the sample size grows larger, the proportion decreases. For N=8000, the expected proportion is only about 0.3%, or 23.5 points on the convex hull.

Simulate samples from the bivariate normal distribution

The previous graph shows the results for points drawn uniformly at random from the unit square. If you change the distribution of the points, you will change the graph. For example, a bivariate normal distribution has most of its points deep inside the convex hull and very few points near the convex hull. Intuitively, you would expect the ratio k / N to be smaller for bivariate normal data, as compared to uniformly distributed data.

You can analyze bivariate normal data by changing only one line of the previous program. Simply change the statement Distrib = "Uniform" to Distrib = "Normal". With that modification, the program creates the following graph:

For bivariate normal data, the graph shows that the expected proportion of points on the convex hull is much less than for data drawn from the uniform distribution on the square. For a sample of size N=2000, you should expect about 0.6% of its points (k=12) on the convex hull. For N=8000, the expected proportion is only about 0.2%, or 13.7 points on the convex hull.

Simulate samples from the uniform distribution on the disk

Let's examine one last distribution: the uniform distribution on the unit disk. I have previously shown how to generate random uniform points from the disk, or, in fact, from any d-dimensional ball. The following program defines the RandBall subroutine, which generates random samples from the unit disk. The function is used instead of the RANDGEN subroutine:

/* 4. random uniform in disk: X ~ U(Disk)
   https://blogs.sas.com/content/iml/2016/03/30/generate-uniform-2d-ball.html */
start RandBall(X, radius=1);
   N = nrow(X); d = ncol(X);
   call randgen(X, "Normal");       /* X ~ MVN(0, I(d)) */
   u = randfun(N, "Uniform");       /* U ~ U(0,1)       */
   r = radius * u##(1/d);           /* r proportional to d_th root of U */
   X = r # X / sqrt(X[,##]);        /* X[,##] is sum of squares for each row */
finish;
 
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call RandBall(X);          /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Disk)";
run HiLow(N, mean, CI[1,], CI[2,]);

For this distribution, a greater proportion of points are on the convex hull, as compared to the other examples. For a sample of size N=2000, you should expect about 2% of its points (k=42) on the convex hull. For N=8000, the expected proportion is about 0.8%, or 67.7 points on the convex hull.

According Har-Peled (2011), the asymptotic formula for the expected number of points on the convex hull of the unit disk is O(n1/3).

Linear transformations of convex hulls

Although I ran simulations for the uniform distribution on a square and on a disk, the same results hold for arbitrary rectangles and ellipses. It is not hard to show that if X is a set of points and C(X) is the convex hull, then for any affine transformation T: R2 → R2, the set T(C(X)) is the convex hull of T(X). Thus, for invertible transformations, the number of points on the convex hull is invariant under affine transformations.

In the same way, the simulation for bivariate normal data (which used uncorrelated variates) remains valid for correlated normal data. This is because the Cholesky transformation maps correlated variates to uncorrelated variates.

Summary

This article uses simulation to understand how the size of a convex hull relates to the size of the underlying sample, drawn randomly from a 2-D distribution. Convex hulls are preserved under linear transformations, so you can simulate from a standardized distribution. This article presents results for three distributions: the uniform distribution on a rectangle, the bivariate normal distribution, and the uniform distribution on an ellipse.

The post The expected number of points on a convex hull appeared first on The DO Loop.

12月 062021
 

While discussing how to compute convex hulls in SAS with a colleague, we wondered how the size of the convex hull compares to the size of the sample. For most distributions of points, I claimed that the size of the convex hull is much less than the size of the original sample. This article looks at the expected value of the proportion p = k / N, where k is the size of the convex hull for a set of N planar points. The proportion depends on the distribution of the points. This article shows the results for three two-dimensional distributions: the uniform distribution on a rectangle, the normal distribution, and the uniform distribution on a disk.

Monte Carlo estimates of the expected number of points on a convex hull

You can use Monte Carlo simulation to estimate the expected number of points on a convex hull. First, specify the distribution of the points and the size of the sample, N. Then do the following:

  • Simulate N random points from the distribution and compute the convex hull of these points. This gives you a value, k, for this one sample.
  • Repeat the simulation B times. You will have B values of k: {k1, k2, ..., kB}.
  • The average of the B values is an estimate of the expected value for the number of points on the convex hull for a random set of size N.
  • If you prefer to think in terms of proportions, the ratio k / N is an estimate of the expected proportion of points on the convex hull.

Simulate uniform random samples for N=200

Before you run a full simulation study, you should always run a smaller simulation to ensure that your implementation is correct. The following SAS/IML statements generate 1000 samples of size N=200 from the uniform distribution on the unit square. For each sample, you can compute the number of points on the convex hull. You can then plot the distribution of the number of points on the convex hull, as follows:

proc iml;
call randseed(1);
 
/* 1. Convex hull of N=200 random uniformly distributed 
      points in the unit square [0,1] x [0,1] */
NRep = 1000;            /* number of Monte Carlo samples */
Distrib = "Uniform";
N = 200;                /* sample size */
k = j(NRep, 1, 0);      /* store the number of points on CH */
X = j(N, 2, 0);         /* allocate matrix for sample */
do j = 1 to NRep;
   call randgen(X, Distrib);  /* random sample */
   Indices = cvexhull(X);
   k[j] = sum(Indices>0);     /* the positive indices are on CH */
end;
 
Ek = mean(k);           /* MC estimate of expected value */
print Ek;
title "k = Number of Points on Convex Hull; N = 200";
call bar(k) other="inset ('E[k] =' = '13.88') / border;";

The bar chart shows the distribution of the size of the convex hull for the 1,000 simulated samples. For most samples, the convex hull contains between 12 and 15 points. The expected value is the average of these values, which is 13.88 for this simulation.

If you want to compare samples of different sizes, it is useful to standardize the estimated quantity. Rather than predict the number of points on the convex hull, it is useful to predict the proportion of the sample that lies on the convex hull. Furthermore, we should report not only the expected value but also a confidence interval. This makes it clearer that the proportion for any one sample is likely to be within a range of values:

p = k / N;              /* proportion of points on the CH */
mean = mean(p);         /* expected proportion */
call qntl(CI, p, {0.025, 0.975}); /* 95% confidence interval */
print mean CI;
 
refStmt = 'refline ' + char(mean//CI) + '/ axis=X lineattrs=(color=red);';
title "Proportion of Points on Convex Hull, 1000 Random Samples";
title2 "X ~ U(Square); N = 200";
call histogram(p) other=refStmt label='Proportion';

For this distribution of points, you should expect about 7% of the sample to lie on the convex hull. An empirical 95% confidence interval is between 5% and 9%.

According to an unpublished preprint by Har-Peled ("On the Expected Complexity of Random Convex Hulls", 2011), the asymptotic formula for the expected number of points on the convex hull of the unit square is O(log(n)).

Simulate samples for many sizes

The previous section simulated B samples for size N=200. This section repeats the simulation for values of N between 300 and 8,000. Instead of plotting histograms, I plot the expected value and a 95% confidence interval for each value of N. So that the simulation runs more quickly, I have set B=250. The following SAS/IML program simulates B samples for various values of N, all drawn from the uniform distribution on the unit square:

/* 2. simulation for X ~  U(Square), N = 300..8000 */
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
Distrib = "Uniform";
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call randgen(X, Distrib);  /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */

You can use the CREATE statement to write the results to a SAS data set. You can use the SUBMIT/ENDSUBMIT statements to call PROC SGPLOT to visualize the result. Because I want to create the same plot several times, I will define a subroutine that creates the graph:

/* define subroutine that creates a scatter plot with confidence intervals */
start HiLow(N, p, lower, upper);
    Ek = N#p;  /* expected value of k */
    create CH var {N p lower upper Ek}; append; close;
    submit;
      proc sgplot data=CH noautolegend;
        format Ek 2.0;
        highlow x=N low=lower high=upper;
        scatter x=N y=p / datalabel=Ek;
        xaxis grid label='Sample Size (N)' offsetmax=0.08;
        yaxis grid label='Proportion on Convex Hull';
      run;
    endsubmit;
finish;
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Square)";
run HiLow(N, mean, CI[1,], CI[2,]);

The graph shows the expected proportion of points on the convex hull for samples of size N. The expected number of points (rounded to the nearest integer) is shown adjacent to each marker. For example, you should expect a sample of size N=2000 to have about 1% of its points (k=20) on the convex hull. As the sample size grows larger, the proportion decreases. For N=8000, the expected proportion is only about 0.3%, or 23.5 points on the convex hull.

Simulate samples from the bivariate normal distribution

The previous graph shows the results for points drawn uniformly at random from the unit square. If you change the distribution of the points, you will change the graph. For example, a bivariate normal distribution has most of its points deep inside the convex hull and very few points near the convex hull. Intuitively, you would expect the ratio k / N to be smaller for bivariate normal data, as compared to uniformly distributed data.

You can analyze bivariate normal data by changing only one line of the previous program. Simply change the statement Distrib = "Uniform" to Distrib = "Normal". With that modification, the program creates the following graph:

For bivariate normal data, the graph shows that the expected proportion of points on the convex hull is much less than for data drawn from the uniform distribution on the square. For a sample of size N=2000, you should expect about 0.6% of its points (k=12) on the convex hull. For N=8000, the expected proportion is only about 0.2%, or 13.7 points on the convex hull.

Simulate samples from the uniform distribution on the disk

Let's examine one last distribution: the uniform distribution on the unit disk. I have previously shown how to generate random uniform points from the disk, or, in fact, from any d-dimensional ball. The following program defines the RandBall subroutine, which generates random samples from the unit disk. The function is used instead of the RANDGEN subroutine:

/* 4. random uniform in disk: X ~ U(Disk)
   https://blogs.sas.com/content/iml/2016/03/30/generate-uniform-2d-ball.html */
start RandBall(X, radius=1);
   N = nrow(X); d = ncol(X);
   call randgen(X, "Normal");       /* X ~ MVN(0, I(d)) */
   u = randfun(N, "Uniform");       /* U ~ U(0,1)       */
   r = radius * u##(1/d);           /* r proportional to d_th root of U */
   X = r # X / sqrt(X[,##]);        /* X[,##] is sum of squares for each row */
finish;
 
NRep = 250;  /* Monte Carlo simulation: use 250 samples of size N */
N = do(300, 900, 100) || do(1000, 8000, 500);  /* sample sizes */
k = j(NRep, ncol(N), 0);         /* store the number of points on CH */
do i = 1 to ncol(N);
   X = j(N[i], 2, 0);            /* allocate matrix for sample */
   do j = 1 to NRep;
      call RandBall(X);          /* random sample */
      Indices = cvexhull(X);
      k[j,i] = sum(Indices>0);   /* the positive indices are on CH */
   end;
end;
 
p = k / N;                       /* proportion of points on the CH */
mean = mean(p);                  /* expected proportion */
call qntl(CI, p, {0.025, 0.975});  /* 95% confidence interval */
 
title "Expected Number of Points on Convex Hull of Sample of Size N";
title2 "X ~ U(Disk)";
run HiLow(N, mean, CI[1,], CI[2,]);

For this distribution, a greater proportion of points are on the convex hull, as compared to the other examples. For a sample of size N=2000, you should expect about 2% of its points (k=42) on the convex hull. For N=8000, the expected proportion is about 0.8%, or 67.7 points on the convex hull.

According Har-Peled (2011), the asymptotic formula for the expected number of points on the convex hull of the unit disk is O(n1/3).

Linear transformations of convex hulls

Although I ran simulations for the uniform distribution on a square and on a disk, the same results hold for arbitrary rectangles and ellipses. It is not hard to show that if X is a set of points and C(X) is the convex hull, then for any affine transformation T: R2 → R2, the set T(C(X)) is the convex hull of T(X). Thus, for invertible transformations, the number of points on the convex hull is invariant under affine transformations.

In the same way, the simulation for bivariate normal data (which used uncorrelated variates) remains valid for correlated normal data. This is because the Cholesky transformation maps correlated variates to uncorrelated variates.

Summary

This article uses simulation to understand how the size of a convex hull relates to the size of the underlying sample, drawn randomly from a 2-D distribution. Convex hulls are preserved under linear transformations, so you can simulate from a standardized distribution. This article presents results for three distributions: the uniform distribution on a rectangle, the bivariate normal distribution, and the uniform distribution on an ellipse.

The post The expected number of points on a convex hull appeared first on The DO Loop.

12月 012021
 

Did you know that the loess regression algorithm is not well-defined when you have repeated values among the explanatory variables, and you request a very small smoothing parameter? This is because loess regression at the point x0 is based on using the k nearest neighbors to x0. If x0 has m > k neighbors that are all the same distance from x0, then there is not a unique way to choose a set of k nearest neighbors.

When Cleveland and colleagues developed locally weighted regression (loess), they described the algorithm as "analogous to how a moving average is computed for a time series" (Cleveland and Devlin, JASA, 1988, p. 596). However, the loess algorithm does not require that the data are "measured at equally spaced points in time." Rather, it applies "to the more general case of regression analysis" (p. 596) where the independent variable(s) can be spaced somewhat arbitrarily. However, Cleveland and Devlin (1988) do not discuss the fact that the algorithm is not well-defined when the data have repeated values for the explanatory variable.

This article discusses the issue and a related fact: If the explanatory variable contains repeated values, the predicted values of a loess regression might depend on the order of the data.

Repeated values cause problems in many nonparametric analyses

Repeated (or tied) values can cause problems in nonparametric analyses because many nonparametric analyses use ranks instead of raw data values. I have previously written about how tied values affect rank-based normal scores, which are used in certain nonparametric tests. I have also discussed how binning by quantiles is affected by tied values. In practice, many tied values will affect the results more than a small number of tied values.

Tied values also affect nearest-neighbor algorithms for which you specify the number of neighbors, such as for spatial data analysis and loess regression, to name just two examples. If you ask for the k closest neighbors to x0, what do you do if there are more than k observations that are the same distance away? For univariate data, this can occur when there are more than k repeated values for the explanatory variable.

A simple loess example that has tied values

Let's construct an example that has N=21 observations but only three unique values of the X variable. The following SAS DATA step constructs the example and uses PROC SGPLOT to fit a loess curve through the scatter plot.

data Simple;
x= -1;
do y = -1 to 1 by 2/9;   /* 10 observations for x = -1 */
   output;
end;
x = 0; y = 0; output;
x= +1;
do y = -1 to 1 by 2/9;   /* 10 observations for x = +1 */
   output;
end;
run;
 
title "Default Loess Fit";
title2 "alpha = 0.929; k=19 Neighbors";
proc sgplot data=Simple;
   loess x=x y=y;
run;

The graph shows the 21 observations and a loess fit. There are 10 observations for which x = -1 and 10 for which x = +1. The loess algorithm searches for a value of the smoothing parameter that minimizes a goodness-of-fit criterion. For these data, the best value of the smoothing parameter is 0.93, which means "use 93% of the observations when you perform local regression estimates." In terms of nearest neighbors, it means "use the nearest 19 observations" to make predictions.

Already you might be wondering which 19 values are used to evaluate the smoother at each value of x. It's not clear, right? The issue becomes more apparent if you manually force the loess algorithm to use a smoothing parameter such as α = 1/3, which tells the algorithm to use 1/3 of the observations (k=7 observations) to make predictions for each value of x. The following call to PROC SGPLOT displays one loess curve that uses k=7 observations and another curve that uses k=4 observations.

title "Data Are Sorted by X and Y";
title2 "Y Ascending";
proc sgplot data=Simple;
   scatter x=x y=y;
   loess x=x y=y / smooth=0.34  legendlabel="k=7 nbrs" nomarkers name='L1';
   loess x=x y=y / smooth=0.2   legendlabel="k=4 nbrs" nomarkers name='L2';
   keylegend 'L1' 'L2' / title="Nearest Neighbors" location=inside position=S across=1;
run;

I added a blue rectangle and a red rectangle to the left side of the plot because I want to focus on the predicted value at x = -1. The question is, which seven observations were used to make the prediction at x = -1? It seems clear that the predicted value in blue is the mean of the seven observations that are inside the blue rectangle. Those happen to be the first seven observations in the data set.

Mathematically, the predicted value at x = -1 is not well-defined. The loess algorithm wants to predict the value a x = -1 by using k=7 observations whose x value is nearest to -1. But there are 10 observations whose distance to x = -1 is exactly 0. The loess algorithm can use any of the "10 choose 7" = 120 possible sets of observations that have x = -1. When k=7, the prediction at x = -1 is unique only when there at seven or fewer observations for which x = -1.

The graph shows a similar situation for the smoothing parameter α=0.2, which corresponds to k=4 nearest neighbors. Again, the loess algorithm can choose any set of four observations to estimate the predicted value. For this example, the first four observations in the data set were used, as shown by the light red rectangle on the graph. This is a valid choice, but it is not the only choice.

Warning about this issue

When you use PROC LOESS to fit a loess smoother, the procedure displays a warning in the log to alert you to the fact that the predicted values are not unique:

proc loess data=Simple plots(only)=fitplot;
   model y=x / smooth=(0.34 0.2) direct;
run;
WARNING: At smoothing parameter 0.34, the local SSCP matrix for
         21 fit point(s) is numerically singular. The fitted
         value and standard errors at those points are not
         uniquely defined.
WARNING: At smoothing parameter 0.2, the local SSCP matrix for
         21 fit point(s) is numerically singular. The fitted
         value and standard errors at those points are not
         uniquely defined.

Notice that the warning from PROC LOESS states that "the fitted value" is "not uniquely defined." The LOESS statement in PROC SGPLOT uses the same algorithm as PROC LOESS but does not display a warning. The loess algorithm in other statistical software might issue an error and refuse to carry out the regression.

A loess prediction might depend on the order of the data

Because the fitted value is not uniquely defined, the fitted value might change if you change the order of the data. The previous graph suggests that the predicted value at x = -1 uses the first k observations in the data set. Let's sort the data in a different order and see if the predicted value changes:

proc sort data=Simple out=YDesc;
   by x  descending y;
run;
 
title2 "Y Descending";
proc sgplot data=YDesc;
   scatter x=x y=y;
   loess x=x y=y / smooth=0.34  legendlabel="k=7 nbrs" nomarkers name='L1';
   loess x=x y=y / smooth=0.2   legendlabel="k=4 nbrs" nomarkers name='L2';
   keylegend 'L1' 'L2' / title="Nearest Neighbors" location=inside position=S across=1;
run;

As suspected, the predicted value at x = -1 is different from before. The predicted value is the average of the first k observations, but now the loess algorithm is using a different set of seven observations. This prediction is neither more correct nor more wrong than the earlier predicted value.

What to do if you have repeated values?

I've described a problem, but what can you do about it? The simplest advice is to never use the loess algorithm when the explanatory variable has repeated values. Instead, you can use a spline smoother, such as a penalized B-spline. (PROC SGPLOT supports the PBSPLINE statement for overlaying a penalized B spline.)

If you don't want to avoid the loess algorithm, at least be aware that you will get predicted values that are not unique. Also, realize that the predicted values might depend on the order of the data.

For the example in this article, I intentionally chose data and a small smoothing parameter so that the problem was obvious. If you have only a few repeated values in your data, or if you use larger values of the smoothing parameter, you might not notice the effect of the repeated values.

Every statistical procedure makes certain assumptions about the data. The best way to use a statistical algorithm is to be aware of the assumptions and know what happens if you violate them. For the loess regression algorithm, the algorithm assumes that there are no tied values among the explanatory variables.

Summary

Be cautious when you use the loess algorithm to add a scatter plot smoother. The algorithm is based on ideas from time series analysis and implicitly assumes that there are no repeated values for the explanatory variable. If you have tied values, then the predicted value at certain locations might not be unique and can depend on the order of observations in the data set. The effect is most pronounced when you use a very small value of the smoothing parameter, or the data contain many tied values.

In closing, I'll add that this article discusses the loess algorithm from a mathematical point of view. In practice, implementations such as PROC LOESS in SAS might use computational methods such as interpolation, k-D tree, and blending to improve the computational efficiency of the algorithm. This means that the algorithm does not actually choose the k nearest neighbors directly for each predicted value. However, the general issue that I describe is still relevant, as shown in the artificial example in this article.

The post Beware of repeated values in loess models appeared first on The DO Loop.