Numerical Analysis

10月 052020
 

Finite-precision computations can be tricky. You might know, mathematically, that a certain result must be non-negative or must be within a certain interval. However, when you actually compute that result on a computer that uses finite-precision, you might observe that the value is slightly negative or slightly outside of the interval. This frequently happens when you are adding numbers that are not exactly representable in base 2. One of the most famous examples is the sum
    0.1 + 0.1 + 0.1 ≠ 0.3 (finite precision),
which is shown to every student in Computer Science 101. Other examples include:

  • If x is in the interval [0,1], then y = 1 - x10 is also in the interval [0,1]. That is true in exact precision, but not necessarily true in finite-precision computations when x is close to 1.
  • Although sin2(t) + cos2(t) = 1 in exact arithmetic for all values of t, the equation might not be true in finite precision.

SAS programs that demonstrate the previous finite-precision computations are shown at the end of this article. Situations like these can cause problems when you want to use the result in a function that has a restricted domain. For example, the SQRT function cannot operate on negative numbers. For probability distributions, the quantile function cannot operate on numbers outside the interval [0,1].

This article discusses how to "trap" results that are invalid because they are outside of a known interval. You can use IF-THEN/ELSE logic to catch an invalid result, but SAS provides more compact syntax. This article discusses using the IFN function in Base SAS and using the "elementwise minimum" (<>) and "elementwise maximum" (><) operators in the SAS/IML language.

Trap and map

I previously wrote about the "trap and cap" technique for handling functions like log(x). The idea is to "trap" invalid input values (x ≤ 0) and "cap" the output when you are trying to visualize the function. For the present article, the technique is "trap and map": you need to detect invalid computations and then map the result back into the interval that you know is mathematically correct. For example, if the argument is supposed to represent a probability in the interval [0,1], you must map negative numbers to 0 and map numbers greater than 1 to 1.

How to use the IFN function in SAS

Clearly, you can "trap and map" by using IF-THEN/ELSE logic. For example, suppose you know that a variable, x, must be in the interval [0,1]. You can use the SAS DATA step to trap invalid values and map them into the interval, as follows:

/* trap invalid values of x and map them into [0,1]. Store result in y */
data Map2;
input x @@;
if x<0 then 
   y = 0;
else if x>1 then 
   y = 1;
else 
   y = x;
datalines;
1.05 1 0.9 0.5 0 -1e-16 -1.1
;

This program traps the invalid values, but it requires six lines of IF-THEN/ELSE logic. A more compact syntax is to use the IFN function. The IFN function has three arguments:

  • The first argument is a logical expression, such as x < 0.
  • The second argument is the value to return when the logical expression is true.
  • The third argument is the value to return when the logical expression is false.

For example, you can use the function call IFN(x<0, 0, x) to trap negative values of x and map those values to 0. Similarly, you can use the function call IFN(x>1, 1, x) to trap values of x that are greater than 1 and map those values to 1. To perform both of the trap-and-map operations on one line, you can nest the function calls so that the third argument to the IFN function is itself a function call, as follows:

data Map2;
input x @@;
y = ifn(x<0, 0, ifn(x>1, 1, x));   /* trap and map: force x into [0,1] */
 
/* or, for clarity, split into two simpler calls */
temp = ifn(x>1, 1, x);             /* force x <= 1 */
z = ifn(x<0, 0, temp);             /* force x >= 0 */
datalines;
1.05 1 0.9 0.5 0 -1e-16 -1.1
;
 
proc print noobs; run;

The output shows that the program performed the trap-and-map operation correctly. All values of y are in the interval [0,1].

If you can't quite see how the logic works in the statement that nests the two IFN calls, you can split the logic into two steps as shown later in the program. The first IFN call traps any variables that are greater than 1 and maps them to 1. The second IFN call traps any variables that are less than 0 and maps them to 0. The z variable has the same values as the y variable.

How to use the element min/max operators in SAS/IML

The SAS/IML language contains operators that perform similar computations. If x is a vector of numbers, then

  • The expression (x <> 0) returns a vector that is the elementwise maximum between the elements of x and 0. In other words, any elements of x that are less than 0 get mapped to 0.
  • The expression (x >< 1) returns a vector that is the elementwise minimum between the elements of x and 1. In other words, any elements of x that are greater than 1 get mapped to 1.
  • You can combine the operators. The expression (x <> 0) >< 1 forces all elements of x into the interval [0,1]. Because the elementwise operators are commutative, you can also write the expression as 0 <> x >< 1.

These examples are shown in the following SAS/IML program:

proc iml;
x = {1.05, 1, 0.9, 0.5, 0, -1e-16, -1.1};
y0 = (x <> 0);                     /* force x >= 0 */
y1 = (x >< 1);                     /* force x <= 1 */
y01 = (x <> 0) >< 1;               /* force x into [0,1] */
print x y0 y1 y01;

Summary

In summary, because of finite-precision computations (or just bad input data), it is sometimes necessary to trap invalid values and map them into an interval. Using IF-THEN/ELSE statements is clear and straightforward, but require multiple lines of code. You can perform the same logical trap-and-map calculations more compactly by using the IFN function in Base SAS or by using the elementwise minimum or maximum operators in SAS/IML.

Appendix: Examples of floating-point computations to study

The following SAS DATA step programs show typical computations for which the floating-point computation can produce results that are different from the expected mathematical (full-precision) computation.

data FinitePrecision;
z = 0.1 + 0.1 + 0.1;
if z^=0.3 then 
   put 'In finite precision: 0.1 + 0.1 + 0.1 ^= 0.3';
run;
 
data DomainError1;
do t = -4 to 4 by 0.1;
   x = cos(t);
   y = sin(t);
   z = x**2 + y**2;  /* always = 1, mathematically */
   s = sqrt(1 - z);  /* s = sqrt(0) */
   output;
end;
run;
 
data DomainError2;
do x = 0 to 1 by 0.01;
   z = 1 - x**10;  /* always in [0,1], mathematically */
   s = sqrt(z);  
   output;
end;
run;
8月 192020
 

Finding the root (or zero) of a nonlinear function is an important computational task. In the case of a one-variable function, you can use the SOLVE function in PROC FCMP to find roots of nonlinear functions in the DATA step. This article shows how to use the SOLVE function to find the roots of a user-defined function from the SAS DATA step.

Some ways to find the root of a nonlinear function in SAS

I have previously blogged about several ways to find the roots of nonlinear equations in SAS, including:

The DATA step does not have a built-in root-finding function, but you can implement one by using PROC FCMP.

PROC FCMP: Implement user-defined functions in SAS

PROC FCMP enables you to write user-defined functions that can be called from the DATA step and from SAS procedures that support DATA step programming statements (such as PROC NLIN, PROC NLMIXED, and PROC MCMC). To demonstrate finding the roots of a custom function, let's use the same function that I used to show how to find roots by using SAS/IML. The following statements use PROC FCMP to define a function (FUNC1) and to store the function to WORK.FUNCS. You must use the CMPLIB= system option to tell the DATA step to look in WORK.FUNCS for unknown functions. The DATA step evaluates the function for input arguments in the interval [-3, 3]. The SGPLOT procedure creates a graph of the function:

/* use PROC FCMP to create a user-defined function (FUNC1) */
proc fcmp outlib=work.funcs.Roots;
   function Func1(x);         /* create and store a user-defined function */
      return( exp(-x**2) - x**3 + 5*x +1 );
   endsub;
quit;
options cmplib=work.funcs;   /* DATA step will look here for unresolved functions */
 
data PlotIt;   
do x = -3 to 3 by 0.1;       /* evaluate function on [-3, 3] */
   y = Func1(x);
   output;
end;
run;
 
title "Graph of User-Defined Function";
proc sgplot data=PlotIt;
   series x=x y=y;
   refline 0 / axis=y;
   xaxis grid;
run;

From the graph, it appears that the function has three roots (zeros). The approximate locations of the roots are {-2.13, -0.38, 2.33}, as found in my previous article. The next section shows how to use the SOLVE function in PROC FCMP to find these roots.

Use the SOLVE function to find roots

The SOLVE function is not a DATA step function. It is a "special function" in PROC FCMP. According to the documentation for the special functions, "you cannot call these functions directly from the DATA step. To use these functions in a DATA step, you must wrap the special function inside another user-defined FCMP function." You can, however, call the SOLVE function directly from procedures such as NLIN, NLMIXED, and MCMC.

I was confused when I first read the documentation for the SOLVE function, so let me give an overview of the syntax. The SOLVE function enables you to find a value x such that f(x) = y0 for a "target value", y0. Thus, the SOLVE function enables you to find roots of the function g(x) = f(x) – y0. However, in this article, I will set y0 = 0 so that x will be a root of f.

Because the function might have multiple roots, you need to provide a guess (x0) that is close to the root. The SOLVE function will start with your initial guess and apply an iterative algorithm to obtain the root. Thus, you need to provide the following information to the SOLVE function:

  • The name of an FCMP function (such as "Func1") that will evaluate the function.
  • An initial guess, x0, that is near a root. (You can also provide convergence criteria that tells the iterative algorithm when it has found an approximate root.)
  • The parameters to the function. A missing value indicates the (one) parameter that you are solving for. For a function of one variable, you will specify a missing value for x, which tells the SOLVE function to solve for x. In general, a function can take multiple parameters, so use a missing value in the parameter list to indicate which parameter you are solving for.

The following SAS program defines a new FCMP function called Root_Func1. That function takes one argument, which is the initial guess for a root. Inside the function, the SOLVE function finds the root of the "Func1" function (defined earlier). If it was successful, it returns the root to the caller. To test the Root_Func1 function, I call it from a DATA step and pass in the values -2, 0, and +2. I obtained these guesses by looking at the graph of the Func1 function.

/* to call SOLVE in the DATA step, wrap it inside an FCMP function */
proc fcmp outlib=work.funcs.Roots;
   function Root_Func1(x0);
      array opts[5] init absconv relconv maxiter status; /* initialize to missing */
      opts[1] = x0;        /* opts[1] is the initial guess */
      z = solve("Func1",   /* name of function */
                opts,      /* initial condition, convergence criteria, and status */
                0,         /* target: find x so that f(x) = 0 */
                .);        /* Params for f. A missing value indicates param to solve for */
      if status=0 then 
         return( z );      /* root found */
      else return (.);     /* a root was not found */
   endsub;
quit;
 
data Roots1;
input x0 @@;   
root = Root_Func1(x0);     /* x0 is a "guess" for a root of Func1 */
datalines;
-2 0 2
;
 
proc print data=Roots1 noobs; run;

The output shows that the function has found three roots, which are approximately {-2.13, -0.38, 2.33}. These are the same values that I found by using the FROOT function in SAS/IML.

The roots of functions that have parameters

Sometimes it is useful to include parameters in a function. For example, the following function of x contains two parameters, a and b:
    g(x; a, b) = exp(-x2) + a*x3 + b*x +1
Notice that the first function (Func1) is a special case of this function because f(x) = g(x; -1, 5). You can include parameters in the FCMP functions that define the function and that find the roots. When you call the SOLVE function, the last arguments are a list of parameters (x, a, b) and you should give specific values to the a and b parameters and use a missing value to indicate that the x variable is the variable that you want to solve for. This is shown in the following program:

/* you can also define a function that depends on parameters */
proc fcmp outlib=work.funcs.Roots;
   /* define the function. Note Func1(x) = Func2(x; a=-1, b=5) */
   function Func2(x, a, b);
      return( exp(-x**2) + a*x**3 + b*x +1 );
   endsub;
 
   function Root_Func2(x0, a, b);
      array opts[5] init absconv relconv maxiter status; /* initialize missing */
      init = x0;           /* pass in initial guess */
      z = solve("Func2",   /* name of function */
                opts,      /* initial condition, convergence opts, status */
                0,         /* Find x so that f(x) = 0 */
                ., a, b ); /* Params for f. A missing value indicates param to solve for */
      if status=0 then 
         return( z );      /* root found */
      else return (.);     /* a root was not found */
   endsub;
quit;
 
/* Evaluate Func2 for x in [-3,3] for three pairs of (a,b) values:
   (-1,5), (-1,3), and (-2,1) */
data PlotIt;
do x = -3 to 3 by 0.1;
   y = Func2(x, -1, 5); lbl="a=-1; b=5";  output;
   y = Func2(x, -1, 3); lbl="a=-1; b=3";  output;
   y = Func2(x, -2, 1); lbl="a=-2; b=1";  output;
end;
run;
 
title "Graph of User-Defined Functions";
proc sgplot data=PlotIt;
   series x=x y=y / group=lbl;
   refline 0 / axis=y;
   xaxis grid;
   yaxis min=-10 max=10;
run;

The three graphs are shown. You can see that two of the graphs have three roots, but the third graph has only one root. You can call the Root_Func2 function from the DATA step to find the roots for each function:

/* find the roots of Func2 for (a,b) values (-1,5), (-1,3), and (-2,1) */
data Roots2;
input a b x0;
root = Root_Func2(x0, a, b);
datalines;
-1 5 -2 
-1 5  0 
-1 5  2 
-1 3 -2 
-1 3  0 
-1 3  2 
-2 1  0
-2 1  2
;
 
proc print data=Roots2 noobs; run;

Notice that the roots for (a,b) = (-1, 5) are exactly the same as the roots for Func1. The roots for the function when (a,b) = (-1, 3) are approximately {-1.51, -0.64, 1.88}. The root for the function when (a,b) = (-2, 1) is approximately 1.06.

Notice that the initial guess x0 = 0 did not converge to a root for the function that has (a,b) = (-2, 1). It is well known that Newton-type methods do not always converge. This is in contrast to Brent's methods (used by the FROOT function in SAS/IML), which always converges to a root, if one exists.

Summary

In summary, this article shows how to use the SOLVE function in Base SAS to find the roots (zeros) of a nonlinear function of one variable. You first define a nonlinear function by using PROC FCMP. You can call the SOLVE function directly from some SAS procedures. However, if you want to use it in the DATA step, you need to define a second FCMP function that takes the function parameters and an initial guess and calls the SOLVE function. This article shows two examples of using the SOLVE function to find roots in SAS.

The post Find the root of a function by using the SAS DATA step appeared first on The DO Loop.

6月 242020
 

If you have ever run a Kolmogorov-Smirnov test for normality, you have encountered the Kolmogorov D statistic. The Kolmogorov D statistic is used to assess whether a random sample was drawn from a specified distribution. Although it is frequently used to test for normality, the statistic is "distribution free" in the sense that it compares the empirical distribution to any specified distribution. I have previously written about how to interpret the Kolmogorov D statistic and how to compute it in SAS.

If you can compute a statistic, you can use simulation to approximate its sampling distribution and to approximate its critical values. Although I am a fan of simulation, in this article I show how to compute the exact distribution and critical values for Kolmogorov D statistic. The technique used in this article is from Facchinetti (2009), "A Procedure to Find Exact Critical Values of Kolmogorov-Smirnov Test."

An overview of the Facchinetti method

Suppose you have a sample of size n and you perform a Kolmogorov-Smirnov test that compares the empirical distribution to a reference distribution. You obtain the test statistic D. To know whether you should reject the null hypothesis (that the test came from the reference distribution), you need to be able to compute the probability CDF(D) = Pr(Dn ≤ D). Facchinetti shows that you can compute the probability for any D by solving a system of linear equations. If a random sample contains n observations, you can form a certain (2n+2) x (2n+2) matrix whose entries are composed of binomial probabilities. You can use a discrete heat map to display the structure of the matrix that is used in the Facchinetti method. One example is shown below. For n=20, the matrix is 42 x 42. As Facchinetti shows in Figure 6 of her paper, the matrix has a block structure, with triangular matrices in four blocks.

The right-hand side of the linear system is also composed of binomial probabilities. The system is singular, but you can use a generalized inverse to obtain a solution. From the solution, you can compute the probability. For the details, see Facchinetti (2009).

Facchinetti includes a MATLAB program in the appendix of her paper. I converted the MATLAB program into SAS/IML and improved the efficiency by vectorizing some of the computations. You can download the SAS/IML program that computes the CDF of the Kolmogorov D statistic and all the graphs in this article. The name of the SAS/IML function is KolmogorovCDF, which requires two arguments, n (sample size) and D (value of the test statistic).

Examples of using the CDF of the Kolmorov distribution

Suppose you have a sample of size 20. You can use a statistical table to look up the critical value of the Kolmogorov D statistic. For α = 0.05, the (two-sided) critical value for n=20 is D = 0.294. If you load and call the KolmogorovCDF function, you can confirm that the CDF for the Kolmogorov D distribution is very close to 0.95 when D=0.294:

proc iml;
load  module=KolmogorovCDF;      /* load the function that implements the Facchinetti method */
/* simple example of calling the function */
n = 20;
D = 0.29408;
cdf = KolmogorovCDF(n, D);
print n D cdf;

The interpretation of this result is that if you draw a random samples of size 20 from the reference distribution, there is a 5% chance that the Kolmogorov D value will exceed 0.294. A test statistic that exceeds 0.294 implies that you should reject the null hypothesis at the 0.05 significance level.

Visualize the Kolmogorov D distribution

If you can compute the CDF for one value of D, you can compute it for a sequence of D values. In this way, you can visualize the CDF curve. The following SAS/IML statements compute the CDF for the Kolmogorov D distribution when n=20 by running the computation on a sequence of D values in the open interval (0, 1):

D = do(0.01, 0.99, 0.01);             /* sequence of D values */
CDF = j(1, ncol(D), .);
do i = 1 to ncol(D);
   CDF[i] = KolmogorovCDF(n, D[i]);   /* CDF for each D */ 
end;
title "Kolmogorov CDF, n=20";
call series(D, CDF) grid={x y};       /* create a graph of the CDF */

The graph enables you to read probabilities and quantiles for the D20 distribution. The graph shows that almost all random samples of size 20 (drawn from the reference distribution) will have a D statistic of less than 0.4. Geometrically, the D statistic represents the largest gap between the empirical distribution and the reference distribution. A large gap indicates that the sample was probably not drawn from the reference distribution.

Visualize a family of Kolmogorov D distributions

The previous graph was for n=20, but you can repeat the computation for other values of n to obtain a family of CDF curves for the Kolmogorov D statistic. (Facchinetti writes Dn to emphasize that the statistic depends on the sample size.) The computation is available in the program that accompanies this article. The results are shown below:

This graph enables you to estimate probabilities and quantiles for several Dn distributions.

Visualize a family of density curves

Because the PDF is the derivative of the CDF, you can use a forward difference approximation of the derivative to also plot the density of the distribution. The following density curves are derived from the previous CDF curves:

A table of critical values

If you can compute the CDF, you can compute quantiles by using the inverse CDF. You can use a root-finding technique to obtain a quantile.

In the SAS/IML language, the FROOT function can find univariate roots. The critical values of the Dn statistic are usually tabulated by listing the sample size down columns and the significance level (α) or confidence level (1-α) across the columns. Facchinetti (p. 352) includes a table of critical values in her paper. The following SAS/IML statements show how to recreate the table. For simplicity, I show only a subset of the table for n=20, 25, 20, and 35, and for four common values of α.

/* A critical value is the root of the function
   f(D) = KolmogorovCDF(n, D) - (1-alpha)
   for any specified n and alpha value.
*/
start KolmogorovQuantile(D) global (n, alpha);
   return  KolmogorovCDF(n, D) - (1-alpha);
finish;
 
/* create a table of critical values for the following vector of n and alpha values */
nVals = do(20, 35, 5);
alphaVals = {0.001 0.01 0.05 0.10};
CritVal = j(ncol(nVals), ncol(alphaVals), .);
do i = 1 to ncol(nVals);
   n = nVals[i];
   do j = 1 to ncol(alphaVals);
      alpha = alphaVals[j];
      CritVal[i,j] = froot("KolmogorovQuantile", {0.01, 0.99});
   end;
end;
print CritVal[r=nVals c=alphaVals format=7.5 L="Critical Values (n,alpha)"];

For each sample size, the corresponding row of the table shows the critical values of D for which CDF(D) = 1-α. Of course, a printed table is unnecessary when you have a programmatic way to compute the quantiles.

Summary

Facchinetti (2009) shows how to compute the CDF for the Kolmogorov D distribution for any value of n and D. The Kolmogorov D statistic is often used for goodness-of-fit tests. Facchinetti's method consists of forming and solving a system of linear equations. To make her work available to the SAS statistical programmer, I translated her MATLAB program into a SAS/IML program that computes the CDF of the Kolmogorov D distribution. This article demonstrates how to use the KolmogorovCDF in SAS/IML to compute and visualize the CDF and density of the Kolmogorov D statistic. It also shows how to use the computation in conjunction with a root-finding method to obtain quantiles and exact critical values of the Kolmogorov D distribution.

The post The Kolmogorov D distribution and exact critical values appeared first on The DO Loop.

6月 012020
 

In a previous article, I discussed the definition of the Kullback-Leibler (K-L) divergence between two discrete probability distributions. For completeness, this article shows how to compute the Kullback-Leibler divergence between two continuous distributions. When f and g are discrete distributions, the K-L divergence is the sum of f(x)*log(f(x)/g(x)) over all x values for which f(x) > 0. When f and g are continuous distributions, the sum becomes an integral:
KL(f,g) = ∫ f(x)*log( f(x)/g(x) ) dx
which is equivalent to
KL(f,g) = ∫ f(x)*( log(f(x)) – log(g(x)) ) dx
The integral is over the support of f, and clearly g must be positive inside the support of f for the integral to be defined.

The Kullback-Leibler divergence between normal distributions

I like to perform numerical integration in SAS by using the QUAD subroutine in the SAS/IML language. You specify the function that you want to integrate (the integrand) and the domain of integration and get back the integral on the domain. Use the special missing value .M to indicate "minus infinity" and the missing value .P to indicate "positive infinity."

As a first example, suppose the f(x) is the pdf of the normal distribution N(μ1, σ1) and g(x) is the pdf of the normal distribution N(μ2, σ2). From the definition, you can exactly calculate the Kullback-Leibler divergence between normal distributions:
KL between Normals: log(σ2/σ1) + (σ12 – σ22 + (μ1 – μ2)2) / (2*σ22)

You can define the integrand by using the PDF and LOGPDF functions in SAS. The following computation integrates the distributions on the infinite integral (-∞, ∞), although a "six sigma" interval about μ1, such as [-6, 6], would give the same result:

proc iml;
/* Example 1: K-L divergence between two normal distributions */
start KLDivNormal(x) global(mu1, mu2, sigma1, sigma2);
   return pdf("Normal", x, mu1, sigma1) #
          (logpdf("Normal", x, mu1, sigma1) - logpdf("Normal", x, mu2, sigma2));
finish;
 
/* f is PDF for N(0,1) and g is PDF for N(2,3) */ 
mu1 = 0; sigma1 = 1;
mu2 = 2; sigma2 = 3;
 
call quad(KL, "KLDivNormal", {.M .P});  /* integral on (-infinity, infinity) */
KLExact = log(sigma2/sigma1) + 
          (sigma1##2 - sigma2##2+ (mu1-mu2)##2) / (2*sigma2##2);
print KL KLExact (KL-KLExact)[L="Difference"];

The output indicates that the numerical integration agrees with the exact result to many decimal places.

The Kullback-Leibler divergence between exponential and gamma distributions

A blog post by John D. Cook computes the K-L divergence between an exponential and a gamma(a=2) distribution. This section performs the same computation in SAS.

This is a good time to acknowledge that numerical integration can be challenging. Numerical integration routines have default settings that enable you to integrate a wide variety of functions, but sometimes you need to modify the default parameters to get a satisfactory result—especially for integration on infinite intervals. I have previously written about how to use the PEAK= option for the QUAD routine to achieve convergence in certain cases by providing additional guidance to the QUAD routine about the location and scale of the integrand. Following the advice I gave in the previous article, a value of PEAK=0.1 seems to be a good choice for the K-L divergence between the exponential and gamma(a=2) distributions: it is inside the domain of integration and is close to the mode of the integrand, which is at x=0.

/* Example 2: K-L divergence between exponential and gamma(2). Example from
   https://www.johndcook.com/blog/2017/11/08/why-is-kullback-leibler-divergence-not-a-distance/
*/
start KLDivExpGamma(x) global(a);
   return pdf("Exponential", x) #
          (logpdf("Exponential", x) - logpdf("Gamma", x, a));
finish;
 
a = 2;
/* Use PEAK= option: https://blogs.sas.com/content/iml/2014/08/13/peak-option-in-quad.html */
call quad(KL2, "KLDivExpGamma", {0 .P}) peak=0.1;
print KL2;

The value of the K-L divergence agrees with the answer in John D. Cook's blog post.

Approximating the Exponential Distribution by the Gamma Distribution

Recall that the K-L divergence is a measure of the dissimilarity between two distributions. For example, the previous example indicates how well the gamma(a=2) distribution approximates the exponential distribution. As I showed for discrete distributions, you can use the K-L divergence to select the parameters for a distribution that make it the most similar to another distribution. You can do this by using optimization methods, but I will merely plot the K-L divergence as a function of the shape parameter (a) to indicate how the K-L divergence depends on the parameter.

You might recall that the gamma distribution includes the exponential distribution as a special case. When the shape parameter a=1, the gamma(1) distribution is an exponential distribution. Therefore, the K-L divergence between the exponential and the gamma(1) distribution should be zero.

The following program computes the K-L divergence for values of a in the range [0.5, 2] and plots the K-L divergence versus the parameter:

/* Plot KL(Expo, Gamma(a)) for various values of a.
   Note that gamma(a=1) is the exponential distribution. */
aa = do(0.5, 2, 0.01);
KL = j(1, ncol(aa), .);
do i = 1 to ncol(aa);
   a = aa[i];
   call quad(KL_i, "KLDivExpGamma", {0 .P}) peak=0.1;
   KL[i] = KL_i;
end;
title "Kullback-Leibler Divergence Between Exponential and Gamma{a)";
call series(aa, KL) grid={x y} label={'Gamma Shape Parameter (a)' 'K-L Divergence'};

As expected, the graph of the K-L divergence reaches a minimum value at a=1, which is the best approximation to an exponential distribution by the gamma(a) distribution. Note that the K-L divergence equals zero when a=1, which indicates that the distributions are identical when a=1.

Summary

The Kullback-Leibler divergence between two continuous probability distributions is an integral. This article shows how to use the QUAD function in SAS/IML to compute the K-L divergence between two probability distributions.

The post The Kullback–Leibler divergence between continuous probability distributions appeared first on The DO Loop.

5月 202020
 

This article shows how to perform two-dimensional bilinear interpolation in SAS by using a SAS/IML function. It is assumed that you have observed the values of a response variable on a regular grid of locations.

A previous article showed how to interpolate inside one rectangular cell. When you have a grid of cells and a point (x, y) at which to interpolate, the first step is to efficiently locate which cell contains (x, y). You can then use the values at the corners of the cell to interpolate at (x, y), as shown in the previous article. For example, the adjacent graph shows the bilinear interpolation function that is defined by 12 points on a 4 x 3 grid.

You can download the SAS program that creates the analyses and graphs in this article.

Bilinear interpolation assumptions

For two-dimensional linear interpolation of points, the following sets of numbers are the inputs for bilinear interpolation:

  1. Grid locations: Let x1 < x2 < ... < xp be the horizontal coordinates of a regular grid. Let y1 < y2 < ... < yn be the vertical coordinates.
  2. Response Values: Let Z be an n x p matrix of response values on the grid. The (j, i)th value of Z is the response value at the location (xi, yj). Notice that the columns of Z correspond to the horizontal positions for the X values and the rows of Z correspond to the vertical positions for the Y values. (This might seem "backwards" to you.) The response values data should not contain any missing values. The X, Y, and Z values (called the "fitting data") define the bilinear interpolation function on the rectangle [x1, xp] x [y1, yn].
  3. Values to score: Let (s1, t1), (s2, t2), ..., (sk, tk) be a set of k new (x, y) values at which you want to interpolate. All values must be within the range of the data: x1 ≤ si ≤ xp and y1 ≤ si ≤ yn for all (i,j) pairs. These values are called the "scoring data" or the "query data."

The goal of interpolation is to estimate the response at each location (si, tj) based on the values at the corner of the cell that contains the location. I have written a SAS/IML function called BilinearInterp, which you can freely download and use. The following statements indicate how to define and store the bilinearInterp function:

proc iml;
/* xGrd : vector of Nx points along x axis
   yGrd : vector of Ny points along y axis
   z    : Ny x Nx matrix. The value Z[i,j] is the value at xGrd[j] and yGrd[i]
   t    : k x 2 matrix of points at which to interpolate
   The function returns a k x 1 vector, which is the bilinear interpolation at each row of t. 
*/
start bilinearInterp(xgrd, ygrd, z, t);
   /* download function from 
      https://github.com/sascommunities/the-do-loop-blog/blob/master/interpolation/bilinearInterp.sas 
   */
finish;
store module=bilinearInterp;       /* store the module so it can be loaded later */
QUIT;

You only need to store the module one time. You then load the module in any SAS/IML program that needs to use it. You can learn more about storing and loading SAS/IML modules.

The structure of the fitting data

The fitting data for bilinear interpolation consists of the grid of points (Z) and the X and Y locations of each grid point. The X and Y values do not need to be evenly spaced, but they can be.

Often the fitting data are stored in "long form" as triplets of (X, Y, Z) values. If so, the first step is to read the triplets and convert them into a matrix of Z values where the columns are associated with the X values and the rows are associated with the Y values. For example, the following example data set specifies a response variable (Z) on a 4 x 3 grid. The values of the grid in the Y direction are {0, 1, 1.5, 3} and the values in the X direction are {1, 2, 5}.

As explained in the previous article, the data must be sorted by Y and then by X. The following statements read the data into a SAS/IML matrix and extract the grid points in the X and Y direction:

data Have;
input x y z;
datalines;
1 0   6 
2 0   7 
5 0   5 
1 1   9 
2 1   9 
5 1   7 
1 1.5 10 
2 1.5 9 
5 1.5 6 
1 3   12 
2 3   9 
5 3   8 
;
 
/* If the data are not already sorted, sort by Y and X */
proc sort data=Have; by Y X; run; 
 
proc iml;
use Have; read all var {x y z}; close;
xGrd = unique(x);                      /* find grid points */                     
yGrd = unique(y);
z = shape(z, ncol(yGrd), ncol(xGrd)); /* data must be sorted by y, then x */
print z[r=(char(yGrd)) c=(char(xGrd))];

Again, note that the columns of Z represent the X values and the rows represent the Y values. The xGrd and yGrd vectors contain the grid points along the horizontal and vertical dimensions of the grid, respectively.

The structure of the scoring data

The bilinearInterp function assumes that the points at which to interpolate are stored in a k x 2 matrix. Each row is an (x,y) location at which to interpolate from the fitting data. If an (x,y) location is outside of the fitting data, then the bilinearInterp function returns a missing value. The scoring locations do not need to be sorted.

The following example specifies six valid points and two invalid interpolation points:

t = {0   0,     /* not in data range */
     1   1,
     1   2,
     2.5 2,
     4   0.5,
     4   2,
     5   3,
     6   3};    /* not in data range */
 
/* LOAD the previously stored function */
load module=bilinearInterp;  
F = bilinearInterp(xGrd, yGrd, z, t);
print t[c={'x' 'y'}] F[format=Best4.];

The grid is defined on the rectangle [1,5] x [0,3]. The first and last rows of t are outside of the rectangle, so the interpolation returns missing values at those locations. The locations (1,1) and (5,3) are grid locations (corners of a cell), and the interpolation returns the correct response values. The other locations are in one of the grid cells. The interpolation is found by scaling the appropriate rectangle onto the unit square and applying the bilinear interpolation formula, as shown in the previous article.

Visualizing the bilinear interpolation function

You can use the ExpandGrid function to generate a fine grid of locations at which to interpolate. You can visualize the bilinear interpolation function by using a heat map or by using a surface plot.

The heat map visualization is shown at the top of this article. The colors indicate the interpolated values of the response variable. The markers indicate the observed values of the response variable. The reference lines show the grid that is defined by the fitting data.

The following graph visualizes the interpolation function by using a surface plot. The function is piecewise quadratic. Within each cell, it is linear on lines of constant X and constant Y. The "ridge lines" on the surface correspond to the locations where two adjacent grid cells meet.

Summary

In summary, you can perform bilinear interpolation in SAS by using a SAS/IML function. The function uses fitting data (X, Y, and Z locations on a grid) to interpolate. Inside each cell, the interpolation is quadratic and is linear on lines of constant X and constant Y. For each location point, the interpolation function first determines what cell the point is in. It then uses the corners of the cell to interpolate at the point. You can download the SAS program that performs bilinear interpolation and generates the tables and graphs in this article.

The post Bilinear interpolation in SAS appeared first on The DO Loop.

5月 182020
 

I've previously written about linear interpolation in one dimension. Bilinear interpolation is a method for two-dimensional interpolation on a rectangle. If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in the rectangle's interior. Bilinear interpolation is a weighted average of the values at the four corners of the rectangle. For an (x,y) position inside the rectangle, the weights are determined by the distance between the point and the corners. Corners that are closer to the point get more weight. The generic result is a saddle-shaped quadratic function, as shown in the contour plot to the right.

The Wikipedia article on bilinear interpolation provides a lot of formulas, but the article is needlessly complicated. The only important formula is how to interpolate on the unit square [0,1] x [0,1]. To interpolate on any other rectangle, simply map your rectangle onto the unit square and do the interpolation there. This trick works for bilinear interpolation because the weighted average depends only on the relative position of a point and the corners of the rectangle. Given a rectangle with lower-left corner (x0, y0) and upper right corner (x1, y1), you can map it into the unit square by using the transformation (x, y) → (u, v) where u = (x-x0)/(x1-x0) and v = (y-y0)/(y1-y0).

Bilinear interpolation on the unit square

Suppose that you want to interpolate on the unit square. Assume you know the values of a continuous function at the corners of the square:

  • z00 is the function value at (0,0), the lower left corner
  • z10 is the function value at (1,0), the lower right corner
  • z01 is the function value at (0,1), the upper left corner
  • z11 is the function value at (1,1), the upper right corner

If (x,y) is any point inside the unit square, the interpolation at that point is the following weighted average of the values at the four corners:
F(x,y) = z00*(1-x)*(1-y) + z10*x*(1-y) + z01*(1-x)*y + z11*x*y

Notice that the interpolant is linear with respect to the values of the corners of the square. It is quadratic with respect to the location of the interpolation point. The interpolant is linear along every horizontal line (lines of constant y) and along every vertical line (lines of constant x).

Implement bilinear interpolation on the unit square in SAS

Suppose that the function values at the corners of a unit square are z00 = 0, z10 = 4, z01 = 2, and z11 = 1. For these values, the bilinear interpolant on the unit square is shown at the top of this article. Notice that the value of the interpolant at the corners agrees with the data values. Notice also that the interpolant is linear along each edge of the square. It is not easy to see that the interpolant is linear along each horizontal and vertical line, but that is true.

In SAS, you can use the SAS/IML matrix language to define a function that performs bilinear interpolation on the unit square. The function takes two arguments:

  • z is a four-element that specifies the values of the data at the corners of the square. The values are specified in the following order: lower left, lower right, upper left, and upper right. This is the "fitting data."
  • t is a k x 2 matrix, where each row specifies a point at which to interpolate. This is the "scoring data."

In a matrix language such as SAS/IML, the function can be written in vectorized form without using any loops:

proc iml;
start bilinearInterpSquare(z, t);
   z00 = z[1]; z10 = z[2]; z01 = z[3]; z11 = z[4]; 
   x = t[,1];  y = t[,2];
   cx = 1-x;                    /* cx = complement of x */
   return( (z00*cx + z10*x)#(1-y) + 
           (z01*cx + z11*x)#y );
finish;
 
/* specify the fitting data */
z = {0 4,                       /* bottom: values at (0,0) and (1,0) */
     2 1};                      /* top: values at (0,1) and (1,1) */
print z[c={'x=0' 'x=1'} r={'y=0' 'y=1'}];
 
t = {0.5 0,                     /* specify the scoring data */
     0    0.25,
     1    0.66666666,
     0.5  0.75,
     0.75 0.5     };
F = bilinearInterpSquare(z, t); /* test the bilinear interpolation function */
print t[c={'x' 'y'} format=FRACT.] F;

For points on the edge of the square, you can check a few values in your head:

  • The point (1/2, 0) is halfway between the corners with values 0 and 4. Therefore the interpolation gives 2.
  • The point (0, 1/4) is a quarter of the way between the corners with values 0 and 2. Therefore the interpolation gives 0.5.
  • The point (1, 2/3) is two thirds of the way between the corners with values 4 and 1. Therefore the interpolation gives 2.

The other points are in the interior of the square. The interpolant at those points is a weighted average of the corner values.

Visualize the interpolant function

Let's use the bilinearInterpSquare function to evaluate a grid of points. You can use the ExpandGrid function in SAS/IML to generate a grid of points. When we look at the values of the interpolant on a grid or in a heat map, we want the X values to change for each column and the Y values to change for each row. This means that you should reverse the usual order of the arguments to the ExpandGrid function:

/* for visualization, reverse the arguments to ExpandGrid and then swap 1st and 2nd columns of t */
xGrd = do(0, 1, 0.2);
yGrd = do(0, 1, 0.2);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
print Q[r=(char(YGrd,3)) c=(char(xGrd,3)) label="Bilinear Interpolant"];

The table shows the interpolant evaluated at the coordinates of a regular 6 x 6 grid. The column headers give the coordinate of the X variable and the row headers give the coordinates of the Y variable. You can see that each column and each row is an arithmetic sequence, which shows that the interpolant is linear on vertical and horizontal lines.

You can use a finer grid and a heat map to visualize the interpolant as a surface. Notice that in the previous table, the Y values increase as you go down the rows. If you want the Y axis to point up instead of down, you can reverse the rows of the grid (and the labels) before you create a heat map:

/* increase the grid resolution and visualize by using a heat map */
xGrd = do(0, 1, 0.1);
yGrd = do(0, 1, 0.1);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
 
/* currently, the Y axis is pointing down. Flip it and the labels. */
H = Q[ nrow(Q):1, ];
reverseY = yGrd[ ,ncol(yGrd):1];
call heatmapcont(H) range={0 4} displayoutlines=0
     xValues=xGrd yValues=reverseY
     colorramp=palette("Spectral", 7)[,7:1]
     title="Interpolation at Multiple Points in the Unit Square";

The heat map agrees with the contour plot at the top of this article. The heat map also makes it easier to see that the bilinear interpolant is linear on each row and column.

In summary, you can create a short SAS/IML function that performs bilinear interpolation on the unit square. (You can download the SAS program that generates the tables and graphs in this article.) The interpolant is a quadratic saddle-shaped function inside the square. You can use the ideas in this article to create a function that performs bilinear interpolation on an arbitrary grid of fitting data. The next article shows a general function for bilinear interpolation in SAS.

The post What is bilinear interpolation? appeared first on The DO Loop.

5月 182020
 

I've previously written about linear interpolation in one dimension. Bilinear interpolation is a method for two-dimensional interpolation on a rectangle. If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in the rectangle's interior. Bilinear interpolation is a weighted average of the values at the four corners of the rectangle. For an (x,y) position inside the rectangle, the weights are determined by the distance between the point and the corners. Corners that are closer to the point get more weight. The generic result is a saddle-shaped quadratic function, as shown in the contour plot to the right.

The Wikipedia article on bilinear interpolation provides a lot of formulas, but the article is needlessly complicated. The only important formula is how to interpolate on the unit square [0,1] x [0,1]. To interpolate on any other rectangle, simply map your rectangle onto the unit square and do the interpolation there. This trick works for bilinear interpolation because the weighted average depends only on the relative position of a point and the corners of the rectangle. Given a rectangle with lower-left corner (x0, y0) and upper right corner (x1, y1), you can map it into the unit square by using the transformation (x, y) → (u, v) where u = (x-x0)/(x1-x0) and v = (y-y0)/(y1-y0).

Bilinear interpolation on the unit square

Suppose that you want to interpolate on the unit square. Assume you know the values of a continuous function at the corners of the square:

  • z00 is the function value at (0,0), the lower left corner
  • z10 is the function value at (1,0), the lower right corner
  • z01 is the function value at (0,1), the upper left corner
  • z11 is the function value at (1,1), the upper right corner

If (x,y) is any point inside the unit square, the interpolation at that point is the following weighted average of the values at the four corners:
F(x,y) = z00*(1-x)*(1-y) + z10*x*(1-y) + z01*(1-x)*y + z11*x*y

Notice that the interpolant is linear with respect to the values of the corners of the square. It is quadratic with respect to the location of the interpolation point. The interpolant is linear along every horizontal line (lines of constant y) and along every vertical line (lines of constant x).

Implement bilinear interpolation on the unit square in SAS

Suppose that the function values at the corners of a unit square are z00 = 0, z10 = 4, z01 = 2, and z11 = 1. For these values, the bilinear interpolant on the unit square is shown at the top of this article. Notice that the value of the interpolant at the corners agrees with the data values. Notice also that the interpolant is linear along each edge of the square. It is not easy to see that the interpolant is linear along each horizontal and vertical line, but that is true.

In SAS, you can use the SAS/IML matrix language to define a function that performs bilinear interpolation on the unit square. The function takes two arguments:

  • z is a four-element that specifies the values of the data at the corners of the square. The values are specified in the following order: lower left, lower right, upper left, and upper right. This is the "fitting data."
  • t is a k x 2 matrix, where each row specifies a point at which to interpolate. This is the "scoring data."

In a matrix language such as SAS/IML, the function can be written in vectorized form without using any loops:

proc iml;
start bilinearInterpSquare(z, t);
   z00 = z[1]; z10 = z[2]; z01 = z[3]; z11 = z[4]; 
   x = t[,1];  y = t[,2];
   cx = 1-x;                    /* cx = complement of x */
   return( (z00*cx + z10*x)#(1-y) + 
           (z01*cx + z11*x)#y );
finish;
 
/* specify the fitting data */
z = {0 4,                       /* bottom: values at (0,0) and (1,0) */
     2 1};                      /* top: values at (0,1) and (1,1) */
print z[c={'x=0' 'x=1'} r={'y=0' 'y=1'}];
 
t = {0.5 0,                     /* specify the scoring data */
     0    0.25,
     1    0.66666666,
     0.5  0.75,
     0.75 0.5     };
F = bilinearInterpSquare(z, t); /* test the bilinear interpolation function */
print t[c={'x' 'y'} format=FRACT.] F;

For points on the edge of the square, you can check a few values in your head:

  • The point (1/2, 0) is halfway between the corners with values 0 and 4. Therefore the interpolation gives 2.
  • The point (0, 1/4) is a quarter of the way between the corners with values 0 and 2. Therefore the interpolation gives 0.5.
  • The point (1, 2/3) is two thirds of the way between the corners with values 4 and 1. Therefore the interpolation gives 2.

The other points are in the interior of the square. The interpolant at those points is a weighted average of the corner values.

Visualize the interpolant function

Let's use the bilinearInterpSquare function to evaluate a grid of points. You can use the ExpandGrid function in SAS/IML to generate a grid of points. When we look at the values of the interpolant on a grid or in a heat map, we want the X values to change for each column and the Y values to change for each row. This means that you should reverse the usual order of the arguments to the ExpandGrid function:

/* for visualization, reverse the arguments to ExpandGrid and then swap 1st and 2nd columns of t */
xGrd = do(0, 1, 0.2);
yGrd = do(0, 1, 0.2);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
print Q[r=(char(YGrd,3)) c=(char(xGrd,3)) label="Bilinear Interpolant"];

The table shows the interpolant evaluated at the coordinates of a regular 6 x 6 grid. The column headers give the coordinate of the X variable and the row headers give the coordinates of the Y variable. You can see that each column and each row is an arithmetic sequence, which shows that the interpolant is linear on vertical and horizontal lines.

You can use a finer grid and a heat map to visualize the interpolant as a surface. Notice that in the previous table, the Y values increase as you go down the rows. If you want the Y axis to point up instead of down, you can reverse the rows of the grid (and the labels) before you create a heat map:

/* increase the grid resolution and visualize by using a heat map */
xGrd = do(0, 1, 0.1);
yGrd = do(0, 1, 0.1);
t = ExpandGrid(yGrd, xGrd);  /* want X changing fastest; put in 2nd col */
t = t[ ,{2 1}];              /* reverse the columns from (y,x) to (x,y) */
F = bilinearInterpSquare(z, t);
Q = shape(F, ncol(yGrd), ncol(xGrd));
 
/* currently, the Y axis is pointing down. Flip it and the labels. */
H = Q[ nrow(Q):1, ];
reverseY = yGrd[ ,ncol(yGrd):1];
call heatmapcont(H) range={0 4} displayoutlines=0
     xValues=xGrd yValues=reverseY
     colorramp=palette("Spectral", 7)[,7:1]
     title="Interpolation at Multiple Points in the Unit Square";

The heat map agrees with the contour plot at the top of this article. The heat map also makes it easier to see that the bilinear interpolant is linear on each row and column.

In summary, you can create a short SAS/IML function that performs bilinear interpolation on the unit square. (You can download the SAS program that generates the tables and graphs in this article.) The interpolant is a quadratic saddle-shaped function inside the square. You can use the ideas in this article to create a function that performs bilinear interpolation on an arbitrary grid of fitting data. The next article shows a general function for bilinear interpolation in SAS.

The post What is bilinear interpolation? appeared first on The DO Loop.

5月 132020
 

This article shows how to find local maxima and maxima on a regression curve, which means finding points where the slope of the curve is zero. An example appears at the right, which shows locations where the loess smoother in a scatter plot has local minima and maxima. Except for simple cases like quadratic regression, you need to use numerical techniques to locate these values.

In a previous article, I showed how to use SAS to evaluate the slope of a regression curve at specific points. The present article applies that technique by scoring the regression curve on a fine grid of point. You can use finite differences to approximate the slope of the curve at each point on the grid. You can then estimate the locations where the slope is zero.

In this article, I use the LOESS procedure to demonstrate the technique, but the method applies equally well to any one-dimensional regression curve. There are several ways to score a regression model:

  • Most parametric regression procedures in SAS (GLM, GLIMMIX, MIXED, ...) support the STORE statement. The STORE statement saves a representation of the model in a SAS item store. You can use PROC PLM to score a model from an item store.
  • Some nonparametric regression procedures in SAS do not support the STORE statement but support a SCORE statement. PROC ADAPTIVEREG and PROC LOESS are two examples. This article shows how to use the SCORE statement to find points where the regression curve has zero slope.
  • Some regression procedures in SAS do not support either the STORE or SCORE statements. For those procedures, you need to use the use the missing value trick to score the model.

The technique in this article will not detect inflection points. An inflection point is a location where the curve has zero slope but is not a local min or max. Consequently, this article is really about "how to find a point where a regression curve has a local extremum," but I will use the slightly inaccurate phrase "find points where the slope is zero."

How to find locations where the slope of the curve is zero?

For convenience, I assume the explanatory variable is named X and the response variable is named Y. The goal is to find locations where a nonparametric curve (x, f(x)) has zero slopes, where f(x) is the regression model. The general outline follows:

  1. Create a grid of points in the range of the explanatory variable. The grid does not have to be evenly spaced, but it is in this example.
  2. Score the model at the grid locations.
  3. Use finite differencing to approximate the slope of the regression curve at the grid points. If the slope changes sign between consecutive grid points, estimate the location between the grid points where the slope is exactly zero. Use linear interpolation to approximate the response at that location.
  4. Optionally, graph the original data, the regression curve, and the point along the curve where the slope is zero.

Example data

SAS distributes the ENSO data set in the SASHelp library. You can create a DATA step view that renames the explanatory and response variables to X and Y, respectively, so that it is easier to follow the logic of the program:

/* Create VIEW where x is the independent variable and y is the response */
data Have / view=Have;
set Sashelp.Enso(rename=(Month=x Pressure=y));
keep x y;
run;

Create a grid of points

After the data set is created, you can use PROC SQL to find the minimum and maximum values of the explanatory variable. You can create an evenly spaced grid of points for the range of the explanatory variable.

/* Put min and max into macro variables */
proc sql noprint;
  select min(x), max(x) into :min_x, :max_x 
  from Have;
quit;
 
/* Evaluate the model and estimate derivatives at these points */
data Grid;
dx = (&max_x - &min_x)/201;    /* choose the step size wisely */
do x = &min_x to &max_x by dx;   
   output;
end;
drop dx;
run;

Score the model at the grid locations

This is the step that will vary from procedure to procedure. You have to know how to use the procedure to score the regression model on the points in the Grid data set. The LOESS procedure supports a SCORE statement, so the call fits the model and scores the model on the Grid data set:

/* Score the model on the grid */
ods select none;    /* do not display the tables */
proc loess data=Have plots=none;
   model y = x;
   score data=Grid;  /* PROC LOESS does not support an OUT= option */
   /* Most procedures support an OUT= option to save the scored values.
      PROC LOESS displays the scored values in a table, so use ODS to
      save the table to an output data set */
   ods output ScoreResults=ScoreOut;
run;
ods select all;

If a procedure supports the STORE statement, you can use PROC PLM to score the model on the data. The SAS program that accompanies this article includes an example that uses the GAMPL procedure. The GAMPL procedure does not support the STORE or SCORE statements, but you can use the missing value trick to find zero derivatives.

Find the locations where the slope is zero

This is the mathematical portion of the computation. You can use a backward difference scheme to estimate the derivative (slope) of the curve. If (x0, y0) and (x1, y1) are two consecutive points along the curve (in the ScoreOut data set), then the slope at (x1, y1) is approximately m = (y1 - y0) / (x1 - x0). When the slope changes sign between consecutive points, it indicates that the slope changed from positive to negative (or vice versa) between the points. If the slope is continuous, it must have been exactly zero somewhere on the interval. You can use a linear approximation to find the point, t, where the slope is zero. You can then use linear interpolation to approximate the point (t, f(t)) at which the curve is a local min or max.

You can use the following SAS DATA step to process the scoring data, approximate the slope, and estimate where the slope of the curve is zero:

/* Compute slope by using finite difference formula. */
data Deriv0;
set ScoreOut;
Slope = dif(p_y) / dif(x);      /* (f(x) - f(x-dx)) / dx */
xPrev = lag(x);  yPrev = lag(p_y);  SlopePrev = lag(Slope);
if n(SlopePrev) AND sign(SlopePrev) ^= sign(Slope) then do;
   /* The slope changes sign between this obs and the previous.
      Assuming linearity on the interval, find (t, f(t))
      where slope is exactly zero */
   t0 = xPrev - SlopePrev * (x - xPrev)/(Slope - SlopePrev); 
   /* use linear interpolation to find the corresponding y value:
      f(t) ~ y0 + (y1-y0)/(x1-x0) * (t - x0)       */
   f_t0 = yPrev + (yPrev - p_y)/(x - xPrev) * (t0 - xPrev);
   if sign(SlopePrev) > 0 then _Type_ = "Max";
   else _Type_ = "Min";
   output; 
end;
keep t0 f_t0 _Type_;
label f_t0 = "f(t0)";
run;
 
proc print data=Deriv0 label;
run;

The table shows that there are seven points at which the derivative of the loess regression curve has a local min or max.

Graph the results

If you want to display the local extreme on the graph of the regression curve, you can concatenate the original data, the regression curve, and the local extreme. You can then use PROC SGPLOT to overlay the three layers. The resulting graph is shown at the top of this article.

data Combine;
merge Have                           /* data   : (x, y)     */
      ScoreOut(rename=(x=t p_y=p_t)) /* curve  : (t, p_t)   */
      Deriv0;                        /* extrema: (t0, f_t0) */
run;
 
title "Loess Smoother";
title2 "Red Markers Indicate Zero Slope for Smoother";
proc sgplot data=Combine noautolegend;
   scatter x=x y=y;
   series x=t y=p_t / lineattrs=GraphData2;
   scatter x=t0 y=f_t0 / markerattrs=(symbol=circlefilled color=red);
   yaxis grid; 
run;

Summary

In summary, if you can evaluate a regression curve on a grid of points, you can approximate the slope at each point along the curve. By looking for when the slope changes sign, you can find local minima and maxima. You can then use a simple linear estimator on the interval to estimate where the slope is exactly zero.

You can download the SAS program that performs the computations in this article.

The post Find points where a regression curve has zero slope appeared first on The DO Loop.

5月 112020
 

I recently showed how to use linear interpolation in SAS. Linear interpolation is a common way to interpolate between a set of planar points, but the interpolating function (the interpolant) is not smooth. If you want a smoother interpolant, you can use cubic spline interpolation. This article describes how to use a cubic spline interpolation in SAS.

As mentioned in my previous post, an interpolation requires two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. These sample data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

The following SAS DATA steps define the data for this example. The POINTS data set contains the sample data, which are shown as blue markers on the graph to the right. The SCORE data set contains the scoring data, which are shown as red tick marks along the horizontal axis.

/* Example dats for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

On the graph, the blue curve is the cubic spline interpolant. Every point that you interpolate will be on that curve. The red asterisks are the interpolated values for the values in the SCORE data set. Notice that points -1 and 10.5 are not interpolated because they are outside of the data range. The following section shows how to compute the cubic spline interpolation in SAS.

Cubic spline interpolation in SAS

A linear interpolation uses a linear function on each interval between the data points. In general, the linear segments do not meet smoothly: the resulting interpolant is continuous but not smooth. In contrast, spline interpolation uses a polynomial function on each interval and chooses the polynomials so that the interpolant is smooth where adjacent polynomials meet. For polynomials of degree k, you can match the first k – 1 derivatives at each data point.

A cubic spline is composed of piecewise cubic polynomials whose first and second derivatives match at each data point. Typically, the second derivatives at the minimum and maximum of the data are set to zero. This kind of spline is known as a "natural cubic spline" with knots placed at each data point.

I have previously shown how use the SPLINE call in SAS/IML to compute a smoothing spline. A smoothing spline is not an interpolant because it does not pass through the original data points. However, you can get interpolation by using the SMOOTH=0 option. Adding the TYPE='zero' option results in a natural cubic spline.

For more control over the interpolation, you can use the SPLINEC function ('C' for coefficients) to fit the cubic splines to the data and obtain a matrix of coefficients. You can then use that matrix in the SPLINEV function ('V' for value) to evaluate the interpolant at the locations in the scoring data.

The following SAS/IML function (CubicInterp) computes the spline coefficients from the sample data and then interpolates the scoring data. The details of the computation are provided in the comments, but you do not need to know the details in order to use the function to interpolate data:

/* Cubic interpolating spline in SAS.
   The interpolation is based on the values (x1,y1), (x2,y2), ..., (xn, yn).
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are interpolated.
*/
proc iml;
start CubicInterp(x, y, t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(t>=min(x) && t<=max(x));    /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   /* fit the cubic model to the data */
   call splinec(splPred, coeff, endSlopes, x||y) smooth=0 type="zero";
 
   p = j(nrow(t)*ncol(t), 1, .);       /* allocate output (prediction) vector */
   call sortndx(ndx, colvec(t));       /* SPLINEV wants sorted data, so get sort index for t */
   sort_t = t[ndx];                    /* sorted version of t */
   sort_pred = splinev(coeff, sort_t); /* evaluate model at (sorted) points of t */
   p[ndx] = sort_pred[,2];             /* "unsort" by using the inverse sort index */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = CubicInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

The visualization of the interpolation is similar to the code in the previous article, so the code is not shown here. However, you can download the SAS program that performs the cubic interpolation and creates the graph at the top of this article.

Although cubic spline interpolation is slower than linear interpolation, it is still fast: The CubicInterp program takes about 0.75 seconds to fit 1000 data points and interpolate one million scoring values.

Summary

In summary, the SAS/IML language provides the computational tools for cubic spline interpolation. The CubicInterp function in this article encapsulates the functionality so that you can perform cubic spline interpolation of your data in an efficient manner.

The post Cubic spline interpolation in SAS appeared first on The DO Loop.

5月 042020
 

SAS programmers sometimes ask about ways to perform one-dimensional linear interpolation in SAS. This article shows three ways to perform linear interpolation in SAS: PROC IML (in SAS/IML software), PROC EXPAND (in SAS/ETS software), and PROC TRANSREG (in SAS/STAT software). Of these, PROC IML Is the simplest to use and has the most flexibility. This article shows how to implement an efficient 1-D linear interpolation algorithm in SAS. You can download the SAS program that creates the analyses and graphs in this article.

Linear interpolation assumptions

For one-dimensional linear interpolation of points in the plane, you need two sets of numbers:

  1. Data: Let (x1, y1), (x2, y2), ..., (xn, yn) be a set of n data points. The data should not contain any missing values. The data must be ordered so that x1 < x2 < ... < xn. These values uniquely define the linear interpolation function on [x1, xn]. I call this the "sample data" or "fitting data" because it is used to create the linear interpolation model.
  2. Values to score: Let {t1, t2, ..., tk} be a set of k new values for the X variable. For interpolation, all values must be within the range of the data: x1 ≤ ti ≤ xn for all i. The goal of interpolation is to produce a new Y value for each value of ti. The scoring data is also called the "query data."

Interpolation requires a model. For linear interpolation, the model is the unique piecewise linear function that passes through each sample point and is linear on each interval [xi, xi+1]. The model is usually undefined outside of the range of the data, although there are various (nonunique) ways to extrapolate the model beyond the range of the data. You fit the model by using the data. You then score the model on the set of new values.

The following SAS data sets define example data for linear interpolation. The POINTS data set contains fitting data that define the linear model. The SCORE data set contains the new query points at which we want to interpolate. The linear interpolation is shown to the right. The sample data are shown as blue markers. The model is shown as blue lines. The query values to score are shown as a fringe plot along the X axis. The interpolated values are shown as red markers.

The data used for this example are:

/* Example data for 1-D interpolation */
data Points;  /* these points define the model */
input x y;
datalines;
0  1
1  3
4  5
5  4
7  6
8  3
10 3
;
 
data Score; /* these points are to be interpolated */
input t @@;
datalines;
2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9
;

For convenience, the fitting data are already sorted by the X variable, which is in the range [0, 10]. The scoring data set does not need to be sorted.

The scoring data for this example contains five special values:

  • Two scoring values (-1 and 10.5) are outside of the range of the data. An interpolation algorithm should return a missing value for these values. (Otherwise, it is extrapolation.)
  • Two scoring values (0 and 1) are duplicates of X values in the data. Ideally, this should not present a problem for the interpolation algorithm.
  • The value 9 appears twice in the scoring data.

Linear interpolation in SAS by using PROC IML

As is often the case, PROC IML enables you to implement a custom algorithm in only a few lines of code. For simplicity, suppose you have a single value, t, that you want to interpolate, based on the data (x1, y1), (x2, y2), ..., (xn, yn). The main steps for linear interpolation are:

  1. Check that the X values are nonmissing and in increasing order: x1 < x2 < ... < xn. Check that t is in the range [x1, xn]. If not, return a missing value.
  2. Find the first interval that contains t. You can use the BIN function in SAS/IML to find the first value i for which x_i <= t <= x_{i+1}.
  3. Define the left and right endpoint of the interval: xL = x_i and xR = x_{i+1}. Define the corresponding response values: yL = y_i and yR = y_{i+1}.
  4. Let f = (t - xL) / (xR - xL) be the proportion of the interval to the left of t. Then p = (1 - f)*yL + f*yR is the linear interpolation at t.

The steps are implemented in the following SAS/IML function. The function accepts a vector of scoring values, t. Notice that the program does not contain any loops over the elements of t. All statements and operations are vectorized, which is very efficient.

/* Linear interpolation based on the values (x1,y1), (x2,y2),....
   The X  values must be nonmissing and in increasing order: x1 < x2 < ... < xn
   The values of the t vector are linearly interpolated.
*/
proc iml;
start LinInterp(x, y, _t);
   d = dif(x, 1, 1);                     /* check that x[i+1] > x[i] */
   if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing.";
   idx = loc(_t>=min(x) && _t<=max(x));  /* check for valid scoring values */
   if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x.";
 
   p = j(nrow(_t)*ncol(_t), 1, .);     /* allocate output (prediction) vector */
   t = _t[idx];                        /* subset t values inside range(x) */
   k = bin(t, x);                      /* find interval [x_i, x_{i+1}] that contains s */
   xL = x[k];   yL = y[k];             /* find (xL, yL) and (xR, yR) */
   xR = x[k+1]; yR = y[k+1];
   f = (t - xL) / (xR - xL);           /* f = fraction of interval [xL, xR] */
   p[idx] = (1 - f)#yL + f#yR;        /* interpolate between yL and yR */
   return( p );
finish;
 
/* example of linear interpolation in SAS */
use Points; read all var {'x' 'y'}; close;
use Score; read all var 't'; close;
 
pred = LinInterp(x, y, t);
create PRED var {'t' 'pred'}; append; close;
QUIT;

Visualize a linear interpolation in SAS

The previous program writes the interpolated values to the PRED data set. You can concatenate the original data and the interpolated values to visualize the linear interpolation:

/* Visualize: concatenate data and predicted (interpolated) values */
data All;
set Points Pred;
run;
 
title "Linear Interpolation";
title2 "No Extrapolation";
proc sgplot data=All noautolegend;
   series x=x y=y;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled size=12)
                    name="data" legendlabel="Data";
   scatter x=t y=Pred / markerattrs=(symbol=asterisk size=12 color=red)
                    name="interp" legendlabel="Interpolated Values";
   fringe t / lineattrs=(color=red thickness=2)
                    name="score" legendlabel="Values to Score";
   xaxis grid values=(0 to 10) valueshint label="X";
   yaxis grid label="Y" offsetmin=0.05;
   keylegend "data" "score" "interp";
run;

The graph is shown at the top of this article. A few noteworthy items:

  • The values -1 and 10.5 are not scored because they are outside the range of the data.
  • The values 0 and 1 correspond to a data point. The interpolated value is the corresponding data value.
  • The other values are interpolated onto the straight line segments that connect the data.

Performance of the IML algorithm

The IML algorithm is very fast. On my Windows PC (Pentium i7), the interpolation takes only 0.2 seconds for 1,000 data points and one million scoring values. For 10,000 data points and one million scoring values, the interpolation takes about 0.25 seconds. The SAS program that accompanies this article includes timing code.

Other SAS procedures that can perform linear interpolation

According to a SAS Usage Note, you can perform linear interpolation in SAS by using PROC EXPAND in SAS/ETS software or PROC TRANSREG in SAS/STAT software. Each has some limitations that I don't like:

  • Both procedures use the missing value trick to perform the fitting and scoring in a single call. That means you must concatenate the sample data (which is often small) and the query data (which can be large).
  • PROC EXPAND requires that the combined data be sorted by X. That can be easily accomplished by calling PROC SORT after you concatenate the sample and query data. However, PROC EXPAND does not support duplicate X values! For me, this makes PROC EXPAND unusable. It means that you cannot score the model at points that are in the original data, nor can you have repeated values in the scoring data.
  • If you use PROC TRANSREG for linear interpolation, you must know the number of sample data points, n. You must specify n – 2 on the NKNOTS= option on a SPLINE transformation. Usually, this means that you must perform an extra step (DATA step or PROC MEANS) and store n – 2 in a macro variable.
  • For scoring values outside the range of the data, PROC EXPAND returns a missing value. However, PROC TRANSREG extrapolates. If t < x1, then the extrapolated value at t is y1. Similarly, if t > xn, then the extrapolated value at t is yn.

I've included examples of using PROC EXPAND and PROC TRANSREG in the SAS program that accompanies this article. You can use these procedures for linear interpolation, but neither is as convenient as PROC IML.

With effort, you can use Base SAS routines such as the DATA step to implement a linear interpolation algorithm. An example is provided by KSharp in a SAS Support Community thread.

Summary

If you want to perform linear interpolation in SAS, the easiest and most efficient method is PROC IML. I have provided a SAS/IML function that implements linear interpolation by using two input vectors (x and y) to define the model and one input vector (t) to specify the points at which to interpolate. The function returns the interpolated values for t.

The post Linear interpolation in SAS appeared first on The DO Loop.