6月 012020
 

“He spends a lot of time wandering around in circles in the backyard,” my wife said to someone on the telephone. That’s true. Our backyard is only about 1/8th of an acre and I have taken to wandering outside and walking around the fence line. Ostensibly, I am checking to [...]

Data for good: the home edition was published on SAS Voices by Elliot Inman

6月 012020
 

In a previous article, I discussed the definition of the Kullback-Leibler (K-L) divergence between two discrete probability distributions. For completeness, this article shows how to compute the Kullback-Leibler divergence between two continuous distributions. When f and g are discrete distributions, the K-L divergence is the sum of f(x)*log(f(x)/g(x)) over all x values for which f(x) > 0. When f and g are continuous distributions, the sum becomes an integral:
KL(f,g) = ∫ f(x)*log( f(x)/g(x) ) dx
which is equivalent to
KL(f,g) = ∫ f(x)*( log(f(x)) – log(g(x)) ) dx
The integral is over the support of f, and clearly g must be positive inside the support of f for the integral to be defined.

The Kullback-Leibler divergence between normal distributions

I like to perform numerical integration in SAS by using the QUAD subroutine in the SAS/IML language. You specify the function that you want to integrate (the integrand) and the domain of integration and get back the integral on the domain. Use the special missing value .M to indicate "minus infinity" and the missing value .P to indicate "positive infinity."

As a first example, suppose the f(x) is the pdf of the normal distribution N(μ1, σ1) and g(x) is the pdf of the normal distribution N(μ2, σ2). From the definition, you can exactly calculate the Kullback-Leibler divergence between normal distributions:
KL between Normals: log(σ2/σ1) + (σ12 – σ22 + (μ1 – μ2)2) / (2*σ22)

You can define the integrand by using the PDF and LOGPDF functions in SAS. The following computation integrates the distributions on the infinite integral (-∞, ∞), although a "six sigma" interval about μ1, such as [-6, 6], would give the same result:

proc iml;
/* Example 1: K-L divergence between two normal distributions */
start KLDivNormal(x) global(mu1, mu2, sigma1, sigma2);
   return pdf("Normal", x, mu1, sigma1) #
          (logpdf("Normal", x, mu1, sigma1) - logpdf("Normal", x, mu2, sigma2));
finish;
 
/* f is PDF for N(0,1) and g is PDF for N(2,3) */ 
mu1 = 0; sigma1 = 1;
mu2 = 2; sigma2 = 3;
 
call quad(KL, "KLDivNormal", {.M .P});  /* integral on (-infinity, infinity) */
KLExact = log(sigma2/sigma1) + 
          (sigma1##2 - sigma2##2+ (mu1-mu2)##2) / (2*sigma2##2);
print KL KLExact (KL-KLExact)[L="Difference"];

The output indicates that the numerical integration agrees with the exact result to many decimal places.

The Kullback-Leibler divergence between exponential and gamma distributions

A blog post by John D. Cook computes the K-L divergence between an exponential and a gamma(a=2) distribution. This section performs the same computation in SAS.

This is a good time to acknowledge that numerical integration can be challenging. Numerical integration routines have default settings that enable you to integrate a wide variety of functions, but sometimes you need to modify the default parameters to get a satisfactory result—especially for integration on infinite intervals. I have previously written about how to use the PEAK= option for the QUAD routine to achieve convergence in certain cases by providing additional guidance to the QUAD routine about the location and scale of the integrand. Following the advice I gave in the previous article, a value of PEAK=0.1 seems to be a good choice for the K-L divergence between the exponential and gamma(a=2) distributions: it is inside the domain of integration and is close to the mode of the integrand, which is at x=0.

/* Example 2: K-L divergence between exponential and gamma(2). Example from
   https://www.johndcook.com/blog/2017/11/08/why-is-kullback-leibler-divergence-not-a-distance/
*/
start KLDivExpGamma(x) global(a);
   return pdf("Exponential", x) #
          (logpdf("Exponential", x) - logpdf("Gamma", x, a));
finish;
 
a = 2;
/* Use PEAK= option: https://blogs.sas.com/content/iml/2014/08/13/peak-option-in-quad.html */
call quad(KL2, "KLDivExpGamma", {0 .P}) peak=0.1;
print KL2;

The value of the K-L divergence agrees with the answer in John D. Cook's blog post.

Approximating the Exponential Distribution by the Gamma Distribution

Recall that the K-L divergence is a measure of the dissimilarity between two distributions. For example, the previous example indicates how well the gamma(a=2) distribution approximates the exponential distribution. As I showed for discrete distributions, you can use the K-L divergence to select the parameters for a distribution that make it the most similar to another distribution. You can do this by using optimization methods, but I will merely plot the K-L divergence as a function of the shape parameter (a) to indicate how the K-L divergence depends on the parameter.

You might recall that the gamma distribution includes the exponential distribution as a special case. When the shape parameter a=1, the gamma(1) distribution is an exponential distribution. Therefore, the K-L divergence between the exponential and the gamma(1) distribution should be zero.

The following program computes the K-L divergence for values of a in the range [0.5, 2] and plots the K-L divergence versus the parameter:

/* Plot KL(Expo, Gamma(a)) for various values of a.
   Note that gamma(a=1) is the exponential distribution. */
aa = do(0.5, 2, 0.01);
KL = j(1, ncol(aa), .);
do i = 1 to ncol(aa);
   a = aa[i];
   call quad(KL_i, "KLDivExpGamma", {0 .P}) peak=0.1;
   KL[i] = KL_i;
end;
title "Kullback-Leibler Divergence Between Exponential and Gamma{a)";
call series(aa, KL) grid={x y} label={'Gamma Shape Parameter (a)' 'K-L Divergence'};

As expected, the graph of the K-L divergence reaches a minimum value at a=1, which is the best approximation to an exponential distribution by the gamma(a) distribution. Note that the K-L divergence equals zero when a=1, which indicates that the distributions are identical when a=1.

Summary

The Kullback-Leibler divergence between two continuous probability distributions is an integral. This article shows how to use the QUAD function in SAS/IML to compute the K-L divergence between two probability distributions.

The post The Kullback–Leibler divergence between continuous probability distributions appeared first on The DO Loop.

5月 292020
 

While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9th edition) codes along with some patient demographic data.

I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.

It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!

In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here's the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter's advice to write clever code that no one can understand so they can't fire you!)

The following program demonstrates the =: operator:

*An interesting trick to read ICD codes;
<strong>Data</strong> ICD_9;
  input ICD : $7. @@;
  if ICD =: "493" the output;
datalines;
493 770.6 999 493.9 493.90 493.100
;
title "Listing of All Asthma Codes";
<strong>proc</strong> <strong>print</strong> data=ICD_9 noobs;
<strong>run</strong>;

 

Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.

The usual reason to write a SAS blog is to teach some aspect of SAS programming or to just point out something interesting about SAS. While that is usually my motivation, I have an ulterior motive in writing this blog – I want to plug a new book I have just published on Amazon. It's called 10-8 Awaiting Crew: Memories of a Volunteer EMT. One of the chapters discusses the difficulty of conducting statistical studies in pre-hospital settings. This was my first attempt at a non-technical book. I hope you take a look. (Enter "10-8 awaiting crew" or "Ron Cody" in Amazon search to find the book.) Drop me an email with your thoughts at ron.cody@gmail.com.

Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.

5月 292020
 

While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9th edition) codes along with some patient demographic data.

I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.

It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!

In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here's the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter's advice to write clever code that no one can understand so they can't fire you!)

The following program demonstrates the =: operator:

*An interesting trick to read ICD codes;
<strong>Data</strong> ICD_9;
  input ICD : $7. @@;
  if ICD =: "493" the output;
datalines;
493 770.6 999 493.9 493.90 493.100
;
title "Listing of All Asthma Codes";
<strong>proc</strong> <strong>print</strong> data=ICD_9 noobs;
<strong>run</strong>;

 

Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.

The usual reason to write a SAS blog is to teach some aspect of SAS programming or to just point out something interesting about SAS. While that is usually my motivation, I have an ulterior motive in writing this blog – I want to plug a new book I have just published on Amazon. It's called 10-8 Awaiting Crew: Memories of a Volunteer EMT. One of the chapters discusses the difficulty of conducting statistical studies in pre-hospital settings. This was my first attempt at a non-technical book. I hope you take a look. (Enter "10-8 awaiting crew" or "Ron Cody" in Amazon search to find the book.) Drop me an email with your thoughts at ron.cody@gmail.com.

Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.

5月 292020
 

The COVID-19 pandemic challenged agriculture and supply chains, but the overarching resilience of agriculture around the world speaks to the industry's efficiency, built-in redundancy and indispensability. In the US, flourishing interactions between government, industry and academic stakeholders underscore how ag represents unity and consilience. And there may be no better [...]

From 9 cows to the future of agtech was published on SAS Voices by John Gottula

5月 282020
 

The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. An application in machine learning is to measure how distributions in a parametric family differ from a data distribution. This article shows that if you minimize the Kullback–Leibler divergence over a set of parameters, you can find a distribution that is similar to the data distribution. This article focuses on discrete distributions.

The Kullback–Leibler divergence between two discrete distributions

As explained in a previous article, the Kullback–Leibler (K-L) divergence between two discrete probability distributions is the sum
KL(f, g) = Σx f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.) For this sum to be well defined, the distribution g must be strictly positive on the support of f.

One application of the K-L divergence is to measure the similarity between a hypothetical model distribution defined by g and an empirical distribution defined by f.

Example data for the Kullback–Leibler divergence

As an example, suppose a call center averages about 10 calls per hour. An analyst wants to investigate whether the number of calls per hour can be modeled by using a Poisson(λ=10) distribution. To test the hypothesis, the analyst records the number of calls for each hour during 100 hours of operation. The following SAS DATA step reads the data. The call to PROC SGPLOT creates a histogram shows the distribution of the 100 counts:

data Calls;
input N @@;
label N = "Calls per hour";
datalines;
11 19 11 13 13 8 11 9 9 14 
10 13 8 15 7 9 6 12 7 13 
12 19 6 12 11 12 11 9 15 4 
7 12 12 10 10 16 18 13 13 8 
13 10 9 9 12 13 12 8 13 9 
7 9 10 9 4 10 12 5 4 12 
8 12 14 16 11 7 18 8 10 13 
12 5 11 12 16 9 11 8 11 7 
11 15 8 7 12 16 9 18 9 8 
10 7 11 12 13 15 6 10 10 7 
;
 
title "Number of Calls per Hour";
title2 "Data for 100 Hours";
proc sgplot data=Calls;
   histogram N / scale=proportion binwidth=1;
   xaxis values=(4 to 19) valueshint;
run;

The graph should really be a bar chart, but I used a histogram with BINWIDTH=1 so that the graph reveals that the value 17 does not appear in the data. Furthermore, the values 0, 1, 2, and 3 do not appear in the data. I used the SCALE=PROPORTION option to plot the data distribution on the density scale.

The call center wants to model these data by using a Poisson distribution. The traditional statistical approach is to use maximum likelihood estimation (MLE) to find the parameter, λ, in the Poisson family so that the Poisson(λ) distribution is the best fit to the data. However, let's see how using the Kullback–Leibler divergence leads to a similar result.

The Kullback–Leibler divergence between data and a Poisson distribution

Let's compute the K-L divergence between the empirical frequency distribution and a Poisson(10) distribution. The empirical distribution is the reference distribution; the Poisson(10) distribution is the model. The Poisson distribution has a nonzero probability for all x ≥ 0, but recall that the K-L divergence is computed by summing over the observed values of the empirical distribution, which is the set {4, 5, ..., 19}, excluding the value 17.

proc iml;
/* read the data, which is the reference distribution, f */
use Calls;  read all var "N" into Obs;  close;
call Tabulate(Levels, Freq, Obs);   /* find unique values and frequencies */
Proportion = Freq / nrow(Obs);      /* empirical density of frequency of calls (f) */
 
/* create the model distribution: Poisson(10) */
lambda = 10;   
poisPDF = pdf("Poisson", Levels, lambda); /* Poisson model on support(f) */
 
/* load K-L divergence module or include the definition from: 
 https://blogs.sas.com/content/iml/2020/05/26/kullback-leibler-divergence-discrete.html
*/
load module=KLDiv;
KL = KLDiv(Proportion, poisPDF); 
print KL[format=best5.];

Notice that although the Poisson distribution has infinite support, you only need to evaluate the Poisson density on the (finite) support of empirical density.

Minimize the Kullback–Leibler divergence between data and a Poisson distribution

The previous section shows how to compute the Kullback–Leibler divergence between an empirical density and a Poisson(10) distribution. You can repeat that computation for a whole range of λ values and plot the divergence versus the Poisson parameter. The following statements compute the K-L divergence for λ on [4, 16] and plots the result. The minimum value of the K-L divergence is achieved near λ = 10.7. At that value of λ, the K-L divergence between the data and the Poisson(10.7) distribution is 0.105.

/* Plot the K-L div versus lambda for a sequence of Poisson(lambda) models */
lambda = do(4, 16, 0.1);
KL = j(1, ncol(lambda), .);
do i = 1 to ncol(lambda);
   poisPDF = pdf("Poisson", Levels, lambda[i]);
   KL[i] = KLDiv(Proportion, poisPDF); 
end;
 
title "K-L Divergence from Poisson(lambda)";
call series(lambda, KL) grid={x y} xvalues=4:16 label={'x' 'K-L Divergence'};

The graph shows the K-L divergence for a sequence of Poisson(λ) models. The Poisson(10.7) model has the smallest divergence from the data distribution, therefore it is the most similar to the data among the Poisson(λ) distributions that were considered. You can use a numerical optimization technique in SAS/IML if you want to find a more accurate value that minimizes the K-L divergences.

The following graph overlays the PMF for the Poisson(10.7) distribution on the empirical distribution for the number of calls.

The minimal Kullback–Leibler divergence and the maximum likelihood estimate

You might wonder how minimizing the K-L divergence relates to the traditional MLE method for fitting a Poisson model to the data. The following call to PROC GENMOD shows that the MLE estimate is λ = 10.71:

proc genmod data=MyData;
   model Obs = / dist=poisson;
   output out=PoissonFit p=lambda;
run;
 
proc print data=PoissonFit(obs=1) noobs;
   var lambda;
run;

Is this a coincidence? No. It turns out that there a connection between the K-L divergence and the negative log-likelihood. Minimizing the K-L divergence is equivalent to minimizing the negative log-likelihood, which is equivalent to maximizing the likelihood between the Poisson model and the data.

Summary

This article shows how to compute the Kullback–Leibler divergence between an empirical distribution and a Poisson distribution. The empirical distribution was the observed number of calls per hour for 100 hours in a call center. You can compute the K-L divergence for many parameter values (or use numerical optimization) to find the parameter that minimizes the K-L divergence. This parameter value corresponds to the Poisson distribution that is most similar to the data. It turns out that minimizing the K-L divergence is equivalent to maximizing the likelihood function. Although the parameter estimates are the same, the traditional MLE estimate comes with additional tools for statistical inference, such as estimates for confidence intervals and standard errors.

The post Minimizing the Kullback–Leibler divergence appeared first on The DO Loop.

5月 282020
 

SAS toolbox: macro functions
Did you know you could have a single universal function that can replace all the functions in the world? All those sin(x), log(x), … whatever(x) can all be replaced by a single super function f(x). Don’t believe me? Just make those functions names – sin, log, … whatever to be another argument to that all-purpose function f, just like that: f(x, sin), f(x, log), … f(x, whatever). Now, we only must deal with a single function instead of many, and its second argument will define what transformation needs to be done with the first argument in order to arrive at this almighty function’s value.

How many functions there are in SAS

Last time I counted there were more than 600 SAS functions, and that is excluding call routines and macro functions. But even that huge number grossly under-represents the actual number of functions available in SAS. That is because there are some functions that are built like the universal multi-purpose super function described above. For example, look at the following functions:

finance() function represents several dozen various financial functions;

finfo() function represents multiple functions returning various information items about files (file size, date created, date modified, access permission, etc.);

dinfo() function returns similar information items about directories;

attrn() function returns numeric attributes of a data set (number of observations, number of variables, etc.)

attrc() function returns character attributes of a data set (engine name, encoding name, character set, etc.)

Each of these functions represents not a single function, but a group of functions, and one of their arguments stipulates specific functionality (an information item or an attribute) that is being requested. You can think of this argument as a function modifier.

%sysfunc SAS macro function

%sysfunc() is a super macro function that brings a wealth of SAS functions into SAS macro language. With very few exceptions, most SAS functions are available in SAS macro language thanks to the %sysfunc().

Moreover, we can build our own user-defined macro functions using SAS-supplied macro functions (such as %eval, %length, %quote, %scan, etc.), as well as hundreds of the SAS non-macro functions wrapped into the %sysfunc() super macro function.

Building a super macro function to retrieve information about data sets

Armed with such a powerful arsenal, let’s build a multi-purpose macro function that taps into the data tables’ metadata and extracts various information items about those tables.

Let’s make this macro function return any of the following most frequently used values:

  • Number of observations
  • Number of variables
  • Variables list (positional, separated by spaces)
  • Variables list (positional, separated by commas)

Obviously, we can create much more of these information items and attributes, but here I am just showing how to do this so that you can create your own list depending on your needs.

In my earlier blog post, How to create and use SAS macro functions, we had already built a macro function for getting the number of observations; let’s expand on that.

Here is the SAS Macro code that handles extraction of all four specified metadata items:

%macro dsinfo(dset,info);
/* dset - data set name                             */
/* info - modifier (NOBS, NVARS, VARLIST, VARLISTC) */      
   %local dsid result infocaps i;
   %let infocaps = %upcase(&info);
   %let dsid = %sysfunc(open(&dset));
   %if &dsid %then
   %do;
      %if &infocaps=NOBS %then %let result = %sysfunc(attrn(&dsid,nlobs));
      %else %if &infocaps=NVARS %then %let result = %sysfunc(attrn(&dsid,nvars));
      %else %if &infocaps=VARLIST %then
      %do i=1 %to %sysfunc(attrn(&dsid,nvars));
         %let result = &result %sysfunc(varname(&dsid,&i));
      %end;
      %else %if &infocaps=VARLISTC %then
      %do i=1 %to %sysfunc(attrn(&dsid,nvars));
         %if &i eq 1 %then %let result = %sysfunc(varname(&dsid,&i));
         %else %let result = &result,%sysfunc(varname(&dsid,&i));
      %end;
      %let dsid = %sysfunc(close(&dsid));
   %end;
   %else %put %sysfunc(sysmsg());
   &result
%mend dsinfo;

The SAS log will show:

%put NOBS=***%dsinfo(SASHELP.CARS,NOBS)***;
NOBS=***428***
%put NVARS=***%dsinfo(SASHELP.CARS,NVARS)***;
NVARS=***15***
%put VARLIST=***%dsinfo(SASHELP.CARS,VARLIST)***;
VARLIST=***Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length***
%put VARLISTC=***%dsinfo(SASHELP.CARS,VARLISTC)***;
VARLISTC=***Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length***

Macro function code highlights

We used the following statement to make our macro function case-insensitive regarding the info argument:

%let infocaps = %upcase(&info);

Then depending on the up-cased second argument of our macro function (modifier) we used the attrn(), varnum() and varname() functions within %sysfunc() to retrieve and construct our result macro variable.

We stick that result macro variable value, &result, right before the %mend statement so that the value is returned to the calling environment.

While info=VARLIST (space-separated variable list) is useful in DATA steps, info=VARLISTC (comma-separated variable list) is useful in PROC SQL.

Usage example

Having this %dsinfo macro function at hands, we can use it in multiple programming scenarios. For example:

/* ending SAS session if no observations to process */
%if %dsinfo(SASHELP.CARS,NOBS)=0 %then %do; endsas; %end;
 
/* further processing */
data MYNEWDATA (keep=%dsinfo(SASHELP.CARS,VARLIST));
   retain %dsinfo(SASHELP.CARS,VARLIST);
   set SASHELP.CARS;
   if _n_=1 then put %dsinfo(SASHELP.CARS,VARLIST);
   /* ... */
run;

Here we first check if there is at least one observation in a data set. If not (0 observations) then we stop the SAS session and don’t do any further processing. Otherwise, when there are some observations to process, we continue.

If SAS code needs multiple calls to the same macro function with the same argument, we can shorten the code by first assigning that macro function’s result to a macro variable and then reference that macro variable instead of repeating macro function invocation. Here is an example:

/* further processing */
%let vlist = %dsinfo(SASHELP.CARS,VARLIST);
data MYNEWDATA (keep=&vlist);
   retain &vlist;
   set SASHELP.CARS;
   if _n_=1 then put &vlist;
   /* ... */
run;

Additional resources

Your thoughts?

Do you see the benefits of these multi-purpose SAS macro functions? Can you suggest other scenarios of their usage? Please share your thoughts in the comments section below.

Multi-purpose macro function for getting information about data sets was published on SAS Users.

5月 262020
 

If you have been learning about machine learning or mathematical statistics, you might have heard about the Kullback–Leibler divergence. The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. It measures how much one distribution differs from a reference distribution. This article explains the Kullback–Leibler divergence and shows how to compute it for discrete probability distributions.

Recall that there are many statistical methods that indicate how much two distributions differ. For example, a maximum likelihood estimate involves finding parameters for a reference distribution that is similar to the data. Statistics such as the Kolmogorov-Smirnov statistic are used in goodness-of-fit tests to compare a data distribution to a reference distribution.

What is the Kullback–Leibler divergence?

Let f and g be probability mass functions that have the same domain. The Kullback–Leibler (K-L) divergence is the sum
KL(f, g) = Σx f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.) The K-L divergence measures the similarity between the distribution defined by g and the reference distribution defined by f.

For this sum to be well defined, the distribution g must be strictly positive on the support of f. That is, the Kullback–Leibler divergence is defined only when g(x) > 0 for all x in the support of f.

Some researchers prefer the argument to the log function to have f(x) in the denominator. Flipping the ratio introduces a negative sign, so an equivalent formula is
KL(f, g) = –Σx f(x) log( g(x)/f(x) )

Notice that if the two density functions (f and g) are the same, then the logarithm of the ratio is 0. Therefore, the K-L divergence is zero when the two distributions are equal. The K-L divergence is positive if the distributions are different.

The Kullback–Leibler divergence was developed as a tool for information theory, but it is frequently used in machine learning. The divergence has several interpretations. In information theory, it measures the information loss when f is approximated by g. In statistics and machine learning, f is often the observed distribution and g is a model.

An example of Kullback–Leibler divergence

As an example, suppose you roll a six-sided die 100 times and record the proportion of 1s, 2s, 3s, etc. You might want to compare this empirical distribution to the uniform distribution, which is the distribution of a fair die for which the probability of each face appearing is 1/6. The following SAS/IML statements compute the Kullback–Leibler (K-L) divergence between the empirical density and the uniform density:

proc iml;
/* K-L divergence is defined for positive discrete densities */
/* Face: 1   2   3   4   5   6 */
f    = {20, 14, 19, 19, 12, 16} / 100;  /* empirical density; 100 rolls of die */
g    = { 1,  1,  1,  1,  1,  1} / 6;    /* uniform density */
KL_fg = -sum( f#log(g/f) );             /* K-L divergence using natural log */
print KL_fg;

The K-L divergence is very small, which indicates that the two distributions are similar.

Although this example compares an empirical distribution to a theoretical distribution, you need to be aware of the limitations of the K-L divergence. The K-L divergence compares two distributions and assumes that the density functions are exact. The K-L divergence does not account for the size of the sample in the previous example. The computation is the same regardless of whether the first density is based on 100 rolls or a million rolls. Thus, the K-L divergence is not a replacement for traditional statistical goodness-of-fit tests.

The Kullback–Leibler is not symmetric

Let's compare a different distribution to the uniform distribution. Let h(x)=9/30 if x=1,2,3 and let h(x)=1/30 if x=4,5,6. The following statements compute the K-L divergence between h and g and between g and h. In the first computation, the step distribution (h) is the reference distribution. In the second computation, the uniform distribution is the reference distribution.

h = { 9,  9,  9,  1,  1,  1} /30;    /* step density */
g = { 1,  1,  1,  1,  1,  1} / 6;    /* uniform density */
KL_hg = -sum( h#log(g/h) );          /* h is reference distribution */
print KL_hg;
 
/* show that K-L div is not symmetric */
KL_gh = -sum( g#log(h/g) );          /* g is reference distribution */
print KL_gh;

First, notice that the numbers are larger than for the example in the previous section. The f density function is approximately constant, whereas h is not. So the distribution for f is more similar to a uniform distribution than the step distribution is. Second, notice that the K-L divergence is not symmetric. In the first computation (KL_hg), the reference distribution is h, which means that the log terms are weighted by the values of h. The weights from h give a lot of weight to the first three categories (1,2,3) and very little weight to the last three categories (4,5,6). In contrast, g is the reference distribution for the second computation (KL_gh). Because g is the uniform density, the log terms are weighted equally in the second computation.

The fact that the summation is over the support of f means that you can compute the K-L divergence between an empirical distribution (which always has finite support) and a model that has infinite support. However, you cannot use just any distribution for g. Mathematically, f must be absolutely continuous with respect to g. (Another expression is that f is dominated by g.) This means that for every value of x such that f(x)>0, it is also true that g(x)>0.

A SAS function for the Kullback–Leibler divergence

It is convenient to write a function, KLDiv, that computes the Kullback–Leibler divergence for vectors that give the density for two discrete densities. The call KLDiv(f, g) should compute the weighted sum of log( g(x)/f(x) ), where x ranges over elements of the support of f. By default, the function verifies that g > 0 on the support of f and returns a missing value if it isn't. The following SAS/IML function implements the Kullback–Leibler divergence.

/* The Kullback–Leibler divergence between two discrete densities f and g.
   The f distribution is the reference distribution, which means that 
   the sum is probability-weighted by f. 
   If f(x0)>0 at some x0, the model must allow it. You cannot have g(x0)=0.
*/
start KLDiv(f, g, validate=1);
   if validate then do;        /* if g might be zero, validate */
      idx = loc(g<=0);         /* find locations where g = 0 */
      if ncol(idx) > 0 then do; 
         if any(f[idx]>0) then do; /* g is not a good model for f */
            *print "ERROR: g(x)=0 when f(x)>0";
            return( . );
         end;
      end;
   end;
   if any(f<=0) then do;        /* f = 0 somewhere */
      idx = loc(f>0);           /* support of f */
      fp = f[idx];              /* restrict f and g to support of f */
      return -sum( fp#log(g[idx]/fp) );  /* sum over support of f */
   end;
   /* else, f > 0 everywhere */
   return -sum( f#log(g/f) );  /* K-L divergence using natural logarithm */
finish;
 
/* test the KLDiv function */
f = {0, 0.10, 0.40, 0.40, 0.1};
g = {0, 0.60, 0.25, 0.15, 0};
KL_pq = KLDiv(f, g);  /* g is not a valid model for f; K-L div not defined */
KL_qp = KLDiv(g, f);  /* f is valid model for g. Sum is over support of g */
print KL_pq KL_qp;

The first call returns a missing value because the sum over the support of f encounters the invalid expression log(0) as the fifth term of the sum. The density g cannot be a model for f because g(5)=0 (no 5s are permitted) whereas f(5)>0 (5s were observed). The second call returns a positive value because the sum over the support of g is valid. In this case, f says that 5s are permitted, but g says that no 5s were observed.

Summary

This article explains the Kullback–Leibler divergence for discrete distributions. A simple example shows that the K-L divergence is not symmetric. This is explained by understanding that the K-L divergence involves a probability-weighted sum where the weights come from the first argument (the reference distribution). Lastly, the article gives an example of implementing the Kullback–Leibler divergence in a matrix-vector language such as SAS/IML. This article focused on discrete distributions. The next article shows how the K-L divergence changes as a function of the parameters in a model.

References

The post The Kullback–Leibler divergence between discrete probability distributions appeared first on The DO Loop.

5月 232020
 

雷达图,也叫网状图、蜘蛛网图、星图、极坐标图、Kiviat图、不规则多边图等,以二维图的形式显示多元数据。通常包括三个或多个定量变量,从同一点开始的轴上表示。轴的相对位置和角度通常是无意义的,不过通常用来揭示相关性、权衡(trade-offs)和比较等等。由坐标、刻度及参考线、多边形和数据点组成,图1。

radar chart

Fig.1.雷达图解析 (出自The Data Visualisation Catalogue)

在SAS第二代绘图g系列语句里,有专门的PROC gradar语句来完成。功能限制多,图形展示效果较差,想多点花样,必须用注释anno辅助才行。虽然到了SAS第三代绘图sg系列里面没有专门的雷达图语句了,但是SAS官方博客上倒是有几篇文章提到怎么用sg语句组合来绘制雷达图,要完成如图1那样的图不是很困难的事情。

            在前一篇文章《SAS程序优化_一万次循环》里面为了展示系统资源使用不均衡的情况,用到了雷达图(图2),表达了磁盘DISK io占用率高,限制了电脑发挥,属于trade-offs。其实这种情况,也可以用条形图展示,不过效果没雷达图好。

Fig.2. 电脑系统资源使用情况示意图

绘图要点:

1,由于不是规则的坐标图,即极坐标图,所以数据得转换,从XY坐标图转换到极坐标图,因此数据要完成角度弧度的转换;

2,只有一个极坐标轴了,坐标轴变成了辐条,可以用sgplot vector来完成;坐标轴上的刻度及参考线可以用polygon画多边形或ellipseparm画圆完成;数据点用scatter完成;数据点之间的连线用polygon多边形来完成;

3,还剩下一个坐标轴的命名,默认是datalabel,名称的位置受限,经常与部分刻度参考线重叠,非常难看,gradar也无法避免这个问题,除非借助anno,相当的难受,sgplot text就可以非常愉快的完成这个,做到美观大方。

代码组成部分一:作图数据

  • data ex;
  • do situation = 1 to 0 by –1;
  •  input cpu memory disk gpu network;output;
  •  end;
  • cards;
  • 20 20 90 10 10
  • 50 50 50 50 50
  • ;
  • run;
  • proc sort data=ex out=ex;by situation;run;
  • proc transpose data=ex out= ex2 name=config prefix= use ;
  • by situation;
  • run;

如果你的数据已经是图3这样了,就无需上面转换。

Fig.3. 期望的实际数据形式

代码组成部分二:构造极坐标图框架和数据,转换的数据形式见图4。

  • data ex3;
  •      set ex2;
  •        deg2Rad = constant(‘pi’)/180 ;
  •        deg   =(90360/5*(_N_-1)) ; /*  5由变量个数来决定   */
  •        angle = deg*deg2Rad ;
  •        rangle =  –360/5*(_N_-1)  ;/*  5由变量个数来决定   */
  •        rdx = use1*cos(angle) ;
  •        rdy = use1*sin(angle) ;
  •        rx = 100*cos(angle); ry=100*sin(angle);/*100由实际情况定 */
  •        config=upcase(config);
  •        if situation then situ_=”实际情况” ;else situ_=”理性情况”;
  • run;

Fig.4. 转换的实际数据和构图数据形式

代码组成部分三:作图

  • ods listing image_dpi=600 gpath=”d:”   ;
  • ods graphics on  /  imagename=”Figure_”  HEIGHT= 7 cm WIDTH= 7 cm border=no ;
  • footnote;
  • title;
  • proc sgplot data=ex3  ASPECT=1  noborder  SUBPIXEL;
  • STYLEATTRS DATACONTRASTCOLORS=(blue red);
  •     polygon x=rdx y=rdy id=situation/ group=situ_ outline LINEATTRS=(pattern=solid color=blue THICKNESS=2) fill fillattrs=(transparency=0.) name=”sxlion”;
  •       vector  x=rx y=ry / xOrigin=0 yOrigin=0  NOARROWHEADS LINEATTRS=(pattern=solid color=black THICKNESS=1.5) ;
  •       text x=rx y=ry text=config / ROTATE=rangle position=top textattrs= (Color=darkblue Family=”Helvetica” Size=10 Weight=Bold);
  •       ellipseparm SEMIMAJOR=50 SEMIMINOR=50;/*中刻度线,自定*/
  •       ellipseparm SEMIMAJOR=99 SEMIMINOR=99;/*外刻度线,自定*/
  •       scatter x=rdx y=rdy /markerattrs=(symbol=circlefilled color=red Size=6);
  • keylegend  “sxlion” / noborder;
  • xaxis DISPLAY=none;
  • yaxis DISPLAY=none;
  • run;
  • ods graphics off;

SAS绘图sg系列语句可以模块化,多个功能叠加在一起就可以组成一个雷达图了。当然,既然是组合,只要兼容的sg系列语句都可以叠加在一起,并且按语句出现的先后顺序叠加起来,语句的前后顺序可根据需要进行安排。

原创文章: ”雷达图及SAS绘图by sxlion“,转载请注明: 转自SAS资源资讯列表

本文链接地址: http://saslist.net/archives/475