rick wicklin

8月 192019
 

One of my friends likes to remind me that "there is no such thing as a free lunch," which he abbreviates by "TINSTAAFL" (or TANSTAAFL). The TINSTAAFL principle applies to computer programming because you often end up paying a cost (in performance) when you call a convenience function that simplifies your program.

I was thinking about TINSTAAFL recently when I was calling a Base SAS function from the SAS/IML matrix language. The SAS/IML language supports hundreds of built-in functions that operate on vectors and matrices. However, you can also call hundreds of functions in Base SAS and pass in vectors for the parameters. It is awesome and convenient to be able to call the virtual smorgasbord of functions in Base SAS, such as probability function, string matching functions, trig function, financial functions, and more. Of course, there is no such thing as a free lunch, so I wondered about the overhead costs associated with calling a Base SAS function from SAS/IML. Base SAS functions typically are designed to operate on scalar values, so the IML language has to call the underlying function many times, once for each value of the parameter vector. It is more expensive to call a function a million times (each time passing in a scalar parameter) than it is to call a function one time and pass in a vector that contains a million parameters.

To determine the overhead costs, I decided to test the cost of calling the MISSING function in Base SAS. The IML language has a built-in syntax (b = (X=.)) for creating a binary variable that indicates which elements of a vector are missing. The call to the MISSING function (b = missing(X)) is equivalent, but requires calling a Base SAS many times, once for each element of x. The native SAS/IML syntax will be faster than calling a Base SAS function (TINSTAAFL!), but how much faster?

The following program incorporates many of my tips for measuring the performance of a SAS computation. The test is run on large vectors of various sizes. Each computation (which is very fast, even on large vectors) is repeated 50 times. The results are presented in a graph. The following program measures the performance for a character vector that contains all missing values.

/* Compare performance of IML syntax
   b = (X = " ");
   to performance of calling Base SAS MISSING function 
   b = missing(X);
*/
proc iml;
numRep = 50;                            /* repeat each computation 50 times */
sizes = {1E4, 2.5E4, 5E4, 10E4, 20E4};  /* number of elements in vector */
labl = {"Size" "T_IML" "T_Missing"};
Results = j(nrow(sizes), 3);
Results[,1] = sizes;
 
/* measure performance for character data */
do i = 1 to nrow(sizes);
   A = j(sizes[i], 1, " ");            /* every element is missing */
   t0 = time();
   do k = 1 to numRep;
      b = (A = " ");                   /* use built-in IML syntax */
   end;
   Results[i, 2] = (time() - t0) / numRep;
 
   t0 = time();
   do k = 1 to numRep;
      b = missing(A);                  /* call Base SAS function */
   end;
   Results[i, 3] = (time() - t0) / numRep;
end;
 
title "Timing Results for (X=' ') vs missing(X) in SAS/IML";
title2 "Character Data";
long = (sizes // sizes) || (Results[,2] // Results[,3]);   /* convert from wide to long for graphing */
Group = j(nrow(sizes), 1, "T_IML") // j(nrow(sizes), 1, "T_Missing"); 
call series(long[,1], long[,2]) group=Group grid={x y} label={"Size" "Time (s)"} 
            option="markers curvelabel" other="format X comma8.;";

The graph shows that the absolute times for creating a binary indicator variable is very fast for both methods. Even for 200,000 observations, creating a binary indicator variable takes less than five milliseconds. However, on a relative scale, the built-in SAS/IML syntax is more than twice as fast as calling the Base SAS MISSING function.

You can run a similar test for numeric values. For numeric values, the SAS/IML syntax is about 10-20 times faster than the call to the MISSING function, but, again, the absolute times are less than five milliseconds.

So, what's the cost of calling a Base SAS function from SAS/IML? It's not free, but it's very cheap in absolute terms! Of course, the cost depends on the number of elements that you are sending to the Base SAS function. However, in general, there is hardly any cost associated with calling a Base SAS function from SAS/IML. So enjoy the lunch buffet! Not only is it convenient and plentiful, but it's also very cheap!

The post Timing performance in SAS/IML: Built-in functions versus Base SAS functions appeared first on The DO Loop.

8月 142019
 

Many programmers are familiar with "short-circuit" evaluation in an IF-THEN statement. Short circuit means that a program does not evaluate the remainder of a logical expression if the value of the expression is already logically determined. The SAS DATA step supports short-circuiting for simple logical expressions in IF-THEN statements and WHERE clauses (Polzin 1994, p. 1579; Gilsen 1997). For example, in the following logical-AND expression, the condition for the variable Y does not need to be checked if the condition for variable X is true:

data _null_;
set A end=eof;
if x>0 & y>0 then   /* Y is evaluated only if X>0 */
   count + 1;
if eof then 
   put count=;
run;

Order the conditions in a logical statement by likelihood

SAS programmers can optimize their IF-THEN and WHERE clauses if they can estimate the probability of each separate condition in a logical expression:

  • In a logical AND statement, put the least likely events first and the most likely events last. For example, suppose you want to find patients at a VA hospital that are male, over the age of 50, and have kidney cancer. You know that kidney cancer is a rare form of cancer. You also know that most patients at the VA hospital are male. To optimize a WHERE clause, you should put the least probable conditions first:
    WHERE Disease="Kidney Cancer" & Age>50 & Sex="Male";
  • In a logical OR statement, put the most likely events first and the least likely events last. For example, suppose you want to find all patients at a VA hospital that are either male, or over the age of 50, or have kidney cancer. To optimize a WHERE clause, you should use
    WHERE Sex="Male" | Age>50 | Disease="Kidney Cancer";

The SAS documentation does not discuss the conditions for which a logical expression does or does not short circuit. Polzin (1994, p. 1579) points out that when you put function calls in the logical expression, SAS evaluates certain function calls that produce side effects. Common functions that have side effects include random number functions and user-defined functions (via PROC FCMP) that have output arguments. The LAG and DIF functions can also produce side effects, but it appears that expressions that involve the LAG and DIF functions are short-circuited. You can force a function evaluation by calling the function prior to an IF-THEN statement. You can use nested IF-THEN/ELSE statements to ensure that functions are not evaluated unless prior conditions are satisfied.

Logical ligatures

The SAS/IML language does not support short-circuiting in IF-THEN statements, but it performs several similar optimizations that are designed to speed up your code execution. One optimization involves the ANY and ALL functions, which test whether any or all (respectively) elements of a matrix satisfy a logical condition. A common usage is to test whether a missing value appear anywhere in a vector, as shown in the following SAS/IML statement:

bAny = any(y = .);   /* TRUE if any element of Y is missing */
/* Equivalently, use   bAll = all(y ^= .);  */

The SAS/IML language treats simple logical expressions like these as a single function call, not as two operations. I call this a logical ligature because two operations are combined into one. (A computer scientist might just call this a parser optimization.)

You might assume that the expression ANY(Y=.) is evaluated by using a two-step process. In the first step, the Boolean expression y=. is evaluated and the result is assigned to a temporary binary matrix, which is the same size as Y. In the second step, the temporary matrix is sent to the ANY function, which evaluates the binary matrix and returns TRUE if any element is nonzero. However, it turns out that SAS/IML does not use a temporary matrix. The SAS/IML parser recognizes that the expression inside the ANY function is a simple logical expression. The program can evaluate the function by looking at each element of Y and returning TRUE as soon it finds a missing value. In other words, it short-circuits the computation. If no value is missing, the expression evaluates to FALSE.

Short circuiting an operation can save considerable time. In the following SAS/IML program, the vector X contains 100 million elements, all equal to 1. The vector Y also contains 100 million elements, but the first element of the vector is a missing value. Consequently, the computation for Y is essentially instantaneous whereas the computation for X takes a tenth of a second:

proc iml;
numReps = 10;      /* run computations 10 times and report average */
N = 1E8;           /* create vector with 100 million elements */
x = j(N, 1, 1);    /* all elements of x equal 1 */
y = x; y[1] = .;   /* the first element of x is missing */
 
/* the ALL and ANY functions short-circuit when the 
   argument is a simple logical expression */
/* these function calls examine only the first elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(y = .);   /* TRUE for y[1] */
   bAll = all(y ^= .);  /* TRUE for y[2] */
end;
t = (time() - t0) / numReps;
print t[F=BEST6.];
 
/* these function calls must examine all elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(x = .);   
   bAll = all(x ^= .);
end;
t = (time() - t0) / numReps;
print t[F=BEST6.];

Although the evaluation of X does not short circuit, it still uses the logical ligature to evaluate the expression. Consequently, the evaluation is much faster than the naive two-step process that is shown explicitly by the following statements, which require about 0.3 seconds and twice as much memory:

   /* two-step process: slower */
   b1 = (y=.);                /* form the binary vector */
   bAny = any(b1);            /* TRUE for y[1] */

In summary, the SAS DATA step uses short-circuit evaluation in IF-THEN statements and WHERE clauses that use simple logical expressions. If the expression contains several subexpressions, you can optimize your program by estimating the probability that each subexpression is true. In the SAS/IML language, the ANY and ALL functions not only short circuit, but when their argument is a simple Boolean expression, the language treats the function call as a logical ligature and evaluates the call in an efficient manner that does not require any temporary memory.

Short circuits can be perplexing if you don't expect them. Equally confusing is expecting a statement to short circuit, but it doesn't. If you have a funny story about short-circuit operators, in any language, leave a comment.

The post Short-circuit evaluation and logical ligatures in SAS appeared first on The DO Loop.

8月 122019
 
What is this math good for, anyway?
     –Every student, everywhere

I am a professional applied mathematician, yet many of the mathematical and statistical techniques that I use every day are not from advanced university courses but are based on simple ideas taught in high school or even in grade school. I've written many blog posts in which the solution to an interesting problem requires little more than high-school math. Even when the solution requires advanced techniques, high-school math often provides the basis for solving the problem.

In celebration of the upcoming school year, here are 12 articles that show connections between advanced topics and mathematical ideas that are introduced in high-school or earlier. If you have a school-age child, read some of the articles to prepare yourself for the inevitable mid-year whine, "But, Mom/Dad, when will I ever use this stuff?!"

Grade-school math

Obviously, most adults use basic arithmetic, fractions, decimals, and percents, but here are some less obvious uses of grade-school math:

High-school math

Algebra, linear transformations, geometry, and trigonometry are the main topics in high school mathematics. These topics are the bread-and-butter of applied mathematics:

  • Linear transformations: Anytime you create a graph on the computer screen, a linear transformation transforms the data from physical units (weight, cost, time) into pixel values. Although modern graphical software performs that linear transformation for you, a situation in which you have to manually apply a linear transformation is when you want to display data in two units (pounds and kilograms, dollars and Euros, etc). You can use a simple linear transformation to align the tick labels on the two axes.
  • Intersections: In high school, you learn to compute the intersection between two lines. You can extend the problem to find the intersection of two line segments. I needed that result to solve a problem in probability theory.
  • Solve a system of equations: In Algebra II, you learn to solve a system of equations. Solving linear systems is the foundation of linear algebra. Solving nonlinear systems is among the most important skills in applied mathematics.
  • Find the roots of a nonlinear equation: Numerically finding the roots of a function is taught in pre-calculus. It is the basis for all "inverse problems" in which you want to find inputs to a function that produce a specified output. In statistics, a common "inverse problem" is finding the quantile of a cumulative probability distribution.
  • Binomial coefficients: In algebra, many teachers use the mnemonic FOIL (First, Outer, Inner, Last) to teach students how to compute the quadratic expansion of (a + b)2. Later, students learn the binomial expansion of an arbitrary power, (a + b)n. The coefficients in this expansion are called the binomial coefficients and appear in Pascal's triangle as well as in many discrete probability distributions such as the negative binomial and hypergeometric distributions.
  • Pythagorean triples: In trigonometry, a huge number of homework problems involve using right triangles with side lengths that are proportional to (3, 4, 5) and (5, 12, 13). These are two examples of Pythagorean triples: right triangles whose side lengths are all integers. It turns out that you can use linear transformations to generate all primitive triples from the single triple (3, 4, 5). A high school student can understand this process, although the process is most naturally expressed in terms of matrix multiplication, which is not always taught in high school.

High-school statistics

Many high schools offer a unit on probability and statistics, and some students take AP Statistics.

Einstein famously said, "everything should be made as simple as possible, but not simpler." It is surprising to me how often an advanced technique can be simplified and explained by using elementary math. I don't claim that "everything I needed to know about math I learned in kindergarten," but I often return to elementary techniques when I describe how to solve non-elementary problems.

What about you? What are some elementary math or statistics concepts that you use regularly in your professional life? Are there fundamental topics that you learned in high school that are deeper and more useful than you realized at the time? Leave a comment.

The post The math you learned in school: Yes, it’s useful! appeared first on The DO Loop.

8月 122019
 
What is this math good for, anyway?
     –Every student, everywhere

I am a professional applied mathematician, yet many of the mathematical and statistical techniques that I use every day are not from advanced university courses but are based on simple ideas taught in high school or even in grade school. I've written many blog posts in which the solution to an interesting problem requires little more than high-school math. Even when the solution requires advanced techniques, high-school math often provides the basis for solving the problem.

In celebration of the upcoming school year, here are 12 articles that show connections between advanced topics and mathematical ideas that are introduced in high-school or earlier. If you have a school-age child, read some of the articles to prepare yourself for the inevitable mid-year whine, "But, Mom/Dad, when will I ever use this stuff?!"

Grade-school math

Obviously, most adults use basic arithmetic, fractions, decimals, and percents, but here are some less obvious uses of grade-school math:

High-school math

Algebra, linear transformations, geometry, and trigonometry are the main topics in high school mathematics. These topics are the bread-and-butter of applied mathematics:

  • Linear transformations: Anytime you create a graph on the computer screen, a linear transformation transforms the data from physical units (weight, cost, time) into pixel values. Although modern graphical software performs that linear transformation for you, a situation in which you have to manually apply a linear transformation is when you want to display data in two units (pounds and kilograms, dollars and Euros, etc). You can use a simple linear transformation to align the tick labels on the two axes.
  • Intersections: In high school, you learn to compute the intersection between two lines. You can extend the problem to find the intersection of two line segments. I needed that result to solve a problem in probability theory.
  • Solve a system of equations: In Algebra II, you learn to solve a system of equations. Solving linear systems is the foundation of linear algebra. Solving nonlinear systems is among the most important skills in applied mathematics.
  • Find the roots of a nonlinear equation: Numerically finding the roots of a function is taught in pre-calculus. It is the basis for all "inverse problems" in which you want to find inputs to a function that produce a specified output. In statistics, a common "inverse problem" is finding the quantile of a cumulative probability distribution.
  • Binomial coefficients: In algebra, many teachers use the mnemonic FOIL (First, Outer, Inner, Last) to teach students how to compute the quadratic expansion of (a + b)2. Later, students learn the binomial expansion of an arbitrary power, (a + b)n. The coefficients in this expansion are called the binomial coefficients and appear in Pascal's triangle as well as in many discrete probability distributions such as the negative binomial and hypergeometric distributions.
  • Pythagorean triples: In trigonometry, a huge number of homework problems involve using right triangles with side lengths that are proportional to (3, 4, 5) and (5, 12, 13). These are two examples of Pythagorean triples: right triangles whose side lengths are all integers. It turns out that you can use linear transformations to generate all primitive triples from the single triple (3, 4, 5). A high school student can understand this process, although the process is most naturally expressed in terms of matrix multiplication, which is not always taught in high school.

High-school statistics

Many high schools offer a unit on probability and statistics, and some students take AP Statistics.

Einstein famously said, "everything should be made as simple as possible, but not simpler." It is surprising to me how often an advanced technique can be simplified and explained by using elementary math. I don't claim that "everything I needed to know about math I learned in kindergarten," but I often return to elementary techniques when I describe how to solve non-elementary problems.

What about you? What are some elementary math or statistics concepts that you use regularly in your professional life? Are there fundamental topics that you learned in high school that are deeper and more useful than you realized at the time? Leave a comment.

The post The math you learned in school: Yes, it’s useful! appeared first on The DO Loop.

8月 072019
 
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.

8月 052019
 

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name such as 'Low,' 'Medium,' and 'High.' This article uses PROC HPBIN to create bins that are assigned numbers. If you bin the data into k groups, the groups have the integer values 1, 2, 3, ..., k. Missing values are put into the zeroth group.

There are two popular ways to choose the cut points. You can use evenly spaced points, or you can use quantiles of the data. If you use evenly spaced cut points (as in a histogram), the number of observations in each bin will usually vary. Using evenly spaced cut points is called the "bucket binning" method. Alternatively, if you use quantiles as cut points (such as tertiles, quartiles, or deciles), the number of observations in each bin tend to be more balanced. This article shows how to use PROC HPBIN in SAS to perform bucket binning and quantile binning. It also shows how to use the CODE statement in PROC HPBIN to create a DATA step that uses the same cut points to bin future observations.

Create sample data

The following statements create sample data from the Sashelp.Heart data. An ID variable is added to the data so that you can identify each observation. A call to PROC MEANS produces descriptive statistics about two variables: Cholesterol and Systolic blood pressure.

data Heart;
format PatientID Z5.;
set Sashelp.Heart(keep=Sex Cholesterol Systolic);
PatientID = 10000 + _N_;
run;
 
proc means data=Heart nolabels N NMISS Min Max Skewness;
   var Cholesterol Systolic;
run;
Desriptive statistics for two variables

The output shows the range of the data for each variable. It also shows that the Cholesterol variable has 152 missing values. If your analysis requires nonmissing observations, you can use PROC HPIMPUTE to replace the missing values. For this article, I will not replace the missing values so that you can see how PROC HPBIN handles missing values.

Each variable has a skewed distribution, as indicated by the values of the skewness statistic. This usually indicates that equal-length binning will result in bins in the tail of the distribution that have only a few observations.

Use PROC HPBIN to bin data into equal-length bins

A histogram divides the range of the data by using k evenly spaced cutpoints. The width of each bin is (Max – Min) / k. PROC HPBIN enables you to create new variables that indicate to which bin each observation belongs. You can use the global NUMBIN= option on the PROC HPBIN statement to set the default number of bins for each variable. You can use the INPUT statement to specify which variables to bin. You can override the default number of bins by using the NUMBIN= option on any INPUT statement.

Suppose that you want to bin the Cholesterol data into five bins and the remaining variables into three bins.

  • The range of the Cholesterol data is [96, 568], so the width of the five bins that contain nonmissing values will be 94.4.
  • The range of the Systolic data is [82, 300], so the width of the three bins will be 72.66.

The following call to PROC HPBIN bins the variables. The output data set, HeartBin, contains the bin numbers for each observation.

/* equally divide the range each variable (bucket binning) */
proc hpbin data=Heart output=HeartBin numbin=3;  /* global NUMBIN= option */
   input Cholesterol / numbin=5;                 /* override global NUMBIN= option */
   input Systolic;                 
   id PatientID Sex;
run;
Cutpoints and frequency of observations for bucket binning using PROC HPBIN in SAS

Part of the output from PROC HPBIN is shown. (You can suppress the output by using the NOPRINT option.) The first table shows that PROC HPBIN used four threads on my PC to compute the results in parallel. The second table summarizes the transformation that bins the data. For each variable, the second column gives the names of the binned variables in the OUTPUT= data set. The third column shows the cutpoints for each bin. The Frequency and Proportion column show the frequency and proportion (respectively) of observations in each bin. As expected for these skewed variables, bins in the tail of each variable contain very few observations (less than 1% of the total).

The OUTPUT= option creates an output data set that contains the indicator variables for the bins. You can use PROC FREQ to enumerate the bin values and (again) count the number of observations in each bin:

proc freq data=HeartBin;
   tables BIN_Cholesterol BIN_Systolic / nocum;
run;
Frequency of observations in each bin for bucket binning using PROC HPBIN in SAS

Notice that the Cholesterol variable was split into six bins even though the syntax specified NUMBIN=5. If a variable contains missing values, a separate bin is created for them. In this case, the zeroth bin contains the 152 missing values for the Cholesterol variable.

Bucket binning divides the range of the variables into equal-width intervals. For long-tailed data, the number of observations in each bin might vary widely, as for these data. The next section shows an alternative binning strategy in which the width of the bins vary and each bin contains approximately the same number of observations.

Use PROC HPBIN to bin data by using quantiles

You can use evenly-spaced quantiles as cutpoints in an attempt to balance the number of observations in the bins. However, if the data are rounded or have duplicate values, the number of observations in each bin can still vary. PROC HPBIN has two ways methods for quantile binning. The slower method (the QUANTILE option) computes cutpoints based on the sample quantiles and then bins the observations. The faster method (the PSEUDO_QUANTILE option) uses approximate quantiles to bin the data. The following call uses the PSEUDO_QUANTILE option to bin the data into approximately equal groups:

/* bin by quantiles of each variable */
proc hpbin data=Heart output=HeartBin numbin=3 pseudo_quantile;
   input Cholesterol / numbin=5;    /* override global NUMBIN= option */
   input Systolic;                  /* use global NUMBIN= option */
   id PatientID Sex;
   code file='C:/Temp/BinCode.sas'; /* generate scoring code */
run;
Cutpoints and frequency of observations in each bin for quantile binning using PROC HPBIN in SAS

The output shows that the number of observations in each bin is more balanced. For the Systolic variable, each bin has between 1,697 and 1,773 observations. For the Cholesterol variable, each bin contains between 975 and 1,056 observations. Although not shown in the table, the BIN_Cholesterol variable also contains a bin for the 152 missing values for the Cholesterol variable.

Use PROC HPBIN to write DATA step code to bin future observations

In the previous section, I used the CODE statement to specify a file that contains SAS DATA step code that can be used to bin future observations. The statements in the BinCode.sas file are shown below:

*****************      BIN_Systolic     ********************;
length BIN_Systolic 8;
if missing(Systolic) then do; BIN_Systolic = 0; end;
else if Systolic < 124.0086 then do; BIN_Systolic =     1; end;
else if 124.0086 <= Systolic < 140.0098 then do; BIN_Systolic =     2; end;
else if 140.0098 <= Systolic then do; BIN_Systolic =     3; end;
 
*****************      BIN_Cholesterol     ********************;
length BIN_Cholesterol 8;
if missing(Cholesterol) then do; BIN_Cholesterol = 0; end;
else if Cholesterol < 190.0224 then do; BIN_Cholesterol =     1; end;
else if 190.0224 <= Cholesterol < 213.0088 then do; BIN_Cholesterol =     2; end;
else if 213.0088 <= Cholesterol < 234.0128 then do; BIN_Cholesterol =     3; end;
else if 234.0128 <= Cholesterol < 263.0408 then do; BIN_Cholesterol =     4; end;
else if 263.0408 <= Cholesterol then do; BIN_Cholesterol =     5; end;

You can see from these statements that the zeroth bin is reserved for missing values. Nonmissing values will be split into bins according to the approximate tertiles (NUMBIN=3) or quintiles (NUMBIN=5) of the training data.

The following statements show how to use the file that was created by the CODE statement. New data is contained in the Patients data set. Simply use the SET statement and the %INCLUDE statement to bin the new data, as follows:

data Patients;
length Sex $6;
input PatientID Sex Systolic Cholesterol;
datalines;
13021 Male    96 . 
13022 Male   148 242 
13023 Female 144 217 
13024 Female 164 376 
13025 Female .   248 
13026 Male   190 238 
13027 Female 158 326 
13028 Female 188 266 
;
 
data MakeBins;
set Patients;
%include 'C:/Temp/BinCode.sas';   /* include the binning statements */
run;
 
/* Note: HPBIN puts missing values into bin 0 */
proc print data=MakeBins;  run;
Binning new observations by using PROC HPBIN in SAS

The input data can contain other variables (PatientID, Sex) that are not binned. However, the data should contain the Systolic and Cholesterol variables because the statements in the BinCode.sas file refer to those variables.

Summary

In summary, you can use PROC HPBIN in SAS to create a new discrete variable by binning a continuous variable. This transformation is common in machine learning algorithms. Two common binning methods include bucket binning (equal-length bins) and quantile binning (unequal-length bins). Missing values are put into their own bin (the zeroth bin). The CODE statement in PROC HPBIN enables you to write DATA step code that you can use to bin future observations.

The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.

8月 052019
 

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name such as 'Low,' 'Medium,' and 'High.' This article uses PROC HPBIN to create bins that are assigned numbers. If you bin the data into k groups, the groups have the integer values 1, 2, 3, ..., k. Missing values are put into the zeroth group.

There are two popular ways to choose the cut points. You can use evenly spaced points, or you can use quantiles of the data. If you use evenly spaced cut points (as in a histogram), the number of observations in each bin will usually vary. Using evenly spaced cut points is called the "bucket binning" method. Alternatively, if you use quantiles as cut points (such as tertiles, quartiles, or deciles), the number of observations in each bin tend to be more balanced. This article shows how to use PROC HPBIN in SAS to perform bucket binning and quantile binning. It also shows how to use the CODE statement in PROC HPBIN to create a DATA step that uses the same cut points to bin future observations.

Create sample data

The following statements create sample data from the Sashelp.Heart data. An ID variable is added to the data so that you can identify each observation. A call to PROC MEANS produces descriptive statistics about two variables: Cholesterol and Systolic blood pressure.

data Heart;
format PatientID Z5.;
set Sashelp.Heart(keep=Sex Cholesterol Systolic);
PatientID = 10000 + _N_;
run;
 
proc means data=Heart nolabels N NMISS Min Max Skewness;
   var Cholesterol Systolic;
run;
Desriptive statistics for two variables

The output shows the range of the data for each variable. It also shows that the Cholesterol variable has 152 missing values. If your analysis requires nonmissing observations, you can use PROC HPIMPUTE to replace the missing values. For this article, I will not replace the missing values so that you can see how PROC HPBIN handles missing values.

Each variable has a skewed distribution, as indicated by the values of the skewness statistic. This usually indicates that equal-length binning will result in bins in the tail of the distribution that have only a few observations.

Use PROC HPBIN to bin data into equal-length bins

A histogram divides the range of the data by using k evenly spaced cutpoints. The width of each bin is (Max – Min) / k. PROC HPBIN enables you to create new variables that indicate to which bin each observation belongs. You can use the global NUMBIN= option on the PROC HPBIN statement to set the default number of bins for each variable. You can use the INPUT statement to specify which variables to bin. You can override the default number of bins by using the NUMBIN= option on any INPUT statement.

Suppose that you want to bin the Cholesterol data into five bins and the remaining variables into three bins.

  • The range of the Cholesterol data is [96, 568], so the width of the five bins that contain nonmissing values will be 94.4.
  • The range of the Systolic data is [82, 300], so the width of the three bins will be 72.66.

The following call to PROC HPBIN bins the variables. The output data set, HeartBin, contains the bin numbers for each observation.

/* equally divide the range each variable (bucket binning) */
proc hpbin data=Heart output=HeartBin numbin=3;  /* global NUMBIN= option */
   input Cholesterol / numbin=5;                 /* override global NUMBIN= option */
   input Systolic;                 
   id PatientID Sex;
run;
Cutpoints and frequency of observations for bucket binning using PROC HPBIN in SAS

Part of the output from PROC HPBIN is shown. (You can suppress the output by using the NOPRINT option.) The first table shows that PROC HPBIN used four threads on my PC to compute the results in parallel. The second table summarizes the transformation that bins the data. For each variable, the second column gives the names of the binned variables in the OUTPUT= data set. The third column shows the cutpoints for each bin. The Frequency and Proportion column show the frequency and proportion (respectively) of observations in each bin. As expected for these skewed variables, bins in the tail of each variable contain very few observations (less than 1% of the total).

The OUTPUT= option creates an output data set that contains the indicator variables for the bins. You can use PROC FREQ to enumerate the bin values and (again) count the number of observations in each bin:

proc freq data=HeartBin;
   tables BIN_Cholesterol BIN_Systolic / nocum;
run;
Frequency of observations in each bin for bucket binning using PROC HPBIN in SAS

Notice that the Cholesterol variable was split into six bins even though the syntax specified NUMBIN=5. If a variable contains missing values, a separate bin is created for them. In this case, the zeroth bin contains the 152 missing values for the Cholesterol variable.

Bucket binning divides the range of the variables into equal-width intervals. For long-tailed data, the number of observations in each bin might vary widely, as for these data. The next section shows an alternative binning strategy in which the width of the bins vary and each bin contains approximately the same number of observations.

Use PROC HPBIN to bin data by using quantiles

You can use evenly-spaced quantiles as cutpoints in an attempt to balance the number of observations in the bins. However, if the data are rounded or have duplicate values, the number of observations in each bin can still vary. PROC HPBIN has two ways methods for quantile binning. The slower method (the QUANTILE option) computes cutpoints based on the sample quantiles and then bins the observations. The faster method (the PSEUDO_QUANTILE option) uses approximate quantiles to bin the data. The following call uses the PSEUDO_QUANTILE option to bin the data into approximately equal groups:

/* bin by quantiles of each variable */
proc hpbin data=Heart output=HeartBin numbin=3 pseudo_quantile;
   input Cholesterol / numbin=5;    /* override global NUMBIN= option */
   input Systolic;                  /* use global NUMBIN= option */
   id PatientID Sex;
   code file='C:/Temp/BinCode.sas'; /* generate scoring code */
run;
Cutpoints and frequency of observations in each bin for quantile binning using PROC HPBIN in SAS

The output shows that the number of observations in each bin is more balanced. For the Systolic variable, each bin has between 1,697 and 1,773 observations. For the Cholesterol variable, each bin contains between 975 and 1,056 observations. Although not shown in the table, the BIN_Cholesterol variable also contains a bin for the 152 missing values for the Cholesterol variable.

Use PROC HPBIN to write DATA step code to bin future observations

In the previous section, I used the CODE statement to specify a file that contains SAS DATA step code that can be used to bin future observations. The statements in the BinCode.sas file are shown below:

*****************      BIN_Systolic     ********************;
length BIN_Systolic 8;
if missing(Systolic) then do; BIN_Systolic = 0; end;
else if Systolic < 124.0086 then do; BIN_Systolic =     1; end;
else if 124.0086 <= Systolic < 140.0098 then do; BIN_Systolic =     2; end;
else if 140.0098 <= Systolic then do; BIN_Systolic =     3; end;
 
*****************      BIN_Cholesterol     ********************;
length BIN_Cholesterol 8;
if missing(Cholesterol) then do; BIN_Cholesterol = 0; end;
else if Cholesterol < 190.0224 then do; BIN_Cholesterol =     1; end;
else if 190.0224 <= Cholesterol < 213.0088 then do; BIN_Cholesterol =     2; end;
else if 213.0088 <= Cholesterol < 234.0128 then do; BIN_Cholesterol =     3; end;
else if 234.0128 <= Cholesterol < 263.0408 then do; BIN_Cholesterol =     4; end;
else if 263.0408 <= Cholesterol then do; BIN_Cholesterol =     5; end;

You can see from these statements that the zeroth bin is reserved for missing values. Nonmissing values will be split into bins according to the approximate tertiles (NUMBIN=3) or quintiles (NUMBIN=5) of the training data.

The following statements show how to use the file that was created by the CODE statement. New data is contained in the Patients data set. Simply use the SET statement and the %INCLUDE statement to bin the new data, as follows:

data Patients;
length Sex $6;
input PatientID Sex Systolic Cholesterol;
datalines;
13021 Male    96 . 
13022 Male   148 242 
13023 Female 144 217 
13024 Female 164 376 
13025 Female .   248 
13026 Male   190 238 
13027 Female 158 326 
13028 Female 188 266 
;
 
data MakeBins;
set Patients;
%include 'C:/Temp/BinCode.sas';   /* include the binning statements */
run;
 
/* Note: HPBIN puts missing values into bin 0 */
proc print data=MakeBins;  run;
Binning new observations by using PROC HPBIN in SAS

The input data can contain other variables (PatientID, Sex) that are not binned. However, the data should contain the Systolic and Cholesterol variables because the statements in the BinCode.sas file refer to those variables.

Summary

In summary, you can use PROC HPBIN in SAS to create a new discrete variable by binning a continuous variable. This transformation is common in machine learning algorithms. Two common binning methods include bucket binning (equal-length bins) and quantile binning (unequal-length bins). Missing values are put into their own bin (the zeroth bin). The CODE statement in PROC HPBIN enables you to write DATA step code that you can use to bin future observations.

The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.

7月 312019
 

Sometimes a little thing can make a big difference. I am enjoying a new enhancement of SAS/IML 15.1, which enables you to use a numeric vector as the column header or row header when you print a SAS/IML matrix. Prior to SAS/IML 15.1, you had to first use the CHAR or PUTN function to manually apply a format to the numeric values. You could then use the formatted values as column or row headers. Now you can skip the formatting step.

A common situation in which I want to use numeric values as a column header is when I am using the TABULATE function to compute the frequencies for a discrete numerical variable. For example, the Cylinders variable in the Sashelp.Cars data set indicates the number of cylinders in each vehicle. You might want to know how many vehicles in the data are four-cylinder, six-cylinder, eight-cylinder, and so forth. In SAS/IML 15.1, you can use a numeric variable (such as the levels of the Cylinders variable) directly in the COLNAME= option of the PRINT statement, as follows:

proc iml;
use Sashelp.Cars;
read all var "Cylinders" into X;
close;
 
call tabulate(level, freq, X);      /* count number of obs for each level of X */
/* New in SAS/IML 15.1: use numeric values as COLNAME= option to PRINT statement */
print freq[colname=level];

Prior to SAS/IML 15.1, you had to convert numbers to character values by applying a format. This is commonly done by using the CHAR function, which applies a W.D format. This technique is still useful, but optional. You can also use the PUTN function to apply any format you want, such as COMMA., DATE., TIME., and so forth. For example, if you like Roman numerals, you could apply the ROMAN. format!

labels = char(level, 2);         /* apply w.d format */
print freq[colname=labels];
 
labels = putn(level, "ROMAN.");  /* second arg is name of format */
print freq[colname=labels];

You can also use numeric values for the COLNAME= option to obtain row headers in SAS/IML 15.1. Sure, it's a small enhancement, but I like it!

The post Use numeric values for column headers when printing a matrix appeared first on The DO Loop.

7月 292019
 

When my colleague, Robert Allison, blogged about visualizing the Mandelbrot set, I was reminded of a story from the 1980s, which was the height of the fractal craze. A research group in computational mathematics had been awarded a multimillion-dollar grant to purchase a supercomputer. When the supercomputer arrived and got set up, what was the first program they ran? It was a computation of the Mandelbrot set! They wanted to see how fast they could generate pretty pictures with their supercomputer!

Although fractals are no longer in vogue, the desire for fast computations never goes out of style. In a matrix-vector language such as SAS/IML, you can achieve fast computations by vectorizing the computations, which means operating on large quantities of data and using matrix-vector computations. Sometimes this paradigm requires rethinking an algorithm. This article shows how to vectorize the Mandelbrot set computation. The example provides important lessons for statistical computations, and, yes, you get to see a pretty picture at the end!

The classical algorithm for the Mandelbrot set

As a reminder, the Mandelbrot set is a visualization of a portion of the complex plane. If c is a complex number, you determine the color for c by iterating a complex quadratic map z <- z2 + c, starting with z=0. You keep track of how many iterations it takes before the iteration diverges to infinity, which in practice means that |z| exceeds some radius, such as 5. The parameter values for which the iteration remains bounded belong to the Mandelbrot set. Other points are colored according to the number of iterations before the trajectory exceeded the specified radius.

The classical computation of the Mandelbrot set iterates over the parameter values in a grid, as follows:

For each value c in a grid:
   Set z = 0
   For i = 1 to MaxIter:
      Apply the quadratic map, f, to form the i_th iteration, z_i = f^i(z; c).
      If z_i > radius, stop and remember the number of iterations.
   End;
   If the trajectory did not diverge, assume parameter value is in the Mandelbrot set.
End;

An advantage of this algorithm is that it does not require much memory, so it was great for the PCs of the 1980s! For each parameter, the color is determined (and often plotted), and then the parameter is never needed again.

A vectorized algorithm for the Mandelbrot set

In the classical algorithm, all computations are scalar computations. The outer loop is typically over a large number, K, of grid points. The maximum number of iterations is typically 100 or more. Thus, the algorithm performs 100*K scalar computations in order to obtain the colors for the image. For a large grid that contains K=250,000 parameters, that's 25 million scalar computations for the low-memory algorithm.

A vectorized version of this algorithm inverts the two loops. Instead of looping over the parameters and iterating each associated quadratic map, you can store the parameter values in a grid and iterate all K trajectories at once by applying the quadratic map in a vectorized manner. The vectorized algorithm is:

Define c to be a large grid of parameters.
Initialize a large grid z = 0, which will hold the current state.
For i = 1 to MaxIter:
   For all points that have not yet diverged:
      Apply the quadratic map, f, to z to update the current state.
      If any z > radius, assign the number of iterations for those parameters.
   End;
End;
If a trajectory did not diverge, assume parameter value is in the Mandelbrot set.

The vectorized algorithm performs the same computations as the scalar algorithm, but each iteration of the quadratic map operates on a huge number of current states (z). Also, each all states are checked for divergence by using a single call to a vector operation. There are many fewer function calls, which reduces overhead costs, but the vectorized program uses a lot of memory to store all the parameters and states.

Let's see how the algorithm might be implemented in the SAS/IML language. First, define vectorized functions for performing the complex quadratic map and the computation of the complex magnitude (absolute value). Because SAS/IML does not provide built-in support for complex numbers, each complex number is represented by a 2-D vector, where the first column is the real part and the second column is the imaginary part.

ods graphics / width=720px height=640px NXYBINSMAX=1200000;  /* enable large heat maps */
proc iml;
/* Complex computations: z and c are (N x 2) matrices that represent complex numbers */
start Magnitude(z);
   return ( z[,##] );           /* |z| = x##2 + y##2 */
finish;
 
/* complex iteration of quadratic mapping  z -> z^2 + c
   For complex numbers z = x + iy and c = a + ib,
   w = z^2 + c is the complex number (x^2 - y^2 + a) + i(2*x*y + b) */
start Map(z, c);
   w = j(nrow(z), 2, 0);
   w[,1] =   z[,1]#z[,1] - z[,2]#z[,2] + c[,1];
   w[,2] = 2*z[,1]#z[,2] + c[,2];
   return ( w );
finish;

The next block of statements defines a grid of parameters for the parameter, c:

/* Set parameters:
   initial range for grid of points and number of grid divisions
   Radius for divergence and max number of iterations */
nRe = 541;                      /* num pts in Real (horiz) direction */
nIm = 521;                      /* num pts in Imag (vert) direction */
re_min = -2.1;   re_max = 0.6;  /* range in Real direction */
im_min = -1.3;   im_max = 1.3;  /* range in Imag direction */
radius = 5;                     /* when |z| > radius, iteration has diverged */
MaxIter = 100;                  /* maximum number of iterations */
 
d_Re = (Re_max - Re_min) / (nRe-1);   /* horiz step size */
d_Im = (Im_max - Im_min) / (nIm-1);   /* vert step size */
Re = do(re_min, re_max, d_Re);        /* evenly spaced array of horiz values */
Im = do(im_min, im_max, d_Im);        /* evenly spaced array of vert values */
 
c = expandgrid( re, im);       /* make into 2D grid */
z = j(nrow(c), 2, 0);          /* initialize z = 0 for grid of trajectories */
iters = j(nrow(c), 1, .);      /* for each parameter, how many iterations until diverge */

In this example, the parameters for the mapping are chosen in the rectangle with real part [-2.1, 0.6] and imaginary part [-1.3, 1.3]. The parameter grid contains 541 horizontal values and 521 vertical values, for a total of almost 282,000 parameters.

The vectorized program will iterate all 282,000 mappings by using a single call to the Map function. After each iteration, all trajectories are checked to see which ones have left the disk of radius 5 (diverged). The iters vector stores the number of iterations until leaving the disk. During the iterations, a missing value indicates that the trajectory has not yet left the disk. At the end of the program, all trajectories that have not left the disk are set to the maximum number of iterations.

There are two efficiencies you can implement. First, you can pre-process some of the parameters that are known to be inside the "head" or "body" of the bug-shaped Mandelbrot set. Second, for each iteration, you only need to apply the quadratic map to the points that have not yet diverged (that is, iters is missing for that parameter). This is implemented in the following program statements:

/* Pre-process parameters that are known to be in Mandelbrot set.
   Params in "head": inside circle of radius 1/8 centered at (-1, 0) 
   Params in "body": inside rect [-0.5, 0.2] x [-0.5, 0.5] 
*/
idxHead = loc( (c - {-1 0})[,##] < 0.0625 );
if ncol(idxHead) > 0 then  iters[idxHead] = MaxIter;
idxBody = loc(c[,1]> -0.5 & c[,1]< 0.2 & c[,2]> -0.5 & c[,2]< 0.5);
if ncol(idxBody) > 0 then  iters[idxBody] = MaxIter;
 
/* compute the Madelbrot set */
done = 0;                   /* have all points diverged? */
do i = 1 to MaxIter until(done);       
   j = loc(iters=.);        /* find the points that remain bounded */
   w = z[j,]; a = c[j,];    /* keep only states and parameters tht have not diverged */
   w = Map(w, a);           /* map those points forward one iteration */
   z[j,] = w;               /* update current state */
   mag = Magnitude(w);      /* compute magnitude of current states */
   diverged = (mag >= radius) & (iters[j] = .);  /* which points diverged on this iteration? */
   idx = loc(diverged);     /* indices of the diverged points */
   if ncol(idx)> 0 then     /* if diverged, remember the iteration number */
      iters[j[idx]] = i;
   if all(diverged > 0) then 
      done = 1;
end;
/* Points that never diverged are part of the Mandelbrot set. Assign them MaxIter */
idx = loc(iters=.);
if ncol(idx)>0 then 
   iters[idx] = MaxIter;

The last step is to assign colors that visualize the Mandelbrot set. Whereas Robert Allison used a scatter plot, I will use a heat map. The iters vector contains the number of iterations before divergence, or MaxIters if the trajectory never diverged. You can assign a color ramp to the number of iterations, or, as I've done below, to a log-transformation of the iteration number. I've previously discussed several ways to assign colors to a heat map.

/* Color parameter space by using LOG10(number of iterations) */
numIters = log10( shape(iters, nRe, nIm) ); /* shape iterations into matrix */
mandelbrot = numIters`;                     /* plot num iterations for each parameter */
call heatmapcont(mandelbrot) xValues=Re yValues=Im ColorRamp="Temperature" displayoutlines=0
                             showlegend=0 title="Mandelbrot Set from Vector Computations";
Mandelbrot set computed by using vectorized computations

On my PC, the Mandelbrot computation takes about a second. This is not blazingly fast, but it is faster than the low-memory, non-vectorized, computation. If you care about the fastest speeds, use the DATA step, as shown in Robert's blog post.

I'm certainly not the first person to compute the Mandelbrot set, so what's the point of this exercise? The main purpose of this article is to show how to vectorize an iterative algorithm that performs the same computation many times for different parameter values. Rather than iterate over the set of parameter values and perform millions of scalar computations, you create an array (or matrix) of parameter values and perform a small number of vectorized computations. In most matrix-vector programming languages, a vectorized computation is more efficient than looping over each parameter value. There are few function calls and each call performs high-level computations on large matrices and vectors.

Statistical programmers don't usually have a need to compute fractals, but the ideas in this program apply to time series computations, root-finding, optimization, and any computation where you compute the same quantity for many parameter values. In short, although the Mandelbrot set is fun to compute, the ability to vectorize computations in a matrix language is a skill that rewards you with more than pretty pictures. And fast computations never go out of style.

The post Vectorize the computation of the Mandelbrot set in a matrix language appeared first on The DO Loop.

7月 242019
 
Probability densities for Gumbel distributions. The parameter values correspond to the parameters that model the distribution of the maximum value in a sample of size n drawn from the standard normal distribution.

SAS supports more than 25 common probability distributions for the PDF, CDF, QUANTILE, and RAND functions. Of course, there are infinitely many distributions, so not every possible distribution is supported. If you need a less-common distribution, I've shown how to extend the functionality of Base SAS (by using PROC FCMP) or the SAS/IML language by defining your own functions. This article shows how to implement the Gumbel distribution in SAS. The density functions for four parameter values of the Gumbel distribution are shown to the right. As I recently discussed, the Gumbel distribution is used to model the maximum (or minimum) in a sample of size n.

Essential functions for the Gumbel Distribution

If you are going to work with a probability distribution, you need at least four essential functions: The CDF (cumulative distribution), the PDF (probability density), the QUANTILE function (inverse CDF), and a way to generate random values from the distribution. Because the RAND function in SAS supports the Gumbel distribution, you only need to define the CDF, PDF, and QUANTILE functions.

These functions are trivial to implement for the Gumbel distribution because each has an explicit formula. The following list defines each function. To match the documentation for the RAND function and PROC UNIVARIATE (which both support the Gumbel distribution), μ is used for the location parameter and σ is used for the scale parameter:

  • CDF: F(x; μ, σ) = exp(-exp(-z)), where z = (x - μ) / σ
  • PDF: f(x; μ, σ) = exp(-z-exp(-z)) / σ, where z = (x - μ) / σ
  • QUANTILE: F-1(p; μ, σ) = μ - σ log(-log(p)), where LOG is the natural logarithm.

The RAND function in Base SAS and the RANDGEN function in SAS/IML support generating random Gumbel variates. Alternatively, you could generate u ~ U(0,1) and then use the QUANTILE formula to transform u into a random Gumbel variate.

Using these definitions, you can implement the functions as inline functions, or you can create a function call, such as in the following SAS/IML program:

proc iml;
start CDFGumbel(x, mu, sigma);
   z = (x-mu)/sigma;
   return exp(-exp(-z));                   * CDF of Gumbel distribution;
finish;
 
start PDFGumbel(x, mu, sigma);
   z = (x-mu)/sigma;
   return exp(-z-exp(-z)) / sigma; 
finish;
 
start QuantileGumbel(p, mu, sigma);
   return mu - sigma # log(-log(p));
finish;
 
/* Example: Call the PDFGumbel function */
mu = 3.09;                       /* params for max value of a sample of size */
sigma = 0.286;                   /*       n=1000 from the std normal distrib */
x = do(2, 5, 0.025);
PDF = PDFGumbel(x, mu, sigma);
title "Gumbel PDF";
call series(x, PDF) grid={x y};

The graph shows the density for the Gumbel(3.09, 0.286) distribution, which models the distribution of the maximum value in a sample of size n=1000 drawn from the standard normal distribution. You can see that the maximum value is typically between 2.5 and 4.5, which values near 3 being the most likely.

Plot a family of Gumbel curves

I recently discussed the best way to draw a family of curves in SAS by using PROC SGPLOT. The following DATA step defines the PDF and CDF for a family of Gumbel distributions. You can use this data set to display several curves that indicate how the distribution changes for various values of the parameters.

data Gumbel;
array n[4] _temporary_     (10   100  1000 1E6);   /* sample size */
array mu[4] _temporary_    (1.28 2.33 3.09 4.75);  /* Gumbel location parameter */
array beta [4] _temporary_ (0.5  0.36 0.29 0.2);   /* Gumbel scale parameter */
do i = 1 to dim(beta);                       /* parameters in the outer loop */
   m = mu[i]; b = beta[i];
   Params = catt("mu=", m, "; beta=", b);    /* concatenate parameters */
   do x = 0 to 6 by 0.01;
      z = (x-m)/b;
      CDF = exp( -exp(-z));                  /* CDF for Gumbel */
      PDF = exp(-z - exp(-z)) / b;           /* PDF for Gumbel */
      output;
   end;
end;
drop z;
run;
 
/* Use ODS GRAPHICS / ATTRPRIORITY=NONE 
   if you want to force the line attributes to vary in the HTML destination. */
title "The Gumbel(mu, beta) Cumulative Distribution";
proc sgplot data=Gumbel;
   label cdf="Probability";
   series x=x y=cdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid;  yaxis grid;
run;
 
title "The Gumbel(mu, beta) Probability Density";
proc sgplot data=Gumbel;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid offsetmin=0 offsetmax=0;  yaxis grid;
run;
 
title "The Gumbel(mu, beta) Quantile Function";
proc sgplot data=Gumbel;
   label cdf="Probability";
   series x=cdf y=x / group=Params lineattrs=(thickness=2);
   keylegend / position=right sortorder=reverseauto;
   xaxis grid;  yaxis grid;
run;

The graph for the CDF is shown. A graph for the PDF is shown at the top of this article. I augmented the PDF curve with labels that show the sample size for which the Gumbel distribution is associated. The leftmost curve is the distribution of the maximum in a sample of size n=10; the rightmost curve is for a sample of size one million.

Random values and fitting parameters to data

For completeness, I will mention that you can use the RAND function to generate random values and use PROC UNIVARIATE to fit Gumbel parameters to data. For example, the following DATA step generates 5,000 random variates from Gumbel(1.28, 0.5). The call to PROC UNIVARIATE fits a Gumbel model to the data. The estimates for (μ, σ) are (1.29, 0.51).

data GumbelRand;
call streaminit(12345);
mu = 1.28;
beta = 0.5;
do i = 1 to 5000;
   x = rand("Gumbel", mu, beta);
   output;
end;
keep x;
run;
 
proc univariate data=GumbelRand;
   histogram x / Gumbel;
run;

In summary, although the PDF, CDF, and QUANTILE functions in SAS do not support the Gumbel distribution, it is trivial to implement these functions. This article visualizes the PDF and CDF of the Gumbel distributions for several parameter values. It also shows how to simulate random values from a Gumbel distribution and how to fit the Gumbel model to data.

The post Implement the Gumbel distribution in SAS appeared first on The DO Loop.