data analysis

9月 162019
 

A moving average is a statistical technique that is used to smooth a time series. My colleague, Cindy Wang, wrote an article about the Hull moving average (HMA), which is a time series smoother that is sometimes used as a technical indicator by stock market traders. Cindy showed how to compute the HMA in SAS Visual Analytics by using a GUI "formula builder" to compute new columns in a data table. This article shows how to implement the Hull moving average in the SAS/IML language. The SAS/IML language is an effective way to implement new (or existing) smoothers for time series because it is easy to use lags and rolling windows to extract values from a time series.

I have previously written about how to compute common moving averages in SAS/ETS (PROC EXPAND) and the DATA step. I have also shown how to compute a weighted moving average (WMA) in the SAS/IML language, which includes simple moving averages and exponential moving averages. This article shows how to implement the Hull moving average formula in SAS by leveraging the WMA function in the previous article.

Every choice of weights in a weighted moving average leads to a different smoother. This article uses weights that are linear in time. Specifically, if Y is a time series, the weighted moving average at time index t is obtained by the formula
WMA(t, n) = 1*Y[tn + 1] + 2*y[tn + 2] + 3*Y[tn + 3] + ... + n*Y[t]) / (1 + 2 + ... + n)
In general, the divisor in the formula is the sum of whatever weights are used.

Although this article focuses on the Hull moving average, the techniques are broadly applicable to other moving average computations.

Hull moving average

Cindy's article contains formulas that show how to compute the Hull moving average (HMA). The HMA is the moving average of a linear combination of two other weighted moving averages, one on a shorter time scale and the other on a longer time scale. Given a time series, Y, choose a lag time, n, which is sometimes called the window length. You can compute the Hull moving average by computing four quantities:

  1. Compute Xshort as the weighted moving average of Y by using a short window of length n/2.
  2. Compute Xlong as the weighted moving average of Y by using the longer window of length n.
  3. Define the series X = 2* Xshort – Xlong.
  4. The Hull moving average of Y is the weighted moving average of X by using a window of length sqrt(n).

The following SAS/IML program implements the Hull moving average. The WMA function is explained in my previous blog post. The HMA function computes the Hull moving average by calling the WMA function three times, twice on Y and once on X. It requires only four statements, one for each computed quantity:

proc iml;
/* Weighted moving average of k data values.
   First k-1 values are assigned the weighted mean of all preceding values.
   Inputs:     y     column vector of length N >= k
               wt    column vector of weights. w[k] is weight for most recent 
                      data; wt[1] is for k time units ago.  The function 
                     internally standardizes the weights so that sum(wt)=1.
   Example calls: 
   WMA  = WMA(y, 1:5);          is weighted moving average of most recent 5 points
   SMA  = WMA(y, {1 1 1 1 1});  is simple moving average of 
   See https://blogs.sas.com/content/iml/2016/02/03/rolling-statistics-sasiml.html
*/
start WMA(y, wt);
   w = colvec(wt) / sum(wt);         /* standardize weights so that sum(w)=1 */
   k = nrow(w);                      /* length of lag */
   MA = j(nrow(y), 1, .);
   /* handle first k values separately */
   do i = 1 to k-1;
      wIdx = k-i+1:k;                /* index for previous i weights */
      MA[i] = sum(wt[wIdx]#y[1:i]) / sum(wt[wIdx]);  /* weighted average */
   end;
   /* main computation: average of current and previous k-1 data values */
   do i = k to nrow(y);
      idx = (i-k+1):i;               /* rolling window of k data points */
      MA[i] = sum( w#y[idx] );       /* weighted sum of k data values */
   end;
   return ( MA );
finish;
 
/* Hull moving average HMA(y, nLag) at each time point. All computations
   use linear weights 1:k for various choices of k. */
start HMA(y, nLag);
   Xshort = WMA(y, 1:round(nLag/2)); /* short smoother (nLag/2) */
   Xlong = WMA(y, 1:nLag);           /* longer smoother (nLag) */
   X = 2*Xshort - Xlong;             /* combination of smoothers */
   HMA = WMA(X, 1:round(sqrt(nLag)));/* Hull moving average (length sqrt(nLag)) */
   return HMA;
finish;
 
/* test on simple time series */
y = T({0 1 0 1 3 2 3 3});
nLag = 4;
WMA = WMA(y, 1:nLag);  
HullMA = HMA( y, nLag );  
print y WMA HullMA;
Hull moving average compared with weighted moving average

The table shows the weighted moving average and the Hull moving average applied to a simple time series with eight observations. The smoothed values are shown. As a general rule, the Hull moving average tends to be smoother than a raw weighted moving average. For a given lag parameter, it responds more quickly to changes in Y.

However, it is important to realize that the HMA is not a smoother of Y. Rather, it smooths a linear combination of other smoothers. Consequently, the value of the HMA at time t can be outside of the range of the series {Y1, Y1, ..., Yt}. This is seen in the last observation in the table. The HMA has the value 3.08 even though the range of Y is [0, 3]. This is shown in the following graph, which plots the series, the weighted moving average, and the Hull moving average.

Graph of Hull moving average and weighted moving average applied to an example time series

The Hull moving average applied to stock data

This section reproduces the graphs in Cindy's blog post by applying the Hull moving average to the monthly closing price of IBM stock. The following statements compute three different moving averages (Hull, simple, and weighted) and use PROC SGPLOT to overlay the simple and Hull moving averages on a scatter plot that shows the closing price and a high-low plot that shows the range of stock values for each month.

use Sashelp.Stocks where(stock='IBM');       /* read stock data for IBM */
read all var {'Close'} into y; close;
 
nLag = 5;
SMA = WMA(y, j(1,nLag,1));                   /* simple MA */
WMA = WMA(y, nLag);                          /* weighted MA */
HMA = HMA(y, nLag);                          /* Hull MA */
create out var {SMA WMA HMA}; append; close; /* write moving averages to SAS data set */
QUIT;
 
data All;                                    /* merge data and moving averages */
   merge Sashelp.Stocks(where=(stock='IBM')) out;
run;
 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=All noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;
Hull moving average and wimple moving average applied to IBM stock price (1986-2006)

It is well known that the simple moving average (green line) lags the data. As Cindy points out, the Hull moving average (orange line) responds more quickly to changes in the data. She compares the two smoothers in the months after June 1992 to emphasize that the HMA is closer to the data than the simple moving average.

In summary, this article shows how to implement a custom time series smoother in SAS. The example uses the Hull moving average and reproduces the computations and graphs shown in Cindy's article. The techniques apply to any custom time series smoother and time series computation. The SAS/IML language makes it easy to compute rolling-window computations because you can easily access segments of a time series.

The post The Hull moving average: Implement a custom time series smoother in SAS appeared first on The DO Loop.

9月 052019
 

When you order an item online, the website often recommends other items based on your purchase. In fact, these kinds of "recommendation engines" contributed to the early success of companies like Amazon and Netflix. SAS uses a recommender engine to suggest articles on the SAS Support Communities. Although recommender engines use many techniques, one technique that estimates the similarity of items is the cosine similarity. You can use the cosine similarity to compare songs, documents, articles, recipes, and more.

This blog post demonstrates how to compute and visualize the similarities among recipes. For simplicity, only the ingredients in the recipes are used. You can extend the example by using the quantities of ingredients or by including other variables that describe the recipe, such as how the recipe is cooked (skillet, oven, grill, etc.).

The data: Ingredients for recipes

I looked up the ingredients for six recipes: spaghetti sauce, a spaghetti sauce with meat, an Italian eggplant relish (called caponata), a creole sauce, a salsa, and an enchilada sauce. Three recipes are Italian, one is creole, and two are Mexican. Four are sauces and two (salsa and eggplant relish) are typically appetizers. All have tomatoes (fresh, canned, or paste). The following DATA step defines the ingredients in each recipe by using a binary indicator variable. The 35 variables are the ingredients (tomato, garlic, salt, and so forth) and the six rows are the recipes.

data recipes;
   input Recipe $ 1-20
      (Tomato Garlic Salt Onion TomatoPaste OliveOil Celery Broth 
       GreenPepper Cumin Flour BrownSugar BayLeaf GroundBeef 
       BlackPepper ChiliPowder Cilantro Carrot CayennePepper Oregano 
       Oil Parsley PorkSausage RedPepper Paprika Thyme Tomatillo 
       JalapenoPepper WorcestershireSauce Lime
       Eggplant GreenOlives Capers Sugar) (1.);
datalines;
Spag Sauce          1111110000000000000101000000000000
Spag Meat Sauce     1111111010001100010000110000000000
Eggplant Relish     0111110000000000000000000000001111
Creole Sauce        1011111110000010000000001100100000
Salsa               1111000000000000100000000011010000
Enchilada Sauce     1000000101110001001010000000000000
;

If you look carefully at the Eggplant Relish and Enchilada Sauce rows, you will see that they have no ingredients in common. Therefore, those recipes should be the most dissimilar.

The cosine similarity of recipes

In a previous article, I showed that you can use PROC DISTANCE in SAS to compute the cosine similarity of rows. Alternatively, you can use the SAS/IML language to define a function that computes the cosine similarity. The following SAS/IML program loads the function and creates a heat map of the similarity matrix for these six recipes:

proc iml;
use Recipes; read all var _NUM_ into X[c=varNames r=Recipe]; close;
 
load module=(CosSimRows);     /* load module from previous post */
cosSim = CosSimRows(X);
Colors = palette('BRBG', 7);  /* brown-blue-green color ramp */
call heatmapcont(cosSim) xvalues=Recipe yvalues=Recipe range={0 1} 
            colorramp=Colors title="Cosine Similarity of Recipes";
print CosSim[F=4.2 r=Recipe c=Recipe];

The heat map and printed table show the similarity scores for the six recipes. The following list summarizes the similarity matrix, which is based solely on the presence or absence of ingredients:

  • The ingredients in the spaghetti sauce recipe are most similar to the meat sauce and the eggplant relish. They are slightly less similar to the creole sauce and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the spaghetti meat sauce are most similar to the spaghetti sauce and the creole sauce. They are slightly less similar to the eggplant relish and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the eggplant relish are similar to both spaghetti sauces. The recipe has zero similarity to the enchilada sauce recipe because they do not share any common ingredients.
  • The ingredients in the creole sauce are similar to both spaghetti sauces.
  • The salsa is most similar to the spaghetti sauce.
  • The enchilada sauce recipe is not similar to any of the other recipes.

Of course, these results depend on what ingredients you use for the recipes. When I planned this study, I expected the salsa and enchilada sauce to be similar, but it turned out that the recipes had only one ingredient in common (tomatoes).

If you do not have very many items to compare, you can use a bar chart to visualize the pairwise similarities. The result is shown below. Even if you have too many items to display the full set of pairwise similarities, you could use a WHERE clause (such as WHERE CosSim > 0.5) to see only the most similar pairs of items.

The bar chart is sorted by the cosine similarity, so it is easy to see the very similar and very dissimilar pairs.

Conclusions

A recommendation engine can use calculations like this to suggest additional recipes that are similar to those that you like. Did you like the spaghetti sauce recipe? Try the eggplant relish! Replace "recipe" with "movie," "book," or "product" and you begin to see how recommender engines work.

An advantage of the cosine similarity is that it preserves the sparsity of the data matrix. The data matrix for these recipes has 204 cells, but only 58 (28%) of the cells are nonzero. If you add additional recipes, the number of variables (the union of the ingredients) might climb into the hundreds, but a typical recipe has only a dozen ingredients, so most of the cells in the data matrix are zero. When you compare two recipes, only the common ingredients contribute to the cosine similarity score. This means that you can compute the cosine similarity very efficiently, and it requires making only a single pass through the data.

You can also compute the cosine similarity of the columns, which tells you which pairs of ingredients appear together in recipes. As you might imagine, olive oil, garlic, and onions are similar to each other because they often appear in recipes together. Cumin and green olive? Not very similar.

You can download the SAS program that computes the cosine similarities and creates the graphs in this article.

The post Use cosine similarity to make recommendations appeared first on The DO Loop.

9月 052019
 

When you order an item online, the website often recommends other items based on your purchase. In fact, these kinds of "recommendation engines" contributed to the early success of companies like Amazon and Netflix. SAS uses a recommender engine to suggest articles on the SAS Support Communities. Although recommender engines use many techniques, one technique that estimates the similarity of items is the cosine similarity. You can use the cosine similarity to compare songs, documents, articles, recipes, and more.

This blog post demonstrates how to compute and visualize the similarities among recipes. For simplicity, only the ingredients in the recipes are used. You can extend the example by using the quantities of ingredients or by including other variables that describe the recipe, such as how the recipe is cooked (skillet, oven, grill, etc.).

The data: Ingredients for recipes

I looked up the ingredients for six recipes: spaghetti sauce, a spaghetti sauce with meat, an Italian eggplant relish (called caponata), a creole sauce, a salsa, and an enchilada sauce. Three recipes are Italian, one is creole, and two are Mexican. Four are sauces and two (salsa and eggplant relish) are typically appetizers. All have tomatoes (fresh, canned, or paste). The following DATA step defines the ingredients in each recipe by using a binary indicator variable. The 35 variables are the ingredients (tomato, garlic, salt, and so forth) and the six rows are the recipes.

data recipes;
   input Recipe $ 1-20
      (Tomato Garlic Salt Onion TomatoPaste OliveOil Celery Broth 
       GreenPepper Cumin Flour BrownSugar BayLeaf GroundBeef 
       BlackPepper ChiliPowder Cilantro Carrot CayennePepper Oregano 
       Oil Parsley PorkSausage RedPepper Paprika Thyme Tomatillo 
       JalapenoPepper WorcestershireSauce Lime
       Eggplant GreenOlives Capers Sugar) (1.);
datalines;
Spag Sauce          1111110000000000000101000000000000
Spag Meat Sauce     1111111010001100010000110000000000
Eggplant Relish     0111110000000000000000000000001111
Creole Sauce        1011111110000010000000001100100000
Salsa               1111000000000000100000000011010000
Enchilada Sauce     1000000101110001001010000000000000
;

If you look carefully at the Eggplant Relish and Enchilada Sauce rows, you will see that they have no ingredients in common. Therefore, those recipes should be the most dissimilar.

The cosine similarity of recipes

In a previous article, I showed that you can use PROC DISTANCE in SAS to compute the cosine similarity of rows. Alternatively, you can use the SAS/IML language to define a function that computes the cosine similarity. The following SAS/IML program loads the function and creates a heat map of the similarity matrix for these six recipes:

proc iml;
use Recipes; read all var _NUM_ into X[c=varNames r=Recipe]; close;
 
load module=(CosSimRows);     /* load module from previous post */
cosSim = CosSimRows(X);
Colors = palette('BRBG', 7);  /* brown-blue-green color ramp */
call heatmapcont(cosSim) xvalues=Recipe yvalues=Recipe range={0 1} 
            colorramp=Colors title="Cosine Similarity of Recipes";
print CosSim[F=4.2 r=Recipe c=Recipe];

The heat map and printed table show the similarity scores for the six recipes. The following list summarizes the similarity matrix, which is based solely on the presence or absence of ingredients:

  • The ingredients in the spaghetti sauce recipe are most similar to the meat sauce and the eggplant relish. They are slightly less similar to the creole sauce and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the spaghetti meat sauce are most similar to the spaghetti sauce and the creole sauce. They are slightly less similar to the eggplant relish and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the eggplant relish are similar to both spaghetti sauces. The recipe has zero similarity to the enchilada sauce recipe because they do not share any common ingredients.
  • The ingredients in the creole sauce are similar to both spaghetti sauces.
  • The salsa is most similar to the spaghetti sauce.
  • The enchilada sauce recipe is not similar to any of the other recipes.

Of course, these results depend on what ingredients you use for the recipes. When I planned this study, I expected the salsa and enchilada sauce to be similar, but it turned out that the recipes had only one ingredient in common (tomatoes).

If you do not have very many items to compare, you can use a bar chart to visualize the pairwise similarities. The result is shown below. Even if you have too many items to display the full set of pairwise similarities, you could use a WHERE clause (such as WHERE CosSim > 0.5) to see only the most similar pairs of items.

The bar chart is sorted by the cosine similarity, so it is easy to see the very similar and very dissimilar pairs.

Conclusions

A recommendation engine can use calculations like this to suggest additional recipes that are similar to those that you like. Did you like the spaghetti sauce recipe? Try the eggplant relish! Replace "recipe" with "movie," "book," or "product" and you begin to see how recommender engines work.

An advantage of the cosine similarity is that it preserves the sparsity of the data matrix. The data matrix for these recipes has 204 cells, but only 58 (28%) of the cells are nonzero. If you add additional recipes, the number of variables (the union of the ingredients) might climb into the hundreds, but a typical recipe has only a dozen ingredients, so most of the cells in the data matrix are zero. When you compare two recipes, only the common ingredients contribute to the cosine similarity score. This means that you can compute the cosine similarity very efficiently, and it requires making only a single pass through the data.

You can also compute the cosine similarity of the columns, which tells you which pairs of ingredients appear together in recipes. As you might imagine, olive oil, garlic, and onions are similar to each other because they often appear in recipes together. Cumin and green olive? Not very similar.

You can download the SAS program that computes the cosine similarities and creates the graphs in this article.

The post Use cosine similarity to make recommendations appeared first on The DO Loop.

9月 032019
 

An important application of the dot product (inner product) of two vectors is to determine the angle between the vectors. If u and v are two vectors, then
cos(θ) = (u ⋅ v) / (|u| |v|)
You could apply the inverse cosine function if you wanted to find θ in [0, π], but since the cosine function is a monotonic increasing transformation on [0, π], it is usually sufficient to know the cosine of the angle between two vectors. The expression (u ⋅ v) / (|u| |v|) is called the cosine similarity between the vectors u and v. It is a value in [-1, 1]. This article discusses the cosine similarity, why it is useful, and how you can compute it in SAS.

What is the cosine similarity?

For multivariate numeric data, you can compute the cosine similarity of the rows or of the columns. The cosine similarity of the rows tells you which subjects are similar to each other. The cosine similarity of the columns tells you which variables are similar to each other. For example, if you have clinical data, the rows might represent patients and the variables might represent measurements such as blood pressure, cholesterol, body-mass index, and so forth. The row similarity compares patients to each other; the column similarity compares vectors of measurements.

Vectors that are very similar to each other have a cosine similarity that is close to 1. Vectors that are nearly orthogonal have a cosine similarity near 0. Vectors that point in opposite directions have a cosine similarity of –1. However, in practice, the cosine similarity is often used on vectors that have nonnegative values. For those vectors, the angle between them is never more than 90 degrees and so the cosine similarity is between 0 and 1.

To illustrate how to compute the cosine similarity in SAS, the following statements create a simple data set that has four observations and two variables. The four row vectors are plotted:

data Vectors;
length Name $1;
input Name x y;
datalines;
A  0.5 1
B  3   5
C  3   2.8
D  5   1
;
 
ods graphics / width=400px height=400px;
title "Four Row Vectors";
proc sgplot data=Vectors aspect=1;
   vector x=x y=y / datalabel=Name datalabelattrs=(size=14);
   xaxis grid;  yaxis grid;
run;
Four vectors. Find the cosine similarity.

The cosine similarity of observations

If you look at the previous graph of vectors and think that vector A is unlike the other vectors, then you are using the magnitude (length) of the vectors to form that opinion. The cosine similarity does not use the magnitude of the vectors to decide which vectors are alike. Instead, it uses only the direction of the vectors. You can visualize the cosine similarity by plotting the normalized vectors that have unit length, as shown in the next graph.

Four normalized vectors. The cosine similarity is the angle between these vectors.

The second graph shows the vectors that are used to compute the cosine similarity. For these vectors, A and B are most similar to each other. Vector C is more similar to B than to D, and D is least similar to the others. You can use PROC DISTANCE and the METHOD=COSINE option to compute the cosine similarity between observations. If your data set has N observations, the result of PROC DISTANCE is an N x N matrix of cosine similarity values. The (i,j)th value is the similarity between the i_th vector and the j_th vector.

proc distance data=Vectors out=Cos method=COSINE shape=square;
   var ratio(_NUMERIC_);
   id Name;
run;
proc print data=Cos noobs; run;
Cosine similarity between four vectors

The results of the DISTANCE procedure confirm what we already knew from the geometry. Namely, A and B are most similar to each other (cosine similarity of 0.997), C is more similar to B (0.937) than to D (0.85), and D is not very similar to the other vectors (similarities range from 0.61 to 0.85).

Notice that the cosine similarity is not a linear function of the angle between vectors. The angle between vectors B and A is 4 degrees, which has a cosine of 0.997. The angle between vectors B and C is almost four times as big (16 degrees) but the cosine similarity is 0.961, which is not very different from 0.997 (and certainly not four times as small). Notice also that the graph of the cosine function is flat near θ=0. Consequently, the cosine similarity does not vary much between the vectors in this example.

Compute cosine similarity in SAS/IML

SAS/STAT does not include a procedure that computes the cosine similarity of variables, but you can use the SAS/IML language to compute both row and column similarity. The following PROC IML statements define functions that compute the cosine similarity for rows and for columns. To support missing values in the data, the functions use listwise deletion to remove any rows that contain a missing value. All functions are saved for future use.

/* Compute cosine similarity matrices in SAS/IML */
proc iml;
/* exclude any row with a missing value */
start ExtractCompleteCases(X);
   if all(X ^= .) then return X;
   idx = loc(countmiss(X, "row")=0);
   if ncol(idx)>0 then return( X[idx, ] );
   else                return( {} ); 
finish;
 
/* compute the cosine similarity of columns (variables) */
start CosSimCols(X, checkForMissing=1);
   if checkForMissing then do;    /* by default, check for missing and exclude */
      Z = ExtractCompleteCases(X);
      Y = Z / sqrt(Z[##,]);       /* stdize each column */
   end;
      else Y = X / sqrt(X[##,]);  /* skip the check if you know all values are valid */
   cosY = Y` * Y;                 /* pairwise inner products */
   /* because of finite precision, elements could be 1+eps or -1-eps */
   idx = loc(cosY> 1); if ncol(idx)>0 then cosY[idx]= 1; 
   idx = loc(cosY<-1); if ncol(idx)>0 then cosY[idx]=-1; 
   return cosY;
finish;
 
/* compute the cosine similarity of rows (observations) */
start CosSimRows(X);
   Z = ExtractCompleteCases(X);  /* check for missing and exclude */
   return T(CosSimCols(Z`, 0));  /* transpose and call CosSimCols */
finish;
store module=(ExtractCompleteCases CosSimCols CosSimRows);

Visualize the cosine similarity matrix

When you compare k vectors, the cosine similarity matrix is k x k. When k is larger than 5, you probably want to visualize the similarity matrix by using heat maps. The following DATA step extracts two subsets of vehicles from the Sashelp.Cars data set. The first subset contains vehicles that have weak engines (low horsepower) whereas the second subset contains vehicles that have powerful engines (high horsepower). You can use a heat map to visualize the cosine similarity matrix between these vehicles:

data Vehicles;
   set sashelp.cars(where=(Origin='USA'));
   if Horsepower < 140 OR Horsepower >= 310;
run;
proc sort data=Vehicles; by Type; run;  /* sort obs by vehicle type */
 
ods graphics / reset;
proc iml;
load module=(CosSimCols CosSimRows);
use Vehicles;
read all var _NUM_ into X[c=varNames r=Model]; 
read all var {Model Type}; close;
 
labl = compress(Type + ":" + substr(Model, 1, 10));
cosRY = CosSimRows(X);
call heatmapcont(cosRY) xvalues=labl yvalues=labl title="Cosine Similarity between Vehicles";
Cosine similarity between attributes of 20 vehicles

The heat map shows the cosine similarities of the attributes of 20 vehicles. Vehicles of a specific type (SUV, sedan, sports car,...) tend to be similar to each other and different from vehicles of other types. The four sports cars stand out. They are very dissimilar to the sedans, and they are also dissimilar to the SUVs and wagons.

Notice the small range for the cosine similarity values. Even the most dissimilar vehicles have a cosine similarity of 0.99.

Cosine similarity of columns

You can treat each row data as a vector of dimension p. Similarly (no pun intended!), you can treat each column as a vector of length N. You can use the CosSimCols function, defined in the previous section, to compute the cosine similarity matrix of numerical columns. The math is the same but is applied to the transpose of the data matrix. To demonstrate the function, the following statements compute and visualize the column similarity for the Vehicles data. There are 10 numerical variables in the data.

cosCY = CosSimCols(X);
call heatmapcont(cosCY) xvalues=varNames yvalues=varNames title="Cosine of Angle Between Variables";
Cosine similarity between variables in the Vehicles data set

The heat map for the columns is shown. You can see that the MPG_City and MPG_Highway variables are dissimilar to most other variables, but similar to each other. Other sets of similar variables include the variables that measure cost (MSRP and Invoice), the variables that measure power (EngineSize, Cylinders, Horsepower), and the variables that measure size (Wheelbase, Length, Weight).

The connection between cosine similarity and correlation

The similarity matrix of the variables shows which variables are similar and dissimilar. In that sense, the matrix might remind you of a correlation matrix. However, there is an important difference: The correlation matrix displays the pairwise inner products of centered variables. The cosine similarity does not center the variables. Although the correlation is scale-invariant and affine invariant, the cosine similarity is not affine invariant: If you add or subtract a constant from a variable, its cosine similarity with other variables will change.

When the data represent positive quantities (as in the Vehicles data), the cosine similarity between two vectors can never be negative. For example, consider the relationship between the fuel economy variables (MPG_City and MPG_Highway) and the "power" and "size" variables. Intuitively, the fuel economy is negatively correlated with power and size (for example, Horsepower and Weight). However, the cosine similarity is positive, which shows another difference between correlation and cosine similarity.

Why is the cosine similarity useful?

The fact that the cosine similarity does not center the data is its biggest strength (and also its biggest weakness). The cosine similarity is most often used in applications in which the variables represent counts or indicator variables. Often, each row represents a document such as a recipe, a book, or a song. The columns indicate an attribute such as an ingredient, word, or musical technique.

For text documents, the number of columns (which represent important words) can be in the thousands. The goal is to discover which documents are structurally similar to each other, perhaps because they share similar content. Recommender engines can use the cosine similarity to suggest new books or articles that are similar to one that was previously read. The same technique can recommend similar recipes or similar songs.

The advantage of the cosine similarity is its speed and the fact that it is very useful for sparse data. In a future article, I'll provide a simple example that demonstrates how you can use the cosine similarity to find recipes that are similar or dissimilar to each other.

The post Cosine similarity of vectors appeared first on The DO Loop.

8月 262019
 

Most SAS programmers know how to use PROC APPEND or the SET statement in DATA step to unconditionally append new observations to an existing data set. However, sometimes you need to scan the data to determine whether or not to append observations. In this situation, many SAS programmers choose one of the following methods:

  • Inside a DATA step, use the SYMPUT call to create a macro variable that indicates whether to append observations. After the DATA step ends, use %IF-%THEN processing to check the value of the macro variable and conditionally append the observations.
  • Use the DATA step to determine whether to append data and append data in the same DATA step. This is especially useful if the values for the new observations depend on the data that you scanned.

This article shows the second method. It shows how to use the SAS DATA step to scan through observations and remember certain values. If a condition is met, it uses the values to append new observations to the end of the data by using end-of-file processing.

A motivating example

Often SAS programmers need to implement complicated data-dependent logic. A simple example is "If the XYZ variable contains a certain value but doesn't contain a different value, then do something." On the SAS discussion forums, the experts often suggest scanning through the data with a DATA step and keeping one or more "flag variables" that indicate which conditions have been satisfied. At the end of the DATA step, you can look at the values of the flag variables to determine what action to take.

Last week I encountered a situation where I needed to conditionally append observations to input data. Although the solution is easy if you use the SAS/IML language (which enables you to scan an entire vector of values), I needed to solve the problem by using the DATA step, which processes only one observation at a time. The problem had the following form:

  • The data are in long form. The character variable TYPE always contains the values 'Min' and 'Max'. The numeric variable VALUE contains numbers.
  • The TYPE variable might or might not contain the values 'LowVal' and 'HighVal'. If they do appear, they always appear after the 'Min' and 'Max' values.

The goal is to create an output data set that always contains the four values 'Min', 'Max', 'LowVal', and 'HighVal'. The goal is summarized by the figure to the right. The following list describes how to generate the 'LowVal' and 'HighVal' observations if they do not exist.

  • If TYPE='HighVal' does not appear in the data, create it and copy the value of the TYPE='Max' observation to use for the VALUE variable.
  • Similarly, if TYPE='LowVal' does not appear in the data, create it. Copy the value of the TYPE='Min' observation to use for the VALUE variable.

The figure shows the four situations that can occur. The input data set always contains the 'Min' and 'Max' values but can contain none, one, or two of the other values. To goal is to produce the data set on the right, which always contains all four values. The next section presents a solution, so stop reading here if you want to solve the problem on your own!

The END= option and end-of-file processing

To solve the problems, I used two facts about the SAS DATA step:

  1. You can use the END= option on the SET statement to create a temporary binary indicator variable that has the value 1 for only the last observation of the input data.
  2. The SAS DATA step contains an implicit loop over all observations in the input data. If you do not use an OUTPUT statement, the DATA step performs an implicit output for each observation. However, if the program contains an OUTPUT statement anywhere in the program, then the implicit output is disabled. Therefore, whenever you use an OUTPUT statement, you must use other OUTPUT statements whenever you want to write an observation to the output data set.

The following program scans the input data. It remembers the values of the 'Min' and 'Max' observations, in case it needs them. It uses indicator variables to determine whether the data contains the 'LowVal' and 'HighVal' observations. After the input data are read, the program uses an end-of-file indicator variable (EOF) to determine whether or not to add observations for 'LowVal' and 'HighVal'.

Because the program uses an OUTPUT statement to (conditionally) create new observation, you must also put an OUTPUT statement after the SET statement to ensure that the original observations are all written.

/* Include all 4 test cases. Use WHERE clause to test each case. */
data Test;
length Type $7;
input Group Type $ Value;
datalines;
1 Min      -3
1 Max       3
1 LowVal   -2
1 HighVal   2
2 Min      -3
2 Max       3
2 HighVal   2
3 Min      -3
3 Max       3
3 LowVal   -2
4 Min      -3
4 Max       3
;
 
/* Input order is always 'Min' and 'Max' optionally followed by 
   'LowVal' and 'HighVal', if they exist. */
%let dsname = Test(where=(Group=4));    /* use 1,2,3,4 to test all cases */
 
data Want;
drop HighFound LowFound Min Max;   /* temporary variables */
retain HighFound LowFound 0        /* binary indicator variables: Initialize to 0 (false) */
       Min Max .;                  /* VALUE of 'Min' and 'Max' obs: Initialize to missing */
set &dsname end=EOF;               /* EOF is temporary indicator variable */
output;                            /* need OUTPUT because of EOF processing */
if Type = 'Min' then 
   min = Value;              /* remember the Min value */
else if Type = 'Max' then
   max = Value;              /* remember the Max value */
else if Type = 'LowVal' then 
   LowFound = 1;             /* Low value found; no need to create it */
else if Type = 'HighVal' then 
   HighFound = 1;            /* High value found; no need to create it */
 
/* end-of-file processing: conditionally append new observations */
if EOF then do;
   if ^LowFound then do;     /* Low value not found. Add it. */
      Type = "LowVal"; Value = Min; output;
   end;
   if ^HighFound then do;    /* High value not found. Add it. */
      Type = "HighVal"; Value = Max; output;
   end;
end;
run;
 
proc print data=Want; 
   var Group Type Value;
run;

The result is shown for input data that contains only the 'Min' and 'Max' observations but not the 'LowVal' or 'HighVal' observations. The output shows that the 'LowVal' or 'HighVal' observations were correctly appended to the input data, and that values for the VALUE column were copied from the 'Min' and 'Max' observations, respectively. You can verify that the other three input data sets are also correctly handled.

Use caution with the DELETE and subsetting IF statements

When performing end-of-file processing, be careful if you use a DELETE statement or a subsetting IF statement. For details and examples, see "The Perils of End-of-File Processing when Subsetting Data" (Landry, 2009). Landry summarizes the problem as follows (p. 2): "The problem occurs only when the last record in the dataset is deleted.... The reason this happens is that when a record is deleted..., SAS stops processing and returns to the next iteration of the DATA step. Thus, any executable statements placed after the [DELETE or subsetting IF statements] do not get executed."

Summary

In summary, this article shows how to use the SAS DATA step to conditionally add observations to the end of the input data. This is useful for data-dependent logic when the observations that you need to append depend on the values of the data. You can perform end-of-file processing by using the END= option on the SET statement to create an indicator variable that has the value 1 for the last observation in the input data. You can use the OUTPUT statement to append additional observations, but remember that you also need to use the OUTPUT statement after the SET statement if you want to output the original data.

Do you have a favorite way to conditionally append data? Do you know of other potential pitfalls with end-of-file processing? Leave a comment.

The post Conditionally append observations to a SAS data set appeared first on The DO Loop.

8月 212019
 

An important application of nonlinear optimization is finding parameters of a model that fit data. For some models, the parameters are constrained by the data. A canonical example is the maximum likelihood estimation of a so-called "threshold parameter" for the three-parameter lognormal distribution. For this distribution, the objective function is the log-likelihood function, which includes a term that looks like log(xi - θ), where xi is the i_th data value and θ is the threshold parameter. Notice that you cannot evaluate the objective function for large values of θ. Similar situations occur for other distributions (such as the Weibull) that contain threshold parameters. The domain of the objective function is restricted by the data.

In SAS, you can fit these three-parameter distributions by using PROC UNIVARIATE. However, it is instructive to fit the lognormal distribution "manually" because it illustrates the general problem, which is how to handle optimization for which parameters depend on the data. This article provides two tips. First, you can use boundary constraints to ensure that the optimal solution enforces the condition θ < min(x). Second, you can define the objective function so that it does not attempt to evaluate expressions like log(xi - θ) if the argument to the LOG function is not positive.

Fit the three-parameter lognormal distribution in SAS

Before discussing the details of optimization, let's look at some data and the MLE parameter estimates for the three-parameter lognormal distribution. The following example is from the PROC UNIVARIATE documentation. The data are measurements of 50 diameters of steel rods that were produced in a factory. The company wants to model the distribution of rod diameters.

data Measures;
   input Diameter @@;
   label Diameter = 'Diameter (mm)';
   datalines;
5.501  5.251  5.404  5.366  5.445  5.576  5.607  5.200  5.977  5.177
5.332  5.399  5.661  5.512  5.252  5.404  5.739  5.525  5.160  5.410
5.823  5.376  5.202  5.470  5.410  5.394  5.146  5.244  5.309  5.480
5.388  5.399  5.360  5.368  5.394  5.248  5.409  5.304  6.239  5.781
5.247  5.907  5.208  5.143  5.304  5.603  5.164  5.209  5.475  5.223
;
 
title 'Lognormal Fit for Diameters';
proc univariate data=Measures noprint;
   histogram Diameter / lognormal(scale=est shape=est threshold=est) odstitle=title;
   ods select ParameterEstimates Histogram;
run;

The UNIVARIATE procedure overlays the MLE fit on a histogram of the data. The ParameterEstimates table shows the estimates. The threshold parameter (θ) is the troublesome parameter when you fit models like this. The other two parameters ("Zeta," which I will call μ in the subsequent sections, and σ) are easier to fit.

An important thing to note is that the estimate for the threshold parameter is 5.069. If you look at the data (or use PROC MEANS), you will see that the minimum data value is 5.143. To be feasible, the value of the threshold parameter must always be less than 5.143 during the optimization process. The log-likelihood function is undefined if θ ≥ min(x).

Tip 1: Constrain the domain of the objective function

Although PROC UNIVARIATE automatically fits the three-parameter lognormal distribution, it is instructive to explicitly write the log-likelihood function and solve the optimization problem "manually." Given N observations {x1, x2, ..., xN}, you can use maximum likelihood estimation (MLE) to fit a three-parameter lognormal distribution to the data. I've previously written about how to use the optimization functions in SAS/IML software to carry out maximum likelihood estimation. The following SAS/IML statements define the log-likelihood function:

proc iml;
start LN_LL1(parm) global (x);
   mu = parm[1]; sigma = parm[2]; theta = parm[3];
   n = nrow(x);
   LL1 = -n*log(sigma) - n*log(sqrt(2*constant('pi')));
   LL2 = -sum( log(x-theta) );                         /* not defined if theta >= min(x) */
   LL3 = -sum( (log(x-theta)-mu)##2 ) / (2*sigma##2);
   return LL1 + LL2 + LL3;
finish;

Recall that for MLE problems, the data are constant values that are known before the optimization begins. The previous function uses the GLOBAL statement to access the data vector, x. The log-likelihood function contains three terms, two of which (LL2 and LL3) are undefined for θ ≥ min(x). Most optimization software enables you to specify constraints on the parameters. In the SAS/IML matrix language, you can define a two-row matrix where the first row indicates lower bounds on the parameters and the second row indicates upper bound. For this problem, the upper bound for the θ parameter is set to min(x) – ε, where ε is a small number so that the quantity (xi - θ) is always bounded away from 0:

use Measures; read all var "Diameter" into x; close;  /* read data */
/* Parameters of the lognormal(mu, sigma, theta):
      mu   sigma>0  theta<min(x)  */   
con = {.    1e-6      .,           /* lower bbounds: sigma > 0 */
       .    .         .};          /* upper bounds */
con[2,3] = min(x) - 1e-6;          /* replace upper bound for theta */
 
/* optimize the log-likelihood function */
x0 = {0 0.5 4};           /* initial guess is not feasible */
opt = {1,                 /* find maximum of function   */
       0};                /* do not display information about optimization */
call nlpnra(rc, result, "LN_LL1", x0, opt) BLC=con;   /* Newton's method with boundary constraints */
print result[c={'mu' 'sigma' 'theta'} L="Optimum"];

The results are very close to the estimates that were produced by PROC UNIVARIATE. Because of the boundary constraints, the optimization process never evaluates the objective function outside of the feasible domain, as determined by the boundary constraints in the CON matrix.

Tip 2: Trap domain errors in the objective function

Ideally, the log-likelihood function should be robust, meaning that it should be able to handle inputs that are outside of the domain of the function. This enables you to use the objective function in many ways, including visualizing the feasible region and solving problems that have nonlinear constraints. When you have nonlinear constraints, some algorithms might evaluate the function in the infeasible region, so it is important that the objective function does not experience a domain error.

The best way to make the log-likelihood function robust is to trap domain errors. But what value should the software return if an input value is outside the domain of the function?

  • If your optimization routine can handle missing values, return a missing value. This is the best mathematical choice because the function cannot be evaluated.
  • If your optimization routine does not handle missing values, return a very large value. If you are maximizing the function, return a large negative value such as -1E20. If you are minimizing the function, return a large positive value, such as 1E20. Essentially, this adds a penalty term to the objective function. The penalty is zero when the input is inside the domain and has a large magnitude when the input is outside the domain.

The SAS/IML optimization routines support missing values, so the following module redefines the objective function to return a missing value when θ ≥ min(x):

/* Better: Trap out-of-domain errors to ensure that the objective function NEVER fails */
start LN_LL(parm) global (x);
   mu = parm[1]; sigma = parm[2]; theta = parm[3];
   n = nrow(x);
 
   if theta > min(x) then return .;     /* Trap out-of-domain errors */
 
   LL1 = -n*log(sigma) - n*log(sqrt(2*constant('pi')));
   LL2 = -sum( log(x-theta) );
   LL3 = -sum( (log(x-theta)-mu)##2 ) / (2*sigma##2);
   return LL1 + LL2 + LL3;
finish;

This modified function can be used for many purposes, including optimization with nonlinear constraints, visualization, and so forth.

In summary, this article provides two tips for optimizing a function that has a restricted domain. Often this situation occurs when one of the parameters depends on data, such as the threshold parameter in the lognormal and Weibull distributions. Because the data are constant values that are known before the optimization begins, you can write constraints that depend on the data. You can also access the data during the optimization process to ensure that the objective function does not ever evaluate input values that are outside its domain. If your optimization routine supports missing values, the objective function can return a missing value for invalid input values. Otherwise, the function can return a very large (positive or negative) value for out-of-domain inputs.

The post Two tips for optimizing a function that has a restricted domain appeared first on The DO Loop.

8月 192019
 

One of my friends likes to remind me that "there is no such thing as a free lunch," which he abbreviates by "TINSTAAFL" (or TANSTAAFL). The TINSTAAFL principle applies to computer programming because you often end up paying a cost (in performance) when you call a convenience function that simplifies your program.

I was thinking about TINSTAAFL recently when I was calling a Base SAS function from the SAS/IML matrix language. The SAS/IML language supports hundreds of built-in functions that operate on vectors and matrices. However, you can also call hundreds of functions in Base SAS and pass in vectors for the parameters. It is awesome and convenient to be able to call the virtual smorgasbord of functions in Base SAS, such as probability function, string matching functions, trig function, financial functions, and more. Of course, there is no such thing as a free lunch, so I wondered about the overhead costs associated with calling a Base SAS function from SAS/IML. Base SAS functions typically are designed to operate on scalar values, so the IML language has to call the underlying function many times, once for each value of the parameter vector. It is more expensive to call a function a million times (each time passing in a scalar parameter) than it is to call a function one time and pass in a vector that contains a million parameters.

To determine the overhead costs, I decided to test the cost of calling the MISSING function in Base SAS. The IML language has a built-in syntax (b = (X=.)) for creating a binary variable that indicates which elements of a vector are missing. The call to the MISSING function (b = missing(X)) is equivalent, but requires calling a Base SAS many times, once for each element of x. The native SAS/IML syntax will be faster than calling a Base SAS function (TINSTAAFL!), but how much faster?

The following program incorporates many of my tips for measuring the performance of a SAS computation. The test is run on large vectors of various sizes. Each computation (which is very fast, even on large vectors) is repeated 50 times. The results are presented in a graph. The following program measures the performance for a character vector that contains all missing values.

/* Compare performance of IML syntax
   b = (X = " ");
   to performance of calling Base SAS MISSING function 
   b = missing(X);
*/
proc iml;
numRep = 50;                            /* repeat each computation 50 times */
sizes = {1E4, 2.5E4, 5E4, 10E4, 20E4};  /* number of elements in vector */
labl = {"Size" "T_IML" "T_Missing"};
Results = j(nrow(sizes), 3);
Results[,1] = sizes;
 
/* measure performance for character data */
do i = 1 to nrow(sizes);
   A = j(sizes[i], 1, " ");            /* every element is missing */
   t0 = time();
   do k = 1 to numRep;
      b = (A = " ");                   /* use built-in IML syntax */
   end;
   Results[i, 2] = (time() - t0) / numRep;
 
   t0 = time();
   do k = 1 to numRep;
      b = missing(A);                  /* call Base SAS function */
   end;
   Results[i, 3] = (time() - t0) / numRep;
end;
 
title "Timing Results for (X=' ') vs missing(X) in SAS/IML";
title2 "Character Data";
long = (sizes // sizes) || (Results[,2] // Results[,3]);   /* convert from wide to long for graphing */
Group = j(nrow(sizes), 1, "T_IML") // j(nrow(sizes), 1, "T_Missing"); 
call series(long[,1], long[,2]) group=Group grid={x y} label={"Size" "Time (s)"} 
            option="markers curvelabel" other="format X comma8.;";

The graph shows that the absolute times for creating a binary indicator variable is very fast for both methods. Even for 200,000 observations, creating a binary indicator variable takes less than five milliseconds. However, on a relative scale, the built-in SAS/IML syntax is more than twice as fast as calling the Base SAS MISSING function.

You can run a similar test for numeric values. For numeric values, the SAS/IML syntax is about 10-20 times faster than the call to the MISSING function, but, again, the absolute times are less than five milliseconds.

So, what's the cost of calling a Base SAS function from SAS/IML? It's not free, but it's very cheap in absolute terms! Of course, the cost depends on the number of elements that you are sending to the Base SAS function. However, in general, there is hardly any cost associated with calling a Base SAS function from SAS/IML. So enjoy the lunch buffet! Not only is it convenient and plentiful, but it's also very cheap!

The post Timing performance in SAS/IML: Built-in functions versus Base SAS functions appeared first on The DO Loop.

8月 072019
 
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.

8月 052019
 

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name such as 'Low,' 'Medium,' and 'High.' This article uses PROC HPBIN to create bins that are assigned numbers. If you bin the data into k groups, the groups have the integer values 1, 2, 3, ..., k. Missing values are put into the zeroth group.

There are two popular ways to choose the cut points. You can use evenly spaced points, or you can use quantiles of the data. If you use evenly spaced cut points (as in a histogram), the number of observations in each bin will usually vary. Using evenly spaced cut points is called the "bucket binning" method. Alternatively, if you use quantiles as cut points (such as tertiles, quartiles, or deciles), the number of observations in each bin tend to be more balanced. This article shows how to use PROC HPBIN in SAS to perform bucket binning and quantile binning. It also shows how to use the CODE statement in PROC HPBIN to create a DATA step that uses the same cut points to bin future observations.

Create sample data

The following statements create sample data from the Sashelp.Heart data. An ID variable is added to the data so that you can identify each observation. A call to PROC MEANS produces descriptive statistics about two variables: Cholesterol and Systolic blood pressure.

data Heart;
format PatientID Z5.;
set Sashelp.Heart(keep=Sex Cholesterol Systolic);
PatientID = 10000 + _N_;
run;
 
proc means data=Heart nolabels N NMISS Min Max Skewness;
   var Cholesterol Systolic;
run;
Desriptive statistics for two variables

The output shows the range of the data for each variable. It also shows that the Cholesterol variable has 152 missing values. If your analysis requires nonmissing observations, you can use PROC HPIMPUTE to replace the missing values. For this article, I will not replace the missing values so that you can see how PROC HPBIN handles missing values.

Each variable has a skewed distribution, as indicated by the values of the skewness statistic. This usually indicates that equal-length binning will result in bins in the tail of the distribution that have only a few observations.

Use PROC HPBIN to bin data into equal-length bins

A histogram divides the range of the data by using k evenly spaced cutpoints. The width of each bin is (Max – Min) / k. PROC HPBIN enables you to create new variables that indicate to which bin each observation belongs. You can use the global NUMBIN= option on the PROC HPBIN statement to set the default number of bins for each variable. You can use the INPUT statement to specify which variables to bin. You can override the default number of bins by using the NUMBIN= option on any INPUT statement.

Suppose that you want to bin the Cholesterol data into five bins and the remaining variables into three bins.

  • The range of the Cholesterol data is [96, 568], so the width of the five bins that contain nonmissing values will be 94.4.
  • The range of the Systolic data is [82, 300], so the width of the three bins will be 72.66.

The following call to PROC HPBIN bins the variables. The output data set, HeartBin, contains the bin numbers for each observation.

/* equally divide the range each variable (bucket binning) */
proc hpbin data=Heart output=HeartBin numbin=3;  /* global NUMBIN= option */
   input Cholesterol / numbin=5;                 /* override global NUMBIN= option */
   input Systolic;                 
   id PatientID Sex;
run;
Cutpoints and frequency of observations for bucket binning using PROC HPBIN in SAS

Part of the output from PROC HPBIN is shown. (You can suppress the output by using the NOPRINT option.) The first table shows that PROC HPBIN used four threads on my PC to compute the results in parallel. The second table summarizes the transformation that bins the data. For each variable, the second column gives the names of the binned variables in the OUTPUT= data set. The third column shows the cutpoints for each bin. The Frequency and Proportion column show the frequency and proportion (respectively) of observations in each bin. As expected for these skewed variables, bins in the tail of each variable contain very few observations (less than 1% of the total).

The OUTPUT= option creates an output data set that contains the indicator variables for the bins. You can use PROC FREQ to enumerate the bin values and (again) count the number of observations in each bin:

proc freq data=HeartBin;
   tables BIN_Cholesterol BIN_Systolic / nocum;
run;
Frequency of observations in each bin for bucket binning using PROC HPBIN in SAS

Notice that the Cholesterol variable was split into six bins even though the syntax specified NUMBIN=5. If a variable contains missing values, a separate bin is created for them. In this case, the zeroth bin contains the 152 missing values for the Cholesterol variable.

Bucket binning divides the range of the variables into equal-width intervals. For long-tailed data, the number of observations in each bin might vary widely, as for these data. The next section shows an alternative binning strategy in which the width of the bins vary and each bin contains approximately the same number of observations.

Use PROC HPBIN to bin data by using quantiles

You can use evenly-spaced quantiles as cutpoints in an attempt to balance the number of observations in the bins. However, if the data are rounded or have duplicate values, the number of observations in each bin can still vary. PROC HPBIN has two ways methods for quantile binning. The slower method (the QUANTILE option) computes cutpoints based on the sample quantiles and then bins the observations. The faster method (the PSEUDO_QUANTILE option) uses approximate quantiles to bin the data. The following call uses the PSEUDO_QUANTILE option to bin the data into approximately equal groups:

/* bin by quantiles of each variable */
proc hpbin data=Heart output=HeartBin numbin=3 pseudo_quantile;
   input Cholesterol / numbin=5;    /* override global NUMBIN= option */
   input Systolic;                  /* use global NUMBIN= option */
   id PatientID Sex;
   code file='C:/Temp/BinCode.sas'; /* generate scoring code */
run;
Cutpoints and frequency of observations in each bin for quantile binning using PROC HPBIN in SAS

The output shows that the number of observations in each bin is more balanced. For the Systolic variable, each bin has between 1,697 and 1,773 observations. For the Cholesterol variable, each bin contains between 975 and 1,056 observations. Although not shown in the table, the BIN_Cholesterol variable also contains a bin for the 152 missing values for the Cholesterol variable.

Use PROC HPBIN to write DATA step code to bin future observations

In the previous section, I used the CODE statement to specify a file that contains SAS DATA step code that can be used to bin future observations. The statements in the BinCode.sas file are shown below:

*****************      BIN_Systolic     ********************;
length BIN_Systolic 8;
if missing(Systolic) then do; BIN_Systolic = 0; end;
else if Systolic < 124.0086 then do; BIN_Systolic =     1; end;
else if 124.0086 <= Systolic < 140.0098 then do; BIN_Systolic =     2; end;
else if 140.0098 <= Systolic then do; BIN_Systolic =     3; end;
 
*****************      BIN_Cholesterol     ********************;
length BIN_Cholesterol 8;
if missing(Cholesterol) then do; BIN_Cholesterol = 0; end;
else if Cholesterol < 190.0224 then do; BIN_Cholesterol =     1; end;
else if 190.0224 <= Cholesterol < 213.0088 then do; BIN_Cholesterol =     2; end;
else if 213.0088 <= Cholesterol < 234.0128 then do; BIN_Cholesterol =     3; end;
else if 234.0128 <= Cholesterol < 263.0408 then do; BIN_Cholesterol =     4; end;
else if 263.0408 <= Cholesterol then do; BIN_Cholesterol =     5; end;

You can see from these statements that the zeroth bin is reserved for missing values. Nonmissing values will be split into bins according to the approximate tertiles (NUMBIN=3) or quintiles (NUMBIN=5) of the training data.

The following statements show how to use the file that was created by the CODE statement. New data is contained in the Patients data set. Simply use the SET statement and the %INCLUDE statement to bin the new data, as follows:

data Patients;
length Sex $6;
input PatientID Sex Systolic Cholesterol;
datalines;
13021 Male    96 . 
13022 Male   148 242 
13023 Female 144 217 
13024 Female 164 376 
13025 Female .   248 
13026 Male   190 238 
13027 Female 158 326 
13028 Female 188 266 
;
 
data MakeBins;
set Patients;
%include 'C:/Temp/BinCode.sas';   /* include the binning statements */
run;
 
/* Note: HPBIN puts missing values into bin 0 */
proc print data=MakeBins;  run;
Binning new observations by using PROC HPBIN in SAS

The input data can contain other variables (PatientID, Sex) that are not binned. However, the data should contain the Systolic and Cholesterol variables because the statements in the BinCode.sas file refer to those variables.

Summary

In summary, you can use PROC HPBIN in SAS to create a new discrete variable by binning a continuous variable. This transformation is common in machine learning algorithms. Two common binning methods include bucket binning (equal-length bins) and quantile binning (unequal-length bins). Missing values are put into their own bin (the zeroth bin). The CODE statement in PROC HPBIN enables you to write DATA step code that you can use to bin future observations.

The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.

8月 052019
 

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name such as 'Low,' 'Medium,' and 'High.' This article uses PROC HPBIN to create bins that are assigned numbers. If you bin the data into k groups, the groups have the integer values 1, 2, 3, ..., k. Missing values are put into the zeroth group.

There are two popular ways to choose the cut points. You can use evenly spaced points, or you can use quantiles of the data. If you use evenly spaced cut points (as in a histogram), the number of observations in each bin will usually vary. Using evenly spaced cut points is called the "bucket binning" method. Alternatively, if you use quantiles as cut points (such as tertiles, quartiles, or deciles), the number of observations in each bin tend to be more balanced. This article shows how to use PROC HPBIN in SAS to perform bucket binning and quantile binning. It also shows how to use the CODE statement in PROC HPBIN to create a DATA step that uses the same cut points to bin future observations.

Create sample data

The following statements create sample data from the Sashelp.Heart data. An ID variable is added to the data so that you can identify each observation. A call to PROC MEANS produces descriptive statistics about two variables: Cholesterol and Systolic blood pressure.

data Heart;
format PatientID Z5.;
set Sashelp.Heart(keep=Sex Cholesterol Systolic);
PatientID = 10000 + _N_;
run;
 
proc means data=Heart nolabels N NMISS Min Max Skewness;
   var Cholesterol Systolic;
run;
Desriptive statistics for two variables

The output shows the range of the data for each variable. It also shows that the Cholesterol variable has 152 missing values. If your analysis requires nonmissing observations, you can use PROC HPIMPUTE to replace the missing values. For this article, I will not replace the missing values so that you can see how PROC HPBIN handles missing values.

Each variable has a skewed distribution, as indicated by the values of the skewness statistic. This usually indicates that equal-length binning will result in bins in the tail of the distribution that have only a few observations.

Use PROC HPBIN to bin data into equal-length bins

A histogram divides the range of the data by using k evenly spaced cutpoints. The width of each bin is (Max – Min) / k. PROC HPBIN enables you to create new variables that indicate to which bin each observation belongs. You can use the global NUMBIN= option on the PROC HPBIN statement to set the default number of bins for each variable. You can use the INPUT statement to specify which variables to bin. You can override the default number of bins by using the NUMBIN= option on any INPUT statement.

Suppose that you want to bin the Cholesterol data into five bins and the remaining variables into three bins.

  • The range of the Cholesterol data is [96, 568], so the width of the five bins that contain nonmissing values will be 94.4.
  • The range of the Systolic data is [82, 300], so the width of the three bins will be 72.66.

The following call to PROC HPBIN bins the variables. The output data set, HeartBin, contains the bin numbers for each observation.

/* equally divide the range each variable (bucket binning) */
proc hpbin data=Heart output=HeartBin numbin=3;  /* global NUMBIN= option */
   input Cholesterol / numbin=5;                 /* override global NUMBIN= option */
   input Systolic;                 
   id PatientID Sex;
run;
Cutpoints and frequency of observations for bucket binning using PROC HPBIN in SAS

Part of the output from PROC HPBIN is shown. (You can suppress the output by using the NOPRINT option.) The first table shows that PROC HPBIN used four threads on my PC to compute the results in parallel. The second table summarizes the transformation that bins the data. For each variable, the second column gives the names of the binned variables in the OUTPUT= data set. The third column shows the cutpoints for each bin. The Frequency and Proportion column show the frequency and proportion (respectively) of observations in each bin. As expected for these skewed variables, bins in the tail of each variable contain very few observations (less than 1% of the total).

The OUTPUT= option creates an output data set that contains the indicator variables for the bins. You can use PROC FREQ to enumerate the bin values and (again) count the number of observations in each bin:

proc freq data=HeartBin;
   tables BIN_Cholesterol BIN_Systolic / nocum;
run;
Frequency of observations in each bin for bucket binning using PROC HPBIN in SAS

Notice that the Cholesterol variable was split into six bins even though the syntax specified NUMBIN=5. If a variable contains missing values, a separate bin is created for them. In this case, the zeroth bin contains the 152 missing values for the Cholesterol variable.

Bucket binning divides the range of the variables into equal-width intervals. For long-tailed data, the number of observations in each bin might vary widely, as for these data. The next section shows an alternative binning strategy in which the width of the bins vary and each bin contains approximately the same number of observations.

Use PROC HPBIN to bin data by using quantiles

You can use evenly-spaced quantiles as cutpoints in an attempt to balance the number of observations in the bins. However, if the data are rounded or have duplicate values, the number of observations in each bin can still vary. PROC HPBIN has two ways methods for quantile binning. The slower method (the QUANTILE option) computes cutpoints based on the sample quantiles and then bins the observations. The faster method (the PSEUDO_QUANTILE option) uses approximate quantiles to bin the data. The following call uses the PSEUDO_QUANTILE option to bin the data into approximately equal groups:

/* bin by quantiles of each variable */
proc hpbin data=Heart output=HeartBin numbin=3 pseudo_quantile;
   input Cholesterol / numbin=5;    /* override global NUMBIN= option */
   input Systolic;                  /* use global NUMBIN= option */
   id PatientID Sex;
   code file='C:/Temp/BinCode.sas'; /* generate scoring code */
run;
Cutpoints and frequency of observations in each bin for quantile binning using PROC HPBIN in SAS

The output shows that the number of observations in each bin is more balanced. For the Systolic variable, each bin has between 1,697 and 1,773 observations. For the Cholesterol variable, each bin contains between 975 and 1,056 observations. Although not shown in the table, the BIN_Cholesterol variable also contains a bin for the 152 missing values for the Cholesterol variable.

Use PROC HPBIN to write DATA step code to bin future observations

In the previous section, I used the CODE statement to specify a file that contains SAS DATA step code that can be used to bin future observations. The statements in the BinCode.sas file are shown below:

*****************      BIN_Systolic     ********************;
length BIN_Systolic 8;
if missing(Systolic) then do; BIN_Systolic = 0; end;
else if Systolic < 124.0086 then do; BIN_Systolic =     1; end;
else if 124.0086 <= Systolic < 140.0098 then do; BIN_Systolic =     2; end;
else if 140.0098 <= Systolic then do; BIN_Systolic =     3; end;
 
*****************      BIN_Cholesterol     ********************;
length BIN_Cholesterol 8;
if missing(Cholesterol) then do; BIN_Cholesterol = 0; end;
else if Cholesterol < 190.0224 then do; BIN_Cholesterol =     1; end;
else if 190.0224 <= Cholesterol < 213.0088 then do; BIN_Cholesterol =     2; end;
else if 213.0088 <= Cholesterol < 234.0128 then do; BIN_Cholesterol =     3; end;
else if 234.0128 <= Cholesterol < 263.0408 then do; BIN_Cholesterol =     4; end;
else if 263.0408 <= Cholesterol then do; BIN_Cholesterol =     5; end;

You can see from these statements that the zeroth bin is reserved for missing values. Nonmissing values will be split into bins according to the approximate tertiles (NUMBIN=3) or quintiles (NUMBIN=5) of the training data.

The following statements show how to use the file that was created by the CODE statement. New data is contained in the Patients data set. Simply use the SET statement and the %INCLUDE statement to bin the new data, as follows:

data Patients;
length Sex $6;
input PatientID Sex Systolic Cholesterol;
datalines;
13021 Male    96 . 
13022 Male   148 242 
13023 Female 144 217 
13024 Female 164 376 
13025 Female .   248 
13026 Male   190 238 
13027 Female 158 326 
13028 Female 188 266 
;
 
data MakeBins;
set Patients;
%include 'C:/Temp/BinCode.sas';   /* include the binning statements */
run;
 
/* Note: HPBIN puts missing values into bin 0 */
proc print data=MakeBins;  run;
Binning new observations by using PROC HPBIN in SAS

The input data can contain other variables (PatientID, Sex) that are not binned. However, the data should contain the Systolic and Cholesterol variables because the statements in the BinCode.sas file refer to those variables.

Summary

In summary, you can use PROC HPBIN in SAS to create a new discrete variable by binning a continuous variable. This transformation is common in machine learning algorithms. Two common binning methods include bucket binning (equal-length bins) and quantile binning (unequal-length bins). Missing values are put into their own bin (the zeroth bin). The CODE statement in PROC HPBIN enables you to write DATA step code that you can use to bin future observations.

The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.