Guest blogger Khari Villela shares tips to help you skip common pitfalls of building a data lake.

The post 3 tips for building a better data lake appeared first on The Data Roundtable.

7月 232019

Guest blogger Khari Villela shares tips to help you skip common pitfalls of building a data lake.

The post 3 tips for building a better data lake appeared first on The Data Roundtable.

7月 222019

Is 4 an extreme value for the standard normal distribution? In high school, students learn the famous 68-95-99.7 rule,
which is a way to remember that 99.7 percent of random observation from a normal distribution are within three standard deviations from the mean. For the standard normal distribution, the probability that a random value is bigger than 3 is 0.0013. The probability that a random value is bigger than 4 is even smaller: about 0.00003 or 3 x 10^{-5}.

So, if you draw randomly from a standard normal distribution, it must be very rare to see an extreme value greater than 4, right? Well, yes and no. Although it is improbable that any ONE observation is more extreme than 4, if you draw MANY independent observations, the probability that the sample contains an extreme value increases with the sample size. If *p* is the probability that one observation is less than 4, then *p*^{n} is the probability that *n* independent observations are all less than 4. Thus
1 – *p*^{n} is the probability that some value is greater than 4.
You can use the CDF function in SAS to compute these probabilities in a sample of size *n*:

/* What is a "large value" of a normal sample? The answer depends on the size of the sample. */ /* Use CDF to find probability that a random value from N(0,1) exceeds 4 */ proc iml; P_NotGT4 = cdf("Normal", 4); /* P(x < 4) */ /* Probability of an extreme obs in a sample that contains n independent observations */ n = {1, 100, 1000, 10000}; /* sample sizes */ P_NotGT4 = P_NotGT4**n; /* P(all values are < 4) */ P_GT4 = 1 - P_NotGT4; /* P(any value is > 4) */ print n P_NotGT4 P_GT4; |

The third column of the table shows that the probability of seeing an observation whose value is greater than 4 increases with the size of the sample. In a sample of size 10,000, the probability is 0.27, which implies that about one out of every four samples of that size will contain a maximum value that exceeds 4.

You can use a simulation to approximate the distribution of the maximum value of a normal sample of size *n*. For definiteness, choose *n* = 1,000 and sample from a standard normal distribution N(0,1). The following SAS/IML program simulates 5,000 samples of size *n* and computes the maximum value of each sample. You can then graph the distribution of the maximum values to understand how the maximum value varies in random samples of size *n*.

/* What is the distribution of the maximum value in a sample of size n drawn from the standard normal distribution? */ call randseed(12345); n = 1000; /* sample size */ numSim = 5000; /* number of simulations */ x = j(n, numSim); /* each column will be a sample */ call randgen(x, "Normal"); /* generate numSim samples */ max = x[<>, ]; /* find max of each sample (column) */ Title "Distribution of Maximum Value of a Normal Sample"; title2 "n = 1000"; call histogram(max); /* compute some descriptive statistics */ mean = mean(max`); call qntl(Q, max`, {0.25 0.5 0.75}); print (mean // Q)[rowname={"Mean" "P25" "Median" "P75"}]; |

Based on this simulation, the expected maximum value of a
sample of size *n* = 1,000 is about 3.2.
The table indicates that about 25% of the samples have a maximum value that is less than 3. About half have a maximum value less than 3.2. About 75% of the samples have a maximum value that is less than 3.4. The graph shows the distribution of maxima in these samples. The maximum value of a sample ranged from 2.3 to 5.2.

The distribution of a maximum (or minimum) value in a sample is studied in an area of statistics that is known as *extreme value theory*. It turns out that you can derive the sampling distribution of the maximum of a sample by using the Gumbel distribution, which is also known as the "extreme value distribution of Type 1." You can use the Gumbel distribution to describe the distribution of the maximum of a sample of size *n*. The Gumbel distribution actually describes the maximum for many distributions, but for simplicity I will only refer to the normal distribution.

This section does two things. First, it uses PROC UNIVARIATE to fit the parameters of a Gumbel distribution to the maximum values from the simulated samples. The Gumbel(3.07, 0.29) distribution is the distribution that maximizes the likelihood function for the simulated data. Second, it uses a theoretical formula to show that a Gumbel(3.09, 0.29) distribution is the distribution that models the maximum of a normally distributed sample of size *n* = 1,000. Thus, the results from the simulation and the theory are similar.

You can write the 5,000 maximum values from the simulation to a SAS data set and use PROC UNIVARIATE to estimate the MLE parameters for the Gumbel distribution, as follows:

create MaxVals var "max"; append; close; QUIT; /* Fit a Gumbel distribution, which models the distribution of maximum values */ proc univariate data=MaxVals; histogram max / gumbel; ods select Histogram ParameterEstimates FitQuantiles; run; |

The output from PROC UNIVARIATE shows that the Gumbel(3.07, 0.29) distribution is a good fit to the distribution of the simulated maxima. But where do those parameter values come from? How are the parameters of the Gumbel distribution related to the sample size of the standard normal distribution?

It was difficult to find an online reference that shows how to derive the Gumbel parameters for a normal sample of size *n*. I finally found a formula in the book *Extreme Value Distributions* (Kotz and Nadarajah, 2000, p. 9). For any cumulative distribution F that satisfies certain conditions, you can use the quantile function of the distribution to estimate the Gumbel parameters.
The result is that the location parameter (μ) is equal to μ = F^{-1}(1-1/*n*) and the scale parameter (σ) is equal to σ = F^{-1}(1-1/(*ne*)) - μ, where *e* is the base of the natural logarithm.
The following SAS/IML program uses the (1 – 1/*n*)th quantile of the normal distribution to derive the Gumbel parameters for a normal sample of size *n*:

proc iml; n = 1000; /* Compute parameters of Gumbel distribution when the sample is normal of size n. SAS calls the parameters (mu, sigma). Wikipedia calls them (mu, beta). Other references use (alpha, beta). */ mu_n = quantile("Normal", 1-1/n); /* location parameter */ sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */ print n mu_n sigma_n; /* what is the mean (expected value) and median of this Gumbel distribution? */ gamma = constant("Euler"); /* Euler–Mascheroni constant = 0.5772157 */ mean = mu_n + gamma*sigma_n; /* expected value of maximum */ median = mu_n - sigma_n * log(log(2)); /* median of maximum value distribution */ print n mean median; |

Notice that these parameters are very close to the MLE estimates from the simulated normal samples. The results tell you that the expected maximum in a standard normal sample is 3.26 and about 50% of samples will have a maximum value of 3.19 or less.

You can use these same formulas to find the expected and median values of the maximum in samples of other sizes:

/* If the sample size is n, what is expected maximum */ n = {1E4, 2E4, 1E5, 2E5, 1E6, 2E6}; mu_n = quantile("Normal", 1-1/n); /* location parameter */ sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */ mean = mu_n + gamma*sigma_n; /* expected value of maximum */ median = mu_n - sigma_n * log(log(2)); /* meadian of maximum value distribution */ print n mean median; |

The table shows that you would expect to see a maximum value of 4 in a sample of size 20,000. If there are two million observations in a sample, you would expect to see a maximum value of 5!

You can graph this data over a range of sample sizes. The following graph shows the expected value of the maximum value in a sample of size *n* (drawn from a standard normal distribution) for large values of *n*.

You can create similar images for quantiles. The *p*_th quantile for the Gumbel distribution is
*q* = mu_n - sigma_n log(-log(*p*)).

So, is 4 an unlikely value for the standard normal distribution? Yes, but for sufficiently large samples that value is likely to be observed. You can use the Gumbel distribution to model the distribution of the maximum in a normal sample of size *n* to determine how likely it is that the sample contains an extreme value. The larger the sample, the more likely it is to observe an extreme value. Although I do not discuss the general case, the Gumbel distribution can also model the maximum value for samples drawn from some non-normal distributions.

The post Extreme values: What is an extreme value for normally distributed data? appeared first on The DO Loop.

7月 192019

This blog answers three questions related to using PROC RANK with groups and ties. Note that question two actually provide an alternative for using the DATA step when PROC RANK cannot provide what you need.

- What does PROC RANK do behind the code when you use the GROUPS= option in the PROC RANK statement?
- What do you do if you want equal groups, regardless of tied values?
- What do you do if you want groups to start at 1 rather than 0?

Most of us just accept the ranked values that are calculated by PROC RANK. But have you ever tried to figure out the calculation that happens behind the code? When you use the GROUPS= option in the PROC RANK statement, the values are assigned to groups that range from 0 to the number-of-groups minus 1, based on tied values. If you have many tied values, you might not obtain the number of groups that you request because observations with the same value are assigned to the same group.

The formula for calculating group values is as follows:

FLOOR(rank*k/(n+1))

In this formula:

- rank is the data value's rank order
- k is the value of the GROUPS= option
- n is the number of nonmissing values

Consider the following example. If you want to see the original value as well as the ranking, use both the VAR and RANKS statements in your PROC RANK step. The RANKS variable contains the rank value. If you use a BY statement, the raw rankings start over in each BY group.

**Example 1**

data; input x; cards; 1 1 1 1 1 1 8 8 ; run; proc rank data=test groups=5 out=test_rank ties=mean /* low and high */; var x; ranks rank_x; run; proc print data=test_rank; run; |

**Output**

x | rank_x |
---|---|

1 | 1 |

1 | 1 |

1 | 1 |

1 | 1 |

1 | 1 |

1 | 1 |

8 | 4 |

8 | 4 |

The following table shows the variable values, the raw ranks, the ties values, and the resulting ranks:

X | RAW_RANK | TIES=MEAN | TIES=LOW | TIES=HIGH | RANK_MEAN | RANK_LOW | RANK_HIGH |
---|---|---|---|---|---|---|---|

1 | 1 | 3.5 | 1 | 6 | 1 | 0 | 3 |

1 | 2 | 3.5 | 1 | 6 | 1 | 0 | 3 |

1 | 3 | 3.5 | 1 | 6 | 1 | 0 | 3 |

1 | 4 | 3.5 | 1 | 6 | 1 | 0 | 3 |

1 | 5 | 3.5 | 1 | 6 | 1 | 0 | 3 |

1 | 6 | 3.5 | 1 | 6 | 1 | 0 | 3 |

8 | 7 | 7.5 | 7 | 8 | 4 | 3 | 4 |

8 | 8 | 7.5 | 7 | 8 | 4 | 3 | 4 |

Using the formula that is shown previously, k=5 and n=8. Since TIES=MEAN, you sum the raw ranks of the tied values of X and divide by the number of observations. For X=1, the rank is (1+2+3+4+5+6)=21/6=3.5. For X=2, the rank is (7+8)=15/2=7.5. Similarly, if you use TIES=LOW, for X=1, the rank is 1; for X=2, the rank is 7. Finally, if you use TIES=HIGH, for X=1, the rank is 6; for X=2, the rank is 8. When you insert those values into the formula for the first observation, you obtain the following results:

TIES=MEAN: Floor(3.5*5/9)=Floor(1.9)=1

TIES=LOW: Floor(1*5/9)=Floor(0.5)=0

TIES=HIGH: Floor(6*5/9)=Floor(3.3)=3

Suppose that you want to create groups that have the same number of observations in each one, regardless of tied values. PROC RANK cannot do this. However, you can use the DATA step to accomplish this task.

You need to sort the data set by the ranking variable and then use the same formula in the DATA step, as shown below.

**Example 2**

proc sort data=test; by x; run; data ranks; set test nobs=numobs; group=floor(_n_*5/(numobs+1)); run; proc print data=ranks; run; |

**Output**

x | group |
---|---|

1 | 0 |

1 | 1 |

1 | 1 |

1 | 2 |

1 | 2 |

1 | 3 |

8 | 3 |

8 | 4 |

When you use the GROUPS= option, the values that are assigned to the groups start at 0. There is no way to indicate for PROC RANK to start the groups at 1. However, once you have the data set with the ranked values, you can add 1 using DATA step logic, as shown in this example:

**Example 3**

data test; input y; cards; 11 22 10 15 25 ; run; proc sort data=test; by y; run; proc rank data=test out=test_rank1 groups=3; var y; ranks rank_y; run; data test_rank1; set test_rank1; rank_y+1; run; proc print data=test_rank1; run; |

**Output**

y | rank_y |
---|---|

10 | 1 |

11 | 2 |

15 | 2 |

22 | 3 |

25 | 3 |

PROC RANK has many statistical applications, such as helping you understand how your data is distributed. Hopefully, this blog has provided you with a better understanding of how ranks are determined when tied values are present in the data and when you want to assign those ranks to groups.

How the RANK procedure calculates ranks with groups and ties was published on SAS Users.

7月 192019

When a new Moon passes between the Earth and the Sun, the Moon can cast a shadow on certain regions of the Earth. This natural phenomenon creates a solar eclipse, meaning the Moon covers, or eclipses, your view of the Sun if you're in that region. No surprise that in [...]

Ring of fire: Visualizing 5,000 years of solar eclipses was published on SAS Voices by Falko Schulz

7月 172019

Have you ever tried to pass comma-delimited values to SAS macro or to a SAS macro function? How can SAS distinguish commas separating parameters or arguments from commas separating parts of the values?

Let’s say you want to extract the first word from the following string of characters (these words represent column names in the SASHELP.CARS data table):

`make, model, type, origin`

If you run the following code:

`%let firstvar = %scan(make, model, type, origin, 1);`

you get is the following ERROR in your SAS log:

`ERROR: Macro function %SCAN has too many arguments.`

That is because %scan macro function sees and treats those make, model, type and origin as arguments since commas between them are interpreted as argument separators.

Even if you “hide” your comma-delimited value within a macro variable, it still won’t do any good since the macro variable gets resolved during macro compilation before being passed on to a macro or macro function for execution.

`%let mylist = make, model, type, origin;
%let firstvar = %scan(&mylist, 1);
`

You will still get the same ERROR:

`ERROR: Macro function %SCAN has too many arguments.`

Try submitting the following code that passes your macro variable value to a SAS macro as a parameter:

%let mylist = make, model, type, origin; %macro subset(dsname=, varlist=); proc sql; select &varlist from &dsname; quit; %mend subset; %subset(dsname=SASHELP.CARS, varlist=&mylist) |

You will get another version of the SAS log ERROR:

`ERROR: All positional parameters must precede keyword parameters.`

`NOTE: Line generated by the macro variable "MYLIST"`.

`1 type, origin
----
180
ERROR 180-322: Statement is not valid or it is used out of proper order.`

In this case, macro %subset gets as confused as the %scan function above because your macro variable will get resolved during macro compilation, and SAS macro processor will see the macro invocation as:

`%subset(dsname=SASHELP.CARS, varlist=make, model, type, origin)`

treating each comma as a parameter separator.

All this confusion happens because SAS functions’ arguments and SAS macros’ parameters use commas as their separators, while resolved macro variables introduce their own values’ comma delimiters into the functions/macros constructs’ picture, thus wreaking havoc on your SAS program.

But don’t panic! To fight that chaos, you need to take a vacation. Not a stay-home, do-nothing vacation, but some serious vacation, with faraway destination and travel arrangements. While real vacation is preferable, an imaginary one would do it too. I mean to start fighting the mess with comma-separated values, pick your destination, book your hotel and flight, and start packing your stuff.

Do you have a “vacation items list”? In my family, we have an individual vacation list for every family member. How many items do you usually take with you? Ten, twenty, a hundred?

Regardless, you don’t show up at the airport checkpoint with a pile of your vacation items. That would’ve been too messy. I don’t think you would be even allowed boarding with an unpacked heap of your stuff. You come to an airport neatly rolling a single item that is called a suitcase. Well, I suppose that some of you may have two of them, but I can’t imagine more than that.

You only started your fantasy vacation, you haven’t even checked in to your flight, but you have already have a solution in your sight, a perfect combine-and-conquer solution for passing comma-delimited values. Even if you have not yet realized that it’s in your plain view.

Forget about “thinking outside the box” metaphor. You can’t solve all your problems with a single strategy. Sometimes, you need to turn your thinking on its head to solve, or even to see the problem.

As for your airport check-in, instead of thinking outside the box, you thought “inside the box” and brought your many items “boxed” as a single item – a suitcase. A container, in a broader sense.

That is exactly how we are going to approach our comma-delimited lists problem. We are going to check them in to a macro or a macro function as a single, boxed item. Just like this:

Or like this:

Not surprisingly, SAS macro language provides a variety of these wonder boxes for many special occasions collectively known as macro quoting functions. Personally, I would prefer calling them “macro masking functions,” as they have nothing to do with “quoting” per se and have everything to do with *masking* various characters during macro compilation or macro processing. But that is what “macro quoting” means – masking, boxing, - similar to “quoting” a character string to make it a single entity.

Different macro quoting functions mask different special characters (+ - , / ; = etc.) and mnemonics (AND OR GT EQ etc.) so that the macro facility interprets them as text instead of as language symbols.

Here are all 7 SAS macro quoting functions, two of which work at macro compilation - %STR() and %NRSTR(), while other 5 work at macro execution - %QUOTE() and %NRQUOTE(), %BQUOTE() and %NRBQUOTE(), and %SUPERQ().

You may look up what symbols they mask and the timing they apply (macro compilation vs. macro execution) in this macro quoting functions summary. You may also want to look at the following cheat sheet: Deciding When to Use a Macro Quoting Function and Which Function to Use.

As general rule of thumb, use macro quoting functions at compilation time when you mask text constants - (`make, model, type, origin`); use macro quoting functions at execution time when you mask macro or macro variable references containing `&` or `%` - (`&mylist`).

**NOTE:** There are many other SAS macro functions that besides their main role also perform macro quoting, e.g. `%QSCAN(), %QSUBSTR()` and others; they all start with `%Q`.

It turns out that to mask (or to “box”) comma-separated values in a macro function or a SAS macro, any macro quoting function will work. In this case I would suggest using the simplest (and shortest) %STR(). %STR() applies during macro compilation and serves as a perfect “box” for our comma-delimited values to hide (mask) commas to receiving macro function or a macro does not confuse them with its own commas separating arguments / parameters.

With it we can re-write our above examples as:

`%let firstvar = %scan(%str(make, model, type, origin), 1);
%put &=firstvar;`

SAS log will produce exactly what we expected:

`FIRSTVAR=make`

Similarly, we can call the above SAS macro as:

`%subset(dsname=SASHELP.CARS, varlist=%str(make, model, type, origin) )`

It will run without ERRORs and produce a print of the `SASHELP.CARS` data table with 4 columns specified by the `varlist` parameter value:

When you assign a comma-delimited list as a value to a macro variable, we want to mask commas within the resolved value during execution. Any of the execution time macro quoting functions will mask comma.

Again, in case of multiple possibilities I would use the shortest one - `%QUOTE()`.

With it we can re-write our above examples as:

%let mylist = make, model, type, origin; %let firstvar = %scan(%quote(&mylist), 1); %subset(dsname=SASHELP.CARS, varlist=%quote(&mylist)) |

But just keep in mind that the remaining 4 execution time macro quoting functions - `%NRQUOTE(), %BQUOTE(), %NRBQUOTE()` and `%SUPERQ()` - will work too.

**NOTE:** The syntax of the %SUPERQ() function is quite different from the rest of the pack. The %SUPERQ() macro function takes as its argument either a macro variable name **without an ampersand** or a macro text expression that yields a macro variable name

I realize that macro quoting is not a trivial matter. That is why I attempted to explain its concept on a very simple yet powerful use case. Hope you will expand on this to empower your SAS coding skills.

Passing comma-delimited values into SAS macros and macro functions was published on SAS Users.

7月 172019

In the SAS/IML language, a matrix contains data of one type: numeric or character. If you want to create a SAS data set that contains mixed-type data (numeric *and* character), SAS/IML 15.1 provides support to write multiple matrices to a data set by using a single statement. Specifically, the CREATE FROM and APPEND FROM statements now support writing multiple matrices of any types. SAS/IML 15.1 was released as part of SAS 9.4m6.

With the new enhancements to the CREATE FROM and APPEND FROM statements, you now have four ways to write mixed type data to a SAS data set:

- Use the CREATE and APPEND statement to write data from vectors (of any types) to a SAS data set.
- Put the data into a SAS/IML table (introduced in SAS/IML 14.2 as part of SAS 9.4m4) and use the TableWriteToDataset call to write the data set.
- Use the CREATE FROM and APPEND FROM statements to write two data sets: one that contains all the numeric data and another that contains the character data. Use the DATA step to merge the data sets.
**New in SAS/IML 15.1:**Use the CREATE FROM statement and APPEND FROM statement to write multiple matrices, as shown in the following section.

In SAS/IML 15.1, you can specify multiple matrices on the CREATE FROM statement. The matrices can be any type. In the following example, X matrix is a numeric matrix and C is a character matrix:

/* read numeric and character vars in one call */ proc iml; NumerVarNames = {'N' 'N2' 'N3'}; X = { 1 2 3, 2 4 6, 3 6 9, 4 8 12}; charVarNames = {'Animal' 'Flower'}; C = {'Rat' 'Iris', 'Pig' 'Rose', 'Goat' 'Daisy', 'Duck' 'Lily'}; /* SAS/IML 15.1: write multiple matrices of any type to a SAS data sets */ AllNames = NumerVarNames || CharVarNames; create MyData from X C [colname=AllNames]; /* specify multiple matrices */ append from X C; /* repeat matrix names */ close; QUIT; proc print data=MyData noobs; run; |

Although the new enhancements to the CREATE FROM and APPEND FROM statements enable you to write mixed-type data to a SAS data set, you can also write multiple matrices regardless of the types. For example, you can use the same technique to write multiple numeric matrices.

Notice that if you want to specify the names of the data set variables, you use a single COLNAME= option at the end of the CREATE FROM statement.

The post Write numeric and character matrices to a data set from SAS/IML appeared first on The DO Loop.

7月 162019

“They were the best of asteroids, they were the worst of asteroids.” ~ Charles Dickens Armstrong There are good asteroids, and there are bad asteroids. Good asteroids stay in their own neighborhoods and wait for us to come visit them. Bad asteroids, however, don’t wait for an invitation – they [...]

A tale of two asteroids was published on SAS Voices by Leo Sadovy

7月 152019

7月 152019

With all the excitement around SAS’ new software architecture, SAS Viya, we often get asked the question:

What is it and how can it help my company conquer our analytics challenges?

Fortunately, learning more about SAS Viya has never been easier.

SAS Viya extends the SAS® Platform and provides reliable, scalable, and secure analytics inventory management and governance. It allows for faster processing, access to machine learning, plus support for other languages like Python, R, Java, and Lua. In addition, it has support for on-site, cloud, or hybrid environments. It opens SAS to more than just data scientists and allows the SAS platform to be used by business analysts, developers, executives, and more. It truly is the next step in data analytics!

*Exploring SAS® Viya®: Programming and Data Management* covers how to access data files, libraries, and existing code in SAS® Studio. It also includes information on new procedures in SAS Viya, how to write new code, and how to use some of the pre-installed tasks that come with SAS® Visual Data Mining and Machine Learning.

*Exploring SAS® Viya®: Visual Analytics, Statistics, and Investigations* covers data visualization which enables decision-makers to see analytics presented visually so that they can grasp difficult concepts or identify new patterns. SAS offers several solutions for visualizing your data, many of which are powered by SAS Viya. This book includes four visualization solutions powered by SAS Viya: SAS Visual Analytics, SAS Visual Statistics, SAS Visual Text Analytics, and SAS Visual Investigator.

Ready to test out Viya for yourself? Get a free trial and test the power of the Viya engine.

For more updates on new SAS Press books and exclusive discounts subscribe to our SAS Press New Book Newsletter.

Curious about SAS® Viya®? Discover two new free SAS Press e-books! was published on SAS Users.

7月 152019

Heat maps have many uses. You can use a heat map to visualize correlation matrices, to visualize longitudinal data ("lasagna plots"), and to visualize counts in any two-dimensional table. As of SAS 9.4m3, you can create heat maps in SAS by using the HEATMAP and HEATMAPPARM statements in PROC SGPLOT. Prior to SAS 9.4m3, you could create heat maps by using the Graph Template Language (GTL) in Base SAS or the HeatmapCont and HeatmapDisc functions in SAS/IML software.

I like to emphasize the difference between a continuous heat map and a discrete heat map. In a continuous heat map, each cell is assigned a color from a continuous color ramp and the graph includes a gradient legend that associates colors with numerical values of the continuous response variable. However, sometimes the response has a small number of discrete values such as 'Low', 'Medium', and 'High'. In that case, you can create a discrete heat map, similar to the one shown to the right. A discrete heat map uses a discrete palette of colors (and a discrete legend) to visualize the response variable.

First, this article shows how to use the HEATMAPPARM statement in PROC SGPLOT to create a continuous heat map, which is the default behavior. Next, it shows how to use a SAS format to bin the response variable into ordinal categories. Third, it creates a discrete heat map, shown at right to visualize the binned responses. Binning the response values and using a discrete heat map is especially useful when the response variable spans several orders of magnitude.

In this article, I use only the HEATMAPPARM statement. The difference between the HEATMAP and the HEATMAPPARM statement is that the HEATMAP statement supports binning the (x, y) values onto a uniform grid. The color in each cell is based on some statistic (frequency, sum, mean,...) that is computed over all the observations in a bin. In contrast, you use the HEATMAPPARM statement when the data are already aggregated onto a uniform grid. For each (x, y) coordinate, you have a single response value that you want to visualize by using color. This is often the case when you use heat maps to visualize tables.

Suppose a store tracks sales of three products ('A', 'B', and 'C') over a 10-week period. You can use a continuous heat map to visualize the quantities sold for each product. Because there are only 10 cells in the horizontal direction, you can optionally use the DISCRETEX option to show all values, as follows:

data Sales; input Product $ @@; do Week = 1 to 10; input QtySold @@; output; end; label QtySold="Quantity Sold"; datalines; A 5 3 2 7 10 8 5 6 9 11 B 4 1 0 2 0 2 2 1 2 2 C 27 15 18 29 40 20 19 25 31 34 ; ods graphics / width=640 height=400px; title "Continuous Heat Map"; title2 "Continuous Color Ramp and Legend"; proc sgplot data=Sales; heatmapparm x=Week y=Product colorresponse=QtySold / outline discretex; text x=Week y=Product text=QtySold / textattrs=(size=12pt) strip; gradlegend; run; |

To make the heat map easier to understand, I overlaid the quantities sold for each product and each week. The color of each cell is determined by using a three-color color ramp. The darkest blue corresponds to 0 items sold, the white color corresponds to 20 units sold, and the darkest red corresponds to 40 units sold. The colors for other values are linearly interpolated. A gradient legend to the right shows the association between shades of colors and units sold. You can use the COLORMODEL= option to use a different color ramp.

As I have discussed in other articles, you might not want to use a linear color ramp when the response variable is skewed or contains outliers. In this example, the store sells many more units of product 'C' than 'A' or 'B'. Consequently, most of the cells in the heat map are blue (low) and only a few are white (medium) or red (high). If you bin the counts into meaningful ordinal categories, the low and medium values will be easier to discern.

If your response variable is discrete and consists of a small number of groups, you can use a discrete heat map. Syntactically, you specify a discrete heat map by using the COLORGROUP= option (instead of COLORRESPONSE=) on the HEATMAPPARM statement. Instead of the GRADLEGEND statement, add a regular (discrete) legend by using the KEYLEGEND statement.

Let's create a discrete heat map for the Sales data by binning the QtySold variable. You can use a SAS format to bin a continuous variable into ordinal categories. The following call to PROC FORMAT bins the data into five categories by using the cut points 3, 7, 12, and 20.

By default, the SGPLOT procedure will use the data colors in the current style to assign colors to groups, such as blue, red, green, brown, and purple. However, when the categories are ordinal, you might want to use a sequential or diverging color scheme to assign colors to group, similar to what the gradient color ramp provides. You can use the STYLEATTRS statement to assign colors to groups. The following is an initial attempt to create a discrete heat map. However, as you will see, the program contains a logical error:

proc format; value SoldFmt /* bin into five groups */ low -< 3 = "Almost None" 3 -< 7 = "Few" 7 -< 12 = "Moderate" 12 -< 20 = "Many" 20 - high = "Most"; run; title "Discrete Heat Map"; title2 "Discrete Color Palette and Legend"; /* Attempt to use STYLEATTRS to define discrete colors. Does not work because default group order is "data order" */ proc sgplot data=Sales; format QtySold SoldFmt.; /* use a format to bin the response variable */ styleattrs datacolors=(ModerateBlue VeryLightBlue CXF8F8F8 VeryLightRed ModerateRed); heatmapparm x=Week y=Product colorgroup=QtySold / outline discretex; keylegend; run; |

The heat map is shown, but it does not reflect the ordinal nature of the counts. I intentionally constructed the example so that the groups appear in the legend "out of order." The "Few" category appears before the "Almost None" category, and the "Most" category appears before the "Many" category. The STYLEATTRS statement correctly assigned colors to the groups, but the groups do not appear in fewest-to-most order.

This problem occurs because the order of the groups (and, therefore, their colors) is determined by the order in which they appear in the data set. There are several solutions to this problem, including sorting the data and adding fake observations to the data set. However, the best solution is to explicitly create a mapping between group values and colors. This is called a "discrete attribute map." A discrete attribute map enables you to associate colors (and other attributes) to groups, regardless of how the groups are sorted or used.

If you encounter this "legend order" problem, a discrete attribute map is the most robust solution. The "map" is simply a data set that assigns attributes to each formatted value of the response variable. The PROC SGPLOT documentation for discrete attribute maps provides details about the names of variables in the data set.

For the heat map, the important attribute is the FILLCOLOR attribute of each cell. Thus, you need to create a data set that has five rows and two variables. The name of the primary columns must be Value and FillColor. You can hard-code the formatted values or you can use the PUT function to format the raw values, as shown in the following program. (I like the second option; it works even if you change the strings in PROC FORMAT.) You also might want to define the ID variable and the Show variables. The ID variable is optional if the data set defines only one attribute map. If you set Show="AttrMap", the legend will show all of the possible values in the legend, even if the data set does not contain all the groups.

The following DATA step defines a discrete attribute map. Use the DATTRMAP= option on the PROC SGPLOT statement to use the mapping, as follows:

data Order; /* create discrete attribute map */ length Value $11 FillColor $15; input raw FillColor; Value = put(raw, SoldFmt.); /* use format to assign values */ retain ID 'SortOrder' /* name of map */ Show 'AttrMap'; /* always show all groups in legend */ datalines; 0 ModerateBlue 3 VeryLightBlue 7 CXF8F8F8 12 VeryLightRed 20 ModerateRed ; proc sgplot data=Sales dattrmap=Order; /* use discrete attribute map */ format QtySold SoldFmt.; heatmapparm x=Week y=Product colorgroup=QtySold / outline attrid=SortOrder; keylegend; /* will use the order in attribute map */ run; |

Success! The heat map uses the custom blue-white-red color ramp for the groups. The order of the items in the legend (and their attributes) are determined by the discrete attribute map. No matter what order the groups appear in the data, the legend will show the items in the correct ordinal order, which is least to greatest.

For more information about legend order and discrete attribute maps, see Warren Kuhfeld's article "Legend order and group attributes."

In summary, this article shows how to use the HEATMAPPARM statement in PROC SGPLOT to create heat maps. Use the HEATMAPPARM statement when the (x, y) values are discrete and pre-summarized. By default, the HEATMAPPARM statement creates a continuous heat map and the GRADLEGEND statement displays a gradient legend. If the response variable is discrete, use the COLORGROUP= option on the HEATMAPPARM statement and use the KEYLEGEND statement to add a discrete legend. Remember that the order of the groups is determined by the order in which the groups appear in the data, but you can define a discrete attribute map to ensure that the groups appear in a specified order.

The post Create a discrete heat map with PROC SGPLOT appeared first on The DO Loop.