5月 022018
 

Order matters. When you create a graph that has a categorical axis (such as a bar chart), it is important to consider the order in which the categories appear. Most software defaults to alphabetical order, which typically gives no insight into how the categories relate to each other.

Alphabetical order rarely adds any insight to the data visualization. For a one-dimensional graph (such as a bar chart) software often provides alternative orderings. For example, in SAS you can use the CATEGORYORDER= option to sort categories of a bar chart by frequency. PROC SGPLOT also supports the DISCRETEORDER= option on the XAXIS and YAXIS statements, which you can use to specify the order of the categories.

It is much harder to choose the order of variables for a two-dimensional display such as a heat map (for example, of a correlation matrix) or a matrix of scatter plots. Both of these displays are often improved by strategically choosing an order for the variables so that adjacent levels have similar properties. Catherine Hurley ("Clustering Visualizations of Multidimensional Data", JCGS, 2004) discusses several ways to choose a particular order for p variables from among the p! possible permutations. This article implements the single-link clustering technique for ordering variables. The same technique is applicable for ordering levels of a nominal categorical variable.

Order variables in two-dimensional displays

I first implemented the single-link clustering method in 2008, which was before I started writing this blog. Last week I remembered my decade-old SAS/IML program and decided to blog about it. This article discusses how to use the program; see Hurley's paper for details about how the program works.

To illustrate ordering a set of variables, the following program creates a heat map of the correlation matrix for variables from the Sashelp.Cars data set. The variables are characteristics of motor vehicles. The rows and columns of the matrix display the variables in alphabetical order. The (i,j)th cell of the heat map visualizes the correlation between the i_th and j_th variable:

proc iml;
/* create correlation matric for variables in alphabetical order */
varNames = {'Cylinders' 'EngineSize' 'Horsepower' 'Invoice' 'Length' 
            'MPG_City' 'MPG_Highway' 'MSRP' 'Weight' 'Wheelbase'};
use Sashelp.Cars;  read all var varNames into Y;  close;
corr = corr(Y);
 
call HeatmapCont(corr) xvalues=varNames yvalues=varNames 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation Matrix: Variables in Alphabetical Order";
Heat map of correlation matrix; variables in alphabetical order

The heat map shows that the fuel economy variables (MPG_City and MPG_Highway) are negatively correlated with most other variables. The size of the vehicle and the power of the engine are positively correlated with each other. The correlations with the price variables (Invoice and MSRP) are small.

Although this 10 x 10 heat map visualizes all pairwise correlations, it is possible to permute the variables so that highly correlated variables are adjacent to each other. The single-link cluster method rearranges the variables by using a symmetric "merit matrix." The merit matrix can be a correlation matrix, a distance matrix, or some other matrix that indicates how the i_th category is related to the j_th category.

The following statements assume that the SingleLinkFunction is stored in a module library. The statements load and call the SingleLinkCluster function. The "merit matrix" is the correlation matrix so the algorithm tries to put strongly positively correlated variables next to each other. The function returns a permutation of the indices. The permutation induces a new order on the variables. The correlation matrix for the new order is then displayed:

/* load the module for single-link clustering */
load module=(SingleLinkCluster);
permutation = SingleLinkCluster( corr );  /* permutation of 1:p */
V = varNames[permutation];                /* new order for variables */
R = corr[permutation, permutation];       /* new order for matrix */
call HeatmapCont(R) xvalues=V yvalues=V 
            colorramp=colors range={-1.01 1.01} 
            title="Correlation: Variables in Clustered Order";
Heat map of correlation matrix; variables in clustered order

In this permuted matrix, the variables are ordered according to their pairwise correlation. For this example, the single-link cluster algorithm groups together the three "size" variables (Length, Wheelbase, and Weight), the three "power" variables (EngineSize, Cylinders, and Horsepower), the two "price" variables (MSRP and Invoice), and the two "fuel economy" variables (MPG_Highway and MPG_City). In this representation, strong positive correlations tend to be along the diagonal, as evidenced by the dark green colors near the matrix diagonal.

The same order is useful if you want to construct a huge scatter plot matrix for these variables. Pairs of variables that are "interesting" tend to appear near the diagonal. In this case, the definition of "interesting" is "highly correlated," but you could choose some other statistic to build the "merit matrix" that is used to cluster the variables. The scatter plot matrix is shown below. Click to enlarge.

Scatter plot matrix of 10 variables. The highly correlated variable tend to appear together so that diagonal scatter plot exhibit high correlation.

Other merit matrices

I remembered Hurley's article while I was working on an article about homophily in married couples. Recall that homophily is the tendency for similar individuals to associate, or in this case to marry partners that have similar academic interests. My heat map used the data order for the disciplines, but I wondered whether the order could be improved by using single-link clustering. The data in the heat map are not symmetric, presumably because women and men using different criteria to choose their partners. However, you can construct the "average homophily matrix" in a natural way. If A is any square matrix then (A` + A)/2 is a symmetric matrix that averages the lower- and upper-triangular elements of A. The following image shows the "average homophily" matrix with the college majors ordered by clusters:

The heat map shows the "averaged data," but in a real study you would probably want to display the real data. You can see that green cells and red cells tend to occur in blocks and that most of the dark green cells are close to the diagonal. In particular, the order of adjacent majors on an axis gives information about affinity. For example, here are a few of the adjacent categories:

  • Biology and Medical/Health
  • Engineering tech, computers, and engineering
  • Interdisciplinary studies, philosophy, and theology/religion
  • Math/statistics and physical sciences
  • Communications and business

Summary

In conclusion, when you create a graph that has a nominal categorical axis, you should consider whether the graph can be improved by ordering the categories. The default order, which is often alphabetical, makes it easy to locate a particular category, but there are other orders that use the data to reveal additional structure. This article uses the single-link cluster method, but Hurley (2004) mentions several other methods.

The post Order variables in a heat map or scatter plot matrix appeared first on The DO Loop.

5月 022018
 

During SAS Global Forum 2018, SAS instructor Charu Shankar sat down with four SAS users to get their take on what makes them a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.

The post What makes a SAS user? Order, logic and magic: Louise Hadden appeared first on SAS Learning Post.

5月 012018
 

As health care evolves, its entire ecosystem – from payers and providers to pharmaceutical companies and government agencies – seeks to find common ground. More data is available than ever. But transforming information into innovation is challenging. Organizations strive to create shared goals, internally and externally, trying to improve patient [...]

How will IoT and AI drive transformation in health care and life sciences? was published on SAS Voices by Kristine Vick

5月 012018
 

The title of this blog says what you really need to know: SAS Enterprise Guide does have a future, and it's a bright one. Ever since SAS Studio debuted in 2014, onlookers have speculated about its impact on the development of SAS Enterprise Guide.

I think that we have been consistent with our message that SAS Enterprise Guide serves an important purpose -- a power-user interface for SAS on the desktop -- and that the product will continue to get support and new features. But that doesn't stop folks from wondering whether it might meet sudden demise like a favorite Star Wars or Game of Thrones character.

I recently recorded a session with Amy Peters, the SAS product manager for SAS Enterprise Guide and SAS Studio. Amy loves to meet with SAS users and hear their successes, their concerns, and their ideas. Her enthusiasm for SAS Enterprise Guide comes through in this video, even as I bumble my way through the prototype of the Next Big Release.

Coming soon: the features of a modern IDE

In addition to a much-needed makeover and modern appearance, the new version of SAS Enterprise Guide (scheduled for sometime in 2019) addresses many of the key requests that we hear from SAS users. First, the new version blows open the window management capabilities. You can open and view many items -- programs, data, log, results -- at the same time, and arrange those views exactly as you want. You can spread your workspace over multiple displays. And you can tear away or dock each item to suit your working style.

Screenshot of Future EG

(in development) screenshot of SAS Enterprise Guide

Second, you can decide whether you want to work with a SAS Enterprise Guide project -- or just simply write and run code. Currently you must start with a project before you can create or open anything else. The new version allows you leverage a project to organize your work...or not, depending on your need at the moment.

And finally, you can expect more alignment and collaboration features between SAS Studio and SAS Enterprise Guide. We see that more users find themselves using both interfaces for related tasks, and presenting a common experience is important. SAS Studio runs in your browser while SAS Enterprise Guide works on your desktop. Each application has different capabilities related to that, but there's no reason that they need to be so different, right?

For more information about what the future will bring, check out the communities article that recaps the SAS Global Forum 2018 presentation. It includes an attached presentation slide deck with many exciting screenshots and roadmap details. All of this is subject to change, of course (including release dates!), but I think it's safe to say the future is bright for SAS users who love their tools.

The post A productive future for SAS Enterprise Guide appeared first on The SAS Dummy.

4月 302018
 

How old was the oldest person in your family, or the oldest person you personally know? And how do they compare to the oldest people in the world? ... Perhaps you can easily make the comparison, with this cool graph! But before we get started, here's a picture of my [...]

The post Keeping track of the oldest people in the world appeared first on SAS Learning Post.

4月 302018
 

When you ask Mark Yost about what he does at SAS every day, his eyes light up. Mark’s our Assistant Manager for Food Services at SAS, but he’s so much more than that. He’s a champion for diversity and a leader who empowers his team to be the same. For [...]

Embracing all abilities: The ultimate win-win was published on SAS Voices by Allison Bonner

4月 302018
 

Some say that opposites attract. Others say that birds of a feather flock together. Which is it? Phillip N. Cohen, a professor of sociology at the University of Maryland, recently posted an interesting visualization that indicates that married couples who are college graduates tend to be birds of a feather. This article discusses Cohen's data and demonstrates some SAS techniques for assigning colors to a heat map, namely how to log-transform the response and how to bin the response variable by quantiles.

Cohen's data and heat map

Cohen's (2018) haet map of the ratio of observed to expected count of marriages between women and men, by college major

Cohen cross-tabulated the college majors of 27,806 married couples from 2009–2016. The majors are classified into 28 disciplines such as Architecture, Business, Computer Science, Engineering, and so forth. The data are counts in a 28 x 28 table. If the (i,j)th cell in the table has the value C[i,j], then there are C[i,j] married couples in the data for which the wife had the i_th major and the husband has the j_th major.

Cohen used a heat map (shown to the right) to visualize associations among the wife's and husband's majors. This heat map shows the ratio of observed counts to expected counts, where the expected counts are computed by assuming the wife's major and the husband's major are independent. In this heat map, the woman's major is listed on the vertical axis; the man's major is on the horizontal axis. Green indicates more marriages than expected for that combination of majors, yellow indicates about as many marriages as expected, and red indicates fewer marriages than expected. The prevalence of green along the diagonal cells in the heat map indicates that many college-educated women marry a man who studied a closely related discipline. The couples are birds of a feather.

Cohen's heat map displays ratios. A ratio of 1 means that the observed number of married couples equals the expected number. For these data, the ratio ranges from 0 (no observed married couples) to 31. The highest ratio was observed for women and men who both majored in Theology and Religious Studies. The observed number of couples is 31 times more than expected under independence. (See the 19th element along the diagonal.)

A heat map of relative differences

You might wonder why Cohen used the ratio instead of the difference between observed and expected. One reason to form a ratio is that it adjusts for the unequal popularity of majors. Business, engineering, and education are popular majors. Many fewer people major in religion and ethnic studies. If you don't account for the differences in popularity, you will see only the size of the majors, not the size of the differences.

Heat map of relative differences observed to expected counts of marriages between women and men, by college major

A different way to view the data is to visualize the relative difference: (Observed – Expected) / Expected. The relative differences range from -1 (no observed couples) to 30 (for Theology majors). In percentage terms, this range is from -100% to 3000%. If you create a heat map of the relative difference and linearly associate colors to the relative differences, only the very largest differences are visible, as shown in the heat map to the right.

This heat map is clearly not useful, but it illustrates a very common problem that occurs when you use a linear scale to assign colors to heat maps and choropleth maps. Often the data vary by several orders of magnitude, which means that a linear scale is not going to reveal subtle features in the data. As shown in this example, you typically see a small number of cells that have large values and the remaining cell values are indistinguishable.

There are two common ways to handle this situation:

  • If your audience is mathematically savvy, use a nonlinear transformation to map the values into a linear color ramp. A logarithmic transformation is the most common way to accomplish this.
  • Otherwise, bin the data into a small number of discrete categories and use a discrete color ramp to assign colors to each of the binned values. For widely varying data, quantiles are often an effective binning strategy.

Log transformation of counts

A logarithmic transform (base 10) is often the simplest way to visualize data that span several orders of magnitudes. When the data contain both positive and negative values (such as the relative differences), you can use the log-modulus transform to transform the data. The log-modulus transform is the transformation x → sign(x) * log10(|x| + 1). It preserves the sign of the data while logarithmically scaling it. The following heat map shows the marriage data where the colors are proportional to the log-modulus transform of the relative difference between the observed and expected values.

Heat map of the log-modulus transform of the relative differences between observed and expected counts of marriages between women and men, by college major

As in Cohen's heat map of the ratios, mapping a continuous color ramp onto the log-transformed values divides the married couples into three visual categories: The reddish-orange cells indicate fewer marriages than expected, the cream-colored cells indicate about as many marriages as expected, and the light green and dark green cells indicate many more marriages than expected. The drawback of the log transformation is that the scale of the gradient color ramp is not easy to understand. I like to add tooltips to the HTML version of the heat map so that the user can hover the mouse pointer over a cell to see the underlying values.

Analysis of the marriage data

What can we learn about homophily (the tendency for similar individuals to associate) from studying this heat map? Here are a few observations. Feel free to add your own ideas in the comments:

  • The strongest homophily is along the diagonal elements. In particular, Theology/Religion majors are highly likely to marry each other, as are agriculture, environment, and architecture majors.
  • Female engineers, computer scientists, and mathematicians tend to marry like-minded men. Notice the large number of red cells in those rows for the non-mathematical disciplines.
  • Women in medical/health field or business are eclectic. Those rows do not have many dark-shaded cells, which indicates a general willingness to associate with men inside and outside of their subject area.

Discrete heat map by quantiles

It is easiest to interpret a heat map when the colors correspond to the actual counts or simple statistics such as the relative difference. The simplest way to assign colors to the cells in the heat map is to bin the data. This results in a discrete set of colors rather than the continuous color ramp in the previous example. You can bin the data by using five to seven quantiles. Alternatively, you can use domain-specific or clinical ranges. (For example, you could use clinical "cut points" to bin blood pressure data into "normal," "elevated," and "hypertensive" categories. In SAS, you can use PROC FORMAT to create a customized format for the response variable and color according to the formatted values.

The following heat map uses cut points to bin the data according to the relative difference between observed and expected values. As before, red and orange cells indicate observed differences that are negative. The cream and light green cells indicate relative differences that are close to zero. The darker greens indicate observed differences that are a substantial fraction of the expected values. The darkest green indicates cells for which the difference is more than 100% greater than the expected value.

When you use a discrete color ramp, all values within a specified interval are mapped to the same color. In this heat map, most cells on the diagonal have the same color; you cannot use the heat map to determine which cell has the largest relative difference.

Heat map of the binned relative differences between observed and expected counts of marriages between women and men, by college major

Summary

There is much more to say about these data and about ways to use heat maps to discover features in the data. However, I'll stop here. The main ideas of this article are:

  • The data are interesting. In general, the married couples in this study tended to have similar college majors. The strength of that relationship varied according to the discipline and also by gender (the heat map is not symmetric). For more about the data, see Professor Cohen's web page for this study.
  • When the response variable varies over several orders of magnitude, a linear mapping of colors to values is not adequate. One way to accommodate widely varying data is to log-transform the response. If the response has both positive and negative values, you can use the log-modulus transform.
  • For a less sophisticated audience, you can assign colors by binning the data and using a discrete color ramp.

You can download the SAS program that I used to create this heat maps.

The post Assign colors in heat maps: A study of married couples and college majors appeared first on The DO Loop.

4月 272018
 

Analyzing ticket sales and customer data for large sports and entertainment events is a complex endeavor. But SAS Visual Analytics makes it easy, with location analytics, customer segmentation, predictive artificial intelligence (AI) capabilities – and more. This blog post covers a brief overview of these features by using a fictitious event company [...]

Analyze ticket sales using location analytics and customer segmentation in SAS Visual Analytics was published on SAS Voices by Falko Schulz

4月 262018
 

A future of flying cars and Minority Report-styled predictive dashboards may still be some time away, but the possibilities of robotics and Artificial Intelligence (AI)-powered automation are a reality today. From connected cars to smart homes and offices, we see daily how big data and the Internet of Things (IoT) [...]

IoT use cases on display at new innovation centre was published on SAS Voices by Randy Goh