GTL

7月 082019
 

The mosaic plot is a graphical visualization of a frequency table. In previous articles, I showed how to create a mosaic plot in SAS by using PROC FREQ and how to define a template in the Graph Template Language (GTL) by using the MOSAICPARM statement. This article shows how to display additional information on a mosaic plot. The two techniques in this article are

  • For interactive displays, add tool tips (also called infotips or "data tips") to the mosaic plot. When you hover the mouse pointer over a cell, SAS displays information about cell counts and percentages.
  • For static displays, use the GTL annotation facility to overlay cell counts or percentages on a two-way mosaic plot. A result is shown to the right.

Add data tips to any SAS graph

When you analyze data in SAS, many SAS procedures can automatically create graphs that are appropriate for the analysis. Most of these graphs support data tips that provide information about the data when you hover a mouse pointer over a graph component. I like to use this feature for "area graphs" such as bar charts, histograms, and mosaic plots.

It is easy to turn on tool tips: you simply specify the IMAGEMAP=ON option on the ODS GRAPHICS statement. Because PROC FREQ can create a mosaic plot, the following statements draw a mosaic plot with tool tips for the Origin and Type variables in the Sashelp.Cars data set:

/* Use tool tips to see details of a mosaic plot */
ods graphics on / imagemap=ON;         /* enable data tips */
proc freq data=Sashelp.Cars;
   where Type ^= 'Hybrid';
   tables Origin * Type / plots=mosaic
                          out=FreqOut(where=(Percent^=.)); /* output stats for next section */
run;

When you hover the mouse pointer over a cell, the graph displays a tool tip. The tip for the center cell shows that the cell represents Origin=Europe and Type=Sedan. The center cell represents 78 vehicles or 18.4% of the total number of vehicles in the data.

Create a mosaic plot from the output of PROC FREQ

Unfortunately, the mosaic plot is not supported by PROC SGPLOT in SAS 9.4M6, but the MOSAICPARM statement in the Graph Template Language (GTL) enables you to create a mosaic plot. The following statements display the PROC FREQ template in the SAS log:

/* view the template for the MosaicPlot in PROC FREQ */
proc template;
   source Base.Freq.Graphics.MosaicPlot;
run;

You can copy the basic structure of the Base.Freq.Graphics.MosaicPlot template to create your own template. You need to add an ANNOTATE statement if you want to support annotation, as follows:

proc template;
  define statgraph mosaicPlotParm;
  dynamic _VERTVAR _HORZVAR _FREQ _TITLE;
    begingraph;
      entrytitle _TITLE;
      layout region;          /* REGION layout, so can't overlay text! */
      MosaicPlotParm category=(_HORZVAR _VERTVAR) count=_FREQ / 
             datatransparency=0.5
             colorgroup=_VERTVAR name="mosaic";
      endlayout;
      annotate;               /* required for annotation */
    endgraph;
  end;
run;
 
proc sgrender data=FreqOut template=mosaicPlotParm;
dynamic _VERTVAR="Origin" _HORZVAR="Type" _FREQ="Count"
        _TITLE="Basic Mosaic Plot with No Labels";
run;

Notice that the FreqOut data (which I created by using the OUT= option on the TABLES statement in PROC FREQ) has the cell counts in a different order than the data object that the PLOTS=MOSAIC option uses. The mosaic plot I created has the vertical axis "pointing up" whereas the vertical axis in the PROC FREQ graph "points down" to match the frequency table that the procedure creates.

My initial idea was to overlay a text plot on the mosaic plot and use the text plot to show the cell counts or percentages. However, the MOSAICPLOTPARM statement must be part of a LAYOUT REGION block. A LAYOUT REGION block supports only one plot, a mosaic plot or a pie chart; you cannot overlay another plot such as a text plot or a scatter plot on a "region plot."

Therefore, the only choice for adding text to a mosaic plot is to use the GTL annotation facility. This is not as easy as I'd hoped because "region plots," which do not have axes, do not support data coordinates. This means that you cannot use the DATAVALUE or WALLPERCENT drawing areas, which are the most useful drawing areas for data-dependent annotations. The only choices for drawing areas are the GRAPHPERCENT and LAYOUTPERCENT areas. Of these, the LAYOUTPERCENT is better because annotations in the layout area do not shift around if you decide to add a title or footnote to your mosaic plot. The horizontal portion of the LAYOUTPERCENT drawing area goes from the vertical axis label to the right edge of the graph region. The vertical portion goes from the horizontal axis label to the top edge of the graph region.

Create an annotation for a mosaic plot

This section describes how to create an annotation data set for a region plot. For an introduction to GTL annotation, see the following articles:

The goal of this section is to annotate a mosaic plot, but the same ideas will work on a pie chart, which is also a "region plot." The following annotation uses the LAYOUTPERCENT drawing space. The annotation consists of a series of 'text' function calls. Each 'text' function must be supplied with the following information:

  • The Label variable specifies the text to display.
  • The x1 and y1 variables specify the coordinates of the label (in the LAYOUTPERCENT drawing space).
  • The Width variable specifies the width of the label and the Anchor variable specifies how the text is anchored (left, right, centered,...) at the (x1, y1) location.

The following data set specifies the center of the text in LAYOUTPERCENT coordinates. In a follow-up article, I will show how to compute these values. For now, just assume that the coordinates are provided. In many annotation examples, coming up with the coordinate is an iterative process of guessing values, plotting them, and then revising the guess.

Regardless of how the coordinates are obtained, you can read the coordinates into an annotation data set and assign the special variable names that SG annotation looks for, such as x1, x2, Label, and Width:

data AnnoData;
length Type $8 Origin $6;
input Type Origin hCenter vCenter Freq Pct;
datalines;
SUV    Asia   16.35 28.75 25 5.8824 
Sedan  Asia   50.45 26.15 94 22.1176 
Sports Asia   83.38 25.61 17 4.0000 
Truck  Asia   91.11 25.00  8 1.8824 
Wagon  Asia   96.82 26.50 11 2.5882 
SUV    Europe 16.35 55.00 10 2.3529 
Sedan  Europe 50.45 55.69 78 18.3529 
Sports Europe 83.38 62.35 23 5.4118 
Truck  Europe 91.11 40.00  0 0.0000 
Wagon  Europe 96.82 61.00 12 2.8235 
SUV    USA    16.35 81.25 25 5.8824 
Sedan  USA    50.45 84.54 90 21.1765 
Sports USA    83.38 91.73  9 2.1176 
Truck  USA    91.11 70.00 16 3.7647 
Wagon  USA    96.82 89.50  7 1.6471 
;
 
data anno;
set AnnoData;
length label $12;
/* use RETAIN stmt to define values that are constant */
retain function 'text' 
       y1space 'layoutpercent' x1space 'layoutpercent'
       width 4         /* text box width = 4% of layout range */
       anchor 'center';
/* for the TEXT function, need (x1, y1) coords and Label */
x1 = hCenter;
y1 = vCenter;
label = put(Freq, 4.); /* use 4. format for count */           
/*
label = put(Pct/100, PERCENT7.1); width=7;
*/
run;
 
title "Basic Mosaic Plot with Labels";
proc sgrender data=FreqOut template=mosaicPlotParm sganno=anno;
run;

The mosaic plot with annotation is shown at the top of this article. The DATA step that creates the annotation data set includes a comment that shows how you can display percentages instead of counts. Of course, if your counts are large (thousands), you should increase the Width value and the field width of the format in the PUT statement. You can also modify the program to omit labels for small cells. For example, you might not want to label cells that have fewer than 2% of the sample size.

The key to this example is that the mosaic plot is a region plot. You cannot overlay plots (such as a text plot) on a region plot. Therefore, you must use an annotation. Furthermore, you must use the LAYOUTPERCENT drawing area, which is somewhat inconvenient. How I wish I could use the WALLPERCENT drawing space with a range of [0, 100] in each direction!

In my next article, I will show how to obtain the locations of the annotation centers from the data.

The post How to add an annotation to a mosaic plot in SAS appeared first on The DO Loop.

10月 122015
 
surfaceplot

This article shows how to visualize a surface in SAS. You can use the SURFACEPLOTPARM statement in the Graph Template Language (GTL) to create a surface plot. But don't worry, you don't need to know anything about GTL: just copy the code in this article and replace the names of the data set and variables.

In some situations, you don't have to write any code because some SAS procedures create surface plots automatically. For example, the RSREG procedures produces surface plots that are quadratic regression surfaces. The KDE procedure produces surfaces for bivariate density estimates.

An alternative to plotting a surface is to use a contour plot. For many situations a contour plot is ideal, and you never need to worry that a ridge in the foreground of a surface will hide details of a valley behind the ridge. Many SAS procedures (GLM, KRIGE, PLM,...) create contour plots automatically when it is appropriate.

Creating a surface plot in SAS with ODS graphics: The template

Use the GTL to create a surface plot. This section describes how to define a graph template for a surface plot. The next section shows how to display the plot by using the SGRENDER procedure.

The following template is adapted from the documentation for the SURFACEPLOTPARM statement, which explains various options for visualizing the surface:

proc template;                        /* surface plot with continuous color ramp */
define statgraph SurfaceTmplt;
dynamic _X _Y _Z _Title;              /* dynamic variables */
 begingraph;
 entrytitle _Title;                   /* specify title at run time (optional) */
  layout overlay3d;
    surfaceplotparm x=_X y=_Y z=_Z /  /* specify variables at run time */
       name="surface" 
       surfacetype=fill
       colormodel=threecolorramp      /* or =twocolorramp */
       colorresponse=_Z;
    continuouslegend "surface";
  endlayout;
endgraph;
end;
run;

The template describes the layout for a surface plot. Instead of hard-coding the names of variables, the template uses dynamic variables so that you can reuse the template for many different data sets. The dynamic variables (and the title) are assigned values when you use the SGRENDER procedure to render the graph.

You only need to run the previous statements once. The statements create a template called SurfaceTmplt. Whenever you want to create a surface, simply use PROC SGRENDER, as shown in the next section.

Creating a surface plot in SAS: The rendering

To use the template, you need to have values for the surface on a grid of (x,y) locations. The following DATA step generates values for a cubic function of two variables that I previously used to construct heat maps. The surface plot of this function is shown at the top of this article.

/* sample data: a cubic function of two variables */
%let Step = 0.04;
data A;
do x = -1 to 1 by &Step;
   do y = -1 to 1 by &Step;
      z = x**3 - y**2 - x + 0.5;
      output;
   end;
end;
run;
 
proc sgrender data=A template=SurfaceTmplt; 
   dynamic _X='X' _Y='Y' _Z='Z' _Title="Cubic Surface";
run;

For this surface, the COLORMODEL=threecolorramp option was used to color the surface according to a three-color spectrum of colors. For other surfaces, a two-color spectrum (COLORMODEL=twocolorramp) might be more appropriate.

Most of the time you have data that is already arranged on a grid of (x, y) values. If you have irregularly spaced data, you can use your favorite regression procedure to fit a surface to the data, then use the PLM procedure to score the model on a regular grid of points. You can download a SAS program that analyzes irregularly spaced data and scores a regression model on a regular grid so that the surface can be visualized. Alternatively, if you have SAS/GRAPH software, you can use the G3GRID procedure to interpolate the values onto a regular grid.

tags: GTL, Statistical Graphics

The post Create a surface plot in SAS appeared first on The DO Loop.

12月 112013
 

A heat map is a graphical representation of a matrix that uses colors to represent values in the matrix cells. Heat maps often reveal the structure of a matrix. There are three common applications of visualizing matrices with heat maps:

  • Visualizing a correlation or covariance matrix reveals relationships between variables. Chris Hemedinger has written an article that describes how to visualize correlation matrices by using a heat map.
  • Visualizing a data matrix reveals outliers, missingness patterns, and more. I will discuss this application in a future blog post.
  • The first two applications are usually visualized by using a color ramp with a continuous color gradient. If the matrix contains a small number of discrete values, it is preferable to use a discrete palette of colors. Heat maps with discrete color palettes are useful for visualizing structured covariance matrices and the nonzero pattern of sparse matrices.

This article describes how to use a heat map to visualize matrices that contain a small number of discrete values.

A structured covariance matrix

In my book Simulating Data with SAS, I simulate data from a repeated-measures model that has a block-diagonal covariance structure. The following SAS/IML statements create a 45 x 45 matrix that consists of nine 5 x 5 blocks:

proc iml;
k=5;                        /* number of repeated measurements */
s=9;                        /* number of individuals           */
B = 1.4*j(k,k,1) + 2*I(k);  /* compound symmetric matrix       */
R = I(s) @ B;               /* block-diagonal matrix           */ 
print R;

This matrix is too large to easily view in printed form, but you can create a heat map that visualizes the matrix by assigning colors to the three values in the matrix.

Create a data set for the matrix in "long form"

I need to write the SAS/IML matrix to a data set so that it can be read by PROC SGRENDER, which will create the heat map by using a custom GTL template. It turns out that the HEATMAPPARM statement in the GTL language requires that the data set represent the matrix in "long form," which I have discussed in a previous blog post. For specific details, you can download the SAS program that generates the plots in this article.

A template for visualizing a matrix with a small number of unique values

The template to visualize the heat map is straightforward. It contains the following noteworthy features:

  • The DYNAMIC statement enables you to specify the names of the data set variables at run time. I like to use this statement so that I can re-use my templates, but you are welcome to hard-code the names of the variables into the template if you prefer.
  • The LAYOUT OVERLAY statement specifies three things.
    1. It specifies the aspect ratio of the plot so that square matrices look square. The aspect ratio interacts with the height and width of the graph as set by the ODS GRAPHICS statement.
    2. It specifies that the axes are discrete, rather than continuous.
    3. It specifies the features of the axes to display. For matrices with fewer than 100 rows or columns, I like to display tick marks and values. For larger matrices, I don't.
  • The HEATMAPPARM statement creates the heat map from the data.
  • The DISCRETELEGEND statement creates a legend that shows the association between the matrix values and the colors.
proc template;
define statgraph HeatmapDisc;
dynamic _X _Y _Z;
begingraph;
   layout overlay/ aspectratio=1  /* optional: for square matrices */
                  xaxisopts=(type=discrete discreteopts=(tickvaluefitpolicy=THIN)
                             display=(line ticks tickvalues))
                  yaxisopts=(type=discrete discreteopts=(tickvaluefitpolicy=THIN)
                             display=(line ticks tickvalues) reverse=true);
      heatmapparm x=_X y=_Y colorgroup=_Z / xbinaxis=false ybinaxis=false
                  name="heatmap" primary=true display=ALL;
      discretelegend "heatmap";
   endlayout;
endgraph;
end;
run;
 
proc sgrender data=BlockDiag template=HeatmapDisc;
   dynamic _X="col" _Y="row" _Z="X";
run;

Clearly, the heat map has an advantage over the printed output. The display is smaller, and the global structure of the matrix is readily apparent. At a glance you can see that the matrix is composed of 5 x 5 blocks that contain a large value on the diagonal and smaller values on the off-diagonal. The remaining matrix values are zero.

Visualizing a sparse or binary matrix

Another common application of visualizing matrices is using a heat map to show the structure of a sparse matrix (zero and nonzero cells) or matrices that occur in experimental designs. For example, Hadamard matrices are used to make orthogonal array experimental designs for two-level factors. The following SAS/IML statement creates a 64 x 64 matrix that contains the values 1 and –1:

X = hadamard(64);           /* 64 x 64 Hadamard matrix */

If you write that matrix (in "long form") to a SAS data set, you can visualize it by using the same GTL template:

proc sgrender data=Hadamard template=HeatmapDisc;
   dynamic _X="col" _Y="row" _Z="X";
run;

Again, the heat map makes the global structure of the matrix apparent. At a glance you can see that the matrix is composed of two values in a pattern that has many symmetries. Closer inspection reveals that the matrix is symmetric (X = X`) and that each row and column has an equal number of positive and negative values. You can also pick out a "self-similar" structure in the sense that the matrix is composed of four 32 x 32 Hadamard blocks, which are themselves composed of four 16 x 16 Hadamard blocks, and so on, recursively.

In this article, I let the SGRENDER procedure pick default colors for the heat maps. The colors come from the current ODS style, which you can change. Alternatively, you can specify colors in your template, which I will demonstrate in a future blog post.

tags: GTL, Statistical Graphics
11月 062013
 

The mosaic plot is a graphical visualization of a frequency table. In a previous post, I showed how to use the FREQ procedure to create a mosaic plot. This article shows how to create a mosaic plot by using the MOSAICPARM statement in the graph template language (GTL). (The MOSAICPARM statement was added in SAS 9.3m2.) The GTL gives you control over the characteristics of the plot, including how to color each tile.

A basic template for a mosaic plot

The MOSAICPARM statement produces a mosaic plot from pre-summarized categorical data. Therefore, the first step is to specify a set of categories, and the frequencies for each category. I'll use the same Sashelp.Heart data set that I used in my previous post. You can download the program that specifies the order of levels for certain categorical variables. The following statements use PROC FREQ to summarize the table. Several additional statistics are also computed for each cell, such as the expected value (under the hypothesis of no association between blood pressure and weight) and the standardized residual (under the same model). The summary is written to the FreqOut data set, which is used to create the mosaic plot.

/* summarize the data */
proc freq data=heart;
tables BP_Cat*Weight_Cat / out=FreqOut(where=(Percent^=.));
run;
 
/* create basic mosaic plot with no tile colors */
proc template;
  define statgraph BasicMosaicPlot;
    begingraph;
      layout region;
       MosaicPlotParm category=(Weight_Cat BP_Cat) count=Count;
      endlayout;
    endgraph;
  end;
run;
 
proc sgrender data=FreqOut template=BasicMosaicPlot;
run;

The mosaic plot is the same was produced by PROC FREQ in my previous post, except that no colors are assigned to the cells. Also, PROC FREQ reverses the Y axis so that the mosaic plot is in the same order as the frequency table. See my previous post for how to interpret a mosaic plot.

A template for a mosaic plot with custom cell colors

As I said, the GTL enables you to specify colors for the cells. All you need to do is to include a variable in the summary data set that species the color. You can specify a discrete palette of colors by using the COLORGROUP= option in the MOSAICPLOTPARM statement. Alternatively, you can specify a continuous spectrum of colors by using the COLORRESPONSE= option in the MOSAICPLOTPARM statement.

A clever use of colors is to color each cell in the mosaic plot by the residual (observed count minus expected count) of a hypothesized model (Friendly, 1999, JGCS). The simplest model is the "independence model," in which the expected count for each cell is simply the product of the marginal counts for each variable. (This is the null hypothesis for the chi-square test for independence.) In order to make the residuals comparable across cells, I will generate standardized residuals. The following PROC FREQ call adds standardized residuals and other statistics to the summary of the data. The summary is written to the FreqList data set, which is used to create the mosaic plot.

proc freq data=heart;
tables BP_Cat*Weight_Cat / norow cellchi2 expected stdres crosslist;
ods output CrossList=FreqList(where=(Expected>0));
run;
 
/* color by response (notice that PROC FREQ reverses Y axis) */
proc template;
  define statgraph mosaicPlotParm;
    begingraph;
      layout region;
       MosaicPlotParm category=(Weight_Cat BP_Cat) count=Frequency /
            colorresponse=StdResidual name="mosaic";
       continuouslegend "mosaic" / title="StdRes";
      endlayout;
    endgraph;
  end;
run;
 
proc sgrender data=FreqList template=mosaicPlotParm;
run;

Notice that I used the COLORRESPONSE= option on the MOSAICPLOTPARM statement to specify that each tile be colored according to the range of standardized residuals. The CONTINUOUSLEGEND statement adds a three-color ramp to the plot and automatically shows the association between colors and the standardized residuals.

From the mosaic plot, you can visually see why the null hypothesis (no association) is rejected for these data. Red is used for cells with large positive deviations from the no-association model, which means a higher-than-expected observed count. Blue is used for large negative residuals. Among the overweight patients, more have high blood pressure than would be expected by the no-association model.

A generalized template for a mosaic plot

The previous template can be generalized in two ways:

  • The template can use dynamic variables instead of hard-coding the variables for this particular medical study.
  • The three-color ramp can be improved by making sure that the range is symmetric and that zero is exactly at the center of the color ramp.

The following improved and generalized template supports these two features. The resulting graph is not shown, since it is similar to the previous graph. However, this template is general enough to be used for a variety of data sets. The template is suitable for handling summaries that are produced by using the CROSSLIST option in the TABLES statement in PROC FREQ.

/* adjust range of values for color ramp; add dynamic variables */
proc template;
   define statgraph mosaicPlotGen;
   dynamic _X _Y _Frequency _Response _Title _LegendTitle; /* dynamic vars */
   begingraph;
      entrytitle _Title;
      /* make sure color ramp is symmetric */
      rangeattrmap name="responserange" ;
      range negmaxabs - maxabs / rangecolormodel=ThreeColorRamp;
      endrangeattrmap ;
 
      rangeattrvar attrvar=rangevar var=_Response attrmap="responserange";
      layout region;
         MosaicPlotParm category=(_X _Y) count=_Frequency /
                        name="mosaic" colorresponse=rangevar;
         continuouslegend "mosaic" / title=_LegendTitle;
      endlayout;
   endgraph;
   end;
run;
 
proc sgrender data=FreqList template=mosaicPlotGen;
dynamic _X="Weight_Cat" _Y="BP_Cat" 
        _Frequency="Frequency" _Response="StdResidual" 
        _Title="Blood Pressure versus Weight" _LegendTitle="StdResid";
run;
tags: 12.1, Data Analysis, GTL, Statistical Graphics
11月 042013
 

Mosaic plots (Hartigan and Kleiner, 1981; Friendly, 1994, JASA) are used for exploratory data analysis of categorical data. Mosaic plots have been available for decades in SAS products such as JMP, SAS/INSIGHT, and SAS/IML Studio. However, not all SAS customers have access to these specialized products, so I am pleased that mosaic plots have recently been added to two Base SAS procedures:

Both of these features were added in SAS 9.3m2, which is the 12.1 release of the analytics products. This article describes how to create a mosaic plot by using PROC FREQ. My next blog post will describe how to create mosaic plots by using the GTL.

Use mosaic plots to visualize frequency tables

You can use mosaic plots to visualize the cells of a frequency table, also called a contingency table or a crosstabulation table. A mosaic plot consists of regions (called "tiles") whose areas are proportional to the frequencies of the table cells. The widths of the tiles are proportional to the frequencies of the column variable levels. The heights of tiles are proportional to the frequencies of the row levels within the column levels.

The FREQ procedure supports two-variable mosaic plots, which are the most important case. The GTL statement supports mosaic plots with up to three categorical variables. JMP and SAS/IML Studio enable you to create mosaic plots with even more variables.

As I showed in a previous blog post, you can use user-defined formats to specify the order of levels of a categorical variable. The Sashelp.Heart data set contains data for 5,209 patients in a medical study of heart disease. You can download the program that that specifies the order of levels for certain categorical variables. The following statements use the ordered categories to create a mosaic plot. The plot shows the relationship between categories of blood pressure and body weight for the patients:

ods graphics on;
proc freq data=heart;
tables BP_Cat*Weight_Cat / norow chisq plots=MOSAIC; /* alias for MOSAICPLOT */
run;

The mosaic plot is a graphical depiction of the frequency table. The mosaic plot shows the distribution of the weight categories by dividing the X axis into three intervals. The length of each interval is proportional to the percentage of patients who are underweight (3.5%), normal weight (28%), and overweight (68%), respectively. Within each weight category, the patients are further subdivided. The first column of tiles shows the proportion of patients who have optimal (29%), normal (54%), or high (18%) blood pressure, given that they are underweight. The middle column shows similar information for the patients of normal weight. The last column shows the conditional distribution of blood pressure categories, given that the patients are overweight.

The chi-square test (not shown) tests the hypothesis that there is no association between the weight of patients and their blood pressure. The chi-square test rejects that hypothesis, and the mosaic plot shows why. If there were no association between the variables, the green, red, and blue tiles would be essentially the same height regardless of the weight category of the patients. They are not. Rather, the height of the blue tiles increases from left to right. This shows that high blood pressure is more prevalent in overweight patients. Similarly, the height of the green tiles decreases from left to right. This shows that optimal blood pressure occurs more often in normal and underweight patents than in overweight patients.

The colors in this mosaic plot indicate the levels of the second variable. This enables you to quickly assess how categories of that variable depend on categories of the first variable. There are other ways to color the mosaic plot tiles, and you can use the GTL to specify an alternate set of colors. I describe that approach in my next blog post.

tags: 12.1, Data Analysis, GTL, Statistical Graphics
11月 042013
 

Mosaic plots (Hartigan and Kleiner, 1981; Friendly, 1994, JASA) are used for exploratory data analysis of categorical data. Mosaic plots have been available for decades in SAS products such as JMP, SAS/INSIGHT, and SAS/IML Studio. However, not all SAS customers have access to these specialized products, so I am pleased that mosaic plots have recently been added to two Base SAS procedures:

Both of these features were added in SAS 9.3m2, which is the 12.1 release of the analytics products. This article describes how to create a mosaic plot by using PROC FREQ. My next blog post will describe how to create mosaic plots by using the GTL.

Use mosaic plots to visualize frequency tables

You can use mosaic plots to visualize the cells of a frequency table, also called a contingency table or a crosstabulation table. A mosaic plot consists of regions (called "tiles") whose areas are proportional to the frequencies of the table cells. The widths of the tiles are proportional to the frequencies of the column variable levels. The heights of tiles are proportional to the frequencies of the row levels within the column levels.

The FREQ procedure supports two-variable mosaic plots, which are the most important case. The GTL statement supports mosaic plots with up to three categorical variables. JMP and SAS/IML Studio enable you to create mosaic plots with even more variables.

As I showed in a previous blog post, you can use user-defined formats to specify the order of levels of a categorical variable. The Sashelp.Heart data set contains data for 5,209 patients in a medical study of heart disease. You can download the program that that specifies the order of levels for certain categorical variables. The following statements use the ordered categories to create a mosaic plot. The plot shows the relationship between categories of blood pressure and body weight for the patients:

ods graphics on;
proc freq data=heart;
tables BP_Cat*Weight_Cat / norow chisq plots=MOSAIC; /* alias for MOSAICPLOT */
run;

The mosaic plot is a graphical depiction of the frequency table. The mosaic plot shows the distribution of the weight categories by dividing the X axis into three intervals. The length of each interval is proportional to the percentage of patients who are underweight (3.5%), normal weight (28%), and overweight (68%), respectively. Within each weight category, the patients are further subdivided. The first column of tiles shows the proportion of patients who have optimal (29%), normal (54%), or high (18%) blood pressure, given that they are underweight. The middle column shows similar information for the patients of normal weight. The last column shows the conditional distribution of blood pressure categories, given that the patients are overweight.

The chi-square test (not shown) tests the hypothesis that there is no association between the weight of patients and their blood pressure. The chi-square test rejects that hypothesis, and the mosaic plot shows why. If there were no association between the variables, the green, red, and blue tiles would be essentially the same height regardless of the weight category of the patients. They are not. Rather, the height of the blue tiles increases from left to right. This shows that high blood pressure is more prevalent in overweight patients. Similarly, the height of the green tiles decreases from left to right. This shows that optimal blood pressure occurs more often in normal and underweight patents than in overweight patients.

The colors in this mosaic plot indicate the levels of the second variable. This enables you to quickly assess how categories of that variable depend on categories of the first variable. There are other ways to color the mosaic plot tiles, and you can use the GTL to specify an alternate set of colors. I describe that approach in my next blog post.

tags: 12.1, Data Analysis, GTL, Statistical Graphics
6月 132013
 

If you've watched any of the demos for SAS Visual Analytics (or even tried it yourself!), you have probably seen this nifty exploration of multiple measures.

It's a way to look at how multiple measures are correlated with one another, using a diagonal heat map chart. The "stronger" the color you see in the matrix, the stronger the correlation.

You might have wondered (as I did): can I build a chart like this in Base SAS? The answer is Yes (of course). It won't match the speed and interactivity of SAS Visual Analytics, but you might still find this to be a useful way to explore your data.

The approach

There are four steps to achieving a similar visualization in the 9.3 version of Base SAS. (Remember that ODS Graphics procedures are part of Base SAS in SAS 9.3!)

  1. Use the CORR procedure to create a data set with a correlations matrix. Actually, several SAS procedures can create TYPE=CORR data sets, but I used PROC CORR with Pearson's correlation in my example.
  2. Use DATA step to rearrange the CORR data set to prepare it for rendering in a heat map.
  3. Define the graph "shell" using the Graph Template Language (GTL) and the HEATMAPPARM statement. You've got a lot of control over the graph appearance when you use GTL.
  4. Use the SGRENDER procedure to create the graph by applying the CORR data you prepared in the first two steps.

Here's an example of the result:

The program

I wrapped up the first two steps in a SAS macro. The macro first runs PROC CORR to create the matrix data, then uses DATA step to transform the result for the heat map.

Note: By default, the PROC CORR step will treat all of the numeric variables as measures to correlate. That's not always what you want, especially if your data contains categorical columns that just happen to be numbers. You can use DROP= or KEEP= data set options when using the macro to narrow the set of variables that are analyzed. The examples (near the end of this post) show how that's done.

/* Prepare the correlations coeff matrix: Pearson's r method */
%macro prepCorrData(in=,out=);
  /* Run corr matrix for input data, all numeric vars */
  proc corr data=&in. noprint
    pearson
    outp=work._tmpCorr
    vardef=df
  ;
  run;
 
  /* prep data for heat map */
data &out.;
  keep x y r;
  set work._tmpCorr(where=(_TYPE_="CORR"));
  array v{*} _numeric_;
  x = _NAME_;
  do i = dim(v) to 1 by -1;
    y = vname(v(i));
    r = v(i);
    /* creates a lower triangular matrix */
    if (i<_n_) then
      r=.;
    output;
  end;
run;
 
proc datasets lib=work nolist nowarn;
  delete _tmpcorr;
quit;
%mend;

You have to define the graph "shell" (or template) only once in your program. The template definition can then be reused in as many PROC SGRENDER steps as you want.

This heat map definition uses the fact that correlations are always between -1 and 1. Negative numbers show a negative correlation (ex: cars of higher weight will achieve a lower MPG). It's useful to select a range of colors that make it easier to discern the relationships. In my example, I went for "strong" contrasting colors on the ends with a muted color in the middle.

  /* Create a heat map implementation of a correlation matrix */
ods path work.mystore(update) sashelp.tmplmst(read);
 
proc template;
  define statgraph corrHeatmap;
   dynamic _Title;
    begingraph;
      entrytitle _Title;
      rangeattrmap name='map';
      /* select a series of colors that represent a "diverging"  */
      /* range of values: stronger on the ends, weaker in middle */
      /* Get ideas from http://colorbrewer.org                   */
      range -1 - 1 / rangecolormodel=(cxD8B365 cxF5F5F5 cx5AB4AC);
      endrangeattrmap;
      rangeattrvar var=r attrvar=r attrmap='map';
      layout overlay / 
        xaxisopts=(display=(line ticks tickvalues)) 
        yaxisopts=(display=(line ticks tickvalues));
        heatmapparm x = x y = y colorresponse = r / 
          xbinaxis=false ybinaxis=false
          name = "heatmap" display=all;
        continuouslegend "heatmap" / 
          orient = vertical location = outside title="Pearson Correlation";
      endlayout;
    endgraph;
  end;
run;

You can then use the macro and template together to produce each visualization. Here are some examples:

/* Build the graphs */
ods graphics /height=600 width=800 imagemap;
 
%prepCorrData(in=sashelp.cars,out=cars_r);
proc sgrender data=cars_r template=corrHeatmap;
   dynamic _title="Corr matrix for SASHELP.cars";
run;
 
%prepCorrData(in=sashelp.iris,out=iris_r);
proc sgrender data=iris_r template=corrHeatmap;
   dynamic _title= "Corr matrix for SASHELP.iris";
run;
 
/* example of dropping categorical numerics */
%prepCorrData(
  in=sashelp.pricedata(drop=region date product line),
  out=pricedata_r);
proc sgrender data=pricedata_r template=corrHeatmap;
  dynamic _title="Corr matrix for SASHELP.pricedata";
run;

Download complete program: corrmatrix_gtl.sas for SAS 9.3

Spoiler alert: These steps will only get easier in a future version of SAS 9.4, where similar built-in visualizations are planned for PROC CORR and elsewhere.

Related resources

You can apply a similar "heat-map-style" coloring to ODS tables by creating custom table templates.

If you haven't yet tried SAS Visual Analytics, it's worth a test-drive. Many of the visualizations are inspiring (as this blog post proves).

Finally, while I didn't dissect the GTL heat map definition in detail in this post, you can learn a lot more about GTL from Sanjay Matange and his team at the Graphically Speaking blog.

Acknowledgments

Big thanks to Rick Wicklin, who helped me quite a bit with this example. Rick validated my initial approach, and also provided valuable suggestions to improve the heat map and the statistical meaning of the example. He pointed me to http://colorbrewer.org, which provides examples of useful color ranges that you can apply in maps -- colors that are easy to read and don't distract from the meaning.

Rick told me that he is working on some related work coming up on his blog and within SAS 9.4, so you should watch his blog for additional insights.

tags: business analytics, GTL, ODS Graphics, SAS programming, Visual Analytics