Statistical Graphics

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 162022
 

A profile plot is a way to display multivariate values for many subjects. The optimal linear profile plot was introduced by John Hartigan in his book Clustering Algorithms (1975). In Michael Friendly's book (SAS System for Statistical Graphics, 1991), Friendly shows how to construct an optimal linear profile by using the SAS/IML language. He displays the plot by using traditional SAS/GRAPH methods: the old GPLOT procedure and an annotation data set. If you do not have a copy of his book, you can download Friendly's %LINPRO macro from his GitHub site.

SAS graphics have changed a lot since 1991. It is now much easier to create the optimal linear profile plot by using modern graphics in SAS. This article reproduces Friendly's SAS/IML analysis to create the data for the plot but uses PROC SGPLOT to create the final graph. This approach does not require an annotate data set.

The optimal linear profile plot

Friendly (1991) illustrates the optimal linear profile plot by using the incidence rate for seven types of crime in 16 US cities in 1970. I have previously shown how to visualize the crime data by using line plots and heat maps. A profile plot for the raw data is shown to the right.

The optimal linear profile plot modifies the profile plot in two ways. It computes an optimal ordering of the categories (crimes) and transforms the raw responses (rates) so that the transformed data are as close to linear as possible. Let's focus on the meaning of the words "linear" and "optimal":

  • Linear: The goal of the linear profile plot is to show linear trends rather than exact data values. The method models the data to determine which cities tend to have relatively low rates of some crimes and relatively high rates of other crimes. Thus, the linear profile plot shows a linear model for the crime trends.
  • Optimal: In creating the optimal linear profiles, you can choose any positions for the categories on the X axis. You can also (affinely) scale the data for each variable in many ways. The algorithm chooses positions and scales that make the transformed data as linear as possible.

The optimal linear profile is constructed by using the first two principal coordinates (PC) for the data (in wide form). The optimal positions are found as the ratio of the elements of the first two eigenvectors. The slopes and intercepts for the linear profiles are obtained by projecting the centered and scaled data onto the eigenspace, so they are related to the PC scores. See Friendly (1991) or the SAS/IML code for details.

The following DATA step creates the crime data in wide form. The SAS/IML program is taken from the %LINPRO macro and is similar to the code in Friendly (1991). It contains a lot of print statements, which you can comment out if you do not want to see the intermediate calculations. I made a few minor changes and three major changes:

  • Friendly's code computed the covariance and standard deviations by using N (the sample size) as the denominator. My code uses the N-1 as the denominator so that the computations of the PCs agree with PROC PRINCOMP.
  • Friendly's code writes a data set (LINFIT), which is used to create an annotate data set. This data set is no longer needed, but I have retained the statements that create it. I have added code that creates two macro variables (POSNAMES and POSVALUES) that you can use to specify the positions of the crime categories.
  • The program uses PROC SGPLOT to create the optimal linear profile plot. The labels for the lines are displayed automatically by using the CURVELABEL option on the SERIES statement. The VALUES= and VALUESDISPPLAY= options are used to position the categories at their optimal positions.
/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city $16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* Program from Friendly (1991, p. 424-427) and slightly modified 
   by Rick Wicklin (see comments). You can download the LINPRO MACRO
   from https://github.com/friendly/SAS-macros/blob/master/linpro.sas
 
   The program assumes the data are in wide form:
   There are k response variables. Each row is the response variables 
   for one subject. The name of the subject is contained in an ID variable.
 
   The output for plotting is stored in the data set LINPLOT.
   Information about the linear regressions is stored in the data set LINFIT.
*/
 
%let data = crime;    /* name of data set */
%let var = _NUM_;     /* list of variable names or the _NUM_ keyword */
%let id = City;       /* subject names */
proc iml;
   *--- Get input matrix and dimensions;
   use &data;
   if upcase("&var") = "_NUM_" then
      read all var _NUM_ into X[colname=vars];
   else
      read all var {&var} into X[colname=vars];
   n = nrow(X);
   p = ncol(X);
   m = mean(X);
   D = X - m;         *- centered cols = deviations from means;
 
   if "&id" = " " then
      ID = T(1:nrow(X));
   else
      read all var{&id} into ID;
   if type(ID)='N' then ID = compress(char(id));
   close;
 
   *--- Compute covariance matrix of x;
   /* Modification by Wicklin: Friendly's COV and SD are different 
      because he uses a divisor of n for the COV matrix instead of n-1 */
   C = cov(X);   
   sd = std(X);
   /* If you want to exactly reproduce Friendly's macro, use the following: 
   c = d` * d /  n;               *- variance-covariance matrix;
   sd = sqrt( vecdiag( c))`;      *- standard deviations;
   */
 
   *--- Standardize if vars have sig diff scales;
   ratio = max(sd) / min(sd);
   if ratio > 3 then do;
      s = sd;             * s is a scale factor;
      C = cov2corr(C);
      Clbl = "Correlation Matrix";
   end;
   else do;
      s = j(1, p, 1);
      Clbl = "Covariance Matrix";
   end;
   print C[colname=vars rowname=vars f=7.3 L=Clbl];
 
   *--- Eigenvalues & vectors of C;
   call eigen(val, vec, C);
 
   *--- Optional: Display information about the eigenvectors;
   prop = val / val[+];
   cum = cusum(prop);
   T = val || dif(val,-1) || prop || cum;
   cl = {'Eigenvalue' 'Difference' 'Proportion' 'Cumulative'}; 
   lbl = "Eigenvalues of the " + Clbl;
   print T[colname=cl f=8.3 L=lbl];
 
   *--- Scale values;
   e1 = vec[ , 1] # sign( vec[<> , 1]);
   e2 = vec[ , 2];
   pos = e2 / e1;
   Ds = D;                /* the raw difference matrix */
   D = Ds / ( s # e1` );  /* a scaled difference matrix */
 
   *--- For case i, fitted line is  Y = F1(I) + F2(I) * X   ;
   f1 = ( D * e1 ) / e1[##];  /* ssq(e1) = ssq(e2) = 1 */
   f2 = ( D * e2 ) / e2[##];
   f = f1 || f2;
 
   *--- Optionally display the results;
   scfac = sd` # e1;   * NOTE: Friendly rounds this to the nearest 0.1 unit;
   table = m` || sd` || e1 || e2 || scfac || pos;
   ct = { 'Mean' 'Std_Dev' 'Eigvec1' 'Eigvec2' 'Scale' 'Position'};
   print table [ rowname=vars colname=ct f=8.2];
 
   *--- Rearrange columns;
   r = rank( pos);
   zz = table;  table [r, ] = zz;
   zz = vars;   vars [ ,r] = zz;
   zz = pos;    pos [r, ] = zz;
   zz = scfac;  scfac [r, ] = zz;
   zz = D;      D [ ,r] = zz;
   *--- Optionally display the results;
   print table[rowname=vars colname=ct f=8.2 L="Variables reordered by position"];
   lt = { 'Intercept' 'Slope'};
   print f[rowname=&id colname=lt f=7.2 L="Case lines"];
 
   *--- Fitted values, residuals;
   fit = f1 * j( 1 , p) + f2 * pos`;
   resid = D - fit;
   *--- Optionally display the results;
   print fit [ rowname=&id colname=vars format=7.3 L="Fitted values"];
   print resid [ rowname=&id colname=vars format=7.3 L="Residuals"];
 
   *--- Optionally display summary statistics for fit;
   sse = resid[##];   ssf = fit[##];
   sst = D[##];       vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   sse = (ds-fit)[##];   sst = ds[##];
   vaf = ssf / (sse+ssf);
   print sse ssf sst vaf;
 
   *--- Construct output array for annotate data set -
          residuals bordered by fitted scale values;
   v1 = val[ 1 ] || {0};
   v2 = {0} || val[ 2 ];
   xout = ( resid || f ) //
          ( pos` || v1 ) //
          ( scfac` || v2 );
   rl = colvec(ID) // {'Variable','Scale'};
   cl = colvec(vars) // {'Intercept','Slope'};
   create LINFIT from xout[ rowname=rl colname=cl ];
      append from xout[ rowname= rl ];
   close;
   free rl cl xout;
 
   *--- Output the array to be plotted;
   do col = 1 to p;
       rows = j(n,1,pos[col,]) ||  fit[,col] || resid[,col];
       pout = pout // rows;
       rl = rl // shape(ID,1);
   end;
   cl = { 'Position' 'Fit' 'Residual'};
   create LINPLOT from pout[ rowname=rl colname=cl ];
      append from pout[ rowname=rl ];
   close;
 
   /* Modification by Wicklin: create two macro variables for 
      labeling the plotting positions. For example:
     POSNAME = 'larceny' 'burglary' 'auto_thf' 'rape' 'robbery' 'assault' 'murder'
     POSVALUES= -1.698 -1.014 -0.374 0.1389 0.5016 0.5805 2.1392
   */
   M1 = compress("'" + vars + "'");
   M1 = rowcat(M1 + " ");
   M2 = rowcat(char(pos`,6) + " ");
   call symputx("posNames", M1);
   call symputx("posValues",M2);
quit;
 
%put &=posNames;
%put &=posValues;
 
/* Use the POSNAMES and POSVALUES macro variables to add custom 
   tick labels and values. See
   https://blogs.sas.com/content/iml/2020/03/23/custom-tick-marks-sas-graph.html
*/
ods graphics / height=640px width=640px;
title "Optimal Linear Profiles for City Crime Rates";
footnote J=L "Friendly (1991, p. 430) Output 8.11";
proc sgplot data=LINPLOT(rename=(RL=&ID));
  label fit="Fitted Crime Index" position="Crime (Scaled Position)";
  series x=position y=fit / group=city curvelabel curvelabelloc=outside;
  xaxis values = (&posValues) valuesdisplay=(&posNames) fitpolicy=stagger;  
run;
footnote;title;

The optimal linear profile plot is shown for the crimes data. Unlike Friendly, I do not show the points along these lines. I think adding markers gives the false impression that the method has perfectly linearized the data.

The X axis shows the optimal positions of the crimes. They are roughly ordered so that crimes against property are on the left and violent crimes against persons are on the right. As is often the case, the PC interpretation does not perfectly agree with our intuitions. For these data, the analysis has placed rape to the left of robbery and assault. Notice that I use the VALUES= and VALUESDISPLAY= to specify the positions of the labels, and I use the FITPOLICY=STAGGER option to prevent the labels from colliding.

The cities can be classified according to their linear trends. For brevity, let's call larceny and burglary "property crimes," and let's call robbery, assault, and murder "violent crimes." Then you can observe the following classifications for the cities:

  • The lines for some cities (Dallas, Chicago, Houston) have a large positive slope. This indicates that the crime rates for property crimes are low relative to the rates of violent crimes.
  • The lines for other cities (Los Angeles, New York, Denver) have a large positive slope. This indicates that the crime rates for property crimes are high relative to the rates of violent crimes.
  • The lines for other cities (Kansas City, Tucson, Hartford) are relatively flat. This indicates that the crime rates for property crimes are about the same as the rates of violent crimes.
  • When the line for one city is above the line for another city, it means that the first city tends to have higher crime rates across the board than the second city. For example, the line for Los Angeles is above the line for New York. If you look at the raw data, you will see that the crime rates in LA is mostly higher, although NYC has higher rates for robbery and larceny. Similarly, the rates for Portland are generally higher than the rates for Honolulu, except for auto theft.
  • The ordering of the city labels is based primarily on the violent crime rates.

The following color image is copied from Friendly (1991, Appendix 3, p. 673) so that you can compare the SAS/GRAPH output to the newer output.

An alternative to the optimal linear profile plot

I'll be honest, I am not a big fan of the optimal linear profile plot. I think it can give a false impression about the data. It displays a linear model but does not provide diagnostic tools to assess issues such as outliers or deviations from the model. For example, the line for Boston shows no indication that auto theft in that city was extremely high in 1970. In general, there is no reason to think that one set of positions for the categories (crimes) leads to a linear profile for all subjects (cities).

If you want to show the data instead of a linear model, see my previous article, "Profile plots in SAS," and consider using the heat map near the end of the article. If you want to observe similarities and differences between the cities based on the multivariate data, I think performing a principal component analysis is a better choice. The following call to PROC PRINCOMP does a good job of showing similarities and differences between the categories and subjects. The analysis uses the correlation matrix, which is equivalent to saying that the crime rates have been standardized.

ods graphics/reset;
proc princomp data=crime out=Scores N=2 plot(only)=(PatternProfile Pattern Score);
   var murder rape robbery assault burglary larceny auto_thf;
   id City;
run;

For this analysis, the "Pattern Profile Plot" (not shown) shows that the first PC can be interpreted as the total crime rate. The second PC is a contrast between property crimes and violent crimes. The component pattern plot shows that the position of the variables along the second PC is nearly the same as the positions computed by the optimal linear profile algorithm. You can see property crimes in the lower half of the plot and violent crimes in the upper half. The "Score Plot" enables you to see the following characteristics of cities:

  • Cities to the right have high overall crime rates. Cities to the left have low overall crime rates.
  • Cities near the top have higher rates of violent crime relative to property crimes. Cities near the bottom have higher rates of property crimes. Cities near the middle have crime rates that are roughly equal for crimes.

In my opinion, the score plot provides the same information as the optimal linear profile plot. However, interpreting the score plot requires knowledge of principal components, which is an advanced skill. The optimal linear profile plot tries to convey this information in a simpler display that is easier to explain to a nontechnical audience.

Summary

This article reproduces Friendly's (1991) implementation of Hartigan's optimal linear profile plot. Friendly distributes the %LINPRO macro, which uses SAS/IML to carry out the computations and uses SAS/GRAPH to construct the linear profile plot. This article reproduces the SAS/IML computations but replaces the SAS/GRAPH calls with a call to PROG SGPLOT.

This article also discusses the advantages and disadvantages of the optimal linear profile plot. In my opinion, there are better ways to explore the similarities between subjects across many variables, including heat maps and score plots in principal component analysis.

The post Optimal linear profile plots in SAS appeared first on The DO Loop.

11月 142022
 

A profile plot is a compact way to visualize many variables for a set of subjects. It enables you to investigate which subjects are similar to or different from other subjects. Visually, a profile plot can take many forms. This article shows several profile plots: a line plot of the original data, a line plot of standardized data, and a heat map.

This article uses crime data to illustrate the various kinds of plots. The line plot at the right shows the incidence rate for seven types of crime in 16 US cities in 1970. The profile plot looks somewhat like a spaghetti plot, but a spaghetti plot typically features a continuous variable (often, time) along the vertical axis, whereas the horizontal axis in a profile plot features discrete categories. The vertical axis shows a response variable, which is often a count, a rate, or a proportion. The line segments enable you to trace the rates for each city. You can use the graph to discover which cities have high crime rates, which have low rates, and which are high for some crimes but low for others.

In the example, the subjects are cities; the variables are the seven crime rates. However, a profile plot is applicable to many fields. For example, in a clinical setting, the subjects could be patients and the variables could be the results of certain lab tests, such as blood levels of sugar, cholesterol, iron, and so forth. Be aware, however, that there are several different graphs that are known as "patient profile plots." Some are more complicated than the profile plots in this article.

A simple profile plot

For the city crime data, all variables are measured as rates per 100K people. Because all variables are measured on the same scale, you can plot the raw data. You could use a box plot if you want to show the distribution of rates for each crime. However, the purpose of a profile plot is to be able to trace each subject's relative rates across all variables. To do that, let's use a line plot.

The data are originally from the United States Statistical Abstracts (1970) and were analyzed by John Hartigan (1975) in his book Clustering Algorithms. The data were reanalyzed by Michael Friendly (SAS System for Statistical Graphics, 1991). This article follows Friendly's analysis for the line plots. The heat map in the last section is not considered in Friendly (1991).

The following SAS DATA step defines the data in the wide format (seven variables and 16 rows). However, to plot the data as a line plot in which colors identify the cities, it is useful to convert the data from wide to long format. For this example, I use PROC TRANSPOSE. You can then use the SERIES statement in PROC SGPLOT to visualize the rates for each crime and for each city.

/* Data from Hartigan (1975), reproduced in Friendly (SAS System for Statistical Graphics, 1991)
   http://friendly.apps01.yorku.ca/psy6140/examples/cluster/linpro2.sas
*/
data crime;
   input city &$16. murder rape robbery assault burglary larceny auto_thf;
datalines;
Atlanta         16.5  24.8  106  147  1112   905  494
Boston           4.2  13.3  122   90   982   669  954
Chicago         11.6  24.7  340  242   808   609  645
Dallas          18.1  34.2  184  293  1668   901  602
Denver           6.9  41.5  173  191  1534  1368  780
Detroit         13.0  35.7  477  220  1566  1183  788
Hartford         2.5   8.8   68  103  1017   724  468
Honolulu         3.6  12.7   42   28  1457  1102  637
Houston         16.8  26.6  289  186  1509  787   697
Kansas City     10.8  43.2  255  226  1494  955   765
Los Angeles      9.7  51.8  286  355  1902  1386  862
New Orleans     10.3  39.7  266  283  1056  1036  776
New York         9.4  19.4  522  267  1674  1392  848
Portland         5.0  23.0  157  144  1530  1281  488
Tucson           5.1  22.9   85  148  1206   757  483
Washington      12.5  27.6  524  217  1496  1003  739
;
 
/* convert to long format and use alphabetical ordering along X axis */
proc transpose data=crime out=crime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=crime2;  by crime;  run;
 
title "Profile Plot of Crime Rates in US Cities in 1970";
footnote J=L "Friendly (1991, p. 421) Output 8.9";
proc sgplot data=crime2;
  label rate="Crime Rate (per 100K population)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

The graph is shown at the top of this article. It is not very useful. The graph does not enable you to trace the crime rate patterns for each city. To quote Friendly (1991, p. 422): "In plotting the raw data, the high-frequency crimes such as burglary completely dominate the display, and the low-frequency crimes (murder, rape) are compressed to the bottom of the plot. Because of this, the possibility that the cities may differ in their patterns is effectively hidden."

Whenever you are trying to graph data that are on different scales, it is useful to transform the scale of the data. You could do this globally (for example, a logarithmic transformation) or you can focus on relative aspects. Following Friendly (1991), the next section standardizes the data, thereby focusing the attention on which cities have relatively high or low crime rates for various crimes.

A profile plot for standardized data

The most common way to standardize data is to use an affine transformation for each variable so that the standardized variables have mean 0 and unit variance. You can use PROC STANDARD in SAS to standardize the variables. You can specify the name of variables by using the VAR statement, or you can use the keyword _NUMERIC_ to standardize all numeric variables in the data. The following statements standardize the original wide data and then convert the standardized data to long format for graphing.

proc standard data=crime mean=0 std=1 out=stdcrime;
   var _numeric_;
run;
/* convert to long format and ensure a consistent ordering along X axis */
proc transpose data=stdcrime out=stdcrime2(rename=(col1=rate)) name=crime;
   by city;
run;
proc sort data=stdcrime2;  by crime;  run;
 
title "Profile Plot of Standardized Crime Rates in 1970";
footnote J=L "Friendly (1991, p. 423) Output 8.10";
proc sgplot data=stdcrime2;
  label rate="Crime Rate (standard score)" crime="Crime";
  series x=crime y=rate / group=city markers;
run;

Unfortunately, for these data, this graph is not much better than the first profile plot. Friendly (1991, p. 423) calls the plot "dismal" because it is "hard to see any similarities among the cities in their patterns of crime." With a close inspection, you can (barely) make out the following facts:

  • For most crimes, Los Angeles, Detroit, and New York have high rates.
  • For most crimes, Tucson and Hartford have low rates.
  • The crime rate in Boston is low except for auto theft.

This plot is similar to a parallel coordinates plot. There are three ways to improve the standardized profile plot:

  • Rearrange the order of the categories along the X axis by putting similar categories next to each other. I created one possible re-ordering, in which property crimes are placed on the left and crimes against persons are placed on the right. The profile plot is slightly improved, but it is still hard to read.
  • Change the affine transformation for the variables. Instead of always using mean 0 and unit variance, perhaps there is a better way to transform the response values? And perhaps using nonuniform spacing for the categories could make the profiles easier to read? These two ideas are implemented in the "optimal linear profile plot" (Hartigan, 1975; Friendly, 1991). I will show how to create the optimal linear profile plot in a subsequent article.
  • Abandon the spaghetti-like line plot in favor of a lasagna plot, which uses a heat map instead of a line plot. This is shown in the next section.

A lasagna plot of profiles

The main advantage of a lasagna plot is that data are neatly displayed in rows and columns. For each subject, you can examine the distribution of the response variable across categories. Furthermore, a rectangular heat map is symmetric: for each category, you can examine the distribution of the response variable across subjects! For these data, the subjects are cities, the categories are crimes, and the response variable is either the raw crime rate or the standardized rate. The disadvantage of a heat map is that the response values are represented by colors, and it can be difficult to perceive changes in colors for some color ramps. On the other hand, you can always print the data in tabular form if you need to know the exact values of the response variables.

The following statements create a lasagna plot for the standardized data in this example. Like the line plot, the heat map uses the data in long form:

%let colormap = (CX0571B0 CX92C5DE CXF7F7F7 CXF4A582 CXCA0020);
/* lasagna plot: https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html 
   and https://blogs.sas.com/content/iml/2022/03/28/viz-missing-values-longitudinal.html */
title "Lasagna Plot of City Profiles";
title2 "Standardized Crime Rates in 1970";
proc sgplot data=stdcrime2;
   label city="City" crime="Crime" rate="Standardized Rate";
   heatmapparm x=crime y=city colorresponse=rate / outline outlineattrs=(color=gray) colormodel=&colormap; 
run;

This is a better visualization of the crime profiles for each city. You can easily see the standardized rates and detect patterns among the cities. You can see that Detroit, Kansas City, Los Angeles, New York, and Washington (DC) have high relative rates. You can see that Atlanta, Boston, Hartford, Portland, and Tucson tend to have relatively low rates. You can see the crimes for which Atlanta, Boston, and Portland do not have low rates.

This display uses an alphabetical ordering of both the categories (crimes) and the subject (cities). However, you can sort the rows and columns of a lasagna plot to improve the visualization. For example, the following heat map sorts the cities by the average standardized crime rate across all categories. It also arranges the horizontal axis so that nonviolent crimes against property are on the left and violent crimes against people on the right:

This arrangement makes it easier to see outliers such as the orange and red cells in the lower left of the graph. Both Portland and Honolulu have rates of property crime that are higher than you might suspect, given their overall low crime rates. Similarly, Boston has relatively high rates of auto theft compared to its otherwise low crime rates. In the same way, the relatively low rate of rapes in New York City is apparent.

Summary

This article shows three ways to create a profile plot in SAS. A profile plot enables you to visualize the values of many variables for a set of subjects. The simplest profile plot is a line plot of the raw values, where each line corresponds to a subject and the discrete horizontal axis displays the names of variables. However, if the variables are measured in different units or on different scales, it is useful to rescale the values of the variables. One way is to create a line plot of the standardized variables, which displays the relative values (high or low) of each variable. A third display is a heat map (sometimes called a lasagna plot), which uses color to display the high or low values of each variable.

In this article, the categories are uniformly spaced along the horizontal axis. In a subsequent article, I will show an alternative plot that uses nonuniform spacing.

The post Profile plots in SAS appeared first on The DO Loop.

10月 262022
 

A SAS programmer asked how to create a graph that shows whether missing values in one variable are associated with certain values of another variable. For example, a patient who is supposed to monitor his blood glucose daily might have more missing measurements near holidays and in the summer months due to vacations. In this example, the "missingness" in the glucose variable is related to the date variable. In general, we say that missing values depend on another variable when the probability of obtaining a missing value in one variable depends on the value of another variable.

It can be useful to visualize the dependency if you are trying to impute values to replace the missing values. In missing-value imputation, the analyst must decide how to model the missing values. Three popular assumptions are that the values are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). You can read more about these missing-value assumptions. The visualization in this article can help you decide whether to treat the data as MCAR or MAR.

Generate fake data and missing values

Let's generate some fake data. The following DATA step creates four variables:

  • The DATE variable does not contain any missing values. It contains daily values between 01JAN2018 and 31DEC2020.
  • The X1 variable is missing only in 2020. In 2020, there is an 8% probability that X1 is missing.
  • The X2 variable is missing only on Sundays. On Sundays, there is a 10% probability that X2 is missing.
  • The X3 variable is nonmissing from March through October. During the winter months, there is a 5% probability that X3 is missing.
/* generate FAKE data to use as an example */
data Have;
call streaminit(1234);
format Date DATE9.;
do Date = '01JAN2018'd to '31DEC2020'd;
   x1 = rand("Normal");
   x2 = rand("Normal");
   x3 = rand("Normal");
   /* assign missing values:
      x1 is missing only in 2020
      x2 is missing only on Sundays
      x3 is missing only in Nov, Dec, Jan, and Feb
   */
   if year(Date)=2020            & rand("Bernoulli", 0.08) then x1 = .;
   if weekday(date)=1            & rand("Bernoulli", 0.10) then x2 = .;
   if month(date) in (11,12,1,2) & rand("Bernoulli", 0.05) then x3 = .;
   output;
end;
run;

The goal of this article is to create a graph that indicates how the missing values in the X1-X3 variables depend on time. In a previous article, I showed how to use heat maps in SAS/IML to visualize missing values. In the next section, I use the SAS DATA step to transform the data into a form that you can visualize by using the HEATMAPPARM statement in PROC SGPLOT.

Visualize patterns of missing values

To visualize the patterns of missing values, I use the following ideas:

/* Plot missing values of X1-X3 versus the values of Date. */
%let IndepVar = Date;           /* variable to show on Y axis */
%let DepVars  = x1 x2 x3;       /* visualize missing values for these variables */
 
/* 1. Filter the data to keep only observations that have missing values.
   2. Convert from wide to long. New variables VARNAME and ISMISSING contain the 
      information to plot.   */
data Want;
set Have;
length varName $32;
array xVars[3] &DepVars;        /* response variables (inputs) */
ObsNum = _N_;                   /* if you want the original row number */
if nmiss(of xVars[*]) > 0;      /* keep only obs that have missing values */
do i = 1 to dim(xVars);
   varName = vname(xVars[i]);   /* assign the name of the original variable */
   isMissing = nmiss(xVars[i]); /* binary indicator variable */
   output;
end;
keep ObsNum &IndepVar varName isMissing;
run;
 
/* 3. Define discrete attribute map: missing values are black and nonmissing values are white. */
data AttrMap;
  retain id 'HeatMiss';
  length fillcolor $20;
  value=0; fillcolor='White'; output;  /* nonmissing = white */
  value=1; fillcolor='Black'; output;  /* missing = black */
run;
 
/* 4. Create a heat map that shows the missing values for each variable for each date. */
/*    Thanks to Dan Heath for suggesting that I use INTERVAL=QUARTER
      for major ticks and MINORINTERVAL=MONTH for minor ticks */
ods graphics / width=400px height=600px;
title "Visualize Missing Values";
proc sgplot data=Want dattrmap=AttrMap;
   heatmapparm x=varName y=&IndepVar colorgroup=isMissing / attrid=HeatMiss;
   xaxis display=(nolabel);
   yaxis display=(nolabel) type=time interval=quarter valuesformat=monyy7. 
                           minor minorinterval=month;
run;

The heat map is shown to the right. It displays dates along the vertical axis. The horizontal axis displays the three numerical variables that have missing values. You can use the graph to determine whether the pattern of missing values depend on the Date variable. In the program, I used macro variables to store the name of the X and Y variables so that the program is easier to reuse.

What does the heat map show?

  • For the X1 variable, the heat map clearly shows that the variable does not have any missing values prior to 2020, but then has quite a few in 2020. The graph is consistent with the fact that the missing values in 2020 are distributed uniformly. You can see gaps and clusters, which are common features of uniform randomness.
  • For the X2 variable, the heat map is less clear. It is possible that the missingness occurs uniformly at random for any date. You would need to run an additional analysis to discover that all the missing values occur on Sundays.
  • For the X3 variable, the heat map suggests that the missing values occur in the winter months. However, you should probably run an additional analysis to discover that all the missing values occur between November and February.

You can use PROC FREQ to perform additional analyses on the X2 and X3 variables. The following calls to PROC FREQ use the MONTHw. and WEEKDAYw. formats to format the Date variable. PROC FREQ can perform a two-way analysis to investigate whether missingness in X2 and X3 is associated with days or months, respectively. You can use a mosaic plot to show the results, as follows:

ods graphics / reset;
proc freq data=Want order=formatted;
   where varName="x2";
   format Date WEEKDAY.;
   label IsMissing = "Missing Value for X2" Date = "Day of Week";
   tables IsMissing * Date / plots=MosaicPlot;
run;
 
proc freq data=Want order=formatted;
   where varName="x3";
   format Date MONTH.;
   label IsMissing = "Missing Value for X3" Date = "Month of Year";
   tables IsMissing * Date / plots=MosaicPlot;
run;

For brevity, only the mosaic plot for the month of the year is shown. The mosaic plot shows that all the missing values for X3 occur in November through February. A similar plot shows that the missing values for X2 occur on Sunday.

Summary

This article shows how to use a heat map to plot patterns of missing values for several variables versus a date. By plotting the missing values of X1, X2, X3,..., you can assess whether the missing values depend on time. This can help you to decide whether the missing values occur completely at random (MCAR). The missing values in this article are not MCAR. However, that fact is not always evident from the missing value pattern plot, and you might need to run additional analyses on the missing values.

Although this analysis used a date as the independent variable, that is not required. You can perform similar visualizations for other choices of an independent variable.

The post Visualize dependencies of missing values appeared first on The DO Loop.

9月 142022
 

I recently showed how to represent positive integers in any base and gave examples of base 2 (binary), base 8 (octal), and base 16 (hexadecimal). One fun application is that you can use base 26 to associate a positive integer to every string of English characters. This article shows how to use base 26 to map integers to strings and vice versa. Here's a sneak peek: the string CAT (base 26) is associated with the integer 2398 (base 10).

What is base 26?

For any base b, you can express an integer as a sum of powers of b, such as
\(\sum\nolimits_{i=0}^p {c_i} b^i\)
where the \(c_i\) are integers \(0 \leq c_i < b\). From the coefficients, you can build a string that represents the number in the given base. Traditionally, we use the symbols 0, 1, ..., 9 to represent the first 10 coefficients and use letters of the English alphabet for higher coefficients. However, in base 26, it makes sense to break with tradition and use English characters for all coefficients. This is done in a natural way by associating the symbols {A, B, C, ..., Z} with the coefficient values {0, 1, 2, ..., 25}. Notice that the symbols for the coefficients in base 26 are zero-based (A=0) and not one-based (A≠1), which is different than you might have seen in other applications.

For example, the number 2398 (base 10) can be written as the sum 3*262 + 14*261 + 6*260. If you use English letters to represent the coefficients, then this number equals DOG (base 26) because 3→D, 14→O, and 6→G. In a similar way, the number 1371 (base 10) can be written as 2*262 + 0*261 + 19*260, which equals CAT (base 26) because 2→C, 0→A, and 19→T.

Recall that for base 10 numbers, we typically do not write the numbers with leading zeros. For example, when considering three-digit numbers in base 10, we do not write the numbers 0-99. But if we use leading zeros, we can write these integers as three-digit numbers: 000, 001, 002, ..., 099. In a similar way, you can represent all three-character strings in base 26 (such as AAA, ABC, and ANT) if you include one or more leading zeros. In base 26, a "leading zero" means that the string starts with A. Unfortunately, if you include leading zeros, you lose a unique representation of the integers because A = AA = AAA, and similarly Z = AZ = AAZ. However, it is a small price to pay. To represent character strings that start with the letter A, you must allow leading zeros.

A SAS program to represent integers in base 26

It is straightforward to adapt the SAS DATA step program in my previous article to base 26. (See the previous article for an explanation of the algorithm.) In this version, I represent each integer in the range [0, 17575] in base 26 by using a three-digit string. The number 17575 (base 10) is the largest integer that can be represented by using a three-digit string because 17575 (base 10) = 25*262 + 25*261 + 25*260 = ZZZ (base 26).

The following statements put a few integers in the range [0, 17575] into a SAS data set:

/* Example data: Base 10 integers in the [0, 17575] */
data Base10;
input x @@;      /* x >= 0 */
datalines;
0 25 28 17575 16197 13030 1371 341 11511 903 13030 2398
;

The following DATA step converts these integers to three-character strings in base 26.

/* For simplicity, only consider three-digit strings. The strings
   will contain 'leading zero', which means strings like
   ABC (base 26) = 28 (base 10)
   Three-digit strings correspond to integers 0 - 17575. */
%let maxCoef = 3;      /* number of characters in string that represents the number */
%let base = 26;        /* base for the representation */
%let valueList = ('A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 
                  'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z');
 
data Base26;
array values[&base] $ _temporary_ &valueList;  /* characters to use to encode values */
array c[0:%eval(&maxCoef-1)]  _temporary_;     /* integer coefficients c[0], c[1], ... */
length ID $&maxCoef;                           /* string for base b */
b = &base;                                     /* base for representation */
 
set Base10;                                    /* x is a positive integer; represent in base b */
/* compute the coefficients that represent x in Base b */
y = x;
do k = 0 to &maxCoef-1;
   c[k] = mod(y, b);                           /* remainder when r is divided by b */
   y = floor(y / b);                           /* whole part of division */
   substr(ID,&maxCoef-k,1) = values[c[k]+1];    /* represent coefficients as string */
end;
drop b y k;
run;
 
proc print data=Base26 noobs label;
   label x="Base 10" ID="Base 26";
   var x ID;
run;

After converting a few arbitrary base 10 numbers to base 26, I show a few special numbers that correspond to three-character English words. The last seven numbers in the data set are integers that, when represented in base 26, correspond to the sentence THE CAT AND RAT BIT THE DOG!

Convert from base 26 to base 10

In the previous section, I used base 10 numbers that converted to a complete English sentence: THE CAT AND RAT BIT THE DOG. Obviously, I started with the sentence that I wanted to "find," and then figured out which decimal digits would produce the sentence. In other words, I started with the base 26 representation and computed the base 10 (decimal) representation.

For compactness, I will use SAS/IML software to compute the integer value for each base 26 representation. You could use the DATA step, if you prefer. The SAS/IML language supports a little-known feature (the MATTRIB statement) that enables you to index a matrix by using a string instead of an integer. This enables you to put the decimal numbers 0-25 into an array and index them according to the corresponding base 26 characters. This feature is demonstrated in the following example:

proc iml;
n = 0:25;              /* decimal values */
L = "A":"Z";           /* base 26 symbols */
mattrib n[colname=L];  /* enable indexing n by the letters A-Z */
C = n["C"];
A = n["A"];
T = n["T"];
print C A T;
D = n["D"];
O = n["O"];
G = n["G"];
print D O G;

The output shows that an expression like n['D'] results in the numerical value of the symbol 'D'. For any base 26 string, you can use the SUBSTR function to extract each character in the string. You can use the MATTRIB trick to find the corresponding base 10 value. You can use the position of each character and the definition of base 26 to find the integer represented by each string:

/* convert base 26 strings to integers */
str = {"AAA" "AAZ" "ABC" "ZZZ" "XYZ" "THE" "CAT" "AND" "RAT" "BIT" "THE" "DOG"}`;
Decimal = j(nrow(str),1);
do j = 1 to nrow(str);      
   s = str[j];               /* get the j_th string */
   k = length(s);            /* how many characters? */
   x = 0;
   do i = 0 to k-1;          /* for each character in the string */
      c = substr(s, k-i, 1); /* extract the character at position k-i */
      x = x + n[c]*26**i;    /* use n[c] as the coefficient in the base 26 representation */
   end;
   Decimal[j] = x;           /* save the integer value for the j_th string */
end;
 
print str[L="Base 26"] Decimal;

The output shows the base 10 representation for a set of three-digit base 26 strings. These are the values that I used in the first part of this article. (NOTE: You can vectorize this program and eliminate the inner loop! I leave that as an exercise.)

Summary

This article shows a fun fact: You can use base 26 to associate an integer to every string of English characters. In base 26, the string 'CAT' is associated with 2398 (base 10) and the string 'DOG' is associated with the number 1371 (base 10). This article uses three-digit character strings to demonstrate the method, but the algorithms apply to character strings that contain an arbitrary number of characters.

The post Base 26: A mapping from integers to strings appeared first on The DO Loop.

8月 292022
 

A SAS programmer was trying to understand how PROC SGPLOT orders categories and segments in a stacked bar chart. As with all problems, it is often useful to start with a simpler version of the problem. After you understand the simpler situation, you can apply that understanding to the more complicated situation.

This article shows how PROC SGPLOT in SAS orders categories in a bar chart in three scenarios:

  • Order the categories alphabetically. This is the default order.
  • Order the categories by the height of the bars, which is called the frequency order. This is accomplished by using the CATEGORYORDER= option on the VBAR (or HBAR) statement.
  • Order the categories in a user-specified manner. This is accomplished by using the VALUES= option on the XAXIS statement.

After you understand these three ways to order categories for a simple bar chart, you can investigate how these orderings work for a stacked bar chart.

Three ways to order bar charts

For data, let's use vehicles in the Sashelp.Cars data set and create bar charts that visualize the number of SUVs, sports cars, wagons, and trucks. This data also contains the Origin variable, which specifies the regions (Asia, USA, or Europe) that each vehicle was manufactured:

/* create example data */
data Have;
   set sashelp.cars;
   where Type in ('SUV' 'Sports' 'Truck' 'Wagon');
   keep Type Origin;
run;

Let's visualize the number of SUVs, sports cars, wagons, and trucks. You can use PROC SGPLOT to order the categories of a bar charts in three ways: alphabetical order, ascending (or descending) order by frequency, and a user-specified order. Each bar chart shows the same data, but the order of the bars is different.

ods graphics / width=300px height=240px;
/* three ways to order categories in a bar chart */
title "Categories in Alphabetical Order";
proc sgplot data=Have;
   vbar Type;
   xaxis display=(nolabel);
run;
 
title "Categories in Frequency Order";
proc sgplot data=Have;
   vbar Type / categoryorder=respdesc;
   xaxis display=(nolabel);
run;
 
title "Categories in Arbitrary Order";
proc sgplot data=Have;
   vbar Type;
   xaxis display=(nolabel) values=('Wagon' 'Sports' 'SUV' 'Truck' );
run;

For these data, you can divide each category into subgroups by using the Origin variable. The next section shows how these plots change if you add a GROUP= variable to the VBAR statement.

Order of clustered and stacked bar charts

When you use the GROUP= option on the VBAR statement, you have an option to display the subgroups as a cluster of bars (GROUPDISPLAY=CLUSTER) or as a stacked bar chart (GROUPDISPLAY=STACK). The clustered option is easier to understand, so let's start with that. You can use the GROUP=Origin option to create clustered bar charts that display the number of cars for each category in each of three subsets: vehicles that were manufactured in 'Asia', 'USA', or 'Europe'.

When you introduce groups and different orderings of the data, you need to ensure that the colors of the groups are consistent across graphs. One way to do this is to create and use a discrete attribute map that associates each subgroup with a color: the bars for Origin='Asia' are red, the bars for Origin='USA' are blue, and the bars for Origin='Europe' are gold.

/* create a discrete attribute map to associate a color with values of the Origin variable */
data BarMap;
length ID $10 value $6 linecolor $ 9 fillcolor $ 9;
input ID $ value $ linecolor $ fillcolor $;
datalines;
BarAttr Asia   DarkGray FireBrick
BarAttr USA    DarkGray DarkBlue
BarAttr Europe DarkGray Gold
;

You can now create bar charts that consistently use these colors regardless of the order of the bars. To make the output easy to see, the following program uses the ODS LAYOUT GRIDDED statement to arrange the output in one row that contains three graphs:

/* for GROUP=Origin, examine the three ways to order categories in a bar chart */
%let method = CLUSTER;   /* use CLUSTER or STACK */
ODS LAYOUT GRIDDED columns=3 advance=table column_gutter=8px;
 
title "Categories in Alphabetical Order";
title2 "GROUPDISPLAY = &method";
proc sgplot data=Have dattrmap=BarMap;
   vbar Type / attrid=BarAttr group=Origin groupdisplay=&method;
   xaxis display=(nolabel);
run;
 
title "Categories in Frequency Order";
title2 "GROUPDISPLAY = &method";
proc sgplot data=Have dattrmap=BarMap;
   vbar Type / attrid=BarAttr categoryorder=respdesc group=Origin groupdisplay=&method;
   xaxis display=(nolabel);
run;
 
title "Categories in Arbitrary Order";
title2 "GROUPDISPLAY = &method";
proc sgplot data=Have dattrmap=BarMap;
   vbar Type / attrid=BarAttr group=Origin groupdisplay=&method;
   xaxis display=(nolabel) values=('Wagon' 'Sports' 'SUV' 'Truck' );
run;
ODS LAYOUT END;

The graphs show that the ordering method (alphabetical or frequency) also extends to the subgroups. For example:

  • In the first graph, the categories are in alphabetical order and so are the subgroups.
  • In the second graph, the categories are in frequency order and so are the subgroups. The order of the subgroups changes depending on the relative frequencies within each category. For example, for the SUV category, the subgroups are ordered as 'Asia', 'USA', and 'Europe'. However, for the Sports category, the subgroups are ordered as 'Europe', 'Asia', and 'USA' because most of the sports cars are European.
  • The third graph manually sets the order for the categories, but the subgroups are plotted in alphabetical order, which is the default. You can also choose reverse-alphabetical order by using the GROUPORDER=DESCENDING option, or you can use the ordering of the groups in the data set (GROUPORDER=DATA).

Ordering categories and groups in stacked bar charts

There is a simple way to understand the order of bars and bar segments in a stacked bar chart in PROC SGPLOT. First, create a clustered bar chart, as shown in the previous section. Then, change GROUPDISPLAY=CLUSTER to GROUPDISPLAY=STACK and rerun the program. In the previous section, I wrote the code so that I only need to change one line:
%let method = STACK; /* use CLUSTER or STACK */
With that change, you can re-run the program to obtain the following graphs:

The graphs show that the order of segments in a stacked bar chart is the same as the order of bars in a clustered bar chart (from left to right). Specifically:

  • In the first stacked bar chart, the subgroups are in alphabetical order from bottom to top.
  • In the second stacked bar chart, the subgroups are in frequency order. For example, for the SUV category, the subgroups are ordered as 'Asia' (the bottom segment), 'USA' (the middle segment), and 'Europe' (the top segment). However, for the Sports category, the subgroups are ordered as 'Europe', 'Asia', and 'USA'. Be careful if you use the CATEGORYORDER=RESPDESC option: it can be confusing to view a graph in which the order of the segments differs between bars.
  • The third stacked bar chart modifies the order for the categories, but the subgroups are plotted from bottom to top in alphabetical order. This provides a viewer with consistency across the bars.

Summary

This article shows how to understand how PROC SGPLOT in SAS orders bars in a bar chart. There are essentially three ways to order bars: alphabetically, by frequency, or by specifying the order manually. When you use the GROUP= option, you get either a clustered bar chart or a stacked bar chart. The order of the subgroups is best understood by looking at the clustered bar chart. The order of bars in the clusters (from left to right) is the same as the order of the segments in a stacked bar chart (from bottom to top). Be aware that the CATEGORYORDER= option also orders the subgroups. This can be confusing to the viewer because segments in the stacked bars might "move around" depending on their relative frequencies.

The post Order the bars in a bar chart with PROC SGPLOT appeared first on The DO Loop.

8月 242022
 

A SAS programmer asked how to display long labels at irregular locations along the horizontal axis of scatter plot. The labels indicate various phases of a clinical study. This article discusses the problem and shows how to use the FITPOLICY=STAGGER option on the XAXIS or X2AXIS statement to avoid collisions of the tick labels. As a bonus, the Appendix shows how to solve the problem when the locations of the labels are stored in a SAS data set.

The challenge of non-overlapping labels

One of the challenges in statistical graphics is that long labels in plots can overlap, which makes the labels difficult to read. There are a few standard tricks for dealing with long or densely packed labels in graphs:

This article demonstrates a new technique for your graphical toolbox: How to use the FITPOLICY= option on the XAXIS or X2AXIS statements to prevent tick labels from overlapping. Specifically, this article shows the FITPOLICY=STAGGER option.

Illustrate the problem

To understand the problem, let's construct some hypothetical data about a weight-loss study. The analyst wants to display the weekly mean weight loss of the participants in the program along with reference lines and labels that visualize various phases and events in the study. The following program uses the REFLINE statement in PROC SGPLOT to specify the location of the events. By default, the labels for the events are displayed at the top of the scatter plot:

data Have;
label MeanValue = "Mean Weight Loss (kg)";
input Day MeanValue @@;
datalines;
0  0  7 1  14 2.2  21 2.7  28 3.1  
35 3.3 42 3.6 49 4.1 56 5.0 63 5.6
70 5.9 77 6.3 84 6.5
;
 
title "Mean Weight Loss in Program";
title2 "First Attempt";
proc sgplot data=Have noautolegend;
   scatter x=Day y=MeanValue;
   refline (0 21 25 42 60 84) / axis=x
            label=('Phase 1' 'Phase 2' 'Adjustment' 'Phase 3' 'End Treatment' 'End Study');
run;

For this set of labels and for the default size of the plot, the REFLINE statement decided to plot the labels vertically because the 'Phase 2' and the 'Adjustment' labels are close to each other and would overlap if the labels are displayed horizontally. The plot isn't terrible looking, but we can do better. I would prefer for the labels to appear horizontally rather than to be rotated by 90 degrees.

Sometimes, you can handle long labels on a horizontal axis by simply making the graph wider. For example, you might try to use
ODS GRAPHICS / width=800px height=400px;
to see whether a wider plot enables the reference labels to display side-by-side without overlapping. For these data, the distance between the 'Phase 2' event (Day=21) and the 'Adjustment' event (Day=25) is very close, so making the plot longer does not fix the problem.

Similarly, you can try to use the LABELATTRS= option to decrease the font size of the labels. The SIZE=6 option is the smallest font size that I can read. However, adding the LABELATTRS=(Size=6) option to the REFLINE statement does not fix the problem for these data.

The problem is that the REFLINE statement has limited support for displaying the labels. It checks to see if they can be displayed horizontally without colliding, and, if not, it rotates the labels 90 degrees. In contrast, PROC SGPLOT provides more support for the tick label on an axis. The XAXIS and X2AXIS statements support the FITPOLICY= option, which provides more options for controlling how to handle overlapping labels. The next section removes the REFLINE statement and uses two X axes: one to show the days and another to show the events.

The FITPOLICY=STAGGER option

As mentioned, the XAXIS and X2AXIS statements support the FITPOLICY= option, which supports more than a dozen ways to control the display of the ticks labels. For these data, I will use FITPOLICY=STAGGER, which alternates the placement of the labels in a two-row display. See the documentation for other useful options.

To visualize both the Day values and the events, you can use two axes, one below the scatter plot and one above the plot. In the following graph, the upper axis displays the events, and the lower axis displays the Day. (You could easily make the opposite choice.) The following techniques are used to create the plot:

  • To create the X2 axis, you must create a plot that uses the X2 axis. In this case, I create an invisible scatter plot. It is invisible because the SIZE=0 option tells the plot that the markers have no width. The invisible plot ensures that the X and X2 axes have the same range.
  • You can use the VALUES= option to specify the locations of the tick marks.
  • You can use the VALUESDISPLAY= option to specify the strings that are associated with the tick marks.
  • By adding the GRID option to the X2AXIS statement, you get a vertical line at each tick mark.
title2 "Uses FITPOLICY=STAGGER to Stagger Labels";
proc sgplot data=Have noautolegend;
   scatter x=Day y=MeanValue;
   /* add an invisible scatter plot to the X2 axis (set SIZE=0) */
   scatter x=Day y=MeanValue / x2axis markerattrs=(size=0);
   x2axis display=(nolabel) grid FITPOLICY=stagger 
         values        = (0 21 28 42 60 84)
         valuesdisplay = ('Phase 1' 'Phase 2' 'Adjustment' 'Phase 3' 'End Treatment' 'End Study');
run;

The SAS analyst was very happy to see this graph. Both the days and the events in the study are apparent. None of the tick labels overlap. The text is displayed horizontally.

Summary

This example shows how to use the FITPOLICY=STAGGER option to avoid overlap when you display long tick labels on an axis. The example uses two X axes: one to display the data and another to display related events. To use a second axis (called the X2 axis), you must create a plot that uses the second axis. This article creates an invisible scatter plot, which ensures that the X and X2 axes have the same scale.

In this article, I have hard-coded the locations of the ticks and the labels for each tick mark. However, if this information is in a SAS data set, you can read the data into macro variables and plot the events automatically. This trick is shown in the Appendix.

Appendix

Sometimes the location and labels for the events are stored in a SAS data set. If so, you can read that information into a SAS macro variable and use the macro as the value for the VALUES= and VALUESDISPLAY= options on the XAXIS or X2AXIS statements. You can use PROC SQL and the SELECT INTO (COLON) statement to create a macro variable that contains the data. There is one trick to learn: The values in the VALUESDISPLAY= option must be strings, so when you read the data you should add quotation marks around the strings, as follows:

data Ticks;
length Label $15;
input value Label 5-18;
Label = "'" || trim(Label) || "'";    /* add single quotes to both sides of each string */
datalines;
0   Phase 1
21  Phase 2
25  Adjustment
42  Phase 3
60  End Treatment
84  End Study
;
 
/* put the list of values into macro variables */
proc sql noprint;                              
 select value into :TickList separated by ' '
 from Ticks;
 select Label into :LabelList separated by ' '
 from Ticks;
quit;
%put &=TickList;
%put &=LabelList;
 
/* use the macro variables in the VALUES= and VALUESDISPLAY= options */
proc sgplot data=Have noautolegend;
   scatter x=Day y=MeanValue;
   scatter x=Day y=MeanValue / x2axis markerattrs=(size=0);/* invisible */
   x2axis display=(nolabel) grid FITPOLICY=stagger 
         values        = (&tickList)
         valuesdisplay = (&LabelList);  
run;

The graph is the same as the previous example, which hard-coded the tick values and strings.

The post How to stagger labels on an axis in PROC SGPLOT appeared first on The DO Loop.

6月 202022
 

For a linear regression model, a useful but underutilized diagnostic tool is the partial regression leverage plot. Also called the partial regression plot, this plot visualizes the parameter estimates table for the regression. For each effect in the model, you can visualize the following statistics:

  • The estimate for each regression coefficient in the model.
  • The hypothesis tests β0=0, β1=0, ..., where βi is the regression coefficient for the i_th effect in the model.
  • Outliers and high-leverage points.

This article discusses partial regression plots, how to interpret them, and how to create them in SAS. If you are performing a regression that uses k effects and an intercept term, you will get k+1 partial regression plots.

Example data for partial regression leverage plots

The following SAS DATA step uses Fisher's iris data. To make it easier to discuss the roles of various variables, the DATA step renames the variables. The variable Y is the response, and the explanatory variables are x1, x2, and x3. (Explanatory variables are also called regressors.)

In SAS, you can create a panel of partial regression plots automatically in PROC REG. Make sure to enable ODS GRAPHICS. Then you can use the PARTIAL option on the MODEL statement PROC REG statement to create the panel. To reduce the number of graphs that are produced, the following call to PROC REG uses the PLOTS(ONLY)=(PARTIAL) option to display only the partial regression leverage plots.

data Have;
set sashelp.iris(rename=(PetalLength = Y
                 PetalWidth  = x1
                 SepalLength = x2
                 SepalWidth  = x3)
                 where=(Species="Versicolor"));
ID = _N_;
label Y= x1= x2= x3=;
run;
 
ods graphics on;
title "Basic Partial Regression Leverage Plots";
proc reg data=Have plots(only)=(PartialPlot);
   model Y = x1 x2 x3 / clb partial;
   ods select ParameterEstimates PartialPlot;
quit;

Let's call the parameter for the Intercept term the 0th coefficient, the parameter for x1 the 1st coefficient, and so on. Accordingly, we'll call the upper left plot the 0th plot, the upper right plot the 1st plot, and so on.

A partial regression leverage plot is a scatter plot that shows the residuals for a specific regressions model. In the i_th plot (i=0,1,2,3), the vertical axis plots the residuals for the regression model where Y is regressed onto the explanatory variables but omits the i_th variable. The horizontal axis plots the residuals for the regression model where the i_th variable is regressed onto the other explanatory variables. For example:

  • The scatter plot with the "Intercept" title is the 0th plot. The vertical axis plots the residual values for the model that regresses Y onto the no-intercept model with regressors x1, x2, and x3. The horizontal axis plots the residual values for the model that regresses the Intercept column in the design matrix onto the regressors x1, x2, and x3. Thus, the regressors in this plot omit the Intercept variable.
  • The scatter plot with the "x1" title is the 1st plot. The vertical axis plots the residual values for the model that regresses Y onto the model with regressors Intercept, x2, and x3. The horizontal axis plots the residual values for the model that regresses x1 onto the regressors Intercept, x2, and x3. Thus, the regressors in this plot omit the x1 variable.

These plots are called "partial" regression plots because each plot is based on a regression model that contains only part of the full set of regressors. The i_th plot omits the i_th variable from the set of regressors.

Interpretation of a partial regression leverage plot

Each partial regression plot includes a regression line. It is this line that makes the plot useful for visualizing the parameter estimates. The line passes through the point (0, 0) in each plot. The slope of the regression line in the i_th plot is the parameter estimate for the i_th regression coefficient (βi) in the full model. If the regression line is close to being horizontal, that is evidence for the null hypothesis βi=0.

To demonstrate these facts, look at the partial regression plot for the Intercept. The partial plot has a regression line that is very flat (almost horizontal). This is because the parameter estimate for the Intercept is 1.65 with a standard error of 4. The 95% confidence interval is [-6.4, 9.7]. Notice that this interval contains 0, which means that the flatness of the regression line in the partial regression plot supports "accepting" (failing to reject) the null hypothesis that the Intercept parameter is 0.

In a similar way, look at the partial regression plot for the x1 variable. The partial plot has a regression line that is not flat. This is because the parameter estimate for x1 is 1.36 with a standard error of 0.24. The 95% confidence interval is [0.89, 1.83], which does not contain 0. Consequently, the steepness of the slope of the regression line in the partial regression plot visualizes the fact that we would reject the null hypothesis that the x1 coefficient is 0.

The partial regression plots for the x2 and x3 variables are similar. The regression line in the x2 plot has a steep slope, so the confidence interval for the x2 parameter does not contain 0. The regression line for the x3 variable has a negative slope because the parameter estimate for x3 is negative. The line is very flat, which indicates that the confidence interval for the x3 parameter contains 0.

John Sall ("Leverage Plots for General Linear Hypotheses", TAS, 1990) showed that you could add a confidence band to the partial regression plot such that if the line segment for Y=0 is not completely inside the band, then you reject the null hypothesis that the regression coefficient is 0.

Identify outliers by using partial regression leverage plots

Here is a remarkable mathematical fact about the regression line in a partial regression plot: the residual for each observation in the scatter plot is identical to the residual for the same observation in the full regression model! Think about what this means. The full regression model in this example is a set of 50 points in four-dimensional space. The regression surface is a 3-D hyperplane over the {x1, x2, x3} variables. Each observation has a residual, which is obtained by subtracting the predicted value from the observed value of Y. The remarkable fact is that in each partial regression plot, the residuals between the regression lines and the 2-D scatter points are exactly the same as the residuals in the full regression model. Amazing!

One implication of this fact is that you can identify points where the residual value is very small or very large. The small residuals indicate that the model fits these points well; the large residuals are outliers for the model.

Let's identify some outliers for the full model and then locate those observations in each of the partial regression plots. If you run the full regression model and analyze the residual values, you can determine that the observations that have the largest (magnitude of) residuals are ID=15, ID=39, ID=41, and ID=44. Furthermore, the next section will look at high-leverage points, which are ID=2 and ID=3. Unfortunately, the PLOTS=PARTIALPLOT option does not support the LABEL suboption, so we need to output the partial regression data and create the plots manually. The following DATA step adds a label variable to the data and reruns the regression model. The PARTIALDATA option on the MODEL statement creates an ODS table (PartialPlotData) that you can write to a SAS data set by using the ODS OUTPUT statement. That data set contains the coordinates that you need to create all of the partial regression plots manually:

/* add indicator variable for outliers and high-leverage points */
data Have2;
set Have;
if ID in (2,3) then Special="Leverage";
else if ID in (15,39,41,44) then Special="Outlier ";
else Special = "None    ";
if Special="None" then Label=.;
else Label=ID;
run;
 
proc reg data=Have2 plots(only)=PartialPlot;
   model Y = x1 x2 x3 / clb partial partialdata;
   ID ID Special Label;
   ods exclude PartialPlotData;
   ods output PartialPlotData=PD;
quit;

You can use the PD data set to create the partial regression plot. The variables for the horizontal axis have names such as Part_Intercept, Part_x1, and so forth. The variables for the vertical axis have names such as Part_Y_Intercept, Part_Y_x1, and so forth. Therefore, it is easy to write a macro that creates the partial regression plot for any variable.

%macro MakePartialPlot(VarName);
   title "Partial Leverage Plot for &VarName";
   proc sgplot data=PD ;
      styleattrs datasymbols=(CircleFilled X Plus);
      refline 0 / axis=y;
      scatter y=part_Y_&VarName x=part_&VarName / datalabel=Label group=Special;
      reg     y=part_Y_&VarName x=part_&VarName / nomarkers;
   run;
%mend;

The following ODS GRAPHICS statement uses the PUSH and POP options to temporarily set the ATTRPRIORITY option to NONE so that the labeled points appear in different colors and symbols. The program then creates all four partial regression plots and restores the default options:

ods graphics / PUSH AttrPriority=None width=360px height=270px;
%MakePartialPlot(Intercept);
%MakePartialPlot(x1);
%MakePartialPlot(x2);
%MakePartialPlot(x3);
ods graphics / POP;      /* restore ODS GRAPHICS options */

The first two plots are shown. The outliers (ID in (15,39,41,44)) will have large residuals in EVERY partial regression plot. In each scatter plot, you can see that the green markers with the "plus" (+) symbol are far from the regression line. Therefore, they are outliers in every scatter plot.

Identify high-leverage points by using partial regression leverage plots

In a similar way, the partial regression plots enable you to see whether a high-leverage point has extreme values in any partial coordinate. For this example, two high-leverage points are ID=2 and ID=3, and they are displayed as red X-shaped markers.

The previous section showed two partial regression plots. In the partial plot for the Intercept, the two influential observations do not look unusual. However, in the x1 partial plot, you can see that both observations have extreme values (both positive) for the "partial x1" variable.

Let's look at the other two partial regression plots:


The partial regression plot for x2 shows that the "partial x2" coordinate is extreme (very negative) for ID=3. The partial regression plot for x3 shows that the "partial x3" coordinate is extreme (very negative) for ID=2. Remember that these extreme values are not the coordinates of the variables themselves, but of the residuals when you regress each variable onto the other regressors.

It is not easy to explain in a few sentences how the high-leverage points appear in the partial regression plots. I think the details are described in Belsley, Kuh, and Welsch (Regression Diagnostics, 1980), but I cannot check that book right now because I am working from home. But these extreme values are why the Wikipedia article about partial regression plots states, "the influences of individual data values on the estimation of a coefficient are easy to see in this plot."

Summary

In summary, the partial regression leverage plots provide a way to visualize several important features of a high-dimensional regression problem. The PROC REG documentation includes a brief description of how to create partial regression leverage plots in SAS. As shown in this article, the slope of a partial regression line equals the parameter estimate, and the relative flatness of the line enables you to visualize the null hypothesis βi=0.

This article describes how to interpret the plots to learn more about the regression model. For example, outliers in the full model are also outliers in every partial regression plot. Observations that have small residuals in the full model also have small residuals in every partial regression plot. The high-leverage points will often show up as extreme values in one or more partial regression plots. To examine outliers and high-leverage plots in SAS, you can use the PARTIALDATA option to write the partial regression coordinates to a data set, then add labels for the observations of interest.

Partial regression leverage plots are a useful tool for analyzing the fit of a regression model. They are most useful when the number of observations is not too big. I recommend them when the sample size is 500 or less.

The post Partial leverage plots appeared first on The DO Loop.

6月 152022
 

The ODS GRAPHICS statement in SAS supports more than 30 options that enable you to configure the attributes of graphs that you create in SAS. Did you know that you can display the current set of graphical options? Furthermore, did you know that you can temporarily set certain options and then restore the options to their previous values? This article shows how to use the SHOW, PUSH, POP, and RESET options on the ODS GRAPHICS statement.

SHOW the ODS GRAPHICS options

ODS graphics have a default set of characteristics, but you can use the ODS GRAPHICS statement to override the default characteristics for graphs. Probably the most familiar options are the WIDTH= and HEIGHT= options, which set a graph's horizontal and vertical dimensions, respectively. For example, the following ODS GRAPHICS statement resets all options to their default values (RESET) and then sets the dimensions of future graphs. You can use the SHOW option on the ODS GRAPHICS statement to list the default value of several graphical options:

ods graphics / reset width=480px height=360px;
ods graphics / SHOW;

The SAS log shows the value of many options. The first few lines are shown below:

ODS Graphics Settings
---------------------
Output format:                     STATIC
By line:                           NOBYLINE
Antialias:                         ON
Maximum Loess observations:        5000
Image width:                       480px
Image height:                      360px
Maximum stack depth:               1024
Stack depth:                       0
... other options omitted ...

The log reports that the width and height of future graphs are set to 480 pixels and 360 pixels, respectively. Note that the "stack depth" is set to 0, which means that no options have been pushed onto the stack. I will say more about the stack depth in a later section.

To demonstrate that the new options are in effect, the following call to PROC SGPLOT creates a scatter plot of two variables in the SasHelp.Iris data set and uses the GROUP= option to visualize the three different species of flowers in the data:

title "Indicate Groups by Using GROUP= Option";
title2 "HTML Destination Uses AttrPriority=COLOR";
proc sgplot data=sashelp.iris;
   scatter x=PetalWidth y=SepalWidth/ group=Species;
run;

This image shows what the graph looks like in the HTML destination, which uses the ATTRPRIORITY=COLOR option by default. When you use the ATTRPRIORITY=COLOR option, groups are visualized by changing the color of markers (or lines). Thus, some markers are blue, others are reddish, and others are green. There is another option (ATTRPRIORITY=NONE), which is used by other ODS destinations, which visualizes groups by changing the color and symbol of markers (and the color and line pattern for lines). For more information about the ATTRPRIORITY= option, see "Getting started with SGPLOT: Style Attributes."

PUSH new options onto the stack

If you override the default value of an option, it will affect all future graphs until you reset the option or end the SAS session. Often, this is exactly what you want to happen. However, sometimes you want to temporarily override options, create some graphs, and then restore the existing set of options. For example, if you are producing code for others to use (such as writing a SAS macro), it is a good programming practice to not change any options that the user has set. For this situation, you can use the PUSH and POP options on the ODS GRAPHICS statement.

Let's see how the PUSH option works. The following statement temporarily overrides the ATTRPRIORITY= option by pushing the option NONE onto the current stack of options. (If you are not familiar with the stack data structure, read the Wikipedia article about stacks.) The call to PROC SGPLOT then creates a scatter plot. The scatter plot uses both colors and symbols to visualize the species of flowers.

ods graphics / PUSH AttrPriority=None;    /* temporarily change option to NONE */
 
title "Indicate Groups by Using Symbols";
title2 "Use AttrPriority=NONE";
proc sgplot data=sashelp.iris;
   scatter x=PetalWidth y=SepalWidth/ group=Species;
run;

You can confirm that the options have changed by using the SHOW option to see the new values of the ODS GRAPHICS options:

ods graphics / SHOW;
ODS Graphics Settings
---------------------
Output format:                     STATIC
By line:                           NOBYLINE
Antialias:                         ON
Maximum Loess observations:        5000
Image width:                       480px
Image height:                      360px
Attribute priority:                NONE
Maximum stack depth:               1024
Stack depth:                       1
... other options omitted ...

The SAS log confirms that the ATTRPRIORITY= option is set to NONE. In addition, the "stack depth" value is now 1, which means that you have pushed options onto the stack. Additional ODS GRAPHICS statements will affect the state of this stack, but they are "temporary" in the sense that you can use the POP option to restore the previous options, as shown in the next section.

POP the stack to restore old options

If you use the PUSH option to create a new set of options, you can use the POP option to restore the previous options. For example, the following statement pops the stack and displays the current options to the SAS log:

ods graphics / POP;           /* restore previous options */
ods graphics / SHOW;
ODS Graphics Settings
---------------------
Output format:                     STATIC
By line:                           NOBYLINE
Antialias:                         ON
Maximum Loess observations:        5000
Image width:                       480px
Image height:                      360px
Maximum stack depth:               1024
Stack depth:                       0
... other options omitted ...

The log shows that the ATTRPRIORITY= option is no longer set, which means that each destination will use its default behavior. In addition, the stack depth is back to 0, which means that the options that are displayed are the "base" options at the bottom of the stack. Notice that the image width and height are not the default values because those options were set before the first PUSH option was set.

RESET all or some options

Even though the width and height options were set before the first PUSH operation, you can still restore the width and height to their default values. The RESET= option can reset individual options, as follows:

ods graphics / reset=width reset=height;    /* restore width and height to default values */

You can restore all options to their default values by using the RESET=ALL option. Alternatively, you can use the RESET option without an equal sign.

/* restore all options to default values */
ods graphics / reset=all;    /* same as ODS GRAPHICS / RESET; */

Summary

The ODS GRAPHICS statement enables you to set options that affect the characteristics of future graphs. If you want to temporarily change the options, you can use the PUSH option to add options to the stack. All future graphs will use those options. You can restore the previous set of options by using the POP option to pop the stack. If you are writing code for others to use (for example, a SAS macro), it is good programming practice to push your options onto a stack and then pop the options after your program completes. In this way, you will not overwrite options that the use of your program has explicitly set.

If you set an option but later want to restore it to its default value, you can use the RESET= option.

The post PUSH, POP, and reset options for ODS graphics appeared first on The DO Loop.

4月 202022
 

Recently, I showed how to use a heat map to visualize measurements over time for a set of patients in a longitudinal study. The visualization is sometimes called a lasagna plot because it presents an alternative to the usual spaghetti plot. A reader asked whether a similar visualization can be created for subjects if the response is an ordinal variable, such as a count. Yes! And the heat map approach is a substantial improvement over a spaghetti plot in this situation.

This article pulls together several techniques from previous articles:

This article uses the following data, which represent the counts of malaria cases at five clinics over a 14-week time period:

data Clinical;
input SiteID @;
do Week = 1 to 14;
   input Count @;
   output;
end;
/* ID Wk1  Wk2  Wk3  Wk4 ... Wk14 */
datalines;
001  1 0 0 0 0 0 0 3 1 3 3 0 3 0
002  0 0 0 1 1 2 1 2 2 1 1 0 2 2
003  1 . . 1 0 1 0 3 . 1 0 3 2 1
004  1 1 . 1 0 1 2 2 3 2 1 0 . 0
005  1 1 1 . 0 0 0 1 0 1 2 4 5 1
;
 
title "Spaghetti Plot of Counts for Five Clinics";
proc sgplot data=Clinical;
   series x=Week y=Count / group=SiteID; 
   xaxis integer values=(1 to 14) valueshint;
run;

The line plot is not an effective way to visualize these data. In fact, it is almost useless. Because the counts are discrete integer values, and most counts are in the range [0, 3], the graph cannot clearly show the weekly values for any one clinic. The following sections develop a heat map that visualizes these data better.

Format the raw values

The following call to PROC FORMAT defines a format that associates character strings with values of the COUNT variable:

proc format;
value CountFmt
      . = "Not Counted"
      0 = "None"
      1 = "1"
      2 = "2"
      3 = "3" 
      4 - high = "4+";
run;

You can use this format to encode values and display them in a legend. Notice that you could also use a format to combine counts, such as using the word "Few" to describe 2 or 3 counts.

Associate colors to formatted values

A discrete attribute map ensures that the colors will not change if the data change. For ordinal data, it also ensures that the legend will be in ordinal order, as opposed to "data order" or alphabetical order.

You can use a discrete data map to associate graphical attributes with the formatted value of a variable. Examples of graphical attributes include marker colors, marker symbols, line colors, and line patterns. For a heat map, you want to associate the "fill color" of each cell with a formatted value. The following DATA step creates the mapping between values and colors. Notice that I use the PUTN function to apply the format to raw data values. This ensures that the mapping correctly associates formatted values with colors. The raw values are stored in an array (VAL) as are the colors (COLOR). This makes it easy to modify the map in the future or to adapt it to other situations.

data MyAttrs;
length Value $11 FillColor $20;
retain ID 'MalariaCount'               /* name of map */
     Show 'AttrMap';                   /* always show all groups in legend */
 
/* output the formatted value and color for a missing value */
Value = putn(., "CountFmt.");          /* formatted value */
FillColor = "LightCyan";               /* color for missing value */
output;
/* output the formatted values and colors for nonmissing values */
array   val[5]     _temporary_ (0 1 2 3 4);
array color[5] $20 _temporary_ ('White' 'CXFFFFB2' 'CXFECC5C' 'CXFD8D3C' 'CXE31A1C');
do i = 1 to dim(val);
   Value = putn(val[i], "CountFmt.");  /* formatted value for this raw value */
   FillColor = color[i];               /* color for this formatted value */
   output;
end;
drop i;
run;

Create a discrete heat map

Now you can create a heat map that uses the format and discrete attribute map from the previous sections. To use the map, you must specify two pieces of information:
  • Use the DATTRPMAP= option on the PROC SGPLOT statement to specify the name of the data set that contains the map.
  • Because a data set can contain multiple maps, use the ATTRID= option on the HEATMAPPARM statement to specify the value of the ID variable that contains the attributes for these data.
title "Heat Map of Malaria Data";
proc sgplot data=Clinical DATTRMAP=MyAttrs; /* <== the data set that contains attributes */
   format Count CountFmt.;                  /* <== apply the format to bin data */
   heatmapparm x=Week y=SiteID colorgroup=Count / outline outlineattrs=(color=gray)
               ATTRID=MalariaCount;         /* <== the discrete attribute map for these data */
   discretelegend;
   refline (1.5 to 5.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(1 to 14) valueshint;
run;

The heat map makes the data much clearer than the spaghetti map. For each clinic and each week, you can determine how many cases of malaria were seen. As mentioned earlier, you can use this same technique to classify counts into categories such as None, Few, Moderate, Many, and so forth.

Summary

You can use SAS formats to bin numerical data into ordinal categories. You can use a discrete attribute map to associate colors and other graphical attributes with categories. By combining these techniques, you can create a heat map that enables you to visualize ordinal responses for subjects over time. The heat map is much better at visualizing the data than a spaghetti plot is.

The post Use a heat map to visualize an ordinal response in longitudinal data appeared first on The DO Loop.