7月 102019
 

I recently showed how to create an annotation data set that will overlay cell counts or percentages on a mosaic plot. A mosaic plot is a visual representation of a cross-tabulation of observed frequencies for two categorical variables. The mosaic plot with cell counts is shown to the right. The previous article focused on how to create the mosaic plot and how to apply an annotation, assuming that the centers of each cell were provided. This article shows how to compute the center of the cells from the output of PROC FREQ.

A motivating example

In a two-way mosaic plot, the widths of the bars represent the proportions of the levels of the categorical variables. It makes sense, therefore, to use [0, 100] as the coordinate system for the mosaic plots. For example, suppose the horizontal variable has three levels (A, B, and C) and the proportion of A is 20%, the proportion of B is 30%, and the proportion of C is 50%. Then the widths of the bars in the coordinate system are 20, 30, and 50, respectively. The first bar covers the range [0, 20], the second bar covers [20, 50], and the third bar covers [50, 100]. The widths of the bars will occupy 20%, 30%, and 50% of the width of the mosaic plot. In general, if the widths of the bars are w1, w2, and w3, the bars cover the intervals [0, w1], [w1, w1+w2], and [w1+w2, 100]. The centers of the bars are at w1/2, w1 + w2/2, and w1 + w2 + w3/2, respectively.

Each bar is composed of smaller vertical bars. If you adopt a [0, 100] coordinate system for the vertical dimension, the heights of the sub-bars are the conditional proportions of the vertical variable within each level of the horizontal variable.

Obtain the proportions and conditional proportions from PROC FREQ

You can use PROC FREQ to obtain the proportions of each level and joint level. You can use the proportions to compute the center of each cell in a [0, 100] x [0, 100] coordinate system. I recommend using the CROSSLIST and SPARSE options, as shown in the following example, which shows a mosaic plot for the Origin and Type variables in the Sashelp.Cars data. For portability (and, hopefully, clarity), I have defined two macros that contain the name of the horizontal and vertical variables.

%let hVar = Type;             /* name of horizontal variable */
%let vVar = Origin;           /* name of vertical variable */
 
proc freq data=Sashelp.Cars;
   where Type ^= 'Hybrid';
   tables &vVar * &hVar / crosslist sparse 
                          out=FreqOut(where=(Percent^=.));
   ods output CrossList=FreqList;    /* output CROSSLIST table */
run;

The CROSSLIST option creates a data set where the formatted values of the horizontal and vertical variables (Origin and Type, respectively) begin with the prefix "F_", which stands for "formatted values." These are always character variables. You can use WHERE clauses to subset the data that you need to compute the centers of the mosaic cells:

  • When F_Origin equals "Total," the Frequency and Percent variables contain the information you need to compute the width of the mosaic bars.
  • The other rows contain the information needed to compute the heights of the bars for each stacked bar. The ColPercent variable contains the conditional proportions for each bar.

The following PROC PRINT statements show the relevant information for this example:

title "Horizontal Percentages: &hVar";
proc print data=FreqList;
   where F_&vVar='Total' & F_&hVar^='Total';
   var F_&vVar F_&hVar Frequency Percent;
run;
 
title "Vertical Percentages: &vVar";
proc print data=FreqList;
   where F_&vVar ^= 'Total' & F_&hVar ^= 'Total';
   by F_&vVar notsorted;
   var F_&vVar F_&hVar Frequency Percent ColPercent;
run;

Find the centers of each cell

As shown in the previous section, the FreqList data set contains all the information you need to find the center of each cell in the mosaic plot. Mathematically, it is not hard to convert that information into the coordinates of the cell centers. Programmatically, the process is complicated. The following SAS/IML program computes the centers by using the following steps:

  1. Read the percentages for the levels of the horizontal variable.
  2. Use these and the CUSUM function to find the horizontal centers of the bars.
  3. For each level of the horizontal variable, read the percentages for the levels of the vertical variable.
  4. Use these to find the vertical centers of each stacked bar.
  5. Write this information to a SAS data set in the long form. Include the observed frequencies and percentages for each cell.
/* Read CROSSLIST data set and write data set that contains centers of cells in mosaic plot.
  Main idea: If a categorical variable has three levels and observed proportions are
  20, 30, and 50, then the midpoints of the bars are
       10 = 20/2
       35 = 20 + 30/2
       75 = 20 + 30 + 50/2
*/
proc iml;
/* 1. read percentages for the horizontal variable */
use FreqList;
read all var {Percent F_&hVar F_&vVar}
         where (F_&vVar='Total' & F_&hVar^='Total');
hPos = Percent;            
nH = nrow(hPos);
hLabel = F_&hVar;
 
/* 2. horizontal centers
   h1/2,  h1 + h2/2, h1 + h2 + h3/2, h1 + h2 + h3 + h4/2, ... */
halfW = hPos / 2;
hCenter = halfW + cusum(0 // hPos[1:(nH-1)]);
print (hCenter`)[c=hLabel L="hCenter" F=5.2];
 
/* 3. For each column, read cell percentages and frequencies */
read all var {Frequency Percent ColPercent F_&vVar F_&hVar}
     where (F_&vVar^='Total' &  F_&hVar ^= 'Total');
close;
FhVar = F_&hVar;              /* levels of horiz var */
FvVar = F_&vVar;              /* lavels of vert var */
*print FhVar fVVar Frequency Percent ColPercent;
 
/* 4. Get the counts and percentages for each cell.
   Vertical centers are 
   v1/2,  v1 + v2/2, v1 + v2 + v3/2, ... */
vLabel = shape( FvVar, 0, nH )[,1];
vPos = shape( ColPercent, 0, nH );
nV = nrow(vPos);
halfW = vPos / 2;
vCenters = j(nrow(vPos), ncol(vPos), .);
do i = 1 to nH;
   vCenters[,i] = halfW[,i] + cusum(0 // vPos[1:(nV-1),i]);
end;
print vCenters[r=vLabel c=hLabel F=5.2];
 
/* 5. convert to a long format: (hPos, vPos, Freq, Pct) 
   and write to SAS data set */
hCenters = repeat(hCenter, nV);
CellFreq = shape( Frequency, 0, nH );
CellPct = shape( Percent, 0, nH );
result = colvec(hCenters) || colvec(vCenters) || 
         colvec(CellFreq) || colvec(CellPct);
*print FhVar FvVar result[c={"hCenter" "vCenter" "Freq" "Pct"}];
 
/* Optional: You might want to get rid of any labels that account to fewer 
   than 2% (or 1%) of the total. The criterion is up to you. For example: 
   idx = loc( result[,4] > 2 );  * keep if greater than 2% of total;
   result = result[idx, ];
*/
 
/* write character and numeric vars separately, then merge together */
/* Character vars: pairwise labels of horiz and vertical categories */
create labels var {FhVar FvVar}; append; close;
/* Numeric vars: centers of cells, counts, and percentages */
create centers from result[c={"hCenter" "vCenter" "Freq" "Pct"}];
   append from result;
close;
QUIT;
 
data annoData;
   merge labels centers;
run;

Convert centers into an annotation data set

The variables and values in an SG annotation data set must use special names so that the annotation facility knows how to interpret the data. For an introduction to GTL annotation, see the GTL documentation or Warren Kuhfeld's free e-book Advanced ODS Graphics Examples. To overlay text, you need to include the following information:

  • The Label variable specifies the text to display.
  • The x1 and y1 variables specify the coordinates of the label (in the LAYOUTPERCENT drawing space).
  • The Width variable specifies the width of the label and the Anchor variable specifies how the text is anchored (left, right, centered,...) at the (x1, y1) location.

However, as discussed in my previous article, a "region plot" such as the mosaic plot does not support the GRAPHPERCENT or WALLPERCENT drawing areas. You have to use the LAYOUTPERCENT drawing area, which includes space for the axes. Therefore, you cannot merely use the centers as the (x1, x2) coordinates. Instead, you need to linearly transform the centers so that they correctly align with the mosaic cells.

I do not know how to do this step in a general way that will accommodate all situations. You need to look at the graph (including its physical dimensions, font sizes, the length of the labels, etc.) and make a guess about how to linearly transform the center data. In the following program, I estimate that the axis area is about 10% of the horizontal and vertical portion of the layout drawing region. Therefore, I shrink the centers by 10% (that is, multiply by 0.9) and translate the result by 10%. You might need to use different values for your data.

The following statements create the annotation data step. For completeness, I've also included the statements that define the mosaic template and create the graph with the annotation.

/* If we could use the WALLPERCENT drawing space, we could
   use (hCenter, vCenter) as the drawing coordinates:
   x1 = hCenter;
   y1 = vCenter;
 
   Unfortunately, we have to use LAYOUTPERCENT, so perform a linear
   transformation from "wall coordinates" to layout coordinates.
*/
data anno;
set AnnoData;
length label $12;
/* use RETAIN stmt to define values that are constant */
retain function 'text' 
       y1space 'layoutpercent' x1space 'layoutpercent'
       width 4        /* make larger if you are plotting percentages */
       anchor 'center';
/* Guess a linear transform to LAYOUTPERCENT coordinates.
   Need to move inside graph area, so shrink and 
   translate to correct for left and bottom axes areas */
x1 = 0.9*hCenter + 10;
y1 = 0.9*vCenter + 10;   
label = put(Freq, 5.);
run;
 
/* Note: PROC FREQ reverses Y axis b/c it sorts the FreqOut data in descending order. */
proc template;
  define statgraph mosaicPlotParm;
  dynamic _VERTVAR _HORZVAR _FREQ _TITLE;
    begingraph;
      entrytitle _TITLE;
      layout region;    /* REGION layout, so can't overlay text! */
      MosaicPlotParm category=(_HORZVAR _VERTVAR) count=_FREQ / 
             datatransparency=0.5
             colorgroup=_VERTVAR name="mosaic";
      endlayout;
      annotate;         /* required for annotation */
    endgraph;
  end;
run;
 
proc sgrender data=FreqOut template=mosaicPlotParm sganno=anno;
dynamic _VERTVAR="Origin" _HORZVAR="Type" _FREQ="Count"
        _TITLE="Basic Mosaic Plot with Counts";
run;

The mosaic plot with text annotations is shown at the top of this program.

In the previous article, I showed how you can use the GTL annotation facility to overlay frequency counts or percentages on a mosaic plot in SAS. In this article, I show how to write a program that computes the centers of the cells from the percentages that PROC FREQ provides when you use the CROSSLIST SPARSE option. Unfortunately, after you compute the centers, you cannot use them directly because the mosaic plot does not support the WALLPERCENT drawing space. Instead, you must use the LAYOUTPERCENT drawing space, which means you need to linearly transform the cell centers.

You can download the SAS program that computes the centers of mosaic cells and uses those coordinates to annotate the mosaic plot.

The post Find the center of each cell in a mosaic plot appeared first on The DO Loop.

7月 082019
 

The mosaic plot is a graphical visualization of a frequency table. In previous articles, I showed how to create a mosaic plot in SAS by using PROC FREQ and how to define a template in the Graph Template Language (GTL) by using the MOSAICPARM statement. This article shows how to display additional information on a mosaic plot. The two techniques in this article are

  • For interactive displays, add tool tips (also called infotips or "data tips") to the mosaic plot. When you hover the mouse pointer over a cell, SAS displays information about cell counts and percentages.
  • For static displays, use the GTL annotation facility to overlay cell counts or percentages on a two-way mosaic plot. A result is shown to the right.

Add data tips to any SAS graph

When you analyze data in SAS, many SAS procedures can automatically create graphs that are appropriate for the analysis. Most of these graphs support data tips that provide information about the data when you hover a mouse pointer over a graph component. I like to use this feature for "area graphs" such as bar charts, histograms, and mosaic plots.

It is easy to turn on tool tips: you simply specify the IMAGEMAP=ON option on the ODS GRAPHICS statement. Because PROC FREQ can create a mosaic plot, the following statements draw a mosaic plot with tool tips for the Origin and Type variables in the Sashelp.Cars data set:

/* Use tool tips to see details of a mosaic plot */
ods graphics on / imagemap=ON;         /* enable data tips */
proc freq data=Sashelp.Cars;
   where Type ^= 'Hybrid';
   tables Origin * Type / plots=mosaic
                          out=FreqOut(where=(Percent^=.)); /* output stats for next section */
run;

When you hover the mouse pointer over a cell, the graph displays a tool tip. The tip for the center cell shows that the cell represents Origin=Europe and Type=Sedan. The center cell represents 78 vehicles or 18.4% of the total number of vehicles in the data.

Create a mosaic plot from the output of PROC FREQ

Unfortunately, the mosaic plot is not supported by PROC SGPLOT in SAS 9.4M6, but the MOSAICPARM statement in the Graph Template Language (GTL) enables you to create a mosaic plot. The following statements display the PROC FREQ template in the SAS log:

/* view the template for the MosaicPlot in PROC FREQ */
proc template;
   source Base.Freq.Graphics.MosaicPlot;
run;

You can copy the basic structure of the Base.Freq.Graphics.MosaicPlot template to create your own template. You need to add an ANNOTATE statement if you want to support annotation, as follows:

proc template;
  define statgraph mosaicPlotParm;
  dynamic _VERTVAR _HORZVAR _FREQ _TITLE;
    begingraph;
      entrytitle _TITLE;
      layout region;          /* REGION layout, so can't overlay text! */
      MosaicPlotParm category=(_HORZVAR _VERTVAR) count=_FREQ / 
             datatransparency=0.5
             colorgroup=_VERTVAR name="mosaic";
      endlayout;
      annotate;               /* required for annotation */
    endgraph;
  end;
run;
 
proc sgrender data=FreqOut template=mosaicPlotParm;
dynamic _VERTVAR="Origin" _HORZVAR="Type" _FREQ="Count"
        _TITLE="Basic Mosaic Plot with No Labels";
run;

Notice that the FreqOut data (which I created by using the OUT= option on the TABLES statement in PROC FREQ) has the cell counts in a different order than the data object that the PLOTS=MOSAIC option uses. The mosaic plot I created has the vertical axis "pointing up" whereas the vertical axis in the PROC FREQ graph "points down" to match the frequency table that the procedure creates.

My initial idea was to overlay a text plot on the mosaic plot and use the text plot to show the cell counts or percentages. However, the MOSAICPLOTPARM statement must be part of a LAYOUT REGION block. A LAYOUT REGION block supports only one plot, a mosaic plot or a pie chart; you cannot overlay another plot such as a text plot or a scatter plot on a "region plot."

Therefore, the only choice for adding text to a mosaic plot is to use the GTL annotation facility. This is not as easy as I'd hoped because "region plots," which do not have axes, do not support data coordinates. This means that you cannot use the DATAVALUE or WALLPERCENT drawing areas, which are the most useful drawing areas for data-dependent annotations. The only choices for drawing areas are the GRAPHPERCENT and LAYOUTPERCENT areas. Of these, the LAYOUTPERCENT is better because annotations in the layout area do not shift around if you decide to add a title or footnote to your mosaic plot. The horizontal portion of the LAYOUTPERCENT drawing area goes from the vertical axis label to the right edge of the graph region. The vertical portion goes from the horizontal axis label to the top edge of the graph region.

Create an annotation for a mosaic plot

This section describes how to create an annotation data set for a region plot. For an introduction to GTL annotation, see the following articles:

The goal of this section is to annotate a mosaic plot, but the same ideas will work on a pie chart, which is also a "region plot." The following annotation uses the LAYOUTPERCENT drawing space. The annotation consists of a series of 'text' function calls. Each 'text' function must be supplied with the following information:

  • The Label variable specifies the text to display.
  • The x1 and y1 variables specify the coordinates of the label (in the LAYOUTPERCENT drawing space).
  • The Width variable specifies the width of the label and the Anchor variable specifies how the text is anchored (left, right, centered,...) at the (x1, y1) location.

The following data set specifies the center of the text in LAYOUTPERCENT coordinates. In a follow-up article, I will show how to compute these values. For now, just assume that the coordinates are provided. In many annotation examples, coming up with the coordinate is an iterative process of guessing values, plotting them, and then revising the guess.

Regardless of how the coordinates are obtained, you can read the coordinates into an annotation data set and assign the special variable names that SG annotation looks for, such as x1, x2, Label, and Width:

data AnnoData;
length Type $8 Origin $6;
input Type Origin hCenter vCenter Freq Pct;
datalines;
SUV    Asia   16.35 28.75 25 5.8824 
Sedan  Asia   50.45 26.15 94 22.1176 
Sports Asia   83.38 25.61 17 4.0000 
Truck  Asia   91.11 25.00  8 1.8824 
Wagon  Asia   96.82 26.50 11 2.5882 
SUV    Europe 16.35 55.00 10 2.3529 
Sedan  Europe 50.45 55.69 78 18.3529 
Sports Europe 83.38 62.35 23 5.4118 
Truck  Europe 91.11 40.00  0 0.0000 
Wagon  Europe 96.82 61.00 12 2.8235 
SUV    USA    16.35 81.25 25 5.8824 
Sedan  USA    50.45 84.54 90 21.1765 
Sports USA    83.38 91.73  9 2.1176 
Truck  USA    91.11 70.00 16 3.7647 
Wagon  USA    96.82 89.50  7 1.6471 
;
 
data anno;
set AnnoData;
length label $12;
/* use RETAIN stmt to define values that are constant */
retain function 'text' 
       y1space 'layoutpercent' x1space 'layoutpercent'
       width 4         /* text box width = 4% of layout range */
       anchor 'center';
/* for the TEXT function, need (x1, y1) coords and Label */
x1 = hCenter;
y1 = vCenter;
label = put(Freq, 4.); /* use 4. format for count */           
/*
label = put(Pct/100, PERCENT7.1); width=7;
*/
run;
 
title "Basic Mosaic Plot with Labels";
proc sgrender data=FreqOut template=mosaicPlotParm sganno=anno;
run;

The mosaic plot with annotation is shown at the top of this article. The DATA step that creates the annotation data set includes a comment that shows how you can display percentages instead of counts. Of course, if your counts are large (thousands), you should increase the Width value and the field width of the format in the PUT statement. You can also modify the program to omit labels for small cells. For example, you might not want to label cells that have fewer than 2% of the sample size.

The key to this example is that the mosaic plot is a region plot. You cannot overlay plots (such as a text plot) on a region plot. Therefore, you must use an annotation. Furthermore, you must use the LAYOUTPERCENT drawing area, which is somewhat inconvenient. How I wish I could use the WALLPERCENT drawing space with a range of [0, 100] in each direction!

In my next article, I will show how to obtain the locations of the annotation centers from the data.

The post How to add an annotation to a mosaic plot in SAS appeared first on The DO Loop.

7月 032019
 

An informat helps you read data into a SAS data set. SAS supports more than 100 informats. The most common informats are related to dates and times and make it easy to read an input string such as 28JAN2001 and convert it to a SAS date such as 15003. Yet no matter how many built-in informats SAS provides, programmers will inevitably encounter data that are difficult to read. That is why PROC FORMAT in SAS enables you to create a custom informat. A custom informat enables you to process input strings in whatever way is most convenient for your application. This article shows how to create a custom informat in SAS in three easy steps:

  1. Write DATA step code that parses an input string and converts it into a form that is more convenient to work with.
  2. Use PROC FCMP to define a function that implements the same string-to-value conversion. Often you can cut and paste from your DATA step program into PROC FCMP.
  3. Use the INVALUE statement in PROC FORMAT to define a custom informat.

Technically, you can skip Step 1, but I find it much easier to develop and debug a program in the DATA step because I can use the PUT statement or PROC PRINT to look at intermediate results.

A motivating example: Convert durations into seconds

While writing a recent article about how to visualize race times, I researched ways to read times into SAS. For time-of-day data, you can use the TIMEw. or ANYDTTME. informats to read a string in the form HH:MM:SS and convert it to a SAS time value. For example, the time "1:23:45" (AM) is converted to the number of seconds since midnight, which is 5025.

Race times are slightly different because they represents durations of time. Short races such as a 5K are typically recorded in minutes, and sprints are recorded in seconds. Both times contain fractions of seconds, so the data is typically represented as SS.fff or MM:SS.ff. You can use the HHMMSS. informat to read these data and store them as seconds. The following DATA step reads a sprint time (12.345 s) and a 5K time (18:33.24) into a SAS time variable. You can use the HOUR, MINUTE, and SECOND functions to make sure the data are read correctly:

/* You can use the HHMMSS informat to read durations into SAS */
data RaceTimes;
input Time HHMMSS.;               /* read duration = length of time */
format Time TIME10.3;             /* there is no HHMMSS. format, so use TIME. format */
rawTime = Time;
hours = hour(Time);               /* check the results by print hour, min, and sec */
minutes = minute(Time);
seconds = second(Time);
datalines;
12.345
18:33.24 
19:23
;
 
proc print noobs; run;
Use HHMMSS informat to read race times into SAS

The only problem with the HHMMSS. informat is that it reads the string 19:23 as "19 hours, 23 minutes," whereas I prefer "19 minutes, 23 seconds." (It's an ambiguous case. For clarity, the string could be entered as 0:19:23 or 19.23.0.) If a SAS informat does not interpret strings the way you prefer, you can write your own SAS informat. The main challenge is writing DATA step code that parses strings such as "12.345" and "18:33.24" and computes the number of seconds that each string represents. The next section presents a DATA step program that converts these strings to seconds. You can use the program to define a new informat that parses race-time data.

Step 1: Write a DATA step that converts a string into a number

This article is about how to create an informat, not about how to parse strings in SAS, so I'll let the comments in the program speak for themselves. Basically, the following program reads from the right side of a string until it encounters a colon (or the beginning of the string). It extracts the string following the colon and converts it to a number, which is the number of seconds. It repeats the process to find the number of minutes (if any) and the number of hours (if any). It then converts the sum of the hours, minutes, and seconds to seconds and stores it in a variable named Time.

/* Parse strings. Obtain time, in seconds, for durations written in the following forms:
          SS.fff
       MM:SS.fff
   HHH:MM:SS.fff
*/
data RaceTimes(drop= i j z);
length str $13;                     /* HHH.MM.SS.fff */
input str;                          /* read raw string */
s=0; m=0; h=0;
/* read seconds :SS.fff */
i = find(str, ':', -length(str)+1); /* find colon from right */
z = substr(str, i+1);               /* extract seconds */
s = input(z, best6.);               /* convert to number */
/* read minutes :MM: */
if i > 0 then do;
   j = find(str, ':', -(i-1));      /* find colon from right */
   z = substr(str, j+1, i-j-1);     /* extract minutes */
   m = input(z, best6.);            /* convert to number */
   /* read hours HHH:, if any */
   if j > 0 then do;
      z = substr(str, 1, j-1);      /* extract hours */
      h = input(z, best6.);         /* convert to number */
   end;
end;
Time = 3600*h + 60*m + s;           /* convert to seconds */
datalines;
   11.345
 0:11.345
 2:39.66 
18:33.24 
19:23
49:02.89 
72:02.89 
1:12:02.89
100:71:00
;
 
proc print data=RaceTimes noobs; 
   format time 10.3;
run;
Convert race times into seconds with SAS

The DATA step parses each string and converts it to a time, in seconds. The output shows the results. This table is the same as you would get by using the HHMMSS. informat to read the data, except that the string 19:23 is now interpreted as "19 minutes, 23 seconds."

Notice that you can see the intermediate expressions such as s, m, and h. This is useful when you are trying to debug the program.

Step 2: Use PROC FCMP to define a function

You can use PROC FCMP to define a custom function in Base SAS. This example defines a function named Duration, which takes a string as an input argument. Copy and paste the DATA step code and return the Time variable. You can use the OPTIONS CMPLIB= system option to tell SAS where to find the Duration function. You can then call the new function from the DATA step, as follows:

proc fcmp outlib=work.functions.MathFuncs; 
function Duration(str $); 
   s=0; m=0; h=0;
   /* read seconds :SS.fff */
   i = find(str, ':', -length(str)+1); /* find colon from right */
   z = substr(str, i+1);               /* extract seconds */
   s = input(z, best6.);               /* convert to number */
   /* read minutes :MM: */
   if i > 0 then do;
      j = find(str, ':', -(i-1));      /* find colon from right */
      z = substr(str, j+1, i-j-1);     /* extract minutes */
      m = input(z, best6.);            /* convert to number */
      /* read hours, if any */
      if j > 0 then do;
         z = substr(str, 1, j-1);      /* extract hours */
         h = input(z, best6.);         /* convert to number */
      end;
   end;
   Time = 3600*h + 60*m + s;           /* convert to seconds */
   return( Time );
endsub; 
quit; 
 
options cmplib=work.functions; /* where to find Duration() function */
data RaceTimes2;
length str $13;
input str;
/* call Duration function to convert string to a time value */
Time = Duration(str);          
datalines;
   11.345
 0:11.345
 2:39.66 
18:33.24 
19:23
49:02.89 
72:02.89 
1:12:02.89
100:71:00 
;
Convert race times seconds with SAS

Step 3: Use PROC FORMAT to define a custom informat

I have written several articles about how to use PROC FORMAT to create your own custom formats in SAS. The process of creating an informat is similar, but because parsing strings is complicated it is often best to write the code into an FCMP function and then define the informat in terms of the function. This is done in the following call to PROC FORMAT, which uses the INVALUE statement to define an informat named DURATION.:

/* define the DURATION. informat */
proc format; 
   invalue duration (default=13) other=[Duration()]; 
run; 
 
/* use new DURATION informat to read duration times into SAS */
data RaceTimes3;
input Time : DURATION.;
datalines;
   11.345
 0:11.345
 2:39.66 
18:33.24 
19:23
49:02.89 
72:02.89
1:12:02.89
100:71:00 
;
 
proc print data=RaceTimes3; 
   format Time 10.3;
run;
Use custom informat to read race times into SAS

Success! The program uses the DURATION. informat to read strings of the form "SS.fff", "MM:SS.fff", and "HHH:MM:SS.fff". It converts the strings to seconds. You can use the statement FORMAT Time TIME10.3 in PROC PRINT if you want to format the seconds as hours, minutes, and seconds.

In summary, you can use the INVALUE statement in PROC FORMAT to create a custom informat in SAS. An informat enables you to read a string or pattern and convert it to a value that is easier to work with. Although SAS supports the HHMMSS. informat for reading strings that represent durations of time, I chose to create my own informat that converts race times to seconds. You can use PROC FCMP to define a function that parses a string and returns a numerical value. You can then use PROC FORMAT to define the informat, which calls the FCMP function.

The post Create a custom informat in SAS appeared first on The DO Loop.

7月 032019
 

One of my favorite parts of summer is a relaxing weekend by the pool. Summer is the time I get to finally catch up on my reading list, which has been building over the year. So, if expanding your knowledge is a goal of yours this summer, SAS Press has a shelf full of new titles for you to explore. To help navigate your selection we asked some of our authors what SAS books were on their reading lists for this summer?

Teresa Jade


Teresa Jade, co-author of SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, has already started The DS2 Procedure: SAS Programming Methods at Work by Peter Eberhardt. Teresa reports that the book “is a concise, well-written book with good examples. If you know a little bit about the SAS DATA step, then you can leverage what you know to more quickly get up to speed with DS2 and understand the differences and benefits.”
 
 
 

Derek Morgan

Derek Morgan, author of The Essential Guide to SAS® Dates and Times, Second Edition, tells us his go-to books this summer are Art Carpenter’s Complete Guide to the SAS® REPORT Procedure and Kirk Lafler's PROC SQL: Beyond the Basics Using SAS®, Third Edition. He also notes that he “learned how to use hash objects from Don Henderson’s Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study.”
 

Chris Holland

Chris Holland co-author of Implementing CDISC Using SAS®: An End-to-End Guide, Revised Second Edition, recommends Richard Zink’s JMP and SAS book, Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP® and SAS®, which describes how to improve efficiency while reducing costs in trials with centralized monitoring techniques.
 
 
 
 
 

And our recommendations this summer?

Download our two new free e-books which illustrate the features and capabilities of SAS® Viya®, and SAS® Visual Analytics on SAS® Viya®.

Want to be notified when new books become available? Sign up to receive information about new books delivered right to your inbox.

Summer reading – Book recommendations from SAS Press authors was published on SAS Users.

7月 012019
 

Math and statistics are everywhere, and I always rejoice when I spot a rather sophisticated statistical idea "in the wild." For example, I am always pleased when I see a graph that shows the distribution of race times in a typical race (such as a 5K), as shown to the right. The finishing times are plotted against the order in which runners crossed the finish line. This is a great visualization because you can see the times for each participant, the range of times between the top finishers and the laggards, and how close each runner's time was to the time of the person who placed ahead of her.

What amazes me is that this graph essentially shows the cumulative distribution of race times. The cumulative distribution is not generally used outside of scientific publications. Yet here it is, easily understood and with no accompanying explanation! Graphs like this are often used to visualize race times for triathlons, marathons, 5Ks, and more.

The distribution of race times

The graph is noteworthy for what it is and also for what it isn't. It isn't a histogram. If you give a statistician a set of measurements and ask for the distribution, you are likely to get a histogram. For small data sets, you might also get a fringe plot below the histogram, as shown below:

The graph shows the distribution of race time for the same race. The histogram bins the times into one-minute intervals. You can easily see that a small percentage of runners (about 2%) finished the race under 19 minutes and that about 40% of the runners finished between 20 and 22 minutes. You can see that only a few runners exceeded 26 minutes.

Since the histogram is such a standard plot, you would think it would be the "best" graph to use, but the time-versus-rank graph has several advantages:

  • The time-versus-rank graph connects the times to the place. What was the time for the 20th runner? It's easy to determine. How many runners finished under 22 minutes? Also easy to find.
  • You can see every runner's time. In a race, there might be a tenth of a second between times. The fringe plot suffers from overplotting. The histogram bins many times into a single bar. The time-versus-rank graph displays one marker per runner and the markers do not overlap unless there are hundreds of runners.
  • The time-versus-rank graph shows packs of runners. In long-distance races, there is often a "lead pack," a "trailing pack," and other clumps of runners of equal abilities. On the time-versus-rank graph, these packs show ups as groups of nearly horizontal markers. In the fringe plot, the vertical lines overlap and are harder to see.
  • Leaders and laggards stand out in the time-versus-rank graph because the markers are isolated. If someone wins a race by 20 seconds (a huge lead!), you can see that clearly in the first graph. In contrast, the histogram lumps together all the leaders into one bar.

The cumulative distribution of race times

The time-versus-rank graph is not exactly equal to the standard graph of the empirical cumulative distribution, but it's close. You can use PROC UNIVARIATE to create the following graph of the cumulative distribution of race times. The graph is known as a CDF plot.

The CDF plot has the same shape as the time-versus-rank graph, but you need to flip the axes. (Geometrically, flip the CDF plot across its diagonal.) The CDF plot differs in three minor ways:

  • The ECDF is plotted as a step function. The time-versus-rank graph is plotted as a scatter plot.
  • The ECDF has its axes reversed: the times are on the horizontal axis and the order is plotted vertically.
  • The ECDF standardizes the "order statistic" into a percentage, rather than using a rank. The runner who finishes 20th out of 66 runners is plotted at the 30.3 percentage point.

So, yes, there are minor differences. But I still smile whenever I see the time-versus-rank graph. Although the racers might not know or care, the plot contains the same information as a plot of the cumulative distribution of the race times.

You can download the SAS program that contains the data and creates all the graphs in this article.

The post Visualize race times in SAS appeared first on The DO Loop.

6月 282019
 

Congratulations on being chosen to speak at an event! Let the anxiety preparation begin. But wait.

Did you know that social media can help you out? Yes, even now, while you plan. What's more, it can be instrumental in maximizing your entire presenter experience before, during and after your presentation. Here are some ideas to get you thinking.

Before

1. Solicit ideas online.

Most of your connections won't attend your event, but many are probably interested in your topic. Don't hesitate to get help from your network while you work on your paper or presentation. Ask them questions. Get their feedback. (And use the event hashtag -- say, #sasgf or #sasusers for example -- when you do it.)

2. Use social media for research.

Online properties like Quora, SAS Communities, Medium, SlideShare and even LinkedIn can lead to statistics, influencers or research you never knew existed. Type keywords or phrases in the basic search field on any of these websites. You never know what (or who) you might find.

3. Polish your LinkedIn (and/or Twitter) profile. (People will be looking.)

Need a checklist? Start with the Example SAS User LinkedIn Profile on communities.sas.com or Buffer's 7 Key Ingredients of a Great Twitter Bio.

4. Schedule a handful of posts.

One week before the conference or perhaps while you're en route, schedule a few posts to your social media accounts. You'll be too busy at the conference to do this. Free tools like Buffer or Hootsuite allow you to schedule posts throughout the week.

During

5. Skim activity around the event hashtag feed to like, reply, share or comment.

Don't know how? Enter the event hashtag, for instance "#sasgf" (no quotation marks), in the main search fields on Twitter and LinkedIn. Doing this is good for a few reasons:

  • It's easy. Especially since you'll be so busy during the event.
  • People (even strangers) appreciate when you interact with their event posts.
  • Social activity during an event is a sure-fire way to gain followers.

6. Post the occasional photo or a useful tip from a particularly inspirational session.

You'll be so busy during the event, it will be hard to find time to post. If you can, do it in small pieces. A favorite stat. A meaningful quote. A beautiful view of the venue. (Remember, use the event hashtag or other topic-specific hashtags when you do.)

After

7. Connect on LinkedIn or SAS Communities.

Immediately after the event (ideally, in less than 24 hours), connect with fellow conference goers on LinkedIn. Be sure to personalize your invitation with a brief note in case they forgot your name. Don't want to wait? Connect with them in person using the LinkedIn QR code trick.

Is your new friend fairly technical? If so, find and follow his or her activity on communities.sas.com (See subhead "How do I search for people?").

8. Add your paper or presentation to your LinkedIn profile (and direct people to it).

There are three sections of your profile where you can add media (in the form of hyperlinks, documents, PowerPoint slides, etc.): your Summary, Experience and Education sections. Professionals: Add your paper or presentation slides to your Summary or Experience sections; students: consider your Education section.

Pro tip: For additional profile views, create a post to point connections to it on your profile or mention it during your presentation.

 

 

 

 

 

9. Write a useful blog post.

Alison Bolen wrote about this in 2012, yet her message remains perfectly relevant: How to transform your live event blogging into evergreen content. The bottom line? Readers care about the content, not the conference.

Nine #SocialMedia Speaker Tips to Use Before, During and After Events was published on SAS Users.

6月 282019
 

Congratulations on being chosen to speak at an event! Let the anxiety preparation begin. But wait.

Did you know that social media can help you out? Yes, even now, while you plan. What's more, it can be instrumental in maximizing your entire presenter experience before, during and after your presentation. Here are some ideas to get you thinking.

Before

1. Solicit ideas online.

Most of your connections won't attend your event, but many are probably interested in your topic. Don't hesitate to get help from your network while you work on your paper or presentation. Ask them questions. Get their feedback. (And use the event hashtag -- say, #sasgf or #sasusers for example -- when you do it.)

2. Use social media for research.

Online properties like Quora, SAS Communities, Medium, SlideShare and even LinkedIn can lead to statistics, influencers or research you never knew existed. Type keywords or phrases in the basic search field on any of these websites. You never know what (or who) you might find.

3. Polish your LinkedIn (and/or Twitter) profile. (People will be looking.)

Need a checklist? Start with the Example SAS User LinkedIn Profile on communities.sas.com or Buffer's 7 Key Ingredients of a Great Twitter Bio.

4. Schedule a handful of posts.

One week before the conference or perhaps while you're en route, schedule a few posts to your social media accounts. You'll be too busy at the conference to do this. Free tools like Buffer or Hootsuite allow you to schedule posts throughout the week.

During

5. Skim activity around the event hashtag feed to like, reply, share or comment.

Don't know how? Enter the event hashtag, for instance "#sasgf" (no quotation marks), in the main search fields on Twitter and LinkedIn. Doing this is good for a few reasons:

  • It's easy. Especially since you'll be so busy during the event.
  • People (even strangers) appreciate when you interact with their event posts.
  • Social activity during an event is a sure-fire way to gain followers.

6. Post the occasional photo or a useful tip from a particularly inspirational session.

You'll be so busy during the event, it will be hard to find time to post. If you can, do it in small pieces. A favorite stat. A meaningful quote. A beautiful view of the venue. (Remember, use the event hashtag or other topic-specific hashtags when you do.)

After

7. Connect on LinkedIn or SAS Communities.

Immediately after the event (ideally, in less than 24 hours), connect with fellow conference goers on LinkedIn. Be sure to personalize your invitation with a brief note in case they forgot your name. Don't want to wait? Connect with them in person using the LinkedIn QR code trick.

Is your new friend fairly technical? If so, find and follow his or her activity on communities.sas.com (See subhead "How do I search for people?").

8. Add your paper or presentation to your LinkedIn profile (and direct people to it).

There are three sections of your profile where you can add media (in the form of hyperlinks, documents, PowerPoint slides, etc.): your Summary, Experience and Education sections. Professionals: Add your paper or presentation slides to your Summary or Experience sections; students: consider your Education section.

Pro tip: For additional profile views, create a post to point connections to it on your profile or mention it during your presentation.

 

 

 

 

 

9. Write a useful blog post.

Alison Bolen wrote about this in 2012, yet her message remains perfectly relevant: How to transform your live event blogging into evergreen content. The bottom line? Readers care about the content, not the conference.

Nine #SocialMedia Speaker Tips to Use Before, During and After Events was published on SAS Users.

6月 272019
 

When Jack Shostak and I first started thinking about writing a SAS book on implementing CDISC (Clinical Data Interchange Standards Consortium) standards, we held one truth to be self-evident: that at least some parts of the book would be outdated before it was even published. Thanks to some lucky timing and the ability to make some minor tweaks just prior to publishing, the extent of the “outdated-ness” with our first edition of Implementing CDISC Using SAS: An End-to-End Guide, was fairly limited…for a few weeks at least.

Shortly after publishing, the final version 2.0 of the Define-XML specification came out and we knew there was some work to do in the future. So, after a bit of a writing break, we rolled up our sleeves again and began updating our %make_define macro and the associated metadata spreadsheets for the second edition of our book. Quite a few other changes were also in the works!

That edition came out in November of 2016. However, CDISC standards didn’t stop for us. True to form, even before publishing, we realized that we weren’t implementing NCI codes, aka “C-codes”, in our metadata-controlled terminology.

This was painfully obvious thanks to a check that started coming up in the Pinnacle 21 reports: “Missing NCI Code for Term in Codelist”. Some users shared this feedback with us, and we took action (thank you, users!).

So with some motivation from Jack, I started working on implementing C-codes. But I wanted it to be slick. The codes are all on the NCI website spreadsheets, so why should we expect users to enter them all into their study-specific metadata spreadsheets, right? Why not just read those spreadsheets, also available in XML format, and automatically merge the C-codes into the study-specific data? Well, I can tell you why it wasn’t that easy.

Jack, meanwhile, was a little more motivated than I to get a solution in place. So, as he frequently did back when we worked together at PRA Health Sciences in the early ’90s, he showed me how it was done. Thanks to him, in April of this year, Implementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition hit the virtual bookstore shelves. The updated macros and meta-data spreadsheets can be downloaded on the SAS website. See the C-codes for yourself and let us know what you think!

Want to get a sneak peek? Also check our book excerpt of Chapter 1: Implementation Strategies! Hoping to learn more about SAS Books? Subscribe to our newsletter for the latest discounts and news.

Other Resources:

SAS and CDISC

Move beyond the ‘whys’ of CDISC and bridge the gap between theory and practice by Jack Shostak

Working with ever-changing CDISC standards was published on SAS Users.

6月 262019
 

"There's a way to do it better - find it." - Thomas A. Edison

Finding a better SAS code

When it comes to SAS coding, this quote by Thomas A. Edison is my best advisor. Time permitting, I love finding better ways of implementing SAS code.

But what code feature means “better” – brevity, clarity or efficiency? It all depends on the purpose of your code. When code is to illustrate a coding concept or technique, clarity is a paramount. However, when processing large data volumes in near real-time, code efficiency becomes critical, not just a luxury or convenience. And brevity won’t hurt in either case. Ideally, your code should be a combination of all three features - brevity, clarity and efficiency.

Parsing a character string

In this blog post we will solve a problem of parsing a character string to find a position of n-th occurrence of a group of characters (substring) in that string.

The closest out-of-box solution to this problem is SAS’ FIND() function. Except this function searches only for a single/first instance of specified substring of characters within a character string. Close enough, and with some do-looping we can easily construct what we want.

After some internet and soul searching to find the Nth occurrence of a substring within a string, I came up with the following DATA STEP code snippet:

   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;

Here, s is a text string (character variable) to be parsed; x is a character variable holding a group of characters that we are searching for within s; p is a position of x value found within s; n is an instance number.

If there is no n-th instance of x within s found, then the code returns p=0.

In this code, each do-loop iteration searches for x within s starting from position p+1 where p is position found in prior iteration: p = find(s,x,p+1);.

Notice, if there is no prior-to n instance of x within s, the do-loop ends prematurely, based on until(p=0) condition, thus cutting the number of loops to the minimal necessary.

Reverse string search

Since find() function allows for a string search in a reverse direction (from right to left) by making the third augment negative, the above code snippet can be easily modified to do just that: find Nth instance (from right to left) of a group of characters within a string. Here is how you can do that:

   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;

The difference here is that we start from position length(s)+1 instead of 0, and each iteration searches substring x within string s starting from position –(p-1)=-p+1 from right to left.

Testing SAS code

You can run the following SAS code to test and see how these searches work:

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;
   put p=;
 
   /* from right to left */
   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;
   put p=;
run;

FINDNTH() function

We can also combine the above left-to-right and right-to-left searches into a single user-defined SAS function by means of SAS Function Compiler (PROC FCMP) procedure:

proc fcmp outlib=sasuser.functions.findnth;
   function findnth(str $, sub $, n);
      p = ifn(n>=0,0,length(str)+1);
      do i=1 to abs(n) until(p=0);
         p = find(str,sub,sign(n)*p+1);
      end;
      return (p);
   endsub;
run;

We conveniently named it findnth() to match the Tableau FINDNTH(string, substring, occurrence) function that returns the position of the nth occurrence of substring within the specified string, where the occurrence argument defines n.

Except our findnth() function allows for both, positive (for left-to-right searches) as well as negative (for right-to-left searches) third argument while Tableau’s function only allows for left-to-right searches.

Here is an example of the findnth() function usage:

options cmplib=sasuser.functions;
data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p=findnth(s,x,n);
   put p=;
 
   /* from right to left */
   p=findnth(s,x,-n);
   put p=;
run;

Using Perl regular expression

As an alternative solution I also implemented SAS code for finding n-th occurrence of a substring within a string using Perl regular expression (regex or prx):

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* using regex */
   xid = prxparse('/'||x||'/o');
   p = 0;
   do i=1 to n until(p=0);
      from = p + 1;
      call prxnext(xid, p + 1, length(s), s, p, len);
   end;
   put p=;
run;

However, efficiency benchmarking tests demonstrated that the above solutions using FIND() function or FINDNTH() SAS user-written function run roughly twice faster than this regex solution.

Challenge

Can you come up with an even better solution to the problem of finding Nth instance of a sub-string within a string? Please share your thoughts and solutions with us. Thomas A. Edison would have been proud of you!

Finding n-th instance of a substring within a string was published on SAS Users.