rick wicklin

3月 032021
 

I have previously shown how you can use the FRACTw. format in SAS to display numbers as fractions. But did you know that you can also use the format to obtain the numerator and denominator of the fraction as numbers in a program? All you need to do is to extract the numerator and denominator from the formatted string and convert them to integers.

The FRACTw. format

If you are not familiar with the FRACTw. format in SAS, here is an example. Suppose a clinical trial has 125 subjects in the control group and 175 subjects in the experimental group. In the control group, 80 subjects experienced pain as compared to only 20 subjects who experienced it in the experiment group. In terms of proportions, 80/125 = 0.64 of the subjects experienced pain as compared to 20/175 = 0.1142857... in the experimental group.

Sometimes it is useful to express numbers like 0.1142857... in terms of a fraction. In this case, we know that the correct fraction is 20/175 or the reduced version of it, 4/35. However, SAS can convert decimals to fractions that are less obvious. The following example uses the FRACTw. format in SAS to display the reduced fractional form of a decimal value:

data Trial;
input Group $ Outcomes Size;
RawRate = Outcomes / Size;
Rate = RawRate;            /* make two copies so we can display decimal & fraction side-by-side */
datalines;
Control 80 125
Test    20 175
;
 
proc print data=Trial noobs;
   format RawRate 10.7 Rate FRACT32.;
run;

In this article, I always use the FRACT32.format because 32 is the largest field width that is supported. That ensures that you can convert the widest range of decimal values to fractions.

By the way, your mathematical friends can probably convert 0.1142857... to a fraction in their head by using a trick! The trick is shown at the end of this article.

Fractions outside of [0,1)

The FRACTw. format will represent improper fractions as a whole number plus a fraction. For example, the number 1.4 is formatted as "1+2/5". If you want to extract the whole part and the fractional parts separately, you need to find the fractional part of the number. There are two common ways to represent a number as its whole and fractional parts. For this article, I will choose to "round to negative infinity" so that the fractional part is always positive.

Use the FRACTw. format to extract the numerator and denominator

Suppose you want to extract the numerator and denominator of the fraction. Because the numerator and denominator are integers, you can do this by extracting substrings from the formatted FRACTw. value (which is a string). Let's generate some random whole numbers and fractions to use as data:

data Have;
call streaminit(1234);
do i = 1 to 10;
   Int = rand("Integer", -2, 2);            /* integer part */
   n = rand("Integer", 1, 100);             /* numerator */
   d = rand("Integer", 1, 100);             /* denominator */
   if n>d then do;  tmp=n; n=d; d=tmp; end; /* swap so that n<=d */
   fractPart = n / d;                       /* decimal version of n/d */
   x = Int + fractPart;  
   output;
end;
format x 10.7 fractPart FRACT32.;
keep Int n d x fractPart;
run;
 
proc print data=Have; 
   var Int n d x fractPart;
run;

Notice that the X value contains some values that are easy to convert to fractions (for example, 1.4, 2.75) but other values (1.4432990) whose fractional form (43/97) is less obvious. The goal is to use SAS to represent the X values as a whole number plus a fraction in reduced form without peeking at the N and D variables. As you can see from the previous table, the FRACT32. format performs the work for you. You can use the following steps to extract the numerator and denominator into numerical variables:

  • Decompose the number into its integer and fractional parts.
  • For the fractional part, use the PUT statement and the FRACTw. format to create a string of the form "NUMER/DENOM".
  • Find the position of the '/' character. The substring to the left contains the numerator; the substring to the right contains the denominator.
  • Use the INPUT function and the w.d informat to extract numerical values from the substrings.

The following DATA step carries out the method to represent each decimal value as an integer and a fraction. The Numer and Denom variables contain the numerator and denominator, respectively, of the fractional part of X.

data NumerDenom;
set Have(keep=x);                      /* read in x, the data to convert */
/* x = floor(x) + frac(x) where frac(x) >= 0 
   https://blogs.sas.com/content/iml/2020/02/10/fractional-part-of-a-number-sas.html
*/
Int = floor(x);                        /* integer part of x */
f = x - Int;                           /* fractional part, f, is always in [0,1) */
s = put(f, FRACT32.);                  /* convert to fraction as a string */
slashPos = find(s,'/');                /* find the '/' character */
if slashPos=0 then do;                 /* if value can't be represented as fraction */
   numer = input(s, 32.); denom = 1;   /* for example: 0 = 0/1 */
end;
else do;
   numer_str = substr(s, 1, slashPos-1);  /* get string to the left of '/' */
   denom_str = substr(s, slashPos+1);     /* get string to the right of '/' */
   numer = input(numer_str, 32.);         /* convert numerator string to integer */
   denom = input(denom_str, 32.);         /* convert denominator string to integer */
end;
drop f s slashPos numer_str denom_str;
run;
 
proc print data=NumerDenom; run;

Success! The method correctly converted decimal values into an integer part and a fraction. The numerator and denominator of the fraction are stored in separate numerical variables.

Summary

SAS contains many formats that convert numbers into strings. One interesting format is the FRACTw. format, which enables you to display a decimal value as an integer and a fraction. If you want to extract the numerator and denominator of the fraction, you can extract the numerator and denominator from substrings of the formatted value. You can then use the INPUT function to convert the strings to numeric values.

Appendix: Mental calculation of 0.1142857...

My mathematical readers might scoff, "We don't need SAS to know that 0.1142857... equals 4/35! We can do that calculation in our heads!" The trick is recognizing that the numeric sequence "142857" is part of the repeating decimal expansion of the fraction 1/7 = 0.142857.... If you know that fact, then you can split the decimal 0.1142857... into two parts: 0.1 + 0.0142857.... This enables you to recognize the decimal as 1/10 + (1/7)/10 = 8/70, which you can easily reduce to 4/35 in your head.

The post Convert decimals to fractions in SAS appeared first on The DO Loop.

3月 012021
 

I recently wrote about a simple statistical formula that approximates the wind chill temperature, which is the cumulative effect of air temperature and wind on the human body. The formula uses two independent variables (air temperature and wind speed) to predict the wind chill temperature. This article describes how to use SAS to create line charts and heat maps that visualize the formula. For the heat map, I show how to overlay numerical values and to choose colors for the numbers so that the text is more readable.

An example of a wind chill chart is shown below. You can use the techniques in this article to visualize any response surface.

A formula for wind chill

The wind chill calculations that are used currently in the US, Canada, and the UK are based on a 2001 revision of an earlier wind chill table. The numbers in the wind chill table are based on equations from physics and mathematical modeling. However, you can fit a simple linear regression model that includes the main effects and an interaction term.

The formula for the wind chill temperature in US units is
\(WCT_{US} = 35.74 + 0.6215*T_F - 35.75*V_{mph}^{0.16} + 0.4275*T_F*V_{mph}^{0.16}\)
where TF is the air temperature in degrees Fahrenheit, and Vmph is the wind speed in miles per hour.

The formula in SI units is
\(WCT_{SI} = 13.12 + 0.6215*T_C - 11.37*V_{kph}^{0.16} + 0.3965*T_C*V_{kph}^{0.16}\)
where TC is the air temperature in degrees Celsius, and Vkph is the wind speed in kilometers per hour.

Visualize slices of the wind chill surface

Given a function of two variables, f(x1, x2), I often visualize the graph of the function by using two techniques. The first technique is to use a sliced fit plot. To create a sliced fit plot, specify a sequence of values for one variable and then overlay the sequence of graphs as a function of the remaining variable. For example, if you choose the values x1=0, 10, 20, and 30, you would overlay the graphs of the functions of x2: f(0, x2), f(10, x2), f(20, x2), and f(30, x2).

The following SAS DATA step evaluates the wind chill function (in US units) at a regular grid of values for the air temperature and the wind speed. The subsequent call to PROC SGPLOT overlays four curves that show the wind chill temperature as a function of the wind speed:

/* Wind chill formula for US customary units.
   The formula gives the apparent temperature (F) from 
   ambient temperature (F) and wind speed (mph).
   Source NOAA: https://www.weather.gov/safety/cold-wind-chill-chart    
*/
data WindChillUS;
label T = "Ambient Temperature (F)"
      V = "Wind Speed (mph)"
      WCT = "Wind Chill Temperature (F)";
do T = 40 to -40 by -5;
   do V = 5 to 40 by 5;
      WCT = 35.74 + 0.6215*T - 35.75*V**0.16 + 0.4275*T*V**0.16;
      output;
   end;
end;
run;
 
ods graphics / width=640px height=480px;
title "Wind Chill Chart (US Units)";
title2 "Dependence on Wind Speed";
footnote J=L "Source NOAA: https://www.weather.gov/safety/cold-wind-chill-chart";
proc sgplot data=WindChillUS;
   label T = "Temperature (F)";
   where T in (30 20 10 0);
   series x=V y=WCT / group=T lineattrs=(thickness=2);
   keylegend / position=E ;
   xaxis grid values=(-40 to 40 by 5) valueshint; 
   yaxis grid values=(-80 to 40 by 5) valueshint;
run;

For any air temperature, the apparent temperature decreases as a function of the wind speed. Stronger winds make it seem colder. Notice that the curves are not parallel, which indicates an interaction between temperature and wind speed. For example, if the air temperature is 30° F, a 25-mph wind makes it seem about 13 degrees colder. However, if the air temperature is 0° F, a 25-mph wind makes it seem about 24 degrees colder.

Create a heat map of the wind chill surface

A second way to visualize a response surface is to create a heat map. To be more useful, you can overlay the wind chill temperatures on the heat map.

When I look at the wind-chill chart that is provided by the National Weather Service (NWS) (shown at the right), I notice three things:

  • The NWS chart contains only four colors. The colors correspond to the danger of frostbite for exposed skin.
  • The temperature axis (horizontal dimension) is reversed. The extremely cold temperatures are on the right side of the graph and the temperatures decrease from left to right. The wind speed axis is also reversed.
  • The NWS chart uses light colors for moderately cold temperatures and dark colors for extremely cold temperatures. Black text is overlaid on the light colors and white text is overlaid on the dark colors. This way, the color of the text contrasts with the color of the background, which makes the text more readable.

I decided to deviate from the NWS chart in two ways. First, instead of four colors, I decided to use a spectrum of colors that represent the wind chill temperature, not the risk of frostbite. Second, I decided to use the traditional orientation of the axes, where temperatures increase from left to right and wind speeds increase from bottom to top. However, I followed the NWS example for choosing the color of the text.

You can use a trick to get the text to display as either white or black. The following DATA step assigns a binary variable according to whether the temperature is less than some threshold value. If you use the TEXT statement in PROC SGPLOT to display the text, you can use the COLORRESPONSE= option and a white-to-black color ramp to generate text that is white or black according to whether the wind chill temperature is above or below the threshold value. (You could also use the GROUP= option and the STYLEATTRS statement.) To be effective, you also need to choose the color ramp so that lighter colors are used above the threshold and darker colors are used below the threshold. Like the NWS, I use dark purples and blues for extremely cold temperatures, but I added lighter colors (off-white, yellow, and orange) for the milder temperatures.

The following SAS program generates the wind chill chart in US units. Notice the trick that generates white/black text for different temperatures. The chart is shown at the top of this article.

title "Wind Chill Chart (US Units)";
footnote J=L "Source NOAA: https://www.weather.gov/safety/cold-wind-chill-chart";
 
/* use threshold to decide between white text and dark text */
data WindChillUS2;
set WindChillUS;
textColor = (WCT > -19.5);  /* use to color text */
run;
 
proc sgplot data=WindChillUS2;
   format WCT 3.0;
   heatmapparm x=T y=V colorresponse=WCT / name="w"
      colormodel=(DarkPurple Purple DarkBlue DarkCyan LightGray LightYellow LightOrange);
   /* trick: white text on a dark background and dark text on a light background */
   text x=T y=V text=WCT / colorresponse=textColor colormodel=(white black) textattrs=(weight=bold);
   gradlegend "w";
   xaxis values=(-40 to 40 by 5) valueshint; 
   yaxis values=(-100 to 40 by 5) valueshint;
run;

Creating a wind chill chart in SI units

Because most of the world uses SI (metric) units, the following program repeats the process for temperatures in Celsius and wind speeds in kilometers per hour (kph). The range of temperatures and wind speeds are approximately the same as for the US chart.

data WindChillSI;
label TC = "Ambient Temperature (C)"
      VK = "Wind Speed (kph)"
      WCT = "Wind Chill Temperature (C)";
do TC = 6 to -39 by -3;
   do VK = 5 to 65 by 5;
      WCT = 13.12 + 0.6215*TC - 11.37*VK**0.16 + 0.3965*TC*VK**0.16;
      output;
   end;
end;
 
/* use white text on a dark background and dark text on a light background */
data WindChillSI2;
set WindChillSI;
textColor = (WCT > -29.5);  /* use to color text */
run;
 
ods graphics / width=640px height=480px;
title "Wind Chill Chart (SI Units)";
proc sgplot data=WindChillSI2;
   format WCT 3.0;
   heatmapparm x=TC y=VK colorresponse=WCT / name="w"
      colormodel=(DarkPurple Purple DarkBlue DarkCyan LightGray LightYellow LightOrange);
   text x=TC y=VK text=WCT / colorresponse=textColor colormodel=(white black) textattrs=(weight=bold);
   gradlegend "w";
   xaxis values=(-39 to 6 by 3) valueshint; 
   yaxis values=(5 to 100 by 5) valueshint;
run;

Summary

This article shows how to create a wind chill chart in SAS. One technique slices the response surface and overlays the one-dimensional graph. Another creates a heat map of the wind chill temperature and overlays text. The colors of the heat map and the text are chosen so that the text is readable on both light and dark backgrounds. You can use the techniques in this article to visualize any response surface.

You can download the complete SAS program that creates these graphs.

The post Create a wind chill chart in SAS appeared first on The DO Loop.

2月 242021
 

In cold and blustery conditions, the weather forecast often includes two temperatures: the actual air temperature and the wind chill temperature. The wind chill temperature conveys the cumulative effect of air temperature and wind on the human body. The goal of the wind-chill scale is to communicate the effect of the wind by giving an equivalent hypothetical scenario in which the temperature is colder but there is no wind. For example, if the wind chill temperature is 0 degrees, it means that the outdoor air temperature and wind combine to make you feel as cold as if it was a calm, clear, night with a temperature of 0°.

This article briefly discusses wind chill calculations and presents a statistical formula for predicting wind chill temperatures. It then creates two wind chill charts by using the formula. The first chart (shown below; click to enlarge) is for US customary units (temperatures in Fahrenheit and wind speeds in miles per hour (mph)). A second chart is shown for SI units (temperatures in Celsius and wind speeds in kilometers per hour (kph)). The end of the article links to information about how scientists developed the wind chill chart. For readers who are interested in how the charts were made, I link to a SAS program.

What is wind chill?

A challenge for meteorologists is how to communicate the risk of severe weather. The wind chill table was developed to communicate the cumulative risk of extreme cold and windy conditions. The table displays the "wind chill equivalent temperature" (WCT or WCET), for a grid of values of air temperature and wind speed.

The wind chill table is based on sophisticated calculations of heat transfer from the human face in the presence of wind, which greatly increases the loss of heat due to convection. It is based on a lot of modeling assumptions (clear night, no sun, the human face is approximately a cylinder,...), and a lot of experimental data to estimate thermal properties of humans. The wind chill calculations that are used currently in the US, Canada, and the UK are based on a 2001 revision of an earlier wind chill table.

A formula for wind chill

The numbers in the official wind chill table are based on equations from physics and mathematical modeling. The equations are very complicated, and you cannot get the numbers in the table by using a simple formula. However, because a simple formula is very useful for understanding how the apparent temperature depends on wind speed, researchers constructed a linear regression model for which the apparent temperature (WCT) is regressed onto the air temperature and a power of the wind speed. The model includes the main effects and an interaction term.

The formula is accurate to within 1° over the range of physically relevant temperatures and wind speeds. The formula is a valid approximation only for temperatures below 10° C (50° F) and for wind speeds above 5 kph (3 mph). In particular, if you plug in V=0, you do not get the air temperature.

The formula for the wind chill temperature in US units is available on a web page for the National Weather Service:
\(WCT_{US} = 35.74 + 0.6215*T_F - 35.75*V_{mph}^{0.16} + 0.4275*T_F*V_{mph}^{0.16}\)
where TF is the air temperature in degrees Fahrenheit, and Vmph is the wind speed in miles per hour.

The previous formula is a conversion from the official metric formula:
\(WCT_{SI} = 13.12 + 0.6215*T_C - 11.37*V_{kph}^{0.16} + 0.3965*T_C*V_{kph}^{0.16}\)
where TC is the air temperature in degrees Celsius, and Vkph is the wind speed in kilometers per hour. You can get one formula from the other by using the equations TC = 5/9*TF – 32 and Vkph=Vmph / 1.609.

Notice that the exponent for the wind speed is rather strange (0.16), but presumably this is based on either a statistical transformation (such as a Box-Cox transformation) or by fitting the exponent as a parameter in the model. Despite my efforts, I was unable to find details about how the regression formula was fitted.

A little known fact is that the wind speeds in the formula are measured at 10 meters (standard anemometer height). The formula uses a wind-profile model to adjust the wind speed to approximate the speed at the height of an average person.

Visualizing wind chill for a given air temperature

Winds removes heat from a layer next to your skin. If you specify the air temperature but allow the wind speed to vary, a graph of the wind chill temperature (WTC) shows how much colder it feels based on the speed of the wind. The following graph shows the WTC (in US units) for four sub-freezing air temperatures: 30°, 20°, 10°, and 0° F.

In each case, you can see that stronger winds make it seem colder. The effect of the wind is nonlinear and is stronger for colder temperatures. For example, if the air temperature is 30° F, a 25-mph wind makes it seem about 13 degrees colder. However, if the air temperature is 0° F, a 25-mph wind makes it seem about 24 degrees colder.

Creating a wind chill chart in US units

You can evaluate the formula for the wind chill temperature on a regular grid to obtain a chart that enables you to estimate the wind chill from the current conditions. The chart for US units is shown at the top of this article. To use the chart, find the approximate outdoor temperature along the horizontal axis, then move up the chart into the row for the approximate wind speed. For example, an outdoor temperature of 5° F and a wind speed of 20 mph results in a wind chill temperature of -15° F.

Creating a wind chill chart in SI units

You can create a similar chart for SI (metric) units, as shown below. Because the Celsius degree is almost twice as large as the Fahrenheit degree, I used increments of 3° C along the horizontal axis. However, I continue to use 5 as an increment for wind speed, even though 5 kph is only about 3 mph. The ranges for the axes are chosen so that the range of temperatures and wind speeds are approximately the same as for the US chart.

Summary and further reading

This article discusses wind chill temperature. The computations that calculate the numbers in the official wind chill chart are very complicated. However, you can fit a linear regression to the numbers in the official chart to obtain an approximate formula that is easy to evaluate. This article presents several wind chill charts that are based on the regression formula.

The history of wind chill computations is fascinating. The Wikipedia article for "Wind Chill" has a few sentences about the history. For an interesting account of the development of the most recent wind chill chart (2001), I highly recommend reading Osczevski and Bluestein, "The New Wind Chill Equivalent Temperature Chart," 2005. They explain some of the complications and approximations used to construct the chart. They note that "the public seems to have a strong preference for the equivalent temperature" chart, which seems simple but hides a lot of complexity. They conclude that the chart is "a deceptive simplification" that only seems to be easy to understand.

A future article will discuss how I constructed the wind chill charts in SAS. You can download the complete SAS program that creates these graphs.

The post The wind chill chart appeared first on The DO Loop.

2月 222021
 

A previous article describes how to use the SGPANEL procedure to visualize subgroups of data. It focuses on using headers to display information about each graph. In the example, the data are time series for the price of several stocks, and the headers include information about whether the stock price increased, decreased, or stayed the same during a time period. The previous article discusses several advantages and disadvantages of using the SGPANEL procedure for this task.

An alternative approach is to use the BY statement in the SGPLOT procedure to process each subgroup separately. This article shows how to use the #BYVAR and #BYVAL keywords in SAS titles to display information about the data in each subgroup.

The example data

I will use the same data as for the previous article. In real life, you would use a separate analysis to determine whether each stock increased or decreased, but I will hard-code this information for the three stocks in the example data.

The following DATA step creates a subset of the Sashelp.Stocks data. The STOCK variable contains the name of three stocks: IBM, Intel, and Microsoft. The OPEN variable contains the opening stock price for these companies for each month. The DATA step restricts the data to the time period Jan 1998 – May 2000. The TREND variable indicates whether the stock price increased or decreased during the time period.

data Have;
   set Sashelp.Stocks;
   where '01Jan1998'd <= Date <= '30May2000'd;
   /* prepare data to display information */
   if      Stock='IBM'       then Trend='Neutral   ';
   else if Stock='Intel'     then Trend='Increasing';
   else if Stock='Microsoft' then Trend='Decreasing';
run;
/* NOTE: The Sashelp.Stock data set is already sorted by Stock and by Date. 
   Be sure to sort your data if you want to use the BY statement. For example:
proc sort data=Have;
   by Stock Date;
run;
*/

You must sort the data for BY-group processing. The example data are already sorted by the STOCK variable, which is the grouping variable. And within each stock, the data are sorted by the DATE variable, which is important for visualizing the stock prices versus time.

Titles for BY-group analysis

When you run a BY-group analysis, SAS automatically creates a title that indicates the name and value of the BY-group variable(s). (This occurs whenever the BYLINE option is on, and it is on by default.) SAS looks at how many titles you have specified and uses the next available title to display the BY-group information. For example, without doing anything special, you can use the standard BY-group analysis to graph the prices for all three stocks in the data set:

/* assume OPTION BYLINE is set */
title "Stock Price Jan 1998 - May 2000";  /* the BY-line will appear in the TITLE2 position */
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;

To save space, I have truncated the output. Each graph shows only a subset of the data. The TITLE2 line displays the name of the BY-group variable (Stock) and the value of the variable. All this happens automatically.

Customize titles: The #BYVAL substitution

The TITLE and TITLEn statements in SAS support substituting the values of a BY-group variable. You can insert the name of a BY-group variable by using the #BYVARn keyword. You can insert the name of a BY-group value by using the #BYVALn keyword. When using these text substitutions, you should specify OPTIONS NOBYLINE to suppress the automatic generation of subtitles.

By default, the BY statement generates the plots one after another, as shown in the previous example. However, you can use the ODS LAYOUT GRIDDED statement to arrange the graphs in a lattice. Essentially, you are using ODS to replicate the layout that PROC SGPANEL handles automatically. In the following example, I let the vertical scale of the axes vary according to the values for each BY group. If you prefer, you can use the MIN= and MAX= options on the YAXIS statement to specify a range of values for each axis.

/* layout the graphs. Use the #BYVALn values to build the titles */
ods graphics / width=300px height=250px;         /* make small to fit on page */
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title "BY-Group Analysis of the #byvar1 Variable";/* substitute variable name for #BYVAR */
title2 "Time Series for #byval1";                /* substitute name of stock for #BYVAL */
 
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Now you can see the power of the #BYVAL keyword. (Click the graph to enlarge it.) It gives you great flexibility in creating a custom subtitle that contains the value of the BY-group variable. The keywords #BYVAR and #BYVAL are an alias for #BYVAR1 and #BYVAL1, just like TITLE is an alias for TITLE1. The next example uses a second BY-group variable.

Because the BY statement supports multiple BY-group variables and because you can specify the NOTSORTED option for variables that are not sorted, you can include the TREND variable as a second BY=group variable. You can then use both the #BYVAL1 and #BYVAL2 keywords to further customize the titles:

options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title2 "The Time Series for #byval1 Is #byval2"; /* substitute stock name and trend value */
 
proc sgplot data=Have;
   by Stock TREND notsorted;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Controlling attributes in a BY-group

I have one more tip. You can use a discrete attribute map to link the attributes of markers and lines to the value of a variable in the data. For example, suppose you want to color the lines in these plots according to whether the stock price increased, decreased, or stayed the same. The following DATA step creates a discrete attribute map that assigns the line colors based on the value in the TREND variable. On the PROC SGPLOT statement, you can use the DATTRMAP= option, which makes the data map available to the procedure. You can add the ATTRID= option to the SERIES statement. Because the colors are determined by the GROUP=TREND option, the procedure will look at the attribute map to determine which color to use for each line.

/* Create a discrete attribute map. Line color is determined by the TREND value. */
data Attrs;
length Value $20 LineColor $20;
ID = "StockTrend";
Value='Neutral';    LineColor = "DarkBlue    "; output;
Value='Increasing'; LineColor = "DarkGreen   "; output;
Value='Decreasing'; LineColor = "DarkRed     "; output;
run;
 
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
 
proc sgplot data=Have noautolegend DATTRMAP=Attrs;
   by Stock Trend notsorted;
   series x=Date y=Open / group=Trend ATTRID=StockTrend lineattrs=(thickness=2);
   yaxis grid label="Stock Price";
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Summary and further reading

This article shows how to customize the title and attributes in graphs that are generated as part of a BY-group analysis. You can use the #BYVARn and #BYVALn keywords to insert information about the BY groups into titles. You can use a discrete attribute map to link attributes in a graph (such as line color) to the values of a variable in the data. Although creating a sequence of graphs by using a BY-group analysis is a powerful technique, I often prefer to use PROC SGPANEL, for the reasons discussed in a previous article. PROC SGPANEL provides support for controlling many features of the graphs and the layout of the graphs.

If you are interested in the SAS Macro solution to creating titles, you can read the original thread and solution on the SAS Support Communities. For more information about using the #BYVAR and #BYVAL keywords in SAS titles, see

The post How to use the #BYVAR and #BYVAL keywords to customize graph titles in SAS appeared first on The DO Loop.

2月 172021
 

Many characteristics of a graph are determined by the underlying data at run time. A familiar example is when you use colors to indicate different groups in the data. If the data have three groups, you see three colors. If the data have four groups, you see four colors. The data analyst does not need to know in advance how many groups are in the data. The marker colors are an example of a "data-driven" attribute that is determined at run time.

In contrast, some features of the graph are typically specified before the graph is created. For example, in SAS, the TITLE and FOOTNOTE statements are usually hard-coded strings. (But I will show a technique that enables you to generate titles and footnotes dynamically from data!)

Recently, a SAS programmer at the SAS Support Communities wanted to construct titles dynamically. Specifically, he was interested in visualizing the price of many stocks over time, such as the time series shown in the panel to the right. For each stock, he wanted to display the name of the stock and whether it had gained in value, lost value, or remained about the same during a given time period. A helpful macro programmer showed how you can use SAS macro to accomplish this task. However, in many cases, you can use PROC SGPANEL or PROC SGPLOT to display the same information, and the code is much easier to write. This article shows a few tricks for displaying information on a graph in a data-driven manner. So that the article is not too long, this article discusses the SGPANEL procedure. A second article discusses how to use the BY statement in PROC SGPLOT.

Visualizing data in each group

A common task in data analysis is to visualize the data according to subgroups. One or more discrete variables (called grouping variables) are used to determine the observations that belong to each subgroup. For example, a data set might contain a variable named Stock whose values are strings such as "IBM", "Intel", and "Microsoft". These strings determine which observations belong to each group.

This article shows how to use PROC SGPANEL to create a panel of k plots. The header for each plot indicates the grouping value. Optionally, you can use the TEXT statement or the INSET statement to display information (often labels or statistics) that are specific to each group.

The data for this article is a subset of the Sashelp.Stocks data. The STOCK variable contains the name of three stocks: IBM, Intel, and Microsoft. The OPEN variable contains the opening stock price for these companies for each month. Although the example only has three stocks, the techniques in this article are applicable to an arbitrary number of stocks. The following DATA step restricts the data to the time period Jan 1998 – May 2000. It also creates a label for each subgroup that indicates whether the stock price increased or decreased during the time period. In practice, this label would be determined by a separate analysis (maybe a linear regression or a DATA step computation), but I am hard-coding it for simplicity.

data Have;
   set Sashelp.Stocks;
   where '01Jan1998'd <= Date <= '30May2000'd;
   /* prepare data to display information */
   if      Stock='IBM'       then Trend='Neutral   ';
   else if Stock='Intel'     then Trend='Increasing';
   else if Stock='Microsoft' then Trend='Decreasing';
   Label = catx(": ", Stock, Trend);  /* example "IBM: Neutral" */
run;
/* NOTE: The Sashelp.Stock data set is already sorted by Stock and by Date. 
   Be sure to sort your data if you want to use the BY statement. 
proc sort data=Have;
   by Stock Date;
run;
*/

The example data are already sorted by the STOCK variable, which is the grouping variable. Within each stock, the data are sorted by the DATE variable. You do not need to sort the data if you are using PROC SGPANEL, but you sort if you intend to do BY-group processing.

A panel of stocks with headers

If you want to create the same plot for many sub-groups, you should start with the SGPANEL procedure, which was created for exactly this purpose! The following call to PROC SGPANEL is short and easy to read. It creates a panel that contains three line plots and labels each according to the value of the PANELBY variable.

title "Stock Price Jan 1998 - May 2000";
proc sgpanel data=Have;
   panelby Stock / columns=1;     /* vertical stack of panels */
   series x=Date y=Open;
   inset Trend / opaque;
   rowaxis grid label="Stock Price";
   colaxis display=(nolabel);
run;

The output is shown at the top of this article. Each panel has a row header (a "title" of sorts) that identifies the subgroups. An INSET statement displays information about the data. In this example, the data set contains a variable (Trend) that indicates the performance of the stock over the specified time period. It could also display statistics.

There are three advantages to using PROC SGPANEL to visualize and compare subgroups of the data:

  • You can use the UNISCALE= option on the PANELBY statement to specify whether each plot in the panel should use a common axis or whether the axes should reflect the data range in each subgroup. In this example, all vertical axes use the range [50,210]. The minimum value is determined by the lowest Intel price whereas the maximum value is determined by the highest IBM price. This automatic scaling is invaluable for comparing data across groups.
  • You can use options on the PANELBY statement to arrange the plots in a panel. This example uses the COLUMNS=1 option to stack the plots vertically. However, you could also use ROWS=1 to arrange the plots horizontally. You can use both options together to arrange the plots in a lattice.
  • You can control the placement and attributes of the headers. An example is shown in the next section.

Individual graphs of stocks

I prefer to put related plots in a panel because it is a great way to organize the output and to compare the data across subgroups. However, the SAS programmer wanted individual graphs for each subgroup, not a panel of plots. No problem, PROC SGPANEL can do that, too! You can make PROC SGPANEL create individual graphs by telling it that each panel should contain only one graph. You can also control the attributes of the row headers so that they look like the titles that are generated by using the TITLE or TITLE2 statements in SAS.

The following call to PROC SGPANEL is similar to the first call, except the options COLUMNS=1 ROWS=1 force each panel to contain only one plot. The options NOHEADERBORDER and HEADERBACKCOLOR=WHITE make the row headers look more like a title. Lastly, instead of displaying the stock names in the headers, I display a label that identifies the stock and its performance. Notice that the label starts with the PANELBY value and then provide additional information.

ods graphics / width=300px height=200px;
title;
footnote J=L "Stock Price Jan 1998 - May 2000";
 
proc sgpanel data=Have;
   panelby Label / columns=1 rows=1   /* a "panel" that contains one graph */
           headerattrs=GraphTitleText
           novarname noheaderborder headerbackcolor=White; /* make headers look like TITLE */
   series x=Date y=Open;
   rowaxis grid label="Stock Price";
   colaxis display=(nolabel);
run;

These statements put one plot in each "panel" and use the row headers to simulate a title. Even though the plots are no longer part of a panel, they can still share a common axis range, as in this example. As mentioned earlier, the UNISCALE= option enables you to specify which axes are shared.

There are two limitations to using PROC SGPANEL in this way:

  • The headers are not really titles. If you use the TITLE, TITLE2, and TITLE3 statements, you can have titles and multiple subtitles.
  • The values of the PANELBY variable must be unique for each stock. Notice that the PANELBY variable in this example is the name of the stock followed by a string that has additional information about the performance of the stock. This ensures that each panel shows data for exactly one stock. You cannot, for example, use the TREND variable (which has values 'Neutral', 'Increasing', and 'Decreasing') on the PANELBY statement because that would create a panel that has at most three cells. All 'Increasing' stocks would be overlaid in a single cell, and the same for the 'Decreasing' and 'Neutral' stocks.

Summary

The SGPANEL procedure creates a panel of plots. There is one plot for every subgroup of a grouping variable. (Note: You can specify two grouping variables to get a lattice of plots.) By default, the SGPANEL procedure uses the values of the grouping variable as headers. However, as shown in this article, you can display any string as a header. In particular, you can pre-process the data to generate a label that depends on the data. In this way, you can create labels that indicate features in the data, such as whether a stock gained or lost value in a time interval.

Although I prefer using PROC SGPANEL for constructing plots of subgroups, an alternative is to use the BY statement in PROC SGPLOT. The next article presents the pros and cons of using BY-group processing and shows how to construct titles for the plots that depend on the data in the plots.

The post Data-driven titles for graphs appeared first on The DO Loop.

2月 152021
 

I've previously written about how to generate all pairwise interactions for a regression model in SAS. For a model that contains continuous effects, the easiest way is to use the EFFECT statement in PROC GLMSELECT to generate second-degree "polynomial effects." However, a SAS programmer was running a simulation study and wanted to be able to generate all pairwise interaction effects within SAS/IML. One way to solve the programmer's problem is to use the HDIR function in SAS/IML. This article introduces the HDIR function and shows how to use it to generate pairwise (quadratic) interactions for continuous effects.

Generate pairwise interactions by using the EFFECT statement

Before trying to solve a problem with a new tool, I like to use an existing method to solve the problem. Then I can compare the old and new answers and make sure they are the same.

For sample data, let's use the numerical variables in the Sashelp.Class data. So that the techniques are clearer, I will rename the variables to X1, X2, and X3. We want to generate all six pairwise interactions, including the "pure quadratic" terms where a variable interacts with itself. The names of the interaction effects will be X1_2, X1_X2, x1_X3, X2_2, X2_X3, and X3_2. The names with "_2" at the end are pure quadratic effects; the others are interactions.

One way to generate a design matrix that has all pairwise interactions is to use the EFFECT statement in PROC GLMSELECT. The interactions are written to the OUTDESIGN= data set. The columns of the design matrix have interpretable names, which are stored in the _GLSMOD macro variable.

data Have;         /* create the example data */
set sashelp.class(rename=(Age=X1 Height=X2 Weight=X3));
if _N_=3 then X1=.;    /* add a missing value */
keep _NUMERIC_;
run;
 
/* Use PROC GLMSELECT to generate all pairwise interactions.
   Step 1: Add a fake response variable */
%let DSIn  = Have;         /* the name of the input data set */
%let DSOut = Want;         /* the name of the output data set */
data AddFakeY / view=AddFakeY;
   set &DSIn;
   _Y = 0;      /* add a fake response variable */
run;
 
/* Step 2: Use the EFFECT statement to generate the degree-2 effects, as shown in 
      https://blogs.sas.com/content/iml/2017/09/07/polynomial-effects-regression-sas.html */
proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=&DSOut(drop=_Y);
   effect poly2 = polynomial( X1 X2 X3 / degree=2);
   model _Y = poly2 /  noint selection=none;  /* equiv to X1*X1 X1*X2 X1*X3 X2*X2 X2*X3 X3*X3 */
run;
%put &=_GLSMOD;    /* look at names of the interaction effects  in the OUTDESIGN= data */
 
proc print data=&DSOut(obs=5);
   var X1_2 X1_X2 X1_X3 X2_2 X2_X3 X3_2; /* names are generated automatically */
run;

The output from PROC GLMSELECT contains the "pure quadratic" interactions (X1_2, X2_2, and X3_2) and the cross-variable interactions (X1_X2, X1_X3, and X2_X3). If you have k variables, there will be k pure quadratic terms and "k choose 2" cross-variable terms. Hence, the total number of quadratic interactions is "(k+1) choose 2," which is (k+1)*k/2. Here, k=3 and there are six quadratic interactions.

Notice how PROC GLMSELECT handles the missing value in the third observation: because the X1 value is missing, the procedure puts a missing value into all interaction effects.

The horizontal direct product between matrices

The horizontal direct product between matrices A and B is formed by the elementwise multiplication of their columns. The operation is most often used to form interactions between dummy variables for two categorical variables. If A is the design matrix for the categorical variable C1 and B is the design matrix for the categorical variable C2, then HDIR(A,B) is the design matrix for the interaction effect C1*C2.

The following simple example shows how the HDIR function works. The HDIR function returns a matrix whose columns are formed by elementwise multiplication of the columns of A and the matrix B. The first set of columns is the product A[,1]#B, the second set is A[,2]#B, and so forth. If A has k1 columns and B has k2 columns, then HDIR(A,B) has k1*k2 columns.

proc iml;
/* horizontal direct product multiplies each column of A by each column of B */
A = {1 2,
     2 3,
     4 5};
B = {0  1 2,
     1  1 3,
     2 -1 4};
C = hdir(A,B);
print C[c={'A1_B1' 'A1_B2' 'A1_B3' 'A2_B1' 'A2_B2' 'A2_B3'} L='hdir(A,B)'];

Interactions of continuous variables

The previous section shows that if X is a data matrix of continuous variables, the function call HDIR(X,X) generates all pairwise combinations of columns of X. Unfortunately, the matrix HDIR(X,X) contains more columns than we need. if X contains k columns, we need the "(k+1) choose 2" =(k+1)k/2 columns of interactions, but the matrix HDIR(X,X) contains k*k columns. The problem is that HDIR(X,X) contains columns for X1*X2 and for X2*X1, even though those columns are the same. The same holds for other crossed terms such as X1*X3 and X3*X1, X2*X3 and X3*X2, and so forth.

If you want to use the HDIR function to generate all pairwise interactions, you have a choice: You can generate ALL products of pairs and delete the redundant ones, or you can compute only the unique pairs. I will show the latter because it is more efficient.

varNames = 'X1':'X3';
use Have;
   read all var varNames into X;
close;
 
/* Compute only the columns we need */
A1 = HDIR(X[,1], X[,1:3]);    /* interaction of X1 with X1, X2, X3 */
A2 = HDIR(X[,2], X[,2:3]);    /* interaction of X2 with X2 X3 */
A3 = HDIR(X[,3], X[,3]);      /* interaction of X3 with X3 */
A = A1 || A2 || A3;
 
/* get the HEAD module from 
   https://blogs.sas.com/content/iml/2021/02/10/print-top-rows-of-data-sas.html
*/
load module=(Head);
run Head(A) colname={'X1_2' 'X1_X2' 'X1_X3' 'X2_2' 'X2_X3' 'X3_2'};

The HEAD module displays only the top five rows of the matrix of interactions. This output is almost the same as the output from PROC GLMSELECT. A difference is how missing values propagate. For the third row, X1 is missing. PROC GLMSELECT puts a missing value in all columns of the pairwise interactions. In contrast, the SAS/IML output only has missing values in the first three columns because those are the only columns that involve the X1 variable. If your data contain missing values, you might want to use the CompleteCases function to delete the rows that have missing values before performing additional analyses.

A function that computes interactions of continuous variables

The previous section computes all quadratic interactions for a matrix that has three variables. It is straightforward to generalize the idea to k variables. You simply compute the interactions HDIR(X[,i],X[,i:k]) for i=1,2,...,k. This is done in the following program. During the computation, it is convenient to also generate names for the interactions. The following program generates the same names as PROC GLMSELECT. To make the code easier to understand, I encapsulate the logic for names into a helper function called GetNamePairs. The helper function is called as part of the InteractPairs module, which returns both the matrix of interactions and the character vector that names the columns:

/* Helper function: Form names of interaction effects.
   Input: s is a scalar name. t is a character vector of names. 
   Return row vector of pairs of names "s_2" or "s_t[i]" 
   Example: GetNamePairs('X1', 'X1':'X3') returns {'X1_2' 'X1_X2' 'X1_X3'} */
start GetNamePairs(s, t);
   k = nrow(t)*ncol(t);
   b = blankstr(nleng(s)+nleng(t)+1);  /* max length of interaction name */
   pairs = j(1, k, b);
   do i = 1 to k;
      if s=t[i] then pairs[i] = strip(s) + "_2";
      else           pairs[i] = catx("_", strip(s), strip(t[i])); 
   end;
   return pairs;
finish;
 
/* Generate all quadratic interactions for continuous variables.
   Input: design matrix X and a vector of column names.
   Output: OutX:     a matrix whose columns are the pairwise interactions
           OutNames: the names of interactions, such as Age_2 and Height_Age */
start InteractPairs(OutX, OutNames, X, Names);
   k = ncol(X);
   numInteract = comb(k+1,2);  /* = k + comb(k,2) */
   OutX = j(nrow(X), numInteract, .);
   OutNames = j(1, numInteract, BlankStr(2*nleng(Names)+1));
   col = 1;              /* initial column to fill */
   do i = 1 to k;
      m = k-i+1;         /* number of interaction for this loop */
      OutX[,col:col+m-1]     = HDIR(X[,i], X[,i:k]);
      OutNames[,col:col+m-1] = GetNamePairs(Names[i], Names[i:k]);
      col = col + m;
   end;
finish;
 
run InteractPairs(OutX, OutNames, X, varNames);
run Head(OutX) colname=OutNames;

The output is identical to the earlier output. The input arguments to this module can have an arbitrary number of columns. In this example, the design matrix does not include an intercept column (a column that is all ones). Consequently, the output is the set of all quadratic interactions. If your design matrix includes an intercept column, the output will contain an intercept column and all main effects in addition to the quadratic effects.

Technically, you don't need to use the HDIR function in the InteractPairs module. Instead, you can use the familiar '#' operator to perform the elementwise multiplication. However, if you try to generalize the module to handle the interaction between categorical variables and continuous variables, the HDIR function will be useful.

Summary

This article introduces the HDIR function in SAS/IML, which computes the horizontal direct product of matrices. You can use the HDIR function to generate interaction effects in regression models. This article shows how to generate all quadratic interactions among a set of continuous variables.

The post Generate all quadratic interactions in a regression model appeared first on The DO Loop.

2月 102021
 

One of the first things I learned in SAS was how to use PROC PRINT to display parts of a data set. I usually do not need to see all the data, so my favorite way to use PROC PRINT is to use the OBS= data set option to display the first few rows. For example, I often display the first five rows of a SAS data set as follows:

proc print data=Sashelp.Class(obs=5);
    * VAR Weight Height Age;  /* optional: the VAR statement specifies variables */
run;

By using the OBS= data set option, you can display only a few observations. This enables you to take a quick peek at the values of your data. As shown in the comment, you can optionally use the VAR statement to display only certain columns. (Use the FIRSTOBS= option if you want more control over the rows that are printed.)

Display the rows of a data table in SAS/IML

In a SAS/IML program, data are either stored in a table or in a matrix. If the data are in a table, you can use the TABLEPRINT subroutine to display the data. The NUMOBS= option enables you to display only a few rows:

proc iml;
TblClass = TableCreateFromDataset("sashelp", "class");
run TablePrint(TblClass) numobs=5;

The TABLEPRINT subroutine supports many options for printing, including the VAR= option for specifying only certain columns.

Display the rows of a matrix in SAS/IML

How can you display only a portion of a SAS/IML matrix? I often write statements like this to print only the first few rows:

/* read numerical data into the X matrix */
use Sashelp.Class; read all var _NUM_ into X[c=varNames]; close;
print (X[1:5,]);    /* print only rows 1 through 5 */

This works, but there are a few things that I don't like. My primary complaint is that X[1:5,] is a temporary matrix and therefore has no name. The rows are printed, but there is no header that tells me where the data came from. My second complaint is that the output does not indicate which rows are being displayed. Consequently, sometimes I include information in the label and add row and column headers:

print (X[1:5,])[rowname=(1:5) colname=varNames label="Top of X"];

The output now includes the information that I want, but that is a LOT of typing, especially if I want to display similar information for other matrices. Even if I use the abbreviated version of the PRINT options (R=, C=, and L=), it is cumbersome to type.

By the way, this PRINT statement demonstrates a new feature of SAS/IML 15.1 (which was released with SAS 9.4M6), which is that the ROWNAME= and COLNAME= options on the PRINT statement support numerical vectors. If you have an earlier version of SAS/IML, you can use rowname=(char(1:5)).

HEAD: A module to print the top rows of a matrix

There's a saying among computer programmers: if you find yourself writing the same statements again and again, create a function to do it. So, let's write a SAS/IML module to print the top rows of a matrix. Because there is a UNIX command called 'head' that displays the top lines of a file, I will use the same name.

Many years ago, I blogged about how to write a HEAD subroutine, although my emphasis was on how to use default arguments in SAS/IML functions. The following routine is a richer version of the previous function:

/* Print the first n rows of a matrix. Optionally, display names of columns */
start Head(x, n=5, colname=);
   m = min(n, nrow(x));         /* make sure n isn't too big */
   idx = 1:m;                   /* the rows to print */
   name = parentname("x");      /* name of symbol in calling environment */
   if name=" " then name = "Temp";  /* the parent name of a temporary variable is " "*/
   labl = "head(" + name + ") rows=" + strip(char(m));  /* construct the label */
   if isSkipped(colname) then  /* print the top rows */
      print (x[idx,])[r=idx label=labl];
   else 
      print (x[idx,])[r=idx c=colname label=labl];
finish;
 
run Head(X) colname=varNames;  /* example: call the HEAD module */

The HEAD subroutine uses three features of user-defined modules that you might not know about:

The result is a short way to display the top few rows of a matrix.

TAIL: A module to print the bottom rows of a matrix

Although I usually want to print the top row of a matrix, it is easy to modify the HEAD module to display the last n rows of a matrix. The following module, called TAIL, is almost identical to the HEAD module.

/* Print the last n rows of a matrix. Optionally, display names of columns */
start Tail(x, n=5, colname=);
   m = min(n, nrow(x));         /* make sure n isn't too big */
   idx = (nrow(x)-m+1):nrow(x); /* the rows to print */
   name = parentname("x");      /* name of symbol in calling environment */
   if name=" " then name = "Temp";  /* the parent name of a temporary variable is " "*/
   labl = "tail(" + name + ") rows=" + strip(char(m));  /* construct the label */
   if isSkipped(colname) then  /* print the bottom rows */
      print (x[idx,])[r=idx label=labl];
   else 
      print (x[idx,])[r=idx c=colname label=labl];
finish;
 
run Tail(X) colname=varNames;

Summary

Most SAS programmers know how to use the OBS= option in PROC PRINT to display only a few rows of a SAS data set. When writing and debugging programs in the SAS/IML matrix language, you might want to print a few rows of a matrix. This article presents the HEAD module, which displays the top rows of a matrix. For completeness, the article also defines the TAIL module, which displays the bottom rows of a matrix. If you find these modules useful, you can incorporate them into your SAS/IML programs.

For more tips and techniques related to SAS/IML modules, see the article "Everything you wanted to know about writing SAS/IML modules."

The post Print the top rows of your SAS data appeared first on The DO Loop.

2月 082021
 

Look at the following matrices. Do you notice anything that these matrices have in common?

If you noticed that the rows of each matrix are arithmetic progressions, good for you. For each row, there is a constant difference (also called the "increment") between adjacent elements. For these examples:

  • In the first matrix, each row is an arithmetic progression with a difference of 1. That is, the value of the (j+1)th column is one more than the value of the j_th column.
  • In the second matrix, each row is an arithmetic progression with a difference of 2. That is, the value of the (j+1)th column is two more than the value of the j_th column.
  • In the third matrix, each row is an arithmetic progression, but the increments for the rows are not identical. The increment for the first row is -3. The increment for the second row is +1. The increments for the remaining rows are -2 and +3, respectively.

However, there is another characteristic that these matrices have in common: they are all singular matrices. It turns out this is not a coincidence. An n x n matrix (n>2) is singular if each row is an arithmetic progression.

Why are these matrices singular?

Recall that a matrix will be singular if one of the columns can be written as a linear combination of other columns. For these three matrices, you can use mental arithmetic to verify that the third column is a linear combination of the first two columns. Specifically, you can verify that the third column can be written as
X3 = 2 X2 – X1
where X1, X2, X3, ... are the columns of a matrix. Thus, these matrices are singular.

It turns out that this relationship holds in general for any matrix whose rows are arithmetic progressions. Suppose the elements of the i_th row satisfy the progression with increment di so that the elements of the i_th row satisfy X[i, j+1] = X[i, j] + d[i]  for j=1,2,...,n. Let d be the vector of differences. Then
X2 = X1 + d
which you can solve for d to obtain
d = X2 – X1.
By substitution,
X3 = X2 + d = X2 + (X2 – X1), or
X3 = 2*X2 – X1.
Therefore, the third column is always a linear combination of the first two columns, and the coefficients are independent of the differences in the arithmetic progressions of the rows. It doesn't matter what increments are used in the arithmetic progressions of the rows.

In fact, even more is true. You can use similar arguments to show that the j_th column can be written as a linear combination of the first two columns for any j > 2. Thus, regardless of the matrix size, matrices whose rows are arithmetic progressions are rank-2 matrices: they have only two linearly independent columns.

Why should you care about this?

I decided to write about this topic for three reasons:

  1. It's mathematically interesting, and I never noticed it before.
  2. When I write a program, I often need to quickly create an example matrix to test the program. I often use a matrix such as {1 2 3, 4 5 6, 7 8 9} or SHAPE(1:25,5,5). This article shows that both matrices are singular, which is important to know if the program requires a nonsingular matrix.
  3. Lastly, I wanted to emphasize the difference between a "random matrix" and an "arbitrary matrix." In a previous article, I discussed the probability that a random matrix is singular and concluded that a random matrix is almost always nonsingular. Here, "random matrix" means that the matrix elements are drawn randomly from a continuous probability distribution such as the uniform or normal distribution.

    However, the other day I received a comment on that blog post. Essentially, the comment said, "I wrote down this matrix at random, but it is singular. How do you explain that?" The matrix was similar to the matrices at the top of this article. The answer is that the elements of the matrix were not generated randomly. The rows of the matrix form an arithmetic progression (very NON-random!), which, as we have seen, implies that the matrix is singular.

The post A matrix is singular if its rows are arithmetic sequences appeared first on The DO Loop.

2月 032021
 

In a previous article, I showed how to generate random points uniformly inside a d-dimensional sphere. In that article, I stated the following fact:

If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit sphere.

I was reminded of this fact recently when I wrote about how to simulate a random walk in high dimensions. Each step of a random walk needs to generate a random direction and a random step length. For a 2-D random walk, it is easy to generate a random direction as an angle, θ, chosen from the uniform distribution on the interval [0,2π). However, for higher-dimensional random walks, I do not suggest using random angles to determine directions. Although it is mathematically possible to use spherical coordinates in d-dimensions, computations in a spherical coordinate system are extremely messy.

A better alternative is to use random unit vectors to determine a direction for each step in a random walk. (In fact, a unit vector is sometimes called a direction vector.) Since a unit vector is a vector from the origin to a point on the unit sphere, this article shows how to generate random points uniformly on the unit sphere. I show the computation twice: once by using the SAS/IML matrix language and again by using the SAS DATA step. In order to visualize the data, I show how to create a contour plot of ungridded data in SAS.

Random points on a circle

For simplicity, let's see how to generate random points on the unit circle in 2-D. The following SAS/IML program simulates N=1000 points from d=2 independent normal distributions. The notation X[ ,##] is a subscript reduction operator that returns a row vector that contains the sum of the squared elements of each row of X. Thus the expression sqrt(X[,##]) gives the Euclidean distance of each row to the origin. If you divide the coordinates by the distance, you obtain a point on the unit circle:

%let N = 1000;         /* number of steps in random walk */
%let d = 2;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
 
proc iml;
call randseed(12345);
r = &r; d = &d;
x = randfun(&N // d, "Normal");   /* N points from MVN(0, I(d)) */
norm = sqrt( x[,##] );            /* Euclidean distance to origin */
x = r * x / norm;                 /* scale point so that distance to origin is r */
title "&n Random Points on Circle";
call scatter(x[,1], x[,2]) grid={x y}
             procopt="aspect=1" option="transparency=0.8";
QUIT;

As promised, the resulting points are on the circle of radius r=1. Recall that uniformly at random does not mean evenly spaced. Point patterns that are generated by a uniform process will typically have gaps and clusters.

Random points on a sphere

You can also use the SAS DATA step to generate the random points on a sphere. You can generate each coordinate independently from a normal distribution and use the EUCLID function in Base SAS to compute the Euclidean distance from a point to the origin. You can then scale the coordinates so that the point is on the sphere of radius r. The following DATA step generates d-dimensional points for any d:

%let N = 2000;         /* number of steps in random walk */
%let d = 3;            /* S^{d-1} sphere embedded in R^d */
%let r = 1;            /* radius of sphere */
data RandSphere;
array x[&d];
call streaminit(12345);
do i = 1 to &N;
   do j = 1 to &d;
      x[j] = rand("Normal");      /* random point from MVN(0, I(d)) */
   end;
   norm = Euclid( of x[*] );      /* Euclidean distance to origin */
   do j = 1 to &d;
      x[j] = &r * x[j] / norm;    /* scale point so that distance to origin is r */
   end;
   output;
end;
drop j norm;
run;

How can we visualize the points to assure ourselves that the program is running correctly? One way is to plot the first two coordinates and use colors to represent the value of the points in the third coordinate direction. For example, the following call to PROC SGPLOT creates a scatter plot of (x1,x2) and uses the COLORRESPONSE=x3 option to color the markers. The default color ramp is the ThreeAltColor ramp, which (for my ODS style) is a blue-black-red color ramp. That means that points near the "south pole" (x3 = -1) will be colored blue, points near the equator will be colored black, and points near the "north pole" (x3 = 1) will be colored red. I have also used the TRANSPARENCY= option so that the density of the points is shown.

title "&n Random Points on a Sphere";
proc sgplot data=RandSphere aspect=1;
   scatter x=x1 y=x2 / colorresponse=x3 transparency=0.5 markerattrs=(symbol=CircleFilled);
   xaxis grid; yaxis grid;
run;

Switching to (x,y,z) instead of (x1,x2,x3) notation, the graph shows that observations near the (x,y)-plane (for which x2 + y2 &asymp 1) have a dark color because the z-value is close 1 zero. In contrast, observations near for which x2 + y2 &asymp 0 have either a pure blue or pure red color since those observations are near one of the poles.

A contour plot of ungridded data

Another way to visualize the data would be to construct a contour plot of the points in the upper hemisphere of the sphere. The exact contours of the upper hemisphere are concentric circles (lines of constant latitude) centered at (x,y)=(0,0). For these simulated data, a contour plot should display contours that are approximately concentric circles.

In a previous article, I showed how to use PROC TEMPLATE and PROC SGRENDER in SAS to create a contour plot. However, in that article the data were aligned on an evenly spaced grid in the (x,y) plane. The data in this article have random positions. Consequently, on the CONTOURPLOTPARM statement in PROC TEMPLATE, you must use the GRIDDED=FALSE option so that the plot interpolates the data onto a rectangular grid. The following Graph Template Language (GTL) defines a contour plot for irregularly spaced data and creates the contour plot for the random point on the upper hemisphere:

/* Set GRIDDED=FALSE if the points are not arranged on a regular grid */
proc template;
define statgraph ContourPlotParmNoGrid;
dynamic _X _Y _Z _TITLE;
begingraph;
   entrytitle _TITLE;
   layout overlay;
      contourplotparm x=_X y=_Y z=_Z / GRIDDED=FALSE /* for irregular data */
        contourtype=fill nhint=12  colormodel=twocolorramp name="Contour";
      continuouslegend "Contour" / title=_Z;
   endlayout;
endgraph;
end;
run;
 
ods graphics / width=480px height=480px;
proc sgrender data=RandSphere template=ContourPlotParmNoGrid;
where x3 >= 0;
dynamic _TITLE="Contour Plot of &n Random Points on Upper Hemisphere"
        _X="x1" _Y="x2" _Z="x3";
run;

The data are first interpolated onto an evenly spaced grid and then the contours are estimated. The contours are approximately concentric circles. From the contour plot, you can conclude that the data form a dome or hemisphere above the (x,y) plane.

Summary

This article shows how to use SAS to generate random points on the sphere in any dimension. You can use this technique to generate random directions for a random walk. In addition, the article shows how to use the GRIDDED=FALSE option on the CONTOURPLOTPARM statement in GTL to create a contour plot from irregularly spaced ("ungridded") data in SAS.

The post Generate random points on a sphere appeared first on The DO Loop.

2月 012021
 

Imagine an animal that is searching for food in a vast environment where food is scarce. If no prey is nearby, the animal's senses (such as smell and sight) are useless. In that case, a reasonable search strategy is a random walk. The animal can choose a random direction, walk/swim/fly in that direction for a while, and hope that it encounters food. If not, it can choose another random direction and try again.

According to a story in Quanta magazine, an optimal search strategy is to use a variant of a random walk that is known as a Lévy flight. This article describes a Lévy flight, simulates it in SAS, and compares it to Gaussian random walk.

Random walks

For the purpose of this article, a 2-D random walk is defined as follows. From the current position,

  • Choose a random direction by choosing an angle, θ, uniformly in [0, 2π).
  • Choose a random distance, r.
  • Move that distance in that direction.

The amount of territory that gets explored during a random walk depends on the distribution of the distances. For example, if you always choose a unit distance, you obtain the "drunkard's walk." If you choose distances from a normal distribution, you obtain a Gaussian walk. In both scenarios, the animal primarily searches near where it has been previously. These random walks that always use short distances are good for finding food in a food-rich environment.

In a food-scarce environment, it is a waste of time and energy to always move a short distance from the previous position. It turns out that a better strategy is to occasionally move a long distance. That puts the animal in a new location, which it can explore by moving small distances again. According to the Quanta article, this behavior has been observed "to greater or lesser extents" in many animals that hunt at sea, including albatross, sharks, turtles, penguins, and tuna (Sims, et al., Nature, 2008).

Levy walks and the Levy distribution

A popular model for this behavior is the Lévy walk, which is often called a Lévy flight because it was first applied to the flight path of an albatross over open water. In a Lévy flight, the distance is a random draw from a heavy-tailed distribution. In this article, I will simulate a Lévy flight by drawing distances from a Lévy distribution, which is a special case of the inverse gamma distribution.

The Lévy distribution Levy(μ, c) has a location parameter (μ) and a shape parameter (c). It is a special case of the inverse gamma distribution, which is represented by IGamma(α, β). If X ~ Levy(μ, c) is a random variable, then X – μ ~ IGamma(1/2, β/2). In this article, μ=0, so a random draw from the Levy(0,c) distribution is a random draw from IGamma(1/2, c/2) distribution. Thus, you can use the SAS program in the previous article to generate a random sample from the Lévy distribution.

The inverse gamma distribution with α=1/2 has very heavy tails. This means that a random variate can be arbitrarily large. When modeling a physical or biological system, there are often practical bounds on the maximum value of a quantity. For example, if you are modeling the flight of an albatross, you know that there is a maximum distance that an albatross can fly in a given time. Thus, researchers often use a truncated Lévy distribution to model the distances that an animal travels in one unit of time. For example, an albatross can fly up to 800 km in one day (!!), so a model of albatross distances should truncate the distances at 800.

Simulate a Gaussian and Levy Walk

You can use the SAS DATA step to simulate random walks. The program in this section generates a random direction in [0, 2π), then generates a random distance. For the Gaussian walk, the distance is the absolute value of a random N(0, σ) variate. For the Lévy walk, the distances are sampled from the Levy(0, 0.3*σ) distribution. The constant 0.3 is chosen so that the two distance distributions have approximately the same median.

For convenience, you can define a macro that makes it easy to generate random variates from the Lévy distribution. The following program uses the %RandIGamma macro (explained in the previous article), which generates a random variate from the inverse gamma distribution.

The following program simulates four individuals (animals) who start at the corners of a unit square with coordinates (0,0), (1,0), (1,1), and (0,1). At each step, the animals choose a random direction and move a random distance in that direction. In one scenario, the distances generated as the absolute value of a normal distribution. In another scenario, the distances are Levy distributed.

/* useful macros: https://blogs.sas.com/content/iml/2021/01/27/inverse-gamma-distribution-sas.html */
%macro RandIGamma(alpha, beta);          /* IGamma(alpha,beta) */
   (1 / rand('Gamma', &alpha, 1/(&beta)))
%mend;
%macro RandLevy(mu, scale);              /* Levy(mu,c) */
   (&mu + %RandIGamma(0.5, (&scale)/2));
%mend;
%macro RandTruncLevy(mu, scale, max);    /* Truncated Levy */
   min((&mu + %RandIGamma(0.5, (&scale)/2)), &max);
%mend;
 
%let N = 1000;         /* number of steps in random walk */
%let sigma = 0.1;      /* scale parameter for N(0,sigma) */
%let c = (0.3*&sigma); /* makes medians similar */
 
data Walks;
array x0[4] _temporary_ (0 1 1 0);  /* initial positions */
array y0[4] _temporary_ (0 0 1 1);
call streaminit(12345);
twopi = 2*constant('pi');
c = &c;                    /* Levy parameter */
do subject = 1 to dim(x0);
   t=0;                    /* initial position for subject */
   x = x0[subject];   y = y0[subject]; 
   r=.; theta=.;           /* r = random distance; theta=random angle */
   Walk = "Gaussian";  output;
   Walk = "Levy";      output;
   xN = x; yN = y;         /* (xN,yN) are coords for Gaussian walk */
   xL = x; yL = y;         /* (xL,yL) are coords for Levy walk */
   do t = 1 to &N;         /* perform steps of a random walk */
      theta = rand("Uniform", 0, twopi);  /* random angle */
      ux = cos(theta); uy = sin(theta);   /* u=(ux,uy) is random unit vector */
      /* take a Gaussian step; remember location */
      Walk = "Gaussian";
      r = abs( rand("Normal", 0, &sigma) ); /* ~ abs(N(0,sigma)) */
      xN = xN + r*ux;   yN = yN + r*uy;
      /* output the Gaussian walk step */
      x = xN; y = yN; output;
 
      /* take a (truncated) Levy step; remember location */
      Walk = "Levy";
      r = %RandTruncLevy(0, c, 2);  /* Levy(0,c) truncated into [0,2] */
      xL = xL + r*ux;   yL = yL + r*uy;
      /* output the Levy walk step */
      x = xL; y = yL; output;
   end;
end;
drop twopi c xN yN xL yL ux uy;
run;

Let's compare the step sizes for the two types of random walks. The following call to PROC SGPANEL creates histograms of the step lengths for each type of random walk:

title "Distribution of Step Lengths";
proc sgpanel data=Walks;
   label r = "Step Length";
   panelby Walk;
   histogram r / binwidth=0.1 binstart=0.05;
run;

The graph shows that all step lengths for the Gaussian random walk are less than 0.4, which represents four standard deviations for the normal distribution. In contrast, the distribution of distances for the Lévy walk has a long tail. About 15% of the distances are greater than 0.75. About 9% of the step lengths are the maximum possible value, which is 2.

How does that difference affect the amount of territory covered by an animal that is foraging for food? The following statements show the paths of the four animals for each type of random walk:

title "Position of Subjects";
title2 "50 Steps of a Random Walk";
proc sgpanel data=Walks aspect=1;
   where t <= 50;
   panelby Walk / onepanel;
   series x=x y=y / group=Subject;
   colaxis grid; rowaxis grid;
run;

The differences in the paths are evident. For the Gaussian random walk, the simulated animals never leave a small neighborhood around their starting positions. After 50 steps, none of the animals who are taking a Gaussian walk have moved more than one unit away from their starting position. In contrast, the animals who take Lévy walk have explored a much greater region. You can see that the typical motion is several small steps followed by a larger step that brings the animal into a new region to explore.

Generalizing to higher dimensions

It is easy to extend the random walk to higher dimensions. The trick is to represent the "random direction" by a unit vector (u) instead of by an angle (θ). In a previous article about random point within a d-dimensional ball, I remark that "If Y is drawn from the uncorrelated multivariate normal distribution, then u = Y / ||Y|| has the uniform distribution on the unit d-sphere." You can use this trick to generate random unit vectors, then step a distance r in that direction, where r is from a specified distribution of distances.

Summary

An article in Quanta magazine describes how some animals use a swim/flight path that can be modeled by a (truncated) Lévy random walk. This type of movement is a great way to search for food in a vast food-scarce environment. This article defines the Lévy distribution as a sub-family of the inverse gamma distribution. It shows how to generate random variates from the Lévy distribution in SAS. Finally, it simulates a Gaussian random walk and a Lévy walk and visually compares the area explored by each.

The post Gaussian random walks and Levy flights appeared first on The DO Loop.