2月 252021
 

The people, the energy, the quality of the content, the demos, the networking opportunities…whew, all of these things combine to make SAS Global Forum great every year. And that is no exception this year.

Preparations are in full swing for an unforgettable conference. I hope you’ve seen the notifications that we set the date, actually multiple dates around the world so that you can enjoy the content in your region and in your time zone. No one needs to set their alarm for 1:00am to attend the conference!

Go ahead and save the date(s)…you don’t want to miss this event!

Content, content, content

We are working hard to replicate the energy and excitement of a live conference in the virtual world. But we know content is king, so we have some amazing speakers and content lined up to make the conference relevant for you. There will be more than 150 breakout sessions for business leaders and SAS users, plus the demos will allow you to see firsthand the innovative solutions from SAS, and the people who make them. I, for one, am looking forward to attending live sessions that will allow attendees the opportunity to ask presenters questions and have them respond in real time.

Our keynote speakers, while still under wraps for now, will have you on the edge of your seats (or couches…no judgement here!).

Networking and entertainment

You read that correctly. We will have live entertainment that'll have you glued to the screen. And you’ll be able to network with SAS experts and peers alike. But you don’t have to wait until the conference begins to network, the SAS Global Forum virtual community is up and running. Join the group to start engaging with other attendees, and maybe take a guess or two at who the live entertainment might be.

A big thank you

We are working hard to bring you the best conference possible, but this isn’t a one-woman show. It takes a team, so I would like to introduce and thank the conference teams for 2021. The Content Advisory Team ensures the Users Program sessions meet the needs of our diverse global audience. The Content Delivery Team ensures that conference presenters and authors have the tools and resources needed to provide high-quality presentations and papers. And, finally, the SAS Advisers help us in a multitude of ways. Thank you all for your time and effort so far!

Registration opens in April, so stay tuned for that announcement. I look forward to “seeing” you all in May.

What makes SAS Global Forum great? was published on SAS Users.

2月 242021
 

In cold and blustery conditions, the weather forecast often includes two temperatures: the actual air temperature and the wind chill temperature. The wind chill temperature conveys the cumulative effect of air temperature and wind on the human body. The goal of the wind-chill scale is to communicate the effect of the wind by giving an equivalent hypothetical scenario in which the temperature is colder but there is no wind. For example, if the wind chill temperature is 0 degrees, it means that the outdoor air temperature and wind combine to make you feel as cold as if it was a calm, clear, night with a temperature of 0°.

This article briefly discusses wind chill calculations and presents a statistical formula for predicting wind chill temperatures. It then creates two wind chill charts by using the formula. The first chart (shown below; click to enlarge) is for US customary units (temperatures in Fahrenheit and wind speeds in miles per hour (mph)). A second chart is shown for SI units (temperatures in Celsius and wind speeds in kilometers per hour (kph)). The end of the article links to information about how scientists developed the wind chill chart. For readers who are interested in how the charts were made, I link to a SAS program.

What is wind chill?

A challenge for meteorologists is how to communicate the risk of severe weather. The wind chill table was developed to communicate the cumulative risk of extreme cold and windy conditions. The table displays the "wind chill equivalent temperature" (WCT or WCET), for a grid of values of air temperature and wind speed.

The wind chill table is based on sophisticated calculations of heat transfer from the human face in the presence of wind, which greatly increases the loss of heat due to convection. It is based on a lot of modeling assumptions (clear night, no sun, the human face is approximately a cylinder,...), and a lot of experimental data to estimate thermal properties of humans. The wind chill calculations that are used currently in the US, Canada, and the UK are based on a 2001 revision of an earlier wind chill table.

A formula for wind chill

The numbers in the official wind chill table are based on equations from physics and mathematical modeling. The equations are very complicated, and you cannot get the numbers in the table by using a simple formula. However, because a simple formula is very useful for understanding how the apparent temperature depends on wind speed, researchers constructed a linear regression model for which the apparent temperature (WCT) is regressed onto the air temperature and a power of the wind speed. The model includes the main effects and an interaction term.

The formula is accurate to within 1° over the range of physically relevant temperatures and wind speeds. The formula is a valid approximation only for temperatures below 10° C (50° F) and for wind speeds above 5 kph (3 mph). In particular, if you plug in V=0, you do not get the air temperature.

The formula for the wind chill temperature in US units is available on a web page for the National Weather Service:
\(WCT_{US} = 35.74 + 0.6215*T_F - 35.75*V_{mph}^{0.16} + 0.4275*T_F*V_{mph}^{0.16}\)
where TF is the air temperature in degrees Fahrenheit, and Vmph is the wind speed in miles per hour.

The previous formula is a conversion from the official metric formula:
\(WCT_{SI} = 13.12 + 0.6215*T_C - 11.37*V_{kph}^{0.16} + 0.3965*T_C*V_{kph}^{0.16}\)
where TC is the air temperature in degrees Celsius, and Vkph is the wind speed in kilometers per hour. You can get one formula from the other by using the equations TC = 5/9*TF – 32 and Vkph=Vmph / 1.609.

Notice that the exponent for the wind speed is rather strange (0.16), but presumably this is based on either a statistical transformation (such as a Box-Cox transformation) or by fitting the exponent as a parameter in the model. Despite my efforts, I was unable to find details about how the regression formula was fitted.

A little known fact is that the wind speeds in the formula are measured at 10 meters (standard anemometer height). The formula uses a wind-profile model to adjust the wind speed to approximate the speed at the height of an average person.

Visualizing wind chill for a given air temperature

Winds removes heat from a layer next to your skin. If you specify the air temperature but allow the wind speed to vary, a graph of the wind chill temperature (WTC) shows how much colder it feels based on the speed of the wind. The following graph shows the WTC (in US units) for four sub-freezing air temperatures: 30°, 20°, 10°, and 0° F.

In each case, you can see that stronger winds make it seem colder. The effect of the wind is nonlinear and is stronger for colder temperatures. For example, if the air temperature is 30° F, a 25-mph wind makes it seem about 13 degrees colder. However, if the air temperature is 0° F, a 25-mph wind makes it seem about 24 degrees colder.

Creating a wind chill chart in US units

You can evaluate the formula for the wind chill temperature on a regular grid to obtain a chart that enables you to estimate the wind chill from the current conditions. The chart for US units is shown at the top of this article. To use the chart, find the approximate outdoor temperature along the horizontal axis, then move up the chart into the row for the approximate wind speed. For example, an outdoor temperature of 5° F and a wind speed of 20 mph results in a wind chill temperature of -15° F.

Creating a wind chill chart in SI units

You can create a similar chart for SI (metric) units, as shown below. Because the Celsius degree is almost twice as large as the Fahrenheit degree, I used increments of 3° C along the horizontal axis. However, I continue to use 5 as an increment for wind speed, even though 5 kph is only about 3 mph. The ranges for the axes are chosen so that the range of temperatures and wind speeds are approximately the same as for the US chart.

Summary and further reading

This article discusses wind chill temperature. The computations that calculate the numbers in the official wind chill chart are very complicated. However, you can fit a linear regression to the numbers in the official chart to obtain an approximate formula that is easy to evaluate. This article presents several wind chill charts that are based on the regression formula.

The history of wind chill computations is fascinating. The Wikipedia article for "Wind Chill" has a few sentences about the history. For an interesting account of the development of the most recent wind chill chart (2001), I highly recommend reading Osczevski and Bluestein, "The New Wind Chill Equivalent Temperature Chart," 2005. They explain some of the complications and approximations used to construct the chart. They note that "the public seems to have a strong preference for the equivalent temperature" chart, which seems simple but hides a lot of complexity. They conclude that the chart is "a deceptive simplification" that only seems to be easy to understand.

A future article will discuss how I constructed the wind chill charts in SAS. You can download the complete SAS program that creates these graphs.

The post The wind chill chart appeared first on The DO Loop.

2月 222021
 

Removing a piece from character string In my previous post, we addressed the problem of inserting substrings into SAS character strings. In this post we will solve a reverse problem of deleting substrings from SAS strings.

These two complementary tasks are commonly used for character data manipulation during data cleansing and preparation to transform data to a shape suitable for analysis, text mining, reporting, modeling and decision making.

As in the previous case of substring insertion, we will cover substring deletion for both, character variables and macro variables as both data objects are strings.

The following diagram illustrates what we are going to achieve by deleting a substring from a string:

Removing a substring from SAS string illustration

Have you noticed a logical paradox? We take away a “pieceof” cake and get the whole thing as result! 😊

Now, let’s get serious.

Deleting all instances of a substring from a character variable

Let’s suppose we have a variable STR whose values are sprinkled with some undesirable substring ‘<br>’ which we inherited from some HTML code where tag <br> denotes a line break. For our purposes, we want to remove all instances of those pesky <br>’s. First, let’s create a source data set imitating the described “contaminated” data:

data HAVE;
   infile datalines truncover;
   input STR $100.;
   datalines;
Some strings<br> have unwanted sub<br>strings in them<br>
<br>A s<br>entence must not be cont<br>aminated with unwanted subs<br>trings
Several line<br> breaks<br> are inserted here<br><br><br>
<br>Resulting st<br>ring must be n<br>eat and f<br>ree from un<br>desirable substrings
Ugly unwanted substrings<br><br> must <br>be<br> removed
<br>Let's remove them <br>using S<br>A<br>S language
Ex<br>periment is a<br>bout to b<br>egin
<br>Simpli<br>city may sur<br>prise you<br><br>
;

This DATA step creates WORK.HAVE data set that looks pretty ugly and is hardly usable:
Source data to be cleansed
The following code, however, cleans it up removing all those unwanted substrings ‘<br>’:

data WANT (keep=NEW_STR);
   length NEW_STR $100;
   SUB = '<br>';
   set HAVE;
   NEW_STR = transtrn(STR,'<br>',trimn(''));
run;

After this code runs, the data set WANT will look totally clean and usable:
Cleaned data

Code highlights

  • We use .

The TRANSTRN function is similar to TRANWRD function which replaces all occurrences of a substring in a character string. While TRANWRD uses a single blank when the replacement string has a length of zero, TRANSTRN does allow the replacement string to have a length of zero which essentially means removing.

  • TRIM() function which removes trailing blanks from a character string and returns one blank if the string is missing. However, when it comes to removing (which is essentially replacement with zero length substring) the ability of TRIMN function to return a zero-length string makes all the difference.

Deleting all instances of a substring from a SAS macro variable

For macro variables, I can see two distinct methods of removing all occurrences of undesirable substring.

Method 1: Using SAS data step

Here is a code example:

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
data _null_;
   NEW_STR = transtrn("&STR","&SUB",trimn(''));
   call symputx('NEW',NEW_STR);
run;
 
%put &=STR;
%put &=NEW;

In this code, we stick our macro variable value &STR in double quotes in the transtrn() function as the first argument (source). The macro variable value &SUB, also double quoted, is placed as a second argument. After variable NEW_STR is produced free from the &SUB substrings, we create a macro variable NEW using

%let STR = Some strings<br> have unwanted sub<br>strings in them<br>;
%let SUB = <br>;
 
%let NEW = %sysfunc(transtrn(&STR,&SUB,%sysfunc(trimn(%str()))));
 
%put &=STR;
%put &=NEW;

Deleting selected instance of a substring from a character variable

In many cases we need to remove not all substring instances form a string, but rather a specific occurrence of a substring. For example, in the following sentence (which is a quote by Albert Einstein) “I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.” the second word “sometimes” was added by mistake. It needs to be removed. Here is a code example presenting two solutions of how such a deletion can be done:

data A;
   length STR STR1 STR2 $250;
   STR = 'I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.';
   SUB = 'sometimes';
   STR_LEN = length(STR);
   SUB_LEN = length(SUB);
   POS = find(STR,SUB,-STR_LEN);
   STR1 = catx(' ', substr(STR,1,POS-1), substr(STR,POS+SUB_LEN)); /* solution 1 */
   STR2 = kupdate(STR,POS,SUB_LEN+1);                              /* solution 2 */
   put STR1= / STR2=;
run;

The code will produce two correct identical values of this quote in the SAS log (notice, that the second instance of word “sometimes” is gone):

STR1=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.
STR2=I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am.

Code highlights

  • FIND() function determines position POS of the substring SUB to be deleted in the string STR. In this particular example, we used the fact, that the second occurrence of word “sometimes” is the first occurrence of this word when counted from right to left. That is indicated by the negative 3-rd argument (-STR_LEN) which means that FIND function searches STR for SUB starting from position STR_LEN from right to left.

Solution 1

This is the most traditional solution that cuts out two pieces of the string – before and after the substring being deleted – and then concatenates them together thus removing that substring:

  • substr(STR,1,POS-1) extracts the first part of the source string STR before the substring to be deleted: from position 1 to position POS-1.
  • substr(STR,POS+SUB_LEN) extracts the second part of the source string STR after the substring to be deleted: from position POS+SUB_LEN till the end of STR value (since the third argument, length, is not specified).
  • Solution 2

    Finding n-th instance of a substring within a string .

Deleting selected instance of a substring from a SAS macro variable

Here is a code example of how to solve the same problem as it relates to SAS macro variables. For brevity, we provide just one solution using %sysfunc and KUPDATE() function:

%let STR = I believe in intuitions and inspirations. I sometimes feel that I am right. I sometimes do not know that I am.;
%let SUB = sometimes;
%let POS = %sysfunc(find(&STR,&SUB,-%length(&STR)));
%let STR2 = %sysfunc(kupdate(&STR,&POS,%eval(%length(&SUB)+1)));
%put "&STR2";

This should produce the following corrected Einstein’s quote in the SAS log:

"I believe in intuitions and inspirations. I sometimes feel that I am right. I do not know that I am."

Additional Resources for SAS character strings processing

Your thoughts?

Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

Deleting a substring from a SAS string was published on SAS Users.

2月 222021
 

A previous article describes how to use the SGPANEL procedure to visualize subgroups of data. It focuses on using headers to display information about each graph. In the example, the data are time series for the price of several stocks, and the headers include information about whether the stock price increased, decreased, or stayed the same during a time period. The previous article discusses several advantages and disadvantages of using the SGPANEL procedure for this task.

An alternative approach is to use the BY statement in the SGPLOT procedure to process each subgroup separately. This article shows how to use the #BYVAR and #BYVAL keywords in SAS titles to display information about the data in each subgroup.

The example data

I will use the same data as for the previous article. In real life, you would use a separate analysis to determine whether each stock increased or decreased, but I will hard-code this information for the three stocks in the example data.

The following DATA step creates a subset of the Sashelp.Stocks data. The STOCK variable contains the name of three stocks: IBM, Intel, and Microsoft. The OPEN variable contains the opening stock price for these companies for each month. The DATA step restricts the data to the time period Jan 1998 – May 2000. The TREND variable indicates whether the stock price increased or decreased during the time period.

data Have;
   set Sashelp.Stocks;
   where '01Jan1998'd <= Date <= '30May2000'd;
   /* prepare data to display information */
   if      Stock='IBM'       then Trend='Neutral   ';
   else if Stock='Intel'     then Trend='Increasing';
   else if Stock='Microsoft' then Trend='Decreasing';
run;
/* NOTE: The Sashelp.Stock data set is already sorted by Stock and by Date. 
   Be sure to sort your data if you want to use the BY statement. For example:
proc sort data=Have;
   by Stock Date;
run;
*/

You must sort the data for BY-group processing. The example data are already sorted by the STOCK variable, which is the grouping variable. And within each stock, the data are sorted by the DATE variable, which is important for visualizing the stock prices versus time.

Titles for BY-group analysis

When you run a BY-group analysis, SAS automatically creates a title that indicates the name and value of the BY-group variable(s). (This occurs whenever the BYLINE option is on, and it is on by default.) SAS looks at how many titles you have specified and uses the next available title to display the BY-group information. For example, without doing anything special, you can use the standard BY-group analysis to graph the prices for all three stocks in the data set:

/* assume OPTION BYLINE is set */
title "Stock Price Jan 1998 - May 2000";  /* the BY-line will appear in the TITLE2 position */
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;

To save space, I have truncated the output. Each graph shows only a subset of the data. The TITLE2 line displays the name of the BY-group variable (Stock) and the value of the variable. All this happens automatically.

Customize titles: The #BYVAL substitution

The TITLE and TITLEn statements in SAS support substituting the values of a BY-group variable. You can insert the name of a BY-group variable by using the #BYVARn keyword. You can insert the name of a BY-group value by using the #BYVALn keyword. When using these text substitutions, you should specify OPTIONS NOBYLINE to suppress the automatic generation of subtitles.

By default, the BY statement generates the plots one after another, as shown in the previous example. However, you can use the ODS LAYOUT GRIDDED statement to arrange the graphs in a lattice. Essentially, you are using ODS to replicate the layout that PROC SGPANEL handles automatically. In the following example, I let the vertical scale of the axes vary according to the values for each BY group. If you prefer, you can use the MIN= and MAX= options on the YAXIS statement to specify a range of values for each axis.

/* layout the graphs. Use the #BYVALn values to build the titles */
ods graphics / width=300px height=250px;         /* make small to fit on page */
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title "BY-Group Analysis of the #byvar1 Variable";/* substitute variable name for #BYVAR */
title2 "Time Series for #byval1";                /* substitute name of stock for #BYVAL */
 
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Now you can see the power of the #BYVAL keyword. (Click the graph to enlarge it.) It gives you great flexibility in creating a custom subtitle that contains the value of the BY-group variable. The keywords #BYVAR and #BYVAL are an alias for #BYVAR1 and #BYVAL1, just like TITLE is an alias for TITLE1. The next example uses a second BY-group variable.

Because the BY statement supports multiple BY-group variables and because you can specify the NOTSORTED option for variables that are not sorted, you can include the TREND variable as a second BY=group variable. You can then use both the #BYVAL1 and #BYVAL2 keywords to further customize the titles:

options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title2 "The Time Series for #byval1 Is #byval2"; /* substitute stock name and trend value */
 
proc sgplot data=Have;
   by Stock TREND notsorted;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Controlling attributes in a BY-group

I have one more tip. You can use a discrete attribute map to link the attributes of markers and lines to the value of a variable in the data. For example, suppose you want to color the lines in these plots according to whether the stock price increased, decreased, or stayed the same. The following DATA step creates a discrete attribute map that assigns the line colors based on the value in the TREND variable. On the PROC SGPLOT statement, you can use the DATTRMAP= option, which makes the data map available to the procedure. You can add the ATTRID= option to the SERIES statement. Because the colors are determined by the GROUP=TREND option, the procedure will look at the attribute map to determine which color to use for each line.

/* Create a discrete attribute map. Line color is determined by the TREND value. */
data Attrs;
length Value $20 LineColor $20;
ID = "StockTrend";
Value='Neutral';    LineColor = "DarkBlue    "; output;
Value='Increasing'; LineColor = "DarkGreen   "; output;
Value='Decreasing'; LineColor = "DarkRed     "; output;
run;
 
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
 
proc sgplot data=Have noautolegend DATTRMAP=Attrs;
   by Stock Trend notsorted;
   series x=Date y=Open / group=Trend ATTRID=StockTrend lineattrs=(thickness=2);
   yaxis grid label="Stock Price";
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Summary and further reading

This article shows how to customize the title and attributes in graphs that are generated as part of a BY-group analysis. You can use the #BYVARn and #BYVALn keywords to insert information about the BY groups into titles. You can use a discrete attribute map to link attributes in a graph (such as line color) to the values of a variable in the data. Although creating a sequence of graphs by using a BY-group analysis is a powerful technique, I often prefer to use PROC SGPANEL, for the reasons discussed in a previous article. PROC SGPANEL provides support for controlling many features of the graphs and the layout of the graphs.

If you are interested in the SAS Macro solution to creating titles, you can read the original thread and solution on the SAS Support Communities. For more information about using the #BYVAR and #BYVAL keywords in SAS titles, see

The post How to use the #BYVAR and #BYVAL keywords to customize graph titles in SAS appeared first on The DO Loop.

2月 192021
 

This post is part of our Young Data Scientists series, featuring the motivation, work and advice of the next generation of data scientists. Be sure to check back for future posts, or read the whole series by clicking on the image to the right. This July, the inaugural Academic SAS [...]

Taking initiative leads to rewards for this young data scientist was published on SAS Voices by Jelena Stankovic

2月 172021
 

Many characteristics of a graph are determined by the underlying data at run time. A familiar example is when you use colors to indicate different groups in the data. If the data have three groups, you see three colors. If the data have four groups, you see four colors. The data analyst does not need to know in advance how many groups are in the data. The marker colors are an example of a "data-driven" attribute that is determined at run time.

In contrast, some features of the graph are typically specified before the graph is created. For example, in SAS, the TITLE and FOOTNOTE statements are usually hard-coded strings. (But I will show a technique that enables you to generate titles and footnotes dynamically from data!)

Recently, a SAS programmer at the SAS Support Communities wanted to construct titles dynamically. Specifically, he was interested in visualizing the price of many stocks over time, such as the time series shown in the panel to the right. For each stock, he wanted to display the name of the stock and whether it had gained in value, lost value, or remained about the same during a given time period. A helpful macro programmer showed how you can use SAS macro to accomplish this task. However, in many cases, you can use PROC SGPANEL or PROC SGPLOT to display the same information, and the code is much easier to write. This article shows a few tricks for displaying information on a graph in a data-driven manner. So that the article is not too long, this article discusses the SGPANEL procedure. A second article discusses how to use the BY statement in PROC SGPLOT.

Visualizing data in each group

A common task in data analysis is to visualize the data according to subgroups. One or more discrete variables (called grouping variables) are used to determine the observations that belong to each subgroup. For example, a data set might contain a variable named Stock whose values are strings such as "IBM", "Intel", and "Microsoft". These strings determine which observations belong to each group.

This article shows how to use PROC SGPANEL to create a panel of k plots. The header for each plot indicates the grouping value. Optionally, you can use the TEXT statement or the INSET statement to display information (often labels or statistics) that are specific to each group.

The data for this article is a subset of the Sashelp.Stocks data. The STOCK variable contains the name of three stocks: IBM, Intel, and Microsoft. The OPEN variable contains the opening stock price for these companies for each month. Although the example only has three stocks, the techniques in this article are applicable to an arbitrary number of stocks. The following DATA step restricts the data to the time period Jan 1998 – May 2000. It also creates a label for each subgroup that indicates whether the stock price increased or decreased during the time period. In practice, this label would be determined by a separate analysis (maybe a linear regression or a DATA step computation), but I am hard-coding it for simplicity.

data Have;
   set Sashelp.Stocks;
   where '01Jan1998'd <= Date <= '30May2000'd;
   /* prepare data to display information */
   if      Stock='IBM'       then Trend='Neutral   ';
   else if Stock='Intel'     then Trend='Increasing';
   else if Stock='Microsoft' then Trend='Decreasing';
   Label = catx(": ", Stock, Trend);  /* example "IBM: Neutral" */
run;
/* NOTE: The Sashelp.Stock data set is already sorted by Stock and by Date. 
   Be sure to sort your data if you want to use the BY statement. 
proc sort data=Have;
   by Stock Date;
run;
*/

The example data are already sorted by the STOCK variable, which is the grouping variable. Within each stock, the data are sorted by the DATE variable. You do not need to sort the data if you are using PROC SGPANEL, but you sort if you intend to do BY-group processing.

A panel of stocks with headers

If you want to create the same plot for many sub-groups, you should start with the SGPANEL procedure, which was created for exactly this purpose! The following call to PROC SGPANEL is short and easy to read. It creates a panel that contains three line plots and labels each according to the value of the PANELBY variable.

title "Stock Price Jan 1998 - May 2000";
proc sgpanel data=Have;
   panelby Stock / columns=1;     /* vertical stack of panels */
   series x=Date y=Open;
   inset Trend / opaque;
   rowaxis grid label="Stock Price";
   colaxis display=(nolabel);
run;

The output is shown at the top of this article. Each panel has a row header (a "title" of sorts) that identifies the subgroups. An INSET statement displays information about the data. In this example, the data set contains a variable (Trend) that indicates the performance of the stock over the specified time period. It could also display statistics.

There are three advantages to using PROC SGPANEL to visualize and compare subgroups of the data:

  • You can use the UNISCALE= option on the PANELBY statement to specify whether each plot in the panel should use a common axis or whether the axes should reflect the data range in each subgroup. In this example, all vertical axes use the range [50,210]. The minimum value is determined by the lowest Intel price whereas the maximum value is determined by the highest IBM price. This automatic scaling is invaluable for comparing data across groups.
  • You can use options on the PANELBY statement to arrange the plots in a panel. This example uses the COLUMNS=1 option to stack the plots vertically. However, you could also use ROWS=1 to arrange the plots horizontally. You can use both options together to arrange the plots in a lattice.
  • You can control the placement and attributes of the headers. An example is shown in the next section.

Individual graphs of stocks

I prefer to put related plots in a panel because it is a great way to organize the output and to compare the data across subgroups. However, the SAS programmer wanted individual graphs for each subgroup, not a panel of plots. No problem, PROC SGPANEL can do that, too! You can make PROC SGPANEL create individual graphs by telling it that each panel should contain only one graph. You can also control the attributes of the row headers so that they look like the titles that are generated by using the TITLE or TITLE2 statements in SAS.

The following call to PROC SGPANEL is similar to the first call, except the options COLUMNS=1 ROWS=1 force each panel to contain only one plot. The options NOHEADERBORDER and HEADERBACKCOLOR=WHITE make the row headers look more like a title. Lastly, instead of displaying the stock names in the headers, I display a label that identifies the stock and its performance. Notice that the label starts with the PANELBY value and then provide additional information.

ods graphics / width=300px height=200px;
title;
footnote J=L "Stock Price Jan 1998 - May 2000";
 
proc sgpanel data=Have;
   panelby Label / columns=1 rows=1   /* a "panel" that contains one graph */
           headerattrs=GraphTitleText
           novarname noheaderborder headerbackcolor=White; /* make headers look like TITLE */
   series x=Date y=Open;
   rowaxis grid label="Stock Price";
   colaxis display=(nolabel);
run;

These statements put one plot in each "panel" and use the row headers to simulate a title. Even though the plots are no longer part of a panel, they can still share a common axis range, as in this example. As mentioned earlier, the UNISCALE= option enables you to specify which axes are shared.

There are two limitations to using PROC SGPANEL in this way:

  • The headers are not really titles. If you use the TITLE, TITLE2, and TITLE3 statements, you can have titles and multiple subtitles.
  • The values of the PANELBY variable must be unique for each stock. Notice that the PANELBY variable in this example is the name of the stock followed by a string that has additional information about the performance of the stock. This ensures that each panel shows data for exactly one stock. You cannot, for example, use the TREND variable (which has values 'Neutral', 'Increasing', and 'Decreasing') on the PANELBY statement because that would create a panel that has at most three cells. All 'Increasing' stocks would be overlaid in a single cell, and the same for the 'Decreasing' and 'Neutral' stocks.

Summary

The SGPANEL procedure creates a panel of plots. There is one plot for every subgroup of a grouping variable. (Note: You can specify two grouping variables to get a lattice of plots.) By default, the SGPANEL procedure uses the values of the grouping variable as headers. However, as shown in this article, you can display any string as a header. In particular, you can pre-process the data to generate a label that depends on the data. In this way, you can create labels that indicate features in the data, such as whether a stock gained or lost value in a time interval.

Although I prefer using PROC SGPANEL for constructing plots of subgroups, an alternative is to use the BY statement in PROC SGPLOT. The next article presents the pros and cons of using BY-group processing and shows how to construct titles for the plots that depend on the data in the plots.

The post Data-driven titles for graphs appeared first on The DO Loop.

2月 162021
 

Insert a piece into a stringSAS provides an extensive set of tools for data cleansing and preparation – transforming data to a shape suitable for analysis, text mining, reporting, modeling and ultimately decision making.

In this post we will cover one of the common tasks of character data manipulation – inserting a substring into a SAS character string.

A diagram below illustrates what we are going to achieve:

Illustration for inserting a substring into a string

SAS character strings come in two different incarnations: character variables and macro variables. Since these two are quite different SAS language objects, let’s cover them one by one separately.

Inserting a substring into a character variable

Here is our task: we have a SAS character variable (string) and we want to insert in it a value of another character variable (substring) starting at a particular specified position.

Let’s say we have a string BASE in which we want to insert a COUNTRY name right before word "stays" to make different variation of the resultant phrase. Here is an example of how this can be easily done:

data COUNTRIES;
   length COUNTRY $20;
   input COUNTRY;
   datalines;
Spain
Argentina
Slovenia
Romania
USA
Luxembourg
Egypt
Switzerland
;
 
data NEW (keep=COUNTRY PHRASE);
   BASE = 'The rain in stays mainly in the plain';
   INSPOS = find(BASE,'stays');
   set COUNTRIES;
   length PHRASE $50;
   PHRASE = catx(' ',substr(BASE,1,INSPOS-1),COUNTRY,substr(BASE,INSPOS));
run;

This code dynamically creates variable PHRASE out of values of variable BASE and the values of variable COUNTRY, thus making it data-driven.

After this code runs, the data set NEW will look like this:

Results after inserting a substring into a character string

Here are the code highlights:

  • maximum length of varying-length character variables is 536,870,911 characters (UTF-8 encoding).

    Inserting a substring into a SAS macro variable

    Let’s solve a similar task, but now instead of SAS variables we will operate with SAS macro variables, since they are strings too.

    Here is our problem to solve: we have a SAS macro variable (string) and we want to insert in it a value of another macro variable (substring) starting at a particular specified position.

    Let’s say we have a macro variable BASE with value of The rain in stays mainly in the plain in which we want to insert a country name defined by macro variable COUNTRY with value of Spain right before word stays. Here is an example of how this can be done:

    %let BASE = The rain in stays mainly in the plain;
    %let COUNTRY = Spain;
    %let W = stays;
     
    %let INSPOS = %index(&BASE,&W);
    %let PHRASE = %substr(&BASE,1,%eval(&INSPOS-1))&COUNTRY %substr(&BASE,&INSPOS);
    %put ***&PHRASE***;

    This code will insert the country name in the appropriate place within the BASE macro variable which will be printed in the SAS log by %put statement:

    ***The rain in Spain stays mainly in the plain***

    Here are the code highlights:

    • %substr() macro function to extract two parts of its first argument (&BASE) - before and after insertion:
      • %substr(&BASE,1,%eval(&INSPOS-1))captures the first part of &BASE (before insertion): substring of &BASE starting from the position 1 with a length of %eval(&INSPOS-1).
      • %substr(&BASE,&INSPOS) captures the second part of &BASE (after insertion): substring of &BASE starting from the position &INSPOS till the end of &BASE (since the third argument is not specified).
    • In case of macro variables, we don’t need any concatenation functions – we just list the component pieces of the macro variable value in a proper order with desired separators (blanks in this case).

    NOTE: Unlike for SAS variables, you don’t need to assign the length of SAS macro variables which are automatically defined by their assigned values. The maximum length of SAS macro variables is 65,534 bytes.

    Inserting multiple instances of a substring into a SAS character string

    Sometimes you need to insert a substring into several places (positions p1, p2, …, pn) of a character string. In this case you can use the above strategy repeatedly or iteratively with one little caveat: start inserting from the highest position and moving backwards to the lowest position. This will preserve your pre-determined positions because positions are counted from left to right and inserting a substring at a higher position won’t change the lower position number. Otherwise, after insertion of a substring into lower position, all your higher positions will shift by the length of the inserted substring.

    Additional Resources for SAS character strings processing

    Your thoughts?

    Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.

    Inserting a substring into a SAS string was published on SAS Users.

  • 2月 152021
     

    I've previously written about how to generate all pairwise interactions for a regression model in SAS. For a model that contains continuous effects, the easiest way is to use the EFFECT statement in PROC GLMSELECT to generate second-degree "polynomial effects." However, a SAS programmer was running a simulation study and wanted to be able to generate all pairwise interaction effects within SAS/IML. One way to solve the programmer's problem is to use the HDIR function in SAS/IML. This article introduces the HDIR function and shows how to use it to generate pairwise (quadratic) interactions for continuous effects.

    Generate pairwise interactions by using the EFFECT statement

    Before trying to solve a problem with a new tool, I like to use an existing method to solve the problem. Then I can compare the old and new answers and make sure they are the same.

    For sample data, let's use the numerical variables in the Sashelp.Class data. So that the techniques are clearer, I will rename the variables to X1, X2, and X3. We want to generate all six pairwise interactions, including the "pure quadratic" terms where a variable interacts with itself. The names of the interaction effects will be X1_2, X1_X2, x1_X3, X2_2, X2_X3, and X3_2. The names with "_2" at the end are pure quadratic effects; the others are interactions.

    One way to generate a design matrix that has all pairwise interactions is to use the EFFECT statement in PROC GLMSELECT. The interactions are written to the OUTDESIGN= data set. The columns of the design matrix have interpretable names, which are stored in the _GLSMOD macro variable.

    data Have;         /* create the example data */
    set sashelp.class(rename=(Age=X1 Height=X2 Weight=X3));
    if _N_=3 then X1=.;    /* add a missing value */
    keep _NUMERIC_;
    run;
     
    /* Use PROC GLMSELECT to generate all pairwise interactions.
       Step 1: Add a fake response variable */
    %let DSIn  = Have;         /* the name of the input data set */
    %let DSOut = Want;         /* the name of the output data set */
    data AddFakeY / view=AddFakeY;
       set &DSIn;
       _Y = 0;      /* add a fake response variable */
    run;
     
    /* Step 2: Use the EFFECT statement to generate the degree-2 effects, as shown in 
          https://blogs.sas.com/content/iml/2017/09/07/polynomial-effects-regression-sas.html */
    proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=&DSOut(drop=_Y);
       effect poly2 = polynomial( X1 X2 X3 / degree=2);
       model _Y = poly2 /  noint selection=none;  /* equiv to X1*X1 X1*X2 X1*X3 X2*X2 X2*X3 X3*X3 */
    run;
    %put &=_GLSMOD;    /* look at names of the interaction effects  in the OUTDESIGN= data */
     
    proc print data=&DSOut(obs=5);
       var X1_2 X1_X2 X1_X3 X2_2 X2_X3 X3_2; /* names are generated automatically */
    run;

    The output from PROC GLMSELECT contains the "pure quadratic" interactions (X1_2, X2_2, and X3_2) and the cross-variable interactions (X1_X2, X1_X3, and X2_X3). If you have k variables, there will be k pure quadratic terms and "k choose 2" cross-variable terms. Hence, the total number of quadratic interactions is "(k+1) choose 2," which is (k+1)*k/2. Here, k=3 and there are six quadratic interactions.

    Notice how PROC GLMSELECT handles the missing value in the third observation: because the X1 value is missing, the procedure puts a missing value into all interaction effects.

    The horizontal direct product between matrices

    The horizontal direct product between matrices A and B is formed by the elementwise multiplication of their columns. The operation is most often used to form interactions between dummy variables for two categorical variables. If A is the design matrix for the categorical variable C1 and B is the design matrix for the categorical variable C2, then HDIR(A,B) is the design matrix for the interaction effect C1*C2.

    The following simple example shows how the HDIR function works. The HDIR function returns a matrix whose columns are formed by elementwise multiplication of the columns of A and the matrix B. The first set of columns is the product A[,1]#B, the second set is A[,2]#B, and so forth. If A has k1 columns and B has k2 columns, then HDIR(A,B) has k1*k2 columns.

    proc iml;
    /* horizontal direct product multiplies each column of A by each column of B */
    A = {1 2,
         2 3,
         4 5};
    B = {0  1 2,
         1  1 3,
         2 -1 4};
    C = hdir(A,B);
    print C[c={'A1_B1' 'A1_B2' 'A1_B3' 'A2_B1' 'A2_B2' 'A2_B3'} L='hdir(A,B)'];

    Interactions of continuous variables

    The previous section shows that if X is a data matrix of continuous variables, the function call HDIR(X,X) generates all pairwise combinations of columns of X. Unfortunately, the matrix HDIR(X,X) contains more columns than we need. if X contains k columns, we need the "(k+1) choose 2" =(k+1)k/2 columns of interactions, but the matrix HDIR(X,X) contains k*k columns. The problem is that HDIR(X,X) contains columns for X1*X2 and for X2*X1, even though those columns are the same. The same holds for other crossed terms such as X1*X3 and X3*X1, X2*X3 and X3*X2, and so forth.

    If you want to use the HDIR function to generate all pairwise interactions, you have a choice: You can generate ALL products of pairs and delete the redundant ones, or you can compute only the unique pairs. I will show the latter because it is more efficient.

    varNames = 'X1':'X3';
    use Have;
       read all var varNames into X;
    close;
     
    /* Compute only the columns we need */
    A1 = HDIR(X[,1], X[,1:3]);    /* interaction of X1 with X1, X2, X3 */
    A2 = HDIR(X[,2], X[,2:3]);    /* interaction of X2 with X2 X3 */
    A3 = HDIR(X[,3], X[,3]);      /* interaction of X3 with X3 */
    A = A1 || A2 || A3;
     
    /* get the HEAD module from 
       https://blogs.sas.com/content/iml/2021/02/10/print-top-rows-of-data-sas.html
    */
    load module=(Head);
    run Head(A) colname={'X1_2' 'X1_X2' 'X1_X3' 'X2_2' 'X2_X3' 'X3_2'};

    The HEAD module displays only the top five rows of the matrix of interactions. This output is almost the same as the output from PROC GLMSELECT. A difference is how missing values propagate. For the third row, X1 is missing. PROC GLMSELECT puts a missing value in all columns of the pairwise interactions. In contrast, the SAS/IML output only has missing values in the first three columns because those are the only columns that involve the X1 variable. If your data contain missing values, you might want to use the CompleteCases function to delete the rows that have missing values before performing additional analyses.

    A function that computes interactions of continuous variables

    The previous section computes all quadratic interactions for a matrix that has three variables. It is straightforward to generalize the idea to k variables. You simply compute the interactions HDIR(X[,i],X[,i:k]) for i=1,2,...,k. This is done in the following program. During the computation, it is convenient to also generate names for the interactions. The following program generates the same names as PROC GLMSELECT. To make the code easier to understand, I encapsulate the logic for names into a helper function called GetNamePairs. The helper function is called as part of the InteractPairs module, which returns both the matrix of interactions and the character vector that names the columns:

    /* Helper function: Form names of interaction effects.
       Input: s is a scalar name. t is a character vector of names. 
       Return row vector of pairs of names "s_2" or "s_t[i]" 
       Example: GetNamePairs('X1', 'X1':'X3') returns {'X1_2' 'X1_X2' 'X1_X3'} */
    start GetNamePairs(s, t);
       k = nrow(t)*ncol(t);
       b = blankstr(nleng(s)+nleng(t)+1);  /* max length of interaction name */
       pairs = j(1, k, b);
       do i = 1 to k;
          if s=t[i] then pairs[i] = strip(s) + "_2";
          else           pairs[i] = catx("_", strip(s), strip(t[i])); 
       end;
       return pairs;
    finish;
     
    /* Generate all quadratic interactions for continuous variables.
       Input: design matrix X and a vector of column names.
       Output: OutX:     a matrix whose columns are the pairwise interactions
               OutNames: the names of interactions, such as Age_2 and Height_Age */
    start InteractPairs(OutX, OutNames, X, Names);
       k = ncol(X);
       numInteract = comb(k+1,2);  /* = k + comb(k,2) */
       OutX = j(nrow(X), numInteract, .);
       OutNames = j(1, numInteract, BlankStr(2*nleng(Names)+1));
       col = 1;              /* initial column to fill */
       do i = 1 to k;
          m = k-i+1;         /* number of interaction for this loop */
          OutX[,col:col+m-1]     = HDIR(X[,i], X[,i:k]);
          OutNames[,col:col+m-1] = GetNamePairs(Names[i], Names[i:k]);
          col = col + m;
       end;
    finish;
     
    run InteractPairs(OutX, OutNames, X, varNames);
    run Head(OutX) colname=OutNames;

    The output is identical to the earlier output. The input arguments to this module can have an arbitrary number of columns. In this example, the design matrix does not include an intercept column (a column that is all ones). Consequently, the output is the set of all quadratic interactions. If your design matrix includes an intercept column, the output will contain an intercept column and all main effects in addition to the quadratic effects.

    Technically, you don't need to use the HDIR function in the InteractPairs module. Instead, you can use the familiar '#' operator to perform the elementwise multiplication. However, if you try to generalize the module to handle the interaction between categorical variables and continuous variables, the HDIR function will be useful.

    Summary

    This article introduces the HDIR function in SAS/IML, which computes the horizontal direct product of matrices. You can use the HDIR function to generate interaction effects in regression models. This article shows how to generate all quadratic interactions among a set of continuous variables.

    The post Generate all quadratic interactions in a regression model appeared first on The DO Loop.