sas programming

2月 222021
 

A previous article describes how to use the SGPANEL procedure to visualize subgroups of data. It focuses on using headers to display information about each graph. In the example, the data are time series for the price of several stocks, and the headers include information about whether the stock price increased, decreased, or stayed the same during a time period. The previous article discusses several advantages and disadvantages of using the SGPANEL procedure for this task.

An alternative approach is to use the BY statement in the SGPLOT procedure to process each subgroup separately. This article shows how to use the #BYVAR and #BYVAL keywords in SAS titles to display information about the data in each subgroup.

The example data

I will use the same data as for the previous article. In real life, you would use a separate analysis to determine whether each stock increased or decreased, but I will hard-code this information for the three stocks in the example data.

The following DATA step creates a subset of the Sashelp.Stocks data. The STOCK variable contains the name of three stocks: IBM, Intel, and Microsoft. The OPEN variable contains the opening stock price for these companies for each month. The DATA step restricts the data to the time period Jan 1998 – May 2000. The TREND variable indicates whether the stock price increased or decreased during the time period.

data Have;
   set Sashelp.Stocks;
   where '01Jan1998'd <= Date <= '30May2000'd;
   /* prepare data to display information */
   if      Stock='IBM'       then Trend='Neutral   ';
   else if Stock='Intel'     then Trend='Increasing';
   else if Stock='Microsoft' then Trend='Decreasing';
run;
/* NOTE: The Sashelp.Stock data set is already sorted by Stock and by Date. 
   Be sure to sort your data if you want to use the BY statement. For example:
proc sort data=Have;
   by Stock Date;
run;
*/

You must sort the data for BY-group processing. The example data are already sorted by the STOCK variable, which is the grouping variable. And within each stock, the data are sorted by the DATE variable, which is important for visualizing the stock prices versus time.

Titles for BY-group analysis

When you run a BY-group analysis, SAS automatically creates a title that indicates the name and value of the BY-group variable(s). (This occurs whenever the BYLINE option is on, and it is on by default.) SAS looks at how many titles you have specified and uses the next available title to display the BY-group information. For example, without doing anything special, you can use the standard BY-group analysis to graph the prices for all three stocks in the data set:

/* assume OPTION BYLINE is set */
title "Stock Price Jan 1998 - May 2000";  /* the BY-line will appear in the TITLE2 position */
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;

To save space, I have truncated the output. Each graph shows only a subset of the data. The TITLE2 line displays the name of the BY-group variable (Stock) and the value of the variable. All this happens automatically.

Customize titles: The #BYVAL substitution

The TITLE and TITLEn statements in SAS support substituting the values of a BY-group variable. You can insert the name of a BY-group variable by using the #BYVARn keyword. You can insert the name of a BY-group value by using the #BYVALn keyword. When using these text substitutions, you should specify OPTIONS NOBYLINE to suppress the automatic generation of subtitles.

By default, the BY statement generates the plots one after another, as shown in the previous example. However, you can use the ODS LAYOUT GRIDDED statement to arrange the graphs in a lattice. Essentially, you are using ODS to replicate the layout that PROC SGPANEL handles automatically. In the following example, I let the vertical scale of the axes vary according to the values for each BY group. If you prefer, you can use the MIN= and MAX= options on the YAXIS statement to specify a range of values for each axis.

/* layout the graphs. Use the #BYVALn values to build the titles */
ods graphics / width=300px height=250px;         /* make small to fit on page */
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title "BY-Group Analysis of the #byvar1 Variable";/* substitute variable name for #BYVAR */
title2 "Time Series for #byval1";                /* substitute name of stock for #BYVAL */
 
proc sgplot data=Have;
   by Stock;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Now you can see the power of the #BYVAL keyword. (Click the graph to enlarge it.) It gives you great flexibility in creating a custom subtitle that contains the value of the BY-group variable. The keywords #BYVAR and #BYVAL are an alias for #BYVAR1 and #BYVAL1, just like TITLE is an alias for TITLE1. The next example uses a second BY-group variable.

Because the BY statement supports multiple BY-group variables and because you can specify the NOTSORTED option for variables that are not sorted, you can include the TREND variable as a second BY=group variable. You can then use both the #BYVAL1 and #BYVAL2 keywords to further customize the titles:

options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
title2 "The Time Series for #byval1 Is #byval2"; /* substitute stock name and trend value */
 
proc sgplot data=Have;
   by Stock TREND notsorted;
   series x=Date y=Open / lineattrs=(thickness=2);
   yaxis grid label="Stock Price"; /* optional: min=50 max=210 */
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Controlling attributes in a BY-group

I have one more tip. You can use a discrete attribute map to link the attributes of markers and lines to the value of a variable in the data. For example, suppose you want to color the lines in these plots according to whether the stock price increased, decreased, or stayed the same. The following DATA step creates a discrete attribute map that assigns the line colors based on the value in the TREND variable. On the PROC SGPLOT statement, you can use the DATTRMAP= option, which makes the data map available to the procedure. You can add the ATTRID= option to the SERIES statement. Because the colors are determined by the GROUP=TREND option, the procedure will look at the attribute map to determine which color to use for each line.

/* Create a discrete attribute map. Line color is determined by the TREND value. */
data Attrs;
length Value $20 LineColor $20;
ID = "StockTrend";
Value='Neutral';    LineColor = "DarkBlue    "; output;
Value='Increasing'; LineColor = "DarkGreen   "; output;
Value='Decreasing'; LineColor = "DarkRed     "; output;
run;
 
options nobyline;                                /* suppress Stock=Value title */
ods layout gridded columns=3 advance=table;      /* layout in three columns */
 
proc sgplot data=Have noautolegend DATTRMAP=Attrs;
   by Stock Trend notsorted;
   series x=Date y=Open / group=Trend ATTRID=StockTrend lineattrs=(thickness=2);
   yaxis grid label="Stock Price";
   xaxis display=(nolabel);
run;
 
ods layout end;                                  /* end the gridded layout */
options byline;                                  /* turn the option on again */

Summary and further reading

This article shows how to customize the title and attributes in graphs that are generated as part of a BY-group analysis. You can use the #BYVARn and #BYVALn keywords to insert information about the BY groups into titles. You can use a discrete attribute map to link attributes in a graph (such as line color) to the values of a variable in the data. Although creating a sequence of graphs by using a BY-group analysis is a powerful technique, I often prefer to use PROC SGPANEL, for the reasons discussed in a previous article. PROC SGPANEL provides support for controlling many features of the graphs and the layout of the graphs.

If you are interested in the SAS Macro solution to creating titles, you can read the original thread and solution on the SAS Support Communities. For more information about using the #BYVAR and #BYVAL keywords in SAS titles, see

The post How to use the #BYVAR and #BYVAL keywords to customize graph titles in SAS appeared first on The DO Loop.

2月 102021
 

One of the first things I learned in SAS was how to use PROC PRINT to display parts of a data set. I usually do not need to see all the data, so my favorite way to use PROC PRINT is to use the OBS= data set option to display the first few rows. For example, I often display the first five rows of a SAS data set as follows:

proc print data=Sashelp.Class(obs=5);
    * VAR Weight Height Age;  /* optional: the VAR statement specifies variables */
run;

By using the OBS= data set option, you can display only a few observations. This enables you to take a quick peek at the values of your data. As shown in the comment, you can optionally use the VAR statement to display only certain columns. (Use the FIRSTOBS= option if you want more control over the rows that are printed.)

Display the rows of a data table in SAS/IML

In a SAS/IML program, data are either stored in a table or in a matrix. If the data are in a table, you can use the TABLEPRINT subroutine to display the data. The NUMOBS= option enables you to display only a few rows:

proc iml;
TblClass = TableCreateFromDataset("sashelp", "class");
run TablePrint(TblClass) numobs=5;

The TABLEPRINT subroutine supports many options for printing, including the VAR= option for specifying only certain columns.

Display the rows of a matrix in SAS/IML

How can you display only a portion of a SAS/IML matrix? I often write statements like this to print only the first few rows:

/* read numerical data into the X matrix */
use Sashelp.Class; read all var _NUM_ into X[c=varNames]; close;
print (X[1:5,]);    /* print only rows 1 through 5 */

This works, but there are a few things that I don't like. My primary complaint is that X[1:5,] is a temporary matrix and therefore has no name. The rows are printed, but there is no header that tells me where the data came from. My second complaint is that the output does not indicate which rows are being displayed. Consequently, sometimes I include information in the label and add row and column headers:

print (X[1:5,])[rowname=(1:5) colname=varNames label="Top of X"];

The output now includes the information that I want, but that is a LOT of typing, especially if I want to display similar information for other matrices. Even if I use the abbreviated version of the PRINT options (R=, C=, and L=), it is cumbersome to type.

By the way, this PRINT statement demonstrates a new feature of SAS/IML 15.1 (which was released with SAS 9.4M6), which is that the ROWNAME= and COLNAME= options on the PRINT statement support numerical vectors. If you have an earlier version of SAS/IML, you can use rowname=(char(1:5)).

HEAD: A module to print the top rows of a matrix

There's a saying among computer programmers: if you find yourself writing the same statements again and again, create a function to do it. So, let's write a SAS/IML module to print the top rows of a matrix. Because there is a UNIX command called 'head' that displays the top lines of a file, I will use the same name.

Many years ago, I blogged about how to write a HEAD subroutine, although my emphasis was on how to use default arguments in SAS/IML functions. The following routine is a richer version of the previous function:

/* Print the first n rows of a matrix. Optionally, display names of columns */
start Head(x, n=5, colname=);
   m = min(n, nrow(x));         /* make sure n isn't too big */
   idx = 1:m;                   /* the rows to print */
   name = parentname("x");      /* name of symbol in calling environment */
   if name=" " then name = "Temp";  /* the parent name of a temporary variable is " "*/
   labl = "head(" + name + ") rows=" + strip(char(m));  /* construct the label */
   if isSkipped(colname) then  /* print the top rows */
      print (x[idx,])[r=idx label=labl];
   else 
      print (x[idx,])[r=idx c=colname label=labl];
finish;
 
run Head(X) colname=varNames;  /* example: call the HEAD module */

The HEAD subroutine uses three features of user-defined modules that you might not know about:

The result is a short way to display the top few rows of a matrix.

TAIL: A module to print the bottom rows of a matrix

Although I usually want to print the top row of a matrix, it is easy to modify the HEAD module to display the last n rows of a matrix. The following module, called TAIL, is almost identical to the HEAD module.

/* Print the last n rows of a matrix. Optionally, display names of columns */
start Tail(x, n=5, colname=);
   m = min(n, nrow(x));         /* make sure n isn't too big */
   idx = (nrow(x)-m+1):nrow(x); /* the rows to print */
   name = parentname("x");      /* name of symbol in calling environment */
   if name=" " then name = "Temp";  /* the parent name of a temporary variable is " "*/
   labl = "tail(" + name + ") rows=" + strip(char(m));  /* construct the label */
   if isSkipped(colname) then  /* print the bottom rows */
      print (x[idx,])[r=idx label=labl];
   else 
      print (x[idx,])[r=idx c=colname label=labl];
finish;
 
run Tail(X) colname=varNames;

Summary

Most SAS programmers know how to use the OBS= option in PROC PRINT to display only a few rows of a SAS data set. When writing and debugging programs in the SAS/IML matrix language, you might want to print a few rows of a matrix. This article presents the HEAD module, which displays the top rows of a matrix. For completeness, the article also defines the TAIL module, which displays the bottom rows of a matrix. If you find these modules useful, you can incorporate them into your SAS/IML programs.

For more tips and techniques related to SAS/IML modules, see the article "Everything you wanted to know about writing SAS/IML modules."

The post Print the top rows of your SAS data appeared first on The DO Loop.

1月 182021
 

Recommended soundtrack for this blog post: Netflix Trip by AJR.

This week's news confirms what I already knew: The Office was the most-streamed television show of 2020. According to reports that I've seen, the show was streamed for 57 billion minutes during this extraordinary year. I'm guessing that's in part because we've all been shut in and working from home; we crave our missing office interactions. We lived vicariously (and perhaps dysfunctionally) through watching Dunder Mifflin staff. But another major factor was the looming deadline of the departure of The Office from Netflix as of January 1, 2021. It was a well-publicized event, so Netflix viewers had to get their binge on while they could.

People in my house are fans of the show, and they account for nearly 6,000 of those 57 billion streaming minutes. I can be this precise (nerd alert!) because I'm in the habit of analyzing our Netflix activity by using SAS. In fact, I can tell you that since late 2017, we've streamed 576 episodes of The Office. We streamed 297 episodes in 2020. (Since the show has only 201 episodes we clearly we have a few repeats in there.)

I built a heatmap that shows the frequency and intensity of our streaming of this popular show. In this graph each row is a month, each square is a day. White squares are Office-free. A square with any red indicates at least one virtual visit with the Scranton crew; the darker the shade, the more episodes streamed during that day. You can see that Sept 15, 2020 was a particular big binge with 17 episodes. (Each episode is about 20-21 minutes, so it's definitely achievable.)

netflix trip through The Office

Heatmap of our household streaming of The Office

How to build the heatmap

To build this heatmap, I started with my Netflix viewing history (downloaded from my Netflix account as CSV files). I filtered to just "The Office (U.S.)" titles, and then merged with a complete "calendar" of dates between late 2017 and the start of 2021. Summarized and merged, the data looks something like this:


With all of the data summarized in this way such that there is only one observation per X and Y value, I can use the HEATMAPPARM statement in PROC SGPLOT to visualize it. (If I needed the procedure to summarize/bin the data for me, I would use the HEATMAP statement. Thanks to Rick Wicklin for this tip!)

proc sgplot data=ofc_viewing;
 title height=2.5 "The Office - a Netflix Journey";
 title2 height=2 "&episodes. episodes streamed on &days. days, over 3 years";
 label Episodes="Episodes per day";
 format monyear monyy7.;
 heatmapparm x=day y=monyear 
   colorresponse=episodes / x2axis
    outline
   colormodel=(white  CXfcae91 CXfb6a4a CXde2d26 CXa50f15) ;
 yaxis  minor reverse display=(nolabel) 
  values=(&amp;allmon.)
  ;
 x2axis values=(1 to 31 by 1) 
   display=(nolabel)  ;
run;

You can see the full code -- with all of the data prep -- on my GitHub repository here. You may even run the code in your own SAS environment -- it will fetch my Netflix viewing data from another GitHub location where I've stashed it.

Distribution of Seasons (not "seasonal distribution")

If you examine the heatmap I produced, you can almost see our Office enthusiasm in three different bursts. These relate directly to our 3 children and the moments they discovered the show. First was early 2018 (middle child), then late 2019 (youngest child), then late 2020 (oldest child, now 22 years old, striving to catch up).

The Office ran for 9 seasons, and our kids have their favorite seasons and episodes -- hence the repeated viewings. I used PROC FREQ to show the distribution of episode views across the seasons:


Season 1 is remarkably low for two reasons. First and most importantly, it contains the fewest episodes. Second, many viewers agree that Season 1 is the "cringiest" content, and can be uncomfortable to watch. (This Reddit user leaned into the cringe with his data visualization of "that's what she said" jokes.)

From the data (and from listening to my kids), I know that Season 2 is a favorite. Of the 60 episodes we streamed at least 4 times, 19 of them were in Season 2.

More than streaming, it's an Office lifestyle

Office fandom goes beyond just watching the show. Our kids continue to embrace The Office in other mediums as well. We have t-shirts depicting the memes for "FALSE." and "Schrute Farms." We listen to The Office Ladies podcast, hosted by two stars of the show. In 2018 our daughter's Odyssey of the Mind team created a parody skit based on The Office (a weather-based office named Thunder Mifflin) -- and advanced to world finals.

Rarely does a day go by without some reference to an iconic phrase or life lesson that we gleaned from The Office. We're grateful for the shared experience, and we'll miss our friends from the Dunder Mifflin Paper Company.

The post Visualizing our Netflix Trip through <em>The Office</em> appeared first on The SAS Dummy.

1月 182021
 

Have you ever heard of the DOLIST syntax? You might know the syntax even if you are not familiar with the name. The DOLIST syntax is a way to specify a list of numerical values to an option in a SAS procedure. Applications include:

  • Specify the end points for bins of a histogram
  • Specify percentiles to be output to a data set
  • Specify tick marks for a custom axis on a graph
  • Specify the location of reference lines on a graph
  • Specify a list of parameters for an algorithm. Examples include smoothing parameters (the SMOOTH= option in PROC LOESS), sample sizes (the NTOTAL= option in PROC POWER), and initial guess for parameters in an optimization (the PARMS statement in PROC NLMIXED and PROC NLIN)

This article demonstrates how to use the DOLIST syntax to specify a list of values in SAS procedures. It shows how to use a single statement to specify individual values and also a sequence of values.

The DOLIST syntax enables you to write a single statement that specifies individual values and one or more sequences of values. The DOLIST syntax should be in the toolbox of every intermediate-level SAS programmer!

The DOLIST syntax in the SAS DATA step

According to the documentation of PROC POWER, the syntax described in this article is sometimes called the DOLIST syntax because it is based on the syntax for the iterative DO loop in the DATA step.

The most common syntax for a DO loop is DO x = start TO stop BY increment. For example, DO x = 10 TO 90 BY 20; iterates over the sequence of values 10, 30, 50, 70, and 90. If the increment is 1, you can omit the BY increment portion of the statement. However, you can also specify values as a common-separated list, such as DO x = 10, 30, 50, 70, 90;, which generates the same values. What you might not know is that you can combine these two methods. For example, in the following DATA step, the values are specified by using two comma-separated lists and three sequences. For clarity, I have placed each list on a separate line, but that is not necessary:

/* the DOLIST syntax for a DO loop in the DATA step */
data A;
do pctl = 5,                  /* individual value(s)  */
          10 to 50 by 20,     /* a sequence of values */
          54.3, 69.1,         /* individual value(s)  */
          80 to 90 by 5,      /* another sequence     */
          60 to 40 by -20;    /* yet another sequence */
   output;
end;
run;
 
proc print; run;

The output (not shown) is a list of values: 5, 10, 30, 50, 54.3, 69.1, 80, 85, 90, 60, 40. Notice that the values do not need to be in sorted order, although they often are.

The expressions to the right of the equal sign are what I mean by the "DOLIST syntax." You can use the same syntax to specify a list of options in many SAS procedures. When the SAS documentation says that an option takes a "list of values," you can often use a comma-separated list, a space-separated list, and the syntax start TO stop BY increment. (Or a combination of these expressions!) The following sections provide a few examples, but there are literally hundreds of options in SAS that support the DOLIST syntax!

Some procedures (for example, PROC SGPLOT) require the DOLIST values to be in parentheses. Consequently, I have adopted the convention of always using parentheses around DOLIST values, even if the parentheses are not strictly required. As far as I know, it is never wrong to put the DOLIST inside parentheses, and it keeps me from having to remember whether parentheses are required. The examples in this article all use parentheses to enclose DOLIST values.

Histogram bins and percentiles

You can use the DOLIST syntax to specify the endpoints of bins in a histogram. For example, in PROC UNIVARIATE, the ENDPOINTS= option in the HISTOGRAM statement supports a DOLIST. Because histograms use evenly spaced bins, usually you will specify only one sequence, as follows:

proc univariate data=sashelp.cars;
   var weight;
   histogram weight / endpoints=(1800 to 7200 by 600);   /* DOLIST sequence expression */
run;

You can also use the DOLIST syntax to specify percentiles. For example, the PCTLPTS= option on the OUTPUT statement enables you to specify which percentiles of the data should be written to a data set:

proc univariate data=sashelp.cars;
   var MPG_City;
   output out=UniOut pctlpre=P_  pctlpts=(50 75, 95 to 100 by 2.5);  /* DOLIST */
run;

Notice that this example specifies both individual percentiles (50 and 75) and a sequence of percentiles (95, 97.5, 100).

Tick marks and reference lines

The SGPLOT procedure enables you to specify the locations of tick marks on the axis of a graph. Most of the time you will specify an evenly spaced set of values, but (just for fun) the following example shows how you can use the DOLIST syntax to combine evenly spaced values and a few custom values:

title "Specify Ticks on the Y Axis";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=Mpg_City;
   yaxis grid values=(10 to 40 by 5, 50 60); /* DOLIST; commas optional */
run;

As shown in the previous example, the GRID option on the XAXIS and YAXIS statements enables you to display reference lines at each tick location. However, sometimes you want to display reference lines independently from the tick marks. In that case, you can use the REFLINE statement, as follows:

title "Many Reference Lines";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=MPG_City;
   refline (1800 to 6000 by 600, 7000) / axis=x;  /* many reference lines */
run;

Statistical procedures

Many statistical procedures have options that support lists. In most cases, you can use the DOLIST syntax to provide values for the list.

I have already written about how to use the DOLIST syntax to specify initial guesses for the PARM statement in PROC NLMIXED and PROC NLIN. The documentation for the POWER procedure discusses how to specify lists of values and uses the term "DOLIST" in its discussion.

Some statistical procedures enable you to specify multiple parameter values, and the analysis is repeated for each parameter in the list. One example is the SMOOTH= option in the MODEL statement of the LOESS procedure. The SMOOTH= option specifies values of the loess smoothing parameter. The following call to PROC LOESS fits four loess smoothers to the data. The call to PROC SGPLOT overlays the smoothers on a scatter plot of the data:

proc loess data=sashelp.cars plots=none;
   model MPG_City = Weight / smooth=(0.1 to 0.5 by 0.2, 0.75); /* value-list */
   output out=LoessOut P=Pred;
run;
 
proc sort data=LoessOut; by SmoothingParameter Weight; run;
proc sgplot data=LoessOut;
   scatter x=Weight y=MPG_City / transparency=0.9;
   series x=Weight y=Pred / group=SmoothingParameter 
          curvelabel curvelabelpos=min;
run;

Summary

In summary, this article describes the DOLIST syntax in SAS, which enables you to simultaneously specify individual values and evenly spaced sequences of values. A sequence is specified by using the start TO step BY increment syntax. The DOLIST syntax is valid in many SAS procedures and in the DATA step. In some procedures (such as PROC SGPLOT), the syntax needs to be inside parentheses. For readability, you can use commas to separate individual values and sequences.

Many SAS procedures accept the special syntax even if it is not explicitly mentioned in the documentation. In the documentation for an option, look for terms such as value-list or numlist or value-1 <...value-n>, which indicate that the option supports the DOLIST syntax.

The post The DOLIST syntax: Specify a list of numerical values in SAS appeared first on The DO Loop.

12月 172020
 

There’s nothing worse than being in the middle of a task and getting stuck. Being able to find quick tips and tricks to help you solve the task at hand, or simply entertain your curiosity, is key to maintaining your efficiency and building everyday skills. But how do you get quick information that’s ALSO engaging? By adding some personality to traditionally routine tutorials, you can learn and may even have fun at the same time. Cue the SAS Users YouTube channel.

With more than 50 videos that show personality published to-date and over 10,000 hours watched, there’s no shortage of learning going on. Our team of experts love to share their knowledge and passion (with personal flavor!) to give you solutions to those everyday tasks.

What better way to round out the year than provide a roundup of our most popular videos from 2020? Check out these crowd favorites:

Most viewed

  1. How to convert character to numeric in SAS
  2. How to import data from Excel to SAS
  3. How to export SAS data to Excel

Most hours watched

  1. How to import data from Excel to SAS
  2. How to convert character to numeric in SAS
  3. Simple Linear Regression in SAS
  4. How to export SAS data to Excel
  5. How to Create Macro Variables and Use Macro Functions
  6. The SAS Exam Experience | See a Performance-Based Question in Action
  7. How it Import CSV files into SAS
  8. SAS Certification Exam: 4 tips for success
  9. SAS Date Functions FAQs
  10. Merging Data Sets in SAS Using SQL

Latest hits

  1. Combining Data in SAS: DATA Step vs SQL
  2. How to Concatenate Values in SAS
  3. How to Market to Customers Based on Online Behavior
  4. How to Plan an Optimal Tour of London Using Network Optimization
  5. Multiple Linear Regression in SAS
  6. How to Build Customized Object Detection Models

Looking forward to 2021

We’ve got you covered! SAS will continue to publish videos throughout 2021. Subscribe now to the SAS Users YouTube channel, so you can be notified when we’re publishing new videos. Be on the lookout for some of the following topics:

  • Transforming variables in SAS
  • Tips for working with SAS Technical Support
  • How to use Git with SAS

2020 roundup: SAS Users YouTube channel how to tutorials was published on SAS Users.

11月 202020
 

If you’re like me and the rest of the conference team, you’ve probably attended more virtual events this year than you ever thought possible. You can see the general evolution of virtual events by watching the early ones from April or May and compare them to the recent ones. We at SAS Global Forum are studying the virtual event world, and we’re learning what works and what needs to be tweaked. We’re using that knowledge to plan the best possible virtual SAS Global Forum 2021.

Everything is virtual these days, so what do we mean by virtual?

Planning a good virtual event takes time, and we’re working through the process now. One thing is certain -- we know the importance of providing quality content and an engaging experience for our attendees. We want to provide attendees with the opportunity as always, but virtually, to continue to learn from other SAS users, hear about new and exciting developments from SAS, and connect and network with experts, peers, partners and SAS. Yes, I said network. We realize it won’t be the same as a live event, but we are hopeful we can provide attendees with an incredible experience where you connect, learn and share with others.

Call for content is open

One of the differences between SAS Global Forum and other conferences is that SAS users are front and center, and the soul of the conference. We can’t have an event without user content. And that’s where you come in! The call for content opened November 17 and lasts through December 21, 2020. Selected presenters will be notified in January 2021. Presentations will be different in 2021; they will be 30 minutes in length, including time for Q&A when able. And since everything is virtual, video is a key component to your content submission. We ask for a 3-minute video along with your title and abstract.

The Student Symposium is back

Calling all postsecondary students -- there’s still time to build a team for the Student Symposium. If you are interested in data science and want to showcase your skills, grab a teammate or two and a faculty advisor and put your thinking caps on. Applications are due by December 21, 2020.

Learn more

I encourage you to visit the SAS Global Forum website for up-to-date information, follow #SASGF on social channels and join the SAS communities group to engage with the conference team and other attendees.

Connect, learn and share during virtual SAS Global Forum 2021 was published on SAS Users.

11月 102020
 

The code and data that drive analytics projects are important assets to the organizations that sponsor them. As such, there is a growing trend to manage these items in the source management systems of record. For most companies these days, that means Git. The specific system might be GitHub Enterprise, GitLab, or Bitbucket -- all platforms that are based on Git.

Many SAS products support direct integration with Git. This includes SAS Studio, SAS Enterprise Guide, and the SAS programming language. (That last one checks a lot of boxes for ways to use Git and SAS together.) While we have good documentation and videos to help you learn about Git and SAS, we often get questions around "best practices" -- what is the best/correct way to organize your SAS projects in Git?

In this article I'll dodge that question, but I'll still try to provide some helpful advice in the process.

Ask the Expert resource: Using SAS® With Git: Bring a DevOps Mindset to Your SAS® Code

Guidelines for managing SAS projects in Git

It’s difficult for us to prescribe exactly how to organize project repositories in source control. Your best approach will depend so much on the type of work, the company organization, and the culture of collaboration. But I can provide some guidance -- mainly things to do and things to avoid -- based on experience.

Do not create one huge repository

DO NOT build one huge repository that contains everything you currently maintain. Your work only grows over time and you'll come to regret/revisit the internal organization of a huge project. Once established, it can be tricky to change the folder structure and organization. If you later try to break a large project into smaller pieces, it can be difficult or impossible to maintain the integrity of source management benefits like file histories and differences.

Design with collaboration in mind

DO NOT organize projects based only on the teams that maintain them. And of course, don't organize projects based on individual team members.

  • Good repo names: risk-adjustment-model, engagement-campaigns
  • Bad repo names: joes-code, claims-dept

All teams reorganize over time, and you don't want to have to reorganize all of your code each time that happens. And code projects change hands, so keep the structure personnel-agnostic if you can. Major refactoring of code can introduce errors, and you don't want to risk that just because you got a new VP or someone changed departments.

Instead, DO organize projects based on function/work that the code accomplishes. Think modular...but don't make projects too granular (or you'll have a million projects). I personally maintain several SAS code projects. The one thing they have in common is that I'm the main contributor -- but I organize them into functional repos that theoretically (oh please oh please) someone else could step in to take over.

The Git view of my YouTube API project in SAS Enterprise Guide

Up with reuse, down with ownership

This might seem a bit communist, but collaboration works best when we don't regard code that we write as "our turf." DO NOT cling to notions of code "ownership." It makes sense for teams/subject-matter experts to have primary responsibility for a project, but systems like Git are designed to help with transparency and collaboration. Be open to another team member suggesting and merging (with review and approval) a change that improves things. GitHub, GitLab, and Bitbucket all support mechanisms for issue tracking and merge requests. These allow changes to be suggested, submitted, revised, and approved in an efficient, transparent way.

DO use source control to enable code reuse. Many teams have foundational "shared code" for standard operations, coded in SAS macros or shared statements. Consider placing these into their own project that other projects and teams can import. You can even use Git functions within SAS to fetch and include this code directly from your Git repository:

/* create a temp folder to hold the shared code */
options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/shared-code;
libname repo "&repoPath.";
libname repo clear;
 
/* Fetch latest code from Git */
data _null_;
 rc = git_clone( 
   "https://gitlab.mycompany.com/sas-projects/shared-code/",
   "&repoPath.");
run;
 
options source2;
/* run the code in this session */
%include "&repoPath./bootstrap-macros.sas";

If you rely on a repository for shared code and components, make sure that tests are in place so changes can be validated and will not break downstream systems. You can even automate tests with continuous integration tools like Jenkins.

DO document how projects relate to each other, dependencies, and prepare guidance for new team members to get started quickly. For most of us, we feel more accountable when we know that our code will be placed in central repositories visible to our peers. It may inspire cleaner code, more complete documentation, and a robust on-boarding process for new team members. Use the Markdown files (README.md and others) in a repository to keep your documentation close to the code.

My SAS code to check Pagespeed Insights, with documentation

Work with Git features (and not against them)

Once your project files are in a Git repository, you might need to change your way of working so that you aren't going against the grain of Git benefits.

DO NOT work on code changes in a shared directory with multiple team members –- you'll step on each other. The advantage of Git is that it's a distributed workflow and each developer can work with their own copy of the repository, and merge/accept changes from others at their own pace.

DO use Git branching to organize and isolate changes until you are ready to merge them with the main branch. It takes a little bit of learning and practice, but when you adopt a branching approach you'll find it much easier to manage -- it beats keeping multiple copies of your code with slightly different file and folder names to mark "works in progress."

DO consider learning and using Git tools such as Git Bash (command line), Git GUI, and a code IDE like VS Code. These don't replace the SAS-provided coding tools with their Git integration, but they can supplement your workflow and make it easier to manage content among several projects.

Learning more

When you're ready to learn more about working with Git and SAS, we have many webinars, videos, and documentation resources:

The post How to organize your SAS projects in Git appeared first on The SAS Dummy.

11月 102020
 

As a SAS consultant I have been an avid user of SAS Enterprise Guide for as long as I can remember. It has been not just my go-to tool, but that of many of the SAS customers I have worked with over the years.

It is easy to use, the interface intuitive, a Swiss Army knife when it comes to data analysis. Whether you’re looking to access SAS data or import good old Excel locally, join data together or perform data analysis, a few clicks and ta-dah, you’re there! Alternatively, if you insist on coding or like me, use a bit of both, the ta-dah point still holds.

SAS Enterprise Guide, or EG as it is commonly known as, is a mature SAS product with many years of R&D, an established user base, a reliable and trusted product. So why move to SAS Studio? Why should I leave the comfort of what works?

For the last nine months I have been working with one of the UK’s largest supermarket answering that exact question as they make that journey from SAS Enterprise Guide to SAS Studio. EG is used widely across several supermarket operations, including:

  • supply chain (to look at wastage and stock availability)
  • marketing analytics (to look at customer behaviour and build successful campaigns)
  • fraud detection (to detect misuse of vouchers).

What is SAS Studio?

Firstly, let's answer the "what is SAS Studio" question. It is the browser-based interface for SAS programmers to run code or use predefined tasks to automatically generate SAS code. Since there is nothing to install on your desktop, you can access it from almost any machine: Windows or Mac. And SAS Studio is brought to you by the same SAS R&D developers who maintain SAS Enterprise Guide.

SAS Studio with Ignite (dark) theme

1. Still does the regular stuff

It allows you to access your data, libraries and existing programs and import a range of data sources including Excel and CSV. You can code or use the tasks to perform analysis. You can build queries to join data, create simple and complex expressions, filter and sort data.

But it does much more than that... So what cool things can you do with SAS Studio?

2. Use the processing power of SAS Viya

SAS Studio (v5.2 onwards) works on SAS Viya. Previously SAS 9 had the compute server aka the workspace server as the processing engine. SAS Viya has CAS, the next generation SAS run time environment which makes use of both memory and disk. It is distributed, fault tolerant, elastic and can work on problems larger than the available RAM. It is all centrally managed, secure, auditable and governed.

3. Cool new functionality

SAS Studio comes with many enhancements and cool new functionality:

  • Custom tasks. You can easily build your own custom tasks (software developer skills not required) so others without SAS coding skills can utilise them. Learn more in this Ask the Expert session.
  • Code snippets. It comes with pre-defined code snippets, commonly used bits of code that you can save and reuse. Additionally, you can create your own which you can share with colleagues. Coders love that these code snippets can be used with keystroke abbreviations.
  • Background submit.  This allows you to run code in the background whilst you continue to work.
  • DATA step debugger. First added into SAS Enterprise Guide, SAS Studio now offers an interactive DATA step debugger as well.
  • Flexible layout for your workspace, You can have multiple tabs open for each program, and open multiple datasets and items.
  • FEDSQL. The query window

    DATA step debugger in SAS Studio

4. Seamlessly access the full suite of SAS Viya capabilities

A key benefit of SAS Studio is the ease of which you can move from writing code to doing some data discovery, visualisation and model building. Previously in the SAS 9 world you may have used EG to access and join your data and then move to SAS Enterprise Miner, a different interface, installed separately to build a model. Those days are long gone.

To illustrate the point, if I wanted to build a campaign to see who would respond to a supermarket voucher, I could access my customer data and join that to my transaction and products data in SAS Studio. I could then move into SAS Visual Analytics to identify the key variables I would need to build an analytical model and even the best model to build. From there I would move to SAS Visual Data Mining and Machine Learning to build the model. I could very easily use the intuitive point-and-click pipeline interface to build several models, incorporating an R or Python to find the best model. This would all be done within one browser-based interface and the data being loaded only once.

This tutorial from Christa Cody illustrates this coding workflow in action.

The Road to SAS Studio

SAS Studio clearly has a huge number of benefits, it does the regular stuff you would expect, but additionally brings a host of cool new functionality and the processing power of SAS Viya, not to mention allowing you to move seamlessly to the next steps of the analytical and decisioning journey including model building, creating visualisations, etc.

Change management + technical enablement = success

Though adoption of modern technology can bring significant benefits to enterprise organisations as this supermarket is seeing, it is not without its challenges. Change is never easy and the transition from EG to Studio will take time. Especially with a mature, well liked and versatile product like EG.

The cultural challenge that new technology provides should not be underestimated and can provide a barrier to adoption. Newer technology requires new approaches, a different way of working across diverse user communities many of whom have well established working practices that may in some cases, resist change. The key is to invest the time with the communities, explain how newer technology can support their activities more efficiently and provide them with broader capability.

Learn more

Visit the Learn and Support center for SAS Studio.

Moving from SAS Enterprise Guide to SAS Studio was published on SAS Users.

11月 022020
 

When there are two equivalent ways to do something, I advocate choosing the one that is simpler and more efficient. Sometimes, I encounter a SAS program that simulates random numbers in a way that is neither simple nor efficient. This article demonstrates two improvements that you can make to your SAS code if you are simulating binary variables or categorical variables.

Simulate a random binary variable

The following DATA step simulates N random binary (0 or 1) values for the X variable. The probability of generating a 1 is p=0.6 and the probability of generating a 0 is 0.4. Can you think of ways that this program can be simplified and improved?

%let N = 8;
%let seed = 54321;
 
data Binary1(drop=p);
call streaminit(&seed);
p = 0.6;
do i = 1 to &N;
   u = rand("Uniform");
   if (u < p) then
      x=1;
   else
      x=0;
   output;
end;
run;
 
proc print data=Binary1 noobs; run;

The goal is to generate a 1 or a 0 for X. To accomplish this, the program generates a random uniform variate, u, which is in the interval (0, 1). If u < p, it assigns the value 1, otherwise it assigns the value 0.

Although the program is mathematically correct, the program can be simplified. It is not necessary for this program to generate and store the u variable. Yes, you can use the DROP statement to prevent the variable from appearing in the output data set, but a simpler way is to use the Bernoulli distribution to generate X directly. The Bernoulli(p) distribution generates a 1 with probability p and generates a 0 with probability 1-p. Thus, the following DATA step is equivalent to the first, but is both simpler and more efficient. Both programs generate the same random binary values.

data Binary2(drop=p);
call streaminit(&seed);
p = 0.6;
do i = 1 to &N;
   x = rand("Bernoulli", p);     /* Bern(p) returns 1  with probability p */
   output;
end;
run;
 
proc print data=Binary2 noobs; run;

Simulate a random categorical variable

The following DATA step simulates N random categorical values for the X variable. If p = {0.1, 0.1, 0.2, 0.1, 0.3, 0.2} is a vector of probabilities, then the probability of generating the value i is p[i]. For example, the probability of generating a 3 is 0.2. Again, the program generates a random uniform variate and uses the cumulative probabilities ({0.1, 0.2, 0.4, 0.5, 0.8, 1}) as cutpoints to determine what value to assign to X, based on the value of u. (This is called the inverse CDF method.)

/* p = {0.1, 0.1, 0.2, 0.1, 0.3, 0.2} */
/* Use the cumulative probability as cutpoints for assigning values to X */
%let c1 = 0.1;
%let c2 = 0.2;
%let c3 = 0.4;
%let c4 = 0.5;
%let c5 = 0.8;
%let c6 = 1;
 
data Categorical1;
call streaminit(&seed);
do i = 1 to &N;
   u = rand("Uniform");
   if (u <=&c1) then
      y=1;
   else if (u <=&c2) then
      y=2;
   else if (u <=&c3) then
      y=3;
   else if (u <=&c4) then
      y=4;
   else if (u <=&c5) then
      y=5;
   else
      y=6;
   output;
end;
run;
 
proc print data=Categorical1 noobs; run;

If you want to generate more than six categories, this indirect method becomes untenable. Suppose you want to generate a categorical variable that has 100 categories. Do you really want to use a super-long IF-THEN/ELSE statement to assign the values of X based on some uniform variate, u? Of course not! Just as you can use the Bernoulli distribution to directly generate a random variable that has two levels, you can use the Table distribution to directly generate a random variable that has k levels, as follows:

data Categorical2;
call streaminit(&seed);
array p[6] _temporary_ (0.1, 0.1, 0.2, 0.1, 0.3, 0.2);
do i = 1 to &N;
   x = rand("Table", of p[*]); /* Table(p) returns i with probability p[i] */
   output;
end;
 
proc print data=Categorical2 noobs; run;

Summary

In summary, this article shows two tips for simulating discrete random variables:

  1. Use the Bernoulli distribution to generate random binary variates.
  2. Use the Table distribution to generate random categorical variates.

These distributions enable you to directly generate categorical values based on supplied probabilities. They are more efficient than the oft-used method of assigning values based on a uniform random variate.

The post Tips to simulate binary and categorical variables appeared first on The DO Loop.

10月 052020
 

Finite-precision computations can be tricky. You might know, mathematically, that a certain result must be non-negative or must be within a certain interval. However, when you actually compute that result on a computer that uses finite-precision, you might observe that the value is slightly negative or slightly outside of the interval. This frequently happens when you are adding numbers that are not exactly representable in base 2. One of the most famous examples is the sum
    0.1 + 0.1 + 0.1 ≠ 0.3 (finite precision),
which is shown to every student in Computer Science 101. Other examples include:

  • If x is in the interval [0,1], then y = 1 - x10 is also in the interval [0,1]. That is true in exact precision, but not necessarily true in finite-precision computations when x is close to 1.
  • Although sin2(t) + cos2(t) = 1 in exact arithmetic for all values of t, the equation might not be true in finite precision.

SAS programs that demonstrate the previous finite-precision computations are shown at the end of this article. Situations like these can cause problems when you want to use the result in a function that has a restricted domain. For example, the SQRT function cannot operate on negative numbers. For probability distributions, the quantile function cannot operate on numbers outside the interval [0,1].

This article discusses how to "trap" results that are invalid because they are outside of a known interval. You can use IF-THEN/ELSE logic to catch an invalid result, but SAS provides more compact syntax. This article discusses using the IFN function in Base SAS and using the "elementwise minimum" (<>) and "elementwise maximum" (><) operators in the SAS/IML language.

Trap and map

I previously wrote about the "trap and cap" technique for handling functions like log(x). The idea is to "trap" invalid input values (x ≤ 0) and "cap" the output when you are trying to visualize the function. For the present article, the technique is "trap and map": you need to detect invalid computations and then map the result back into the interval that you know is mathematically correct. For example, if the argument is supposed to represent a probability in the interval [0,1], you must map negative numbers to 0 and map numbers greater than 1 to 1.

How to use the IFN function in SAS

Clearly, you can "trap and map" by using IF-THEN/ELSE logic. For example, suppose you know that a variable, x, must be in the interval [0,1]. You can use the SAS DATA step to trap invalid values and map them into the interval, as follows:

/* trap invalid values of x and map them into [0,1]. Store result in y */
data Map2;
input x @@;
if x<0 then 
   y = 0;
else if x>1 then 
   y = 1;
else 
   y = x;
datalines;
1.05 1 0.9 0.5 0 -1e-16 -1.1
;

This program traps the invalid values, but it requires six lines of IF-THEN/ELSE logic. A more compact syntax is to use the IFN function. The IFN function has three arguments:

  • The first argument is a logical expression, such as x < 0.
  • The second argument is the value to return when the logical expression is true.
  • The third argument is the value to return when the logical expression is false.

For example, you can use the function call IFN(x<0, 0, x) to trap negative values of x and map those values to 0. Similarly, you can use the function call IFN(x>1, 1, x) to trap values of x that are greater than 1 and map those values to 1. To perform both of the trap-and-map operations on one line, you can nest the function calls so that the third argument to the IFN function is itself a function call, as follows:

data Map2;
input x @@;
y = ifn(x<0, 0, ifn(x>1, 1, x));   /* trap and map: force x into [0,1] */
 
/* or, for clarity, split into two simpler calls */
temp = ifn(x>1, 1, x);             /* force x <= 1 */
z = ifn(x<0, 0, temp);             /* force x >= 0 */
datalines;
1.05 1 0.9 0.5 0 -1e-16 -1.1
;
 
proc print noobs; run;

The output shows that the program performed the trap-and-map operation correctly. All values of y are in the interval [0,1].

If you can't quite see how the logic works in the statement that nests the two IFN calls, you can split the logic into two steps as shown later in the program. The first IFN call traps any variables that are greater than 1 and maps them to 1. The second IFN call traps any variables that are less than 0 and maps them to 0. The z variable has the same values as the y variable.

How to use the element min/max operators in SAS/IML

The SAS/IML language contains operators that perform similar computations. If x is a vector of numbers, then

  • The expression (x <> 0) returns a vector that is the elementwise maximum between the elements of x and 0. In other words, any elements of x that are less than 0 get mapped to 0.
  • The expression (x >< 1) returns a vector that is the elementwise minimum between the elements of x and 1. In other words, any elements of x that are greater than 1 get mapped to 1.
  • You can combine the operators. The expression (x <> 0) >< 1 forces all elements of x into the interval [0,1]. Because the elementwise operators are commutative, you can also write the expression as 0 <> x >< 1.

These examples are shown in the following SAS/IML program:

proc iml;
x = {1.05, 1, 0.9, 0.5, 0, -1e-16, -1.1};
y0 = (x <> 0);                     /* force x >= 0 */
y1 = (x >< 1);                     /* force x <= 1 */
y01 = (x <> 0) >< 1;               /* force x into [0,1] */
print x y0 y1 y01;

Summary

In summary, because of finite-precision computations (or just bad input data), it is sometimes necessary to trap invalid values and map them into an interval. Using IF-THEN/ELSE statements is clear and straightforward, but require multiple lines of code. You can perform the same logical trap-and-map calculations more compactly by using the IFN function in Base SAS or by using the elementwise minimum or maximum operators in SAS/IML.

Appendix: Examples of floating-point computations to study

The following SAS DATA step programs show typical computations for which the floating-point computation can produce results that are different from the expected mathematical (full-precision) computation.

data FinitePrecision;
z = 0.1 + 0.1 + 0.1;
if z^=0.3 then 
   put 'In finite precision: 0.1 + 0.1 + 0.1 ^= 0.3';
run;
 
data DomainError1;
do t = -4 to 4 by 0.1;
   x = cos(t);
   y = sin(t);
   z = x**2 + y**2;  /* always = 1, mathematically */
   s = sqrt(1 - z);  /* s = sqrt(0) */
   output;
end;
run;
 
data DomainError2;
do x = 0 to 1 by 0.01;
   z = 1 - x**10;  /* always in [0,1], mathematically */
   s = sqrt(z);  
   output;
end;
run;