regular expressions

2月 082019
 

Since we added the new "Recommended by SAS" widget in the SAS Support Communities, I often find myself diverted to topics that I probably would not have found otherwise. This is how I landed on this question and solution from 2011 -- "How to convert 5ft. 4in. (Char) into inches (Num)". While the question was deftly answered by my friend (and SAS partner) Jan K. from the Netherlands, the topic inspired me to take it a step further here.

Jan began his response by throwing a little shade on the USA:

Short of moving to a country that has a decent metric system in place, I suggest using a regular expression.

On behalf of my nation I just want say that, for the record, we tried. But we did not get very far with our metrication, so we are stuck with the imperial system for most non-scientific endeavors.

Matching patterns with a regular expression

Regular expressions are a powerful method for finding specific patterns in text. The syntax of regular expressions can be a challenge for those getting started, but once you've solved a few pattern-recognition problems by using them, you'll never go back to your old methods.

Beginning with the solution offered by Jan, I extended this program to read in a "ft. in." measurement, convert to the component number values, express the total value in inches, and then convert the measurement to centimeters. I know that even with my changes, we can think of patterns that might not be matched. But allow me to describe the updates:

Here's my program, followed by the result:

data measure;
 length 
     original $ 25
     feet 8 inches 8 
     total_inches 8 total_cm 8;
 /* constant regex is parsed just once */
 re = prxparse('/(\d*)ft.(\s*)((\d?\.?\d?)in.)?/'); 
 input;
 original = _infile_;
 if prxmatch(re, original) then do;
  feet =   input ( prxposn(re, 1, original), best12.);
  inches = input ( prxposn(re, 4, original), best12.);
  if missing(inches) and not missing(feet) then inches=0;
 end;
 else 
   original = "[NO MATCH] " || original;
 total_inches = (feet*12) + inches;
 total_cm = total_inches * 2.54;
 drop re;
cards;
5ft. 4in.
4ft 0in.
6ft. 10in.
3ft.2in.
4ft.
6ft.     1.5in.
20ft. 11in.
25ft. 6.5in.
Soooooo big
;

Other tools to help with regular expressions

The Internet offers a plethora of online tools to help developers build and test regular expression syntax. Here's a screenshot from RegExr.com, which I used to test some aspects of my program.

Tools like these provide wonderful insight into the capture groups and regex directives that will influence the pattern matching. They are part tutorial, part workbench for regex crafters at all levels.

Many of these tools also include community contributions for matching common patterns. For example, many of us often need to match/parse data with e-mail addresses, phone numbers, and other tokens as data fields. Sites like RegExr.com include the syntax for many of these common patterns that you can copy and use in your SAS programs.

See Also

The post Convert a text-based measurement to a number in SAS appeared first on The SAS Dummy.

7月 032017
 

If you spend a lot of time in SAS Enterprise Guide (as I do), you probably get to know its features pretty well. But we don't always take the time to explore as we should, so there might be a few golden nuggets of editor knowledge that have escaped you so far. Here are 10 program editor features that I've found essential while writing, editing, and debugging SAS programs. How many of these do you already know and use?

1. Turn on the line numbers

We programmers like to count lines of code. The SAS log often uses line numbers to reference problems in WARNINGs and ERRORs. So of course, you should have line numbers displayed in the program editor. But they aren't on by default. Go to Program → Editor Options and select "Show line numbers" to turn them on.

line numbers

2. Get the tabs out (or leave them in)

Tabs or spaces? Your choice here can have a significant effect on your earning potential, and perhaps even on your love life. Most code editors have options that support your choice, regardless of which camp you choose. SAS Enterprise Guide offers these:

  • Tab size - width of a tab character, represented in number of spaces. Default is 4, but I like to use 2 as it makes my program lines less wide.
  • Insert spaces for tabs - when you press the TAB key, don't add a TAB character but instead add the specified number of space characters.
  • Replace tabs with spaces on file open - a perfect passive-aggressive option when working with team members who disagree with your TAB world view. This option will change TAB characters to spaces when you open the program file. If you must retain the TAB characters...well, my main advice is do not rely on TAB characters in your code file. But if you must for some crazy reason, don't select this option and sign a pact with your teammates for the same.

tabs vs spaces

3. Define abbreviations for commonly used code

The most common code snippet that I reuse is a LIBNAME statement that points to my project data. To save on typing and mistakes, I've saved this as an editor abbreviation. When I begin to type the alias I've assigned to the snippet, the program editor offers to complete it for me. The custom abbreviation is presented along side all of the other built-in syntax suggestions.

abbrev animation

See more about editor abbreviations in this article.

4. Let the editor format your code

As shown in the vigorous "TABS vs spaces" debate, programmers care deeply about how their code is formatted. Individuals and teams adopt various standards for line breaks and indenting, and these are usually particular to the programming language. In general, SAS doesn't care how your code is laid out -- statements are delimited by semicolons, and that's the only cue that SAS needs. However, your teammates (and your future YOU, rereading your code) might appreciate something a little more readable.

Press Ctrl+I to format your entire program, applying some reasonable readability rules to indent code lines with conditionals and looping logic. Or select just a portion of the program and press Ctrl+I to affect a smaller part of the program. You can adjust some of the formatting rules by visiting Program → Editor Options, the Indenter tab.

autoformat code

5. Zoom out for the big picture

Some SAS programs are long -- hundreds (or thousands!) of lines of code. Sometimes it's helpful to get a birds-eye view of your code to understand its structure and to help you navigate. The Zoom feature is super helpful for this. Simply press Ctrl+- (control-minus) until you get the view you need. Press Ctrl++ (control-plus) to zoom back in, or press Ctrl+0 to get to the 100% view.

This trick works for SAS logs as well, and also data sets and ODS output (including text listing, which uses the program editor in a special mode for viewing SAS output).

zoom out

6. Change the program editor font

Want to waste an afternoon? Search the Internet for "best font for programmers" and experiment with all of the results that you find. I discovered Consolas (built into Microsoft Windows) a decade ago, and I've yet to find anything better. I use it for all of my "fixed font" needs: programming, terminal windows, command consoles, etc. But you can choose your own favorite -- just don't feel that you're stuck with the default "Courier" that seems to be standard issue.

Change your font in Programs → Editor Options, Appearance tab. You'll find lots of elements that you can tweak for typeface, size and color.

7. Select columns of content with block selection

Even though column block selection -- also known as "Alt+Select" -- is a standard feature in most advanced text editors, many programmers don't know about it. It's the perfect trick for selecting just a few columns of your text without including the content that's on the rest of the line. In SAS programming, this can be handy for selecting columns of values from the text listing output and pasting somewhere else, such as into a DATALINES block. It takes a little practice to master the Alt+Select, but once you do you'll find all sorts of uses for it. To get started, simply hold down the Alt key and click-drag to highlight a vertical column of text within the editor.

column selection animation

8. Find (and replace) using regular expressions

Regular expressions are a powerful, if confusing, method for finding and replacing text that matches certain patterns. The Find/Find and Replace window in SAS Enterprise Guide supports "Regular expression search" as a checkbox option.

Here's an example. Suppose I wanted to find all occurrences of 3 numbers after the thousands separator (comma) at the end of each data line -- and I wanted to turn those digits into zeros. (I don't know why--but just stick with me here.) A regex pattern to match this is ",\d\d\d\n" (comma, followed by 3 occurrences of numeric digits, followed by a line ending). Here's an animation of this in action.

regex replace animation

For more, select Help→SAS Enterprise Guide help and search for "regular expressions". The help topics contain several examples of useful patterns.

9. Scroll just part of your document using a split view

Do you find yourself scrolling back and forth in your program view? Trying to remember what was in that DATA step at the top of your program so you can reference the proper variable in another part of your code? Instead of dealing with "scrolling whiplash", you can split the program editor view to keep one part of your code always visible while you work on another code segment that's hundreds of lines away from it.

split view

There are several ways to split your view of SAS code, log output, and listing. Check out the article here for details.

10. Break out to your other favorite editor

Please don't tell anyone, but I have a secret: SAS Enterprise Guide is not my default application associated with .SAS files. When I double-click on a .SAS file in Windows Explorer, I like to use Notepad++ to provide a quick view of what's in that program file. Don't get me wrong: I use SAS Enterprise Guide for all of my serious SAS programming work. With syntax suggestions, color coding, built-in DATA step debugger, and more -- there just isn't a better, more full-featured environment. (No, I'm not trying to troll you, diehard SAS display manager users -- you keep using what makes you happy.) But Notepad++ has a deep set of text editing features, and sometimes I like to use it for hardcore find/replace functions, deeper inspection of special characters in my files, and more.

You can launch your program into your other favorite editor from SAS Enterprise Guide. Simply right-click on the program node in your process flow, select Open → Open <program name> with Windows Default. And make sure your other editor is registered in Windows as the default "Open with" action for SAS programs. Note: this trick works only with SAS programs that you've saved locally on your Windows file system.

Open with default

More than editing -- this is your workbench

The program editor isn't just about "editing programs." It's also the launchpad for several other programmer-centric features, such as debugging your DATA step, comparing your SAS programs, viewing program history and source control, and more. If you use SAS Enterprise Guide, take the time to learn about all of its programming features -- you'll become a more productive programmer as a result.

The post Ten SAS Enterprise Guide program editor tricks appeared first on The SAS Dummy.

4月 072016
 

I know what you're thinking: two "Boaty McBoatface" articles within two weeks? And we're past April Fool's Day?

But since I posted my original analysis about the "Name our ship" phenomenon that's happening in the UK right now, a new contender has appeared: Poppy-Mai.

The cause of Poppy-Mai, a critically ill infant who has captured the imagination of many British citizens (and indeed, of the world), has made a very large dent in the lead that Boaty McBoatface holds.

poppymai
Yes, "Boaty" still has a-better-than 4:1 lead. But that's a lot closer than the 10:1 lead (over "Henry Worsley") from just over a week ago. Check out the box plot now: you can actually make out a few more dots. Voting is open for another 10 days -- and as we have seen, a lot can happen in that time.

poppybox
As I take this second look at the submissions (now almost 6300) and voting data (almost 350,000 votes cast), I've found a few more entries that made me chuckle. Some of them struck me by their word play, and others cater to my nerdy sensibilities. Here they are (capitalization retained):

While I'm on this topic, I want to give a shout-out to regex101, the online regular expression tester. I was able to develop and test my regular expressions before dropping them into a PRXPARSE function call. I found that I had to adjust my regular expression to cast a wider net for valid titles from the names submissions data. Previously, I wasn't capturing all of the punctuation. While that's probably because I didn't expect punctuation to be part of a ship's name, that assumption doesn't stop people from suggesting and voting on such names. My new regex match:

  title_regex = prxparse("/'title':s?""([a-zA-Z0-9'.-_#s$%&()@!]+)/");

I could probably optimize by specifying an exception pattern instead of an inclusion pattern...but this isn't the sort of project where I worry about that.

Will I write about Boaty McBoatface again? What will my next Boaty article reveal? Stay tuned!

tags: Boaty McBoatface, regular expressions, SAS programming, SGPLOT

The post Boaty McBoatface is on the run appeared first on The SAS Dummy.

3月 262016
 

In a voting contest, is it possible for a huge population to get behind a ridiculous candidate with such force that no other contestant can possibly catch up? The answer is: Yes.

Just ask the folks at NERC, the environmental research organization in the UK. They are commissioning a new vessel for polar research, and they decided to crowdsource the naming process. Anyone in the world is welcome to visit their NameOurShip web site and suggest a name or vote on an existing name submission.

As of today, the leading name is "RRS Boaty McBoatface." ("RRS" is standard prefix for a Royal Research Ship.) This wonderfully creative name is winning the race by more than just a little bit: it has 10 times the number of votes as the next highest vote getter, "RRS Henry Worsley".

I wondered whether the raw data for this poll might be available, and I was pleased to find it embedded in the web page that shows the current entries. The raw data is in JSON format, embedded in the source of the HTML page. I saved the web page source to my local machine, copied out just the JSON line with the submissions data, then used SAS to parse the results. Here's my code:

filename records "c:projectsvotedata.txt";

data votes (keep=title likes);
 length likes 8;
 format likes comma20.;
 label likes="Votes";
 length len 8;
 infile records;
  if _n_ = 1 then
    do;
      retain likes_regex title_regex;
      likes_regex = prxparse("/'likes':s?([0-9]*)/");
      title_regex = prxparse("/'title':s?""([a-zA-Z0-9's]+)/");
    end;
 input;

 position = prxmatch(likes_regex,_infile_);
  if (position ^= 0) then
    do;
      call prxposn(likes_regex, 1, start, len);
      likes = substr(_infile_,start,len);
    end;
 start=0; len=0;

 position = prxmatch(title_regex,_infile_);
  if (position ^= 0) then
    do;
      call prxposn(title_regex, 1, start, len);
      title = substr(_infile_,start,len);
    end;
run;

With the data in SAS, I used PROC FREQ to show the current tally:

title "Vote tally for NERC's Name Our Ship campaign";
proc freq data=votes order=freq;
table title;
weight likes;
run;

boatfacefreq
The numbers are compelling: good ol' Boaty Mac has over 42% of the nearly 200,000 votes. The arguably more-respectable "Henry Worsley" entry is tracking at just 4%. I'm not an expert on polling and sample sizes, but even I can tell that Boaty McBoatface is going to be tough to beat.

To drive the point home a bit more, let's look at a box plot of the votes distribution.

title "Distribution of votes for ALL submissions";
proc sgplot data=votes;
hbox likes;
xaxis valueattrs=(size=12pt);
run;

In this output, we have a clear outlier:
hboxall
If we exclude Boaty, then it shows a slightly closer race among the other runners up (which include some good serious entries, plus some whimsical entries, such as "Boatimus Prime"):

title "Distribution of votes for ALL submissions except Boaty McBoatface";
proc sgplot data=votes(where=(title^="Boaty McBoatface"));
hbox likes;
xaxis valueattrs=(size=12pt);
run;

hboxtrim
See the difference between the automatic axis values between the two graphs? The tick marks show 80,000 vs. 8,000 as the top values.

Digging further, I wondered whether there were some recurring themes in the entries. I decided to calculate word frequencies using a technique I found on our SAS Support Communities (thanks to Cynthia Zender for sharing):

/* Tally the words across all submissions */
data wdcount(keep=word);
    set votes;
    i = 1;
    origword = scan(title,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        /* exclude the most common words */
        if word not in ('a','the','of','and') then output;
        i + 1;
        wordord = i;
        origword = scan(title,i);
        word = compress(lowcase(origword),'?');
    end;
run;
 
proc sql;
   create table work.wordcounts as 
   select t1.word, 
          /* count_of_word */
            (count(t1.word)) as word_count
      from work.wdcount t1
      group by t1.word
      order by word_count desc;
quit;
title "Frequently occurring words in boat name submissions";
proc print data=wordcounts(obs=25);
run;

The top words evoke the northern, cold nature of the boat's mission. Here are the top 25 words and their counts:

  1    polar         352 
  2    ice           193 
  3    explorer      110 
  4    arctic         86 
  5    red            69 
  6    sir            55 
  7    john           54 
  8    lady           46 
  9    sea            42 
 10    ocean          42 
 11    scott          41 
 12    bear           39 
 13    aurora         38 
 14    artic          37 
 15    queen          37 
 16    captain        36 
 17    james          36 
 18    endeavour      35 
 19    william        35 
 20    star           34 
 21    spirit         34 
 22    new            26 
 23    antarctic      26 
 24    boat           25 
 25    cold           25 

I don't know when voting closes, so maybe whimsy will yet be outvoted by a more serious entry. Or maybe NERC will exercise their right to "take this under advisement" and set a certain standard for the finalist names. Whatever the outcome, I'm sure we haven't heard the last of Boaty...

tags: regular expressions, SAS programming, SGPLOT

The post And it's Boaty McBoatface by an order of magnitude appeared first on The SAS Dummy.

8月 222012
 

Regular expressions provide a powerful method to find patterns in a string of text. However, the syntax for regular expressions is somewhat cryptic and difficult to devise. This is why, by my reckoning, approximately 97% of the regular expressions used in code today were copied and pasted from somewhere else. (Who knows who the original authors were? Some experts believe they were copied from ancient cave paintings in New Mexico.)

You can use regular expressions in your SAS programs, via the PRX* family of functions. These include PRXPARSE and PRXMATCH, among others. The classic example for regular expressions is to validate and standardize data values that might have been entered in different ways, such as a phone number or a zip code (with or without the plus-4 notation).

In this post I'll present another, less-trodden example. It's a regular expression that validates the syntax of a SAS variable name. (Now, I'm talking about the regular old traditional SAS variable names, and not those abominations that you can use with OPTIONS VALIDVARNAME=ANY.)

SAS variable names, as you know, can be 1 to 32 characters long, begin with a letter or underscore, and then contain letters, numbers, or underscores in any combination after that. If all you need is a way to validate such names, stop reading here and go learn about the NVALID function, which does exactly this. But if you want to learn a little bit about regular expressions, read on.

The following program shows the regular expression used in context:

data vars (keep=varname valid);
  length varname $ 50;
  input varname 1-50 ;
  re = prxparse('/^(?=.{1,32}$)([a-zA-Z_][a-zA-Z0-9_]*)$/');
  pos = prxmatch(re, trim(varname));
  valid = ifc(pos>0, "YES","NO");
cards;
var1
no space
1var
_temp
thisVarIsOver32charactersInLength
thisContainsFunkyChar!
_
yes_underscore_is_valid_name
run;

And the results:

Here's a closer look at the regular expression in "diagram" form, lightly annotated. (This reminds me a little of how we used to diagram sentences in 4th grade. Does anybody do that anymore?)


Among the more interesting aspects of this expression is the lookahead portion, which checks that the value is between 1 and 32 characters right out of the gate. If that test fails, the expression fails to match immediately. Of course, you could use the LENGTHN function to check that, but it's nice to include all of the tests in a single expression. I mean, if you're going to write a cryptic expression, you might as well go all the way.

Unless you live an alternate reality where normal text looks like cartoon-style swear words, there really isn't much that's "regular" about regular expressions. But they are extremely powerful as a data validation and cleansing tool, and they exist in just about every programming language, so it's a transferable skill. If you take the time to learn how to use them effectively, it's sure to pay off.

tags: prxmatch, regular expressions, SAS programming
2月 252011
 
A more or less anonymous reader commented on our last post, where we were reading data from a file with a varying number of fields. The format of the file was:


1 Las Vegas, NV --- 53.3 --- --- 1
2 Sacramento, CA --- 42.3 --- --- 2

The complication in the number of fields related to spaces in the city field (which could vary from one to three words).

The reader's elegant solution took full advantage of R's regular expressions: a powerful and concise language for processing text.

file <- readLines("http://www.math.smith.edu/r/data/ella.txt")
file <- gsub("^([0-9]* )(.*),( .*)$", "\\1'\\2'\\3", file)
tc <- textConnection(file)
processed <- read.table(tc, sep=" ", na.string="---")
close(tc)

The main work is done by the gsub() function, which processes each line of the input file and puts the city values in quotes (so that it is seen as a single field when read.table() is run.

While not straightforward to parse, the regular expression pattern can be broken into parts. The string ^([0-9]* ) matches any numbers (characters 0-9) at the beginning of the line (which is indicated by the "^"), followed by a space. The "*" means that there may be more than one such 0-9 character included. The string (.*), matches any number of characters followed by a comma, while the last pattern matches any characters after the next space to the end of the line. After the comma (between the quotes) the user gives the characters to replace the found character strings with. To replicate the data found between the parens, we can use the "\\n" syntax; the fact that the comma in the second clause "(.*)," is outside the parens means that it is not replicated.

It may be slightly easier to understand the code if we note that the third clause is unnecessary and split the remaining clauses into two separate gsub() commands, as follows.

file <- readLines("http://www.math.smith.edu/r/data/ella.txt")
file <- gsub("^([0-9]* )", "\\1'", file)
file <- gsub("(.*),", "\\1'", file)
tc <- textConnection(file)
processed <- read.table(tc, sep=" ", na.string="---")
close(tc)

The first two elements of the file vector become:

"1 'Las Vegas' NV --- 53.3 --- --- 1"
"2 'Sacramento' CA --- 42.3 --- --- 2"

The use of the na.string option to read.table() is a more appropriate approach to recoding the missing values than we used previously. Overall, we're impressed with the commenter's use of regular expressions in this example, and are thinking more about Nolan and Temple Lang's focus on them as part of a modern statistical computing curriculum.
2月 252011
 
A more or less anonymous reader commented on our last post, where we were reading data from a file with a varying number of fields. The format of the file was:


1 Las Vegas, NV --- 53.3 --- --- 1
2 Sacramento, CA --- 42.3 --- --- 2

The complication in the number of fields related to spaces in the city field (which could vary from one to three words).

The reader's elegant solution took full advantage of R's regular expressions: a powerful and concise language for processing text.

file <- readLines("http://www.math.smith.edu/r/data/ella.txt")
file <- gsub("^([0-9]* )(.*),( .*)$", "\\1'\\2'\\3", file)
tc <- textConnection(file)
processed <- read.table(tc, sep=" ", na.string="---")
close(tc)

The main work is done by the gsub() function, which processes each line of the input file and puts the city values in quotes (so that it is seen as a single field when read.table() is run.

While not straightforward to parse, the regular expression pattern can be broken into parts. The string ^([0-9]* ) matches any numbers (characters 0-9) at the beginning of the line (which is indicated by the "^"), followed by a space. The "*" means that there may be more than one such 0-9 character included. The string (.*), matches any number of characters followed by a comma, while the last pattern matches any characters after the next space to the end of the line. After the comma (between the quotes) the user gives the characters to replace the found character strings with. To replicate the data found between the parens, we can use the "\\n" syntax; the fact that the comma in the second clause "(.*)," is outside the parens means that it is not replicated.

It may be slightly easier to understand the code if we note that the third clause is unnecessary and split the remaining clauses into two separate gsub() commands, as follows.

file <- readLines("http://www.math.smith.edu/r/data/ella.txt")
file <- gsub("^([0-9]* )", "\\1'", file)
file <- gsub("(.*),", "\\1'", file)
tc <- textConnection(file)
processed <- read.table(tc, sep=" ", na.string="---")
close(tc)

The first two elements of the file vector become:

"1 'Las Vegas' NV --- 53.3 --- --- 1"
"2 'Sacramento' CA --- 42.3 --- --- 2"

The use of the na.string option to read.table() is a more appropriate approach to recoding the missing values than we used previously. Overall, we're impressed with the commenter's use of regular expressions in this example, and are thinking more about Nolan and Temple Lang's focus on them as part of a modern statistical computing curriculum.