4月 192018

There were 97 e-posters in The Quad demo room at SAS Global Forum this year. And the one that caught my eye was Ted Conway's "Periodic Table of Introductory SAS ODS Graphics Examples." Here's a picture of Ted fielding some questions from an interested user... He created a nice/fun graphic, [...]

The post A periodic table to help you with your SAS ODS graphics! appeared first on SAS Learning Post.

4月 062017

For those of you who don't have SAS/Graph's Proc GMap, I recently showed how to 'fake' a variety of maps using Proc SGplot polygons. So far I've written blogs on creating: pretty maps, gradient shaded choropleth maps, and maps with markers at zip codes. And now (by special request from [...]

The post Drawing paths on a map using SGplot appeared first on SAS Learning Post.

3月 302017

Users frequently ask how to plot their data as markers on a map. There are several ways to do this using SAS software. If you're a Visual Analytics user, you can do it using a point-and-click interface. But if you're a coder, you might need a little help... In this [...]

The post Plotting markers on a map at zip code locations, using GMap or SGplot appeared first on SAS Learning Post.

3月 232017

If you're a fan of SAS' ODS Graphics, you probably know that it does pretty much everything except geographical maps. But it's flexible enough that you can "fake it 'till you make it"! This example describes how to fake a geographical (choropleth) heat map using Proc SGplot polygons. In my [...]

The post Bringing the heat! - Creating heat maps with proc sgplot ... appeared first on SAS Learning Post.

3月 212017

If you give an artist some tools, they can create a pretty picture. Sure, they probably have a preferred tool - but they can probably do a pretty decent job no matter what you give them (paint, colored pencils, watercolor, charcoal, etc). And creating pretty graphs in SAS is no [...]

The post How to create a 'pretty' map with Proc SGplot appeared first on SAS Learning Post.

2月 102017

During my morning commute I heard an interesting news story about the merits and risks of the $100 bill. Apparently there are a lot of them in circulation, but no one knows exactly where they are. They are seldom used for legitimate business transactions because when a transaction reaches into the hundreds of dollars, the business parties tend to resort to a digital transaction (or maybe a good-old-fashioned personal check).

One stat from the story stuck with me: 80 percent of the currency value that is in circulation right now is in $100 bills. When I hear a number like that my first instinct is to find the data and verify it. It was easy to find on the US Federal Reserve: Currency in Circulation web site.

I could have copied and pasted the table for use in SAS, but the Fed already makes an ASCII version of the table available for nerds like me. With a bit of SAS code I was able to download the data, transpose it for analysis, calculate the currency values for with the counts of the various notes, and create a graph that verifies what I heard on the news. Indeed, much of our nation's cash wealth is floating around on a bunch of Benjamins.

The percentage of wealth represented in the $100 bill has grown over the past 20 years, perhaps at the expense of the far-more-ATM-friendly $20 bill. In 1996, the ratio of hundreds-to-twenties was 60/20. Today it's 79/12.

If we look at the percent breakdown by bill counts (instead of value), you can see the shifts a bit differently.

I created those two plots by calculating the percentages with PROC FREQ, and then using PROC SGPLOT to graph Year, using the percentages as the response value and stacking the data by Denomination. I used GROUPORDER=Data to keep the data colors and legend consistent across the different graphs. The raw values are interesting to examine as well, as it more clearly shows the trends for the past 20 years. Here are the two representations (cash value and bill counts) with their actual values and not just the percentages.

Raw values of currency in circulation

Raw counts of currency in circulation
An interesting aside: the $2 bill remains a novelty. I had the opportunity to spend a few of these recently, and the person receiving them had to be convinced that they were real money. It made me want to secure more $2 bills (a.k.a. "some Jeffersons") for future transactions -- that's the sort of trouble-making consumer that I like to be.

In case you're interested, here's the SAS program that I used -- try it yourself! (If using SAS University Edition, you'll have to download the text data first instead of relying on the FILENAME URL statement. Notes in the program.)

tags: SGPLOT

The post Visualized: US Currency in circulation, past and present appeared first on The SAS Dummy.

11月 022016

Fun with ODS GraphicsSAS Community member @tc (a.k.a. Ted Conway) has found a new toy: ODS Graphics. Using PROC SGPLOT and GTL (Graph Template Language), along with some creative data prep steps, Ted has created several fun examples that show off what you can do with a bit of creativity, some math knowledge, and open data.

And bonus -- since most of his examples work with SAS University Edition, it's easy for you to try them yourself. Here are some of my favorites.

Learn to draw a Jack-O-Lantern

Using the GIF output device and free data from, Ted shows how to use GTL (PROC TEMPLATE and PROC SGRENDER) to animate this Halloween icon.

learn to draw a Jack-O-Lantern

The United Polygons of America

Usually map charts with SAS require specialized procedures and map data, but here's a technique that can plot a stylized version of the USA and convey some interesting data. (You might have seen this one featured in a SAS Tech Report newsletter. Do you subscribe?)

United Polygons of America

A look at Katie Ledecky's dominance

Using a vector plot, Ted shows how this championship swimmer dominated her event during the summer games in Rio. This example contains a lot of text information too; and that's a cool trick in PROC SGPLOT with the AXISTABLE statement. Click on the image for a closer look.

Katie Ledecky dominates

Demonstrating the Bublé Sort

This example is nerdy on so many levels. It's a take on the Computer Science 101 concept of "bubble sort," an algorithm for placing a collection of items in a desired order. In this case, the items consist of Christmas songs recorded by Michael Bublé, that dreamy crooner from Canada.

See the songs sort things out
Ted posts these examples (and more) in the SAS/GRAPH and ODS Graphics section of SAS Support Communities. That's a great place to learn SAS graphing techniques, from simple to advanced, and to see what other practitioners are doing. Experts like Ted hang out there, and the SAS visualization developers often post answers to the tricky questions.

More from @tc

In addition to his community posts, Ted is an award-winning contributor to SAS Global Forum with some very popular presentations. Here are a few of his papers.

tags: ODS Graphics, SAS Communities, SGPLOT

The post Binge on this series: Fun with ODS Graphics appeared first on The SAS Dummy.

4月 072016

I know what you're thinking: two "Boaty McBoatface" articles within two weeks? And we're past April Fool's Day?

But since I posted my original analysis about the "Name our ship" phenomenon that's happening in the UK right now, a new contender has appeared: Poppy-Mai.

The cause of Poppy-Mai, a critically ill infant who has captured the imagination of many British citizens (and indeed, of the world), has made a very large dent in the lead that Boaty McBoatface holds.

Yes, "Boaty" still has a-better-than 4:1 lead. But that's a lot closer than the 10:1 lead (over "Henry Worsley") from just over a week ago. Check out the box plot now: you can actually make out a few more dots. Voting is open for another 10 days -- and as we have seen, a lot can happen in that time.

As I take this second look at the submissions (now almost 6300) and voting data (almost 350,000 votes cast), I've found a few more entries that made me chuckle. Some of them struck me by their word play, and others cater to my nerdy sensibilities. Here they are (capitalization retained):

While I'm on this topic, I want to give a shout-out to regex101, the online regular expression tester. I was able to develop and test my regular expressions before dropping them into a PRXPARSE function call. I found that I had to adjust my regular expression to cast a wider net for valid titles from the names submissions data. Previously, I wasn't capturing all of the punctuation. While that's probably because I didn't expect punctuation to be part of a ship's name, that assumption doesn't stop people from suggesting and voting on such names. My new regex match:

  title_regex = prxparse("/'title':s?""([a-zA-Z0-9'.-_#s$%&()@!]+)/");

I could probably optimize by specifying an exception pattern instead of an inclusion pattern...but this isn't the sort of project where I worry about that.

Will I write about Boaty McBoatface again? What will my next Boaty article reveal? Stay tuned!

tags: Boaty McBoatface, regular expressions, SAS programming, SGPLOT

The post Boaty McBoatface is on the run appeared first on The SAS Dummy.

3月 262016

In a voting contest, is it possible for a huge population to get behind a ridiculous candidate with such force that no other contestant can possibly catch up? The answer is: Yes.

Just ask the folks at NERC, the environmental research organization in the UK. They are commissioning a new vessel for polar research, and they decided to crowdsource the naming process. Anyone in the world is welcome to visit their NameOurShip web site and suggest a name or vote on an existing name submission.

As of today, the leading name is "RRS Boaty McBoatface." ("RRS" is standard prefix for a Royal Research Ship.) This wonderfully creative name is winning the race by more than just a little bit: it has 10 times the number of votes as the next highest vote getter, "RRS Henry Worsley".

I wondered whether the raw data for this poll might be available, and I was pleased to find it embedded in the web page that shows the current entries. The raw data is in JSON format, embedded in the source of the HTML page. I saved the web page source to my local machine, copied out just the JSON line with the submissions data, then used SAS to parse the results. Here's my code:

filename records "c:projectsvotedata.txt";

data votes (keep=title likes);
 length likes 8;
 format likes comma20.;
 label likes="Votes";
 length len 8;
 infile records;
  if _n_ = 1 then
      retain likes_regex title_regex;
      likes_regex = prxparse("/'likes':s?([0-9]*)/");
      title_regex = prxparse("/'title':s?""([a-zA-Z0-9's]+)/");

 position = prxmatch(likes_regex,_infile_);
  if (position ^= 0) then
      call prxposn(likes_regex, 1, start, len);
      likes = substr(_infile_,start,len);
 start=0; len=0;

 position = prxmatch(title_regex,_infile_);
  if (position ^= 0) then
      call prxposn(title_regex, 1, start, len);
      title = substr(_infile_,start,len);

With the data in SAS, I used PROC FREQ to show the current tally:

title "Vote tally for NERC's Name Our Ship campaign";
proc freq data=votes order=freq;
table title;
weight likes;

The numbers are compelling: good ol' Boaty Mac has over 42% of the nearly 200,000 votes. The arguably more-respectable "Henry Worsley" entry is tracking at just 4%. I'm not an expert on polling and sample sizes, but even I can tell that Boaty McBoatface is going to be tough to beat.

To drive the point home a bit more, let's look at a box plot of the votes distribution.

title "Distribution of votes for ALL submissions";
proc sgplot data=votes;
hbox likes;
xaxis valueattrs=(size=12pt);

In this output, we have a clear outlier:
If we exclude Boaty, then it shows a slightly closer race among the other runners up (which include some good serious entries, plus some whimsical entries, such as "Boatimus Prime"):

title "Distribution of votes for ALL submissions except Boaty McBoatface";
proc sgplot data=votes(where=(title^="Boaty McBoatface"));
hbox likes;
xaxis valueattrs=(size=12pt);

See the difference between the automatic axis values between the two graphs? The tick marks show 80,000 vs. 8,000 as the top values.

Digging further, I wondered whether there were some recurring themes in the entries. I decided to calculate word frequencies using a technique I found on our SAS Support Communities (thanks to Cynthia Zender for sharing):

/* Tally the words across all submissions */
data wdcount(keep=word);
    set votes;
    i = 1;
    origword = scan(title,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        /* exclude the most common words */
        if word not in ('a','the','of','and') then output;
        i + 1;
        wordord = i;
        origword = scan(title,i);
        word = compress(lowcase(origword),'?');
proc sql;
   create table work.wordcounts as 
   select t1.word, 
          /* count_of_word */
            (count(t1.word)) as word_count
      from work.wdcount t1
      group by t1.word
      order by word_count desc;
title "Frequently occurring words in boat name submissions";
proc print data=wordcounts(obs=25);

The top words evoke the northern, cold nature of the boat's mission. Here are the top 25 words and their counts:

  1    polar         352 
  2    ice           193 
  3    explorer      110 
  4    arctic         86 
  5    red            69 
  6    sir            55 
  7    john           54 
  8    lady           46 
  9    sea            42 
 10    ocean          42 
 11    scott          41 
 12    bear           39 
 13    aurora         38 
 14    artic          37 
 15    queen          37 
 16    captain        36 
 17    james          36 
 18    endeavour      35 
 19    william        35 
 20    star           34 
 21    spirit         34 
 22    new            26 
 23    antarctic      26 
 24    boat           25 
 25    cold           25 

I don't know when voting closes, so maybe whimsy will yet be outvoted by a more serious entry. Or maybe NERC will exercise their right to "take this under advisement" and set a certain standard for the finalist names. Whatever the outcome, I'm sure we haven't heard the last of Boaty...

tags: regular expressions, SAS programming, SGPLOT

The post And it's Boaty McBoatface by an order of magnitude appeared first on The SAS Dummy.

5月 012014

Here at SAS, we've come a long way with how we deal with blog spam on

Last year at this time, I was sifting through dozens of spam messages per day in order to salvage the one or two genuine comments that originate from real readers. I was just a human trying to keep up with machine-generated spam, a time-consuming -- and somewhat frustrating -- activity.

I created SAS-based reports that showed the impact not just on me, but on the hundreds of other SAS blog contributors. Thankfully in May, our internal IT support configured an industry-standard spam-filtering mechanism called Akismet. The spam flow ceased immediately, as if somebody turned a spigot.

Recently, the spigot has been turned on slightly, allowing a few spam messages to leak through. The spammers are like an ever-evolving virus, constantly adapting their techniques to punch through our defenses. I wondered: with the leaks I'm seeing, how effective is Akismet for us today?

It turns out that Akismet is very effective. Despite the few messages that I see leak through daily, Akismet is catching over 95% of the spam messages that target our blogs. Over the last 16 days I saw 40 spam notifications hit my inbox. Without Akismet, I would have seen 741 messages in the same period. Yikes.

akismetdbI can see some of this within my WordPress blog administration screen, since the Akismet plugin embeds a sort of dashboard there. This is useful for me, but we have over 30 blogs hosted at As far as I know, we don't have a view into how Akismet is performing for our entire set of blog properties.

Enter SAS. By using SAS to connect to the WordPress database (something I already do for other reports), I can scrape and aggregate the Akismet metrics for a more global report on spam activity. Here's an example of a chart that I created:


In this chart, the height of the blue bar indicates the total number of incoming spam on each day, across all blogs. The smaller green bar indicates the number of "author-identified spam" messages -- those that the filter did not catch and that the blog moderator had to mark as spam. The number at the top of each bar indicates a "percent missed" for the day. And finally, the text at the bottom of the chart provides a summary: total spam caught, total missed, and a percentage. Since we receive so much spam, the IT folks configured Akismet to purge our spam data every 3 weeks or so. That's why the chart shows only a 16-day study period. (See an example of the full report here.)

Interested in the SGPLOT code behind this chart? It looks something like this (data prep steps omitted, of course):

proc sgplot data=reckoning;
  format comment_date dtdate5. fail_rate percentn6.2;
  label spam_caught='Spam caught' spam_allowed='NOT caught';
  vbar comment_date / response=spam_caught dataskin=pressed
  datalabel=fail_rate datalabelfitpolicy=none 
    datalabelattrs=(color=black size=10pt);
  vbar comment_date / 
     response=spam_allowed barwidth=.6 dataskin=pressed 
  yaxis label="Akismet ruling" grid;
  xaxis valueattrs=(size=9) 
    label="Spam comments (&total_caught. caught, &total_allowed. allowed - &overallfail. missed overall)";

Assuming that it takes a blog moderator just a few seconds to evaluate and dispose of each spam message, Akismet has saved us a collective several hours over the weeks we see here. However, as we can see from the metrics accumulated over time, this really adds up:

I don't think that Akismet costs us very much, and -- in my view -- the time that we save is definitely worth the expense. Spammers will always be with us and are just part of the cost of publishing content. But dealing with it manually? Ain't nobody got time for that.

tags: Akismet, SGPLOT, spam, wordpress