3月 262016
 

In a voting contest, is it possible for a huge population to get behind a ridiculous candidate with such force that no other contestant can possibly catch up? The answer is: Yes.

Just ask the folks at NERC, the environmental research organization in the UK. They are commissioning a new vessel for polar research, and they decided to crowdsource the naming process. Anyone in the world is welcome to visit their NameOurShip web site and suggest a name or vote on an existing name submission.

As of today, the leading name is "RRS Boaty McBoatface." ("RRS" is standard prefix for a Royal Research Ship.) This wonderfully creative name is winning the race by more than just a little bit: it has 10 times the number of votes as the next highest vote getter, "RRS Henry Worsley".

I wondered whether the raw data for this poll might be available, and I was pleased to find it embedded in the web page that shows the current entries. The raw data is in JSON format, embedded in the source of the HTML page. I saved the web page source to my local machine, copied out just the JSON line with the submissions data, then used SAS to parse the results. Here's my code:

filename records "c:projectsvotedata.txt";

data votes (keep=title likes);
 length likes 8;
 format likes comma20.;
 label likes="Votes";
 length len 8;
 infile records;
  if _n_ = 1 then
    do;
      retain likes_regex title_regex;
      likes_regex = prxparse("/'likes':s?([0-9]*)/");
      title_regex = prxparse("/'title':s?""([a-zA-Z0-9's]+)/");
    end;
 input;

 position = prxmatch(likes_regex,_infile_);
  if (position ^= 0) then
    do;
      call prxposn(likes_regex, 1, start, len);
      likes = substr(_infile_,start,len);
    end;
 start=0; len=0;

 position = prxmatch(title_regex,_infile_);
  if (position ^= 0) then
    do;
      call prxposn(title_regex, 1, start, len);
      title = substr(_infile_,start,len);
    end;
run;

With the data in SAS, I used PROC FREQ to show the current tally:

title "Vote tally for NERC's Name Our Ship campaign";
proc freq data=votes order=freq;
table title;
weight likes;
run;

boatfacefreq
The numbers are compelling: good ol' Boaty Mac has over 42% of the nearly 200,000 votes. The arguably more-respectable "Henry Worsley" entry is tracking at just 4%. I'm not an expert on polling and sample sizes, but even I can tell that Boaty McBoatface is going to be tough to beat.

To drive the point home a bit more, let's look at a box plot of the votes distribution.

title "Distribution of votes for ALL submissions";
proc sgplot data=votes;
hbox likes;
xaxis valueattrs=(size=12pt);
run;

In this output, we have a clear outlier:
hboxall
If we exclude Boaty, then it shows a slightly closer race among the other runners up (which include some good serious entries, plus some whimsical entries, such as "Boatimus Prime"):

title "Distribution of votes for ALL submissions except Boaty McBoatface";
proc sgplot data=votes(where=(title^="Boaty McBoatface"));
hbox likes;
xaxis valueattrs=(size=12pt);
run;

hboxtrim
See the difference between the automatic axis values between the two graphs? The tick marks show 80,000 vs. 8,000 as the top values.

Digging further, I wondered whether there were some recurring themes in the entries. I decided to calculate word frequencies using a technique I found on our SAS Support Communities (thanks to Cynthia Zender for sharing):

/* Tally the words across all submissions */
data wdcount(keep=word);
    set votes;
    i = 1;
    origword = scan(title,i);
    word = compress(lowcase(origword),'?');
    wordord = i;
    do until (origword = ' ');
        /* exclude the most common words */
        if word not in ('a','the','of','and') then output;
        i + 1;
        wordord = i;
        origword = scan(title,i);
        word = compress(lowcase(origword),'?');
    end;
run;
 
proc sql;
   create table work.wordcounts as 
   select t1.word, 
          /* count_of_word */
            (count(t1.word)) as word_count
      from work.wdcount t1
      group by t1.word
      order by word_count desc;
quit;
title "Frequently occurring words in boat name submissions";
proc print data=wordcounts(obs=25);
run;

The top words evoke the northern, cold nature of the boat's mission. Here are the top 25 words and their counts:

  1    polar         352 
  2    ice           193 
  3    explorer      110 
  4    arctic         86 
  5    red            69 
  6    sir            55 
  7    john           54 
  8    lady           46 
  9    sea            42 
 10    ocean          42 
 11    scott          41 
 12    bear           39 
 13    aurora         38 
 14    artic          37 
 15    queen          37 
 16    captain        36 
 17    james          36 
 18    endeavour      35 
 19    william        35 
 20    star           34 
 21    spirit         34 
 22    new            26 
 23    antarctic      26 
 24    boat           25 
 25    cold           25 

I don't know when voting closes, so maybe whimsy will yet be outvoted by a more serious entry. Or maybe NERC will exercise their right to "take this under advisement" and set a certain standard for the finalist names. Whatever the outcome, I'm sure we haven't heard the last of Boaty...

tags: regular expressions, SAS programming, SGPLOT

The post And it's Boaty McBoatface by an order of magnitude appeared first on The SAS Dummy.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)