Chris Hemedinger

4月 202019
 

Do you have a favorite television show? Or a favorite movie franchise that you follow? If you call yourself a "fan," just how much of a fan are you? Are you merely a spectator, or do you take your fanaticism to the next level by creating something new?

When it comes to fandom for franchises like Game of Thrones, the Marvel movies, or Stranger Things, there's a new kind of nerd in town. And this nerd brings data science skills. You've heard of the "second screen" experience for watching television, right? That's where fans watch a show (or sporting event or awards ceremony), but also keep up with Twitter or Facebook so they can commune with other fans of the show on social media. These fan-data-scientists bring a third screen: their favorite data workbench IDE.

I was recently lured into into a rabbit hole of Game of Thrones data by a tweet. The Twitter user was reacting to a data visualization of character screen time during the show. The visualization was built in a different tool, but the person was wondering whether it could be done in SAS. I knew the answer was Yes...as long as we could get the data. That turned out to be the easiest part.

WARNING: While this blog post does not reveal any plot points from the show, the data does contain spoilers! No spoilers in what I'm showing here, but if you run my code examples there might be data points that you cannot "unsee." I was personally conflicted about this, since I'm a fan of the show but I'm not yet keeping up with the latest episodes. I had to avert my eyes for the most recent data.

Data is Coming

A GitHub user named Jeffrey Lancaster has shared a repository for all aspects of data around Game of Thrones. He also has similar repos for Stranger Things and Marvel universe. Inside that repo there's a JSON file with episode-level data for all episodes and seasons of the show. With a few lines of code, I was able to read the data directly from the repo into SAS:

filename eps temp;
 
/* Big thanks to this GoT data nerd for assembling this data */
proc http
 url="https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/episodes.json"
 out=eps
 method="GET";
run;
 
/* slurp this in with the JSON engine */
libname episode JSON fileref=eps;

Note that I've shared all of my code for my steps in my own GitHub repo (just trying to pay it forward). Everything should work in Base SAS, including in SAS University Edition.

The JSON library reads the data into a series of related tables that show all of the important things that can happen to characters within a scene. Game of Thrones fans know that death, sex, and marriage (in that order) make up the inflection points in the show.

Building the character-scene data

With a little bit of data prep using SQL, I was able to show the details of the on-screen time per character, per scene. These are the basis of the visualization I was trying to create.

/* Build details of scenes and characters who appear in them */
PROC SQL;
   CREATE TABLE WORK.character_scenes AS 
   SELECT t1.seasonNum, 
          t1.episodeNum,
          t2.ordinal_scenes as scene_id, 
          input(t2.sceneStart,time.) as time_start format=time., 
          input(t2.sceneEnd,time.) as time_end format=time., 
          (calculated time_end) - (calculated time_start) as duration format=time.,
          t3.name
      FROM EPISODE.EPISODES t1, 
           EPISODE.EPISODES_SCENES t2, 
           EPISODE.SCENES_CHARACTERS t3
      WHERE (t1.ordinal_episodes = t2.ordinal_episodes AND 
             t2.ordinal_scenes = t3.ordinal_scenes);
QUIT;

With a few more data prep steps (see my code on GitHub), I was able to summarize the screen time for scene locations:

You can see that The Crownlands dominate as a location. In the show that's a big region and a sort of headquarters for The Seven Kingdoms, and the show data actually includes "sub-locations" that can help us to break that down. Here's the makeup of that 18+ hours of time in The Crownlands:

Screen time for characters

My goal is to show how much screen time each of the major characters receives, and how that changes over time. I began by creating a series of charts using PROC SGPLOT. These were created using a single SGPLOT step using a BY group, segmented by show episode. They appear in a grid because I used ODS LAYOUT GRIDDED to arrange them.

Here's the code segment that creates these dozens of charts. Again, see my GitHub for the intermediate data prep work.

/* Create a gridded presentation of Episode graphs CUMULATIVE timings */
ods graphics / width=500 height=300 imagefmt=svg noborder;
ods layout gridded columns=3 advance=bygroup;
proc sgplot data=all_times noautolegend ;
  hbar name / response=cumulative 
    categoryorder=respdesc  
    colorresponse=total_screen_time dataskin=crisp
    datalabel=name datalabelpos=right datalabelattrs=(size=10pt)
    seglabel seglabelattrs=(weight=bold size=10pt color=white) ;
   ;
  by epLabel notsorted;
  format cumulative time.;
  label epLabel="Ep";
  where rank<=10;
  xaxis display=(nolabel)  grid ;
  yaxis display=none grid ;
run;
ods layout end;
ods html5 close;

Creating an animated timeline

The example shared on Twitter showed an animation of screen time, per character, over the complete series of episodes. So instead of a huge grid with many plots, need to produce a single file with layers for each episode. In SAS we can produce an animated GIF or animated SVG (scalable vector graphics) file. The SVG is a much smaller file format, but you need a browser or a special viewer to "play" it. Still, that's the path I followed:

/* Create a single animated SVG file for all episodes */
options printerpath=svg animate=start animduration=1 
  svgfadein=.25 svgfadeout=.25 svgfademode=overlap
  nodate nonumber; 
 
/* change this file path to something that works for you */
ODS PRINTER file="c:\temp\got_cumulative.svg" style=daisy;
 
/* For SAS University Edition
ODS PRINTER file="/folders/myfolders/got_cumulative.svg" style=daisy;
*/
 
proc sgplot data=all_times noautolegend ;
  hbar name / response=cumulative 
    categoryorder=respdesc 
    colorresponse=total_screen_time dataskin=crisp
    datalabel=name datalabelpos=right datalabelattrs=(size=10pt)
    seglabel seglabelattrs=(weight=bold size=10pt color=white) ;
   ;
  by epLabel notsorted;
  format cumulative time.;
  label epLabel="Ep";
  where rank<=10;
  xaxis label="Cumulative screen time (HH:MM:SS)" grid ;
  yaxis display=none grid ;
run;
options animation=stop;
ods printer close;

Here's the result (hosted on my GitHub repo -- but as a GIF for compatibility.)

I code and I know things

Like the Game of Thrones characters, my visualization is imperfect in many ways. As I was just reviewing it I discovered a few data prep missteps that I should correct. I used some features of PROC SGPLOT that I've learned only a little about, and so others might suggest improvements. And my next mission should be to bring this data in SAS Visual Analytics, where the real "data viz maesters" who work with me can work their magic. I'm just hoping that I can stay ahead of the spoilers.

The post Deeper enjoyment of your favorite shows -- through data appeared first on The SAS Dummy.

4月 112019
 

This blog post could be subtitled "To Catch a Thief" or maybe "Go ahead. Steal this blog. I dare you."* That's because I've used this technique several times to catch and report other web sites who lift the blog content from blogs.sas.com and present it as their own.

Syndicating blog content is an honorable practice, made possible by RSS feeds that virtually every blog platform supports. With syndicated feeds, your blog content can appear on someone else's site, but this always includes attribution and a link back to the original source. However, if you copy content from another website and publish it natively on your own web property, without attribution or citations...well, that's called plagiarism. And the Digital Millennium Copyright Act (DCMA) provides authors with recourse to have stolen content removed from infringing sites -- if you can establish that you're the rightful copyright owner.

Establishing ownership is a tedious task, especially when someone steals dozens or hundreds of articles. You must provide links to each example of infringing content, along with links to the original authorized content. Fortunately, as I've discussed before, I have ready access to the data about all 17,000+ blog posts that we've published at SAS (see How SAS uses SAS to Analyze SAS Blogs). In this article, I'll show you how I gathered that same information from the infringing websites so that I could file the DCMA "paperwork" almost automatically.

The complete programs from this article are available on GitHub.

Read a JSON feed using the JSON engine

In my experience, the people who steal our blog content don't splurge on fancy custom web sites. They tend to use free or low-cost web site platforms, and the most popular of these include WordPress (operated by Automattic) and Blogspot (operated by Google). Both of these platforms support API-like syndication using feeds.

Blogspot sites can generate article feeds in either XML or JSON. I prefer JSON when it's available, as I find that the JSON libname engine in SAS requires fewer "clues" in order to generate useful tables. (See documentation for the JSON engine.) While you can supply a JSON map file that tells SAS how to assemble your tables and data types, I find it just as easy to read the data as-is and post-process it to join the fields I need and convert data fields. (For an example that uses a JSON map, see Reading data with the SAS JSON libname engine.)

Since I don't want to draw attention to the specific infringing sites, I'll use an example of a popular (legitimate!) Blogspot site named "Maps Mania". If you're into data and maps (who isn't?) you might like their content. In this code I use PROC HTTP to fetch the RSS feed, using "alt=json" to request JSON format and "max-results=100" to retrieve a larger-than-default batch of published posts.

/* Read JSON feed into a local file. */
/* Use Blogspot parameters to get 100 posts at a time */
filename resp temp;
proc http
 url='https://googlemapsmania.blogspot.com/feeds/posts/default?alt=json&max-results=100'
 method="get"
 out=resp;
run;
 
libname rss json fileref=resp;

This JSON libname breaks the data into a series of tables that relate to each other via common keys.

RSS feed tables

With a little bit of exploration in SAS Enterprise Guide and the Query Builder, I was able to design a PROC SQL step to assemble just the fields and records I needed: post title and post URL.

/* Join the relevant feed entry items to make a single table */
/* with post titles and URLs */
proc sql;
   create table work.blogspot as 
   select t2._t as rss_title,
          t1.href as rss_href          
      from rss.entry_link t1
           inner join rss.entry_title t2 on (t1.ordinal_entry = t2.ordinal_entry)
      where t1.type = 'text/html' and t1.rel = 'alternate';
quit;
 
libname rss clear;

RSS output from blogspot

Read an XML feed using the XMLv2 engine

WordPress sites generate XML-based feeds by default. Site owners can install a WordPress plugin to generate JSON feeds as well, but most sites don't bother with that. Like the JSON feeds, the XML feed can contain many fields that relate to each other. I find that with XML, the best approach is to use the SAS XML Mapper application to explore the XML and "design" the final data tables that you need. You use SAS XML Mapper to create a map file, which you can then feed into the SAS XMLv2 engine to instruct SAS how to read the data. (See documentation for the XMLv2 engine.)

SAS XML Mapper is available as a free download from support.sas.com. Download it as a ZIP file (on Windows), and extract the ZIP file to a temporary folder. Then run setup.exe in the root of that folder to install the app on your system.

To design the map, I use an example of the XML feed from the blog that I want to examine. Once again, I'll choose a popular WordPress blog instead of the actual infringing sites. In this case, let's look at the Star Wars News site. I point my browser at the feed address is https://www.starwars.com/news/feed and save as an XML file. Then, I use SAS XML Mapper to Open XML (File menu), and examine the result.

I found everything that I needed in "item" subset of the feed. I dragged that group over to the right pane to include in the map. That creates a data set container named "item." Then dragged just the title, link, and pubDate fields into that data set to include in the final result.

The SAS XML Mapper generates a SAS program that you can include to define the map, and that's what I've done with the following code. It uses DATA step to create the map file just as I need it.

filename rssmap temp;
data _null_;
 infile datalines;
 file rssmap;
 input;
 put _infile_;
 datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
  <NAMESPACES count="0"/>
  <!-- ############################################################ -->
  <TABLE name="item">
    <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
    <COLUMN name="title">
      <PATH syntax="XPath">/rss/channel/item/title</PATH>
      <TYPE>character</TYPE>
      <DATATYPE>string</DATATYPE>
      <LENGTH>250</LENGTH>
    </COLUMN>
    <COLUMN name="link">
      <PATH syntax="XPath">/rss/channel/item/link</PATH>
      <TYPE>character</TYPE>
      <DATATYPE>string</DATATYPE>
      <LENGTH>200</LENGTH>
    </COLUMN>
    <COLUMN name="pubDate">
      <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
      <TYPE>character</TYPE>
      <DATATYPE>string</DATATYPE>
      <LENGTH>40</LENGTH>
    </COLUMN>
  </TABLE>
</SXLEMAP>
;
run;

Because WordPress feeds return just most recent 25 items by default, I need to use the "pageid=" directive to go deeper into the archive and return older items. I used a simple SAS macro loop to iterate through 5 pages (125 items) in this example. Note how I specified the XMLv2 libname with the XMLMAP= option to include my custom map. That ensures that SAS will read the XML and build the table as I've designed it.

My final DATA step in this part is to recast the pubDate field (a text field by default) into a proper SAS date.

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.starwars.com/news/feed?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

Reporting the results

After gathering the data I need from RSS feeds, I use SAS to match that with the WordPress data that I have about our blogs. I can then generate a table that I can easily submit in a DCMA form.

Usually, matching by "article title" is the easiest method. However, sometimes the infringing site will alter the titles a little bit or even make small adjustments to the body of the article. (This reminds me of my college days in computer science, when struggling students would resort to cheating by copying someone else's program, but change just the variable names. It's a weak effort.) With the data in SAS, I've used several other techniques to detect the "distance" of a potentially infringing post from the original work.

Maybe you want to see that code. But you can't expect me to reveal all of my secrets, can you?


* props to Robert "I dare you to knock it off" Conrad.

The post Read RSS feeds with SAS using XML or JSON appeared first on The SAS Dummy.

2月 082019
 

Since we added the new "Recommended by SAS" widget in the SAS Support Communities, I often find myself diverted to topics that I probably would not have found otherwise. This is how I landed on this question and solution from 2011 -- "How to convert 5ft. 4in. (Char) into inches (Num)". While the question was deftly answered by my friend (and SAS partner) Jan K. from the Netherlands, the topic inspired me to take it a step further here.

Jan began his response by throwing a little shade on the USA:

Short of moving to a country that has a decent metric system in place, I suggest using a regular expression.

On behalf of my nation I just want say that, for the record, we tried. But we did not get very far with our metrication, so we are stuck with the imperial system for most non-scientific endeavors.

Matching patterns with a regular expression

Regular expressions are a powerful method for finding specific patterns in text. The syntax of regular expressions can be a challenge for those getting started, but once you've solved a few pattern-recognition problems by using them, you'll never go back to your old methods.

Beginning with the solution offered by Jan, I extended this program to read in a "ft. in." measurement, convert to the component number values, express the total value in inches, and then convert the measurement to centimeters. I know that even with my changes, we can think of patterns that might not be matched. But allow me to describe the updates:

Here's my program, followed by the result:

data measure;
 length 
     original $ 25
     feet 8 inches 8 
     total_inches 8 total_cm 8;
 /* constant regex is parsed just once */
 re = prxparse('/(\d*)ft.(\s*)((\d?\.?\d?)in.)?/'); 
 input;
 original = _infile_;
 if prxmatch(re, original) then do;
  feet =   input ( prxposn(re, 1, original), best12.);
  inches = input ( prxposn(re, 4, original), best12.);
  if missing(inches) and not missing(feet) then inches=0;
 end;
 else 
   original = "[NO MATCH] " || original;
 total_inches = (feet*12) + inches;
 total_cm = total_inches * 2.54;
 drop re;
cards;
5ft. 4in.
4ft 0in.
6ft. 10in.
3ft.2in.
4ft.
6ft.     1.5in.
20ft. 11in.
25ft. 6.5in.
Soooooo big
;

Other tools to help with regular expressions

The Internet offers a plethora of online tools to help developers build and test regular expression syntax. Here's a screenshot from RegExr.com, which I used to test some aspects of my program.

Tools like these provide wonderful insight into the capture groups and regex directives that will influence the pattern matching. They are part tutorial, part workbench for regex crafters at all levels.

Many of these tools also include community contributions for matching common patterns. For example, many of us often need to match/parse data with e-mail addresses, phone numbers, and other tokens as data fields. Sites like RegExr.com include the syntax for many of these common patterns that you can copy and use in your SAS programs.

See Also

The post Convert a text-based measurement to a number in SAS appeared first on The SAS Dummy.

1月 182019
 

It seems that everyone knows about GitHub -- the service that hosts many popular open source code projects. The underpinnings of GitHub are based on Git, which is itself an open-source implementation of a source management system. Git was originally built to help developers collaborate on Linux (yet another famous open source project) -- but now we all use it for all types of projects.

There are other free and for-pay services that use Git, like Bitbucket and GitLab. And there are countless products that embed Git for its versioning and collaboration features. In 2014, SAS developers added built-in Git support for SAS Enterprise Guide.

Since then, Git (and GitHub) have grown to play an even larger role in data science operations and DevOps in general. Automation is a key component for production work -- including check-in, check-out, commit, and rollback. In response, SAS has added Git integration to more SAS products, including:

  • the Base SAS programming language, via a collection of SAS functions.
  • SAS Data Integration Studio, via a new source control plugin
  • SAS Studio (experimental in v3.8)

You can use this Git integration with any service that supports Git (GitHub, GitLab, etc.), or with your own private Git servers and even just local Git repositories.

SAS functions for Git

Git infrastructure and functions were added to SAS 9.4 Maintenance 6. The new SAS functions all have the helpful prefix of "GITFN_" (signifying "Git fun!", I assume). Here's a partial list:

GITFN_CLONE  Clones a Git repository (for example, from GitHub) into a directory on the SAS server.
GITFN_COMMIT  Commits staged files to the local repository
GITFN_DIFF Returns the number of diffs between two commits in the local repository and creates a diff record object for the local repository.
GITFN_PUSH  Pushes the committed files in the local repository to the remote repository.
GITFN_NEW_BRANCH  Creates a Git branch

 

The function names make sense if you're familiar with Git lingo. If you're new to Git, you'll need to learn the terms that go with the commands: clone, repo, commit, stage, blame, and more. This handbook provided by GitHub is friendly and easy to read. (Or you can start with this xkcd comic.)

You can

data _null_;
 version = gitfn_version();
 put version=;             
 
 rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/",
   "c:\Projects\sas-dummy-blog");
 put rc=;
run;

In one line, this function fetches an entire collection of code files from your source control system. Here's a more concrete example that fetches the code to a work space, then runs a program from that repository. (This is safe for you to try -- here's the code that will be pulled/run. It even works from SAS University Edition.)

options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-dummy-blog;
libname repo "&repoPath.";
libname repo clear;
 
/* Fetch latest code from GitHub */
data _null_;
 rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/",
   "&repoPath.");
 put rc=;
run;
 
/* run the code in this session */
%include "&repoPath./rng_example_thanos.sas";

You could use the other GITFN functions to stage and commit the output from your SAS jobs, including log files, data sets, ODS results -- whatever you need to keep and version.

Using Git in SAS Data Integration Studio

SAS Data Integration Studio has supported source control integration for many years, but only for CVS and Subversion (still in wide use, but they aren't media darlings like GitHub). By popular request, the latest version of SAS Data Integration Studio adds support for a Git plug-in.

Example of Git in SAS DI Studio

See the documentation for details:

Read more about setup and use in the available here as part of our "Custom Tasks Tuesday" series.

Using Git in SAS Enterprise Guide

This isn't new, but I'll include it for completeness. SAS Enterprise Guide supports built-in Git repository support for SAS programs that are stored in your project file. You can use this feature without having to set up any external Git servers or repositories. Also, SAS Enterprise Guide can recognize when you reference programs that are managed in an external Git repository. This integration enables features like program history, compare differences, commit, and more. Read more and see a demo of this in action here.

program history

If you use SAS Enterprise Guide to edit and run SAS programs that are managed in an external Git repository, here's an important tip. Change your project file properties to "Use paths relative to the project for programs and importable files." You'll find this checkbox in File->Project Properties.

With this enabled, you can store the project file (EGP) and any SAS programs together in Git, organized into subfolders if you want. As long as these are cloned into a similar structure on any system you use, the file paths will resolve automatically.

The post Using built-in Git operations in SAS appeared first on The SAS Dummy.

12月 212018
 

Good news -- the SAS program that you wrote and put into production 10 years ago still works. Hey, it's SAS, so you probably take that for granted. But are those techniques from 2008 still the best way to accomplish your task? SAS 9.4, first released in 2013 and now refreshed with its sixth maintenance release, continues to extend the SAS programming language. New features allow you to simplify your code, make it run faster, and erase some of that technical debt you've been carrying due to previous workarounds or limitations.

The reason that I'm writing this post now is to recognize the next chapter for my long-time colleague, Rick Langston. After 38 years at SAS (and more time as a SAS user before that) Rick is retiring from his role. Many of you know him as a major steward of the SAS programming language. And many of the tips that I've shared on this blog are made possible by Rick's work. Here's an interview that I hosted with Rick in 2013, just before SAS 9.4 was first released.


Five cool features of the SAS language: the details

In the above video, Rick talks about three SAS features that were first introduced in 2013. I've added a couple of more recent items to the list to round it out.

FILENAME ZIP access method

This brings the ability to read and write compressed ZIP files, and GZIP files, directly into the SAS language. Use this feature to replace the clunky (and not always feasible) calls into external tools such as gzip, WinZip, or 7-Zip. In SAS, a "native" FILENAME access method is more portable and robust than calling out to an external tool with FILENAME PIPE.

For more information, check out the many SAS blog posts with examples that I've shared over the years.

DOSUBL function

Have you ever wanted to run another SAS procedure from inside of a DATA step? Rick calls this "submitting SAS code on the side", as it allows you to run a SAS step or statement from within a currently running step. You can learn more this SAS Global Forum paper by Rick. I've also written a post with a specific example in SAS Enterprise Guide.

LOCKDOWN system option and statement

This one will excite SAS administrators. You can set the LOCKDOWN system option in a batch SAS session or SAS Workspace server to limit some of the "dangerous" functions of SAS and, more importantly, limit the file areas in which the SAS session will operate. Read more in this article, Fencing in your SAS users with LOCKDOWN.

Creating and managing directories within SAS

This technique combines two features into a one-two punch of file folder management. Use the DLCREATEDIR option and the LIBNAME statement to create a new directory, and then use the DLGCDIR function to change the current directory of the SAS session.

In the past, you would have had to issue operating system commands to create a new directory and then switch ('cd') into it. That approach is not portable across different operating systems, and it requires access to the operating system shell -- not available in many SAS sessions these days. See these blog posts for more information and examples about the new techniques:

Using %IF/%THEN/%ELSE in open code

Perhaps the most life-changing of all of these SAS language updates, you can now use simple if-then-else logic for program flow, outside of the confines of the SAS macro language. It's what makes defensive programming like this possible -- without having to wrap the logic in %MACRO/%MEND:

%if %symexist(config_root) %then %do;
  filename config "&config_root./config.json";
  libname config json fileref=config;
  data _null_;
   set config.root;
   call symputx('tenant_id',tenant_id,'G');
   call symputx('client_id',client_id,'G');
   call symputx('redirect_uri',redirect_uri,'G');
   call symputx('resource',resource,'G');
  run;
%end;
%else %do;
  %put ERROR: You must define the CONFIG_ROOT macro variable.; 
%end;

Read more here: Using %IF/%THEN/%ELSE in SAS programs.

A legacy in the making

I'm going to miss having Rick Langston as a SAS colleague. There aren't many other people who know how to spin up a version of SAS from 30 years ago to help me track down a curious question. However, I'm not worried about the future of the SAS language. Rick has been excellent about sharing his knowledge for decades (just check his annual contributions to SAS Global Forum), and his team is well-suited to carry on the work of extending the SAS programming language for the next generation of SAS users. Thanks to Rick for helping to build such a solid foundation.

The post Five SAS programming language features you should be using appeared first on The SAS Dummy.

12月 122018
 

In her role as Product Manager for SAS Platform Technologies (including the SAS Add-In for Microsoft Office), my colleague Amy Peters hears this question often. With many organizations adopting Microsoft Office 365 -- the "cloud" version of Office -- what does this mean for other processes that integrate with Microsoft Office applications?

Microsoft has used different names for these similar offerings: Office 2016, Office 365, Microsoft 365, Office Online. The bottom line is that most users of a "365" package in the cloud, also have access to the Microsoft Office tools on their Windows desktop. They can use the full version of Excel, PowerPoint, Word, etc., and they also have access to these same tools via a web browser. At SAS, we recently experienced this transition ourselves. Have the Office applications on our desktops vanished? No, they have not. While more of our data is now on the cloud (looking at you, OneDrive), it's not really changing how we work, especially when creating/maintaining content. (Like many organizations, we already had one foot in this world by using Microsoft SharePoint for collaboration.)

Collaboration on the web. Full control on the desktop

Let's look at an example of how I use SAS with Microsoft Office. First, I create a report in SAS Visual Analytics. Then I open Excel on my desktop and use the SAS Add-In for Microsoft Office to embed the shared report into my spreadsheet. Want to see what that looks like in action? Check out this video Tech Talk with SAS developer Tim Beese.

Now suppose that I share this content in Microsoft OneDrive, and my colleague views it in Excel in a web browser. Yep, the content is still there. The difference is that the content is not dynamic like it is on my desktop. So what do you do when you want to edit that spreadsheet displaying in the browser? You select Open in Excel and the document opens on your desktop. Voila! The content is dynamic and you have all the functionality the SAS Add-In for Microsoft Office provides.

How is Microsoft Office 365 changing your workflow?

Today, the expectation of most users working with "Office Online" applications in their browsers is that it's primarily for viewing and basic editing. Will this change? Probably. We're researching how to provide more of the SAS Add-In for Microsoft Office function in a browser app. If you or your colleagues need this browser-based function – you want to do something specific in Excel with your SAS content -- we want to hear from you. And do you have a plan to move completely to browser-based Office apps? Currently you can't create SAS content from a browser-based Office app. If that's a pressing need, we would like to know. For now, we're not hearing of use cases where some form of the desktop app isn't still in the picture.

SAS integration with these everyday productivity tools, like Microsoft Office, is important to us. Don't forget about these SAS programming methods to create and read your Microsoft Office content:

How are you using Microsoft Office 365 with SAS? How do you think this workflow will change for you in the next year or two? Leave a comment -- we would love to hear from you.

The post Does SAS support Microsoft Office 365? appeared first on The SAS Dummy.

11月 292018
 

If your work environment is like ours here at SAS, you're seeing more of your data and applications move to the cloud. It's not yet a complete replacement for having local files on your desktop machine, but with cloud storage and apps -- like Microsoft OneDrive -- I can now access my work documents from any browser and any device, including my smartphone. I can update now my spreadsheets while waiting in the dentist office. Oh joy.

For those of us who use SAS to read and create Microsoft Excel documents, cloud-based files can add an extra wrinkle when we automate the process. It also adds some exciting possibilities! The Microsoft Office 365 suite offers APIs to discover, fetch, and update our documents using code. In this article, I'll show you how to use SAS programs to reach into your Microsoft OneDrive cloud to read and update your files. Note: All of this assumes that you already have a Microsoft Office 365 account -- perhaps provisioned by your IT support team -- and that you're using it to manage your documents.

Before I go on, I have to give major credit to Joseph Henry, the SAS developer who maintains PROC HTTP. Joseph did the heavy lifting for putting together the code and examples in this article. He also regularly adds new features to PROC HTTP that make it a more natural fit for calling REST APIs that require special authentication flows, such as OAuth2.

Using SAS with Microsoft OneDrive: an overview

Microsoft Office 365 uses an OAuth2-style authentication flow to grant access and permissions to third-party apps. When we're writing SAS programs to access Microsoft OneDrive, we're also writing a third-party app. This requires several setup steps, a few of which cannot be automated. Fortunately, these need to be done just once, or at least infrequently. Here's an outline of the steps:

  1. Register a new client application at the Microsoft Application Registration Portal. (You will need to sign in with your Office 365 credentials, which might be your primary organization credentials if you have single-signon with Active Directory.)
  2. Using your browser while you are signed into Office 365, navigate to a special web address to obtain an authorization code for your application.
  3. With your authorization code in hand, plug this into a SAS program (PROC HTTP step) to retrieve an OAuth2 access token (and a refresh token).
  4. With the access token, you can now use PROC HTTP and the Microsoft Office 365 APIs to retrieve your OneDrive folders and files, download files, upload files, and replace files.

You'll have to complete Step 1 just once for your application or project. Steps 2 and 3 can be done just once, or at least just occasionally. The access token is valid for a limited time (usually 1 hour), but you can always exchange the refresh token for a new valid access token. This refresh token step can be automated in your program, usually run just once per session. Occasionally that refresh token can be revoked (and thus made invalid) when certain events occur (such as you changing your account password). When that happens, you'll need to repeat steps 2 and 3 to get a new set of access/refresh tokens.

Step 1: Register your application

Visit the Microsoft Application Registration portal to register your new app. You'll sign in with your Microsoft Office 365 credentials.

Microsoft Application Registration portal

Note: At the time of this writing, Microsoft was testing a new web experience for application developers. I opted for the "preview experience" with the hope that my screenshots would have a longer shelf life for future readers.

Click New Registration to get started. This presents you with a form where you can complete the details that define your app. Mainly, you're giving it a name and defining its scope. You'll probably want to limit its use to just your organization (your company) unless you're collaborating with colleagues who work elsewhere.

"Register an application" form

When you create an app, you'll receive a Client ID (unique to your app) and Tenant ID (unique to your organization). You'll need these values to obtain your authorization code and tokens later. The application portal provides a sort of control center for all aspects of your app. (Note: I masked out my client ID and tenant ID in this screenshot.)

Details for my sample application

Specifying your app permissions

Your app will need specific permissions in order to function. In my example, I want my SAS program to read documents from my OneDrive, and also add new docs and update existing docs. The permissions I need are:

  • Files.ReadWrite.All: Allows the app to read, create, update and delete all files that you can access.
  • User.Read: Allows you to sign in to the app with your organizational account and let the app read your profile.

To add these to your app, click the API Permissions tab in the control center. To be clear, these are not permissions that your app will automatically have. These are the permissions that will be requested when you "sign into" the app for the first time, and that you'll have to agree to in order for the app to run.

Adding permissions that the app needs

Creating a configuration file

There are a few app-specific values that we'll need to reference throughout the SAS programs we're writing. I decided to create a configuration file for these settings rather than hard-code them into my SAS statements. This will make it easier for other people to reuse my code in their own applications.

I created a file named conf.json that looks like this (but with different tenant_id and client_id values):

{
  "tenant_id": "206db638-6adb-41b9-b20c-95d8d04abcbe",
  "client_id": "8fb7804a-8dfd-40d8-bf5b-d02c2cbc56f3",
  "redirect_uri": "https://login.microsoftonline.com/common/oauth2/nativeclient",
  "resource" : "https://graph.microsoft.com"
}

By "externalizing" the IDs specific to my account/instance, I can use SAS code to read the values at run time. Note: This code, like all of the code in this article, uses features from SAS 9.4 Maintenance 5.

/*
  Set the variables that will be needed through the code
  We'll need these for authorization and also for runtime 
  use of the service.
 
  Reading these from a config.json file so that the values
  are easy to adapt for different users or projects.
*/
 
%if %symexist(config_root) %then %do;
  filename config "&config_root./config.json";
  libname config json fileref=config;
  data _null_;
   set config.root;
   call symputx('tenant_id',tenant_id,'G');
   call symputx('client_id',client_id,'G');
   call symputx('redirect_uri',redirect_uri,'G');
   call symputx('resource',resource,'G');
  run;
%end;
%else %do;
  %put ERROR: You must define the CONFIG_ROOT macro variable.; 
%end;

Step 2: Obtain an authorization code

Now that I've defined the application, it's time to "sign into it" and grant it the permission to read and manage content in OneDrive. This step needs to be completed from a web browser while I am signed into my Microsoft Office 365 account. The web address is very long...but we can use a SAS program to generate it for us.

/* location of my config file */
%let config_root=/folders/myfolders/onedrive;
 
%include "&config_root./onedrive_config.sas";
 
/* Run this line to build the authorization URL */
%let authorize_url=https://login.microsoftonline.com/&tenant_id./oauth2/authorize?client_id=&client_id.%nrstr(&response_type)=code%nrstr(&redirect_uri)=&redirect_uri.%nrstr(&resource)=&resource.;
options nosource;
%put Paste this URL into your web browser:;
%put -- START -------;
%put &authorize_url;
%put ---END ---------;
options source;

This produces these output lines in the SAS log:

 Paste this URL into your web browser:
 -- START -------
https://login.microsoftonline.com/206db638-6adb-41b9-b20c-95d8d04abcbe/oauth2/authorize?client_id=8fb7804a-8dfd-40d8-bf5b-d02c2cbc56
f3&response_type=code&redirect_uri=https://login.microsoftonline.com/common/oauth2/nativeclient&resource=https://graph.microsoft.com
 ---END ---------

Copy and paste the URL (all on one line, no spaces) into the address bar of your web browser. When you press Enter, you'll be prompted to grant the required permissions:

Once you click Accept, the browser will redirect to what looks like a blank page, but the URL contains the authorization code that we need:

Copy the value that appears after the code= in the URL. It's going to be a very long string -- over 700 characters. We'll need that value for the next step.

Step 3: Obtain an access token

My colleague Joseph wrote a few convenient utility macros that can help manage the access token and refresh token within your SAS session. These macros include:

  • %get_token - get the initial access and refresh tokens, given an authorization code. Remember, an access token will expire in about 60 minutes. But the refresh token can be used to get a renewed access token.
  • %refresh - exchange a valid refresh token for a new access token
  • %process_token_file - read/update an external token file so that these values persist beyond your current SAS session.

I'm not going to walk through the macro code in this article, but the SAS programs are straightforward and well-documented. See "How to get this example code" at the end of this article.

With these macros in place, we can paste the (very long) authorization code we retrieved in the previous step into a macro variable. Then we can run the %get_token macro to generate the tokens and store them in a local file.

%let config_root=/folders/myfolders/onedrive;
 
%include "&config_root./onedrive_config.sas";
%include "&config_root./onedrive_macros.sas";
 
filename token "&config_root./token.json";
%let auth_code=AQABAAIAAAC5una0EUFgTIF8ElaxtWjTqwohjyfG; * and much more;
 
/*
  Now that we have an authorization code we can get the access token
  This step will write the tokens.json file that we can use in our
  production programs.
*/
%get_token(&client_id.,&auth_code,&resource.,token,tenant=&tenant_id);

Running this step will create a new file, token.json, in your designated config folder. Here's an screenshot of what my version looks like right now:

It's very important that you keep this file secure. With the information in this file (your refresh token) and your conf.json file (with your client ID and tenant ID), anyone can use these code techniques to impersonate you and access your Microsoft Office 365 data. There are techniques for storing these files such that only you can see them.

Using Microsoft Office 365 APIs to access OneDrive from SAS

Whew! It took near 1500 words to get this far, so thanks for sticking with me. The good news is that these steps take much longer to describe than to actually execute. Plus, creating apps is fun! (Right?)

From the screenshots I've shared, you probably already noticed that these services are working on Microsoft Azure, which is Microsoft's cloud platform for applications. For the remainder of this article, I'll be using methods from the Microsoft Graph API. This REST-based API provides access to almost all of Microsoft's hosted services. For my examples, I'll be using methods within the Files component of the API: Drives and Drive Items (folders and files).

Initializing and refreshing the access token in a new session

Now that we have the access and refresh tokens, we can get down to business with some actual OneDrive interactions. Here's how to initialize your SAS session with the tokens.

%let config_root=/folders/myfolders/onedrive;
 
%include "&config_root./onedrive_config.sas";
%include "&config_root./onedrive_macros.sas";
 
/*
  Our json file that contains the oauth token information
*/
filename token "&config_root./token.json";
 
%process_token_file(token);
 
/* If this is first use for the session, we'll likely need to refresh  */
/* the token.  This will also call process_token_file again and update */
/* our token.json file.                                                */
%refresh(&client_id.,&refresh_token.,&resource.,token,tenant=&tenant_id.);
 
/*
  At this point we have a valid access token and we can start using the API.
*/

If all goes well, we'll have our access token, and it will be stored in a macro variable named &access_token. It's going to be another long and illegible (>700 characters) value.

(Ever hear of the "infinite monkey theorem? That a monkey hitting a typewriter for an infinite amount of time is certain to produce a certain text, such as the complete works of Shakespeare? Well, that monkey is not going to produce this access token. Plus, who has a typewriter anymore?)

Retrieving the top-level drive identifier

We'll need to explore the OneDrive system from the top-down, using code. First, we need the identifier for the root drive. It's possible for you to have multiple root drives, and if that's the case for you, you'll need to modify this code a bit. This code queries the service for your drives, and stores the identifier for just the first drive in a macro variable. We'll need that identifier later to retrieve a list of top-level items.

/*
First we need the ID of the "drive" we are going to use.
to list the drives the current user has access to you can do this
*/
filename resp TEMP;
/* Note: oauth_bearer option added in 9.4M5 */
proc http url="https://graph.microsoft.com/v1.0/me/drives/"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
libname jresp json fileref=resp;
 
/*
 I only have access to 1 drive, but if you have multiple you can filter 
 the set with a where clause on the name value.
 
 This creates a data set with the one record for the drive.
*/
data drive;
 set jresp.value;
run;
 
/* store the ID value for the drive in a macro variable */
proc sql noprint;
 select id into: driveId from drive;
quit;

Note that this code uses the new OAUTH_BEARER option in PROC HTTP -- a convenient addition when working with OAuth2-compliant APIs. This is shorthand -- and more intuitive syntax -- for placing "Authorization: Bearer TOKEN-VALUE" in the HTTP headers.

Retrieve a list of top-level folders/files

With the drive identifier in hand, I can use the /children verb on the Microsoft Graph API to get a list of all of the top-level objects in that drive. These represent the folders and files that are at the root.

/*
 To list the items in the drive, use the /children verb with the drive ID
*/
filename resp TEMP;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/root/children"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
libname jresp json fileref=resp;
 
/* Create a data set with the top-level paths/files in the drive */
data paths;
 set jresp.value;
run;

Here's what I'm keeping in my OneDrive right now. It's not too disorganized, is it?

List the files in a particular folder

If I'm interested in exploring a particular folder, I'll need to find the folder identifier as it's known to OneDrive. Using PROC SQL and SELECT INTO, I can find the folder by its name and store its ID in another macro variable. Then, I use the /children verb again, but this time with the folder ID instead of the "root" constant.

/*
 At this point, if you want to act on any of the items, you just replace "root" 
 with the ID of the item. So to list the items in the "SASGF" folder I have:
  - find the ID for that folder
  - list the items within by using the "/children" verb
*/
 
/* Find the ID of the folder I want */
proc sql noprint;
 select id into: folderId from paths
  where name="SASGF";
quit;
 
filename resp TEMP;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&folderId./children"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
/* This creates a data set of the items in that folder, 
   which might include other folders.
*/
libname jresp json fileref=resp;
data folderItems;
 set jresp.value;
run;

Here are the items from my SASGF folder. Can you tell that I don't throw anything away?

Download a file from OneDrive and import into SAS

I know that I keep a spreadsheet named "sas_tech_talks_18.xlsx" in this SASGF folder. With the /content verb, I can download the file from OneDrive and store it in the file system that is local to my SAS session. Then, I can use PROC IMPORT to read it into a SAS data set.

/*
 With a list of the items in this folder, we can download
 any item of interest by using the /content verb 
*/
 
/* Find the item with a certain name */
proc sql noprint;
 select id into: fileId from folderItems
  where name="sas_tech_talks_18.xlsx";
quit;
 
filename fileout "&config_root./sas_tech_talks_18.xlsx";
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&fileId./content"
     oauth_bearer="&access_token"
     out = fileout;
	 run;
 
/* Import the first sheet into a SAS data set */
proc import file=fileout 
 out=sasgf
 dbms=xlsx replace;
run;

Boom! I've just downloaded my data from the cloud and brought it into my SAS session.

Add a new file to OneDrive

We can build wonderful documents from SAS too, and it's important to be able to share those. By using the PUT method with the /content verb, we can copy a file from the local SAS session into a target folder on OneDrive. Most often, this will probably be an Excel spreadsheet or maybe a PDF report. (But hey, maybe it's a good opportunity to try out

/*
  We can upload a new file to that same folder with the PUT method and /content verb
  Notice the : after the folderId and the target filename
*/
 
/* Create a simple Excel file to upload */
%let targetFile=iris.xlsx;
filename tosave "%sysfunc(getoption(WORK))/&targetFile.";
ods excel(id=upload) file=tosave;
proc print data=sashelp.iris;
run;
ods excel(id=upload) close;
 
filename details temp;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&folderId.:/&targetFile.:/content"
  method="PUT"
  in=tosave
  out=details
  oauth_bearer="&access_token";
run;
 
/*
  This returns a json response that describes the item uploaded.
  This step pulls out the main file attributes from that response.
*/
libname attrs json fileref=details;
data newfileDetails (keep=filename createdDate modifiedDate filesize);
 length filename $ 100 createdDate 8 modifiedDate 8 filesize 8;
 set attrs.root;
 filename = name;
 modifiedDate = input(lastModifiedDateTime,anydtdtm.);
 createdDate  = input(createdDateTime,anydtdtm.);
 format createdDate datetime20. modifiedDate datetime20.;
 filesize = size;
run;

Replace/update a file in OneDrive

If you want to replace an existing file, then you'll want to perform the additional step of retrieving the unique ID for that file from OneDrive. When you PUT the new version of the file into place, its history and sharing properties should remain intact. Here is my code for navigating the folder/file structure in my OneDrive and finally replacing an existing file.

/*
  If you want to replace a file instead of making a new file 
  then you need to upload it with the existing file ID.  If you
  don't replace it with the existing ID, some sharing properties
  and history could be lost.
*/
/* Create a simple Excel file to upload */
%let targetFile=iris.xlsx;
filename tosave "%sysfunc(getoption(WORK))/&targetFile.";
ods excel(id=upload) file=tosave;
proc print data=sashelp.iris;
run;
ods excel(id=upload) close;
 
/* Navigate the folder and file IDs from my OneDrive */
proc sql noprint;
 select id into: folderId from paths
  where name="SASGF";
quit;
 
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&folderId./children"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
libname jresp json fileref=resp;
data folderItems;
 set jresp.value;
run;
 
/* Find the ID of the existing file */
proc sql noprint;
 select id into: fileId from folderItems
  where name="iris.xlsx";
quit;
 
libname attrs json fileref=details;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&fileId./content"
 method="PUT"
 in=tosave
 out=details
 oauth_bearer="&access_token";
run;

As you can see from my OneDrive history for this file, I've tested this program a few times -- resulting in 23 revisions of this file in its history!

How to get this example code

You can find the source files for these examples on GitHub.

I've organized this code into 4 different files in order to make it easy to reuse:

  • onedrive_config.sas - read the fields from the conf.json and set them as global macro variables. This includes your client_id and tenant_id.
  • onedrive_setup.sas - the SAS statements that represent code you will need to run just once to get your authorization code and first access code.
  • onedrive_macros.sas - three utility macros that help you to create, refresh, and manage your access token and refresh token in your token.json file
  • onedrive_example_use.sas - sample SAS steps that I used in this article. They won't quite work for you as-is, since you don't have the same files that I do. (Unless you do have the same files, in which case...creepy.) My hope is that you can read and adapt them for your own content.

I also included a template for the conf.json file, with obvious placeholders for the client_id and tenant_id that you'll substitute with your own values. You'll also need to change the statements that define &CONFIG_LOC -- the location of your configuration directory where you're storing these files. I developed these examples in SAS University Edition -- yes, this works there! I also ran the code from my full SAS environment via SAS Enterprise Guide.

More about using REST APIs from SAS

This has been a monster article -- in terms of its length. But I hope it's clear enough to follow and has sufficient detail for you to try this on your own. If you have questions, post in the comments.

I've published a number of other articles about using REST APIs from SAS -- it's one of my favorite things to do. Check out:

The post Using SAS to access and update files on Microsoft OneDrive appeared first on The SAS Dummy.

11月 152018
 

JSON is a popular format for data exchange between APIs and some modern databases. It's also used as a way to archive audit logs in many systems. Because JSON can represent hierarchical relationships of data fields, many people consider it to be superior to the CSV format -- although it's certainly not yet universal.

I learned recently that newline-delimited JSON, also called JSONL or JSON Lines, is growing in popularity. In a JSONL file, each line of text represents a valid JSON object -- building up to a series of records. But there is no hierarchical relationship among these lines, so when taken as a whole the JSONL file is not valid JSON. That is, a JSON parser can process each line individually, but it cannot process the file all at once.

In SAS, use the JSON libname engine to parse valid JSON files. But neither of these can create or parse JSONL files directly. Here's a simple example of a JSONL file. Each line, enclosed in braces, represents valid JSON. But if you paste the entire body into a validation tool like JSONLint, the parsing fails.

JSONL sample, not valid in JSON parsers

If we needed these records to be true JSON, we need a hierarchy. This requires us to set off the rows with more braces and brackets and separate them with commas, like this:

Valid JSON with hierarchies

Creating JSONL with PROC JSON and DATA step

In a recent SAS Support Communities thread, a SAS user was struggling to use PROC JSON and a SAS data set to create a JSONL file for use with the Amazon Redshift database. PROC JSON can't create the finished file directly, but we can use PROC JSON to create the individual JSON object records. Our solution looks like this:

  • Use PROC JSON to read each record of the source data set and create a new JSON file. DATA step and CALL EXECUTE can generate these steps for us.
  • Using DATA step, post-process the collection of JSON files and append these into a final JSONL file.

Here's the code we used. You need to change only the output file name and the source SAS data set.

/* Build a JSONL (newline-delimited JSON) file */
/* from the records in a SAS data set          */
filename final "c:\temp\final.jsonl" ;
%let datasource = sashelp.class;
 
/* Create a new subfolder in WORK to hold */
/* temp JSON files, avoiding conflicts    */
options dlcreatedir;
%let workpath = %sysfunc(getoption(WORK))/json;
libname json "&workpath.";
libname json clear;
 
/* Will create a run a separate PROC JSON step */
/* for each record.  This might take a while   */
/* for very large data.                        */
/* Each iteration will create a new JSON file  */
data _null_;
 set &datasource.;
 call execute(catt('filename out "',"&workpath./out",_n_,'.json";'));
 call execute('proc json out=out nosastags ;');
 call execute("export &datasource.(obs="||_n_||" firstobs="||_n_||");");
 call execute('run;');
run;
 
/* This will concatenate the collection of JSON files */
/* into a single JSONL file                           */
data _null_;
 file final encoding='utf-8' termstr=cr;
 infile "&workpath./out*.json";
 input;
 /* trim the start and end [ ] characters */
 final = substr(_infile_,2,length(_infile_)-2);
 put final;
run;

From what I've read, it's a common practice to compress JSONL files with gzip for storage or faster transfers. That's a simple step to apply in our example, because SAS supports a GZIP method in SAS 9.4 Maintenance 5. To create a gzipped final result, change the first FILENAME statement to something like:

filename final ZIP "c:\temp\final.jsonl.gz" GZIP;

The JSONL format is new to me and I haven't needed to use it in any of my applications. If you use JSONL in your work, I'd love to hear your feedback about whether this approach would create the types of files you need.

The post Create newline-delimited JSON (or JSONL) with SAS appeared first on The SAS Dummy.

11月 062018
 

A few weeks ago I posted a cliffhanger-of-a-blog-post. I left my readers in suspense about which of my physical activities are represented in different sets of accelerometer data that I captured. In the absence of more details from me, the internet fan theories have been going wild. Well, it's time for the big reveal! I've created a SAS Visual Analytics report that shows each of these activity streams with the proper label:

Accelerometer measurements per activity -- click to enlarge!

Were your guesses confirmed? Any surprises? Were you more impressed with my safe driving or with my reckless behavior on the trampoline?

Collecting and preparing accelerometer data

You might remember that this entire experiment was inspired by a presentation from Analytics Experience 2018. That's when I learned about an insurance company that built a smartphone app to collect data about driving behavior, and that the app relies heavily on accelerometer readings. I didn't have time or expertise to build my own version of such an app, but I found that there are several good free apps that can collect and export this data. I used an app called AccDataRec on my Android phone.

Each "recording session" generates a TSV file -- a tab-separated file that contains a timestamp and a measurement for each of the accelerometer axes (X, Y, and Z). In my previous post, I shared tips about how to import multiple TSV files in a single step. Here's the final version of the program that I wrote to import these data:

filename tsvs "./accel/*.tsv";
libname out "./accel";
 
data out.accel;
  length 
    casefile $ 100 /* to write to data set */
    counter 8 
    timestamp 8 
    timestamp_sec 8
    x 8 y 8 z 8 
    filename $ 25        
    tsvfile $ 100 /* to hold the value */
  ;
  format timestamp datetime22.3 timestamp_sec datetime20.;
 
  /* store the name of the current infile */
  infile tsvs filename=tsvfile expandtabs;
  casefile=tsvfile;
  input counter timestamp x y z filename;
 
  /* convert epoch time into SAS time */
  timestamp=dhms('01jan1970'd, 0, 0, timestamp / 1000);
 
  /* create a timestamp with the precision of one second */
  timestamp_sec = intnx('second',timestamp,0);
run;

Some notes:

  • I converted the timestamp value from the data file (an epoch time value) to a native SAS datetime value by using this trick.
  • Following advice from readers on my last post, I changed the DLM= option to a more simple EXPANDTABS option on the INFILE statement.
  • Some of the SAS time-series analysis doesn't like the more-precise timestamp values with fractions of seconds. I computed a less precise field, rounding down to the second, just in case.
  • For my reports in this post, I really need only 5 fields: counter (the ordinal sequence of measurements), x, y, z, and the filename (mapping to activity).

The new integrated SAS Viya environment makes it simple to move from one task to another, without needing to understand the SAS product boundaries. I used the Manage Data function (that's SAS Data Management, but does that matter?) to upload the ACCEL data set and make it available for use in my reports. Here's a preview:

Creating a SAS Visual Analytics report

With the data now available and loaded into memory, I jumped to the Explore and Visualize Data activity. This is where I can use my data to create a new SAS Visual Analytics report.

At first, I was tempted to create a Time Series Plot. My data does contain time values, and I want to examine the progression of my measurements over time. However, I found the options of the Time Series Plot to be too constraining for my task, and it turns out that for this task the actual time values really aren't that important. What's important is the sequence of the measurements I've collected, and that's captured as an ordinal in the counter value. So, I selected the Line Plot instead. This allowed for more options in the categorical views -- including a lattice row arrangement that made it easy to see the different activity patterns at a glance. This screen capture shows the Role assignments that I selected for the plot.

Adding a closer view at each activity

With the overview Line Plot complete, it's time to add another view that allows us to see just a single activity and provide a close-up view of its pattern. I added a second page to my report and dropped another Line Plot onto the canvas. I assigned "counter" to the category and the x, y, and z values to the Measures. But instead of adding a Lattice Row value, I added a Button Bar to the top of the canvas. My idea is to use the Button Bar -- which is good for navigating among a small number of values -- as a way to trigger a filter for the accelerometer data.

I assigned "filename" to the Category value in the Button Bar role pane. Then I used the Button Bar options menu (the vertical dots on the right) to add a New filter from selection, selecting "Include only selection".

With this Button Bar control and its filter in place, I can now switch among the data values for the different activities. Here's my "drive home" data -- it looks sort of exciting, but I can promise you that it was a nice, boring ride home through typical Raleigh traffic.

Phone mounted in my car for the drive home

The readings from the "kitchen table" activity surprised me at first. This activity was simply 5 minutes of my phone lying flat on my kitchen table. I expected all readings to hover around zero, but the z axis showed a relatively flat line closer to 10 meters-per-second-per-second. Then I remembered: gravity. This sensor registers Earth's gravity, which we are taught is 9.8 meters-per-second-per-second. The readings from my phone hovered around 9.6 -- maybe my house is in a special low-gravity zone, or the readings are a bit off.

Phone at rest on my kitchen table

Finally, let's take a closer look at my trampoline workout. Since I was holding my phone upright, it looks like the x-axis felt the brunt of the acceleration forces. According to these readings, my phone was subjected to a g-force of 7 or 8 times that of Earth's gravity -- but just for a split second. And since my phone was in my hand and my arm was flailing around (I am not a graceful rebounder), my phone was probably experiencing more force than my body was.

Bounding on the trampoline as high as I can

Some love for the Windows 10 app

My favorite method to view SAS Visual Analytics reports is through the SAS Visual Analytics application that's available for Windows 10 and Windows mobile devices. Even on my desktop, where I have a full web browser to help me, I like the look and feel of the specialized Windows 10 app. The report screen captures for this article were rendered in the Windows 10 app. Check out this article for more information about the app. You can try the app for free, even without your own SAS Viya environment. The app is hardwired with a connection to the SAS demo reports at SAS.com.

See also

This is the third (and probably final) article in my series about accelerometer data. See these previous posts for more of the fun background information:

The post Reporting on accelerometer data with SAS Visual Analytics appeared first on The SAS Dummy.

10月 092018
 

As part of my research for a different article, I recently collected data about my driving commute home via an accelerometer recorder app on my phone. The app generates a simple TSV file. (A TSV file is like a CSV file, but instead of a comma separator, it uses a TAB character to separate the values.) The raw data looks like this:

Related from Analytics Experience 2018: Using your smartphone accelerometer to build a safe driving profile

With SAS, it's simple to import the file into a data set. Here's my DATA step code that uses the INFILE statement to identify the file and how to read it. Note that the DLM= option references the hexadecimal value for the TAB character in ASCII (09x), the delimiter for fields in this data.

data drive;
  infile "/home/chris.hemedinger/tsv/drivehome.tsv" 
   dlm='09'x;
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;
run;

In my research, I didn't stop with just my drive home. In addition to my commute, I collected data about 4 other activities, and thus accumulated a collection of TSV files. Here's my file directory in my SAS OnDemand for Academics account:

To import each of these data files into SAS, I could simply copy and paste my code 4 times and then replace the name of the file for each case that I collected. After all, copy-and-paste is a tried and true method for writing large volumes of code. But as the number of code lines grows, so does the maintenance work. If I want to add any additional logic into my DATA step, that change would need to be applied 5 times. And if I later come back and add more files to my TSV collection, I'll need to copy-and-paste the same code blocks for my additional cases.

Using a wildcard on the INFILE statement

I can read all of my TSV files in a single step by *.tsv, which tells SAS to match on all of the TSV files in the folder and process each of them in turn. I also changed the name of the data set from "drive" to the more generic "accel".

data accel;
  infile "/home/chris.hemedinger/tsv/*.tsv" 
   dlm='09'x;
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;
run;

The SAS log shows which files have been processed and added into my data set.

With a single data set that has all of my accelerometer readings, I can easily segment these with a WHERE clause in later processing. It's convenient that my accelerometer app also captured the name of each TSV file so that I can keep these cases distinct. A quick PROC FREQ shows the allocation of records for each case that I collected.

Add the filename into the data set

If your input data files don't contain an eponymous field name, then you will need to use a different method to keep track of which records come from which files. The

filename tsvs "/home/chris.hemedinger/tsv/*.tsv";
data accel;
  length casefile $ 100 /* to write to data set */
     counter 8 
     timestamp 8 
     x 8 y 8 z 8 
     filename $ 25	       
     tsvfile $ 100 /* to hold the value */	       
   ;
 
   /* store the name of the current infile */       
   infile tsvs filename=tsvfile 
    dlm='09'x ; 
  casefile=tsvfile;
  input counter timestamp x y z filename;	
run;

In the output, you'll notice that we now have the fully qualified file name that SAS processed using INFILE.

Managing data files: fewer files is better

Because we started this task with 5 distinct input files, it might be tempting to store the records in separate tables: one for each accelerometer case. While there might be good reasons to do that for some types of data, I believe that we have more flexibility when we keep all of these records together in a single data set. (But if you must split a single data set into many, here's a method to do it.)

In this single data set, we still have the information that keeps the records distinct (the name of the original files), so we haven't lost anything. SAS procedures support CLASS and BY statements that allow us to simplify our code when reporting across different groups of data. We'll have fewer blocks of repetitive code, and we can accomplish more across all of these cases before we have to resort to SAS macro logic to repeat operations for each file.

As a simple example, I can create a simple visualization with a single PROC SGPANEL step.

ods graphics / width=1600 height=400;
proc sgpanel data=accel;
 panelby filename / columns=5 noheader;
 series x=counter y=x;
 series x=counter y=y;
 series x=counter y=z;
 colaxis display=none minor;
 rowaxis label="m/s**2" grid;
 where counter<11000;
run;

Take a look at these 5 series plots. Using just what you know of the file names and these plots, can you guess which panel represents which accelerometer case?

Leave your guess in the comments section. I'll explore these data further in a future blog post!

The post How to read multiple text files in SAS appeared first on The SAS Dummy.