time series data

7月 032018

At SAS, we love data. Data is central to our corporate vision: to transform a world of data into a world of intelligence. We're also famous for enjoying M&Ms, but to us they are more than a sweet snack. They're also another source of data.

My colleague Pete Privitera, with a team of like-minded "makers," built a device that they named SnackBot. SnackBot is an internet-connected sensor that measures the flow of M&Ms in a particular SAS break room. There's a lot to love about this project. You can learn more by watching its origin story in this video:

As the number of M&Ms changes, SnackBot takes a reading and records the M&M count in a database. Most readings reflect a decrease in candy pieces, as my colleagues help themselves to a treat. But once per week, the reading shows a drastic increase -- as our facilities staff restocks the canister. SnackBot has its own website. It also has its own API, and you know what that means (right?). It means that we can use SAS to read and analyze the sensor data.

Reading sensor data into SAS with the SnackBot API

The SnackBot system offers a REST API with a JSON data response. Like any REST API, we can use PROC HTTP to fetch the data, and the JSON library engine to parse the response into a SAS data set.

%let start = '20MAY2018:0:0:0'dt;
/* format the start/end per the API needs */
%let start_time= %sysfunc(putn(&start.,is8601dt26.));
%let end_time=   %sysfunc(datetime(),is8601dt26.);
/* Call the SnackBot API from snackbot.net */
filename resp temp;
proc http
/* JSON libname engine to read the result       */
/* Simple record layout, all in the ROOT member */
libname mms json fileref=resp;
data mmlevels;
  set mms.root;

I've written about how to use SAS with REST APIs in several other blog posts, so I won't dwell on this part of the process here. This short program retrieves the raw data from SnackBot, which represents a series of M&M "levels" (count of remaining pieces) and a timestamp for each measurement. It's a good start. Though there are only two fields to work with here, there's quite a bit we can do with these data.

raw SnackBot data

Add features to the raw sensor data

With a few additional DATA step statements and some built-in SAS formats, we can derive several interesting characteristics of these data for use in further analysis.

First, we need to convert the character-formatted datetime field to a proper SAS datetime value. That's easily achieved with the INPUT function and the ANYDTDTM informat. (Rick Wicklin wrote a helpful article about how how the ANYDT* informats work.)

data mmlevels;
  set mms.root;
  drop ordinal_root timestamp;
  /* Convert the TIMESTAMP field to native value -- it's a character */
  datetime = input(timestamp, anydtdtm.);
  date = datepart(datetime);
  time = timepart(datetime);
  dow = date;
  qhour = round(datetime,'0:15:0'T);
  format  datetime datetime20. 
          qhour datetime20.
          date date9.
          time timeampm10.
          dow downame.;

For convenience, I duplicated the datetime value a few times and applied different formats so we can get different views of the same value: datetime, just the date, just the time-of-day, and the day-of-week. I also used the ROUND function to "round" the raw datetime value to the nearest quarter hour. I'll explain why I've done that in a later step, but the ROUNDing of a time value is one of the documented unusual uses of the ROUND function.

SnackBot data with features

Even with this small amount of data preparation, we can begin to analyze the characteristics of these data. For example, let's look at the descriptive stats for the data classified by day-of-week:

title "SnackBot readings per day-of-week";
proc means data=mmlevels mean stddev max min;
 var pieces;
 class dow;

SnackBot by day of week

The "N Obs" column shows the number of measurements taken over the entire "study period" broken down by day-of-week. If a measurement is a proxy for a "number-of-pieces-changed" event, then we can see that most events happen on Wednesday, Thursday, and Friday. From this, can you guess which day the M&M canister is refilled?

Let's take another slice through these data, but this time looking at time-of-day. For this, I used PROC FREQ to count the measurements by hour. I applied the HOUR2. format, which allows the SAS procedure to group these data into hour-long intervals with no need for additional data prep. ( I've written previously about how to use SAS formats to derive new categories without expensive data rewriting.) Then I used PROC SGPLOT to produce a step plot for the 24-hour cycle.

/* Count of readings per hour of the day */ 
title "SnackBot readings per hour";
proc freq data=mmlevels ;
 table time / out=perhour;
 format time hour2.;
ods graphics / height=400 width=800;
title "SnackBot readings per hour";
proc sgplot data=perhour des="Readings per hour of day";
 step x=time y=count;
 xaxis min='0:0:0't max='24:0:0't label="Time of day" grid;
 yaxis label="Servings";

SnackBot hour step

From the chart, we can see that most M&M "events" happen around 11am, and then again between 2pm and 4pm. From personal experience, I can confirm that those are the times when I hear the M&Ms calling to me.

Expand the time series to regular intervals

The SnackBot website can tell you how many M&Ms are remaining right now. But what if you want to know how many were remaining last Friday? Or on any typical Monday morning?

The sensor data that we've analyzed so far is sparse -- that is, there are data entries for each "change" event, but not for every discrete time interval in the study period. I don't know how the SnackBot sensor records its readings -- it might sample the M&M levels every minute, or every second. Regardless, the API reports (and probably stores) only the records that represent a change. If SnackBot records that the final 24 pieces were depleted at 25JUN2018:07:45:00 (a Monday morning) bringing the count to 0, how many M&Ms remain at 1pm later that day? The data don't tell us explicitly with a recorded reading. But we can assume at that point that the count was still 0. The next recorded reading occurs at 27JUN2018:10:30:00 (on a Wednesday, bringing the count to 1332 -- oh joy!).

If we want to create a useful time series visualization of the M&M candy counts over time, we need to expand the time series from these sparse recordings to regular intervals. SAS offers a few sophisticated time series procedures to accomplish this: PROC EXPAND, PROC TIMESERIES, and PROC TIMEDATA. Each of these offer powerful econometrics methods for interpolation and forecasting -- and that's more than we need for this situation. For my example, I took a more low-tech approach.

First, I created an empty data set with datetime entries at quarter-hour intervals, covering the study period of the data we're looking at.

/* Empty data set with 15 minute interval slots    */
/* Regular intervals for the entire "study" period */
data timeslots;
  last = datetime();
  length qhour 8;
  format qhour datetime20;
  drop last i;
  do i = &start. to last by '0:15:00't;
    qhour = i;

Then I used a DATA step to merge these empty slots with the actual event data that I had rounded to the nearest quarter hour (remember that?):

/* Merge the sample data with the timeslots */
data expand;
  merge mmlevels(keep=pieces qhour) timeslots;
  by qhour;

Finally, I used a variation of a last-observation-carried-forward (LOCF) approach to fill in the remaining empty slots. If a reading at 20MAY2018:11:15:00 reports 132 pieces remaining, then that value should be RETAINed for each 15-minute slot until the next reading at 20MAY2018:17:30:00. (That reading is 82 pieces -- meaning somebody helped themselves to 50 pieces. Recommended serving size for plain M&Ms is 20 pieces, but I'm not passing judgement.) I also recorded a text value for the day-of-week to help with the final visualization.

/* for empty timeslots, carry the sample data   */
/* forward, so we always have a count of pieces */
/* Variation on a LOCF technique                */
data final;
  set expand;
  length day $ 3;
  /* 3-char value for day of week */
  retain hold;
  if not missing(pieces) then
  else pieces=hold;
  drop hold;
  if not missing(pieces);

Now I have data that represents the regular intervals that we need.

SnackBot regular intervals

Putting it all together

For my final visualization, I created a series plot for the study period. It shows the rise and fall of M&Ms levels in one SAS break room over several weeks. For additional "color", I annotated the plot with a block chart to delineate the days of the week.

title 'Plain M&M pieces on S1 tracked by SnackBot';
ods graphics / height=300 width=1600;
proc sgplot data=final des='M&M pieces tracked by SnackBot';
  /* plot the data as a series */ 
  series x=qhour y=pieces / lineattrs=(color=navy thickness=3px);
  /* Yes, these are the "official" M&M colors               */
  /* Will be applied in data-order, so works best when data */
  /* begins on a Sunday                                     */
  styleattrs datacolors=(red orange yellow green blue CX593B18 red);
  /* block areas to indicate days-of-week                   */
  block x=qhour block=day / transparency=0.65
    valueattrs=(weight=bold size=10pt color=navy);
  xaxis minor display=(nolabel);
  yaxis display=(nolabel) grid max=1600 minor;

You can see the pattern. M&Ms are typically filled on Wednesday to the canister capacity of about 1400 pieces. We usually enter into the weekend with 0 remaining, but there are exceptions. The week of May 27 was our Memorial Day holiday, which explains the lack of activity on Monday (and even Tuesday) during that week as SAS folks took advantage of a slow week with their vacation plans.

SnackBot visualization

More about SAS and M&Ms data

You can download the complete code for this example from my public Gist on GitHub. The example code should work with SAS University Edition and SAS OnDemand for Academics, as well as with any SAS environment that can reach the internet with PROC HTTP.

For more M&M data fun, check out Rick Wicklin's article about the distribution of colors in plain M&Ms. SnackBot does not (yet) report on how many and which color of M&Ms are taken per serving, but using statistics, we can predict that!

The post The Internet of Snacks: SnackBot data and what it reveals about SAS life appeared first on The SAS Dummy.

3月 122012

Did you oversleep this morning? If you live in the United States of America, Monday morning seems to have arrived just a bit earlier, accompanied by a bit more "dark" than usual.

That's because as good time-fearing citizens, we have all set our clocks ahead by one hour so as to conserve more daylight for the end of the day. It means that on Saturday night/Sunday morning, we give up an hour of sleep.

We also lose additional time as we wander around our houses and set our many dozens of clocks ahead by 60 minutes. Several years ago, in an effort to save my clock-setting labor, I purchased one of those "never-have-to-set-it-again" alarm clocks. It featured an internal battery (to guard against power outages) and sophisticated circuitry that already knew about Daylight Savings Time. With this new clock, I would never have to worry about waking up too late (or too early) on the two designated Sunday mornings each year.

But guess what? Shortly after I bought the clock, Congress changed the Laws of Time and effectively extended DST by about a month. My "smart" clock was now remarkably stupid. Not only did it fail to automatically adjust the time when it was supposed to, but it actually continued to adjust the time according to the old schedule, thus interfering with my sleep patterns on two errant weekends per year while offering no help on the "official" time-change dates. John D. Cook says that "DST is a huge mess", and for a while in my household, it was extra messy.

Now I try to make sure that every new clock that I buy can connect to the Internet. Why? Because Internet-connected devices track time in UTC, and not in local time. For display purposes, our devices provide the arithmetic service of calculating the UTC value into "local" time. And although that formula might change a couple of times per year (UTC-5 instead of UTC-6), the devices take care of all of that for me.

UTC is used for most computer-related activity that features time stamps, thus providing an objective measure of time without local bias. But sometimes we encounter data that doesn't use UTC and we must adjust our analysis accordingly. That's why it's important to know how to calculate the UTC offset within your SAS programs. And you can also use SAS to plan ahead, and note exactly which future Sundays will be affected by DST. That is, until the DST debate results in more changes.

tags: time series data, utc
9月 202011

Like millions of other Americans, I recently was asked to make a decision of tremendous importance to my household -- a decision that would affect the welfare of everyone in my family. That decision, of course, was whether to continue to receive Netflix movies by mail, or opt for the less-expensive "streaming only" subscription.

Let me just say this up front: we love our Netflix subscription.  We subscribed way back in 2005 on the low-cost "one-disc-at-a-time" plan, and since then we've seen over 180 movies that we received on-loan, delivered via the US Postal Service.  Many of these were movies that we never would have seen otherwise: older films, independent films, and many other titles that would have been difficult to find at a local video rental shop.

But having arrived at this crossroads, it's a good time to try and measure just how much the DVD-by-mail option would cost us, and then we can decide what action to take.  And of course, I used SAS Enterprise Guide to analyze my Netflix history, and thus created additional insight about the movie-rental pattern of my household.

Getting my account history into SAS

One of the things that I like about Netflix is how they allow you to see your entire account history online.   At the time that I'm writing this, this URL will get you to your complete DVD shipping activity (although this could change, as Netflix is restructuring their DVD-by-mail business):


In order for that URL to work, you must be already signed in to your Netflix account in your web browser.  While there are several ways to turn this web page into data, I found the easiest method within Microsoft Excel.  On the Data ribbon menu, select Get External Data->From Web.  On the New Web Query window, paste the URL in the Address field and click Go.  You'll see a preview of the web page where you can select the table of your account history to import as data.

When the content is transferred into the Excel spreadsheet, I saved the file (as NetflixHistory.xlsx), and closed Microsoft Excel.  The spreadsheet doesn't look like data that's ready to analyze yet (lots of extra rows and space as you can see in the example below), but that's okay.  I can fix all of that easily in SAS.

With the data now in an Excel spreadsheet, I fired up SAS Enterprise Guide and selected File->Import Data. After just a few clicks through the wizard, I've got the data in a work data set.

Cleaning the data and calculating the value

The data records for the account history are very simple, containing just four fields for each movie: DVD_Title, Rating (whether we liked it), Shipped (date when Netflix shipped the movie out to me), and Returned (date when Netflix received the movie back from me).  My goal for this project is to measure value, and there are no measures in this data...yet.  I also need to filter out the "garbage" rows -- those values that had some purpose in the HTML page, but don't add anything to my analysis.

I'm going to accomplish all of this within the SAS Enterprise Guide query builder, building it all into a single step.  First, I need to design a few filters to clean up the data, as shown here:

The first three filters will drop all of the rows that don't contain information about a DVD title or shipment.  The last two filters will drop any records that reflect multi-disc shipments, or the occasional replacement shipment from when I reported a damaged disc.  Those are rare events, and they don't contain any information that I need to include in my analysis.

Next, I want to calculate some measures.  The most obvious measure to calculate is "How many days did we have the movie" -- the difference between the Shipped Date and Received Date.  And while that number will be interesting, by itself it doesn't convey value or cost.  I want a number that I can express in dollar terms.  To come up with that number, I will employ the tried-and-true method used by data hackers all over the world: I will Make Something Up.

In this case, I'm going to create a formula that reflects my cost for each movie.  That formula is:

(Netflix Monthly Fee / Days In a Month) * Days We Had the Movie = Cost of the Movie

Using the query builder, I calculated new columns with these values.  I assumed the fee was $10/month (varied over time, but this constant is good enough) and that there are 30 days in a month (again, a "good enough" constant).  Here are the new columns in the query builder:

After applying these filters and calculations, I finally have a result set that looks interesting to analyze:

By sorting this data by CostPerMovie, I can see that the "cheapest movies" were those that we had out for only 3 days, which is the fastest possible turnaround (example: receive in the mailbox on Monday, watch Monday night, mail out on Tuesday, Netflix receives on Wednesday and ships the next DVD in our queue).  By my reckoning, those DVDs cost just $1 to watch.  The most expensive movie in my list came to $26.33, a Mad Men DVD that sat for 79 days while we obviously had other things to do besides watch movies.

Visualizing the results

To visualize the "Days Out" as a time series, I used the SGSCATTER procedure to generate a simple plot.  You can see that at the start of our Netflix subscription, we were enthusiastic about watching the movies immediately after we received them, and then returning them in order to release the next title from our queue.  These are where the DaysOut values are closer to zero.  But as time goes on and Life Gets Busy, there are more occurrences of "extended-period loans", with higher values for DaysOut.

Because I've calculated the cost/movie with my sophisticated model, I can plot the cost over time by using the SERIES statement in PROC SGPLOT, with this result:

This plot makes it easy to see that I've had a few "high cost" DVDs.  But it's still difficult to determine an actual trend from this, because the plot is -- and this is a technical term -- "too jumpy".  To remedy that, I used another task in SAS Enterprise Guide -- one that I probably have no business using because I don't fully understand it.  I used the Prepare Time Series Data task (found under the Tasks->Time Series menu) to accomplish two things:

  • Calculate the moving average of the CostPerMovie over each 10-movie interval, in an effort to "smooth out" the variance among these values.
  • Interpolate the CostPerMovie value for all dates that are covered by these data, so that on any given day I can see the "going rate" of my CostPerMovie, even if that date is not a Shipped Date or Received Date.

This magic happens behind the scenes by using PROC EXPAND, part of SAS/ETS.  And although PROC EXPAND creates some nice plots by using ODS statistical graphics, I created my own series plot again by using PROC SGPLOT:

This plot confirms what I already know: our movies have become more expensive over the past 6 years of my subscription.  But more importantly, it tells me by how much: from an initial cost of $3-4, it's now up to nearly $6 per movie -- based solely on our pattern of use.

Important note: The data I collected and analyzed covers only the DVDs we've had shipped to us.  It does not include any movies or shows that we've watched by streaming the content over the Internet.  The "instant watch" feature is an important component of the Netflix model, and we do use this quite a bit.  I know that this accounts for much of the decrease in frequency for our DVD watching.  But by changing their pricing model, Netflix basically asked the question: how much is it worth to you to continue receiving movies by mail, independent of the streaming content?

And I answered that question: it's not worth $6 per DVD to me (as I reckon it, given my pattern of use).  Like millions of others, I've opted out of the DVD-by-mail service.  But we've kept the streaming service!  In a future post, I'll take a look at how we use the streaming content and what value we receive from it. [UPDATE: Here it is, the analysis of my streaming account.]

tags: moving average, Netflix, proc expand, SAS Enterprise Guide, SGPLOT, sgscatter, time series data