analytics conference

6月 132020

I’m going to be honest. When we decided back in March to pivot our largest event of the year from an in-person event to a virtual one, I wasn’t sure what we were getting ourselves into. I didn’t know how we could replace the in-person networking, the unlimited learning opportunities [...]

6 new ways to connect at virtual events was published on SAS Voices by Jenn Chase

5月 162020

Dr. Ryan McGarry might have the most uncanny timing of any documentary producer in history. Just weeks before the novel coronavirus began to saturate headlines, the emergency medicine physician’s Netflix documentary series Pandemic hit the screens of millions of viewers. Six months ago, many of the subjects featured in the [...]

What can we learn about COVID-19 from an executive producer of Pandemic? was published on SAS Voices by Jenn Chase

11月 062018

A few weeks ago I posted a cliffhanger-of-a-blog-post. I left my readers in suspense about which of my physical activities are represented in different sets of accelerometer data that I captured. In the absence of more details from me, the internet fan theories have been going wild. Well, it's time for the big reveal! I've created a SAS Visual Analytics report that shows each of these activity streams with the proper label:

Accelerometer measurements per activity -- click to enlarge!

Were your guesses confirmed? Any surprises? Were you more impressed with my safe driving or with my reckless behavior on the trampoline?

Collecting and preparing accelerometer data

You might remember that this entire experiment was inspired by a presentation from Analytics Experience 2018. That's when I learned about an insurance company that built a smartphone app to collect data about driving behavior, and that the app relies heavily on accelerometer readings. I didn't have time or expertise to build my own version of such an app, but I found that there are several good free apps that can collect and export this data. I used an app called AccDataRec on my Android phone.

Each "recording session" generates a TSV file -- a tab-separated file that contains a timestamp and a measurement for each of the accelerometer axes (X, Y, and Z). In my previous post, I shared tips about how to import multiple TSV files in a single step. Here's the final version of the program that I wrote to import these data:

filename tsvs "./accel/*.tsv";
libname out "./accel";
data out.accel;
    casefile $ 100 /* to write to data set */
    counter 8 
    timestamp 8 
    timestamp_sec 8
    x 8 y 8 z 8 
    filename $ 25        
    tsvfile $ 100 /* to hold the value */
  format timestamp datetime22.3 timestamp_sec datetime20.;
  /* store the name of the current infile */
  infile tsvs filename=tsvfile expandtabs;
  input counter timestamp x y z filename;
  /* convert epoch time into SAS time */
  timestamp=dhms('01jan1970'd, 0, 0, timestamp / 1000);
  /* create a timestamp with the precision of one second */
  timestamp_sec = intnx('second',timestamp,0);

Some notes:

  • I converted the timestamp value from the data file (an epoch time value) to a native SAS datetime value by using this trick.
  • Following advice from readers on my last post, I changed the DLM= option to a more simple EXPANDTABS option on the INFILE statement.
  • Some of the SAS time-series analysis doesn't like the more-precise timestamp values with fractions of seconds. I computed a less precise field, rounding down to the second, just in case.
  • For my reports in this post, I really need only 5 fields: counter (the ordinal sequence of measurements), x, y, z, and the filename (mapping to activity).

The new integrated SAS Viya environment makes it simple to move from one task to another, without needing to understand the SAS product boundaries. I used the Manage Data function (that's SAS Data Management, but does that matter?) to upload the ACCEL data set and make it available for use in my reports. Here's a preview:

Creating a SAS Visual Analytics report

With the data now available and loaded into memory, I jumped to the Explore and Visualize Data activity. This is where I can use my data to create a new SAS Visual Analytics report.

At first, I was tempted to create a Time Series Plot. My data does contain time values, and I want to examine the progression of my measurements over time. However, I found the options of the Time Series Plot to be too constraining for my task, and it turns out that for this task the actual time values really aren't that important. What's important is the sequence of the measurements I've collected, and that's captured as an ordinal in the counter value. So, I selected the Line Plot instead. This allowed for more options in the categorical views -- including a lattice row arrangement that made it easy to see the different activity patterns at a glance. This screen capture shows the Role assignments that I selected for the plot.

Adding a closer view at each activity

With the overview Line Plot complete, it's time to add another view that allows us to see just a single activity and provide a close-up view of its pattern. I added a second page to my report and dropped another Line Plot onto the canvas. I assigned "counter" to the category and the x, y, and z values to the Measures. But instead of adding a Lattice Row value, I added a Button Bar to the top of the canvas. My idea is to use the Button Bar -- which is good for navigating among a small number of values -- as a way to trigger a filter for the accelerometer data.

I assigned "filename" to the Category value in the Button Bar role pane. Then I used the Button Bar options menu (the vertical dots on the right) to add a New filter from selection, selecting "Include only selection".

With this Button Bar control and its filter in place, I can now switch among the data values for the different activities. Here's my "drive home" data -- it looks sort of exciting, but I can promise you that it was a nice, boring ride home through typical Raleigh traffic.

Phone mounted in my car for the drive home

The readings from the "kitchen table" activity surprised me at first. This activity was simply 5 minutes of my phone lying flat on my kitchen table. I expected all readings to hover around zero, but the z axis showed a relatively flat line closer to 10 meters-per-second-per-second. Then I remembered: gravity. This sensor registers Earth's gravity, which we are taught is 9.8 meters-per-second-per-second. The readings from my phone hovered around 9.6 -- maybe my house is in a special low-gravity zone, or the readings are a bit off.

Phone at rest on my kitchen table

Finally, let's take a closer look at my trampoline workout. Since I was holding my phone upright, it looks like the x-axis felt the brunt of the acceleration forces. According to these readings, my phone was subjected to a g-force of 7 or 8 times that of Earth's gravity -- but just for a split second. And since my phone was in my hand and my arm was flailing around (I am not a graceful rebounder), my phone was probably experiencing more force than my body was.

Bounding on the trampoline as high as I can

Some love for the Windows 10 app

My favorite method to view SAS Visual Analytics reports is through the SAS Visual Analytics application that's available for Windows 10 and Windows mobile devices. Even on my desktop, where I have a full web browser to help me, I like the look and feel of the specialized Windows 10 app. The report screen captures for this article were rendered in the Windows 10 app. Check out this article for more information about the app. You can try the app for free, even without your own SAS Viya environment. The app is hardwired with a connection to the SAS demo reports at

See also

This is the third (and probably final) article in my series about accelerometer data. See these previous posts for more of the fun background information:

The post Reporting on accelerometer data with SAS Visual Analytics appeared first on The SAS Dummy.

10月 172018

What can you learn about wildfires when you provide a room full of analysts with 7 years of US wildfire data and the tools they need to analyze it? A lot. At a recent data dive, we plit 35 data scientists into 9 teams, provided multiple data sets containing information [...]

3 wildfire predictions from a recent SAS hackathon that might surprise you was published on SAS Voices by Alison Bolen

10月 092018

As part of my research for a different article, I recently collected data about my driving commute home via an accelerometer recorder app on my phone. The app generates a simple TSV file. (A TSV file is like a CSV file, but instead of a comma separator, it uses a TAB character to separate the values.) The raw data looks like this:

Related from Analytics Experience 2018: Using your smartphone accelerometer to build a safe driving profile

With SAS, it's simple to import the file into a data set. Here's my DATA step code that uses the INFILE statement to identify the file and how to read it. Note that the DLM= option references the hexadecimal value for the TAB character in ASCII (09x), the delimiter for fields in this data.

data drive;
  infile "/home/chris.hemedinger/tsv/drivehome.tsv" 
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;

In my research, I didn't stop with just my drive home. In addition to my commute, I collected data about 4 other activities, and thus accumulated a collection of TSV files. Here's my file directory in my SAS OnDemand for Academics account:

To import each of these data files into SAS, I could simply copy and paste my code 4 times and then replace the name of the file for each case that I collected. After all, copy-and-paste is a tried and true method for writing large volumes of code. But as the number of code lines grows, so does the maintenance work. If I want to add any additional logic into my DATA step, that change would need to be applied 5 times. And if I later come back and add more files to my TSV collection, I'll need to copy-and-paste the same code blocks for my additional cases.

Using a wildcard on the INFILE statement

I can read all of my TSV files in a single step by *.tsv, which tells SAS to match on all of the TSV files in the folder and process each of them in turn. I also changed the name of the data set from "drive" to the more generic "accel".

data accel;
  infile "/home/chris.hemedinger/tsv/*.tsv" 
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;

The SAS log shows which files have been processed and added into my data set.

With a single data set that has all of my accelerometer readings, I can easily segment these with a WHERE clause in later processing. It's convenient that my accelerometer app also captured the name of each TSV file so that I can keep these cases distinct. A quick PROC FREQ shows the allocation of records for each case that I collected.

Add the filename into the data set

If your input data files don't contain an eponymous field name, then you will need to use a different method to keep track of which records come from which files. The

filename tsvs "/home/chris.hemedinger/tsv/*.tsv";
data accel;
  length casefile $ 100 /* to write to data set */
     counter 8 
     timestamp 8 
     x 8 y 8 z 8 
     filename $ 25	       
     tsvfile $ 100 /* to hold the value */	       
   /* store the name of the current infile */       
   infile tsvs filename=tsvfile 
    dlm='09'x ; 
  input counter timestamp x y z filename;	

In the output, you'll notice that we now have the fully qualified file name that SAS processed using INFILE.

Managing data files: fewer files is better

Because we started this task with 5 distinct input files, it might be tempting to store the records in separate tables: one for each accelerometer case. While there might be good reasons to do that for some types of data, I believe that we have more flexibility when we keep all of these records together in a single data set. (But if you must split a single data set into many, here's a method to do it.)

In this single data set, we still have the information that keeps the records distinct (the name of the original files), so we haven't lost anything. SAS procedures support CLASS and BY statements that allow us to simplify our code when reporting across different groups of data. We'll have fewer blocks of repetitive code, and we can accomplish more across all of these cases before we have to resort to SAS macro logic to repeat operations for each file.

As a simple example, I can create a simple visualization with a single PROC SGPANEL step.

ods graphics / width=1600 height=400;
proc sgpanel data=accel;
 panelby filename / columns=5 noheader;
 series x=counter y=x;
 series x=counter y=y;
 series x=counter y=z;
 colaxis display=none minor;
 rowaxis label="m/s**2" grid;
 where counter<11000;

Take a look at these 5 series plots. Using just what you know of the file names and these plots, can you guess which panel represents which accelerometer case?

Leave your guess in the comments section. I'll explore these data further in a future blog post!

The post How to read multiple text files in SAS appeared first on The SAS Dummy.

10月 012018

The solar farm at SAS world headquarters is a treasure trove of data. Jessica Peter, Senior User Experience Designer at SAS, had an idea about using that treasure in an art installation to show how data can tell a story. Her idea became a reality when she and others at SAS [...]

What can you learn from an AI art installation? was published on SAS Voices by Lane Whatley

9月 262018

Here's a challenge.  You're a passenger in an automobile, and you've been asked to evaluate whether the driver's habits behind the wheel are "safe" or "risky."  But there's a catch: you have to collect all of your information with your eyes closed.

Think about it -- with your eyes shut, you're denied important information such as your location, traffic conditions, speed limits and traffic signals, and weather conditions.  Sightless, your only source of data comes from your sense of motion as the vehicle accelerates, slows down, and turns.

Sunish Menon, a PhD. researcher at State Farm Insurance, faced this challenge with his team as they designed the data collection scheme for State Farm's Drive Safe and Save program.  Sunish shared his experience and ideas with attendees at the Analytics Experience 2018 conference in San Diego.

Accelerometer: simple measurements with rich results

Sunish's team knew that they were going to build a smartphone app to support the Drive Safe and Save program.  After all, a smartphone can collect a ton of information: location with GPS, phone use during a trip, traffic conditions, trip duration and speed, and more.  But accessing these details has a cost.  Every sensor on a phone consumes precious battery life, and potential users might not be comfortable sharing their location constantly with an insurance company -- even if there is a premium discount at stake.  So what's the minimum amount of information you can collect and still assemble a meaningful profile?  Maybe capturing the changes in speed and direction is enough.

Like people, your smartphone also has a "sense of motion" -- it's called an accelerometer.  As you might guess from its name, an accelerometer is a small electrical sensor that measures acceleration.  For a quick physics refresher, let's review the difference between speed, velocity, and acceleration:

Speed How fast an object is moving, usually expressed as distance over time (example: 10 meters per second)
Velocity How fast an object is moving and in which direction (example: 10 meters per second, to the east)
Acceleration The rate of change in the velocity of an object. Since it’s a rate of change, it’s expressed as distance over time (speed), per unit of time.  For example, to change speed from 0 to 60 miles-per-hour in 10 seconds, an object must accelerate at 2.682 meters per second per second, or 2.682 m/s2.


Your smartphone measures acceleration across three axes, traditionally labeled as x, y, and z.  The measurements are sampled multiple times per second.  Each measurement reflects acceleration across one of the axes.  Taken together, you can get a sense for the phone's overall direction.

I've included Sunish's diagram of how these axes are oriented on a smartphone.  The x axis is horizontal along the face of the phone, and the y axis is vertical along the face.  The z axis is along the perpendicular plane passing through the center of the phone.  Depending on the direction of the phone's movement, acceleration values might have a positive or negative value.

Capturing my commute data

Inspired by Sunish's presentation, I decided to get a bit of hands-on practice with accelerometer data. I installed a free app on my phone to capture the raw data from the accelerometer,  Here's what the data values look like, as measured from the start of my driving commute from work to home.

In these data, the first value is a record counter.  The second "big number" value is a timestamp value in Unix epoch format.  That's the number of milliseconds since midnight on January 1, 1970.  And the next three values are the acceleration measurements for the x, y, and z axis respectively.  Acceleration is measured in meters-per-second squared, or m/s2.  For reference, keep in mind that Earth's gravity -- the force that keeps us grounded (literally) -- is about 9.8 m/s(1g).

The data from my commute contains over 85,000 measurements, captured over about 30 minutes (it was a busy Friday afternoon).  I used the SERIES plot in PROC SGPLOT to create a simple visualization.  Can you tell where the longest stoplight occurs?  (It's right near the shopping mall -- I really don't like that intersection.)

commute accelerometer

Teasing out "events" from the data

In my commute as represented in the above chart, it seems simple enough to locate the mundane events of accelerating, braking, and waiting in traffic.  There are a few spikes and dips that might represent more dramatic braking events, or perhaps a fast start from a traffic light (my car has some pep!).  Let's use some histograms to look at these measurements another way.

histograms of x y z

Most of my commute is uninteresting, as I'm driving at a steady speed or waiting in traffic.  The histogram shows the x axis measurements are centered around 0.  But why don't the y and z axes behave the same?  During my drive,  my phone is positioned nearly vertical in a dashboard holder, with perhaps a 30-degree forward tilt.  Gravity works on all of us at about 9.8m/s2.  With my phone at the vertical-ish tilt, you can see most of that force applied to the y axis, with some shared with the z axis.

Since the data collected represents a time series, it makes sense to apply a time series analysis to see if we can decompose its components and make the interesting events more obvious.  In Sunish's case, his team used PROC TIMESERIES also offers a SPECTRA statement for spectrum analysis for similar options.

Here's a tip: if you are trying this on your own and you get stuck, post a question to the SAS forecasting/time series community.  Experts are eager to answer!

Confounding factors when analyzing a drive

During a drive, the measurements from the accelerometer "start at zero" (or their natural baseline) only when the phone is lying flat, with the top of the phone pointed toward the front of the car.  But who keeps their phone stationed like that?  When I'm driving alone in my car, my phone is usually in a holder mounted on the dash, positioned nearly vertical, tilted slightly.  Or it's in my pants pocket.

Sunish presented a series of techniques to help control for this -- all of them applying more math than I am qualified to describe.  The smartphone also has a gyroscope sensor, which can measure the phone's "tilt" along any of its axes (labeled as pitch, roll, and yaw).  Combining these measurements with the acceleration readings, as well as controlling for the force of gravity, can help create a more accurate picture of your driving experience.

When I'm not alone in the car, the phone might not stay in one place.  A passenger might pick it up to find directions, or to reference IMDB to settle a bet.  All of those movements will also register on the accelerometer, and how will a "safe driving app" judge these actions?  That's a challenge for analytics.

Safe driver versus risky driver: more than just measurement

Please do not rush to judgement about my driving behavior from this one sample.  In fact, even if you had hundreds of samples of my driving, it would probably be difficult to fairly judge whether I am a high-risk driver.

For insurance companies, assessment of risk is influenced more by how similar you are to known risky populations.  That's why young drivers tend to command higher premiums.  It's not just because they are young, exactly, but it's because insurance companies have to pay out more claims due to accidents caused by young drivers.  The cause might be due to their inexperience and immaturity, but that's almost beside the point.  It's a numbers game.

By collecting data from millions of car trips across a wide range of customers, an insurance company can apply machine learning to discern the patterns of drivers who make claims versus those who don't.  If your driving patterns are scored as too similar to those of other drivers who cause accidents...well, don't expect to receive a discount when you share your driving data.

Programs like State Farm's Drive Safe and Save accomplish more than just "proving" that you're a good driver.  The program incents you to be more conscious of your driving behavior, especially while you have that app running and collecting data.  State Farm provides periodic reports to program subscribers that show how your driving behavior compares (favorably or not) to other drivers in the pool.  The gamification and feedback aspect of the program might do just as much to improve driving as the promise of a discount.


Using your smartphone accelerometer to build a safe driving profile was published on SAS Users.

9月 242018

If you were to ask Tricia Wang, PhD, about real business growth, she would tell you that it lies outside the boundaries of the known. Not everything valuable is measurable, she would say. And big data is hiding new customers in the market from you. Wang is more than just [...]

Don't just do digital. Be digital. was published on SAS Voices by Stephanie DiRico

9月 202018

Referred to as the instigator of innovation, Mick Ebeling works to expand the human possibilities of technology through his company, Not Impossible Labs, whose mission is to change the world through technology and story. Powered by a team of thinkers, doers, creators and hackers, Not Impossible Labs works on inspirational [...]

What could you create if nothing were impossible? was published on SAS Voices by Anjelica Cummings

9月 192018

Like most people, I believed that process of diagnosing and treating cancer begins with a biopsy.  If cancer is suspected, a doctor will extract a small tissue sample -- usually a tiny cylindrical "core sample" -- and examine it for cancer cells.  No cancer cells found -- that's good news!  But if cancer cells are present, then you have decisions to make about treatment.

A young woman named Richa Sehgal taught me that it's not so simple.  There aren't just two types of cells (cancerous and non-cancerous).  There are actually several types of cancer cells, and these do not all have the same importance when it comes to effective cancer treatment.  I learned this from Richa during her presentation at Analytics Experience 2018 -- a remarkable talk for several reasons, not the least of which is this: Richa Sehgal is a high school student, just 18 years old.  I'll have to check the record books, but this might make her the youngest-ever presenter at this premier analytics event.

Last year, Richa served as a student intern at the Canary Center at Stanford for Cancer Early Detection.  That's where she learned about the biology of cancer. She was allowed (encouraged!) to attend all lab meetings – and the experience opened her eyes to the challenges of cancer detection.

The importance of cell types and how cancer works

Unlike many technical conference talks that I've attended, Richa did not dive directly into the math or the code that support the techniques she was presenting.  Instead, Richa dedicated the first 25 minutes of her talk to teach the audience how cancer works.  And that primer was essential to help the (standing-room only!) audience to understand the relevance and value of her analytical solution.

What we call "cancer" is actually a collection of different types of cells.  Richa focused on three types: cancer stem cells (CSCs), transient amplifying cells (TACs), and terminally differentiated cells (TDCs).  CSCs are the most rare type within a tumor, making up just a few percent of the total mix of cells.  But because of their self-renewing qualities and their ability to grow all other types of cancer cells, these are very important to treat.  CSCs require targeted therapy -- that is, you can't use the same type of treatment for all cell types.  TACs usually require their own treatment, depending on the stage of the disease and the ability of a patient to tolerate the therapy.  The presence of TACs can activate CSCs to grow more cancer cells, so if you can't eradicate the CSCs (and that's difficult to manage with 100% certainty, as we'll see) then it is important to treat the TACs.  TDCs represent cancer cells that are no longer capable of dividing, and so generally don't require a treatment -- they will die off on their own.

(I know that my explanation here represents a simplistic view of cancer -- but it was enough of a framework to help me to understand the rest of Richa's talk.)

Richa Sehgal presents to a standing-room-only crowd at #AnalyticsX

The inexact science of biopsies

Now that we understand that cancer is made up of a variety cell types, it makes sense to hope that when we extract a biopsy, that we get a sample that represents this cell type variability.  Richa used an example of sampling a chocolate chip cookie.  If you were to use a needle to extract a core sample from a chocolate chip cookie...but didn't manage to extract any portions of the (disappointingly rare) chocolate chips, you might conclude that the cookie was a simple sugar cookie.  And as a result, you might treat that cookie differently.  (If you encountered a raisin instead...well..that might require a different treatment altogether.  Blech.)

But, as Richa told us, we don't yet know enough about the distribution and proximity of the different cell types for different types of cancers.  This makes it difficult to design better biopsies.  Richa is optimistic that it's just a matter of time -- medical science will crack this and we'll one day have good models of cancer makeup.  And when that day comes, Richa has a statistical method to make biopsies better.

Using SAS and Python to model cancer cell clusters

Most high school students wouldn't think to pick up SAS for use in their science fair projects, but Richa has an edge: her uncle works for SAS as a research statistician.  However, you don't need an inside connection to get access to SAS for learning.  In Richa's case, she used SAS University Edition hosted on AWS -- nothing to install, easy to access, and free to use for any learner.

Since she didn't have real data that represent the makeup of a tumor, Richa created simulations of the cancer cells, their different types and proximity to each other in a 3D model.  With this data in hand, she could use cluster analysis (PROC CLUSTER with Ward's method and then PROC TREE) to analyze a distant matrix that she computed.  The result shows how close cancer cells of the same type are positioned in proximity.  With that information, it's possible to design a biopsy that captures a highly variable collection of cells.

Richa then used the Python package plotly to visualize the 3D model of her cell map.  (I didn't have the heart to tell her that she could accomplish this in SAS with PROC SGPLOT -- some things you just have to learn for yourself.)

A bright future -- for all of us

Clearly, Richa is an extremely accomplished young woman.  When I asked about her college plans for next year, she told me that she has a long list of "stretch schools" that she's looking at.  I'm having a difficult time understanding what constitutes a "stretch" for Richa -- I'm certain that any institution would love to have her.

Richa's accomplishments make me feel optimism for her, but also for the rest of us.  As a father of three daughters, I'm encouraged to see young women enter technical fields and be successful.  SAS is among the elite technology companies that work to close the analytics skills gap by providing free software, education, and mentoring.  Throughout the Analytics Experience 2018 conference, I've heard from many attendees who also saw Richa's talk -- they were similarly impressed and inspired.  Presentations like Richa's deliver on the conference tagline: "Analytics redefines innovation. You redefine the future."

Using machine learning to improve tumor biopsies was published on SAS Users.