analytics conference

10月 092018
 

As part of my research for a different article, I recently collected data about my driving commute home via an accelerometer recorder app on my phone. The app generates a simple TSV file. (A TSV file is like a CSV file, but instead of a comma separator, it uses a TAB character to separate the values.) The raw data looks like this:

Related from Analytics Experience 2018: Using your smartphone accelerometer to build a safe driving profile

With SAS, it's simple to import the file into a data set. Here's my DATA step code that uses the INFILE statement to identify the file and how to read it. Note that the DLM= option references the hexadecimal value for the TAB character in ASCII (09x), the delimiter for fields in this data.

data drive;
  infile "/home/chris.hemedinger/tsv/drivehome.tsv" 
   dlm='09'x;
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;
run;

In my research, I didn't stop with just my drive home. In addition to my commute, I collected data about 4 other activities, and thus accumulated a collection of TSV files. Here's my file directory in my SAS OnDemand for Academics account:

To import each of these data files into SAS, I could simply copy and paste my code 4 times and then replace the name of the file for each case that I collected. After all, copy-and-paste is a tried and true method for writing large volumes of code. But as the number of code lines grows, so does the maintenance work. If I want to add any additional logic into my DATA step, that change would need to be applied 5 times. And if I later come back and add more files to my TSV collection, I'll need to copy-and-paste the same code blocks for my additional cases.

Using a wildcard on the INFILE statement

I can read all of my TSV files in a single step by *.tsv, which tells SAS to match on all of the TSV files in the folder and process each of them in turn. I also changed the name of the data set from "drive" to the more generic "accel".

data accel;
  infile "/home/chris.hemedinger/tsv/*.tsv" 
   dlm='09'x;
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;
run;

The SAS log shows which files have been processed and added into my data set.

With a single data set that has all of my accelerometer readings, I can easily segment these with a WHERE clause in later processing. It's convenient that my accelerometer app also captured the name of each TSV file so that I can keep these cases distinct. A quick PROC FREQ shows the allocation of records for each case that I collected.

Add the filename into the data set

If your input data files don't contain an eponymous field name, then you will need to use a different method to keep track of which records come from which files. The

filename tsvs "/home/chris.hemedinger/tsv/*.tsv";
data accel;
  length casefile $ 100 /* to write to data set */
     counter 8 
     timestamp 8 
     x 8 y 8 z 8 
     filename $ 25	       
     tsvfile $ 100 /* to hold the value */	       
   ;
 
   /* store the name of the current infile */       
   infile tsvs filename=tsvfile 
    dlm='09'x ; 
  casefile=tsvfile;
  input counter timestamp x y z filename;	
run;

In the output, you'll notice that we now have the fully qualified file name that SAS processed using INFILE.

Managing data files: fewer files is better

Because we started this task with 5 distinct input files, it might be tempting to store the records in separate tables: one for each accelerometer case. While there might be good reasons to do that for some types of data, I believe that we have more flexibility when we keep all of these records together in a single data set. (But if you must split a single data set into many, here's a method to do it.)

In this single data set, we still have the information that keeps the records distinct (the name of the original files), so we haven't lost anything. SAS procedures support CLASS and BY statements that allow us to simplify our code when reporting across different groups of data. We'll have fewer blocks of repetitive code, and we can accomplish more across all of these cases before we have to resort to SAS macro logic to repeat operations for each file.

As a simple example, I can create a simple visualization with a single PROC SGPANEL step.

ods graphics / width=1600 height=400;
proc sgpanel data=accel;
 panelby filename / columns=5 noheader;
 series x=counter y=x;
 series x=counter y=y;
 series x=counter y=z;
 colaxis display=none minor;
 rowaxis label="m/s**2" grid;
 where counter<11000;
run;

Take a look at these 5 series plots. Using just what you know of the file names and these plots, can you guess which panel represents which accelerometer case?

Leave your guess in the comments section. I'll explore these data further in a future blog post!

The post How to read multiple text files in SAS appeared first on The SAS Dummy.

10月 012018
 

The solar farm at SAS world headquarters is a treasure trove of data. Jessica Peter, Senior User Experience Designer at SAS, had an idea about using that treasure in an art installation to show how data can tell a story. Her idea became a reality when she and others at SAS [...]

What can you learn from an AI art installation? was published on SAS Voices by Lane Whatley

9月 262018
 

Here's a challenge.  You're a passenger in an automobile, and you've been asked to evaluate whether the driver's habits behind the wheel are "safe" or "risky."  But there's a catch: you have to collect all of your information with your eyes closed.

Think about it -- with your eyes shut, you're denied important information such as your location, traffic conditions, speed limits and traffic signals, and weather conditions.  Sightless, your only source of data comes from your sense of motion as the vehicle accelerates, slows down, and turns.

Sunish Menon, a PhD. researcher at State Farm Insurance, faced this challenge with his team as they designed the data collection scheme for State Farm's Drive Safe and Save program.  Sunish shared his experience and ideas with attendees at the Analytics Experience 2018 conference in San Diego.

Accelerometer: simple measurements with rich results

Sunish's team knew that they were going to build a smartphone app to support the Drive Safe and Save program.  After all, a smartphone can collect a ton of information: location with GPS, phone use during a trip, traffic conditions, trip duration and speed, and more.  But accessing these details has a cost.  Every sensor on a phone consumes precious battery life, and potential users might not be comfortable sharing their location constantly with an insurance company -- even if there is a premium discount at stake.  So what's the minimum amount of information you can collect and still assemble a meaningful profile?  Maybe capturing the changes in speed and direction is enough.

Like people, your smartphone also has a "sense of motion" -- it's called an accelerometer.  As you might guess from its name, an accelerometer is a small electrical sensor that measures acceleration.  For a quick physics refresher, let's review the difference between speed, velocity, and acceleration:

Speed How fast an object is moving, usually expressed as distance over time (example: 10 meters per second)
Velocity How fast an object is moving and in which direction (example: 10 meters per second, to the east)
Acceleration The rate of change in the velocity of an object. Since it’s a rate of change, it’s expressed as distance over time (speed), per unit of time.  For example, to change speed from 0 to 60 miles-per-hour in 10 seconds, an object must accelerate at 2.682 meters per second per second, or 2.682 m/s2.

 

Your smartphone measures acceleration across three axes, traditionally labeled as x, y, and z.  The measurements are sampled multiple times per second.  Each measurement reflects acceleration across one of the axes.  Taken together, you can get a sense for the phone's overall direction.

I've included Sunish's diagram of how these axes are oriented on a smartphone.  The x axis is horizontal along the face of the phone, and the y axis is vertical along the face.  The z axis is along the perpendicular plane passing through the center of the phone.  Depending on the direction of the phone's movement, acceleration values might have a positive or negative value.

Capturing my commute data

Inspired by Sunish's presentation, I decided to get a bit of hands-on practice with accelerometer data. I installed a free app on my phone to capture the raw data from the accelerometer,  Here's what the data values look like, as measured from the start of my driving commute from work to home.

In these data, the first value is a record counter.  The second "big number" value is a timestamp value in Unix epoch format.  That's the number of milliseconds since midnight on January 1, 1970.  And the next three values are the acceleration measurements for the x, y, and z axis respectively.  Acceleration is measured in meters-per-second squared, or m/s2.  For reference, keep in mind that Earth's gravity -- the force that keeps us grounded (literally) -- is about 9.8 m/s(1g).

The data from my commute contains over 85,000 measurements, captured over about 30 minutes (it was a busy Friday afternoon).  I used the SERIES plot in PROC SGPLOT to create a simple visualization.  Can you tell where the longest stoplight occurs?  (It's right near the shopping mall -- I really don't like that intersection.)

commute accelerometer

Teasing out "events" from the data

In my commute as represented in the above chart, it seems simple enough to locate the mundane events of accelerating, braking, and waiting in traffic.  There are a few spikes and dips that might represent more dramatic braking events, or perhaps a fast start from a traffic light (my car has some pep!).  Let's use some histograms to look at these measurements another way.

histograms of x y z

Most of my commute is uninteresting, as I'm driving at a steady speed or waiting in traffic.  The histogram shows the x axis measurements are centered around 0.  But why don't the y and z axes behave the same?  During my drive,  my phone is positioned nearly vertical in a dashboard holder, with perhaps a 30-degree forward tilt.  Gravity works on all of us at about 9.8m/s2.  With my phone at the vertical-ish tilt, you can see most of that force applied to the y axis, with some shared with the z axis.

Since the data collected represents a time series, it makes sense to apply a time series analysis to see if we can decompose its components and make the interesting events more obvious.  In Sunish's case, his team used PROC TIMESERIES also offers a SPECTRA statement for spectrum analysis for similar options.

Here's a tip: if you are trying this on your own and you get stuck, post a question to the SAS forecasting/time series community.  Experts are eager to answer!

Confounding factors when analyzing a drive

During a drive, the measurements from the accelerometer "start at zero" (or their natural baseline) only when the phone is lying flat, with the top of the phone pointed toward the front of the car.  But who keeps their phone stationed like that?  When I'm driving alone in my car, my phone is usually in a holder mounted on the dash, positioned nearly vertical, tilted slightly.  Or it's in my pants pocket.

Sunish presented a series of techniques to help control for this -- all of them applying more math than I am qualified to describe.  The smartphone also has a gyroscope sensor, which can measure the phone's "tilt" along any of its axes (labeled as pitch, roll, and yaw).  Combining these measurements with the acceleration readings, as well as controlling for the force of gravity, can help create a more accurate picture of your driving experience.

When I'm not alone in the car, the phone might not stay in one place.  A passenger might pick it up to find directions, or to reference IMDB to settle a bet.  All of those movements will also register on the accelerometer, and how will a "safe driving app" judge these actions?  That's a challenge for analytics.

Safe driver versus risky driver: more than just measurement

Please do not rush to judgement about my driving behavior from this one sample.  In fact, even if you had hundreds of samples of my driving, it would probably be difficult to fairly judge whether I am a high-risk driver.

For insurance companies, assessment of risk is influenced more by how similar you are to known risky populations.  That's why young drivers tend to command higher premiums.  It's not just because they are young, exactly, but it's because insurance companies have to pay out more claims due to accidents caused by young drivers.  The cause might be due to their inexperience and immaturity, but that's almost beside the point.  It's a numbers game.

By collecting data from millions of car trips across a wide range of customers, an insurance company can apply machine learning to discern the patterns of drivers who make claims versus those who don't.  If your driving patterns are scored as too similar to those of other drivers who cause accidents...well, don't expect to receive a discount when you share your driving data.

Programs like State Farm's Drive Safe and Save accomplish more than just "proving" that you're a good driver.  The program incents you to be more conscious of your driving behavior, especially while you have that app running and collecting data.  State Farm provides periodic reports to program subscribers that show how your driving behavior compares (favorably or not) to other drivers in the pool.  The gamification and feedback aspect of the program might do just as much to improve driving as the promise of a discount.

 

Using your smartphone accelerometer to build a safe driving profile was published on SAS Users.

9月 242018
 

If you were to ask Tricia Wang, PhD, about real business growth, she would tell you that it lies outside the boundaries of the known. Not everything valuable is measurable, she would say. And big data is hiding new customers in the market from you. Wang is more than just [...]

Don't just do digital. Be digital. was published on SAS Voices by Stephanie DiRico

9月 202018
 

Referred to as the instigator of innovation, Mick Ebeling works to expand the human possibilities of technology through his company, Not Impossible Labs, whose mission is to change the world through technology and story. Powered by a team of thinkers, doers, creators and hackers, Not Impossible Labs works on inspirational [...]

What could you create if nothing were impossible? was published on SAS Voices by Anjelica Cummings

9月 192018
 

Like most people, I believed that process of diagnosing and treating cancer begins with a biopsy.  If cancer is suspected, a doctor will extract a small tissue sample -- usually a tiny cylindrical "core sample" -- and examine it for cancer cells.  No cancer cells found -- that's good news!  But if cancer cells are present, then you have decisions to make about treatment.

A young woman named Richa Sehgal taught me that it's not so simple.  There aren't just two types of cells (cancerous and non-cancerous).  There are actually several types of cancer cells, and these do not all have the same importance when it comes to effective cancer treatment.  I learned this from Richa during her presentation at Analytics Experience 2018 -- a remarkable talk for several reasons, not the least of which is this: Richa Sehgal is a high school student, just 18 years old.  I'll have to check the record books, but this might make her the youngest-ever presenter at this premier analytics event.

Last year, Richa served as a student intern at the Canary Center at Stanford for Cancer Early Detection.  That's where she learned about the biology of cancer. She was allowed (encouraged!) to attend all lab meetings – and the experience opened her eyes to the challenges of cancer detection.

The importance of cell types and how cancer works

Unlike many technical conference talks that I've attended, Richa did not dive directly into the math or the code that support the techniques she was presenting.  Instead, Richa dedicated the first 25 minutes of her talk to teach the audience how cancer works.  And that primer was essential to help the (standing-room only!) audience to understand the relevance and value of her analytical solution.

What we call "cancer" is actually a collection of different types of cells.  Richa focused on three types: cancer stem cells (CSCs), transient amplifying cells (TACs), and terminally differentiated cells (TDCs).  CSCs are the most rare type within a tumor, making up just a few percent of the total mix of cells.  But because of their self-renewing qualities and their ability to grow all other types of cancer cells, these are very important to treat.  CSCs require targeted therapy -- that is, you can't use the same type of treatment for all cell types.  TACs usually require their own treatment, depending on the stage of the disease and the ability of a patient to tolerate the therapy.  The presence of TACs can activate CSCs to grow more cancer cells, so if you can't eradicate the CSCs (and that's difficult to manage with 100% certainty, as we'll see) then it is important to treat the TACs.  TDCs represent cancer cells that are no longer capable of dividing, and so generally don't require a treatment -- they will die off on their own.

(I know that my explanation here represents a simplistic view of cancer -- but it was enough of a framework to help me to understand the rest of Richa's talk.)

Richa Sehgal presents to a standing-room-only crowd at #AnalyticsX

The inexact science of biopsies

Now that we understand that cancer is made up of a variety cell types, it makes sense to hope that when we extract a biopsy, that we get a sample that represents this cell type variability.  Richa used an example of sampling a chocolate chip cookie.  If you were to use a needle to extract a core sample from a chocolate chip cookie...but didn't manage to extract any portions of the (disappointingly rare) chocolate chips, you might conclude that the cookie was a simple sugar cookie.  And as a result, you might treat that cookie differently.  (If you encountered a raisin instead...well..that might require a different treatment altogether.  Blech.)

But, as Richa told us, we don't yet know enough about the distribution and proximity of the different cell types for different types of cancers.  This makes it difficult to design better biopsies.  Richa is optimistic that it's just a matter of time -- medical science will crack this and we'll one day have good models of cancer makeup.  And when that day comes, Richa has a statistical method to make biopsies better.

Using SAS and Python to model cancer cell clusters

Most high school students wouldn't think to pick up SAS for use in their science fair projects, but Richa has an edge: her uncle works for SAS as a research statistician.  However, you don't need an inside connection to get access to SAS for learning.  In Richa's case, she used SAS University Edition hosted on AWS -- nothing to install, easy to access, and free to use for any learner.

Since she didn't have real data that represent the makeup of a tumor, Richa created simulations of the cancer cells, their different types and proximity to each other in a 3D model.  With this data in hand, she could use cluster analysis (PROC CLUSTER with Ward's method and then PROC TREE) to analyze a distant matrix that she computed.  The result shows how close cancer cells of the same type are positioned in proximity.  With that information, it's possible to design a biopsy that captures a highly variable collection of cells.

Richa then used the Python package plotly to visualize the 3D model of her cell map.  (I didn't have the heart to tell her that she could accomplish this in SAS with PROC SGPLOT -- some things you just have to learn for yourself.)

A bright future -- for all of us

Clearly, Richa is an extremely accomplished young woman.  When I asked about her college plans for next year, she told me that she has a long list of "stretch schools" that she's looking at.  I'm having a difficult time understanding what constitutes a "stretch" for Richa -- I'm certain that any institution would love to have her.

Richa's accomplishments make me feel optimism for her, but also for the rest of us.  As a father of three daughters, I'm encouraged to see young women enter technical fields and be successful.  SAS is among the elite technology companies that work to close the analytics skills gap by providing free software, education, and mentoring.  Throughout the Analytics Experience 2018 conference, I've heard from many attendees who also saw Richa's talk -- they were similarly impressed and inspired.  Presentations like Richa's deliver on the conference tagline: "Analytics redefines innovation. You redefine the future."

Using machine learning to improve tumor biopsies was published on SAS Users.

9月 182018
 

To succeed in any data-focused hackathon, you need a robust set of tools and skills – as well as a can-do attitude.  Here's what you can expect from any hackathon:

  • Messy data.   It might come from a variety of sources, and won't necessarily be organized for analytics or reporting.  That's your job.
  • Nebulous problem set. Usually the goal of a hackathon is to generate insights, improve a situation, or optimize a process. But you don't know going into it which insights you need, which process is ripe for optimization, or which situations can be improved by using data.  Hackathons are as much about discovering opportunities as they are about solving problems.
  • Team members with different viewpoints. This is a big strength of hackathons, and it can also present the biggest challenge.  Team members bring different skills and ideas.  To be successful, you need to be open to those ideas and to allowing team members to contribute in the way that best uses their skills.  Think of yourselves as the Oceans Eleven of data analytics.

In my experience, hackathons are often a great melting pot of different tools and technologies.  Whatever tech biases you might have in your day job (Windows versus Linux, SAS versus Python, JSON versus CSV) – these melt away when your teammates show up ready to contribute to a common goal using the tools that they each know best.

My favorite hackathon tools

At the Analytics Experience 2018 Hackathon, attendees have the entire suite of SAS tools available.  From Base SAS, to SAS Enterprise Guide, to SAS Studio, to SAS Enterprise Miner and the entire SAS Viya framework -- including SAS Visual Analytics, SAS Visual Text Analytics, SAS Data Mining and Machine Learning.  As we say here in San Diego, it's the whole enchilada.  As the facilitators were presenting the whirlwind tour of all of these goodies, I could see the attendees salivating.  Or maybe that was just me.

When it comes to getting my hands dirty with unknown data, my favorite path begins with SAS Enterprise Guide.  If you know me, this won't surprise you.  Here's why I like it.

Import Data task: Import any data

Hackathon data almost always comes as CSV or Excel spreadsheets.  The Import Data task can ingest CSV, fixed-width text, and Excel spreadsheets of any version.  Of course most "hackers" worth their salt can write code to read these file types, but the Import Data task helps you to discover what's in the file almost instantly.  You can review all of the field names and types, tweak them as you like, and click Finish to produce a data set.  There's no faster method of turning raw data into a SAS data set that feeds the next step.

See Tricks for importing text files and Importing Excel files using SAS Enterprise Guide for more details about the ins-and-outs of this task.  If you want to ultimately turn this step into repeatable code (a great idea for hackathons), then it's important to know how this task works.

Note: if your data is coming from a web service or API, then it's probably in JSON format.  There's no point-and-click task to read that, but a couple of SAS program lines will do the trick.

Query Builder: Filter, compute, summarize, and join

The Query Builder in SAS Enterprise Guide is a one-stop shop for data management.  Use this for quick filtering, data cleansing, simple recoding, and summarizing across groups. Later, when you have multiple data sources, the Query Builder provides simple methods to join these – merge on the fly.

Before heading into your next hackathon, it's worth exploring and practicing your skills with the Query Builder.  It can do so much -- but some of the functions are a bit hidden.  Limber up before you hack!

See this paper by Jennifer First-Kluge for an in-depth tour of the tool.

Characterize Data: Quick data characteristics, with ability to dive deeper

If you've never seen your data before, you'll appreciate this one-click method to report on variable types, frequencies, distinct values, and distributions.  The Describe->Characterize Data task provides a good start.

Using SAS Studio? There's a Characterize Data task in there as well.  See Marje Fecht's paper: Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide for more about this and other tasks.

Data tasks: Advanced data reworking: long to wide, wide to long

"Long" data is typically best for reporting, while "wide" data is more suited for analytics and modeling  The process of restructuring data from long to wide (or wide to long) is called Transpose.  SAS Enterprise Guide has special tasks called "Split Data" (for making wide tables) and "Stack Data" (for making long data).  Each method has some special requirements for a successful transformation, so it's worth your time to practice with these tasks before you need them.

Program Editor: Flexible coding environment

The program editor in SAS Enterprise Guide is my favorite place to write and modify SAS code.  Here are my favorite tricks for staying productive in this environment including code formatting, shown below.

autoformat code

Have another favorite editor?  You can use SAS Enterprise Guide to open your code in your default Windows editor too.  That's a great option when you need to do super-fancy text manipulation.  (We won't go into the "best programming editor" debate here, but I've got my defaults set up for Notepad++.)

Export and share with others

The hackathon "units of sharing" are code (of course) and data.  SAS Enterprise Guide provides several simple methods to share data in a way that just about any other tool can consume:

  • Export data as CSV (CSV is the lingua franca of data sharing)
  • Export data as Excel (if that's what your teammates are using)
  • Send to Excel -- actually my favorite way to generate ad-hoc Excel data, as it automates Microsoft Excel and pipes the data your looking at directly into a new sheet.
  • Copy / paste with headers -- low-tech, but this gets you exactly the columns and fields that you want to share with another team member.

When it comes to sharing code, you can use File->Export All Code to capture all SAS code from your project or process flow.  However, I prefer to assemble my own "standalone" code piecemeal, so that I can make sure it's going to run the same for someone else as it does for me.  To accomplish this, I create a new SAS program node and copy the code for each step that I want to share into it...one after another.  Then I test by running that code in a new SAS session.  Validating your code in this way helps to reduce friction when you're sharing your work with others.

Hacking your own personal growth

The obvious benefit of hackathons is that at the end of a short, intense period of work, you have new insights and solutions that didn't have before – and might never have arrived at on your own.  But the personal benefit comes in the people you meet and the techniques that you learn.  I find that I'm able to approach my day job with fresh perspective and ideas – the creativity keeps flowing, and I'm energized to apply what I've learned in my business.

The post Essential SAS tools to bring to your next hackathon appeared first on The SAS Dummy.

8月 182018
 

Get ready to have your mind blown. Whether or not you plan to attend Analytics Experience in San Diego on September 17, you'll be inspired by the speakers we have lined up to keynote the event. There's an inventor. A data scientist. A world class athlete. And a photographer. They've [...]

4 inspiring videos from Analytics Experience keynote speakers was published on SAS Voices by Alison Bolen

8月 142018
 

Even though I’ve worked at SAS for nearly 30 years, I still get excited when great things come together for our customers! This year we are hosting the very first hackathon at our Analytics Experience conference in San Diego - the AnalyticsX Hackathon. Analytics Experience is in its third year [...]

The post Announcing the AnalyticsX Hackathon: Are you up for it? appeared first on SAS Learning Post.

7月 252018
 

Wondering what makes this conference special?  Over the years I’ve heard from many attendees that it’s the best way to get the most out of their analytics investments. Analytics Experience is a learning-focused conference featuring networking opportunities, training, certification exams and analytics presentations for all skill levels. #AnalyticsX will give you [...]

4 resources to help convince your boss to send you to SAS Analytics Experience was published on SAS Voices by Kristine Vick