Chris Hemedinger

11月 152018
 

JSON is a popular format for data exchange between APIs and some modern databases. It's also used as a way to archive audit logs in many systems. Because JSON can represent hierarchical relationships of data fields, many people consider it to be superior to the CSV format -- although it's certainly not yet universal.

I learned recently that newline-delimited JSON, also called JSONL or JSON Lines, is growing in popularity. In a JSONL file, each line of text represents a valid JSON object -- building up to a series of records. But there is no hierarchical relationship among these lines, so when taken as a whole the JSONL file is not valid JSON. That is, a JSON parser can process each line individually, but it cannot process the file all at once.

In SAS, use the JSON libname engine to parse valid JSON files. But neither of these can create or parse JSONL files directly. Here's a simple example of a JSONL file. Each line, enclosed in braces, represents valid JSON. But if you paste the entire body into a validation tool like JSONLint, the parsing fails.

JSONL sample, not valid in JSON parsers

If we needed these records to be true JSON, we need a hierarchy. This requires us to set off the rows with more braces and brackets and separate them with commas, like this:

Valid JSON with hierarchies

Creating JSONL with PROC JSON and DATA step

In a recent SAS Support Communities thread, a SAS user was struggling to use PROC JSON and a SAS data set to create a JSONL file for use with the Amazon Redshift database. PROC JSON can't create the finished file directly, but we can use PROC JSON to create the individual JSON object records. Our solution looks like this:

  • Use PROC JSON to read each record of the source data set and create a new JSON file. DATA step and CALL EXECUTE can generate these steps for us.
  • Using DATA step, post-process the collection of JSON files and append these into a final JSONL file.

Here's the code we used. You need to change only the output file name and the source SAS data set.

/* Build a JSONL (newline-delimited JSON) file */
/* from the records in a SAS data set          */
filename final "c:\temp\final.jsonl" ;
%let datasource = sashelp.class;
 
/* Create a new subfolder in WORK to hold */
/* temp JSON files, avoiding conflicts    */
options dlcreatedir;
%let workpath = %sysfunc(getoption(WORK))/json;
libname json "&workpath.";
libname json clear;
 
/* Will create a run a separate PROC JSON step */
/* for each record.  This might take a while   */
/* for very large data.                        */
/* Each iteration will create a new JSON file  */
data _null_;
 set &datasource.;
 call execute(catt('filename out "',"&workpath./out",_n_,'.json";'));
 call execute('proc json out=out nosastags ;');
 call execute("export &datasource.(obs="||_n_||" firstobs="||_n_||");");
 call execute('run;');
run;
 
/* This will concatenate the collection of JSON files */
/* into a single JSONL file                           */
data _null_;
 file final encoding='utf-8' termstr=cr;
 infile "&workpath./out*.json";
 input;
 /* trim the start and end [ ] characters */
 final = substr(_infile_,2,length(_infile_)-2);
 put final;
run;

From what I've read, it's a common practice to compress JSONL files with gzip for storage or faster transfers. That's a simple step to apply in our example, because SAS supports a GZIP method in SAS 9.4 Maintenance 5. To create a gzipped final result, change the first FILENAME statement to something like:

filename final ZIP "c:\temp\final.jsonl.gz" GZIP;

The JSONL format is new to me and I haven't needed to use it in any of my applications. If you use JSONL in your work, I'd love to hear your feedback about whether this approach would create the types of files you need.

The post Create newline-delimited JSON (or JSONL) with SAS appeared first on The SAS Dummy.

11月 062018
 

A few weeks ago I posted a cliffhanger-of-a-blog-post. I left my readers in suspense about which of my physical activities are represented in different sets of accelerometer data that I captured. In the absence of more details from me, the internet fan theories have been going wild. Well, it's time for the big reveal! I've created a SAS Visual Analytics report that shows each of these activity streams with the proper label:

Accelerometer measurements per activity -- click to enlarge!

Were your guesses confirmed? Any surprises? Were you more impressed with my safe driving or with my reckless behavior on the trampoline?

Collecting and preparing accelerometer data

You might remember that this entire experiment was inspired by a presentation from Analytics Experience 2018. That's when I learned about an insurance company that built a smartphone app to collect data about driving behavior, and that the app relies heavily on accelerometer readings. I didn't have time or expertise to build my own version of such an app, but I found that there are several good free apps that can collect and export this data. I used an app called AccDataRec on my Android phone.

Each "recording session" generates a TSV file -- a tab-separated file that contains a timestamp and a measurement for each of the accelerometer axes (X, Y, and Z). In my previous post, I shared tips about how to import multiple TSV files in a single step. Here's the final version of the program that I wrote to import these data:

filename tsvs "./accel/*.tsv";
libname out "./accel";
 
data out.accel;
  length 
    casefile $ 100 /* to write to data set */
    counter 8 
    timestamp 8 
    timestamp_sec 8
    x 8 y 8 z 8 
    filename $ 25        
    tsvfile $ 100 /* to hold the value */
  ;
  format timestamp datetime22.3 timestamp_sec datetime20.;
 
  /* store the name of the current infile */
  infile tsvs filename=tsvfile expandtabs;
  casefile=tsvfile;
  input counter timestamp x y z filename;
 
  /* convert epoch time into SAS time */
  timestamp=dhms('01jan1970'd, 0, 0, timestamp / 1000);
 
  /* create a timestamp with the precision of one second */
  timestamp_sec = intnx('second',timestamp,0);
run;

Some notes:

  • I converted the timestamp value from the data file (an epoch time value) to a native SAS datetime value by using this trick.
  • Following advice from readers on my last post, I changed the DLM= option to a more simple EXPANDTABS option on the INFILE statement.
  • Some of the SAS time-series analysis doesn't like the more-precise timestamp values with fractions of seconds. I computed a less precise field, rounding down to the second, just in case.
  • For my reports in this post, I really need only 5 fields: counter (the ordinal sequence of measurements), x, y, z, and the filename (mapping to activity).

The new integrated SAS Viya environment makes it simple to move from one task to another, without needing to understand the SAS product boundaries. I used the Manage Data function (that's SAS Data Management, but does that matter?) to upload the ACCEL data set and make it available for use in my reports. Here's a preview:

Creating a SAS Visual Analytics report

With the data now available and loaded into memory, I jumped to the Explore and Visualize Data activity. This is where I can use my data to create a new SAS Visual Analytics report.

At first, I was tempted to create a Time Series Plot. My data does contain time values, and I want to examine the progression of my measurements over time. However, I found the options of the Time Series Plot to be too constraining for my task, and it turns out that for this task the actual time values really aren't that important. What's important is the sequence of the measurements I've collected, and that's captured as an ordinal in the counter value. So, I selected the Line Plot instead. This allowed for more options in the categorical views -- including a lattice row arrangement that made it easy to see the different activity patterns at a glance. This screen capture shows the Role assignments that I selected for the plot.

Adding a closer view at each activity

With the overview Line Plot complete, it's time to add another view that allows us to see just a single activity and provide a close-up view of its pattern. I added a second page to my report and dropped another Line Plot onto the canvas. I assigned "counter" to the category and the x, y, and z values to the Measures. But instead of adding a Lattice Row value, I added a Button Bar to the top of the canvas. My idea is to use the Button Bar -- which is good for navigating among a small number of values -- as a way to trigger a filter for the accelerometer data.

I assigned "filename" to the Category value in the Button Bar role pane. Then I used the Button Bar options menu (the vertical dots on the right) to add a New filter from selection, selecting "Include only selection".

With this Button Bar control and its filter in place, I can now switch among the data values for the different activities. Here's my "drive home" data -- it looks sort of exciting, but I can promise you that it was a nice, boring ride home through typical Raleigh traffic.

Phone mounted in my car for the drive home

The readings from the "kitchen table" activity surprised me at first. This activity was simply 5 minutes of my phone lying flat on my kitchen table. I expected all readings to hover around zero, but the z axis showed a relatively flat line closer to 10 meters-per-second-per-second. Then I remembered: gravity. This sensor registers Earth's gravity, which we are taught is 9.8 meters-per-second-per-second. The readings from my phone hovered around 9.6 -- maybe my house is in a special low-gravity zone, or the readings are a bit off.

Phone at rest on my kitchen table

Finally, let's take a closer look at my trampoline workout. Since I was holding my phone upright, it looks like the x-axis felt the brunt of the acceleration forces. According to these readings, my phone was subjected to a g-force of 7 or 8 times that of Earth's gravity -- but just for a split second. And since my phone was in my hand and my arm was flailing around (I am not a graceful rebounder), my phone was probably experiencing more force than my body was.

Bounding on the trampoline as high as I can

Some love for the Windows 10 app

My favorite method to view SAS Visual Analytics reports is through the SAS Visual Analytics application that's available for Windows 10 and Windows mobile devices. Even on my desktop, where I have a full web browser to help me, I like the look and feel of the specialized Windows 10 app. The report screen captures for this article were rendered in the Windows 10 app. Check out this article for more information about the app. You can try the app for free, even without your own SAS Viya environment. The app is hardwired with a connection to the SAS demo reports at SAS.com.

See also

This is the third (and probably final) article in my series about accelerometer data. See these previous posts for more of the fun background information:

The post Reporting on accelerometer data with SAS Visual Analytics appeared first on The SAS Dummy.

10月 092018
 

As part of my research for a different article, I recently collected data about my driving commute home via an accelerometer recorder app on my phone. The app generates a simple TSV file. (A TSV file is like a CSV file, but instead of a comma separator, it uses a TAB character to separate the values.) The raw data looks like this:

Related from Analytics Experience 2018: Using your smartphone accelerometer to build a safe driving profile

With SAS, it's simple to import the file into a data set. Here's my DATA step code that uses the INFILE statement to identify the file and how to read it. Note that the DLM= option references the hexadecimal value for the TAB character in ASCII (09x), the delimiter for fields in this data.

data drive;
  infile "/home/chris.hemedinger/tsv/drivehome.tsv" 
   dlm='09'x;
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;
run;

In my research, I didn't stop with just my drive home. In addition to my commute, I collected data about 4 other activities, and thus accumulated a collection of TSV files. Here's my file directory in my SAS OnDemand for Academics account:

To import each of these data files into SAS, I could simply copy and paste my code 4 times and then replace the name of the file for each case that I collected. After all, copy-and-paste is a tried and true method for writing large volumes of code. But as the number of code lines grows, so does the maintenance work. If I want to add any additional logic into my DATA step, that change would need to be applied 5 times. And if I later come back and add more files to my TSV collection, I'll need to copy-and-paste the same code blocks for my additional cases.

Using a wildcard on the INFILE statement

I can read all of my TSV files in a single step by *.tsv, which tells SAS to match on all of the TSV files in the folder and process each of them in turn. I also changed the name of the data set from "drive" to the more generic "accel".

data accel;
  infile "/home/chris.hemedinger/tsv/*.tsv" 
   dlm='09'x;
  length counter 8 
         timestamp 8 
         x 8 y 8 z 8 
         filename $ 25;
  input counter timestamp x y z filename;
run;

The SAS log shows which files have been processed and added into my data set.

With a single data set that has all of my accelerometer readings, I can easily segment these with a WHERE clause in later processing. It's convenient that my accelerometer app also captured the name of each TSV file so that I can keep these cases distinct. A quick PROC FREQ shows the allocation of records for each case that I collected.

Add the filename into the data set

If your input data files don't contain an eponymous field name, then you will need to use a different method to keep track of which records come from which files. The

filename tsvs "/home/chris.hemedinger/tsv/*.tsv";
data accel;
  length casefile $ 100 /* to write to data set */
     counter 8 
     timestamp 8 
     x 8 y 8 z 8 
     filename $ 25	       
     tsvfile $ 100 /* to hold the value */	       
   ;
 
   /* store the name of the current infile */       
   infile tsvs filename=tsvfile 
    dlm='09'x ; 
  casefile=tsvfile;
  input counter timestamp x y z filename;	
run;

In the output, you'll notice that we now have the fully qualified file name that SAS processed using INFILE.

Managing data files: fewer files is better

Because we started this task with 5 distinct input files, it might be tempting to store the records in separate tables: one for each accelerometer case. While there might be good reasons to do that for some types of data, I believe that we have more flexibility when we keep all of these records together in a single data set. (But if you must split a single data set into many, here's a method to do it.)

In this single data set, we still have the information that keeps the records distinct (the name of the original files), so we haven't lost anything. SAS procedures support CLASS and BY statements that allow us to simplify our code when reporting across different groups of data. We'll have fewer blocks of repetitive code, and we can accomplish more across all of these cases before we have to resort to SAS macro logic to repeat operations for each file.

As a simple example, I can create a simple visualization with a single PROC SGPANEL step.

ods graphics / width=1600 height=400;
proc sgpanel data=accel;
 panelby filename / columns=5 noheader;
 series x=counter y=x;
 series x=counter y=y;
 series x=counter y=z;
 colaxis display=none minor;
 rowaxis label="m/s**2" grid;
 where counter<11000;
run;

Take a look at these 5 series plots. Using just what you know of the file names and these plots, can you guess which panel represents which accelerometer case?

Leave your guess in the comments section. I'll explore these data further in a future blog post!

The post How to read multiple text files in SAS appeared first on The SAS Dummy.

9月 262018
 

Here's a challenge.  You're a passenger in an automobile, and you've been asked to evaluate whether the driver's habits behind the wheel are "safe" or "risky."  But there's a catch: you have to collect all of your information with your eyes closed.

Think about it -- with your eyes shut, you're denied important information such as your location, traffic conditions, speed limits and traffic signals, and weather conditions.  Sightless, your only source of data comes from your sense of motion as the vehicle accelerates, slows down, and turns.

Sunish Menon, a PhD. researcher at State Farm Insurance, faced this challenge with his team as they designed the data collection scheme for State Farm's Drive Safe and Save program.  Sunish shared his experience and ideas with attendees at the Analytics Experience 2018 conference in San Diego.

Accelerometer: simple measurements with rich results

Sunish's team knew that they were going to build a smartphone app to support the Drive Safe and Save program.  After all, a smartphone can collect a ton of information: location with GPS, phone use during a trip, traffic conditions, trip duration and speed, and more.  But accessing these details has a cost.  Every sensor on a phone consumes precious battery life, and potential users might not be comfortable sharing their location constantly with an insurance company -- even if there is a premium discount at stake.  So what's the minimum amount of information you can collect and still assemble a meaningful profile?  Maybe capturing the changes in speed and direction is enough.

Like people, your smartphone also has a "sense of motion" -- it's called an accelerometer.  As you might guess from its name, an accelerometer is a small electrical sensor that measures acceleration.  For a quick physics refresher, let's review the difference between speed, velocity, and acceleration:

Speed How fast an object is moving, usually expressed as distance over time (example: 10 meters per second)
Velocity How fast an object is moving and in which direction (example: 10 meters per second, to the east)
Acceleration The rate of change in the velocity of an object. Since it’s a rate of change, it’s expressed as distance over time (speed), per unit of time.  For example, to change speed from 0 to 60 miles-per-hour in 10 seconds, an object must accelerate at 2.682 meters per second per second, or 2.682 m/s2.

 

Your smartphone measures acceleration across three axes, traditionally labeled as x, y, and z.  The measurements are sampled multiple times per second.  Each measurement reflects acceleration across one of the axes.  Taken together, you can get a sense for the phone's overall direction.

I've included Sunish's diagram of how these axes are oriented on a smartphone.  The x axis is horizontal along the face of the phone, and the y axis is vertical along the face.  The z axis is along the perpendicular plane passing through the center of the phone.  Depending on the direction of the phone's movement, acceleration values might have a positive or negative value.

Capturing my commute data

Inspired by Sunish's presentation, I decided to get a bit of hands-on practice with accelerometer data. I installed a free app on my phone to capture the raw data from the accelerometer,  Here's what the data values look like, as measured from the start of my driving commute from work to home.

In these data, the first value is a record counter.  The second "big number" value is a timestamp value in Unix epoch format.  That's the number of milliseconds since midnight on January 1, 1970.  And the next three values are the acceleration measurements for the x, y, and z axis respectively.  Acceleration is measured in meters-per-second squared, or m/s2.  For reference, keep in mind that Earth's gravity -- the force that keeps us grounded (literally) -- is about 9.8 m/s(1g).

The data from my commute contains over 85,000 measurements, captured over about 30 minutes (it was a busy Friday afternoon).  I used the SERIES plot in PROC SGPLOT to create a simple visualization.  Can you tell where the longest stoplight occurs?  (It's right near the shopping mall -- I really don't like that intersection.)

commute accelerometer

Teasing out "events" from the data

In my commute as represented in the above chart, it seems simple enough to locate the mundane events of accelerating, braking, and waiting in traffic.  There are a few spikes and dips that might represent more dramatic braking events, or perhaps a fast start from a traffic light (my car has some pep!).  Let's use some histograms to look at these measurements another way.

histograms of x y z

Most of my commute is uninteresting, as I'm driving at a steady speed or waiting in traffic.  The histogram shows the x axis measurements are centered around 0.  But why don't the y and z axes behave the same?  During my drive,  my phone is positioned nearly vertical in a dashboard holder, with perhaps a 30-degree forward tilt.  Gravity works on all of us at about 9.8m/s2.  With my phone at the vertical-ish tilt, you can see most of that force applied to the y axis, with some shared with the z axis.

Since the data collected represents a time series, it makes sense to apply a time series analysis to see if we can decompose its components and make the interesting events more obvious.  In Sunish's case, his team used PROC TIMESERIES also offers a SPECTRA statement for spectrum analysis for similar options.

Here's a tip: if you are trying this on your own and you get stuck, post a question to the SAS forecasting/time series community.  Experts are eager to answer!

Confounding factors when analyzing a drive

During a drive, the measurements from the accelerometer "start at zero" (or their natural baseline) only when the phone is lying flat, with the top of the phone pointed toward the front of the car.  But who keeps their phone stationed like that?  When I'm driving alone in my car, my phone is usually in a holder mounted on the dash, positioned nearly vertical, tilted slightly.  Or it's in my pants pocket.

Sunish presented a series of techniques to help control for this -- all of them applying more math than I am qualified to describe.  The smartphone also has a gyroscope sensor, which can measure the phone's "tilt" along any of its axes (labeled as pitch, roll, and yaw).  Combining these measurements with the acceleration readings, as well as controlling for the force of gravity, can help create a more accurate picture of your driving experience.

When I'm not alone in the car, the phone might not stay in one place.  A passenger might pick it up to find directions, or to reference IMDB to settle a bet.  All of those movements will also register on the accelerometer, and how will a "safe driving app" judge these actions?  That's a challenge for analytics.

Safe driver versus risky driver: more than just measurement

Please do not rush to judgement about my driving behavior from this one sample.  In fact, even if you had hundreds of samples of my driving, it would probably be difficult to fairly judge whether I am a high-risk driver.

For insurance companies, assessment of risk is influenced more by how similar you are to known risky populations.  That's why young drivers tend to command higher premiums.  It's not just because they are young, exactly, but it's because insurance companies have to pay out more claims due to accidents caused by young drivers.  The cause might be due to their inexperience and immaturity, but that's almost beside the point.  It's a numbers game.

By collecting data from millions of car trips across a wide range of customers, an insurance company can apply machine learning to discern the patterns of drivers who make claims versus those who don't.  If your driving patterns are scored as too similar to those of other drivers who cause accidents...well, don't expect to receive a discount when you share your driving data.

Programs like State Farm's Drive Safe and Save accomplish more than just "proving" that you're a good driver.  The program incents you to be more conscious of your driving behavior, especially while you have that app running and collecting data.  State Farm provides periodic reports to program subscribers that show how your driving behavior compares (favorably or not) to other drivers in the pool.  The gamification and feedback aspect of the program might do just as much to improve driving as the promise of a discount.

 

Using your smartphone accelerometer to build a safe driving profile was published on SAS Users.

9月 192018
 

Like most people, I believed that process of diagnosing and treating cancer begins with a biopsy.  If cancer is suspected, a doctor will extract a small tissue sample -- usually a tiny cylindrical "core sample" -- and examine it for cancer cells.  No cancer cells found -- that's good news!  But if cancer cells are present, then you have decisions to make about treatment.

A young woman named Richa Sehgal taught me that it's not so simple.  There aren't just two types of cells (cancerous and non-cancerous).  There are actually several types of cancer cells, and these do not all have the same importance when it comes to effective cancer treatment.  I learned this from Richa during her presentation at Analytics Experience 2018 -- a remarkable talk for several reasons, not the least of which is this: Richa Sehgal is a high school student, just 18 years old.  I'll have to check the record books, but this might make her the youngest-ever presenter at this premier analytics event.

Last year, Richa served as a student intern at the Canary Center at Stanford for Cancer Early Detection.  That's where she learned about the biology of cancer. She was allowed (encouraged!) to attend all lab meetings – and the experience opened her eyes to the challenges of cancer detection.

The importance of cell types and how cancer works

Unlike many technical conference talks that I've attended, Richa did not dive directly into the math or the code that support the techniques she was presenting.  Instead, Richa dedicated the first 25 minutes of her talk to teach the audience how cancer works.  And that primer was essential to help the (standing-room only!) audience to understand the relevance and value of her analytical solution.

What we call "cancer" is actually a collection of different types of cells.  Richa focused on three types: cancer stem cells (CSCs), transient amplifying cells (TACs), and terminally differentiated cells (TDCs).  CSCs are the most rare type within a tumor, making up just a few percent of the total mix of cells.  But because of their self-renewing qualities and their ability to grow all other types of cancer cells, these are very important to treat.  CSCs require targeted therapy -- that is, you can't use the same type of treatment for all cell types.  TACs usually require their own treatment, depending on the stage of the disease and the ability of a patient to tolerate the therapy.  The presence of TACs can activate CSCs to grow more cancer cells, so if you can't eradicate the CSCs (and that's difficult to manage with 100% certainty, as we'll see) then it is important to treat the TACs.  TDCs represent cancer cells that are no longer capable of dividing, and so generally don't require a treatment -- they will die off on their own.

(I know that my explanation here represents a simplistic view of cancer -- but it was enough of a framework to help me to understand the rest of Richa's talk.)

Richa Sehgal presents to a standing-room-only crowd at #AnalyticsX

The inexact science of biopsies

Now that we understand that cancer is made up of a variety cell types, it makes sense to hope that when we extract a biopsy, that we get a sample that represents this cell type variability.  Richa used an example of sampling a chocolate chip cookie.  If you were to use a needle to extract a core sample from a chocolate chip cookie...but didn't manage to extract any portions of the (disappointingly rare) chocolate chips, you might conclude that the cookie was a simple sugar cookie.  And as a result, you might treat that cookie differently.  (If you encountered a raisin instead...well..that might require a different treatment altogether.  Blech.)

But, as Richa told us, we don't yet know enough about the distribution and proximity of the different cell types for different types of cancers.  This makes it difficult to design better biopsies.  Richa is optimistic that it's just a matter of time -- medical science will crack this and we'll one day have good models of cancer makeup.  And when that day comes, Richa has a statistical method to make biopsies better.

Using SAS and Python to model cancer cell clusters

Most high school students wouldn't think to pick up SAS for use in their science fair projects, but Richa has an edge: her uncle works for SAS as a research statistician.  However, you don't need an inside connection to get access to SAS for learning.  In Richa's case, she used SAS University Edition hosted on AWS -- nothing to install, easy to access, and free to use for any learner.

Since she didn't have real data that represent the makeup of a tumor, Richa created simulations of the cancer cells, their different types and proximity to each other in a 3D model.  With this data in hand, she could use cluster analysis (PROC CLUSTER with Ward's method and then PROC TREE) to analyze a distant matrix that she computed.  The result shows how close cancer cells of the same type are positioned in proximity.  With that information, it's possible to design a biopsy that captures a highly variable collection of cells.

Richa then used the Python package plotly to visualize the 3D model of her cell map.  (I didn't have the heart to tell her that she could accomplish this in SAS with PROC SGPLOT -- some things you just have to learn for yourself.)

A bright future -- for all of us

Clearly, Richa is an extremely accomplished young woman.  When I asked about her college plans for next year, she told me that she has a long list of "stretch schools" that she's looking at.  I'm having a difficult time understanding what constitutes a "stretch" for Richa -- I'm certain that any institution would love to have her.

Richa's accomplishments make me feel optimism for her, but also for the rest of us.  As a father of three daughters, I'm encouraged to see young women enter technical fields and be successful.  SAS is among the elite technology companies that work to close the analytics skills gap by providing free software, education, and mentoring.  Throughout the Analytics Experience 2018 conference, I've heard from many attendees who also saw Richa's talk -- they were similarly impressed and inspired.  Presentations like Richa's deliver on the conference tagline: "Analytics redefines innovation. You redefine the future."

Using machine learning to improve tumor biopsies was published on SAS Users.

9月 182018
 

To succeed in any data-focused hackathon, you need a robust set of tools and skills – as well as a can-do attitude.  Here's what you can expect from any hackathon:

  • Messy data.   It might come from a variety of sources, and won't necessarily be organized for analytics or reporting.  That's your job.
  • Nebulous problem set. Usually the goal of a hackathon is to generate insights, improve a situation, or optimize a process. But you don't know going into it which insights you need, which process is ripe for optimization, or which situations can be improved by using data.  Hackathons are as much about discovering opportunities as they are about solving problems.
  • Team members with different viewpoints. This is a big strength of hackathons, and it can also present the biggest challenge.  Team members bring different skills and ideas.  To be successful, you need to be open to those ideas and to allowing team members to contribute in the way that best uses their skills.  Think of yourselves as the Oceans Eleven of data analytics.

In my experience, hackathons are often a great melting pot of different tools and technologies.  Whatever tech biases you might have in your day job (Windows versus Linux, SAS versus Python, JSON versus CSV) – these melt away when your teammates show up ready to contribute to a common goal using the tools that they each know best.

My favorite hackathon tools

At the Analytics Experience 2018 Hackathon, attendees have the entire suite of SAS tools available.  From Base SAS, to SAS Enterprise Guide, to SAS Studio, to SAS Enterprise Miner and the entire SAS Viya framework -- including SAS Visual Analytics, SAS Visual Text Analytics, SAS Data Mining and Machine Learning.  As we say here in San Diego, it's the whole enchilada.  As the facilitators were presenting the whirlwind tour of all of these goodies, I could see the attendees salivating.  Or maybe that was just me.

When it comes to getting my hands dirty with unknown data, my favorite path begins with SAS Enterprise Guide.  If you know me, this won't surprise you.  Here's why I like it.

Import Data task: Import any data

Hackathon data almost always comes as CSV or Excel spreadsheets.  The Import Data task can ingest CSV, fixed-width text, and Excel spreadsheets of any version.  Of course most "hackers" worth their salt can write code to read these file types, but the Import Data task helps you to discover what's in the file almost instantly.  You can review all of the field names and types, tweak them as you like, and click Finish to produce a data set.  There's no faster method of turning raw data into a SAS data set that feeds the next step.

See Tricks for importing text files and Importing Excel files using SAS Enterprise Guide for more details about the ins-and-outs of this task.  If you want to ultimately turn this step into repeatable code (a great idea for hackathons), then it's important to know how this task works.

Note: if your data is coming from a web service or API, then it's probably in JSON format.  There's no point-and-click task to read that, but a couple of SAS program lines will do the trick.

Query Builder: Filter, compute, summarize, and join

The Query Builder in SAS Enterprise Guide is a one-stop shop for data management.  Use this for quick filtering, data cleansing, simple recoding, and summarizing across groups. Later, when you have multiple data sources, the Query Builder provides simple methods to join these – merge on the fly.

Before heading into your next hackathon, it's worth exploring and practicing your skills with the Query Builder.  It can do so much -- but some of the functions are a bit hidden.  Limber up before you hack!

See this paper by Jennifer First-Kluge for an in-depth tour of the tool.

Characterize Data: Quick data characteristics, with ability to dive deeper

If you've never seen your data before, you'll appreciate this one-click method to report on variable types, frequencies, distinct values, and distributions.  The Describe->Characterize Data task provides a good start.

Using SAS Studio? There's a Characterize Data task in there as well.  See Marje Fecht's paper: Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide for more about this and other tasks.

Data tasks: Advanced data reworking: long to wide, wide to long

"Long" data is typically best for reporting, while "wide" data is more suited for analytics and modeling  The process of restructuring data from long to wide (or wide to long) is called Transpose.  SAS Enterprise Guide has special tasks called "Split Data" (for making wide tables) and "Stack Data" (for making long data).  Each method has some special requirements for a successful transformation, so it's worth your time to practice with these tasks before you need them.

Program Editor: Flexible coding environment

The program editor in SAS Enterprise Guide is my favorite place to write and modify SAS code.  Here are my favorite tricks for staying productive in this environment including code formatting, shown below.

autoformat code

Have another favorite editor?  You can use SAS Enterprise Guide to open your code in your default Windows editor too.  That's a great option when you need to do super-fancy text manipulation.  (We won't go into the "best programming editor" debate here, but I've got my defaults set up for Notepad++.)

Export and share with others

The hackathon "units of sharing" are code (of course) and data.  SAS Enterprise Guide provides several simple methods to share data in a way that just about any other tool can consume:

  • Export data as CSV (CSV is the lingua franca of data sharing)
  • Export data as Excel (if that's what your teammates are using)
  • Send to Excel -- actually my favorite way to generate ad-hoc Excel data, as it automates Microsoft Excel and pipes the data your looking at directly into a new sheet.
  • Copy / paste with headers -- low-tech, but this gets you exactly the columns and fields that you want to share with another team member.

When it comes to sharing code, you can use File->Export All Code to capture all SAS code from your project or process flow.  However, I prefer to assemble my own "standalone" code piecemeal, so that I can make sure it's going to run the same for someone else as it does for me.  To accomplish this, I create a new SAS program node and copy the code for each step that I want to share into it...one after another.  Then I test by running that code in a new SAS session.  Validating your code in this way helps to reduce friction when you're sharing your work with others.

Hacking your own personal growth

The obvious benefit of hackathons is that at the end of a short, intense period of work, you have new insights and solutions that didn't have before – and might never have arrived at on your own.  But the personal benefit comes in the people you meet and the techniques that you learn.  I find that I'm able to approach my day job with fresh perspective and ideas – the creativity keeps flowing, and I'm energized to apply what I've learned in my business.

The post Essential SAS tools to bring to your next hackathon appeared first on The SAS Dummy.

8月 292018
 

The concept of "current working directory" is important within any SAS program that reads or creates external files. In SAS, when you reference a file location with a relative path (for example, "./projects/mydata.pdf"), that file reference resolves to an absolute path by way of the working directory. You can control the initial working directory by modifying the shell scripts that launch the SAS process, or by specifying the simple SAS macro that allows you to learn the current working directory. The macro uses a trick to assign a SAS fileref to the current path ('.'), grab the full path of that fileref by using Read the article for the full source (it's only about 7 lines). Here's how you would use it:

56         %put Current path is %curdir;
Current path is C:\WINDOWS\system32

As you might infer from my example here, I'm running this on a managed Windows environment. Most users cannot write to the "C:\WINDOWS\system32" path (and would not want to), so any relative file paths in my SAS code would cause errors. Maybe you've seen something like this:

25         ods html file="./test.html";
NOTE: Writing HTML Body file: ./test.html
ERROR: Insufficient authorization to access C:\WINDOWS\system32\test.html.
ERROR: No body file. HTML output will not be created.

If I want to use a relative path, I need to change the current working directory. Fortunately, there's a simple way to do that.

Change the current directory in SAS

Use the

/* working path for my projects */
%let rc = %sysfunc(dlgcdir('u:/projects'));
 
ods html file="./test.html";
proc print data=sashelp.class; run;
ods html close;

I can use my account-specific environment variables to make these paths work for all users. For example, on Windows I can reference the USERPROFILE environment variable. (On Unix, I can use the HOME environment variable instead.)

/* working path for my projects */
%let user = %sysget(USERPROFILE);
%let rc = %sysfunc(dlgcdir("&user./Documents"));
 
/* create an output data folder if needed */
options dlcreatedir;
libname outdata "./data";
 
ods html file="./test.html";
data outdata.class;
 set sashelp.class;
run;
proc print data=outdata.class; run;
ods html close;

Here's my log output. Notice how the HTML file and the output data folder are both created at locations relative to my home directory.

25         /* working path for my projects */
26         %let user = %sysget(USERPROFILE);
27         %let rc = %sysfunc(dlgcdir("&user./Documents"));
NOTE: The current working directory is now "C:\Users\sascrh\Documents".
28         
29         options dlcreatedir;
30         libname outdata "./data";
NOTE: Library OUTDATA was created.
NOTE: Libref OUTDATA was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: C:\Users\sascrh\Documents\data
31         
32         ods html file="./test.html";
NOTE: Writing HTML Body file: ./test.html
33         data outdata.class;
34          set sashelp.class;
35         run;

If using SAS Enterprise Guide, you can add DLGCDIR function steps to the startup statements that run when you connect to SAS, ensuring that your working directory starts in a valid location for SAS output. You can specify those statements in Tools->Options->SAS Programs, "Submit SAS code when server is connected." A SAS administrator can also add code to the AUTOEXEC file that runs when the SAS session begins, thus helping to manage this for larger groups of SAS users.

See also

The post Manage the current directory within your SAS program appeared first on The SAS Dummy.

7月 202018
 

"Code golf" is a fun programming pastime that challenges you to solve a problem with the least amount of code possible. Like regular golf, the goal is to use fewest code "strokes" to hit the mark. Here's a recent challenge that was posted to me via Twitter.

While I feel that I can solve nearly any problem (that I can understand) using SAS, my knowledge of the SAS language is quite limited when compared to that of many experts.  And so, I reached out to the SAS Support Communities for help on this one.

The answers were quick, creative, and diverse.  I'll share a few of them here.

The winner, in terms of concision, came from FreelanceReinhard.  He supplied a macro-function one-liner:

%sysfunc(findc(123456789,00112010302,b));

With this entry, FreelanceReinhard defied a natural algorithmic instinct to treat this as a numerical digit comparison problem, and instead approached it as simple pattern matching problem.  The highest digit comes from a finite set (0..9).  COMPRESS function is often used to eliminate matching characters from a string, but the k directive inverts the action to keep only the matching characters instead.

If you wanted to use the more traditional approach of looping through values, comparing, and keeping just the maximum value, then you can hardly do better than the code offered by hashman.

do j = 1 to length (str) ; 
  d = d <> input (char (str, j), 1.) ;
end ;

Experienced SAS programmers will remember that

AhmedAl_Attar offered the most dangerous approach, using memory manipulation techniques to populate members of an array:

array ct [20] $1 _temporary_;
call pokelong (str,addrlong(ct[1]),length(str));
c=max(of ct{*});

Visit the communities topic and chime in.

data max;
  str = '00112010302';
 
  /* novinosrin's approach */
  a=first(compress('9876543210',str,'k'));
 
  /* FreelanceReinhard's approach */
  b=findc('123456789',str,-9);
 
  /* AhmedAl_Attar's approach using POKELONG */
  array ct [20] $1 _temporary_;
  call pokelong (str,addrlong(ct[1]),length(str));
  c=max(of ct{*});
 
  /* loop approach from hashman */
  /* remember that <> is MAX    */
  do j = 1 to length (str) ;            
    d = d <> input (char (str, j), 1.) ;
  end ;
 
drop j;
run;
 
/* FreelanceReinhard's approach in a one-liner macro function */
%let str=00112010302;
%put max=%sysfunc(findc(123456789,&str.,b));
 
/* IML approach from ChanceTGardener */
/* Requires SAS/IML to run           */
proc iml;
  str='000112010302';
  maximum=max((substr(str,1:length(str),1)));
  print maximum;
quit;

The post SAS code golf: find the max digit in a string of digits appeared first on The SAS Dummy.

7月 062018
 

SAS programmers have long wanted the ability to control the flow of their SAS programs without having to resort to complex SAS macro programming. With SAS 9.4 Maintenance 5, it's now supported! You can now recently came to light on SAS Support Communities. (Thanks to Super User Tom for asking about it.)

Prior to this change, if you wanted to check a condition -- say, whether a data set exists -- before running a PROC, you had to code it within a macro routine. It would look something like this:

/* capture conditional logic in macro */
%macro PrintIfExists();
 %if %sysfunc(exist(work.result)) %then
  %do;
    proc means data=work.result;
    run;
  %end; 
%else
  %do;
    %PUT WARNING: Missing WORK.RESULT - report process skipped.;
  %end;
%mend;
 
/* call the macro */
%PrintIfExists();

Now you can simplify this code to remove the %MACRO/%MEND wrapper and the macro call:

/* If a file exists, take an action */
/* else fail gracefully */
%if %sysfunc(exist(work.result)) %then
  %do;
    proc means data=work.result;
    run;
  %end;
%else
  %do;
    %PUT WARNING: Missing WORK.RESULT - report process skipped.;
  %end;

Here are some additional ideas for how to use this feature. I'm sure you'll be able to think of many more!

Run "debug-level" code only when in debug mode

When developing your code, it's now easier to leave debugging statements in and turn them on with a simple flag.

/* Conditionally produce debugging information */
%let _DEBUG = 0; /* set to 1 for debugging */
%if &_DEBUG. %then
  %do;
    proc print data=sashelp.class(obs=10);
    run;
  %end;

If you have code that's under construction and should never be run while you work on other parts of your program, you can now "IF 0" out the entire block. As a longtime C and C++ programmer, this reminds me of the "#if 0 / #endif" preprocessor directives as an alternative for commenting out blocks of code. Glad to see this in SAS!

/* skip processing of blocks of code */
/* like #if 0 / #endif in C/C++      */
%if 0 %then
  %do;
    proc ToBeDetermined;
      READMYMIND = Yes;
    run;
  %end;

Run code only on a certain day of the week

I have batch jobs that run daily, but that send e-mail to people only one day per week. Now this is easier to express inline with conditional logic.

/*If it's Monday, send a weekly report by email */
%if %sysfunc(today(),weekday1.)=2 %then
  %do;
    options emailsys=smtp emailhost=myhost.company.com;
    filename output email
      subject = "Weekly report for &SYSDATE."
      from = "SAS Dummy <sasdummy@sas.com>"
      to = "knowledgethirster@curious.net"
      ct ='text/html';
 
  ods tagsets.msoffice2k(id=email) 
    file=OUTPUT(title="Important Report!")
    style=seaside;
   title "The Weekly Buzz";
   proc print data=amazing.data;
   run;
  ods tagsets.msoffice2k(id=email) close;
  %end;

Check a system environment variable before running code

For batch jobs especially, system environment variables can be a rich source of information about the conditions under which your code is running. You can glean user ID information, path settings, network settings, and so much more. If your SAS program needs to pick up cues from the running environment, this is a useful method to accomplish that.

/* Check for system environment vars before running code */
%if %sysfunc(sysexist(ORACLE_HOME)) %then
  %do;
    %put NOTE: ORACLE client is installed.;
    /* assign an Oracle library */
    libname ora oracle path=corp schema=alldata authdomain=oracle;
  %end;

Limitations of %IF/%THEN in open code

As awesome as this feature is, there are a few rules that apply to the use of the construct in open code. These are different from what's allowed within a %MACRO wrapper.

First rule: your %IF/%THEN must be followed by a %DO/%END block for the statements that you want to conditionally execute. The same is true for any statements that follow the optional %ELSE branch of the condition.

And second: no nesting of multiple %IF/%THEN constructs in open code. If you need that flexibility, you can do that within a %MACRO wrapper instead.

And remember, this works only in SAS 9.4 Maintenance 5 and later. That includes the most recent release of SAS University Edition, so if you don't have the latest SAS release in your workplace, this gives you a way to kick the tires on this feature if you can't wait to try it.

The post Using %IF-%THEN-%ELSE in SAS programs appeared first on The SAS Dummy.

7月 032018
 

At SAS, we love data. Data is central to our corporate vision: to transform a world of data into a world of intelligence. We're also famous for enjoying M&Ms, but to us they are more than a sweet snack. They're also another source of data.

My colleague Pete Privitera, with a team of like-minded "makers," built a device that they named SnackBot. SnackBot is an internet-connected sensor that measures the flow of M&Ms in a particular SAS break room. There's a lot to love about this project. You can learn more by watching its origin story in this video:

As the number of M&Ms changes, SnackBot takes a reading and records the M&M count in a database. Most readings reflect a decrease in candy pieces, as my colleagues help themselves to a treat. But once per week, the reading shows a drastic increase -- as our facilities staff restocks the canister. SnackBot has its own website. It also has its own API, and you know what that means (right?). It means that we can use SAS to read and analyze the sensor data.

Reading sensor data into SAS with the SnackBot API

The SnackBot system offers a REST API with a JSON data response. Like any REST API, we can use PROC HTTP to fetch the data, and the JSON library engine to parse the response into a SAS data set.

%let start = '20MAY2018:0:0:0'dt;
 
/* format the start/end per the API needs */
%let start_time= %sysfunc(putn(&start.,is8601dt26.));
%let end_time=   %sysfunc(datetime(),is8601dt26.);
 
/* Call the SnackBot API from snackbot.net */
filename resp temp;
proc http
  method="GET"
  url="http://snackbot.net/snackdata?start_time=&start_time.%str(&)end_time=&end_time.%str(&)utc_offset_minutes=-240"
  out=resp;
run;
 
/* JSON libname engine to read the result       */
/* Simple record layout, all in the ROOT member */
libname mms json fileref=resp;
data mmlevels;
  set mms.root;
run;

I've written about how to use SAS with REST APIs in several other blog posts, so I won't dwell on this part of the process here. This short program retrieves the raw data from SnackBot, which represents a series of M&M "levels" (count of remaining pieces) and a timestamp for each measurement. It's a good start. Though there are only two fields to work with here, there's quite a bit we can do with these data.

raw SnackBot data

Add features to the raw sensor data

With a few additional DATA step statements and some built-in SAS formats, we can derive several interesting characteristics of these data for use in further analysis.

First, we need to convert the character-formatted datetime field to a proper SAS datetime value. That's easily achieved with the INPUT function and the ANYDTDTM informat. (Rick Wicklin wrote a helpful article about how how the ANYDT* informats work.)

data mmlevels;
  set mms.root;
  drop ordinal_root timestamp;
  /* Convert the TIMESTAMP field to native value -- it's a character */
  datetime = input(timestamp, anydtdtm.);
  date = datepart(datetime);
  time = timepart(datetime);
  dow = date;
  qhour = round(datetime,'0:15:0'T);
  format  datetime datetime20. 
          qhour datetime20.
          date date9.
          time timeampm10.
          dow downame.;
run;

For convenience, I duplicated the datetime value a few times and applied different formats so we can get different views of the same value: datetime, just the date, just the time-of-day, and the day-of-week. I also used the ROUND function to "round" the raw datetime value to the nearest quarter hour. I'll explain why I've done that in a later step, but the ROUNDing of a time value is one of the documented unusual uses of the ROUND function.

SnackBot data with features

Even with this small amount of data preparation, we can begin to analyze the characteristics of these data. For example, let's look at the descriptive stats for the data classified by day-of-week:

title "SnackBot readings per day-of-week";
proc means data=mmlevels mean stddev max min;
 var pieces;
 class dow;
run;

SnackBot by day of week

The "N Obs" column shows the number of measurements taken over the entire "study period" broken down by day-of-week. If a measurement is a proxy for a "number-of-pieces-changed" event, then we can see that most events happen on Wednesday, Thursday, and Friday. From this, can you guess which day the M&M canister is refilled?

Let's take another slice through these data, but this time looking at time-of-day. For this, I used PROC FREQ to count the measurements by hour. I applied the HOUR2. format, which allows the SAS procedure to group these data into hour-long intervals with no need for additional data prep. ( I've written previously about how to use SAS formats to derive new categories without expensive data rewriting.) Then I used PROC SGPLOT to produce a step plot for the 24-hour cycle.

/* Count of readings per hour of the day */ 
title "SnackBot readings per hour";
proc freq data=mmlevels ;
 table time / out=perhour;
 format time hour2.;
run;
 
ods graphics / height=400 width=800;
 
title "SnackBot readings per hour";
proc sgplot data=perhour des="Readings per hour of day";
 step x=time y=count;
 xaxis min='0:0:0't max='24:0:0't label="Time of day" grid;
 yaxis label="Servings";
run;

SnackBot hour step

From the chart, we can see that most M&M "events" happen around 11am, and then again between 2pm and 4pm. From personal experience, I can confirm that those are the times when I hear the M&Ms calling to me.

Expand the time series to regular intervals

The SnackBot website can tell you how many M&Ms are remaining right now. But what if you want to know how many were remaining last Friday? Or on any typical Monday morning?

The sensor data that we've analyzed so far is sparse -- that is, there are data entries for each "change" event, but not for every discrete time interval in the study period. I don't know how the SnackBot sensor records its readings -- it might sample the M&M levels every minute, or every second. Regardless, the API reports (and probably stores) only the records that represent a change. If SnackBot records that the final 24 pieces were depleted at 25JUN2018:07:45:00 (a Monday morning) bringing the count to 0, how many M&Ms remain at 1pm later that day? The data don't tell us explicitly with a recorded reading. But we can assume at that point that the count was still 0. The next recorded reading occurs at 27JUN2018:10:30:00 (on a Wednesday, bringing the count to 1332 -- oh joy!).

If we want to create a useful time series visualization of the M&M candy counts over time, we need to expand the time series from these sparse recordings to regular intervals. SAS offers a few sophisticated time series procedures to accomplish this: PROC EXPAND, PROC TIMESERIES, and PROC TIMEDATA. Each of these offer powerful econometrics methods for interpolation and forecasting -- and that's more than we need for this situation. For my example, I took a more low-tech approach.

First, I created an empty data set with datetime entries at quarter-hour intervals, covering the study period of the data we're looking at.

/* Empty data set with 15 minute interval slots    */
/* Regular intervals for the entire "study" period */
data timeslots;
  last = datetime();
  length qhour 8;
  format qhour datetime20;
  drop last i;
  do i = &start. to last by '0:15:00't;
    qhour = i;
    output;
  end;
run;

Then I used a DATA step to merge these empty slots with the actual event data that I had rounded to the nearest quarter hour (remember that?):

/* Merge the sample data with the timeslots */
data expand;
  merge mmlevels(keep=pieces qhour) timeslots;
  by qhour;
run;

Finally, I used a variation of a last-observation-carried-forward (LOCF) approach to fill in the remaining empty slots. If a reading at 20MAY2018:11:15:00 reports 132 pieces remaining, then that value should be RETAINed for each 15-minute slot until the next reading at 20MAY2018:17:30:00. (That reading is 82 pieces -- meaning somebody helped themselves to 50 pieces. Recommended serving size for plain M&Ms is 20 pieces, but I'm not passing judgement.) I also recorded a text value for the day-of-week to help with the final visualization.

/* for empty timeslots, carry the sample data   */
/* forward, so we always have a count of pieces */
/* Variation on a LOCF technique                */
data final;
  set expand;
  length day $ 3;
  /* 3-char value for day of week */
  day=put(datepart(qhour),weekdate3.);
  retain hold;
  if not missing(pieces) then
    hold=pieces;
  else pieces=hold;
  drop hold;
  if not missing(pieces);
run;

Now I have data that represents the regular intervals that we need.

SnackBot regular intervals

Putting it all together

For my final visualization, I created a series plot for the study period. It shows the rise and fall of M&Ms levels in one SAS break room over several weeks. For additional "color", I annotated the plot with a block chart to delineate the days of the week.

title 'Plain M&M pieces on S1 tracked by SnackBot';
ods graphics / height=300 width=1600;
 
proc sgplot data=final des='M&M pieces tracked by SnackBot';
 
  /* plot the data as a series */ 
  series x=qhour y=pieces / lineattrs=(color=navy thickness=3px);
 
  /* Yes, these are the "official" M&M colors               */
  /* Will be applied in data-order, so works best when data */
  /* begins on a Sunday                                     */
  styleattrs datacolors=(red orange yellow green blue CX593B18 red);
  /* block areas to indicate days-of-week                   */
  block x=qhour block=day / transparency=0.65
    valueattrs=(weight=bold size=10pt color=navy);
 
  xaxis minor display=(nolabel);
  yaxis display=(nolabel) grid max=1600 minor;
run;

You can see the pattern. M&Ms are typically filled on Wednesday to the canister capacity of about 1400 pieces. We usually enter into the weekend with 0 remaining, but there are exceptions. The week of May 27 was our Memorial Day holiday, which explains the lack of activity on Monday (and even Tuesday) during that week as SAS folks took advantage of a slow week with their vacation plans.

SnackBot visualization

More about SAS and M&Ms data

You can download the complete code for this example from my public Gist on GitHub. The example code should work with SAS University Edition and SAS OnDemand for Academics, as well as with any SAS environment that can reach the internet with PROC HTTP.

For more M&M data fun, check out Rick Wicklin's article about the distribution of colors in plain M&Ms. SnackBot does not (yet) report on how many and which color of M&Ms are taken per serving, but using statistics, we can predict that!

The post The Internet of Snacks: SnackBot data and what it reveals about SAS life appeared first on The SAS Dummy.