9月 182018

To succeed in any data-focused hackathon, you need a robust set of tools and skills – as well as a can-do attitude.  Here's what you can expect from any hackathon:

  • Messy data.   It might come from a variety of sources, and won't necessarily be organized for analytics or reporting.  That's your job.
  • Nebulous problem set. Usually the goal of a hackathon is to generate insights, improve a situation, or optimize a process. But you don't know going into it which insights you need, which process is ripe for optimization, or which situations can be improved by using data.  Hackathons are as much about discovering opportunities as they are about solving problems.
  • Team members with different viewpoints. This is a big strength of hackathons, and it can also present the biggest challenge.  Team members bring different skills and ideas.  To be successful, you need to be open to those ideas and to allowing team members to contribute in the way that best uses their skills.  Think of yourselves as the Oceans Eleven of data analytics.

In my experience, hackathons are often a great melting pot of different tools and technologies.  Whatever tech biases you might have in your day job (Windows versus Linux, SAS versus Python, JSON versus CSV) – these melt away when your teammates show up ready to contribute to a common goal using the tools that they each know best.

My favorite hackathon tools

At the Analytics Experience 2018 Hackathon, attendees have the entire suite of SAS tools available.  From Base SAS, to SAS Enterprise Guide, to SAS Studio, to SAS Enterprise Miner and the entire SAS Viya framework -- including SAS Visual Analytics, SAS Visual Text Analytics, SAS Data Mining and Machine Learning.  As we say here in San Diego, it's the whole enchilada.  As the facilitators were presenting the whirlwind tour of all of these goodies, I could see the attendees salivating.  Or maybe that was just me.

When it comes to getting my hands dirty with unknown data, my favorite path begins with SAS Enterprise Guide.  If you know me, this won't surprise you.  Here's why I like it.

Import Data task: Import any data

Hackathon data almost always comes as CSV or Excel spreadsheets.  The Import Data task can ingest CSV, fixed-width text, and Excel spreadsheets of any version.  Of course most "hackers" worth their salt can write code to read these file types, but the Import Data task helps you to discover what's in the file almost instantly.  You can review all of the field names and types, tweak them as you like, and click Finish to produce a data set.  There's no faster method of turning raw data into a SAS data set that feeds the next step.

See Tricks for importing text files and Importing Excel files using SAS Enterprise Guide for more details about the ins-and-outs of this task.  If you want to ultimately turn this step into repeatable code (a great idea for hackathons), then it's important to know how this task works.

Note: if your data is coming from a web service or API, then it's probably in JSON format.  There's no point-and-click task to read that, but a couple of SAS program lines will do the trick.

Query Builder: Filter, compute, summarize, and join

The Query Builder in SAS Enterprise Guide is a one-stop shop for data management.  Use this for quick filtering, data cleansing, simple recoding, and summarizing across groups. Later, when you have multiple data sources, the Query Builder provides simple methods to join these – merge on the fly.

Before heading into your next hackathon, it's worth exploring and practicing your skills with the Query Builder.  It can do so much -- but some of the functions are a bit hidden.  Limber up before you hack!

See this paper by Jennifer First-Kluge for an in-depth tour of the tool.

Characterize Data: Quick data characteristics, with ability to dive deeper

If you've never seen your data before, you'll appreciate this one-click method to report on variable types, frequencies, distinct values, and distributions.  The Describe->Characterize Data task provides a good start.

Using SAS Studio? There's a Characterize Data task in there as well.  See Marje Fecht's paper: Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide for more about this and other tasks.

Data tasks: Advanced data reworking: long to wide, wide to long

"Long" data is typically best for reporting, while "wide" data is more suited for analytics and modeling  The process of restructuring data from long to wide (or wide to long) is called Transpose.  SAS Enterprise Guide has special tasks called "Split Data" (for making wide tables) and "Stack Data" (for making long data).  Each method has some special requirements for a successful transformation, so it's worth your time to practice with these tasks before you need them.

Program Editor: Flexible coding environment

The program editor in SAS Enterprise Guide is my favorite place to write and modify SAS code.  Here are my favorite tricks for staying productive in this environment including code formatting, shown below.

autoformat code

Have another favorite editor?  You can use SAS Enterprise Guide to open your code in your default Windows editor too.  That's a great option when you need to do super-fancy text manipulation.  (We won't go into the "best programming editor" debate here, but I've got my defaults set up for Notepad++.)

Export and share with others

The hackathon "units of sharing" are code (of course) and data.  SAS Enterprise Guide provides several simple methods to share data in a way that just about any other tool can consume:

  • Export data as CSV (CSV is the lingua franca of data sharing)
  • Export data as Excel (if that's what your teammates are using)
  • Send to Excel -- actually my favorite way to generate ad-hoc Excel data, as it automates Microsoft Excel and pipes the data your looking at directly into a new sheet.
  • Copy / paste with headers -- low-tech, but this gets you exactly the columns and fields that you want to share with another team member.

When it comes to sharing code, you can use File->Export All Code to capture all SAS code from your project or process flow.  However, I prefer to assemble my own "standalone" code piecemeal, so that I can make sure it's going to run the same for someone else as it does for me.  To accomplish this, I create a new SAS program node and copy the code for each step that I want to share into it...one after another.  Then I test by running that code in a new SAS session.  Validating your code in this way helps to reduce friction when you're sharing your work with others.

Hacking your own personal growth

The obvious benefit of hackathons is that at the end of a short, intense period of work, you have new insights and solutions that didn't have before – and might never have arrived at on your own.  But the personal benefit comes in the people you meet and the techniques that you learn.  I find that I'm able to approach my day job with fresh perspective and ideas – the creativity keeps flowing, and I'm energized to apply what I've learned in my business.

The post Essential SAS tools to bring to your next hackathon appeared first on The SAS Dummy.

9月 172018

The SAS/IML language and the MATLAB language are similar. Both provide a natural syntax for performing high-level computations on vectors and matrices, including basic linear algebra subroutines. Sometimes a SAS programmer will convert an algorithm from MATLAB into SAS/IML. Because the languages are not identical, I am sometimes asked, "what is the SAS/IML function that is equivalent to the XYZ function in MATLAB?" One function I am often asked about is the linspace function in MATLAB, which generates a row vector of n evenly spaced points in a closed interval. Although I have written about how to generate evenly spaced points in SAS/IML (and in the DATA step, too!), the name of the SAS/IML function that performs this operation (the DO function) is not very descriptive. Understandably, someone who browses the documentation might pass by the DO function without realizing that it is the function that generates a linearly spaced vector. This article shows how to construct a SAS/IML function that is equivalent to the MATLAB linspace function.

Generate equally spaced points in an interval

Syntactically, the main difference between the DO function in SAS/IML and the linspace function in MATLAB is that the third argument to the DO function is a step size (an increment), whereas the third function to the linspace function is the number of points to generate in an interval. But that's no problem: to generate n evenly spaced points on the interval [a, b], you can use a step size of (b – a)/(n – 1). Therefore, the following SAS/IML function is a drop-in replacement for the MATLAB linspace function:

proc iml;
/* generate n evenly spaced points (a linearly spaced vector) in the interval [a,b] */
start linspace(a, b, numPts=100);
   n = floor(numPts);               /* if n is not an integer, truncate */
   if n < 1 then return( {} );      /* return empty matrix */
   else if n=1 then return( b );    /* return upper endpoint */
   return( do(a, b, (b-a)/(n-1)) ); /* return n equally spaced points */

A typical use for the linspace function is to generate points in the domain of a function so that you can quickly visualize the function on an interval. For example, the following statements visualize the function exp( -x2 ) on the domain [-3, 3]:

x = linspace(-3, 3);                /* by default, 100 points in [-3,3] */
title "y = exp( -x^2 )";
call series(x, exp(-x##2));         /* graph the function */
Graph of y = exp( -x^2 )

Reminder: "10.0 times 0.1 is hardly ever 1.0"

This is a good time to remind everyone of the programmer's maxim (from Kernighan and Plauger, 1974, The Elements of Programming Style) that "10.0 times 0.1 is hardly ever 1.0." Similarly, "5 times 0.2 is hardly ever 1.0." The maxim holds because many finite decimal values in base 10 have a binary representation that is infinite and repeating. For example, 0.1 and 0.2 are represented by repeating decimals in base 2. Specifically, 0.210 = 0.00110011...2. Thus, just as 3 * (0.3333333) is not equal to 1 in base 10, so too is 5 * 0.00110011...2 not equal to 1 in base 2.

A consequence of this fact is that you should avoid testing floating-point values for equality. For example, if you generate evenly spaced points in the interval [-1, 1] with a step size of 0.2, do not expect that 0.0 is one of the points that are generated, as shown by the following statements:

z = do(-1, 1, 0.2);
/* find all points that are integers */
idx = loc( z = int(z) );               /* test for equality (bad idea) */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];  /* oops! 0.0 is not there! */
print z;                               /* show that 0.0 is not one of the points */

When you query for all values for which z = int(z), only the values -1 and +1 are found. If you print out the values in the vector, you'll see that the middle value is an extremely tiny but nonzero value (-5.55e=17). This is not a bug but is a consequence of the fact that 0.2 is represented as a repeating value in binary.

So how can you find the points in a vector that "should be" integers (in exact arithmetic) but might be slightly different than an integer in floating-point arithmetic? The standard approach is to choose a small distance (such as 1e-12 or 1e-14) and look for floating-point numbers that are within that distance from an integer. In SAS, you can use the ROUND function or check the absolute value of the difference, as follows:

eps = 1e-12;
w = round(z, eps);                            /* Round to nearest eps */
idx = loc( int(w) = w);                       /* find points are within epsilon of integer */
print idx;                                    
idx = loc( abs(int(z) - z) < eps );           /* find points whose distance to integer is less than eps */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];

In summary, this article shows how to define a SAS/IML function that is equivalent to the MATLAB linspace function. It also reminds us that some finite decimal values (such as 0.1 and 0.2) do not have finite binary representations. When these values are used to generate an arithmetic sequence, the resulting vector of values might be different from what you expect. A wise practice is to never test a floating-point value for equality, but instead to test whether a floating-point value is within a small distance from a target value.

The post Linearly spaced vectors in SAS appeared first on The DO Loop.

9月 142018

In my previous post I wrote about the Atari video game, Breakout, and how an AI technique (reinforcement learning, or RL) outperformed a human player. I also drew an analogy between Breakout and customer journey optimization. In Breakout, the environment is what you see on the screen – the blocks, [...]

Reinforcement learning brings AI to customer journeys was published on Customer Intelligence Blog.

9月 142018

What’s that buzz on the power line? It’s AI, of course! But what is AIoT, you ask?  It is applying artificial intelligence (AI) to data from smart devices connected by the Internet of Things (IoT). AI is the science of training systems to perform human tasks through learning and automation. [...]

Five ways that AIoT delivers value to utilities was published on SAS Voices by Alyssa Farrell

9月 132018

With the release of SAS Viya 3.4, you can easily build large-scale machine learning models and seamlessly publish and run models to Hadoop, or other external databases such as Teradata, without the data ever leaving the Hadoop environment. In this process, SAS Viya:

1) Converts the model into MapReduce Code.

2) Executes the MapReduce code.

3) Returns a new, scored dataset in Hadoop.

SAS Viya is a new, distributed in-memory product that allows users to easily build predictive models at scale. Using the SAS Model Studio interface, I can build complex models without the need to write large amounts of underlying code.

For this blog post, I'll go through the steps to build my model using a telecommunications dataset to predict customer churn. Under the “Data” tab, I can see all of my variables, assign the proper roles, and view the dataset.














With the data prepared, I build a pipeline to perform data preprocessing steps such as imputation and binning and build several predictive models, including Regression, Neural Networks and Gradient Boosting. Pipelines are powerful because they automate the heavy lifting of the model building process, allowing you to solve problems faster. In addition, pipelines are re-usable across different users and datasets, allowing the adoption of best practices across an organization.


After building the models, I combine the models into one ensemble model with ease, and compare their performance on the validation sample. I determine that the gradient boosting model is the most accurate based on the misclassification rate. You can pick from a large number of accuracy criteria, including KS Statistic, AUC, MCR or F1.


After having identified the best model, I  publish the model to Hadoop. This allows me to perform future scoring at the data source, meaning data does not have to leave Hadoop. I could have configured the system to publish the model directly from SAS Model Studio; however, I publish and score the model via SAS code for maximum flexibility. With SAS Studio, I can easily control, and change, where I write my resulting models and datasets in Hadoop.

In the “Compare Models” tab, I then download the score code, which provides me with the following:

  1. sas file containing DS2 code that performs all the data preprocessing steps, such as binning and imputation, in the pipeline above. I load this .sas file into a location that can be viewed in SAS Studio.
  2. A .sashdat file in the “Models” Caslib, that is a binary representation of our model called ASTORE, used to score our model in Hadoop.


Opening up the dm_epscore.sas file in SAS Studio, the comments in the top tell me the ASTORE file needed to publish the model.




This scoring file allows the data preparation within the pipeline above to be published to Hadoop as well. In this case, the file is binning the variables before building the Gradient Boosting.





The scoring file then invokes the ASTORE file needed to score the model in Hadoop.




Now, I switch to SAS Studio to publish and score my model in Hadoop.  The full code can be found here.

Below is the syntax to publish the model.  I'm sure to set the classpath variable to the appropriate jar and config files for my Hadoop cluster. Note that you will need permission to read and write from the modeldir directory. Publishing the model converts the .sas scoring code and the ATORE file into MapReduce code for execution in the cluster.











This publishing code will create a directory called “telco_churn” in my home directory in HDFS, /user/ankram. In a SAS Viya environment co-located with Hadoop, the “CASUSERHDFS” Caslib is by default pointed to this location, allowing me to ensure the “telco_churn” file was successfully published.





The next step is to score the model in Hadoop. The code below scores the “looking_glass_v4” table in Hive and create a new table called “looking_glass_v4_scored”, without the data ever leaving Hadoop.









If everything is configured properly, the log should show that the SAS Embedded Process executed correctly.


Using a previously setup Caslib called “Hivelib” that points to the default schema in the Hive Server, I can now load the “Looking_Glass_v4_scored” dataset into CAS to view the table.





Using the Table Viewer, I can then see the predicted probabilities of churn for each individual.


To conclude, many organizations have very large datasets, often times terabytes or larger, and often find that minimizing data movement is critical to successfully putting models into production. The in-database technologies for Hadoop on SAS Viya allow you and your fellow data scientists to easily prepare data and score large-scale models entirely in Hadoop, with the data never leaving the environment. You can now focus on solving more problems and are no longer at the mercy of large datasets and network latency.





Publishing and running models to Hadoop in SAS Viya was published on SAS Users.

9月 122018

Have you ever tried to type a movie title by using a TV remote control? Both Netflix and Amazon Video provide an interface (a virtual keyboard) that enables you to use the four arrow keys of a standard remote control to type letters. The letters are arranged in a regular grid and to navigate from one letter to another can require you to press the arrow keys many times. Fortunately, the software displays partial matches as you choose each letter, so you rarely need to type the entire title. Nevertheless, I was curious: Which interface requires the fewest number of key presses (on average) to type a movie title by using only the arrow keys?

The layout of the navigation screens

The following images show the layout of the navigation screen for Netflix and for Amazon Video.

The Netflix grid has 26 letters and 10 numbers arranged in a 7 x 6 grid. The letters are arranged in alphabetical order. The first row contains large keys for the Space character (shown in a light yellow color) and the Backspace key (dark yellow). Each of those keys occupies three columns of the grid, which means that you can get to the Space key by pressing Up Arrow from the A, B, or C key. When you first arrive at the Netflix navigation screen, the cursor is positioned on the A key (shown in pink).

The letters in the Amazon Video grid are arranged in a 3 x 11 grid according to the standard QWERTY keyboard layout. When you first arrive at the navigation screen, the cursor is positioned on the Q key. The Space character can be typed by using a large key in the last row (shown in a light yellow color) that spans columns 8, 9, and 10. The Space character can be accessed from the M key (press Right Arrow) or from the K, L, or '123' keys on the second row. The '123' key (shown in green) navigates to a second keyboard that contains numbers and punctuation. The numbers are arranged in a 1 x 11 grid. When you arrive at that keyboard, the cursor is on the 'ABC' key, which takes you back to the keyboard that contains letters. (Note: The real navigation screen places the '123' key under the 0 key. However, the configuration in the image is equivalent because in each case you must press one arrow key to get to the 0 (zero) key.) For simplicity, this article ignores punctuation in movie titles.

Which navigation interface is more efficient?

I recently wrote a mathematical discussion about navigating through grids by moving only Up, Down, Left, and Right. The article shows that nearly square grids are more efficient than short and wide grids, assuming that the letters that you type are chosen at random. A 7 x 6 grid requires an average of 4.23 key presses per character whereas a 4 x 11 grid requires an average of 4.89 key presses per character. Although the difference might not seem very big, the average length of a movie title is about 15 characters (including spaces). For a 15-character title, the mathematics suggests that using the Netflix interface requires about 10 fewer key presses (on average) than the Amazon Video interface.

If you wonder why I did not include the Hulu interface in this comparison, it is because the Hulu "keyboard" is a 1 x 28 grid that contains all letters and the space and backspace keys. Theory predicts an average of 9.32 key presses per character, which is almost twice as many key presses as for the Netflix interface.

Comparing two interfaces for selecting movie titles

You might wonder how well this theoretical model matches reality. Movie titles are not a jumble of random letters! How do the Netflix and Amazon Video interfaces compare when they are used to type actual movie titles?

To test this question, I downloaded the titles of 1,000 highly-rated movies. I wrote a SAS program that calculates the number of the arrow keys that are needed to type each movie title for each interface. This section summarizes the results.

The expression "press the arrow key," is a bit long, so I will abbreviate it as "keypress" (one word). The "number of times that you need to press the arrow keys to specify a movie title" is similarly condensed to "the number of keypresses."

For these 1,000 movie titles, the Netflix interface requires an average of 50.9 keypresses per title or 3.32 keypresses per character. the Amazon Video interface requires an average of 61.4 keypresses per title or 4.01 keypresses per character. Thus, on average, the Netflix interface requires 10.56 fewer keypresses per title, which closely agrees with the mathematical prediction that consider only the shape of the keyboard interface. A paired t test indicates that the difference between the means is statistically significant. The difference between medians is similar: 45 for the Netflix interface and 56 for the Amazon interface.

The following comparative histogram (click to enlarge) shows the distribution of the number of keypresses for each of the 1,000 movie titles for each interface. The upper histogram shows that most titles require between 30 and 80 keypresses in the Amazon interface, with a few requiring more than 140 keypresses. In contrast, the lower histogram indicates that most titles require between 20 and 60 keypresses in the Netflix interface; relatively fewer titles require more than 140 keypresses.

You can also use a scatter plot to compare the number of keypresses that are required for each interface. Each marker in the following scatter plot shows the number of keypresses for a title in the Amazon interface (horizontal axis) and the Netflix interface (vertical axis). Markers that are below and to the right of the diagonal (45-degree) line are titles for which the Netflix interface requires fewer keypresses. Markers that are above and to the left of the diagonal line are titles for which the Amazon interface is more efficient. You can see that most markers are below the diagonal line. In fact, 804 titles require fewer keypresses in the Netflix interface, only 177 favor the Amazon interface, and 19 require the same number of keypresses in both interfaces. Clearly, the Netflix layout of the virtual keyboard is more efficient for specifying movie titles.

Movie titles that require the most and the fewest key presses

The scatter plot and histograms reveal that there are a few movies whose titles require many keypresses. Here is a list of the 10 titles that require the most keypresses when using the Amazon interface:

Most of the titles are long. However, one (4 Months, 3 Weeks and 2 Days) is not overly long but instead requires shifting back and forth between the two keyboards in the Amazon interface. That results in a large number of keypresses in the Amazon interface (178) and a large difference between the keypresses required by each interface. In fact, the absolute difference for that title (75) is the largest difference among the 1,000 titles.

You can also look at the movie titles that require few keypresses. The following table shows titles that require fewer than 10 keypresses in either interface. The titles that require the fewest keypresses in the Netflix interface are M, Moon, PK, and Up. The titles that require the fewest keypresses in the Amazon interface are Saw, M, Creed, and Up. It is interesting that Saw, which has three letters, requires fewer keypresses than M, which has one letter. That is because the S, A, and W letters are all located in the upper left of the QWERTY keyboard whereas the letter M is in the lower left corner of the keyboard. (Recall that the cursor begins on the Q letter in the upper left corner.).


In summary, both Netflix and Amazon Video provide an interface that enables customers to select movie titles by using the four arrow keys on a TV remote control. The Netflix interface is a 7 x 6 grid of letters; the Amazon interface is a 3 x 11 QWERTY keyboard and a separate keyboard for numbers. In practice, both interfaces display partial matches and you only need to type a few characters. However, it is interesting to statistically compare the interfaces in terms of efficiency. For a set of 1,000 movie titles, the Netflix interface requires, on average, 10.6 fewer keypresses than the Amazon interface to completely type the titles. This article also lists the movie titles that require the most and the fewest number of key presses.

If you would like to duplicate or extend this analysis, you can download the SAS program that contains the data.

The post Two interfaces for typing text by using a TV remote control appeared first on The DO Loop.

9月 112018

The data I was analyzing was about “trust.” Maybe that’s what got me thinking about Stephen Sondheim, the Broadway composer and lyricist of musicals like Sunday in the Park with George and Into the Woods and the lyricist for West Side Story. Trust is a heavy emotional topic.  Developmental psychologists [...]

Data visualization, the musical was published on SAS Voices by Elliot Inman

9月 112018

Here in North Carolina (NC), we're pretty much resigned to the fact that many of the hurricanes in the Atlantic Ocean are going to visit us. NC sticks out farther into the ocean than most of our neighboring states, and that just makes a tempting target for the hurricanes. But [...]

The post Hurricane Florence - and other category 4 hurricanes that hit NC appeared first on SAS Learning Post.