9月 172018
 

The SAS/IML language and the MATLAB language are similar. Both provide a natural syntax for performing high-level computations on vectors and matrices, including basic linear algebra subroutines. Sometimes a SAS programmer will convert an algorithm from MATLAB into SAS/IML. Because the languages are not identical, I am sometimes asked, "what is the SAS/IML function that is equivalent to the XYZ function in MATLAB?" One function I am often asked about is the linspace function in MATLAB, which generates a row vector of n evenly spaced points in a closed interval. Although I have written about how to generate evenly spaced points in SAS/IML (and in the DATA step, too!), the name of the SAS/IML function that performs this operation (the DO function) is not very descriptive. Understandably, someone who browses the documentation might pass by the DO function without realizing that it is the function that generates a linearly spaced vector. This article shows how to construct a SAS/IML function that is equivalent to the MATLAB linspace function.

Generate equally spaced points in an interval

Syntactically, the main difference between the DO function in SAS/IML and the linspace function in MATLAB is that the third argument to the DO function is a step size (an increment), whereas the third function to the linspace function is the number of points to generate in an interval. But that's no problem: to generate n evenly spaced points on the interval [a, b], you can use a step size of (b – a)/(n – 1). Therefore, the following SAS/IML function is a drop-in replacement for the MATLAB linspace function:

proc iml;
/* generate n evenly spaced points (a linearly spaced vector) in the interval [a,b] */
start linspace(a, b, numPts=100);
   n = floor(numPts);               /* if n is not an integer, truncate */
   if n < 1 then return( {} );      /* return empty matrix */
   else if n=1 then return( b );    /* return upper endpoint */
   return( do(a, b, (b-a)/(n-1)) ); /* return n equally spaced points */
finish;

A typical use for the linspace function is to generate points in the domain of a function so that you can quickly visualize the function on an interval. For example, the following statements visualize the function exp( -x2 ) on the domain [-3, 3]:

x = linspace(-3, 3);                /* by default, 100 points in [-3,3] */
title "y = exp( -x^2 )";
call series(x, exp(-x##2));         /* graph the function */
Graph of y = exp( -x^2 )

Reminder: "10.0 times 0.1 is hardly ever 1.0"

This is a good time to remind everyone of the programmer's maxim (from Kernighan and Plauger, 1974, The Elements of Programming Style) that "10.0 times 0.1 is hardly ever 1.0." Similarly, "5 times 0.2 is hardly ever 1.0." The maxim holds because many finite decimal values in base 10 have a binary representation that is infinite and repeating. For example, 0.1 and 0.2 are represented by repeating decimals in base 2. Specifically, 0.210 = 0.00110011...2. Thus, just as 3 * (0.3333333) is not equal to 1 in base 10, so too is 5 * 0.00110011...2 not equal to 1 in base 2.

A consequence of this fact is that you should avoid testing floating-point values for equality. For example, if you generate evenly spaced points in the interval [-1, 1] with a step size of 0.2, do not expect that 0.0 is one of the points that are generated, as shown by the following statements:

z = do(-1, 1, 0.2);
/* find all points that are integers */
idx = loc( z = int(z) );               /* test for equality (bad idea) */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];  /* oops! 0.0 is not there! */
 
print z;                               /* show that 0.0 is not one of the points */

When you query for all values for which z = int(z), only the values -1 and +1 are found. If you print out the values in the vector, you'll see that the middle value is an extremely tiny but nonzero value (-5.55e=17). This is not a bug but is a consequence of the fact that 0.2 is represented as a repeating value in binary.

So how can you find the points in a vector that "should be" integers (in exact arithmetic) but might be slightly different than an integer in floating-point arithmetic? The standard approach is to choose a small distance (such as 1e-12 or 1e-14) and look for floating-point numbers that are within that distance from an integer. In SAS, you can use the ROUND function or check the absolute value of the difference, as follows:

eps = 1e-12;
w = round(z, eps);                            /* Round to nearest eps */
idx = loc( int(w) = w);                       /* find points are within epsilon of integer */
print idx;                                    
 
idx = loc( abs(int(z) - z) < eps );           /* find points whose distance to integer is less than eps */
print (idx // z[,idx])[r={'idx', 'z[idx]'}];

In summary, this article shows how to define a SAS/IML function that is equivalent to the MATLAB linspace function. It also reminds us that some finite decimal values (such as 0.1 and 0.2) do not have finite binary representations. When these values are used to generate an arithmetic sequence, the resulting vector of values might be different from what you expect. A wise practice is to never test a floating-point value for equality, but instead to test whether a floating-point value is within a small distance from a target value.

The post Linearly spaced vectors in SAS appeared first on The DO Loop.

9月 142018
 

In my previous post I wrote about the Atari video game, Breakout, and how an AI technique (reinforcement learning, or RL) outperformed a human player. I also drew an analogy between Breakout and customer journey optimization. In Breakout, the environment is what you see on the screen – the blocks, [...]

Reinforcement learning brings AI to customer journeys was published on Customer Intelligence Blog.

9月 142018
 

What’s that buzz on the power line? It’s AI, of course! But what is AIoT, you ask?  It is applying artificial intelligence (AI) to data from smart devices connected by the Internet of Things (IoT). AI is the science of training systems to perform human tasks through learning and automation. [...]

Five ways that AIoT delivers value to utilities was published on SAS Voices by Alyssa Farrell

9月 132018
 

With the release of SAS Viya 3.4, you can easily build large-scale machine learning models and seamlessly publish and run models to Hadoop, or other external databases such as Teradata, without the data ever leaving the Hadoop environment. In this process, SAS Viya:

1) Converts the model into MapReduce Code.

2) Executes the MapReduce code.

3) Returns a new, scored dataset in Hadoop.

SAS Viya is a new, distributed in-memory product that allows users to easily build predictive models at scale. Using the SAS Model Studio interface, I can build complex models without the need to write large amounts of underlying code.

For this blog post, I'll go through the steps to build my model using a telecommunications dataset to predict customer churn. Under the “Data” tab, I can see all of my variables, assign the proper roles, and view the dataset.

 

 

 

 

 

 

 

 

 

 

 

 

 

With the data prepared, I build a pipeline to perform data preprocessing steps such as imputation and binning and build several predictive models, including Regression, Neural Networks and Gradient Boosting. Pipelines are powerful because they automate the heavy lifting of the model building process, allowing you to solve problems faster. In addition, pipelines are re-usable across different users and datasets, allowing the adoption of best practices across an organization.

 

After building the models, I combine the models into one ensemble model with ease, and compare their performance on the validation sample. I determine that the gradient boosting model is the most accurate based on the misclassification rate. You can pick from a large number of accuracy criteria, including KS Statistic, AUC, MCR or F1.

 

After having identified the best model, I  publish the model to Hadoop. This allows me to perform future scoring at the data source, meaning data does not have to leave Hadoop. I could have configured the system to publish the model directly from SAS Model Studio; however, I publish and score the model via SAS code for maximum flexibility. With SAS Studio, I can easily control, and change, where I write my resulting models and datasets in Hadoop.

In the “Compare Models” tab, I then download the score code, which provides me with the following:

  1. sas file containing DS2 code that performs all the data preprocessing steps, such as binning and imputation, in the pipeline above. I load this .sas file into a location that can be viewed in SAS Studio.
  2. A .sashdat file in the “Models” Caslib, that is a binary representation of our model called ASTORE, used to score our model in Hadoop.

 

Opening up the dm_epscore.sas file in SAS Studio, the comments in the top tell me the ASTORE file needed to publish the model.

 

 

 

This scoring file allows the data preparation within the pipeline above to be published to Hadoop as well. In this case, the file is binning the variables before building the Gradient Boosting.

 

 

 

 

The scoring file then invokes the ASTORE file needed to score the model in Hadoop.

 

 

 

Now, I switch to SAS Studio to publish and score my model in Hadoop.  The full code can be found here.

Below is the syntax to publish the model.  I'm sure to set the classpath variable to the appropriate jar and config files for my Hadoop cluster. Note that you will need permission to read and write from the modeldir directory. Publishing the model converts the .sas scoring code and the ATORE file into MapReduce code for execution in the cluster.

 

 

 

 

 

 

 

 

 

 

This publishing code will create a directory called “telco_churn” in my home directory in HDFS, /user/ankram. In a SAS Viya environment co-located with Hadoop, the “CASUSERHDFS” Caslib is by default pointed to this location, allowing me to ensure the “telco_churn” file was successfully published.

 

 

 

 

The next step is to score the model in Hadoop. The code below scores the “looking_glass_v4” table in Hive and create a new table called “looking_glass_v4_scored”, without the data ever leaving Hadoop.

 

 

 

 

 

 

 

 

If everything is configured properly, the log should show that the SAS Embedded Process executed correctly.

 

Using a previously setup Caslib called “Hivelib” that points to the default schema in the Hive Server, I can now load the “Looking_Glass_v4_scored” dataset into CAS to view the table.

 

 

 

 

Using the Table Viewer, I can then see the predicted probabilities of churn for each individual.

 

To conclude, many organizations have very large datasets, often times terabytes or larger, and often find that minimizing data movement is critical to successfully putting models into production. The in-database technologies for Hadoop on SAS Viya allow you and your fellow data scientists to easily prepare data and score large-scale models entirely in Hadoop, with the data never leaving the environment. You can now focus on solving more problems and are no longer at the mercy of large datasets and network latency.

 

 

 

 

Publishing and running models to Hadoop in SAS Viya was published on SAS Users.

9月 122018
 

Have you ever tried to type a movie title by using a TV remote control? Both Netflix and Amazon Video provide an interface (a virtual keyboard) that enables you to use the four arrow keys of a standard remote control to type letters. The letters are arranged in a regular grid and to navigate from one letter to another can require you to press the arrow keys many times. Fortunately, the software displays partial matches as you choose each letter, so you rarely need to type the entire title. Nevertheless, I was curious: Which interface requires the fewest number of key presses (on average) to type a movie title by using only the arrow keys?

The layout of the navigation screens

The following images show the layout of the navigation screen for Netflix and for Amazon Video.

The Netflix grid has 26 letters and 10 numbers arranged in a 7 x 6 grid. The letters are arranged in alphabetical order. The first row contains large keys for the Space character (shown in a light yellow color) and the Backspace key (dark yellow). Each of those keys occupies three columns of the grid, which means that you can get to the Space key by pressing Up Arrow from the A, B, or C key. When you first arrive at the Netflix navigation screen, the cursor is positioned on the A key (shown in pink).

The letters in the Amazon Video grid are arranged in a 3 x 11 grid according to the standard QWERTY keyboard layout. When you first arrive at the navigation screen, the cursor is positioned on the Q key. The Space character can be typed by using a large key in the last row (shown in a light yellow color) that spans columns 8, 9, and 10. The Space character can be accessed from the M key (press Right Arrow) or from the K, L, or '123' keys on the second row. The '123' key (shown in green) navigates to a second keyboard that contains numbers and punctuation. The numbers are arranged in a 1 x 11 grid. When you arrive at that keyboard, the cursor is on the 'ABC' key, which takes you back to the keyboard that contains letters. (Note: The real navigation screen places the '123' key under the 0 key. However, the configuration in the image is equivalent because in each case you must press one arrow key to get to the 0 (zero) key.) For simplicity, this article ignores punctuation in movie titles.

Which navigation interface is more efficient?

I recently wrote a mathematical discussion about navigating through grids by moving only Up, Down, Left, and Right. The article shows that nearly square grids are more efficient than short and wide grids, assuming that the letters that you type are chosen at random. A 7 x 6 grid requires an average of 4.23 key presses per character whereas a 4 x 11 grid requires an average of 4.89 key presses per character. Although the difference might not seem very big, the average length of a movie title is about 15 characters (including spaces). For a 15-character title, the mathematics suggests that using the Netflix interface requires about 10 fewer key presses (on average) than the Amazon Video interface.

If you wonder why I did not include the Hulu interface in this comparison, it is because the Hulu "keyboard" is a 1 x 28 grid that contains all letters and the space and backspace keys. Theory predicts an average of 9.32 key presses per character, which is almost twice as many key presses as for the Netflix interface.

Comparing two interfaces for selecting movie titles

You might wonder how well this theoretical model matches reality. Movie titles are not a jumble of random letters! How do the Netflix and Amazon Video interfaces compare when they are used to type actual movie titles?

To test this question, I downloaded the titles of 1,000 highly-rated movies. I wrote a SAS program that calculates the number of the arrow keys that are needed to type each movie title for each interface. This section summarizes the results.

The expression "press the arrow key," is a bit long, so I will abbreviate it as "keypress" (one word). The "number of times that you need to press the arrow keys to specify a movie title" is similarly condensed to "the number of keypresses."

For these 1,000 movie titles, the Netflix interface requires an average of 50.9 keypresses per title or 3.32 keypresses per character. the Amazon Video interface requires an average of 61.4 keypresses per title or 4.01 keypresses per character. Thus, on average, the Netflix interface requires 10.56 fewer keypresses per title, which closely agrees with the mathematical prediction that consider only the shape of the keyboard interface. A paired t test indicates that the difference between the means is statistically significant. The difference between medians is similar: 45 for the Netflix interface and 56 for the Amazon interface.

The following comparative histogram (click to enlarge) shows the distribution of the number of keypresses for each of the 1,000 movie titles for each interface. The upper histogram shows that most titles require between 30 and 80 keypresses in the Amazon interface, with a few requiring more than 140 keypresses. In contrast, the lower histogram indicates that most titles require between 20 and 60 keypresses in the Netflix interface; relatively fewer titles require more than 140 keypresses.

You can also use a scatter plot to compare the number of keypresses that are required for each interface. Each marker in the following scatter plot shows the number of keypresses for a title in the Amazon interface (horizontal axis) and the Netflix interface (vertical axis). Markers that are below and to the right of the diagonal (45-degree) line are titles for which the Netflix interface requires fewer keypresses. Markers that are above and to the left of the diagonal line are titles for which the Amazon interface is more efficient. You can see that most markers are below the diagonal line. In fact, 804 titles require fewer keypresses in the Netflix interface, only 177 favor the Amazon interface, and 19 require the same number of keypresses in both interfaces. Clearly, the Netflix layout of the virtual keyboard is more efficient for specifying movie titles.

Movie titles that require the most and the fewest key presses

The scatter plot and histograms reveal that there are a few movies whose titles require many keypresses. Here is a list of the 10 titles that require the most keypresses when using the Amazon interface:

Most of the titles are long. However, one (4 Months, 3 Weeks and 2 Days) is not overly long but instead requires shifting back and forth between the two keyboards in the Amazon interface. That results in a large number of keypresses in the Amazon interface (178) and a large difference between the keypresses required by each interface. In fact, the absolute difference for that title (75) is the largest difference among the 1,000 titles.

You can also look at the movie titles that require few keypresses. The following table shows titles that require fewer than 10 keypresses in either interface. The titles that require the fewest keypresses in the Netflix interface are M, Moon, PK, and Up. The titles that require the fewest keypresses in the Amazon interface are Saw, M, Creed, and Up. It is interesting that Saw, which has three letters, requires fewer keypresses than M, which has one letter. That is because the S, A, and W letters are all located in the upper left of the QWERTY keyboard whereas the letter M is in the lower left corner of the keyboard. (Recall that the cursor begins on the Q letter in the upper left corner.).

Summary

In summary, both Netflix and Amazon Video provide an interface that enables customers to select movie titles by using the four arrow keys on a TV remote control. The Netflix interface is a 7 x 6 grid of letters; the Amazon interface is a 3 x 11 QWERTY keyboard and a separate keyboard for numbers. In practice, both interfaces display partial matches and you only need to type a few characters. However, it is interesting to statistically compare the interfaces in terms of efficiency. For a set of 1,000 movie titles, the Netflix interface requires, on average, 10.6 fewer keypresses than the Amazon interface to completely type the titles. This article also lists the movie titles that require the most and the fewest number of key presses.

If you would like to duplicate or extend this analysis, you can download the SAS program that contains the data.

The post Two interfaces for typing text by using a TV remote control appeared first on The DO Loop.

9月 112018
 

The data I was analyzing was about “trust.” Maybe that’s what got me thinking about Stephen Sondheim, the Broadway composer and lyricist of musicals like Sunday in the Park with George and Into the Woods and the lyricist for West Side Story. Trust is a heavy emotional topic.  Developmental psychologists [...]

Data visualization, the musical was published on SAS Voices by Elliot Inman

9月 112018
 

Here in North Carolina (NC), we're pretty much resigned to the fact that many of the hurricanes in the Atlantic Ocean are going to visit us. NC sticks out farther into the ocean than most of our neighboring states, and that just makes a tempting target for the hurricanes. But [...]

The post Hurricane Florence - and other category 4 hurricanes that hit NC appeared first on SAS Learning Post.

9月 102018
 

The customer journey is at the forefront of every discussion about modern marketing. The idea that customers move in premeditated or, at the very least, marketer-meditated paths between well-defined states is alluring (and comforting) to a marketing professional. Of course, even a cursory examination of a web path analytics (Sankey) [...]

AI-based customer journey optimization was published on Customer Intelligence Blog.

9月 102018
 

Given a rectangular grid with unit spacing, what is the expected distance between two random vertices, where distance is measured in the L1 metric? (Here "random" means "uniformly at random.") I recently needed this answer for some small grids, such as the one to the right, which is a 7 x 6 grid. The graph shows that the L1 distance between the points (2,6) and (5,2) is 7, the length of the shortest path that connects the two points. The L1 metric is sometimes called the "city block" or "taxicab" metric because it measures the distance along the grid instead of "as the bird flies," which is the Euclidean distance.

The answer to the analogous question for the continuous case (a solid rectangle) is difficult to compute. The main result is that the expected distance is less than half of the diameter of the rectangle. In particular, among all rectangles of a given area, a square is the rectangle that minimizes the expected distance between random points.

Although I don't know a formula for the expected distance on a discrete regular grid, the grids in my application were fairly small so this article shows how to compute all pairwise distances and explicitly find the average (expected) value. The DISTANCE function in SAS/IML makes the computation simple because it supports the L1 metric. It is also simple to perform computer experiments to show that among all grids that have N*M vertices, the grid that is closest to being square minimizes the expected L1 distance.

L1 distance on a regular grid

An N x M grid contains NM vertices. Therefore the matrix of pairwise distances is NM x NM. Without loss of generality, assume that the vertices have X coordinates between 1 and N and Y coordinates between 1 and M. Then the following SAS/IML function defines the distance matrix for vertices on an N x M grid. To demonstrate the computation, the distance matrix for a 4 x 4 grid is shown, along with the average distance between the 16 vertices:

proc iml;
start DistMat(rows, cols);
   s = expandgrid(1:rows, 1:cols);   /* create (x, y) ordered pairs */
   return distance(s, "CityBlock");  /* pairwise L1 distance matrix */
finish;
 
/* test function on a 4 x 4 grid */
D = DistMat(4, 4);
AvgDist = mean(colvec(D));
print D[L="L1 Distance Matrix for 4x4 Grid"];
print AvgDist[L="Avg Distance on 4x4 Grid"];
L1 Distance matrix for vertices on a regular 4 x 4 grid Average L1 distance on a 4 x 4 grid

For an N x M grid, the L1 diameter of the grid is the L1 distance between opposite corners. That distance is always (N-1)+(M-1), which equates to 6 units for a 4 x 4 grid. As for the continuous case, the expected L1 distance is less than half the diameter. In this case, E(L1 distance) = 2.5.

A comparison of distances for grids of different aspect ratios

As indicated previously, the expected distance between two random vertices on a grid depends on the aspect ratio of the grid. A grid that is nearly square has a smaller expected distance than a short-and-wide grid that contains the same number of vertices. You can illustrate this fact by computing the distance matrices for grids that each contain 36 vertices. The following computation computes the distances for five grids: a 1 x 36 grid, a 2 x 18 grid, a 3 x 12 grid, a 4 x 9 grid, and a 6 x 6 grid.

/* average L1 distance on 36 x 36 grid in several configurations */
N=36;
rows = {1, 2, 3, 4, 6};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
/* show average distance as a decimal and as a fraction */
numer = AvgDist*108; AvgDistFract = char(numer) + " / 108"; 
print rows cols AvgDist AvgDistFract;

The table shows that short-and-wide tables have an average distance that is much greater than a nearly square grid that contains the same number of vertices. When the points are arranged in a 6 x 6 grid, the distance matrix naturally decomposes into a block matrix of 6 x 6 symmetric blocks, where each block corresponds to a row. When the points are arranged in a 3 x 12 grid, the distance matrix decomposes into a block matrix of 12 x 12 blocks The following heat maps visualize the patterns in the distance matrices:

D = DistMat(6, 6);
call heatmapcont(D) title="L1 Distance Between Vertices in a 6 x 6 Grid";
D = DistMat(3, 12);
call heatmapcont(D) title="L1 Distance Between Vertices in a 3 x 12 Grid";
Visualization of L1 distance matrix for items arranged on a 6 x 6 grid Visualization of L1 distance matrix for items arranged on a 3 x 12 grid

Expected (average) distance versus number of rows

You can demonstrate how the average distance depends on the number of rows by choosing the number of vertices to be a highly composite number such as 5,040. A highly composite number has many factors. The following computation computes the average distance between points on 28 grids that each contain 5,040 vertices. A line chart then displays the average distance as a function of the number of rows in the grid:

N = 5040;               /* 5040 is a highly composite number */
rows = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 
       21, 24, 28, 30, 35, 36, 40, 42, 45, 48, 56, 60, 63, 70};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
 
title "Average L1 Distance Between 5040 Objects in a Regular Grid";
call series(rows, AvgDist) grid={x y} other="yaxis offsetmin=0 grid;";
Expected L1 distance between vertices on a regular grid with 5,040 points. The curve demonstraes that nearly square grids have a much lower average distance than short and wide grids.

Each point on the graph represents the average distance for an N x M grid where NM = 5,040. The horizontal axis displays the value of N (rows). The graph shows that nearly square grids (rows greater than 60) have a much lower average distance than very short and wide grids (rows less than 10). The scale of the graph makes it seem like there is very little difference between the average distance in a grid with 40 versus 70 rows, but that is somewhat of an illusion. A 40 x 126 grid (aspect ratio = 3.15) has an average distance of 55.3; a 70 x 72 grid (aspect ratio = 1.03) has an average distance of 47.3.

Applications and conclusions

In summary, you can use the DISTANCE function in SAS/IML to explicitly compute the expected L1 distance (the "city block" distance) between random points on a regular grid. You can minimize the average pairwise distance by making the grid as square as possible.

City planning provides a real-world application of the L1 distance. If you are tasked with designing a city with N buildings along a grid, then the average distance between buildings is smallest when the grid is square. Of course, in practice, some buildings have a higher probability of being visited than others (such as a school or a grocery). You should position those buildings in the center of town to shorten the average distance that people travel.

The post Distances on rectangular grids appeared first on The DO Loop.