7月 182017
 

St. Louis Union Station welcomed its first passenger train on Sept. 2, 1894 at 1:45 pm and became one of the largest and busiest passenger rail terminals in the world. Back in those days, the North American railroads widely used a system called Timetable and Train Order Operation to establish [...]

The post Finding important predictors: Using your data to explain what’s going on appeared first on SAS Learning Post.

7月 182017
 

In the digital world where billions of customers are making trillions of visits on a multi-channel marketing environment, big data has drawn researchers’ attention all over the world. Customers leave behind a huge trail of data volumes in digital channels. It is becoming an extremely difficult task finding the right data, given exploding data volumes, that can potentially help make the right decision.

This can be a big issue for brands. Traditional databases have not been efficient enough to capture the sheer amount of information or complexity of datasets we accumulate on the web, on social media and other places.

A leading consulting firm, for example, boasts of one of its client having 35 million customers and 9million unique visitors daily on their website, leaving a huge amount of shopper information data every second. Segmenting this large amount of data with the right tools to help target marketing activities is not readily available. To make matters more complicated, this data can be structured and unstructured, making traditional methods of analysing data not suitable.

Tackling market segmentation

Market segmentation is a process by which market researchers identify key attributes about customers and potential customers which can be used to create distinct target market groups. Without a market segmentation base, Advertising and Sales can lose large amount of money targeting the wrong set of customers.

Some known methods of segmenting consumers market include geographical segmentation, demographical segmentation, behavioural segmentation, multi-variable account segmentation and others. Common approaches using statistical methods to segment various markets include:

  • Clustering algorithms such as K-Means clustering
  • Statistical mixture models such as Latent Class Analysis
  • Ensemble approaches such as Random Forests

Most of these methods assume the number of clusters to be known, which in reality is never the case. There are several approaches to estimate the number of clusters. However, strong evidence about the quality of this clusters does not exist.

To add to the above issues, clusters could be domain specific, which means they are built to solve certain domain problems such as:

  • Delimitation of species of plants or animals in biology.
  • Medical classification of diseases.
  • Discovery and segmentation of settlements and periods in archaeology.
  • Image segmentation and object recognition.
  • Social stratification.
  • Market segmentation.
  • Efficient organization of data bases for search queries.

There are also quite general tasks for which clustering is applied in many subject areas:

  • Exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses.
  • Information reduction and structuring of sets of entities from any subject area for simplification, effective communication, or effective access/action such as complexity reduction for further data analysis, or classification systems.
  • Investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data.

Depending on the application, it may differ a lot what is meant by a “cluster,” and cluster definition and methodology have to be adapted to the specific aim of clustering in the application of interest.

Van Mechelen et al. (1993) set out an objective characteristics of what a “true clusters” should possess which includes the following:

  • Within-cluster dissimilarities should be small.
  • Between-cluster dissimilarities should be large.
  • Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial process models.
  • Members of a cluster should be well represented by its centroid.
  • The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
  • Clusters should be stable.
  • Clusters should correspond to connected areas in data space with high density.
  • The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
  • It should be possible to characterize the clusters using a small number of variables.
  • Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
  • Features should be approximately independent within clusters.
  • The number of clusters should be low.

Reference

  1. Van Mechelen, J. Hampton, R.S. Michalski, P. Theuns (1993). Categories and Concepts—Theoretical Views and Inductive Data Analysis, Academic Press, London

What are the characteristics of “true clusters?” was published on SAS Users.

7月 172017
 

DMS keysSAS power users (and actually, power users of any application) like to customize their environment for maximimum productivity. Long-time SAS users remember the KEYS window in SAS display manager, which allows you to assign SAS commands to "hot keys" in your SAS session. These users will invest many hours to come up with the perfect keyboard mappings to suit the type of work that they do.

When using SAS Enterprise Guide, these power users often lament the lack of a similar KEYS window. But these people needn't suffer with the default keys -- a popular tool named AutoHotkey can fill the gap for this and for any other Windows application. I've recommended it to many SAS users over the years, and I've heard positive feedback from those who have adopted it. AutoHotkey is free, and it's lightweight and portable; even users with "locked-down" systems can usually make use of it.

AutoHotkey provides its own powerful scripting language, which allows you define new behaviors for any key combination that you want to repurpose. When you activate these scripts, AutoHotkey gets first crack at processing your keypress, so you can redirect the built-in key mappings for any Windows application. I'll share two examples of different types of scripts that users have found helpful.

"Unmap" a key that you don't like

In SAS Enterprise Guide, F3 and F8 are both mapped to "Run program". A newer user found the F8 mapping confusing because she had a habit of using that key for something else, and so became quite annoyed when she kept accidentally running her process before she was ready.

The following AutoHotkey script "eats" the F8 keypress. The logic first checks to see if the running process is SAS Enterprise Guide (seguide.exe), and if so, it simply stops processing the action, effectively vetoing the F8 action.

F8::
WinGet, Active_ID, ID, A
WinGet, Active_Process, ProcessName, ahk_id %Active_ID%
if ( Active_Process ="seguide.exe" ) {
  ;eat the keystroke
} 

Map a single key to an action that requires multiple keys or clicks

I recently shared a tip to close all open data sets in SAS Enterprise Guide. It's a feature on the Tools menu that launches a special window, and some readers wished for a single key mapping to get the job done. Using AutoHotkey, you can map a series of clicks/keystrokes to a single key.

The following script will select the menu item, activate the "View Open Data Sets" window, and then select Close All.

F12::
WinGet, Active_ID, ID, A
WinGet, Active_Process, ProcessName, ahk_id %Active_ID%
if ( Active_Process ="seguide.exe" ) 
{
  Sleep, 100
  Send {Alt Down}{Alt Up}{t}
  Sleep, 100  
  Send, {v}
  WinActivate, View Open Data Sets ahk_class WindowsForms10.Window.8.app.0.143a722_r12_ad1
  Send, {Tab}
  Sleep, 100  
  Send, {Space}
  Sleep, 500  
  Send, {Esc}
}

You'll see that one of the script commands activates the "View Open Data Sets" window. The window "class" is referenced, and the class name is hardly intuitive. AutoHotkey includes a "Window spy" utility called "Active Window Info" that can help you to find the exact name of the window you need to activate.

Window Spy

AutoHotkey can direct mouse movements and clicks, but those directives might not be reliable in different Windows configurations. In my scripts, I rely on simulated keyboard commands. This script activates the top-level menu with Alt+"t" (for Tools), then "v" (for the "View Open Data Sets" window), then TAB to the "Close All" button, space bar to press the button, then Escape to close the window. Each action takes some time to take effect, so "Sleep" commands are inserted (with times in milliseconds) to allow the actions to complete.

Every action in SAS Enterprise Guide is accessible by the keyboard (even if several keystrokes are required). If you want to see all of the already-defined keyboard mappings, search the SAS Enterprise Guide help for "keyboard shortcuts."

Key help

Automate more with AutoHotkey

In this article, I've only just scratched the surface of how you can customize keys and automate actions in SAS Enterprise Guide. Some of our users have asked us to build in the ability to customize key actions within the application. While that might be a good enhancement within the boundaries of your SAS applications, a tool like AutoHotkey can help you to automate your common tasks within SAS and across other applications that you use. The scripting language presents a bit of a learning curve, but the online help is excellent. And there is a large community of AutoHotkey users who have published hundreds of useful examples.

Have you used AutoHotkey to automate any SAS tasks? If so, please share your tips here in the comments or within the SAS Enterprise Guide community.

The post Customize your keys in SAS Enterprise Guide with AutoHotkey appeared first on The SAS Dummy.

7月 172017
 

My previous blog post focused on a graph, showing the % of women earning STEM degrees in various fields. While that graph was was designed to answer a very specific question, let's now look at the data from a broader perspective. Let's look at the total number of STEM degrees [...]

The post Tracking STEM degrees - a deeper look! appeared first on SAS Learning Post.

7月 172017
 

Venturing into the world of work after university can be an intimidating experience. But fear not! Some of SAS' finest graduate recruiters have teamed up to give you some tips on how to fine tune your job applications, shine at open days, and kick-start your career. How to build your [...]

The post Preparing to enter the job market appeared first on SAS Analytics U Blog.

7月 172017
 

An important problem in machine learning is the "classification problem." In this supervised learning problem, you build a statistical model that predicts a set of categorical outcomes (responses) based on a set of input features (explanatory variables). You do this by training the model on data for which the outcomes are known. For example, researchers might want to predict the outcomes "Lived" or "Died" for patients with a certain disease. They can use data from a clinical trial to build a statistical model that uses demographic and medical measurements to predict the probability of each outcome.

Prediction regions for the binary classification problem. Graph created in SAS.

SAS software provides several procedures for building parametric classification models, including the LOGISTIC and DISCRIM procedures. SAS also provides various nonparametric models, such as spline effects, additive models, and neural networks.

For each input, the statistical model predicts an outcome. Thus the model divides the input space into disjoint regions for which the first outcome is the most probable, for which the second outcome is the most probable, and so forth. In many textbooks and papers, the classification problem is illustrated by using a two-dimensional graph that shows the prediction regions overlaid with the training data, as shown in the adjacent image which visualizes a binary outcome and a linear boundary between regions. (Click to enlarge.)

This article shows three ways to visualize prediction regions in SAS:

  1. The polygon method: A parametric model provides a formula for the boundary between regions. You can use the formula to construct polygonal regions.
  2. The contour plot method: If there are two outcomes and the model provides probabilities for the first outcome, then the 0.5 contour divides the feature space into disjoint prediction regions.
  3. The background grid method: You can evaluate the model on a grid of points and color each point according to the predicted outcome. You can use small markers to produce a faint indication of the prediction regions, or you can use large markers if you want to tile the graph with color.

This article uses logistic regression to discriminate between two outcomes, but the principles apply to other methods as well. The SAS documentation for the DISCRIM procedure contains some macros that visualize the prediction regions for the output from PROC DISCRIM.

A logistic model to discriminate two outcomes

To illustrate the classification problem, consider some simulated data in which the Y variable is a binary outcome and the X1 and X2 variable are continuous explanatory variables. The following call to PROC LOGISTIC fits a logistic model and displays the parameter estimates. The STORE statement creates an item store that enables you to evaluate (score) the model on future observations. The DATA step creates a grid of evenly spaced points in the (x1, x2) coordinates, and the call to PROC PLM scores the model at those locations. In the PRED data set, GX and GY are the coordinates on the regular grid and PREDICTED is the probability that Y=1.

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   store work.LogiModel;                /* save model to item store */
run;
 
data Grid;                              /* create grid in (x1,x2) coords */
do x1 = 0 to 1 by 0.02;
   do x2 = -7.5 to 7.5 by 0.3;
      output;
   end;
end;
run;
 
proc plm restore=work.LogiModel;        /* use PROC PLM to score model on a grid */
   score data=Grid out=Pred(rename=(x1=gx x2=gy)) / ilink;  /* evaluate the model on new data */
run;

The polygon method

Parameter estimates for  logistic model

This method is only useful for simple parametric models. Recall that the logistic function is 0.5 when its argument is zero, so the level set for 0 of the linear predictor divides the input space into prediction regions. For the parameter estimates shown to the right, the level set {(x1,x2) | 2.3565 -4.7618*x1 + 0.7959*x2 = 0} is the boundary between the two prediction regions. This level set is the graph of the linear function x2 = (-2.3565 + 4.7618*x1)/0.7959. You can compute two polygons that represent the regions: let x1 vary between [0,1] (the horizontal range of the data) and use the formula to evaluate x2, or assign x2 to be the minimum or maximum vertical value of the data.

After you have computed polygonal regions, you can use the POLYGON statement in PROC SGPLOT to visualize the regions. The graph is shown at the top of this article. The drawbacks of this method are that it requires a parametric model for which one variable is an explicit function of the other. However, it creates a beautiful image!

The contour plot method

Given an input value, many statistical models produce probabilities for each outcome. If there are only two outcomes, you can plot a contour plot of the probability of the first outcome. The 0.5 contour divides the feature space into disjoint regions.

There are two ways to create such a contour plot. The easiest way is to use the EFFECTPLOT statement, which is supported in many SAS/STAT regression procedures. The following statements show how to use the EFFECTPLOT statement in PROC LOGISTIC to create a contour plot, as shown to the right:

proc logistic data=LogisticData;
   model y(Event='1') = x1 x2;          
   effectplot contour(x=x1 y=x2);       /* 2. contour plot with scatter plot overlay */
run;

Unfortunately, not every SAS procedure supports the EFFECTPLOT statement. An alternative is to score the model on a regular grid of points and use the Graph Template Language (GTL) to create a contour plot of the probability surface. You can read my previous article about how to use the GTL to create a contour plot.

The drawback of this method is that it only applies to binary outcomes. The advantage is that it is easy to implement, especially if the modeling procedure supports the EFFECTPLOT statement.

The background grid method

Prediction region for a classification problem with two outcomes

In this method, you score the model on a grid of points to obtain the predicted outcome at each grid point. You then create a scatter plot of the grid, where the markers are colored by the outcome, as shown in the graph to the right.

When you create this graph, you get to choose how large to make the dots in the background. The image to the right uses small markers, which is the technique used by Hastie, Tibshirani, and Friedman in their book The Elements of Statistical Learning. If you use square markers and increase the size of the markers, eventually the markers tile the entire background, which makes it look like the polygon plot at the beginning of this article. You might need to adjust the vertical and horizontal pixels of the graph to get the background markers to tile without overlapping each other.

This method has several advantages. It is the most general method and can be used for any procedure and for any number of outcome categories. It is easy to implement because it merely uses the model to predict the outcomes on a grid of points. The disadvantage is that choosing the size of the background markers is a matter of trial and error; you might need several attempts before you create a graph that looks good.

Summary

This article has shown several techniques for visualizing the predicted outcomes for a model that has two independent variables. The first model is limited to simple parametric models, the second is restricted to binary outcomes, and the third is a general technique that requires scoring the model on a regular grid of inputs. Whichever method you choose, PROC SGPLOT and the Graph Template Language in SAS can help you to visualize different methods for the classification problem in machine learning.

You can download the SAS program that produces the graphs in this article. Which image do you like the best? Do you have a better visualization? Leave a comment?

The post 3 ways to visualize prediction regions for classification problems appeared first on The DO Loop.

7月 132017
 

For the past several years, efforts have been under way to recruit more women into the STEM (science, technology, engineering, and math) fields. I recently saw an interesting graph showing the percentage of bachelor's degrees conferred to women in the US, and I wondered if I could tweak that graph [...]

The post Are more women getting STEM degrees? appeared first on SAS Learning Post.

7月 132017
 

Artificial intelligence promises to transform society on the scale of the industrial, technical, and digital revolutions before it. Machines that can sense, reason and act will accelerate solutions to large-scale problems in myriad of fields, including science, finance, medicine and education, augmenting human capability and helping us to go further, [...]

5 questions about artificial intelligence with Intel's Pat Richards was published on SAS Voices by Scott Batchelor

7月 122017
 

I recently showed how to compute a bootstrap percentile confidence interval in SAS. The percentile interval is a simple "first-order" interval that is formed from quantiles of the bootstrap distribution. However, it has two limitations. First, it does not use the estimate for the original data; it is based only on bootstrap resamples. Second, it does not adjust for skewness in the bootstrap distribution. The so-called bias-corrected and accelerated bootstrap interval (the BCa interval) is a second-order accurate interval that addresses these issues. This article shows how to compute the BCa bootstrap interval in SAS. You can download the complete SAS program that implements the BCa computation.

As in the previous article, let's bootstrap the skewness statistic for the petal widths of 50 randomly selected flowers of the species Iris setosa. The following statements create a data set called SAMPLE and the rename the variable to analyze to 'X', which is analyzed by the rest of the program:

data sample;
   set sashelp.Iris;     /* <== load your data here */
   where species = "Setosa";
   rename PetalWidth=x;  /* <== rename the analyzes variable to 'x' */
run;

BCa interval: The main ideas

The main advantage to the BCa interval is that it corrects for bias and skewness in the distribution of bootstrap estimates. The BCa interval requires that you estimate two parameters. The bias-correction parameter, z0, is related to the proportion of bootstrap estimates that are less than the observed statistic. The acceleration parameter, a, is proportional to the skewness of the bootstrap distribution. You can use the jackknife method to estimate the acceleration parameter.

Assume that the data are independent and identically distributed. Suppose that you have already computed the original statistic and a large number of bootstrap estimates, as shown in the previous article. To compute a BCa confidence interval, you estimate z0 and a and use them to adjust the endpoints of the percentile confidence interval (CI). If the bootstrap distribution is positively skewed, the CI is adjusted to the right. If the bootstrap distribution is negatively skewed, the CI is adjusted to the left.

Estimate the bias correction and acceleration

The mathematical details of the BCa adjustment are provided in Chernick and LaBudde (2011) and Davison and Hinkley (1997). My computations were inspired by Appendix D of Martinez and Martinez (2001). To make the presentation simpler, the program analyzes only univariate data.

The bias correction factor is related to the proportion of bootstrap estimates that are less than the observed statistic. The acceleration parameter is proportional to the skewness of the bootstrap distribution. You can use the jackknife method to estimate the acceleration parameter. The following SAS/IML modules encapsulate the necessary computations. As described in the jackknife article, the function 'JackSampMat' returns a matrix whose columns contain the jackknife samples and the function 'EvalStat' evaluates the statistic on each column of a matrix.

proc iml;
load module=(JackSampMat);             /* load helper function */
 
/* compute bias-correction factor from the proportion of bootstrap estimates 
   that are less than the observed estimate 
*/
start bootBC(bootEst, Est);
   B = ncol(bootEst)*nrow(bootEst);    /* number of bootstrap samples */
   propLess = sum(bootEst < Est)/B;    /* proportion of replicates less than observed stat */
   z0 = quantile("normal", propLess);  /* bias correction */
   return z0;
finish;
 
/* compute acceleration factor, which is related to the skewness of bootstrap estimates.
   Use jackknife replicates to estimate.
*/
start bootAccel(x);
   M = JackSampMat(x);                 /* each column is jackknife sample */
   jStat = EvalStat(M);                /* row vector of jackknife replicates */
   jackEst = mean(jStat`);             /* jackknife estimate */
   num = sum( (jackEst-jStat)##3 );
   den = sum( (jackEst-jStat)##2 );
   ahat = num / (6*den##(3/2));        /* ahat based on jackknife ==> not random */
   return ahat;
finish;

Compute the BCa confidence interval

With those helper functions defined, you can compute the BCa confidence interval. The following SAS/IML statements read the data, generate the bootstrap samples, compute the bootstrap distribution of estimates, and compute the 95% BCa confidence interval:

/* Input: matrix where each column of X is a bootstrap sample. 
   Return a row vector of statistics, one for each column. */
start EvalStat(M); 
   return skewness(M);               /* <== put your computation here */
finish;
 
alpha = 0.05;
B = 5000;                            /* B = number of bootstrap samples */
use sample; read all var "x"; close; /* read univariate data into x */
 
call randseed(1234567);
Est = EvalStat(x);                   /* 1. compute observed statistic */
s = sample(x, B // nrow(x));         /* 2. generate many bootstrap samples (N x B matrix) */
bStat = T( EvalStat(s) );            /* 3. compute the statistic for each bootstrap sample */
bootEst = mean(bStat);               /* 4. summarize bootstrap distrib, such as mean */
z0 = bootBC(bStat, Est);             /* 5. bias-correction factor */
ahat = bootAccel(x);                 /* 6. ahat = acceleration of std error */
print z0 ahat;
 
/* 7. adjust quantiles for 100*(1-alpha)% bootstrap BCa interval */
zL = z0 + quantile("normal", alpha/2);    
alpha1 = cdf("normal", z0 + zL / (1-ahat*zL));
zU = z0 + quantile("normal", 1-alpha/2);
alpha2 = cdf("normal", z0 + zU / (1-ahat*zU));
call qntl(CI, bStat, alpha1//alpha2); /* BCa interval */
 
R = Est || BootEst || CI`;          /* combine results for printing */
print R[c={"Obs" "BootEst" "LowerCL" "UpperCL"} format=8.4 L="95% Bootstrap Bias-Corrected CI (BCa)"];
Bias=corrected and accelerated BCa interval for a dootstrap distribution

The BCa interval is [0.66, 2.29]. For comparison, the bootstrap percentile CI for the bootstrap distribution, which was computed in the previous bootstrap article, is [0.49, 1.96].

Notice that by using the bootBC and bootAccel helper functions, the program is compact and easy to read. One of the advantages of the SAS/IML language is the ease with which you can define user-defined functions that encapsulate sub-computations.

You can visualize the analysis by plotting the bootstrap distribution overlaid with the observed statistic and the 95% BCa confidence interval. Notice that the BCa interval is not symmetric about the bootstrap estimate. Compared to the bootstrap percentile interval (see the previous article), the BCa interval is shifted to the right.

Bootstrap distribution and bias-corrected and accelerated BCa confidence interval

There is another second-order method that is related to the BCa interval. It is called the ABC method and it uses an analytical expression to approximate the endpoints of the BCa interval. See p. 214 of Davison and Hinkley (1997).

In summary, bootstrap computations in the SAS/IML language can be very compact. By writing and re-using helper functions, you can encapsulate some of the tedious calculations into a high-level function, which makes the resulting program easier to read. For univariate data, you can often implement bootstrap computations without writing any loops by using the matrix-vector nature of the SAS/IML language.

If you do not have access to SAS/IML software or if the statistic that you want to bootstrap is produced by a SAS procedure, you can use SAS-supplied macros (%BOOT, %JACK,...) for bootstrapping. The macros include the %BOOTCI macro, which supports the percentile interval, the BCa interval, and others. For further reading, the web page for the macros includes a comparison of the CI methods.

The post The bias-corrected and accelerated (BCa) bootstrap interval appeared first on The DO Loop.