7月 192017
 

Skewness is a measure of the asymmetry of a univariate distribution. I have previously shown how to compute the skewness for data distributions in SAS. The previous article computes Pearson's definition of skewness, which is based on the standardized third central moment of the data.

Moment-based statistics are sensitive to extreme outliers. A single extreme observation can radically change the mean, standard deviation, and skewness of data. It is not surprising, therefore, that there are alternative definitions of skewness. One robust definition of skewness that is intuitive and easy to compute is a quantile definition, which is also known as the Bowley skewness or Galton skewness.

A quantile definition of skewness

The quantile definition of skewness uses Q1 (the lower quartile value), Q2 (the median value), and Q3 (the upper quartile value). You can measure skewness as the difference between the lengths of the upper quartile (Q3-Q2) and the lower quartile (Q2-Q1), normalized by the length of the interquartile range (Q3-Q1). In symbols, the quantile skewness γQ is

Definition of quantile skewness (Bowley skewness)

You can visualize this definition by using the figure to the right. Figure that shows the relevant lengths used to define the quantile skewness (Bowley skewness) For a symmetric distribution, the quantile skewness is 0 because the length Q3-Q2 is equal to the length Q2-Q1. If the right length (Q3-Q2) is larger than the left length (Q2-Q1), then the quantile skewness is positive. If the left length is larger, then the quantile skewness is negative. For the extreme cases when Q1=Q2 or Q2=Q3, the quantile skewness is ±1. Consequently, whereas the Pearson skewness can be any real value, the quantile skewness is bounded in the interval [-1, 1]. The quantile skewness is not defined if Q1=Q3, just as the Pearson skewness is not defined when the variance of the data is 0.

There is an intuitive interpretation for the quantile skewness formula. Recall that the relative difference between two quantities R and L can be defined as their difference divided by their average value. In symbols, RelDiff = (R - L) / ((R+L)/2). If you choose R to be the length Q3-Q2 and L to be the length Q2-Q1, then quantile skewness is half the relative difference between the lengths.

Compute the quantile skewness in SAS

It is instructive to simulate some skewed data and compute the two measures of skewness. The following SAS/IML statements simulate 1000 observations from a Gamma(a=4) distribution. The Pearson skewness of a Gamma(a) distribution is 2/sqrt(a), so the Pearson skewness for a Gamma(4) distribution is 1. For a large sample, the sample skewness should be close to the theoretical value. The QNTL call computes the quantiles of a sample.

/* compute the quantile skewness for data */
proc iml;
call randseed(12345);
x = j(1000, 1);
call randgen(x, "Gamma", 4);
 
skewPearson = skewness(x);           /* Pearson skewness */
call qntl(q, x, {0.25 0.5 0.75});    /* sample quartiles */
skewQuantile = (q[3] -2*q[2] + q[1]) / (q[3] - q[1]);
print skewPearson skewQuantile;
The Pearson and Bowley skewness statistics for skewed data

For this sample, the Pearson skewness is 1.03 and the quantile skewness is 0.174. If you generate a different random sample from the same Gamma(4) distribution, the statistics will change slightly.

Relationship between quantile skewness and Pearson skewness

In general, there is no simple relationship between quantile skewness and Pearson skewness for a data distribution. (This is not surprising: there is also no simple relationship between a median and a mean, nor between the interquartile range and the standard deviation.) Nevertheless, it is interesting to compare the Pearson skewness to the quantile skewness for a particular probability distribution.

For many probability distributions, the Pearson skewness is a function of the parameters of the distribution. To compute the quantile skewness for a probability distribution, you can use the quantiles for the distribution. The following SAS/IML statements compute the skewness for the Gamma(a) distribution for varying values of a.

/* For Gamma(a), the Pearson skewness is skewP = 2 / sqrt(a).  
   Use the QUANTILE function to compute the quantile skewness for the distribution. */
skewP = do(0.02, 10, 0.02);                  /* Pearson skewness for distribution */
a = 4 / skewP##2;        /* invert skewness formula for the Gamma(a) distribution */
skewQ = j(1, ncol(skewP));                   /* allocate vector for results       */
do i = 1 to ncol(skewP);
   Q1 = quantile("Gamma", 0.25, a[i]);
   Q2 = quantile("Gamma", 0.50, a[i]);
   Q3 = quantile("Gamma", 0.75, a[i]);
   skewQ[i] = (Q3 -2*Q2 + Q1) / (Q3 - Q1);  /* quantile skewness for distribution */
end;
 
title "Pearson vs. Quantile Skewness";
title2 "Gamma(a) Distributions";
call series(skewP, skewQ) grid={x y} label={"Pearson Skewness" "Quantile Skewness"};
Pearson skewness versus quantile skewness for the Gamma distribution

The graph shows a nonlinear relationship between the two skewness measures. This graph is for the Gamma distribution; other distributions would have a different shape. If a distribution has a parameter value for which the distribution is symmetric, then the graph will go through the point (0,0). For highly skewed distributions, the quantile skewness will approach ±1 as the Pearson skewness approaches ±∞.

Alternative quantile definitions

Several researchers have noted that there is nothing special about using the first and third quartiles to measure skewness. An alternative formula (sometimes called Kelly's coefficient of skewness) is to use deciles: γKelly = ((P90 - P50) - (P50 - P10)) / (P90 - P10). Hinkley (1975) considered the q_th and (1-q)_th quantiles for arbitrary values of q.

Conclusions

The quantile definition of skewness is easy to compute. In fact, you can compute the statistic by hand without a calculator for small data sets. Consequently, the quantile definition provides an easy way to quickly estimate the skewness of data. Since the definition uses only quantiles, the quantile skewness is robust to extreme outliers.

At the same time, the Bowley-Galton quantile definition has several disadvantages. It uses only the central 50% of the data to estimate the skewness. Two different data sets that have the same quartile statistics will have the same quantile skewness, regardless of the shape of the tails of the distribution. And, as mentioned previously, the use of the 25th and 75th percentiles are somewhat arbitrary.

Although the Pearson skewness is widely used in the statistical community, it is worth mentioning that the quantile definition is ideal for use with a box-and-whisker plot. The Q1, Q2, and Q2 quartiles are part of every box plot. Therefore you can visually estimate the quantile skewness as the relative difference between the lengths of the upper and lower boxes.

The post A quantile definition for skewness appeared first on The DO Loop.

7月 182017
 

I clearly remember the morning last December when I interviewed for SAS. I was visiting my brother in Seattle over winter break, so I was interviewed over Skype by Brandon and Jason (my current manager and mentor respectively). Near the end of the interview, Brandon asked me, “Where do you [...]

The post Sustainability at SAS appeared first on SAS Analytics U Blog.

7月 182017
 

Nowadays, whether you write SAS programs or use point-and-click methods to get results, you have choices for how you access SAS. Currently, when you open Base SAS most people get the traditional SAS windowing environment (aka Display Manager) as their interface. But it doesn’t have to be that way. If [...]

The post Organize your work with SAS® Enterprise Guide® Projects appeared first on SAS Learning Post.

7月 182017
 

In part one of this series, Clark Twiddy, Chief Administrative Officer of Twiddy & Company, shared some best practices from the first of three phases of Twiddy’s journey to becoming a data-driven SMB. This post focuses on phases two and three of their journey. Phase two is about action. Now [...]

How to be a data-driven SMB: Part 2 of Twiddy’s Tale was published on SAS Voices by Analise Polsky

7月 182017
 

Datasets are rarely ready for analysis, and one of the most prevalent problems is missing data. This post is the first in a short series focusing on how to think about missingness, how JMP13 can help us determine the scope of missing data in a given table, and how to [...]

The post How severe is your missing data problem? appeared first on SAS Learning Post.

7月 182017
 

St. Louis Union Station welcomed its first passenger train on Sept. 2, 1894 at 1:45 pm and became one of the largest and busiest passenger rail terminals in the world. Back in those days, the North American railroads widely used a system called Timetable and Train Order Operation to establish [...]

The post Finding important predictors: Using your data to explain what’s going on appeared first on SAS Learning Post.

7月 182017
 

In the digital world where billions of customers are making trillions of visits on a multi-channel marketing environment, big data has drawn researchers’ attention all over the world. Customers leave behind a huge trail of data volumes in digital channels. It is becoming an extremely difficult task finding the right data, given exploding data volumes, that can potentially help make the right decision.

This can be a big issue for brands. Traditional databases have not been efficient enough to capture the sheer amount of information or complexity of datasets we accumulate on the web, on social media and other places.

A leading consulting firm, for example, boasts of one of its client having 35 million customers and 9million unique visitors daily on their website, leaving a huge amount of shopper information data every second. Segmenting this large amount of data with the right tools to help target marketing activities is not readily available. To make matters more complicated, this data can be structured and unstructured, making traditional methods of analysing data not suitable.

Tackling market segmentation

Market segmentation is a process by which market researchers identify key attributes about customers and potential customers which can be used to create distinct target market groups. Without a market segmentation base, Advertising and Sales can lose large amount of money targeting the wrong set of customers.

Some known methods of segmenting consumers market include geographical segmentation, demographical segmentation, behavioural segmentation, multi-variable account segmentation and others. Common approaches using statistical methods to segment various markets include:

  • Clustering algorithms such as K-Means clustering
  • Statistical mixture models such as Latent Class Analysis
  • Ensemble approaches such as Random Forests

Most of these methods assume the number of clusters to be known, which in reality is never the case. There are several approaches to estimate the number of clusters. However, strong evidence about the quality of this clusters does not exist.

To add to the above issues, clusters could be domain specific, which means they are built to solve certain domain problems such as:

  • Delimitation of species of plants or animals in biology.
  • Medical classification of diseases.
  • Discovery and segmentation of settlements and periods in archaeology.
  • Image segmentation and object recognition.
  • Social stratification.
  • Market segmentation.
  • Efficient organization of data bases for search queries.

There are also quite general tasks for which clustering is applied in many subject areas:

  • Exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses.
  • Information reduction and structuring of sets of entities from any subject area for simplification, effective communication, or effective access/action such as complexity reduction for further data analysis, or classification systems.
  • Investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data.

Depending on the application, it may differ a lot what is meant by a “cluster,” and cluster definition and methodology have to be adapted to the specific aim of clustering in the application of interest.

Van Mechelen et al. (1993) set out an objective characteristics of what a “true clusters” should possess which includes the following:

  • Within-cluster dissimilarities should be small.
  • Between-cluster dissimilarities should be large.
  • Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial process models.
  • Members of a cluster should be well represented by its centroid.
  • The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
  • Clusters should be stable.
  • Clusters should correspond to connected areas in data space with high density.
  • The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
  • It should be possible to characterize the clusters using a small number of variables.
  • Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
  • Features should be approximately independent within clusters.
  • The number of clusters should be low.

Reference

  1. Van Mechelen, J. Hampton, R.S. Michalski, P. Theuns (1993). Categories and Concepts—Theoretical Views and Inductive Data Analysis, Academic Press, London

What are the characteristics of “true clusters?” was published on SAS Users.

7月 172017
 

DMS keysSAS power users (and actually, power users of any application) like to customize their environment for maximimum productivity. Long-time SAS users remember the KEYS window in SAS display manager, which allows you to assign SAS commands to "hot keys" in your SAS session. These users will invest many hours to come up with the perfect keyboard mappings to suit the type of work that they do.

When using SAS Enterprise Guide, these power users often lament the lack of a similar KEYS window. But these people needn't suffer with the default keys -- a popular tool named AutoHotkey can fill the gap for this and for any other Windows application. I've recommended it to many SAS users over the years, and I've heard positive feedback from those who have adopted it. AutoHotkey is free, and it's lightweight and portable; even users with "locked-down" systems can usually make use of it.

AutoHotkey provides its own powerful scripting language, which allows you define new behaviors for any key combination that you want to repurpose. When you activate these scripts, AutoHotkey gets first crack at processing your keypress, so you can redirect the built-in key mappings for any Windows application. I'll share two examples of different types of scripts that users have found helpful.

"Unmap" a key that you don't like

In SAS Enterprise Guide, F3 and F8 are both mapped to "Run program". A newer user found the F8 mapping confusing because she had a habit of using that key for something else, and so became quite annoyed when she kept accidentally running her process before she was ready.

The following AutoHotkey script "eats" the F8 keypress. The logic first checks to see if the running process is SAS Enterprise Guide (seguide.exe), and if so, it simply stops processing the action, effectively vetoing the F8 action.

F8::
WinGet, Active_ID, ID, A
WinGet, Active_Process, ProcessName, ahk_id %Active_ID%
if ( Active_Process ="seguide.exe" ) {
  ;eat the keystroke
} 

Map a single key to an action that requires multiple keys or clicks

I recently shared a tip to close all open data sets in SAS Enterprise Guide. It's a feature on the Tools menu that launches a special window, and some readers wished for a single key mapping to get the job done. Using AutoHotkey, you can map a series of clicks/keystrokes to a single key.

The following script will select the menu item, activate the "View Open Data Sets" window, and then select Close All.

F12::
WinGet, Active_ID, ID, A
WinGet, Active_Process, ProcessName, ahk_id %Active_ID%
if ( Active_Process ="seguide.exe" ) 
{
  Sleep, 100
  Send {Alt Down}{Alt Up}{t}
  Sleep, 100  
  Send, {v}
  WinActivate, View Open Data Sets ahk_class WindowsForms10.Window.8.app.0.143a722_r12_ad1
  Send, {Tab}
  Sleep, 100  
  Send, {Space}
  Sleep, 500  
  Send, {Esc}
}

You'll see that one of the script commands activates the "View Open Data Sets" window. The window "class" is referenced, and the class name is hardly intuitive. AutoHotkey includes a "Window spy" utility called "Active Window Info" that can help you to find the exact name of the window you need to activate.

Window Spy

AutoHotkey can direct mouse movements and clicks, but those directives might not be reliable in different Windows configurations. In my scripts, I rely on simulated keyboard commands. This script activates the top-level menu with Alt+"t" (for Tools), then "v" (for the "View Open Data Sets" window), then TAB to the "Close All" button, space bar to press the button, then Escape to close the window. Each action takes some time to take effect, so "Sleep" commands are inserted (with times in milliseconds) to allow the actions to complete.

Every action in SAS Enterprise Guide is accessible by the keyboard (even if several keystrokes are required). If you want to see all of the already-defined keyboard mappings, search the SAS Enterprise Guide help for "keyboard shortcuts."

Key help

Automate more with AutoHotkey

In this article, I've only just scratched the surface of how you can customize keys and automate actions in SAS Enterprise Guide. Some of our users have asked us to build in the ability to customize key actions within the application. While that might be a good enhancement within the boundaries of your SAS applications, a tool like AutoHotkey can help you to automate your common tasks within SAS and across other applications that you use. The scripting language presents a bit of a learning curve, but the online help is excellent. And there is a large community of AutoHotkey users who have published hundreds of useful examples.

Have you used AutoHotkey to automate any SAS tasks? If so, please share your tips here in the comments or within the SAS Enterprise Guide community.

The post Customize your keys in SAS Enterprise Guide with AutoHotkey appeared first on The SAS Dummy.

7月 172017
 

My previous blog post focused on a graph, showing the % of women earning STEM degrees in various fields. While that graph was was designed to answer a very specific question, let's now look at the data from a broader perspective. Let's look at the total number of STEM degrees [...]

The post Tracking STEM degrees - a deeper look! appeared first on SAS Learning Post.