7月 182017

In the digital world where billions of customers are making trillions of visits on a multi-channel marketing environment, big data has drawn researchers’ attention all over the world. Customers leave behind a huge trail of data volumes in digital channels. It is becoming an extremely difficult task finding the right data, given exploding data volumes, that can potentially help make the right decision.

This can be a big issue for brands. Traditional databases have not been efficient enough to capture the sheer amount of information or complexity of datasets we accumulate on the web, on social media and other places.

A leading consulting firm, for example, boasts of one of its client having 35 million customers and 9million unique visitors daily on their website, leaving a huge amount of shopper information data every second. Segmenting this large amount of data with the right tools to help target marketing activities is not readily available. To make matters more complicated, this data can be structured and unstructured, making traditional methods of analysing data not suitable.

Tackling market segmentation

Market segmentation is a process by which market researchers identify key attributes about customers and potential customers which can be used to create distinct target market groups. Without a market segmentation base, Advertising and Sales can lose large amount of money targeting the wrong set of customers.

Some known methods of segmenting consumers market include geographical segmentation, demographical segmentation, behavioural segmentation, multi-variable account segmentation and others. Common approaches using statistical methods to segment various markets include:

  • Clustering algorithms such as K-Means clustering
  • Statistical mixture models such as Latent Class Analysis
  • Ensemble approaches such as Random Forests

Most of these methods assume the number of clusters to be known, which in reality is never the case. There are several approaches to estimate the number of clusters. However, strong evidence about the quality of this clusters does not exist.

To add to the above issues, clusters could be domain specific, which means they are built to solve certain domain problems such as:

  • Delimitation of species of plants or animals in biology.
  • Medical classification of diseases.
  • Discovery and segmentation of settlements and periods in archaeology.
  • Image segmentation and object recognition.
  • Social stratification.
  • Market segmentation.
  • Efficient organization of data bases for search queries.

There are also quite general tasks for which clustering is applied in many subject areas:

  • Exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses.
  • Information reduction and structuring of sets of entities from any subject area for simplification, effective communication, or effective access/action such as complexity reduction for further data analysis, or classification systems.
  • Investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data.

Depending on the application, it may differ a lot what is meant by a “cluster,” and cluster definition and methodology have to be adapted to the specific aim of clustering in the application of interest.

Van Mechelen et al. (1993) set out an objective characteristics of what a “true clusters” should possess which includes the following:

  • Within-cluster dissimilarities should be small.
  • Between-cluster dissimilarities should be large.
  • Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial process models.
  • Members of a cluster should be well represented by its centroid.
  • The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”).
  • Clusters should be stable.
  • Clusters should correspond to connected areas in data space with high density.
  • The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear).
  • It should be possible to characterize the clusters using a small number of variables.
  • Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering.
  • Features should be approximately independent within clusters.
  • The number of clusters should be low.


  1. Van Mechelen, J. Hampton, R.S. Michalski, P. Theuns (1993). Categories and Concepts—Theoretical Views and Inductive Data Analysis, Academic Press, London

What are the characteristics of “true clusters?” was published on SAS Users.

7月 172017

My previous blog post focused on a graph, showing the % of women earning STEM degrees in various fields. While that graph was was designed to answer a very specific question, let's now look at the data from a broader perspective. Let's look at the total number of STEM degrees [...]

The post Tracking STEM degrees - a deeper look! appeared first on SAS Learning Post.

7月 132017

For the past several years, efforts have been under way to recruit more women into the STEM (science, technology, engineering, and math) fields. I recently saw an interesting graph showing the percentage of bachelor's degrees conferred to women in the US, and I wondered if I could tweak that graph [...]

The post Are more women getting STEM degrees? appeared first on SAS Learning Post.

7月 112017

Carbon Dioxide ... CO2. Humans breathe out 2.3 pounds of it per day. It's also produced when we burn organic materials & fossil fuels (such as coal, oil, and natural gas). Plants use it for photosynthesis, which in turn produces oxygen. It is also a greenhouse gas, which many claim [...]

The post U.S. CO2 emissions are on the decline! appeared first on SAS Learning Post.

7月 012017

We live in exciting times. Our relationships with machines, objects and things are quickly changing. Since mankind lived in caves, we have pushed our will into passive tools with our hands and our voices. Our mice and our keyboards do exactly as we tell them to, and devices like the [...]

Artificial intelligence: Separating the reality from the hype was published on SAS Voices by Oliver Schabenberger

6月 302017

Choosing great colors for a graph is sometimes the most difficult part. And here is yet another thing you need to worry about ... sometimes colors represent different things in different cultures! In this blog post, I improve a graphic to help you get a grasp on those color-to-culture relationships. [...]

The post Colors represent different things, in different cultures appeared first on SAS Learning Post.

6月 292017

I'm sure I'm not the only one who has read and contributed to threads on the internet about all the different languages used for data mining. But one aspect that's been left out of most of these comparisons is that SAS is more than a 4th generation programming language (4GL). [...]

SAS is an analytical platform, not just a language was published on SAS Voices by David Pope

6月 262017

The role of analytics in combating terrorism Earlier this spring, I found myself walking through a quiet and peaceful grove of spruce trees south of the small hamlet of Foy outside of Bastogne, Belgium.  On travel in Europe, I happened to have some extra time before heading to London.  I [...]

The analytics of evil was published on SAS Voices by Steve Bennett

6月 262017

The role of analytics in combating terrorism Earlier this spring, I found myself walking through a quiet and peaceful grove of spruce trees south of the small hamlet of Foy outside of Bastogne, Belgium.  On travel in Europe, I happened to have some extra time before heading to London.  I [...]

The analytics of evil was published on SAS Voices by Steve Bennett

6月 232017

Here in the US, we typically use top level domains such as .com, .gov, and .org. I guess we were one of the first countries to start using web domains in a big way, and therefore we kind of got squatter's rights. As other countries started using the web, they [...]

The post A map of country code top-level domains (ccTLD) appeared first on SAS Learning Post.