I’ve had several meetings lately on data management, and especially integration, where the ability to explore alternatives has been critical. And the findings from our internet of things (IoT) early adopters survey confirms that the ecosystem nature of data sources in IoT deployments means we need to expand the traditional […]

It’s data integration, but not as we know it was published on SAS Voices.


Digital intelligence is a trending term in the space of digital marketing analytics that needs to be demystified. Let's begin by defining what a digital marketing analytics platform is:

Digital marketing analytics platforms are technology applications used by customer intelligence ninjas to understand and improve consumer experiences. Prospecting, acquiring, and holding on to digital-savvy customers depends on understanding their multidevice behavior, and derived insight fuels marketing optimization strategies. These platforms come in different flavors, from stand-alone niche offerings, to comprehensive end-to-end vehicles performing functions from data collection through analysis and visualization.

However, not every platform is built equally from an analytical perspective. According to Brian Hopkins, a Forrester analyst, firms that excel at using data and analytics to optimize their digital businesses will together generate $1.2 trillion per annum in revenue by 2020. And digital intelligence — the practice of continuously optimizing customer experiences with online and offline data, advanced analytics and prescriptive insights — supports every insights-driven business. Digital intelligence is the antidote to the weaknesses of analytically immature platforms, leaving the world of siloed reporting behind and maturing towards actionable, predictive marketing. Here are a couple of items to consider:

  • Today's device-crazed consumers flirt with brands across a variety of interactions during a customer life cycle. However, most organizations seem to focus on website activity in one bucket, mobile in another, and social in . . . you see where I'm going. Strategic plans often fall short in applying digital intelligence across all channels — including offline interactions like customer support or product development.
  • Powerful digital intelligence uses timely delivery of prescriptive insights to positively influence customer experiences. This requires integration of data, analytics and the systems that interact with the consumer. Yet many teams manually apply analytics and deliver analysis via endless reports and dashboards that look retroactively at past behavior — begging business leaders to question the true value and potential impact of digital analysis.

As consumer behavioral needs and preferences shifts over time, the proportion of digital to non-digital interactions is growing. With the recent release of Customer Intelligence 360, SAS has carefully considered feedback from our customers (and industry analysts) to create technology that supports a modern digital intelligence strategy in guiding an organization to:

  • Enrich your first-party customer data with user level data from web and mobile channels. It's time to graduate from aggregating data for reporting purposes to the collection and retention of granular, customer-level data. It is individual-level data that drives advanced segmentation and continuous optimization of customer interactions through personalization, targeting and recommendations.
  • Keep up with customers through machine learning, data science and advanced analytics. The increasing pace of digital customer interactions requires analytical maturity to optimize marketing and experiences. By enriching first-party customer data with infusions of web and mobile behavior, and more importantly, in the analysis-ready format for sophisticated analytics, 360 Discover invites analysts to use their favorite analytic tool and tear down the limitations of traditional web analytics.
  • Automate targeting, channel orchestration and personalization. Brands struggle with too few resources to support the manual design and data-driven design of customer experiences. Connecting first-party data that encompasses both offline and online attributes with actionable propensity scores and algorithmically-defined segments through digital channel interactions is the agenda. If that sounds mythical, check out a video example of how SAS brings this to life.

The question now is - are you ready? Learn more here of why we are so excited about enabling digital intelligence for our customers, and how this benefits testing, targeting, and optimization of customer experiences.


tags: Customer Engagement, customer intelligence, Customer Intelligence 360, customer journey, data science, Digital Intelligence, machine learning, marketing analytics, personalization, predictive analytics, Predictive Personalization, Prescriptive Analytics

Digital intelligence for optimizing customer engagement was published on Customer Intelligence.


There are so many ways in which a customer’s journey of experiences can be negatively affected, from forms on websites that are unclear or complicated, to inconsistent or non-relevant interactions over many channels. It is important that these interactions are measured and reduced to maximize customer engagement and increase customer satisfaction over the long run.

You can tackle this challenge from several directions, from A/B testing and Multi-Armed Bandit tests that optimize interactions at specific points in a journey (these are available in SAS 360), to approaches that optimize the full customer journeys over many sequential points in the journey.

Optimizing full analytically-driven customer journeys

This year I was involved in a project for a large retailer and the retailer believed there were a significant number of interactions with customers that had an impact on response rates – i.e., positive (halo) effects and negative (cannibalization) effects. These are difficult to deal with using standard optimization techniques that assume independence of contacts, and therefore full customer journey optimization was used to identify these effects and address this complexity successfully.

As always, the first step was to get the consolidated data, at the individual customer level. We were able to accomplish this because we had good, quality data for the project – customer-level demographic data, and contact history data.

The stages of customer journey optimization

The journey optimization was then carried out in three stages:

Stage 1 –Creating analytically driven customer journeys is an important advancement towards truly effective analytically driven omnichannel marketing. We used decision trees (in SAS Enterprise Miner) on the customer history data to map widely varied journeys that customers were taking. Traditionally, decision trees are used on a wider set of data, but by using just the history, the paths of significant activity that led to purchases were identified.

Stage 2 – Next, these analytically driven journey maps were used as inputs for optimization. For every journey identified by the decision trees, a predictive model was created (using SAS Enterprise Miner) to predict spending, so that for every customer, for every journey, we can predict how much they will spend.

Stage 3 – Finally, the data was optimized using SAS Marketing Optimization, and constraints were applied to establish a final set of scenarios that the retailer agreed would be appropriate to implement.

This illustrates how decision trees can be used to map customer journeys, and these journeys can then be optimized; to replace the disjointed and disconnected results of traditional optimization methods. We are also beginning to use the cutting-edge machine learning technique of deep reinforcement learning to further optimize customer journeys. These techniques will be incorporated into SAS Customer Intelligence solutions to ensure that SAS users can explore this complicated and increasingly important area of customer intelligence.

tags: A/B Testing, customer journey, customer journey mapping, multi-armed bandit test, optimization, retail, sas customer intelligence, sas marketing optimization

Customer journey optimization: A real-world example was published on Customer Intelligence.


In my last post I described "4 adaptability attributes for analytical success," and in the past I've discussed the strategic role analytics play in helping organizations succeed now and into the future. Now I'd like to discuss three attributes that define a powerful analytics environment: Speed Accuracy Scalability [NOTE: Any […]

3 attributes that define a powerful analytics environment was published on SAS Voices.


I've been working on a pilot project recently with a client to test out some new NoSQL database frameworks (graph databases in particular). Our goal is to see how a different storage model, representation and presentation can enhance the usability and ease of integration for master data indexes and entity […]

The post Balancing privacy concerns for analytics design appeared first on The Data Roundtable.


SAS® Viya™ 3.1 represents the third generation of high performance computing from SAS. Our journey started a long time ago and, along the way, we have introduced a number of high performance technologies into the SAS software platform:

Introducing Cloud Analytic Services (CAS)

SAS Viya introduces Cloud Analytic Services (CAS) and continues this story of high performance computing.  CAS is the runtime engine and microservices environment for data management and analytics in SAS Viya and introduces some new and interesting innovations for customers. CAS is an in-memory technology and is designed for scale and speed. Whilst it can be set up on a single machine, it is more commonly deployed across a number of nodes in a cluster of computers for massively parallel processing (MPP). The parallelism is further increased when we consider using all the cores within each node of the cluster for multi-threaded, analytic workload execution. In a MPP environment, just because there are a number of nodes, it doesn’t mean that using all of them is always the most efficient for analytic processing. CAS maintains node-to-node communication in the cluster and uses an internal algorithm to determine the optimal distribution and number of nodes to run a given process.

However, processing in-memory can be expensive, so what happens if your data doesn’t fit into memory? Well CAS, has that covered. CAS will automatically spill data to disk in such a way that only the data that are required for processing are loaded into the memory of the system. The rest of the data are memory-mapped to the filesystem in an efficient way for loading into memory when required. This way of working means that CAS can handle data that are larger than the available memory that has been assigned.

The CAS in-memory engine is made up of a number of components - namely the CAS controller and, in an MPP distributed environment, CAS worker nodes. Depending on your deployment architecture and data sources, data can be read into CAS either in serial or parallel.

What about resilience to data loss if a node in an MPP cluster becomes unavailable? Well CAS has that covered too. CAS maintains a replicate of the data within the environment. The number of replicates can be configured but the default is to maintain one extra copy of the data within the environment. This is done efficiently by having the replicate data blocks cached to disk as opposed to consuming resident memory.

One of the most interesting developments with the introduction of CAS is the way that an end user can interact with SAS Viya. CAS actions are a new programming construct and with CAS, if you are a Python, Java, SAS or Lua developer you can communicate with CAS using an interactive computing environment such as a Jupyter Notebook. One of the benefits of this is that a Python developer, for example, can utilize SAS analytics on a high performance, in-memory distributed architecture, all from their Python programming interface. In addition, we have introduced open REST APIs which means you can call native CAS actions and submit code to the CAS server directly from a Web application or other programs written in any language that supports REST.

Whilst CAS represents the most recent step in our high performance journey, SAS Viya does not replace SAS 9. These two platforms can co-exist, even on the same hardware, and indeed can communicate with one another to leverage the full range of technology and innovations from SAS. To find out more about CAS, take a look at the early preview trial. Or, if you would like to explore the capabilities of SAS Viya with respect to your current environment and business objectives speak to your local SAS representative about arranging a ‘Path to SAS Viya workshop’ with SAS.

Many thanks to Fiona McNeill, Mark Schneider and Larry LaRusso for their input and review of this article.


tags: global te, Global Technology Practice, high-performance analytics, SAS Grid Manager, SAS Visual Analytics, SAS Visual Statistics, SAS Viya

A journey of SAS high performance was published on SAS Users.


Recently a colleague told me Google had published new, interesting data sets at BigQuery. I found a lot of Reddit data as well, so I quickly tried running BigQuery with these text data to see what I could produce.  After getting some pretty interesting results, I wanted to see if I could implement the same analysis with SAS and if using SAS Text Mining you would get deeper insights than simple queries. So, I tried SAS with Reddit comments data and I’d like to share my analyses and findings with you.

Analysis 1: Significant Words

To get started with BigQuery, I googled what others were sharing regarding BigQuery and Reddit, and I found USING BIGQUERY WITH REDDIT DATA. In this article the author posted a query statement about extracting significant words from Politics subreddit. I then wrote a SAS program to mimic this query and I got following data with the July of Reddit comments. The result is not completely same as the one from BigQuery, since I downloaded the Reddit data from another web site and used SAS Text Parsing action to parse the comments into tokens rather than just splitting tokens by white space.

Analysis 2: Daily Submissions

The words Trump and Hillary in the list raised my interest and begged for further analysis. So, I did a daily analysis to understand how hot Trump and Hillary were during this month. I filtered all comments mentioning Trump or Hillary under Politics subreddit and counted total submissions per day. The resulting time series plot is shown below.

I found several spikes in the plot, which happened on 2016/7/5, 2016/7/12, 2016/7/21, and 2016/7/26.

Analysis 3: Topics Time Line

I wondered what Reddit users were concerned about on these specific days, so I extracted the top 10 topics from all comments submitted in July, 2016 within Politics subreddit and got the following data. These topics obviously focused on several aspects, such as vote, president candidates, party, and hot news such as Hillary’s email probe.

The topics showed what people were concerned about in the whole month, but I need further investigation in order to explain which topic mostly contributed to the four spikes. The topics’ time series plot helped me find the answer.

Some topics’ time series trends are very close and it is hard to determine which topic contributed mostly, so I got the top contribution topic based on their daily percentage growth. The top growth topic on July 05 is “emails, dnc, +server, hillary, +classify”, which has 256.23 times of growth.

Its time series plot also shows a high spike on July 05. Then, I googled with “July 5, 2016 emails dnc server hillary classify” and I got following news.

There is no doubt the spike on July 05 is related to the FBI’s decision about Clinton’s email probe. In order to confirm this, I extracted the Top 20 Reddit comments submitted on July 05 according to its Reddit score. I quoted partial comment from the top one and I found the link in the comment was included in the Google’s search result.

"...under normal circumstances, security clearances would be revoked. " This is your FBI. EDIT: I took paraphrased quote, this is the actual quote as per https://www.fbi.gov/news/pressrel/press-releases/statement-by-fbi-director-james-b.-comey-on-the-investigation-of-secretary-hillary-clintons-use-of-a-personal-e-mail-system - "

Similar analysis was done on the other three days and the hot topics as follows.

Interestingly, one person did a sentiment analysis with Twitter data and the tweet submission trend of July looks the same as Reddit.

And in this blog, he listed several important events that happened in July.

  • July 5th: the FBI says it’s not going to end Clinton’s email probe and will not recommend prosecution.
  • July 12th: Bernie Sanders endorses Hillary Clinton for president.
  • July 21st: Donald Trump accepts the Republican nomination.
  • July 25-28: Clinton accepts nomination in the DNC.

It showcased that different social media data have similar response trends on the same events.

Now I know why these spikes happened. However, more questions came to my mind.

  • Who started posting these news?
  • Were there cyber armies?
  • Who were opinion leaders in the politics community?

I believe all these questions can be answered by analyzing the data with SAS.

tags: SAS R&D, SAS Text Analytics

Analyzing Trump v. Clinton text data at Reddit was published on SAS Users.


Tell me if you’ve heard this before:  Your company hired (or re-titled) a talented data scientist and they have great skills and no data. Or they're marginalized by IT because they're misunderstood. They're offered “cleansed” data that will fit into the hardware provisioned. What they want is “all” relevant data […]

4 tips for the utility data scientist was published on SAS Voices.


Are you one of those people who get easily bored at amusement parks? Would you like something to do while your friends/family are waiting in line for a ride? Perhaps I have an alternate ides, to keep you busy - survey markers! When surveyors are measuring and marking areas for […]

The post Something for geeks and nerds to do at Disney World! appeared first on SAS Learning Post.


This article shows how to simulate a data set in SAS that satisfies a least squares regression model for continuous variables.

When you simulate to create "synthetic" (or "fake") data, you (the programmer) control the true parameter values, the form of the model, the sample size, and magnitude of the error term. You can use simulated data as a quick-and-easy way to generate an example. You can use simulation to test the performance of an algorithm on very wide or very long data sets.

The least squares regression model with continuous explanatory variables is one of the simplest regression models. The book Simulating Data with SAS describes how to simulate data from dozens of regression models. I have previously blogged about how to simulate data from a logistic regression model in SAS.

Simulate data that satisfies a linear regression model

It is useful to be able to generate data that fits a known model. Suppose you want to fit a regression model in which the response variable is a linear combination of 10 explanatory variables, plus random noise. Furthermore, suppose you don't need to use real X values; you are happy to generate random values for the explanatory variables.

With the following SAS DATA step, you can do the following:

  1. Specify the sample size.
  2. Specify the number of explanatory variables.
  3. Specify the model parameters, which are the regression coefficients: β0=Intercept, β1, β2, β3, .... If you know the number of explanatory variables, you can hard-code these values. The example below uses the formula βj = (-1) j+1 4 / (j+1), but you can use a different formula or hard-code the parameter values.
  4. Simulate the explanatory variables. In this example, the variables are independent and normally distributed. When augmented with a column that contains all 1s (to capture the model intercept), the explanatory variables form a data matrix, X.
  5. Specify the error distribution, ε ∼ N(0,σ).
  6. Simulate the response variable as Y = MODEL + ERROR = X β + ε.
%let N = 50;        /* 1. Specify sample size */
%let nCont = 10;    /* 2. Specify the number of continuous variables */
data SimReg1(keep= Y x:);
call streaminit(54321);              /* set the random number seed */
array x[&nCont];         /* explanatory vars are named x1-x&nCont  */
/* 3. Specify model coefficients. You can hard-code values such as
array beta[0:&nCont] _temporary_ (-4 2 -1.33 1 -0.8 0.67 -0.57 0.5 -0.44 0.4 -0.36);
      or you can use a formula such as the following */
array beta[0:&nCont] _temporary_;
do j = 0 to &nCont;
   beta[j] = 4 * (-1)**(j+1) / (j+1);       /* formula for beta[j]  */
do i = 1 to &N;              /* for each observation in the sample  */
   do j = 1 to dim(x);
      x[j] = rand("Normal"); /* 4. Simulate explanatory variables   */
   eta = beta[0];                       /* model = intercept term   */
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       /*     + sum(beta[j]*x[j])  */
   epsilon = rand("Normal", 0, 1.5);    /* 5. Specify error distrib */
   Y = eta + epsilon;                   /* 6. Y = model + error     */

How do you know whether the simulation is correct?

After you simulate data, it is a good idea to run a regression analysis and examine the parameter estimates. For large samples, the estimates should be close to the true value of the parameters. The following call to PROC REG fits the known model to the simulated data and displays the parameter estimates, confidence intervals for the parameters, and the root mean square error (root MSE) statistic.

proc reg data=SimReg1 plots=none;
   model Y = x: / CLB;
   ods select FitStatistics ParameterEstimates;

For a simple model a "large sample" does not have to be very large. Although the sample size is merely N = 50, the parameter estimates are close to the parameter values. Furthermore, the root MSE value is close to 1.5, which is the true magnitude of the error term. For this simulation, each 95% confidence interval contain the true value of the parameters, but that does not happen in general.

Simulate data. Compare parameters, estimates, and 95% confidence limits for a 10-variable linear regression of simulated data

For more complex regression models, you might need to generate larger samples to verify that the simulation correctly generates data from the model.

If you write the ParametereEstimates table to a SAS data set, you can create a plot that shows the parameters overlaid on a plot of the estimates and the 95% confidence limits. The plot shows that the parameters and estimates are close to each other, which should give you confidence that the simulated data are correct.

Notice that this DATA step can simulate any number of observations and any number of continuous variables. Do you want a sample that has 20, 50, or 100 variables? No problem: just change the value of the nCont macro variable! Do you want a sample that has 1000 observations? Change the value of N.

In summary, the SAS DATA step provides an easy way to simulate data from regression models in which the explanatory variables are uncorrelated and continuous. Download the complete program and modify it to your needs. For example, if you want more significant effects, use sqrt(j+1) in the denominator of the regression coefficient formula.

In a future article, I will show how to generalize this program to efficiently simulate and analyze many samples a part of a Monte Carlo simulation study.

tags: Simulation

The post Simulate data for a linear regression model appeared first on The DO Loop.