Phil Simon shares his thoughts on this simple yet often-overlooked question.
The post How do I know what data is available? appeared first on The Data Roundtable.
Phil Simon shares his thoughts on this simple yet often-overlooked question.
The post How do I know what data is available? appeared first on The Data Roundtable.
With North Korea's growing missile capabilities in the news lately, I thought it would be interesting to create a map showing how far (or close) they are from other parts of the world. I first did a few searches on the Web, to see what maps are already out there. [...]
The post How far are you from North Korea? appeared first on SAS Learning Post.
When someone refers to the correlation between two variables, they are probably referring to the Pearson correlation, which is the standard statistic that is taught in elementary statistics courses. Elementary courses do not usually mention that there are other measures of correlation.
Why would anyone want a different estimate of correlation? Well, the Pearson correlation, which is also known as the product-moment correlation, uses empirical moments of the data (means and standard deviations) to estimate the linear association between two variables. However, means and standard deviations can be unduly influenced by outliers in the data, so the Pearson correlation is not a robust statistic.
A simple robust alternative to the Pearson correlation is called the Spearman rank correlation, which is defined as the Pearson correlation of the ranks of each variable. (If a variable contains tied values, replace those values by their average rank.) The Spearman rank correlation is simple to compute and conceptually easy to understand. Some advantages of the rank correlation are
PROC CORR in SAS supports several measures of correlation, including the Pearson and Spearman correlations. For data without outliers, the two measures are often similar. For example, the following call to PROC CORR computes the Spearman rank correlation between three variables in the Sashelp.Class data set:
/* Compute PEARSON and SPEARMAN rank correlation by using PROC CORR in SAS */ proc corr data=sashelp.class noprob nosimple PEARSON SPEARMAN; var height weight age; run; |
According to both statistics, these variables are very positively correlated, with correlations in the range [0.7, 0.88]. Notice that the rank correlations (the lower table) are similar to the Pearson correlations for these data. However, if the data contain outliers, the rank correlation estimate is less influenced by the magnitude of the outliers.
As mentioned earlier, the Spearman rank correlation is conceptually easy to understand. It consists of two steps: compute the ranks of each variable and compute the Pearson correlation between the ranks. It is instructive to reproduce each step in the Spearman computation. You can use PROC RANK in SAS to compute the ranks of the variables, then use PROC CORR with the PEARSON option to compute the Pearson correlation of the ranks. If the data do not contain any missing values, then the following statements implement to two steps that compute the Spearman rank correlation:
/* Compute the Spearman rank correlation "manually" by explicitly computing ranks */ /* First compute ranks; use average rank for ties */ proc rank data=sashelp.class out=classRank ties=mean; var height weight age; ranks RankHeight RankWeight RankAge; run; /* Then compute Pearson correlation on the ranks */ proc corr data=classRank noprob nosimple PEARSON; var RankHeight RankWeight RankAge; run; |
The resulting table of correlations is the same as in the previous section and is not shown. Although PROC CORR can compute the rank correlation directly, it is comforting that these two steps produce the same answer. Furthermore, this two-step method can be useful if you decide to implement a rank-based statistic that is not produced by any SAS procedure. This two-step method is also the way to compute the Spearman correlation of character ordinal variables because PROC CORR does not analyze character variables. However, PROC RANK supports both character and numeric variables.
If you have missing values in your data, then make sure you delete the observations that contain missing values before you call PROC RANK. Equivalently, you can use a WHERE statement to omit the missing values. For example, you could insert the following statement into the PROC RANK statements:
where height^=. & weight^=. & age^=.;
In the SAS/IML language, the CORR function computes the Spearman rank correlation directly, as follows. The results are the same as the results from PROC CORR, and are not shown.
proc iml; use sashelp.class; read all var {height weight age} into X; close; RankCorr = corr(X, "Spearman"); /* compute rank correlation */ |
If you ever need to compute a rank-based statistic manually, you can also use the RANKTIE function to compute the ranks of the elements in a numerical vector, such as
ranktie(X[ ,1], "Mean");
The Spearman rank correlation is a robust measure of the linear association between variables. It is related to the classical Pearson correlation because it is defined as the Pearson correlation between the ranks of the individual variables. It has some very nice properties, including being robust to outliers and being invariant under monotonic increasing transformations of the data. For other measures of correlation that are supported in SAS, see the PROC CORR documentation.
The post What is rank correlation? appeared first on The DO Loop.
Optimization is a core competency for digital marketers. As customer interactions spread across fragmented touch points and consumers demand seamless and relevant experiences, content-oriented marketers have been forced to re-evaluate their strategies for engagement. But the complexity, pace and volume of modern digital marketing easily overwhelms traditional planning and design approaches that rely on historical conventions, myopic single-channel perspectives and sequential act-and-learn iteration.
SAS Customer Intelligence 360 Engage was released last year to address our client needs for a variety of modern marketing challenges. Part of the software's capabilities revolve around:
Regardless of the method, testing is attractive because it is efficient, measurable and serves as a machete cutting through the noise and assumptions associated with delivering effective experiences. The question is: How does a marketer know what to test?
There are so many possibilities. Let's be honest - if it's one thing marketers are good at, it's being creative. Ideas flow out of brainstorming meetings, bright minds flourish with motivation and campaign concepts are born. As a data and analytics geek, I've worked with ad agencies and client-side marketing teams on the importance of connecting the dots between the world of predictive analytics (and more recently machine learning) with the creative process. Take a moment to reflect on the concept of ideation.
Is it feasible to have too many ideas to practically try them all? How do you prioritize? Wouldn't it be awesome if a statistical model could help?
Let's break this down:
Here is the really sweet part. The space of visual analytics has matured dramatically. Creative minds dreaming of the next digital experience cannot be held back by hard-to-understand statistical greek. Nor can I condone the idea that if a magical analytic easy-button is accessible in your marketing cloud, one doesn't need to understand what's going on behind the scene.That last sentence is my personal opinion, and feel free to dive into my mind here.
Want a simple example? Of course you do. I'm sitting in a meeting with a bunch of creatives. They are debating on which pages should they run optimization tests on their website. Should it be on one of the top 10 most visited pages? That's an easy web analytic report to run. However, are those the 10 most important pages with respect to a conversion goal? That's where the analyst can step up and help. Here's a snapshot of a gradient boosting machine learning model I built in a few clicks with SAS Visual Data Mining and Machine Learning leveraging sas.com website data collected by SAS Customer Intelligence 360 Discover on what drives conversions.
I know what you're thinking. Cool data viz picture. So what? Take a closer look at this...
The model prioritizes what is important. This is critical, as I have transparently highlighted (with statistical vigor I might add) that site visitor interest in our SAS Customer Intelligence product page is popping as an important predictor in what drives conversions. Now what?
The creative masterminds and I agree we should test various ideas on how to optimize the performance of this important web page. A/B test? Multivariate test? As my SAS colleague Malcolm Lightbody stated:
"Multivariate testing is the way to go when you want to understand how multiple web page elements interact with each other to influence goal conversion rate. A web page is a complex assortment of content and it is intuitive to expect that the whole is greater than the sum of the parts. So, why is MVT less prominent in the web marketer’s toolkit?
One major reason – cost. In terms of traffic and opportunity cost, there is a combinatoric explosion in unique versions of a page as the number of elements and their associated levels increase. For example, a page with four content spots, each of which have four possible creatives, leads to a total of 256 distinct versions of that page to test.
If you want to be confident in the test results, then you need each combination, or variant, to be shown to a reasonable sample size of visitors. In this case, assume this to be 10,000 visitors per variant, leading to 2.56 million visitors for the entire test. That might take 100 or more days on a reasonably busy site. But by that time, not only will the marketer have lost interest – the test results will likely be irrelevant."
SAS Customer Intelligence 360 provides a business-user interface which allows the user to:
Continuing with my story, we decide to set up a test on the sas.com customer intelligence product page with four content spots, and three creatives per spot. This results in 81 total variants and an estimated sample size of 1,073,000 visits to get a significant read at a 90 percent confidence level.
Notice that Optimize button in the image? Let's talk about the amazing special sauce beneath it. Methodical experimentation has many applications for efficient and effective information gathering. To reveal or model relationships between an input, or factor, and an output, or response, the best approach is to deliberately change the former and see whether the latter changes, too. Actively manipulating factors according to a pre-specified design is the best way to gain useful, new understanding.
However, whenever there is more than one factor – that is, in almost all real-world situations – a design that changes just one factor at a time is inefficient. To properly uncover how factors jointly affect the response, marketers have numerous flavors of multivariate test designs to consider. Factorial experimental designs are more common, such as full factorial, fractional factorial, and mixed-level factorial. The challenge here is each method has strict requirements.
This leads to designs that, for example, are not orthogonal or that have irregular design spaces. Over a number of years SAS has developed a solution to this problem. This is contained within the OPTEX procedure, and allows testing of designs for which:
The OPTEX procedure can generate an efﬁcient experimental design for any of these situations and website (or mobile app) multivariate testing is an ideal candidate because it applies:
The OPTEX procedure is highly flexible and has many input parameters and options. This means that it can cover different digital marketing scenarios, and it’s use can be tuned as circumstances demand. Customer Intelligence 360 provides the analytic heavy lifting behind the scenes, and the marketer only needs to make choices for business relevant parameters. Watch what happens when I press that Optimize button:
Suddenly that scary sample size of 1,070,000 has reduced to 142,502 visits to perform my test. The immediate benefit is the impractical multivariate test has become feasible. However, if only a subset of the combinations are being shown, how can the marketer understand what would happen for an untested variant? Simple! SAS Customer Intelligence 360 fits a model using the results of the tested variants and uses them to predict the outcomes for untested combinations. In this way, the marketer can simulate the entire multivariate test and draw reliable conclusions in the process.
So you're telling me we can dream big in the creative process and unleash our superpowers? That's right my friends, you can even preview as many variants of the test's recipe as you desire.
The majority of today’s technologies for digital personalization have generally failed to effectively use predictive analytics to offer customers a contextualized digital experience. Many of today’s offerings are based on simple rules-based recommendations, segmentation and targeting that are usually limited to a single customer touch point. Despite some use of predictive techniques, digital experience delivery platforms are behind in incorporating machine learning to contextualize digital customer experiences.
At the end of the day, connecting the dots between data science and testing, no matter which flavor you select, is a method I advocate. The challenge I pose to every marketing analyst reading this:
Can you tell a good enough data story to inspire the creative minded?
How does a marketer know what to test? was published on Customer Intelligence Blog.
How can you tell if your marketing is working? How can you determine the cost and return of your campaigns? How can you decide what to do next? An effective way to answer these questions is to monitor a set of key performance indicators, or KPIs.
KPIs are the basic statistics that give you a clear idea of how your website (or app) is performing. KPIs vary by predetermined business objectives, and measure progress towards those specific objectives. In the famous words of Avinash Kaushik, KPIs should be:
An example that fits this description, with applicability to profit, nonprofit, and e-commerce business models, would be the almighty conversion rate. In digital analytics this metric is interpreted as the proportion of visitors to a website or app who take action to go beyond a casual content view or site visit, as a result of subtle or direct requests from marketers, advertisers, and content creators.
Although successful conversions can be defined differently based on your use case, it is easy to see why this KPI is uncomplex, relevant, timely, and useful. We can even splinter this metric into two types:
Macro conversion – Someone completes an action that is important to your business (like making you some money).
Micro conversion – An indicator that a visitor is moving towards a macro conversion (like progressing through a multi-step sales funnel to eventually make you some money)
Regardless of the conversion type, I have always found that reporting on this KPI is a popular request for analysts from middle management and executives. However, it isn't difficult to anticipate what is coming next from the most important person in your business world:
"How can we improve our conversion rate going forward?"
You can report, slice, dice, and segment away in your web analytics platform, but needles in haystacks are not easily discovered unless we adapt. I know change can be difficult, but allow me to make the case for machine learning and hyperparameters within the discipline of digital analytics. A trendy subject for some, a scary subject for others, but my intent is to lend a practitioner's viewpoint. Analytical decision trees are an excellent way to begin because of their frequent usage within marketing applications, primarily due to their approachability, and ease of interpretation.
Whether your use case is for supervised segmentation, or propensity scoring, this form of predictive analytics can be labeled as machine learning due to algorithm's approach to analyzing data. Have you ever researched how trees actually learn before arriving to a final result? It's beautiful math. However, it doesn't end there. We are living in a moment where more sophisticated machine learning algorithms have emerged that can comparatively increase predictive accuracy, precision, and most importantly – marketing-centric KPIs, while being just as easy to construct.
Using the same data inputs across different analysis types like Forests, Gradient Boosting, and Neural Networks, analysts can compare model fit statistics to determine which approach will have the most meaningful impact on your organization's objectives. Terms like cumulative lift or misclassification may not mean much to you, but they are the keys to selecting the math that best answers how conversion rate can be improved by transparently disclosing accurate views of variable importance.
So is that it? I can just drag and drop my way through the world of visual analytics to optimize against KPIs. Well, there is a tradeoff to discuss here. For some organizations, simply using a machine learning algorithm enabled by an easy-to-use software interface will help improve conversion rate tactics on a mobile app screen experience as compared to not using an analytic method. But an algorithm cannot be expected to perform well as a one size fits all approach for every type of business problem. It is a reasonable question to ask oneself if opportunity is being left on the table to motivate analysts to refine the math to the use case. Learning to improve how an algorithm arrives at a final result should not be scary because it can get a little technical. It's actually quite the opposite, and I love learning how machine learning can be elegant. This is why I want to talk about hyperparameters!
Anyone who has ever built a predictive model understands the iterative nature of adjusting various property settings of an algorithm in an effort to optimize the analysis results. As we endlessly try to improve the predictive accuracy, the process becomes painfully repetitive and manual. Due to the typical length of time an analyst can spend on this task alone - from hours, days, or longer - the approach defies our ability as humans to practically arrive at an optimized final solution. Sometimes referred to as auto tuning, hyperparameters address this issue by exploring different combinations of algorithm options, training a model for each option in an effort to find the best model. Imagine running 1000s of iterations of a website conversion propensity model across different property threshold ranges in a single execution. As a result, these models can improve significantly across important fit statistics that relate directly to your KPIs.
At the end of running an analysis with hyperparameters, the best recipe will be identified. Just like any other modeling project, the ability to action off of the insight is no different, from traditional model score code to next-best-action recommendations infused into your mobile app's personalization technology. That's genuinely exciting, courtesy of recent innovations in distributed analytical engines with feature-rich building blocks for machine-learning activities.
If the subject of hyperparameters is new to you, I encourage you to watch this short video.
This will be one of the main themes of my presentations at Analytics Experience 2017 in Washington DC. Using digital data collected by SAS Customer Intelligence 360 and analyzing it with SAS Visual Data Mining & Machine Learning on VIYA, I want to share the excitement I am feeling about digital intelligence and predictive personalization. I hope you'll consider joining the SAS family for an awesome agenda between September 18th-20th in our nation's capital.
Hyperparameters, digital analytics, and key performance indicators was published on Customer Intelligence Blog.
Recently, I interviewed three SAS customers to understand firsthand how each is using data visualization and analytics in education. In this series of blog posts, I’ll take you on a journey to learn how these customers are turning their data into insights to be a more data-informed and analytical organization. [...]
Education Analytics: How SAS is used in education was published on SAS Voices by Georgia Mariani
The State of Illinois faces an unprecedented budget crisis, with more than $15 billion in unpaid bills. While experts will argue over the exact causes of states' financial struggles, many are pointing to the problem of state leaders avoiding long-term budgetary problems for short-term fixes. Illinois is not alone in [...]
Battling the government budget crisis with analytics was published on SAS Voices by Trent Smith
With all the recent talk about some people wanting to move from the US to Canada, I got to wondering how cold, and how far north Canada is. And after a few Google searches, I was surprised to learn that 27 US states are actually farther north than the southernmost point [...]
The post So, 27 US states are farther north than Canada, eh? appeared first on SAS Learning Post.
In this post I describe the important tasks of data preparation, exploration and binning.These three steps enable you to know your data well and build accurate predictive models. First you need to clean your data. Cleaning includes eliminating variables which have uneven spread across the target variable. I give an [...]
The post 3 steps to prepare your data for accurate predictive models in SAS Enterprise Miner appeared first on SAS Learning Post.
Recently, I was asked whether SAS can perform a principal component analysis (PCA) that is robust to the presence of outliers in the data. A PCA requires a data matrix, an estimate for the center of the data, and an estimate for the variance/covariance of the variables. Classically, these estimates are the mean and the Pearson covariance matrix, respectively, but if you replace the classical estimates by robust estimates, you can obtain a robust PCA.
This article shows how to implement a classical (non-robust) PCA by using the SAS/IML language and how to modify that classical analysis to create a robust PCA.
Let's use PROC PRINTCOMP perform a simple PCA. The PROC PRINCOMP results will be the basis of comparison when we implement the PCA in PROC IML. The following example is taken from
proc princomp data=Crime /* add COV option for covariance analysis */ outstat=PCAStats_CORR out=Components_CORR plots=score(ncomp=4); var Murder Rape Robbery Assault Burglary Larceny Auto_Theft; id State; ods select Eigenvectors ScreePlot '2 by 1' '4 by 3'; run; proc print data=Components_CORR(obs=5); var Prin:; run; |
To save space, the output from PROC PRINCOMP is not shown, but it includes a table of the eigenvalues and principal component vectors (eigenvectors) of the correlation matrix, as well as a plot of the scores of the observations, which are the projection of the observations onto the principal components. The procedure writes two data sets: the eigenvalues and principal components are contained in the PCAStats_CORR data set and the scores are contained in the Components_CORR data set.
Assume that the data consists of n observations and p variables and assume all values are nonmissing. If you are comfortable with multivariate analysis, a principal component analysis is straightforward: the principal components are the eigenvectors of a covariance or correlation matrix, and the scores are the projection of the centered data onto the eigenvector basis. However, if it has been a few years since you studied PCA, Abdi and Williams (2010) provides a nice overview of the mathematics. The following matrix computations implement the classical PCA in the SAS/IML language:
proc iml; reset EIGEN93; /* use "9.3" algorithm, not vendor BLAS (SAS 9.4m3) */ varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"}; use Crime; read all var varNames into X; /* X = data matrix (assume nonmissing) */ close; S = cov(X); /* estimate of covariance matrix */ R = cov2corr(S); /* = corr(X) */ call eigen(D, V, R); /* D=eigenvalues; columns of V are PCs */ scale = T( sqrt(vecdiag(S)) ); /* = std(X) */ B = (X - mean(X)) / scale; /* center and scale data about the mean */ scores = B * V; /* project standardized data onto the PCs */ print V[r=varNames c=("Prin1":"Prin7") L="Eigenvectors"]; print (scores[1:5,])[c=("Prin1":"Prin7") format=best9. L="Scores"]; |
The principal components (eigenvectors) and scores for these data are identical to the same quantities that were produced by PROC PRINCOMP. In the preceding program I could have directly computed R = corr(X) and scale = std(X), but I generated those quantities from the covariance matrix because that is the approach used in the next section, which computes a robust PCA.
If you want to compute the PCA from the covariance matrix, you would use S in the EIGEN call, and define B = X - mean(X) as the centered (but not scaled) data matrix.
This section is based on a similar robust PCA computation in Wicklin (2010). The main idea behind a robust PCA is that if there are outliers in the data, the covariance matrix will be unduly influenced by those observations. Therefore you should use a robust estimate of the covariance matrix for the eigenvector computation. Also, the arithmetic mean is unduly influenced by the outliers, so you should center the data by using a robust estimate of center before you form the scores.
/* get robust estimates of location and covariance */ N = nrow(X); p = ncol(X); /* number of obs and variables */ optn = j(8,1,.); /* default options for MCD */ optn[4]= floor(0.75*N); /* override default: use 75% of data for robust estimates */ call MCD(rc,est,dist,optn,x); /* compute robust estimates */ RobCov = est[3:2+p, ]; /* robust estimate of covariance */ RobLoc = est[1, ]; /* robust estimate of location */ /* use robust estimates to perform PCA */ RobCorr = cov2corr(RobCov); /* robust correlation */ call eigen(D, V, RobCorr); /* D=eigenvalues; columns of V are PCs */ RobScale = T( sqrt(vecdiag(RobCov)) ); /* robust estimates of scale */ B = (x-RobLoc) / RobScale; /* center and scale data */ scores = B * V; /* project standardized data onto the PCs */ |
Notice that the SAS/IML code for the robust PCA is very similar to the classical PCA. The only difference is that the robust estimates of covariance and location (from the MCD call) are used instead of the classical estimates.
If you want to compute the robust PCA from the covariance matrix, you would use RobCov in the EIGEN call, and define B = X - RobLoc as the centered (but not scaled) data matrix.
You can create a score plot to compare the classical scores to the robust scores. The Getting Started example in the PROC PRINCOMP documentation shows the classical scores for the first three components. The documentation suggests that Nevada, Massachusetts, and New York could be outliers for the crime data.
You can write the robust scores to a SAS data set and used the SGPLOT procedure to plot the scores of the classical and robust PCA on the same scale. The first and third component scores are shown to the right, with abbreviated state names serving as labels for the markers. (Click to enlarge.) You can see that the robust first-component scores for California and Nevada are separated from the other states, which makes them outliers in that dimension. (The first principal component measures overall crime rate.) The robust third-component scores for New York and Massachusetts are also well-separated and are outliers for the third component. (The third principal component appears to contrast murder with rape and auto theft with other theft.)
Because the crime data does not have huge outliers, the robust PCA is a perturbation of the classical PCA, which makes it possible to compare the analyses. For data that have extreme outliers, the robust analysis might not resemble its classical counterpart.
If you run an analysis like this on your own data, remember that eigenvectors are not unique and so there is no guarantee that the eigenvectors for the robust correlation matrix will be geometrically aligned with the eigenvectors from the classical correlation matrix. For these data, I multiplied the second and fourth robust components by -1 because that seems to make the score plots easier to compare.
In summary, you can implement a robust principal component analysis by using robust estimates for the correlation (or covariance) matrix and for the "center" of the data. The MCD subroutine in SAS/IML language is one way to compute a robust estimate of the covariance and location of multivariate data. The SAS/IML language also provides ways to compute eigenvectors (principal components) and project the (centered and scaled) data onto the principal components.
You can download the SAS program that computes the analysis and creates the graphs.
As discussed in Hubert, Rousseeuw, and Branden (2005), the MCD algorithm is most useful for 100 or fewer variables. They describe an alternative computation that can support a robust PCA in higher dimensions.
The post Robust principal component analysis in SAS appeared first on The DO Loop.