In the first of three posts on using automated analysis with SAS Visual Analytics, we explored a typical visualization designed to give telco customer care workers guidance on customers most receptive to upgrade their plans. While the analysis provided some insight, it lacked analytical depth -- and that increases the risk of wasting time, energy and money on a strategy that may not succeed.
Let’s now look at the same data, but this time deepen the analytical view by putting SAS Visual Analytics' automated analysis into play. We’ll use automated analysis to determine significant variables that impact our key business measure, X-sell and Up-sell Flag.
Less time spent on data discovery, quicker response time
The automated analysis object determines the most important underlying factors for a specific response variable, in our case the X-Sell and Up-Sell flag. After you specify a response variable, most of the remaining data items are added as underlying factors. Variables that are identical to the response variable, variables that have excessive missing values, or variables that have high cardinality are not added as underlying factors. For category responses, you can select the event level (category value) that interests you.
To run automated analysis, I will use the data pane and right click on Xsell and Upsell Flag Category, select Analyze, then Analyze on new page.
Here we see the results for Not Yet Upgraded.
Seeing how we really want to understand what made our customers upgrade so we can learn from it, let’s change the results to see upgraded accounts. To do this I will use the drop-down menu to change the category value to Upgraded.
Now we see the details for Upgraded. Let’s look at each piece of information within this chart.
The top section tells us that the probability of a customer upgraded is 12.13%. It also tells us the other variables in our dataset that influences that probability. The strongest influencers are Total Days over plan, Days Suspended Last 6M (months), Total Times Over Plan and Delinquent Indicator. Remember from our previous analysis, the correlation matrix determined that Total Days over plan, Delinquent Indicator and Days suspended last 6M were correlated with our X-sell and Up-sell Flag. So this part of the analysis is pretty similar. However, the rest of the automated analysis provides so much more information than what we got from our previous analysis and it was produced in under a minute.
The next section gives us a visual on how strong each influencer is on our variable of interest, Xsell and Upsell Flag Category. Total Days Over Plan is the strongest followed by Days Suspended Last 6M followed by Total Times Over Plan…. If we mouse over each of the boxes, we’ll see their relative importance.
After SAS Visual Analytics adds the underlying factors, it creates a relative importance score for each underlying factor. The most important underlying factor is assigned a score of 1, and all other scores are proportional to that value.
If I mouse over Total Days Over Plan I’ll see the relative importance score for that variable.
Here we see that Total Days Over Plan relative importance in influencing a customer to change plans is 1. That means it was the most important factor in predicting our variable of interest, cross-sell and up-sell flag. If I mouse over the Days Suspended Last 6M, I can see that the relative importance for that variable is 0.6664.
The percentages along the left-hand side give us the probability (or chance) of the subgroups of customers likely to upgrade. SAS Visual Analysis shows the top groups and the bottom groups based on probability. The first group of customers are 100% likely to upgrade. These customers have Total Days Over Plan greater than or equal to 33, Days Suspended Last 6M greater than or equal to 6, 6M Avg Minutes on Network Normally Distributed less than -6.9, Delinquent Indicator of 1,2,3 or 4. This means going forward, if we have customers that meet these criteria we should target them for an upgrade because they are 100% likely to upgrade. We can also use the next three customer groups to target as well.
For measure responses, the results display the four groups that result in the greatest values of the response. The results also display the two groups that result in the smallest values of the response. For category responses, the results display the four groups that contain the greatest percentages of the response. The results also display the two groups that contain the least percentages of the response.
The bottom right chart shows how a variable relates to our variable of interest. Below the chart is a description outlining key findings.
An explanatory plot is included for each underlying factor. The contents of this plot depend on the variable type of both the response variable and the underlying factor.
If I click on Days Suspended Last 6M from the colored button bar, the informative text will be highlighted, and the plot chart will be updated to reflect my selection.
But what if you want to see all the variables analyzed and discover what actions were taken on them? If we maximize the automated analysis object we’d see a table at the bottom. This table outlines actions taken on the predictors.
Here we see that Census Area Total Males was rejected because it is too strongly correlated with another measure. This reason would be easy for someone to miss and would affect the results of an analysis or model if that predictor was not removed. Automated analysis really does do the thinking for us and makes models more accurate!
In the second post of this three-part series, we’ll see how we can turn the results from this automated analysis into actionable items.
SAS® Visual Analytics on SAS® Viya® Try it for free!How SAS Visual Analytics' automated analysis takes customer care to the next level - Part 2 was published on SAS Users.
Deming regression (also called errors-in-variables regression) is a total regression method that fits a regression line when the measurements of both the explanatory variable (X) and the response variable (Y) are assumed to be subject to normally distributed errors. Recall that in ordinary least squares regression, the explanatory variable (X) is assumed to be measured without error. Deming regression is explained in a Wikipedia article and in a paper by K. Linnet (1993).
A situation in which both X and Y are measured with errors arises when comparing measurements from different instruments or medical devices. For example, suppose a lab test measures the amount of some substance in a patient's blood. If you want to monitor this substance at regular intervals (for example, hourly), it is expensive, painful, and inconvenient to take the patient's blood multiple times. If someone invents a medical device that goes on the patient's finger and measures the substance indirectly (perhaps by measuring an electrical property such as bioimpedance), then that device would be an improved way to monitor the patient. However, as explained in Deal, Pate, and El Rouby (2009), the FDA would first need to approve the device and determine that it measures the response as accurately as the existing lab test. The FDA encourages the use of Deming regression for method-comparison studies.
Deming regression in SAS
There are several ways to compute a Deming Regression in SAS. The SAS FASTats site suggests maximum likelihood estimation (MLE) by using PROC OPTMODEL, PROC IML, or PROC NLMIXED. However, you can solve the MLE equations explicitly to obtain an explicit formula for the regression estimates. Deal, Pate, and El Rouby (2009) present a rather complicated macro, whereas Njoya and Hemyari (2017) use simple SQL statements. Both authors also provide SAS code for estimating the variance of the Deming regression estimates, either by using the jackknife method or by using the bootstrap. However, the resampling schemes in both papers are inefficient because they use a macro loop to perform the jackknife or bootstrap.
The following SAS DATA Step defines pairs of hypothetical measurements for 65 patients, each of whom received the standard lab test (measured in micrograms per deciliter) and the new noninvasive device (measured in kiloohms):
data BloodTest; label x="micrograms per deciliter" y="kiloOhms"; input x y @@; datalines; 169.0 45.5 130.8 33.4 109.0 23.8 94.1 19.8 86.3 20.4 78.4 18.7 76.1 16.1 72.2 16.7 70.0 11.9 69.8 14.6 69.5 10.6 68.7 12.7 67.3 16.9 174.7 57.8 137.9 39.0 114.6 30.4 99.8 21.1 90.1 21.7 85.1 25.2 80.7 20.6 78.1 19.3 77.8 20.9 76.0 18.2 77.8 18.3 74.2 15.7 73.1 13.9 182.5 55.5 144.0 38.7 123.8 35.1 107.6 30.6 96.9 25.7 92.8 19.2 87.2 22.4 86.3 18.4 84.4 20.7 83.7 20.6 83.3 20.0 83.9 18.8 82.7 21.8 160.8 49.9 122.7 32.2 102.6 19.2 86.6 14.7 76.1 16.6 69.6 18.8 66.7 7.4 64.4 8.2 63.0 15.5 61.7 13.7 61.2 9.2 62.4 12.0 58.4 15.2 171.3 48.7 136.3 36.1 111.9 28.6 96.5 21.8 90.3 25.6 82.9 16.8 78.1 14.1 76.5 14.2 73.5 11.9 74.4 17.7 73.9 17.6 71.9 10.2 72.0 15.6 ; title "Deming Regression"; title2 "Gold Standard (X) vs New Method (Y)"; proc sgplot data=BloodTest noautolegend; scatter x=x y=y; lineparm x=0 y=-10.56415 slope=0.354463 / clip; /* Deming regression estimates */ xaxis grid label="Lab Test (micrograms per deciliter)"; yaxis grid label="New Device (kiloohms)"; run; |
The scatter plot shows the pairs of measurements for each patient. The linear pattern indicates that the new device is well calibrated with the standard lab test over a range of clinical values. The diagonal line represents the Deming regression estimate, which enables you to convert one measurement into another. For example, a lab test that reads 100 micrograms per deciliter is expected to correspond to 25 kiloohms on the new device and vice versa. (If you want to convert the new readings into the old, you can regress X onto Y and plot X on the vertical axis.)
The following SAS/IML function implements the explicit formulas that compute the slope and intercept of the Deming regression line:
/* Deming Regression in SAS */ proc iml; start Deming(XY, lambda=); /* Equations from https://en.wikipedia.org/wiki/Deming_regression */ m = mean(XY); xMean = m[1]; yMean = m[2]; S = cov(XY); Sxx = S[1,1]; Sxy = S[1,2]; Syy = S[2,2]; /* if lambda is specified (eg, lambda=1), use it. Otherwise, estimate. */ if IsEmpty(lambda) then delta = Sxx / Syy; /* estimate of ratio of variance */ else delta = lambda; c = Syy - delta*Sxx; b1 = (c + sqrt(c**2 + 4*delta*Sxy**2)) / (2*Sxy); b0 = yMean - b1*xMean; return (b0 || b1); finish; /* Test the program on the blood test data */ use BloodTest; read all var {x y} into XY; close; b = Deming(XY); print b[c={'Intercept' 'Slope'} L="Deming Regression"]; |
The SAS/IML function can estimate the ratio of the variances of the X and Y variable. In the SAS macros by Deal, Pate, and El Rouby (2009) and Njoya and Hemyari (2017), the ratio is a parameter that is determined by the user. The examples in both papers use a ratio of 1, which assumes that the devices have an equal accuracy and use the same units of measurement. In the current example, the lab test and the electrical device use different units. The ratio of the variances for these hypothetical devices is about 7.4.
Standard errors of estimates
You might wonder how accurate the parameter estimates are. Linnet (1993) recommends using the jackknife method to answer that question. I have previously explained how to jackknife estimates in SAS/IML, and the following program is copied from that article:
/* Helper modules for jackknife estimates of standard error and CI for parameters: */ /* return the vector {1,2,...,i-1, i+1,...,n}, which excludes the scalar value i */ start SeqExclude(n,i); if i=1 then return 2:n; if i=n then return 1:n-1; return (1:i-1) || (i+1:n); finish; /* return the i_th jackknife sample for (n x p) matrix X */ start JackSamp(X,i); return X[ SeqExclude(nrow(X), i), ]; /* return data without i_th row */ finish; /* 1. Compute T = statistic on original data */ T = b; /* 2. Compute statistic on each leave-one-out jackknife sample */ n = nrow(XY); T_LOO = j(n,2,.); /* LOO = "Leave One Out" */ do i = 1 to n; J = JackSamp(XY,i); T_LOO[i,] = Deming(J); end; /* 3. compute mean of the LOO statistics */ T_Avg = mean( T_LOO ); /* 4. Compute jackknife estimates of standard error and CI */ stdErrJack = sqrt( (n-1)/n * (T_LOO - T_Avg)[##,] ); alpha = 0.05; tinv = quantile("T", 1-alpha/2, n-2); /* use df=n-2 b/c both x and y are estimated */ Lower = T - tinv#stdErrJack; Upper = T + tinv#stdErrJack; result = T` || T_Avg` || stdErrJack` || Lower` || Upper`; print result[c={"Estimate" "Mean Jackknife Estimate" "Std Error" "Lower 95% CL" "Upper 95% CL"} r={'Intercept' 'Slope'}]; |
The formulas for the jackknife computation differs slightly from the SAS macro by Deal, Pate, and El Rouby (2009). Because both X and Y have errors, the t quantile must be computed by using n–2 degrees of freedom, not n–1.
If X and Y are measured on the same scale, then the methods are well-calibrated when the 95% confidence interval (CI) for the intercept includes 0 and the CI for the intercept includes 1. In this example, the devices use different scales. The Deming regression line enables you to convert from one measurement scale to the other; the small standard errors (narrow CIs) indicate that this conversion is accurate.
In summary, you can use a simple set of formulas to implement Deming regression in SAS. This article uses SAS/IML to implement the regression estimates and the jackknife estimate of the standard errors. You can also use the macros that are mentioned in the section "Deming regression in SAS," but the macros are less efficient, and you need to specify the ratio of the variances of the data vectors.
The post Deming regression for comparing different measurement methods appeared first on The DO Loop.
Do you want to start your year feeling great about data and analytics? Then dive into this list and read a few data for good stories that you might have missed in 2018. You'll learn how data scientists are using analytics to save endangered species, prevent suicide, combat cancer - [...]
Top 10 data for good articles of 2018 was published on SAS Voices by Alison Bolen
Do you want to start your year feeling great about data and analytics? Then dive into this list and read a few data for good stories that you might have missed in 2018. You'll learn how data scientists are using analytics to save endangered species, prevent suicide, combat cancer - [...]
Top 10 data for good articles of 2018 was published on SAS Voices by Alison Bolen
You're the operations director for a major telco's contact center. Your customer-care workers enjoy solving problems. Turning irate callers into fans makes their day.
They also hate flying blind. They've been begging you for deeper insight into customer data to better serve their callers. They want to know which customers will likely accept offers and upgrades they're authorized to give. Their success = customer satisfaction = your company's success, right?
Automated analytics facilitate that level of insight, and this post introduces you to it. It will help you begin to think through what it looks like to equip your contact center workers to be heroes. Two subsequent posts will further demonstrate how SAS Visual Analytics leverages automated analytics.
What is automated analytics?
If you're already familiar with business intelligence tools, it's not a stretch to call automated analytics disruptive, significantly changing the way you see BI. In essence, automated analytics uses machine learning to find meaningful relationships between variables. It provides valuable insights in easy-to-understand text generated using natural language.
Automated analytics, which is expanding to include Artificial Intelligence, overcomes barriers to insight-driven business decisions by reducing:
- Time to insights.
- Bias in the analysis.
- The need for more employee training.
An analyst-intensive approach to better insights
Now put your analyst hat on and imagine a day in the life of interpreting data visualizations. Pictured below is a report created to explore and visualize customers' interactions with a telecommunications company. It contains usage information from a subset of customers who have contacted customer care centers. Enhanced by adding cleansed demographics data, this report is being used to target customers for cross-sell or up-sell opportunities.
Note that the Private Label GM channel have the highest upgrade rate of 50%. This could mean that customers who purchased their plans through the Private Label GM channel were not well informed on their options and might have purchased a plan that did not fit their needs. We could investigate this further and see how we can assist our customers better when purchasing their plans through the Private label GM channel.
This report also shows us that the unknown handset type had the highest upgrade rate of all phone types. Unknown handset indicates that this customer brought their phone over from another company. So, this high upgrade rate is not surprising as a recent promotion targeted those users to switch their phones and upgrade.
The analysis showing our upgrade rate and total upgrades by plan type shows us that the Lotta Minutes Classic plan had the highest number of upgrades. This is not surprising as it also has the highest number of accounts. However, the Data Bytes Value plan had the highest upgrade rate but very few accounts. We could focus on the Unlimited SL plan customers and offer them upgrades, as they seem to be more likely to upgrade than customers on other plans and there are quite a few customers still on that plan.
From the analysis on the bottom we can see the correlation of other variables to our variable of interest, cross-sell and up-sell flag. We can see that Total Days Over Plan, Delinquent Indicator and Days Suspended Last 6M are weakly correlated to upgrades.
What’s interesting here is that data plan is not correlated to our variable of interest, Xsell and Upsell flag. This tells me that if we had started a campaign targeted on Unlimited SL customers, we probably wouldn’t have much success.
We might want to target customers based on total days over plan or delinquent indicator or days suspended in the last 6 months, but they were only weakly correlated.
While this correlation provided great insight and may have prevented us from going down the wrong path, I had to physically choose the variables I wanted to include. I used my own logic and chose variables I thought might influence our variable of interest, Xsell and Upsell Flag.
Risk of mistakes, missed opportunity
But there are many other variables in this dataset. What if one of the other variables that I hadn’t thought of was correlated? I would miss some key findings. Or what if multiple variables in combination better predict our variable of interest, Xsell and Upsell Flag?
To dive deeper we could add a decision tree, or other charts to try to determine where we should focus our future efforts. This would take some time to build and we’d need to interpret the results on our own. However, if we use automated analysis the application would:
- Choose the most relevant categories and measures.
- Perform the most appropriate analytics for our data.
- Provide us with results that are easy-to-understand.
Upcoming: A closer look at automated analytics
In next week's post, you'll see what happens when we turn loose the power of automated analytics with the SAS Viya Platform and let SAS Visual Analytics analyze all the measures and categories.
What's your experience with automated analytics? Share in the comments.
How SAS Visual Analytics' automated analysis takes customer care to the next level - Part 1 was published on SAS Users.
Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis. Here are the most popular articles from 2018 in each category.
Data Visualization
- Visualize repetition in song lyrics: In one of my favorite posts of the year, I use SAS to visualize the "repetition matrix" that shows how words and phrases are repeated in the lyrics of some popular songs. I had so much fun writing the post that I revisited it in December to visualize repetition in Christmas songs.
- Rating US Presidents: 166 experts in political science ranked the 44 US presidents on "greatness scale." You can use SAS to visualize the ranking of US presidents according to these experts.
- Customize legends in statistical graphs: In two articles, I showed ways to customize the legends in SAS statistical graphs. First, I showed how to combine symbols and line patterns in a legend by using the new LEGENDITEM statement. Later, I showed five ways to customize a legend in the SGPLOT procedure.
General SAS programming techniques
- CLASS versus BY variables: What is the difference between the CLASS statement and BY variables in SAS? When should you use one instead of the other in a SAS procedure?
- How to create a "Top 10" table and graph: Often it is useful to visualize the largest (most frequently occurring) categories of a categorical variable. This article shows how to use PROC FREQ to display only the top 10 levels (or five, or three,...) of a categorical variable.
- How to use DATA _NULL_ step: When you use the keyword '_NULL_' as the name of a data set, the output is not written. This article shows six reasons to use the _NULL_ data set in your SAS programs.
- FIRST.variable and LAST.variable in a BY-group: This article gives several examples of how to use the FIRST.variable and LAST.variable indicator variables for BY-group analysis in the SAS DATA step.
- Specify lists of variables in SAS: The SAS language provides syntax that enables you to specify lists of variables. This article describes six ways to specify a list of variables in SAS, including SAS keywords and syntax such as hyphens, colons, and double hyphens.
Statistics and Data Analysis
- New random number generators in SAS: Have you heard that SAS has a slew of new random number generators? Read about why you might want to use a new random number generator (RNG) instead of the default RNG. Or, you can watch a video about the new SAS random number generators.
- Calibration plots: A calibration plot for a binary response model can help you to rank subjects according to risk. It also is a way to diagnose lack of fit in a model. Learn about calibration curves and how to use a smooth curve to compare the predicted and empirical probabilities for the data. The SAS developer of PROC LOGISTIC liked this article a lot, so look for calibration plots in a future release of SAS/STAT software!
- Bootstrap methods in SAS: I published six bootstrap articles in 2018. The most popular demonstrate how to bootstrap the statistics in a two-sample t test. You can use the BOOTSTRAP statement in PROC TTEST or you can go "old school" and manually perform a bootstrap analysis for the t test.
- Chi-square tests in SAS: The chi-square test is a fundamental hypothesis test for testing the proportions of categorical variables. My first article shows how to compute a two-variable chi-square test for association in SAS. A second article shows how to specify the proportions for a one-way chi-square test of proportions.
I write this blog because I love to learn new things and share what I know with others. If you want to learn something new, read (or re-read!) these popular articles from 2018. Then share this page with one of your colleagues. Happy New Year! I hope we both have many opportunities to learn and share in 2019!
The post Top posts from <em>The DO Loop</em> in 2018 appeared first on The DO Loop.
In my last blog post we defined data scientists – who they are and what they do. In this post, we'll discuss the data engineer, which is someone with a role so important that you'll want this person as your very best friend! A data engineer transforms and integrates data [...]
The post The data engineer: Jack of all trades appeared first on The Data Roundtable.