1月 232020
 

When I attend a conference, one of the first things I do is look at the agenda. This gives me a good overview of how my time will be spent.

The next thing I do is find the detailed breakdown of sessions, so I can start building out my own personal agenda. I know my areas of interest, and I want to make sure my time is spent learning as much as possible. I’ve done this at industry conferences, as well as every SAS GF I’ve attended (and I’ve attended a lot).

I am happy to report that the session catalog for SAS Global Forum 2020 helps me understand what sessions are available so I can make the most of my conference experience. Here are my tips to make the most of the session catalog:

Use filters

For the first time, all programs are represented in the catalog. And anyone can attend any session. You can start broad, then filter by level of expertise, topic of interest, industry, product, or style of presentation. Filters are your friend…use them!

Use the interest list

You can bookmark sessions to your interest list, but beware that these do not automatically save for you. Be sure to use the export feature. Then you can share your chosen sessions with friends, print them, and cross-check them as you’re building your agenda in the mobile app starting March 1. It’s a great tool.

Review sample agendas

Speaking of agendas, it’s sometimes hard to narrow down your options. Start with one of the custom-tailored agendas within the session catalog and use these as a starting point to help complete your personal agenda. We keep adding new sample agendas, so you will have many great sessions to pick from!

Solicit the help of your SAS point of contact

If you’re a first-timer, or someone who is overwhelmed by too many choices, let your SAS contact help. They can help select sessions that they feel will be beneficial to your learning and networking.

Moral of the story

If you can’t tell, I’m a big fan of the session catalog. It’s totally customizable. The way I use it may not be the way you use it and that’s fine. That’s why there are so many ways to search, filter and save. Just give it a try. Your SAS Global Forum experience will be all the better when you include the session catalog.

And remember, anyone can check out the session catalog. Not planning to attend? Well, boo, go ahead and see what you’re missing. If you’re on the fence about the conference, check out all the great things SAS Global Forum has to offer then go register!

Make the most of SAS Global Forum with the session catalog was published on SAS Users.

1月 232020
 

I was recently asked about how to interpret the output from the COLLIN (or COLLINOINT) option on the MODEL statement in PROC REG in SAS. The example in the documentation for PROC REG is correct but is somewhat terse regarding how to use the output to diagnose collinearity and how to determine which variables are collinear. This article uses the same data but goes into more detail about how to interpret the results of the COLLIN and COLLINOINT options.

An overview of collinearity in regression

Collinearity (sometimes called multicollinearity) involves only the explanatory variables. It occurs when a variable is nearly a linear combination of other variables in the model. Equivalently, there a set of explanatory variables that is linearly dependent in the sense of linear algebra. (Another equivalent statement is that the design matrix and the X`X matrices are not full rank.)

For example, suppose a model contains five regressor variables and the variables are related by X3 = 3*X1 - X2 and X5 = 2*X4;. In this case, there are two sets of linear relationships among the regressors, one relationship that involves the variables X1, X2, and X3, and another that involves the variables X4 and X5. In practice, collinearity means that a set of variables are almost linearly combinations of each other. For example, the vectors u = X3 - 3*X1 + X2 and v = X5 - 2*X4; are close to the zero vector.

Unfortunately, the words "almost" and "close to" are difficult to quantify. The COLLIN option on the MODEL statement in PROC REG provides a way to analyze the design matrix for potentially harmful collinearities.

Why should you avoid collinearity in regression?

The assumptions of ordinary least square (OLS) regression are not violated if there is collinearity among the independent variables. OLS regression still provides the best linear unbiased estimates of the regression coefficients.

The problem is not the estimates themselves but rather the variance of the estimates. One problem caused by collinearity is that the standard errors of those estimates will be big. This means that the predicted values, although the "best possible," will have wide prediction limits. In other words, you get predictions, but you can't really trust them.

A second problem concerns interpretability. The sign and magnitude of a parameter estimate indicate how the dependent variable changes due to a unit change of the independent variable when the other variables are held constant. However, if X1 is nearly collinear with X2 and X3, it does not make sense to discuss "holding constant" the other variables (X2 and X3) while changing X1. The variables necessarily must change together. Collinearities can even cause some parameter estimates to have "wrong signs" that conflict with your intuitive notion about how the dependent variable should depend on an independent variable.

A third problem with collinearities is numerical rather than statistical. Strong collinearities cause the cross-product matrix (X`X) to become ill-conditioned. Computing the least squares estimates requires solving a linear system that involves the cross-product matrix. Solving an ill-conditioned system can result in relatively large numerical errors. However, in practice, the statistical issues usually outweigh the numerical one. A matrix must be extremely ill-conditioned before the numerical errors become important, whereas the statistical issues are problematic for moderate to large collinearities.

How to interpret the output of the COLLIN option?

The following example is from the "Collinearity Diagnostics" section of the PROC REG documentation. Various health and fitness measurements were recorded for 31 men, such as time to run 1.5 miles, the resting pulse, the average pulse rate while running, and the maximum pulse rate while running. These measurements are used to predict the oxygen intake rate, which is a measurement of fitness but is difficult to measure directly.

data fitness;
   input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
   datalines;
44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
 
proc reg data=fitness plots=none;
   model Oxygen = RunTime Age Weight RunPulse MaxPulse RestPulse / collin;
   ods select ParameterEstimates CollinDiag;
   ods output CollinDiag = CollinReg;
quit;

The output from the COLLIN option is shown. I have added some colored rectangles to the output to emphasize how to interpret the table. To determine collinearity from the output, do the following:

  • Look at the "Condition Index" column. Large values in this column indicate potential collinearities. Many authors use 30 as a number that warrants further investigation. Other researchers suggest 100. Most researchers agree that no single number can handle all situations.
  • For each row that has a large condition index, look across the columns in the "Proportion of Variation" section of the table. Identify cells that have a value of 0.5 or greater. The columns of these cells indicate which variables contribute to the collinearity. Notice that at least two variables are involved in each collinearity, so look for at least two cells with large values in each row. However, there could be three or more cells that have large values. "Large" is relative to the value 1, which is the sum of each column.

Let's apply these rules to the output for the example:

  • If you use 30 as a cutoff value, there are three rows (marked in red) whose condition numbers exceed the cutoff value. They are rows 5, 6, and 7.
  • For the 5th row (condition index=33.8), there are no cells that exceed 0.5. The two largest cells (in the Weight and RestPulse columns) indicate a small near-collinearity between the Weight and RestPulse measurements. The relationship is not strong enough to worry about.
  • For the 6th row (condition index=82.6), there are two cells that are 0.5 or greater (rounded to four decimals). The cells are in the Intercept and Age columns. This indicates that the Age and Intercept terms are nearly collinear. Collinearities with the intercept term can be hard to interpret. See the comments at the end of this article.
  • For the 7th row (condition index=196.8), there are two cells that are greater than 0.5. The cells are in the RunPulse and MaxPulse columns, which indicates a very strong linear relationship between these two variables.

Your model has collinearities. Now what?

After you identify the cause of the collinearities, what should you do? That is a difficult and controversial question that has many possible answers.

  • Perhaps the simplest solution is to use domain knowledge to omit the "redundant" variables. For example, you might want to drop MaxPulse from the model and refit. However, in this era of Big Data and machine learning, some analysts want an automated solution.
  • You can use dimensionality reduction and an (incomplete) principal component regression.
  • You can use a biased estimation technique such as ridge regression, which allows bias but reduces the variance of the estimates.
  • Some practitioners use variable selection techniques to let the data decide which variables to omit from the model. However, be aware that different variable-selection methods might choose different variables from among the set of nearly collinear variables.

Keep the intercept or not?

Equally controversial is the question of whether to include the intercept term in a collinearity diagnostic. The COLLIN option in PROC REG includes the intercept term among the variables to be analyzed for collinearity. The COLLINOINT option excludes the intercept term. Which should you use? Here are two opinions that I found:

  1. Use the intercept term: Belsley, Kuh, and Welsch (Regression Diagnostics, 1980, p. 120) state that omitting the intercept term is "inappropriate in the event that [the design matrix]contains a constant column." They go on to say (p. 157) that "centering the data [when the model has an intercept term]can mask the role of the constant in any near dependencies and produce misleading diagnostic results." These quotes strongly favor using the COLLIN option when the model contains an intercept term.
  2. Do not use the intercept term if it is outside of the data range: Freund and Littell (SAS System for Regression, 3rd Ed, 2000) argue that including the intercept term in the collinearity analysis is not always appropriate. "The intercept is an estimate of the response at the origin, that is, where all independent variables are zero. .... [F]or most applications the intercept represents an extrapolation far beyond the reach of the data. For this reason, the inclusion of the intercept in the study of multicollinearity can be useful only if the intercept has some physical interpretation and is within reach of the actual data space." For the example data, it is impossible for a person to have zero age, weight, or pulse rate, therefore I suspect Freund and Little would recommend using the COLLINOINT option instead of the COLLIN option for these data.

So what should you do if the experts disagree? I usually defer to the math. Although I am reluctant to contradict Freund and Littell (both widely published experts and Fellows of the American Statistical Association), the mathematics of the collinearity analysis (which I will discuss in a separate article) seems to favor the opinion of Belsley, Kuh, and Welsch. I use the COLLIN option when my model includes an intercept term.

Do you have an opinion on this matter? Leave a comment.

The post Collinearity in regression: The COLLIN option in PROC REG appeared first on The DO Loop.

1月 212020
 

One great thing about being a SAS programmer is that you never run out of new things to learn. SAS often gives us a variety of methods to produce the same result. One good example of this is the DATA step and PROC SQL, both of which manipulate data. The DATA step is extremely powerful and flexible, but PROC SQL has its advantages too. Until recently, my knowledge of PROC SQL was pretty limited. But for the sixth edition of The Little SAS Book, we decided to move the discussion of PROC SQL from an appendix (who reads appendices?) to the body of the book. This gave me an opportunity to learn more about PROC SQL.

When developing my programs, I often find myself needing to calculate the mean (or sum, or median, or whatever) of a variable, and then merge that result back into my SAS data set. That would generally involve at least a couple PROC steps and a DATA step, but using PROC SQL I can achieve the same result all in one step.

Example

Consider this example using the Cars data set in the SASHELP library. Among other things, the data set contains the 2004 MSRP for over 400 models of cars of various makes and car type. Suppose you want a data set which contains the make, model, type, and MSRP for the model, along with the median MSRP for all cars of the same make. In addition, you would like a variable that is the difference between the MSRP for that model, and the median MSRP for all models of the same make. Here is the PROC SQL code that will create a SAS data set, MedianMSRP, with the desired result:

*Create summary variable for median MSRP by Make;

PROC SQL;
   CREATE TABLE MedianMSRP AS
   SELECT Make, Model, Type, MSRP,
          MEDIAN(MSRP) AS MedMSRP,
          (MSRP - MEDIAN(MSRP)) AS MSRP_VS_Median
   FROM sashelp.cars
   GROUP BY Make;
QUIT;


 

The CREATE TABLE clause simply names the SAS data set to create, while the FROM clause names the SAS data set to read. The SELECT clause lists the variables to keep from the old data set (Make, Model, Type, and MSRP) along with specifications for the new summary variables. The new variable, MedMSRP, is the median of the old MSRP variable, while the new variable MSRP_VS_Median is the MSRP minus the median MSRP. The GROUP BY clause tells SAS to do the calculations within each value of the variable Make. If you leave off the GROUP BY clause, then the calculations would be done over the entire data set. When you run this code, you will get the following message in your SAS log telling you it is doing exactly what you wanted it to do:

NOTE: The query requires remerging summary statistics back with the original data.

The following PROC PRINT produces a report showing just the observations for two makes – Porsche and Jeep.

PROC PRINT DATA = MedianMSRP;
  TITLE '2004 Car Prices';
  WHERE Make IN ('Porsche','Jeep');
  FORMAT MedMSRP MSRP_VS_Median DOLLAR8.0;
RUN;

Results

Here are the results:

Now PROC SQL aficionados will tell you that if all you want is a report and you don’t need to create a SAS data set, then you can do it all in just the PROC SQL step. But that is the topic for another blog!

 

Expand Your SAS Knowledge by Learning PROC SQL was published on SAS Users.

1月 212020
 

AI innovations range from unique modelling techniques and computer vision efforts to medical diagnostic tools and self-driving cars. Within that wide range of technology, what should you consider patenting? Which of your discoveries are true intellectual property that necessitate protection? A little thought upfront could help you to know what [...]

Protecting your intellectual property in the brave new world of AI and machine learning was published on SAS Voices by Claudio Broggio

1月 202020
 

From the early days of probability and statistics, researchers have tried to organize and categorize parametric probability distributions. For example, Pearson (1895, 1901, and 1916) developed a system of seven distributions, which was later called the Pearson system. The main idea behind a "system" of distributions is that for each feasible combination of skewness and kurtosis, there is a member of the system that achieves the combination. A smaller system is the Johnson system (Johnson, 1949), which contains only four distributions: the normal distribution, the lognormal distribution, the SB distribution (which models bounded distributions), and the SU distribution (which models unbounded distributions).

Every member of the Johnson system can be transformed into a normal distribution by using elementary functions. Because the transformation is invertible, it is also true that every member of the Johnson system is a transformation of a normal distribution.

The Johnson SB and SU distributions have four parameters and are extremely flexible, which means that they can fit a wide range of distribution shapes. As John von Neumann once said, "With four parameters, I can fit an elephant and with five I can make him wiggle his trunk." As I indicated in an article about the moment-ratio diagram, if you specify the skewness and kurtosis of a distribution, you can find a unique member of the Johnson system that has the same skewness and kurtosis.

This article shows how to compute the four essential functions for the Johnson SB distribution: the PDF, the CDF, the quantile function, and a method to generate random variates. It also shows how to fit the parameters of the distribution to data by using PROC UNIVARIATE in SAS.

What is the Johnson SB distribution?

The SB distribution contains a threshold parameter, θ, and a scale parameter, σ > 0. The SB distribution has positive density (support) on the interval (θ, θ + σ). The two shape parameters for the SB distribution are called δ and γ in the SAS documentation for PROC UNIVARIATE. By convention, the parameter δ is taken to be positive (δ > 0).

If X is a random variable from the Johnson SB distribution, then the variable
Z = γ + δ log( (X-θ) / (θ + σ - X) )
is standard normal. Because the transformation to normality is invertible, you can use properties of the normal distribution to understand properties of the SB distribution. The relationship also enables you to generate random variates of the SB distribution from random normal variates. To be explicit, define Y = (Z-γ) / δ, where Z ∼ N(0,1). Then the following transformation defines a Johnson SB random variable:
X = ( σ exp(Y) + θ (exp(Y) + 1) ) / (exp(Y) + 1);

Interestingly, if σ=1 and θ=0, then this is the logistic transformation. So one way to get a member of the Johnson system is as a logistic transformation of a normal random variable.

Generate random variates from the SB distribution

The relationship between the SB and the normal distribution provides an easy way to generate random variates from the SB distribution. You simply generate a normal variate and then transform it, as follows:

/* Johnson SB(threshold=theta, scale=sigma, shape=gamma, shape=delta) */
data SB(keep= X);
call streaminit(1);
theta = 0;   sigma = 1;   gamma = -1.1;    delta = 1.5;
do i = 1 to 1000;
   Y = (rand("Normal")-gamma) / delta;
   expY = exp(Y);
   X = ( sigma*expY + theta*(expY + 1) ) / (expY + 1);
   output;
end;
run;
 
/* set bins: https://blogs.sas.com/content/iml/2014/08/25/bins-for-histograms.html */
proc univariate data=SB;
   histogram X / SB(theta=0, sigma=1, gamma=-1.1, delta=1.5)
                 endpoints=(0 to 1 by 0.05);
   ods select Histogram;
run;

I used PROC UNIVARIATE in SAS to overlay the SB density on the histogram of the random variates. The result follows:

Density curve for a Johnson SB distribution overlaid on a histogram of bounded data

The PDF of the SB distribution

The formula for the SB density function is given in the PROC UNIVARIATE documentation (set h = v = 1 in the formula). You can evaluate the probability density function (PDF) on the interval (θ, θ + σ) in order to visualize the density function. So that you can easily compare various shape parameters, the following examples use θ=0 and σ=1 and plot the density on the interval (0, 1). The θ parameter translates the density curve and the σ parameter scales it, but these parameters do not change the shape of the curve. Therefore, only the two shape parameters are varied in the following program, which generates the PDF of the SB distribution for four combinations of parameters.

/* Visualize a few SB distributions */
data SB;
array _theta[4] _temporary_ (0 0 0 0);
array _sigma[4] _temporary_ (1 1 1 1);
array _gamma[4] _temporary_ (-1.1 -1.1  0.5  0.5);
array _delta[4] _temporary_ ( 1.5  0.8  0.8  0.05 );
sqrt2pi = sqrt(2*constant('pi'));
do i = 1 to dim(_theta);
   theta=_theta[i]; sigma=_sigma[i]; gamma=_gamma[i]; delta=_delta[i]; 
   Params = catt("gamma=", gamma, "; delta=", delta);
   do x = theta to theta+sigma by 0.005;
      c = delta / (sigma * sqrt2pi);
      w = (x-theta) / sigma;
      t = (x - theta) / (theta + sigma - x);
      PDF = c / (w*(1-w)) * exp( -0.5*(gamma + delta*log(t))**2 );
      output;
   end;
end;
drop c w t sqrt2pi;
run;
 
title "The Johnson SB Probability Density";
title2 "theta=0; sigma=1";
proc sgplot data=SB;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   xaxis grid offsetmin=0 offsetmax=0;  yaxis grid;
run;
Density curves for four Johnson SB distributions

The graph shows that the SB distribution supports a variety of shapes. The blue curve (γ=–1.1, δ=1.5) and the red curve (γ=–1.1, δ=0.8) both exhibit negative skewness, whereas the green curve (γ=0.5, δ=0.8) has positive skewness. Those three curves are unimodal, whereas the brown curve (γ=0.5, δ=0.05) is a U-shaped distribution. These curves might remind you of the beta distribution, which is a more familiar family of bounded distributions.

The CDF and quantiles of the SB distribution

I am not aware of explicit formulas for the cumulative distribution function (CDF) and quantile function of the SB distribution. You can use numerical integration of the PDF to evaluate the cumulative distribution. You can use numerical root-finding methods to evaluate the inverse CDF (quantile) function, but it will be an expensive computation.

Fitting parameters of the SB distribution

If you have data, you can use PROC UNIVARIATE to estimate four parameters of the SB distribution that fit the data. In practice, often the threshold (and sometimes the scale) is known from domain-specific knowledge. For example, if you are modeling student test scores, you might know that the scores are always in the range [0, 100], so θ=0 and σ=100.

In an earlier section, we simulated a data set for which all values are in the interval [0, 1]. The following call to PROC UNIVARIATE estimates the shape parameters for these simulated data:

proc univariate data=SB;
   histogram X / SB(theta=0, sigma=1, fitmethod=moments) 
                 endpoints=(0 to 1 by 0.05);
run;
Fit parameters to the Johnson SB distribution by using PROC UNIVARIATE in SAS

By default, the HISTOGRAM statement uses the FITMETHOD=PERCENTILE method to fit the data. You can also use FITMETHOD=MLE to obtain a maximum likelihood estimate, or FITMETHOD=MOMENTS to estimate the parameters by matching the sample moments. In this example, I show how to use the FITMETHOD=MOMENTS option, which, for these data, seems to provide a better fit than the method of percentiles.

Summary

The Johnson system is a four-parameter system that contains four families of distributions. If you choose any feasible combination of skewness and kurtosis, you can find a member of the Johnson system that has that same skewness and kurtosis. The SB distribution is a family that models bounded distributions. This article shows how to simulate random values from the SB distribution and how to visualize the probability density function (PDF). It also shows how to use PROC UNIVARIATE to fit parameters of the SB distribution to data.

The post The Johnson SB distribution appeared first on The DO Loop.

1月 172020
 

Where in your business process can analytics and AI play a contributing role in enhancing your decision making capability?  At the information interpretation stage.  As a framework for understanding what analytic and AI opportunities may arise, the simple diagram below illustrates the relationships between data, information and knowledge, and how [...]

Opportunities for analytics: Interpretation was published on SAS Voices by Leo Sadovy

1月 162020
 

This blog is part of a series on SAS Visual Data Mining and Machine Learning (VDMML).

If you're new to SAS VDMML and you want a brief overview of the features available, check out my last blog post!

This blog will discuss types of missing data and how to use imputation in SAS VDMML to improve your predictions.

 
Imputation is an important aspect of data preprocessing that has the potential to make (or break) your model. The goal of imputation is to replace missing values with values that are close to what the missing value might have been. If successful, you will have more data with which the model can learn. However, you could introduce serious bias and make learning more difficult.

missing data

When considering imputation, you have two main tasks:

  • Try to identify the missing data mechanism (i.e. what caused your data to be missing)
  • Determine which method of imputation to use

Why is your data missing?

Identifying exactly why your data is missing is often a complex task, as the reason may not be apparent or easily identified. However, determining whether there is a pattern to the missingness is crucial to ensuring that your model is as unbiased as possible, and that you have enough data to work with.

One reason imputation may be used is that many machine learning models cannot use missing values and only do complete case analysis, meaning that any observation with missing values will be excluded.

Why is this a problem? Well look at the percentage of missing data in this dataset...

Percentage of missing data for columns in HMEQ dataset

View the percentage of missing values for each variable in the Data tab within Model Studio.  Two variables are highlighted to demonstrate a high percentage of missingness (DEBTINC - 21%, DEROG - 11%).

If we use complete case analysis with this data, we will lose at least 21% of our observations! Not good.

Missing data can be classified in 3 types:

  • missing completely at random (MCAR),
  • missing at random (MAR),
  • and missing not at random (MNAR).

Missing completely at random (MCAR) indicates the missingness is unrelated to any of the variables in the analysis -- rare case for missing data. This means that neither the values that are missing nor the other variables associated with those observations are related to the missingness.  In this case, complete case analysis should not produce a biased model. However, if you are concerned about losing data, imputing the missing values is still a valid approach.

Data can also be missing at random (MAR), where the missing variable value is random and unrelated to the missingness, but the missingness is related to the other variables in the analysis. For example, let’s say there was a survey sent out in the mail asking participants about their activity levels. Then, a portion of surveys from the certain region were lost in the mail. In this case, the missing data is not at random, but this missingness is not related to the variable we are interested in, activity level.

Dramatization of potential affected survey participants

An important consideration is that even though the missingness is unrelated to activity level, the data could be biased if participants in that region are particularly fit or lazy. Therefore, if we had data from other people in the affected region, we may be able to use that information in how we impute the data. However, imputing MAR data will bias the results, to what extent will be dependent on your data and method for imputation.

In some cases, the data is missing not at random (MNAR) and the missingness could provide information that is useful to prediction. MNAR data is non-ignorable and must be addressed or your model will be biased.  For example, if a survey asks a sensitive question about a person's weight, a higher proportion of people with a heavier weight may withhold that information -- creating bias within the missing data. One possible solution is to create another level or label for the missing value.

The bottom-line is that whether you should use imputation for missing values is completely dependent on your data and what the missing value represents. Be diligent!

If you have determined that imputation may be beneficial for your data, there are several options to choose from. I will not be going into detail on the types of imputation available as the type you use will be highly dependent on your data and goals, but I have listed a few examples below to give you a glimpse of the possibilities.

Examples of types of imputations:

  • Mean, Median
  • Regression
  • Hot-deck
  • Last Observation Carried Forward (LOCF - for longitudinal data)
  • Multiple Imputation

How to Impute in SAS VDMML

Now, I will show an example of Imputation in SAS Viya - Model Studio. (SAS VDMML)

First, I created a New Project, selecting the Data Mining and Machine Learning, and the default template. 

On the data tab, I will specify how I want certain features to be imputed. This does not do the imputation for us! Think about the specifications on the data tab as a blueprint for how you want the machine learning pipeline to handle the data.

To start, I selected CLAGE feature and changed the imputation to "Mean" (See demonstration below). Then, I selected DELINQ and changed the imputation to "Median". For NINQ, I selected "minimum" as the imputed value. For DEBTINC and YOJ, I will set the Imputation for both to Custom Constant Value (50,100).

Selecting to impute the mean value for the CLAGE variable

Selecting to impute the mean value for the CLAGE variable

Now, I will go over to the Pipeline tab, and add an imputation node (this is where the work gets done!)

Adding an Imputation node to the machine learning pipeline

Adding an Imputation node to the machine learning pipeline

Selecting the imputation node, I view the options and change the default operations for imputation to none. I changed this because we only want the variables previously specified to be imputed. However, if you wanted to run the default imputation, SAS VDMML automatically imputes all missing values that are Class (categorical) variables to the Median and Interval (numerical) values to the Mean. 

Updating default imputation values to none for Categorical and Interval variable

Updating default imputation values to none for Categorical and Interval variables

Looking at the results, we can see what variables imputation was performed on, what value was used, and how many values for each variable were replaced.

Results of running Imputation node on data

Results of running Imputation node on data

And that's how easy it is to impute variables in SAS Model Studio!

Interested in checking out SAS Viya?

Machine Learning Using SAS Viya is a course that teaches machine learning basics, gives instruction on using SAS Viya VDMML, and provides access to the SAS Viya for Learners software all for $79.This course is the pre-requisite course for the SAS Certified Specialist in Machine Learning Certification. Going through the course myself, I was able to quickly learn how to use SAS VDMML and received a refresher on many data preprocessing tactics and machine learning concepts. 

Want to learn more? 

Stay tuned; I will be posting more blogs with examples of specific features in the SAS VDMML. If you there’s any specific features you would like to know more about, leave a comment below! 

Missing data? Quickly impute values using SAS Viya was published on SAS Users.

1月 162020
 

Using Customer Lifetime Value in your business decision making is often important and crucial for success. Businesses that are customer-centric often spend thousands of dollars acquiring new customers, “on-boarding” new customers, and retaining those customers. If your business margins are thin, then it can often be months or quarters before you start to turn a profit on a particular customer. Additionally, some business models will segment the worth of their customers into categories that will often give different levels of service to the more “higher worth” customers. The metric most often used for that is called Customer Lifetime Value (CLV). CLV is simply a balance sheet look at the total cost spent versus the total revenue earned over a customer’s projected tenure or “life.”

In this blog, we will focus on how a business analyst can build a functional analytical dashboard for a fictional company that is seeing its revenue, margins, and a customer’s lifetime value decrease and what steps they can take to correct that.

We will cover 3 main areas of interest:

  1. First, screenshots of SAS Visual Analytic reports, using Customer Lifetime Value and how you can replicate them.
  2. Next, we will look at the modeling that we did in the report, with explanations on how we got used the results in subsequent modeling.
  3. Lastly, we talk about one example of how we scored and deployed the model, and how you can do the same.

Throughout this blog, I will also highlight areas where SAS augments our software with artificial intelligence to improve your experience.

1. State of the company

First, we will look at the state of the company using the dashboard and take note of any problems.

Our dashboard shows the revenue of our company over the last two years as well as a forecast for the next 6 months. We see that revenue has been on the decline in recent years and churns have been erratically climbing higher.

Our total annual revenue was 112M last year with just over 5,000 customers churning.

So far this year, our revenue is tracking low and sits at only 88M, but the bad news is that we have already tripled last year's churn total.

If these trends continue, we stand to lose a third of our revenue!

2. The problems

Now, let’s investigate as to where the problems are and what can be done about them.

If we look at our current metrics, we can see some interesting points worth investigating.

The butterfly chart on the right shows movement between customer loyalty tiers within each region of the country with the number of upgrades (on the right) and downgrades (on the left).

The vector plots show us information over multiple dimensions. These show us the difference between two values and the direction it is heading. For example, on the left, we see that Revenue is pointed downward while churns (x axis) are increasing.

The vector plot on the right shows us the change in margin from year to year as well as the customer lifetime value.

What’s interesting here is that there are two arrows that are pointing up, indicating a rise in customer lifetime value. Indeed, if we were to click on the map, we would see that these two regions are the same two that have a net increase in Loyalty Tier.

This leads me to believe that a customer’s tier is predictive of margin. Let’s investigate it further.

3. Automated Analysis

We will use the Automated Analysis feature within Visual Analytics to quickly give us the drivers of CLV.

This screenshot shows an analysis that SAS Visual Analytics(VA) performed for me automatically. I simply told VA which variable I was interested in analyzing and within a matter of seconds, it ran a series of decision trees to produce this summary. This is an example of how SAS is incorporating AI into our software to improve your experience.

Here we can see that loyalty tier is indeed the most important factor in determining projected annual margin (or CLV).

4. Influential driver

Once identified, the important driver will be explored across other dimensions to assess how influential this driver might be.

A cursory exploration of Loyalty Tier indicates that yes, loyalty tier, particularly Tier 5, has a major influence on revenue, order count, repeat orders, and margin.

5. CLV comparison models

We will create two competing models for CLV and compare them.

Here on our modeling page are two models that I’ve created to predict CLV. The first one is a Linear Regression and the second is a Gradient Boosting model. I've used Model Comparison to tell me that the Linear Regression model delivers a more accurate prediction and so I use the output of that model as input into a recommendation engine.

6. Recommendation engine

Based on our model learnings and the output of the model, we are going to build a recommendation engine to help us with determine what to do with each customer.

Represented here, I built a recommendation engine model using the Factorization Machine algorithm.

Once we implement our model, customers are categorized more appropriately and we can see that it has had an impact on revenue and the number of accounts is back on the rise!

Conclusion

Even though Customer Lifetime Value has been around for years, it is still a valuable metric to utilize in modeling and recommendation engines as we have seen. We used it our automated analysis, discovered that it had an impact on revenue, we modeled future values of CLV and then incorporated those results into a recommendation engine that recommended new loyalty tiers for our customers. As a result, we saw positive changes in overall company revenue and churn.

To learn more, please check out these resources:

How to utilize Customer Lifetime Value with SAS Visual Analytics was published on SAS Users.