5月 082019
 

At SAS Global Forum 2019, Daymond Ling presented an interesting discussion of binary classifiers in the financial industry. The discussion is motivated by a practical question: If you deploy a predictive model, how can you assess whether the model is no longer working well and needs to be replaced?

Daymond discussed the following three criteria for choosing a model:

  1. Discrimination: The ability of the binary classifier to predict the class of a labeled observation. The area under an ROC curve is one measure of a binary model's discrimination power. In SAS, you can compute the ROC curve for any predictive model.
  2. Accuracy: The ability of the model to estimate the probability of an event. The calibration curve is a graphical indication of a model's accuracy. In SAS, you can compute a calibration curve manually, or you can use PROC LOGISTIC in SAS/STAT 15.1 to automatically compute a calibration curve.
  3. Stability: Point estimates are often used to choose a model, but you should be aware of the variability of the estimates. This is a basic concept in statistics: When choosing between two unbiased estimators, you should usually choose the one that has smaller variance. SAS procedures provide (asymptotic) standard errors for many statistics such as the area under an ROC curve. If you have reason to doubt the accuracy of an asymptotic estimate, you can use bootstrap methods in SAS to estimate the sampling distribution of the statistic.

Estimates of model stability

My article about comparing the ROC curves for predictive models contains two competing models: A model from using PROC LOGISTIC and an "Expert model" that was constructed by asking domain experts for their opinions. (The source of the models is irrelevant; you can use any binary classifier.) You can download the SAS program that produces the following table, which estimates the area under each ROC curve, the standard error, and 90% confidence intervals:

The "Expert" model has a larger Area statistic and a smaller standard error, so you might choose to deploy it as a "champion model."

In his presentation, Daymond asked an important question. Suppose one month later you run the model on a new batch of labeled data and discover that the area under the ROC curve for the new data is only 0.73. Should you be concerned? Does this indicate that the model has degraded and is no longer suitable? Should you cast out this model, re-train all the models (at considerable time and expense), and deploy a new "champion"?

The answer depends on whether you think Area = 0.73 represents a degraded model or whether it can be attributed to sampling variability. The statistic 0.73 is barely more than 1 standard error away from the point estimate, and you will recall that 68% of a normal distribution is within one standard deviation of the mean. From that point of view, the value 0.73 is not surprising. Furthermore, the 90% confidence interval indicates that if you run this model every day for 100 days, you will probably encounter statistics lower than 0.68 merely due to sampling variability. In other words, a solitary low score might not indicate that the model is no longer valid.

Bootstrap estimates of model stability

If "asymptotic normality" makes you nervous, you can use the bootstrap method to obtain estimates of the standard error and the distribution of the Area statistic. The following table summarizes the results of 5,000 bootstrap replications. The results are very close to the asymptotic results in the previous table. In particular, the standard error of the Area statistic is estimated as 0.08 and in 90% of the bootstrap samples, the Area was in the interval [0.676, 0.983]. The conclusion from the bootstrap computation is the same as for the asymptotic estimates: you should expect the Area statistic to bounce around. A value such as 0.73 is not unusual and does not necessarily indicate that the model has degraded.

You can use the bootstrap computations to graphically reveal the stability of the two models. The following comparative histogram shows the bootstrap distributions of the Area statistic for the "Expert" and "Logistic" models. You can see that not only is the upper distribution shifted to the right, but it has less variance and therefore greater stability.

I think Daymond's main points are important to remember. Namely, discrimination and accuracy are important for choosing a model, but understanding the stability of the model (the variation of the estimates) is essential for determining when a model is no longer working well and should be replaced. There is no need to replace a model for a "bad score" if that score is within the range of typical statistical variation.

References

Ling, D. (2019), "Measuring Model Stability", Proceedings of the SAS Global Forum 2019 Conference.

Download the complete SAS program that creates the analyses and graphs in this article.

The post Discrimination, accuracy, and stability in binary classifiers appeared first on The DO Loop.

5月 082019
 

I just spent four inspiring days talking to customers about the many ways they are putting analytics into action in their organizations.  From computer vision models that interpret medical images to natural language processing models that analyze supply chain records, SAS users are doing ground-breaking work with analytics and AI. [...]

6 must reads following our biggest event of the year was published on SAS Voices by Oliver Schabenberger

5月 062019
 

App security is at the top of mind for just about everybody – users, IT folks, business executives. Rightfully so. Mobile apps and the devices on which they reside tend to travel around, without any physical boundaries that encompass the traditional desktop computers.

In chatting with folks who are evaluating the SAS Visual Analytics app for their mobile devices, the conversation eventually winds up with a focus on security and the big question comes up:

How is this app secure?

Great question! Here’s a whirlwind tour of the security features that have been built into the SAS Visual Analytics app for Windows 10, Android, and iOS devices. The app is now a young kid and not a toddler anymore, it has been around for about six years. And during its growth journey, the app has been beefed up with rock-solid features to address security for Visual Analytics reports viewed from mobile devices.

Before we take a look at the security features in the app, here are a few things you should know:

    • The app is free.
    • No license is needed to use the app.
    • You can download it anytime from the app store, and try out the sample reports in the app.
    • If you already have SAS Visual Analytics deployed in your organization, you can connect to your server, add reports to the app, and start interacting with your reports from your smartphone or tablet. The Help available in the app walks you through these steps.

Now, let’s get back to security for Visual Analytics reports on mobile devices. Here are five things that make the Visual Analytics app robust and secure on mobile devices.

    1. Device Whitelisting: If you want to connect to your SAS Visual Analytics server from the app, your administrator will “whitelist” your mobile device. Your device is first registered as a valid device that can connect to the Visual Analytics server. The whitelist affects devices, not users. If you happen to lose your mobile device, your administrator can remove the device from the whitelist and prevent access to the reports and data. The option to “blacklist” devices is also available.
    2. Cached Reports: After you add Visual Analytics reports to your app, if you don’t want the report data to remain with the report in the app, your administrator can enable the cached report feature. Data is downloaded only when you open and view the report on your mobile device. When you close the report, that data is removed from the device. For enhanced security, thumbnail images for report tiles in your app will not display for cached reports.
    3. Passcode: To prevent anyone other than yourself from opening the Visual Analytics app, you can set a 4-digit passcode for the app. There are two kinds of passcodes: required and optional. A required passcode is mandated by the server – when you connect to the server, you will create a passcode. Then, whenever you open the app or view a report from that server, you must enter the passcode. An optional passcode, on the other hand, is a passcode that you choose to use to lock up the app – it is not required to access the server, it is needed only to open the app. In addition, there are several features for passcode use that solidify security and access to the app: time-out, lock-out and so forth. I’ll go over these features in an upcoming blog.
    4. SSL/HTTPS: If the Visual Analytics server is set up with SSL/HTTPS, the data viewed in the reports on your mobile device is encrypted.
    5. Offline: If you were offline for a specified number of days, you must sign into the server again. If you don’t, the app does not download reports, update reports, or open reports for viewing.

Cached Reports

One of the security features we just talked about was the cached report feature. Here’s how cached report thumbnails are displayed in the Visual Analytics app on Windows 10, without any images.

When you tap the thumbnail for the cached report, data is immediately downloaded and the report opens in the app for viewing and interaction:

When you close this cached report in the app, the data is removed from the device and the cached report thumbnail displays in the app without any images.

Thanks for joining me on this whirlwind security tour of the SAS Visual Analytics app. Now you know the many different security mechanisms that are in place to protect your organization’s data and reports accessed from the mobile app.

Five key security features in the SAS Visual Analytics app was published on SAS Users.

5月 062019
 

Here's a simulation tip: When you simulate a fixed-effect generalized linear regression model, don't add a random normal error to the linear predictor. Only the response variable should be random. This tip applies to models that apply a link function to a linear predictor, including logistic regression, Poisson regression, and negative binomial regression.

Recall that a generalized linear model has three components:

  1. A linear predictor η = X β, which is a linear combination of the regressors.
  2. An invertible link function, g, which transforms the linear predictor to the expected value of the response variable. If μ = E(Y), then g(μ) = η or μ = g-1(η).
  3. A random component, which specifies the conditional distribution of the response variable, Y, given the values of the independent variables.

Notice that only the response variable is randomly generated. In a previous article about simulating data from a logistic regression model, I showed that the following SAS DATA step statements can be used to simulate data for a logistic regression model. The statements model a binary response variable, Y, which depends linearly on two explanatory variables X1 and X2:

   /* CORRECT way to simulate data from a logistic model with parameters (-2.7, -0.03, 0.07) */
   eta = -2.7 - 0.03*x1 + 0.07*x2;   /* linear predictor */
   mu = logistic(eta);               /* transform by inverse logit */
   y = rand("Bernoulli", mu);        /* simulate binary response with probability mu */

Notice that the randomness occurs only during the last step when you generate the response variable. Sometimes I see a simulation program in which the programmer adds a random term to the linear predictor, as follows:

   /* WRONG way to simulate logistic data. This is a latent-variable model. */
   eta = -2.7 - 0.03*x1 + 0.07*x2 + RAND("NORMAL", 0, 0.8);   /* WRONG: Leads to a misspecified model! */
   ...

Perhaps the programmer copied these statements from a simulation of a linear regression model, but it is not correct for a fixed-effect generalized linear model. When you simulate data from a generalized linear model, use the first set of statements, not the second.

Why is the second statement wrong? Because it has too much randomness. The model that generates the data includes a latent (unobserved) variable. The model you are trying to simulate is specified in a SAS regression procedure as MODEL Y = X1 X2, but the model for the latent-variable simulation (the second one) should be MODEL Y = X1 X2 X3, where X3 is the unobserved normally distributed variable.

What happens if you add a random term to the linear predictor

I haven't figured out all the mathematical ramifications of (incorrectly) adding a random term to the linear predictor prior to applying the logistic transform, but I ran a simulation that shows that the latent-variable model leads to biased parameter estimates when you fit the simulated data.

You can download the SAS program that generates data from two models: from the correct model (the first simulation steps) and from the latent-variable model (the second simulation). I generated 100 samples (each containing 5057 observations), then used PROC LOGISTIC to generate the resulting 100 sets of parameter estimates by using the statement MODEL Y = X1 X2. The results are shown in the following scatter plot matrix.

The blue markers are the parameter estimates from the correctly specified simulation. The reference lines in the upper right cells indicate the true values of the parameters in the simulation: (β0, β1, β2) = (-2.7, -0.03, 0.07). You can see that the true parameter values are in the center of the cloud of blue markers, which indicates that the parameter estimates are unbiased.

In contrast, the red markers show that the parameter estimates for the misspecified latent-variable model are biased. The simulated data does not come from the model that is being fit. This simulation used 0.8 for the standard deviation of the error term in the linear predictor. If you use a smaller value, the center of the red clouds will be closer to the true parameter values. If you use a larger value, the clouds will move farther apart.

For additional evidence that the data from the second simulation does not fit the model Y = X1 X2, the following graphs show the calibration plots for a data sets from each simulation. The plot on the left shows nearly perfect calibration: This is not surprising because the data were simulated from the same model that is fitted! The plot on the right shows the calibration plot for the latent-variable model. The calibration plot shows substantial deviations from a straight line, which indicates that the model is misspecified for the second set of data.

In summary, be careful when you simulate data for a generalized fixed-effect linear model. The randomness only appears during the last step when you simulate the response variable, conditional on the linear predictor. You should not add a random term to the linear predictor.

I'll leave you with a thought that is trivial but important: You can use the framework of the generalized linear model to simulate a linear regression model. For a linear model, the link function is the identity function and the response distribution is normal. That means that a linear model can be simulated by using the following:

   /* Alternative way to simulate a linear model with parameters (-2.7, -0.03, 0.07) */
   eta = -2.7 - 0.03*x1 + 0.07*x2;   /* linear predictor */
   mu = eta;                         /* identity link function */
   y = rand("Normal", mu, 0.7);      /* simulate Y as normal response with RMSE = 0.7 */

Thus simulating a linear model fits into the framework of simulating a generalized linear model, as it should!

Download the SAS program that generates the images in this article.

The post How to simulate data from a generalized linear model appeared first on The DO Loop.

5月 012019
 

SAS regression procedures support several parameterizations of classification variables. When a categorical variable is used as an explanatory variable in a regression model, the procedure generates dummy variables that are used to construct a design matrix for the model. The process of forming columns in a design matrix is called a parameterization or encoding. In SAS, most regression procedures use either the GLM encoding, the EFFECT encoding, or the REFERENCE encoding. This article summarizes the default and optional encodings for each regression procedure in SAS/STAT. In many SAS procedures, you can use the PARAM= option to change the default encoding.

The documentation section "Parameterization of Model Effects" provides a complete list of the encodings in SAS and shows how the design matrices are constructed from the levels. (The levels are the values of a classification variable.) Pasta (2005) gives examples and further discussion.

Default and optional encodings for SAS regression procedures

The following SAS regression procedures support the CLASS statement or a similar syntax. The columns GLM, REFERENCE, and EFFECT indicate the three most common encodings. The word "Default" indicates the default encoding. For procedures that support the PARAM= option, the column indicates the supported encodings. The word All means that the procedure supports the complete list of SAS encodings. Most procedures default to using the GLM encoding; the exceptions are highlighted.

Procedure GLMREFERENCE EFFECTPARAM=
ADAPTIVEREG Default
ANOVA Default
BGLIMM Default Yes Yes GLM | EFFECT | REF
CATMOD Default
FMM Default
GAM Default
GAMPL Default Yes GLM | REF
GEE Default
GENMOD Default Yes Yes All
GLIMMIX Default
GLM Default
GLMSELECT Default Yes Yes All
HP regression procedures Default Yes GLM | REF
HPMIXED Default
ICPHREG Default Yes Yes All
LIFEREG Default
LOGISTIC Yes Yes Default All
MIXED Default
ORTHOREG Default Yes Yes All
PLS Default
PROBIT Default
PHREG Yes Default Yes All
QUANTLIFE Default
QUANTREG Default
QUANTSELECT Default Yes Yes All
RMTSREG Default Yes Yes All
ROBUSTREG Default
SURVEYLOGISTIC Yes Yes Default All
SURVEYPHREG Default Yes Yes All
SURVEYREG Default
TRANSREG Yes Default Yes

A few comments:

  • The REFERENCE encoding is the default for PHREG and TRANSREG.
  • The EFFECT encoding is the default for CATMOD, LOGISTIC, and SURVEYLOGISTIC.
  • The HP regression procedures all use the GLM encoding by default and support only PARAM=GLM or PARAM=REF. The HP regression procedures include HPFMM, HPGENSELECT, HPLMIXED, HPLOGISTIC, HPNLMOD, HPPLS, HPQUANTSELECT, and HPREG. In spite of its name, GAMPL is also an HP procedure. In spite of its name, HPMIXED is NOT an HP procedure!
  • PROC LOGISTIC and PROC HPLOGISTIC use different default encodings.
  • CATMOD does not have a CLASS statement because all variables are assumed to be categorical.
  • PROC TRANSREG does not support a CLASS statement. Instead, it uses a CLASS() transformation list. It uses different syntax to support parameter encodings.

How to interpret main effects for the SAS encodings

The GLM parameterization is a singular parameterization. The other encodings are nonsingular. The "Other Parameterizations" section of the documentation gives a simple one-sentence summary of how to interpret the parameter estimates for the main effects in each encoding:

  • The GLM encoding estimates the difference in the effect of each level compared to the reference level. You can use the REF= option to specify the reference level. By default, the reference level is the last ordered level. The design matrix for the GLM encoding is singular.
  • The REFERENCE encoding estimates the difference in the effect of each nonreference level compared to the effect of the reference level. You can use the REF= option to specify the reference level. By default, the reference level is the last ordered level. Notice that the REFERENCE encoding gives the same interpretation as the GLM encoding. The difference is that the design matrix for the REFERENCE encoding excludes the column for the reference level, so the design matrix for the REFERENCE encoding is (usually) nonsingular.
  • The EFFECT encoding estimates the difference in the effect of each nonreference level compared to the average effect over all levels.

This article lists the various encodings that are supported for each SAS regression procedures. I hope you will find it to be a useful reference. If I've missed your favorite regression procedure, let me know in the comments.

The post Encodings of CLASS variables in SAS regression procedures: A cheat sheet appeared first on The DO Loop.

5月 012019
 

SAS has one of the highest percentages of women working in the technology industry. And yet, a persistent gender gap in technology is cause for concern as the number of women seeking degrees in computing continues to shrink. Why is that? Kicking off day three of SAS Global Forum, SAS [...]

Closing the gender gap in technology was published on SAS Voices by Shannon Heath