5月 192018
 

How to change your working directory for SASRegardless of the environment in which you run SAS (whether it is SAS® Foundation, SAS® Studio, or SAS® Enterprise Guide®), SAS uses a default location on your host system as a working directory. When you do not specify the use of a different directory within your code, the default location is where SAS stores output.

Beginning with SAS® 9.4 TS1M4, you can use a new DATA step function, DLGCDIR, to change the location for your working directory. You can use this function in Microsoft Windows or UNIX/Linux environments.

Make sure that any directory that you specify with the DLGCDIR function is an existing directory that you have Write or Update access to.

Finding Out What Your Current Directory Is

To determine what your current working directory in SAS is, submit the following code:

   data _null_;
      rc=dlgcdir();
      put rc=;
   run;

Changing Your Windows Directory

The following sample code for Windows sets the working directory in SAS as the TEMP folder on your C: drive:

   data _null_; 
      rc=dlgcdir("c:\temp");
      put rc=;
   run;

Changing Your Linux Directory

This sample code (for a Linux environment) changes the working directory in SAS to /u/your/linux/directory:

   data _null_;
      rc=dlgcdir("/u/your/linux/directory");
      put rc=;
   run;

Changing Your Directory: Other Tips

The DLGCDIR function temporarily changes the working directory for the current SAS or client session. However, you can create an autoexec file that contains the DATA step code that uses the DLGCDIR function. The autoexec file then executes the code each time you invoke SAS.

In most situations, it is still recommended that you specify the intended target directory for the Output Delivery System (ODS) and in other SAS statements. For example, when you use the ODS HTML statement, you should specify the target directory with the PATH option, as shown here:

   ods html path="c:\temp" (url=none) file="sasoutput.html";

Similarly, with the ODS PDF statement, you should specify the target directory with the FILE option, as shown here:

   ods pdf file="c:\temp\sasoutput.pdf";

I hope you've found this post helpful.

How to change your working directory for SAS® with the DLGCDIR DATA step function was published on SAS Users.

5月 182018
 

During SAS Global Forum 2018, I sat down with four SAS users to get their take on what makes a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.

The post What makes a SAS User? Insight and Community: Josh Horstman appeared first on SAS Learning Post.

5月 182018
 

When data meets geography, use cases revolve around mapping and spatial analytics. But what happens when you combine digital analytics and powerful visualization for customer location analysis? Leveraging data collection mechanisms, SAS 360 Discover captures first-party behavioral information across the entire digital customer experience with a brand’s websites and mobile [...]

SAS Customer Intelligence 360: Location analytics meets digital intelligence was published on Customer Intelligence Blog.

5月 162018
 

In my article about how to construct calibration plots for logistic regression models in SAS, I mentioned that there are several popular variations of the calibration plot. The previous article showed how to construct a loess-based calibration curve. Austin and Steyerberg (2013) recommend the loess-based curve on the basis of an extensive simulation study. However, some practitioners prefer to use a decile calibration plot. This article shows how to construct a decile-based calibration curve in SAS.

The decile calibration plot

The decile calibration plot is a graphical analog of the Hosmer-Lemeshow goodness-of-fit test for logistic regression models. The subjects are divided into 10 groups by using the deciles of the predicted probability of the fitted logistic model. Within each group, you compute the mean predicted probability and the mean of the empirical binary response. A calibration plot is a scatter plot of these 10 ordered pairs, although most calibration plots also include the 95% confidence interval for the proportion of the binary responses within each group. Many calibration plots connect the 10 ordered pairs with piecewise line segments, others use a loess curve or a least squares line to smooth the points.

Create the decile calibration plot in SAS

The previous article simulated 500 observations from a logistic regression model logit(p) = b0 + b1*x + b2*x2 where x ~ U(-3, 3). The following call to PROC LOGISTIC fits a linear model to these simulated data. That is, the model is intentionally misspecified. A call to PROC RANK creates a new variable (Decile) that identifies the deciles of the predicted probabilities for the model. This variable is used to compute the means of the predicted probabilities and the empirical proportions (and 95% confidence intervals) for each decile:

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim noprint;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
 
/* To construct the decile calibration plot, identify deciles of the predicted prob. */
proc rank data=LogiOut out=LogiDecile groups=10;
   var PredProb;
   ranks Decile;
run;
 
/* Then compute the mean predicted prob and the empirical proportions (and CI) for each decile */
proc means data=LogiDecile noprint;
   class Decile;
   types Decile;
   var y PredProb;
   output out=LogiDecileOut mean=yMean PredProbMean
          lclm=yLower uclm=yUpper;
run;
 
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiDecileOut noautolegend aspect=1;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   *loess x=PredProbMean y=yMean;  /* if you want a smoother based on deciles */
   series x=PredProbMean y=yMean;  /* if you to connect the deciles */
   scatter x=PredProbMean y=yMean / yerrorlower=yLower yerrorupper=yUpper;
   yaxis label="Observed Probability of Outcome";
   xaxis label="Predicted Probability of Outcome";
run;
Decile calibration curve for a misspecified logistic regression model

The diagonal line is the line of perfect calibration. In a well-calibrated model, the 10 markers should lie close to the diagonal line. For this example, the graph indicates that the linear model does not fit the data well. For the first decile of the predicted probability (the lowest predicted-risk group), the observed probability of the event is much higher than the mean predicted probability. For the fourth, sixth, and seventh deciles, the observed probability is much lower than the mean predicted probability. For the tenth decile (the highest predicted-risk group), the observed probability is higher than predicted. By the way, this kind of calibration is sometimes called internal calibration because the same observations are used to fit and assess the model.

The decile calibration plot for a correctly specified model

You can fit a quadratic model to the data to see how the calibration plot changes for a correctly specified model. The results are shown below. In this graph, all markers are close to the diagonal line, which indicates a very close agreement between the predicted and observed probabilities of the event.

Decile calibration curve for a correctly specified logistic regression model

Should you use the decile calibration curve?

The decile-based calibration plot is popular, perhaps because it is so simple that it can be constructed by hand. Nevertheless, Austin and Steyerberg (2013) suggest using the loess-based calibration plot instead of the decile-based plot. Reasons include the following:

  • The use of deciles results in estimates that "display greater variability than is evident in the loess-based method" (p. 524).
  • Several researchers have argued that the use of 10 deciles is arbitrary. Why not use five? Or 15? In fact, the results of the Hosmer-Lemeshow test "can depend markedly on the number of groups, and there's no theory to guide the choice of that number." (P. Allison, 2013. "Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression")

Many leading researchers in logistic regression do not recommend the Hosmer-Lemeshow test for these reasons. The decile-based calibration curve shares the same drawbacks. Since SAS can easily create the loess-based calibration curve (see the previous article), there seems to be little reason to prefer the decile-based version.

The post Decile calibration plots in SAS appeared first on The DO Loop.

5月 152018
 

Editor's note: Learn more about SAS 360 Plan, the latest addition to the SAS Customer Intelligence 360 solution. We all know that digital platforms have transformed the role of marketing in the last few years; however, in many organizations, marketing operations hasn’t kept pace with that innovation. Marketers need to be [...]

Purpose-driven marketing planning with SAS Customer Intelligence 360 was published on Customer Intelligence Blog.

5月 152018
 

In a recent #SASchat we explored the intersection of artificial intelligence and human creativity. Posing questions like, "How is AI impacting human creativity?" And, "What will it take for the public to fully trust AI?" In the lively online conversation, a few themes prevailed: Automation enabled by AI frees humans [...]

Exploring the intersection of AI and human creativity was published on SAS Voices by Kristine Vick

5月 142018
 

A logistic regression model is a way to predict the probability of a binary response based on values of explanatory variables. It is important to be able to assess the accuracy of a predictive model. This article shows how to construct a calibration plot in SAS. A calibration plot is a goodness-of-fit diagnostic graph. It enables you to qualitatively compare a model's predicted probability of an event to the empirical probability.

Calibration curves

There are two standard ways to assess the accuracy of a predictive model for a binary response: discrimination and calibration. To assess discrimination, you can use the ROC curve. Discrimination involves counting the number of true positives, false positive, true negatives, and false negatives at various threshold values. In contrast, calibration curves compare the predicted probability of the response to the empirical probability. Calibration can help diagnose lack of fit. It also helps to rank subjects according to risk.

This article discusses internal calibration, which is the agreement on the sample used to fit the model. External calibration, which involves comparing the predicted and observed probabilities on a sample that was not used to fit the model, is not discussed.

In the literature, there are two types of calibration plots. One uses a smooth curve to compare the predicted and empirical probabilities. The curve is usually a loess curve, but sometimes a linear regression curve is used. The other (to be discussed in a future article) splits the data into deciles. An extensive simulation study by Austin and Steyerberg (2013) concluded that loess-based calibration curves have several advantages. This article discusses how to use a loess fit to construct a calibration curve.

A calibration curve for a misspecified model

Calibration curves help to diagnose lack of fit. This example simulates data according to a quadratic model but create a linear fit to that data. The calibration curve should indicate that the model is misspecified.

Step 1: Simulate data for a quadratic logistic regression model

Austin and Steyerberg (2013) use the following simulated data, which is based on the earlier work of Hosmer et al. (1997). Simulate X ~ U(-3, 3) and define η = b0 + b1*X + b2*X2) to be a linear predictor, where b0 = -2.337, b1 = 0.8569, b2 = 0.3011. The logistic model is logit(p) = η, where p = Pr(Y=1 | x) is the probability of the binary event Y=1. The following SAS DATA step simulates N=500 observations from this logistic regression model:

/* simulate sample of size N=500 that follows a quadratic model from Hosmer et al. (1997) */
%let N = 500;                              /* sample size */ 
data LogiSim(keep=Y x);
call streaminit(1234);                     /* set the random number seed */
do i = 1 to &N;                            /* for each observation in the sample */
   /* Hosmer et al. chose x ~ U(-3,3). Use rand("Uniform",-3,3) for modern versions of SAS */
   x = -3 + 6*rand("Uniform");
   eta = -2.337 + 0.8569*x + 0.3011*x**2;  /* quadratic model used by Hosmer et al. */
   mu = logistic(eta);
   Y = rand("Bernoulli", mu);
   output;
end;
run;

For this simulated data, the "true" model is known to be quadratic. Of course, in practice, the true underlying model is unknown.

Step 2: Fit a logistic model

The next step is to fit a logistic regression model and save the predicted probabilities. The following call to PROC LOGISTIC intentionally fits a linear model. The calibration plot will indicate that the model is misspecified.

/* Use PROC LOGISTIC and output the predicted probabilities.
   Intentionally MISSPECIFY the model as linear. */
proc logistic data=LogiSim plots=effect;
   model Y(event='1') = x;
   output out=LogiOut predicted=PredProb; /* save predicted probabilities in data set */
run;
Plot of predicted probabilities for a logistic regression model in SAS

As shown above, PROC LOGISTIC can automatically create a fit plot for a regression that involves one continuous variable. The fit plot shows the observed responses, which are plotted at Y=0 (failure) or Y=1 (success). The predicted probabilities are shown as a sigmoidal curve. The predicted probabilities are in the range (0, 0.8). The next section creates a calibration plot, which is a graph of the predicted probability versus the observed response.

Create the calibration plot

The output from PROC LOGISTIC includes the predicted probabilities and the observed responses. Copas (1983), Harrell et al. (1984), and others have suggested assessing calibration by plotting a scatter plot smoother and overlaying a diagonal line that represents the line of perfect calibration. If the smoother lies close to the diagonal, the model is well calibrated. If there is a systematic deviation from the diagonal line, that indicates that the model might be misspecified.

The following statements sort the data by the predicted probabilities and create the calibration plot. (The sort is optional for the loess smoother, but is useful for other analyses.)

proc sort data=LogiOut;  by PredProb;  run;
 
/* let the data choose the smoothing parameter */
title "Calibration Plot for Misspecified Model";
title2 "True Model Is Quadratic; Fit Is Linear";
proc sgplot data=LogiOut noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm;   /* smoothing value by AICC (0.657) */
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
run;
Calibration plot for a misspecified logistic model

The loess smoother does not lie close to the diagonal line, which indicates that the linear model is not a good fit for the quadratic data. The loess curve is systematically above the diagonal line for very small and very large values of the predicted probability. This indicates that the empirical probability (the proportion of events) is higher than predicted on these intervals. In contrast, when the predicted probability is in the range [0.1, 0.45], the empirical probability is lower than predicted.

Several variations of the calibration plot are possible:

  • Add the NOMARKERS option to the LOESS statement to suppress the observed responses.
  • Omit the CLM option if you do not want to see a confidence band.
  • Use the SMOOTH=0.75 option if you want to specify the value for the loess smoothing parameter. The value 0.75 is the value that Austin and Steyerberg used in their simulation study, in part because that is the default parameter value in the R programming language.
  • Use a REG statement instead of the LOESS statement if you want a polynomial smoother.

A calibration plot for a correctly specified model

In the previous section, the model was intentionally misspecified. The calibration curve for the example did not lie close to the diagonal "perfect calibration" line, which indicates a lack of fit. Let's see how the calibration curve looks for the same data if you include a quadratic term in the model. First, generate the predicted probabilities for the quadratic model, then create the calibration curve for the model:

proc logistic data=LogiSim noprint;
   model Y(event='1') = x x*x;  /* fit a quadratic model */
   output out=LogiOut2 predicted=PredProb;
run;
 
proc sort data=LogiOut2;  by PredProb;  run;
 
title "Calibration Plot for a Correct Model";
proc sgplot data=LogiOut2 noautolegend aspect=1;
   loess x=PredProb y=y / interpolation=cubic clm nomarkers;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   yaxis grid; xaxis grid;
run;
Calibration plot for a correctly specified logistic model

This calibration plot indicates that the quadratic model fits the data well. The calibration curve is close to the diagonal reference line, which is the line of perfect calibration. The curve indicates that the predicted and empirical probabilities are close to each other for low-risk, moderate-risk, and high-risk subjects.

It is acceptable for the calibration curve to have small dips, bends, and wiggles. The extent that the curve wiggles is determined by the (automatically chosen) smoothing parameter for the loess algorithm, The value of the smoothing parameter is not displayed by PROC SGPLOT, but you can use PROC LOESS to display the smoothing parameter and other statistics for the loess smoother.

According to Austin and Steyerberg (2013), the calibration curve can detect misspecified models when the magnitude of the omitted term (or terms) is moderate to strong. The calibration curve also behaves better for large samples and when the incidence of outcomes is close to 50% (as opposed to having rare outcomes). Lastly, there is a connection between discrimination and calibration: The calibration curve is most effective in models for which the discrimination (as measured by the C-statistic or area under the ROC curve) is good.

In summary, this article shows how to construct a loess-based calibration curve for logistic regression models in SAS. To create a calibration curve, use PROC LOGISTIC to output the predicted probabilities for the model and plot a loess curve that regresses the observed responses onto the predicted probabilities.

My next article shows how to construct a popular alternative to the loess-based calibration curve. In the alternative version, the subjects are ranked into deciles according to the predicted probability or risk.

Further reading

The post Calibration plots in SAS appeared first on The DO Loop.

5月 102018
 

For those of you who weren't able to attend SAS Global Forum 2018, you can still learn a lot from the content shared there. Gain knowledge from your SAS family. SAS Global Forum 2018 papers and videos now available.

The post Knowledge from the SAS family: SAS Global Forum 2018 papers and videos now available appeared first on SAS Learning Post.