The endpoint of analytics is not a report or an alert. The endpoint is a decision. Often those decisions are related to your business and you make them to reduce risks, improve production or satisfy customers. In health care, however, the decisions made with analytics can be a matter of [...]
Let’s face it. Data sharing between platforms in health care just isn’t easy. Patient data privacy concerns, incompatible file formats, asynchronous identifiers … I’ve heard it all. From the electronic health record (EHR), picture archiving and communication systems (PACS) to discrete processes like pharmacy or departmental information systems, achieving some [...]
Why analytic interoperability matters in health care was published on SAS Voices by Alyssa Farrell
Growing up, I was often reminded to turn off the lights. In my home, this was a way of saving on a key resource (electricity) that we had to pay for when not using it. It was a way of being a good steward with the family’s money and targeting it to run the lights and other things in our home. This allowed us to go about our daily tasks and get them done when we needed to.
These days I have the same goal in my own home, but I’ve automated the task. I have a voice assistant that, with a few select words, will turn off (or on) the lights in the rooms that I use most. The goal and the reasoning are the same, but the automation allows me to take it to another level.
In a similar way, automation today allows us to optimize the use of compute resources in a way that we haven’t been able to do in the past. The degree to which we can switch on and off the systems required to run our compute workloads in cloud environments, scale to use more, or fewer, resources depending on demand, and only pay for what we need, is a clear indicator of just how much infrastructure technology has evolved in recent years.
Like the basic utilities we rely on in our homes, SAS, with its analytics and modeling capabilities has become an essential utility for any business that wants to not only make sense of data but turn it into the power to get things done. And, like any utility necessary to do business, we want it working quickly at the flip of a switch, easily made available anywhere we need it, and helping us be good stewards of our resources.
Containers and the related technologies can help us achieve all of this in SAS
A container can most simply be thought of as a self-contained environment with all the programs, configuration, initial data, and other supporting pieces to run applications. This environment can be treated as a stand-alone unit, ready to turn on and run at any time, much in the way your laptop is a stand-alone system. In fact, this sort of “portable machine” analogy can be a good way to think about containers at a high level – a complete virtual system containing, and configured for running, one or more targeted application(s) – or components of an application.
Docker is one of the oldest, and most well-known, applications for defining and managing containers. Defining a container is done by first defining an image. The image is an immutable (static) set of the software, environment, configuration, etc. that serves as a template for containers to be created from it. Each image, in turn, is composed of layers that apply some setting, software, or data that a container based on the image will need.
Have you ever staged a new machine for yourself, your company, a friend or relative? If so, you can relate the “layering” of software you installed and configured to the layers that go into an image. And the image itself might be like the image of software you created on the disk of the system. The applications are stored there, fully configured and ready to go, whenever you turn the system on.
Turning an image into a container is mostly just adding another layer on top of the present image layers. The difference is this “container layer” can be modified as needed – have things written to it, updated, etc. If you think about the idea of creating a user profile along with its space on a system you staged, it’s a similar idea. Like that dedicated user space on the laptop or desktop, the layer that gets added to an image to make a container, is there for the running system to use and customize as needed. This is like how the user area is there to use and customize as needed when a PC is turned on and running.
Containers, Kubernetes, and cloud environments
It is rare that any corporate system today is managed with only a single PC. Likewise, in the world of cloud and containerized environments, it is rare that any software product is run with only a single container. More commonly, applications consist of many containers organized to address specific application areas (such as web interfaces, database management, etc.) and/or architectural designs to optimize resource use and communication paths in the system (microservices).
Having the advantages that are derived from either multiple PCs or multiple containers also requires a way to manage them and ensure reliability and robustness for our applications and customers. For the machines, enterprises typically rely on data centers. Data centers play a key role, ensuring systems are kept up and running, replaced when broken, and are centrally accessible. As well, they may be responsible for bringing more systems online to address increased loads or taking some systems offline to save costs.
For containers, we have applications that function much like a “data center for containers.” The most prominent one today is Kubernetes (also known as “K8S” for the eight letters between “K” and “S”). Kubernetes’ job is to simplify deployment and management of containers and containerized workloads. It does this by automating key needs around containers, such as deployment, scaling, scheduling, healing, monitoring, and more. All of this is managed in a “declarative” way where we no longer must tell the system “how” to get to the state we want – we instead tell it “what” state we want, and it ensures that state is met and preserved.
The combination of containers, Kubernetes, and cloud environments provides an evolutionary jump in being able to control and leverage the infrastructure and runtime environments that you run your applications in. And this gives your business a similar jump in being able to provide the business value targeted to meet the environments, scale, and reliability that your customers demand - while having the automatic optimization of resources and the automatic management of workloads that you need to be competitive.
Harness decades of expertise with SAS® Viya® 4.0
Viya 4.0 provides this same evolutionary jump for SAS. Now, your SAS applications and workloads can be run in containers, Kubernetes, and cloud environments natively. Viya 4 builds on the award-winning, best-in-class analytics to allow data scientists, business executives, and decision makers at all levels to harness the decades of SAS expertise running completely in containers, and tightly integrated with Kubernetes and the cloud.
Viya 4.0 brings all the key SAS functionalities you’d expect – modeling, decision-making, forecasting, visualization, and more – to the cloud and enterprise cloud environments, along with the advantages of running in a containerized model. It also leverages the robust container management, monitoring, self-healing, scaling and other aspects of Kubernetes. This is all guaranteed to make you more in control and less reliant on being in your data center to manage these kinds of activities.
Just remember to turn the lights off.LEARN MORE | AN INTRO TO SAS FOR CONTAINERS
Automation with Containers: the Power to Get Things Done was published on SAS Users.
A colleague recently posted an article about how to use SAS Visual Analytics to create a circular graph that displays a year's worth of temperature data. Specifically, the graph shows the air temperature for each day in a year relative to some baseline temperature, such as 65F (18C). Days warmer than baseline are displayed in one color (red for warm) whereas days colder than the baseline are displayed in another color (blue for cold). The graph was very pretty. A reader posted a comment asking whether a similar graph could be created by using other graphical tools such as GTL or even PROC SGPLOT. The answer is yes, but I am going to propose a different graph that I think is more flexible and easier to read.
Let's generalize the problem. Suppose you have a time series and you want to compare the values to a baseline (or reference) value. One way to do this is to visualize the data as deviations from the baseline. Data values that are close to the baseline will be small and almost unnoticeable. The eye will be drawn to values that indicate large deviations from the baseline. A "deviation plot" like this can be used for many purposes. Some applications include monitoring blood glucose relative to a target value, showing expenditures relative to a fixed income amount, and, yes, displaying the temperature relative to some comfortable reference value. Deviation plots sometimes accompany a hypothesis test for a one-way frequency distribution.
Linear displays versus circular displays
My colleague's display shows one year's worth of temperatures by plotting the day of the year along a circle. While this makes for an eye-catching display, there are a few shortcomings to this approach:
- It is difficult to read the data values. It is also difficult to compare values that are on opposite sides of a circle. For example, how does March data compare with October data?
- Although a circle can show data for one year, it is less effective for showing 8 or 14 months of data.
- Even for one year's worth of data, it has a problem: It places December 31 next to January 1. In the temperature graph, the series began on 01JAN2018. However, the graph places 31DEC2018 next to 01JAN2018 even though those values are a year apart.
As mentioned earlier, you can use SAS/GRAPH or the statistical graphics (SG) procedure in SAS to display the data in polar coordinates. Sanjay Matange's article shows how to create a polar plot. For some of my thought about circular versus rectangular displays, see "Smoothers for periodic data."
A deviation-from-baseline plot
The graph to the right (click to enlarge) shows an example of a deviation plot (or deviation-from-baseline plot). It is similar to a waterfall chart, but in many waterfall charts the values are shown as percentages, whereas for the deviation plot we will show the observed values. You can see that the values are plotted for each day. The high values are plotted in one color (red) whereas low values are plotted in a different color (blue). A reference line (in this case, at 100) is displayed.
To create a deviation plot, you need to perform these three steps:
- Use the SAS DATA step to encode the data as 'High' or 'Low' by using the reference value. Compute the deviations from the reference value.
- Create a discrete attribute map that maps values to colors. This step is optional. Alternatively, SAS will assign colors based on the current ODS style.
- Use a HIGHLOW plot to graph the deviations from the reference value.
Let's implement these steps on a time series for three months of daily blood glucose values. An elderly male takes oral medications to control his blood glucose level. Each morning he takes his fasting blood glucose level and records it. The doctor has advised him to try to keep the blood glucose level below 100 mg/dL, so the reference value is 100. The following DATA step defines the dates and glucose levels for a three-month period.
data Series; informat Date date.; format Date Date.; input Date y @@; label y = "Blood Glucose (mg/dL)"; datalines; 01SEP19 100 02SEP19 96 03SEP19 86 04SEP19 93 05SEP19 105 06SEP19 106 07SEP19 123 08SEP19 121 09SEP19 115 10SEP19 108 11SEP19 94 12SEP19 96 13SEP19 95 14SEP19 120 15SEP19 112 16SEP19 104 17SEP19 97 18SEP19 101 19SEP19 108 20SEP19 108 21SEP19 117 22SEP19 103 23SEP19 109 24SEP19 97 25SEP19 93 26SEP19 100 27SEP19 98 28SEP19 122 29SEP19 116 30SEP19 99 01OCT19 102 02OCT19 99 03OCT19 95 04OCT19 99 05OCT19 116 06OCT19 109 07OCT19 106 08OCT19 94 09OCT19 104 10OCT19 112 11OCT19 119 12OCT19 111 13OCT19 104 14OCT19 101 15OCT19 99 16OCT19 92 17OCT19 101 18OCT19 115 19OCT19 109 20OCT19 98 21OCT19 91 22OCT19 92 23OCT19 100 24OCT19 109 25OCT19 102 26OCT19 117 27OCT19 106 28OCT19 98 29OCT19 98 30OCT19 95 31OCT19 97 01NOV19 129 02NOV19 120 03NOV19 117 04NOV19 . 05NOV19 101 06NOV19 105 07NOV19 105 08NOV19 106 09NOV19 118 10NOV19 109 11NOV19 102 12NOV19 98 13NOV19 97 14NOV19 . 15NOV19 92 16NOV19 114 17NOV19 107 18NOV19 98 19NOV19 91 20NOV19 97 21NOV19 109 22NOV19 98 23NOV19 95 24NOV19 95 25NOV19 94 26NOV19 . 27NOV19 98 28NOV19 115 29NOV19 123 30NOV19 114 01DEC19 104 02DEC19 96 03DEC19 97 04DEC19 100 05DEC19 94 06DEC19 93 07DEC19 105 08DEC19 . 09DEC19 88 10DEC19 84 11DEC19 101 12DEC19 122 13DEC19 114 14DEC19 108 15DEC19 103 16DEC19 88 17DEC19 74 18DEC19 92 19DEC19 110 20DEC19 118 21DEC19 106 22DEC19 100 23DEC19 106 24DEC19 107 25DEC19 116 26DEC19 113 27DEC19 113 28DEC19 117 29DEC19 101 30DEC19 96 31DEC19 101 ;
Encode the data
The first step is to compute the deviation of each observed value from the reference value. If an observed value is above the reference value, mark it as 'High', otherwise mark it as 'Low'. We will plot a vertical bar that goes from the reference level to the observed value. Because we will use a HIGHLOW statement to display the graph, the DATA step computes two new variables, High and Low.
/* 1. Compute the deviation and encode the data as 'High' or 'Low' by using the reference value */ %let RefValue = 100; data Center; set Series; if (y > &RefValue) then Group="High"; else Group="Low"; Low = min(y, &RefValue); /* lower end of highlow bar */ High = max(y, &RefValue); /* upper end of highlow bar */ run;
Maps high and low values to colors
If you want SAS to assign colors to the two groups, you can skip this step. However, in many cases you might want to choose which color is plotted for the high and low categories. You can map levels of a group to colors by using a discrete attribute map ("DATTR map", for short) in PROC SGPLOT. Because we are going to use a HIGHLOW statement to graph the data, we need to define a map that has the FillColor and LineColor for the vertical bars. The following DATA step maps the 'High' category to red and the 'Low' category to blue:
/* 2. Create a discrete attribute map that maps values to colors */ data DAttrs; length c FillColor LineColor $16.; ID = "HighLow"; Value = "High"; c="DarkRed"; FillColor=c; LineColor=c; output; Value = "Low"; c="DarkBlue"; FillColor=c; LineColor=c; output; run;
Create a high-low plot
The final step is to create a high-low plot that shows the deviations from the reference value. You can use the DATTRMAP= option to tell PROC SGPLOT how to assign colors for the group values. Because a data set can contain multiple maps, the ATTRID= option specifies which mapping to use.
/* 3. Use a HIGHLOW plot to graph the deviations from the reference value */ title "Deviations from Reference Value (&RefValue)"; title2 "Morning Fasting Blood Glucose"; ods graphics / width=600px height=400px; proc sgplot data=Center DATTRMAP=DAttrs noautolegend; highlow x=day low=low high=high / group=Group ATTRID=HighLow; refline &RefValue / axis=y; yaxis grid label="Blood Glucose Level"; run;
The graph is shown at the top of this section. It is clear that on most days the patient has high blood sugar. With additional investigation, you can discover that the highest levels are associated with weekends and holidays.
Note that these data would not be appropriate to plot on a circular graph because the data are not for a full year. Furthermore, on this graph it is easy to see specific values and days and to compare days in September with days in December.
A deviation plot for daily average temperatures
My colleague's graph displayed daily average temperatures. The following deviation plot shows average temperatures and a reference value of 65F. The graph shows the daily average temperature in Raleigh, NC, for 2018:
In this graph, it is easy to find the approximate temperature for any range of dates (such as "mid-October") and to compare the temperature for different time periods, such as March versus October. I think the rectangular deviation plot makes an effective visualization of how these data compare to a baseline value.
The post Create a deviation plot to visualize values relative to a baseline appeared first on The DO Loop.
Stored processes were a very popular feature in SAS 9.4. They were used in reporting, analytics and web application development. In Viya, the equivalent feature is jobs. Using jobs the same way as stored processes was enhanced in Viya 3.5. In addition, there is preliminary support for promoting 9.4 stored processes to Viya. In this post, I will examine what is sure to be a popular feature for SAS 9.4 users starting out with Viya.
What is a job? In Viya, that is a complicated question as there are a number of different types of jobs. In the context of the log, we are talking about jobs that can be created in SAS Studio or in the Job Execution application. A SAS Viya job consists of a program and its definition (metadata related to the program). The job definition includes information such as the job name, the author and when it was created. After you have created a job definition, you have an execution URL that you can share with others at your site. The execution URL can be entered into a web browser and ran without opening SAS Studio, or you can run a job directly from SAS Studio. When running a SAS job, SAS Studio uses the SAS Job Execution Web Application.
In addition, you can create task prompt(XLM) or an HTML form to provide a user interface to the job. When the user selects an option in the prompt and submits the job, the data specified in the form or task prompt is passed to a SAS session as global macro variables. The SAS program runs and the results are returned to the web browser. Sounds a lot like a stored process!
For more information on authoring jobs in SAS Studio, see SAS® Studio 5.2 Developer’s Guide: Working with Jobs.
With the release of Viya 3.5, there is preliminary support for the promotion of 9.4 Stored Processes to Viya Jobs. For more information on this functionality, see SAS® Viya® 3.5 Administration: Promotion (Import and Export).
How does it work? Stored processes are exported from SAS 9.4 using the export wizard or on the command-line and the resultant package can be imported to Viya using SAS Environment Manager or the sas-admin transfer CLI. The import process converts the stored processes to job definitions which can be run in SAS Studio or directly via a URL.
The new job definition includes the job metadata, SAS code and task prompts.
1. Job Metadata
The data is not included in the promotion process. However, it must be made available to the Viya compute server so that the job code and any dynamic prompts can access it. There are two Viya server contexts that need to have access to the data:
- Job Execution Context: used when running jobs via a URL
- SAS Studio Context: used when running jobs from SAS Studio
To make the data available, the compute server has to access it via a libname and the table must exist in the library.
To add a libname to the SAS Job Execution Compute context in SAS Environment Manager, select Contexts > Compute contexts and SAS Job Execution Compute context. Then select Edit > Advanced.
In the box labelled, “Enter each autoexec statement on a new line:” add a line for the libname. If you keep the same 8 character libname as in 9.4, you will have less to change in your code and prompts.
NOTE: the libname could be added to the /opt/sas/viya/config/etc/compsrv/default/autoexec_usermods.sas file on the compute server. While this is perhaps easier than adding to each context that requires it, this would make it available to every Viya compute process.
2. SAS Code
The SAS code that makes up the stored process is copied without modification to the Viya job definition. In most cases, the code will need to be edited so that it will run in the Viya environment. The SAS code in the definition can be modified using SAS Studio or the SAS Job Execution Web Application. Edit the code so that:
- Any libnames in the new job code to point to the location of the data in Viya
- Any other SAS 9 metadata related code is removed from the stored process
As you test your new jobs in Viya, additional changes may be needed to get them running.
3. Task Prompts
Prompts that were defined in metadata in the SAS 9 stored process are converted to the task prompts and stored within the job definition. The xml for the prompt definition can be accessed and edited from SAS Studio or the SAS Job Execution Web App. For more information on working with task prompts in Viya, see the SAS Studio Developers guide.
If you have shared prompts you want to include, it is recommended that you select, “Include dependent objects” when exporting from SAS 9.4. If you do not select this option, any shared prompts will be omitted from the package. If the shared prompts, libnames and tables are included in the package with the stored processes, the SAS 9 metadata based library and table definitions referenced in dynamic prompts will convert to use a library.table reference in Viya. When this happens, the XML for the prompt will include the libname.tablename of the dataset used to populate the prompt values. For example:
<datasource active=”true” name=”DataSource2″ defaultValue=”FINDATA.FINANCIAL__SUMMARY” name=”DataSource1″”>
If the libnames and tables are not included in the package, the prompt will show a url mapping to the tables location in 9.4 metadata. For example:
<DataSource active=”true” name=”DataSource2″ url=”/gelcorp/financecontent/Data/Source Data/FINANCIAL__SUMMARY(Table)”>
For the prompt to work in the latter case, you need to edit it and provide the libname.table reference in the same way as it is shown in the first example.
Including libraries and tables in a package imported to Viya results in folders that contain libraries and tables in 9.4 created in Viya. This may result in Viya folders that are not needed, because data does not reside in folders in Viya. As an administrator, you can choose:
- Include the dependent data and tables in the package and clean up any extra folders after promotion
- Exclude the dependent data and tables in the package and edit the data source in prompt xml to reference the libname.table (this is not a great option if you have shared prompts)
Issues encountered converting SAS 9 prompts to Viya SAS Prompt Interface will be cause warning messages to be included at the beginning of the xml defining the prompt interface.
As I mentioned earlier, you can run the job from SAS Studio or from the Job Execution Web Application. You can also run the job from a URL just like you could with stored processes in SAS 9.4. To get the URL for the job, select the job properties of in SAS Studio.
Earlier releases of Viya provided a different way to support stored processes. This consists of enabling access to the 9.4 stored process server and its stored processes in Viya. This approach is still supported in Viya 3.5 because, while jobs can replace some stored processes, they can not currently be embedded in a Viya Visual Analytics Report.
For more information, please check out:
- SAS Studio 5.2 Developer’s Guide: Working with Jobs
- SAS Studio 5.2 Developers Guide: Task Prompting Interface
- SAS® Viya® 3.5 Administration: Promotion (Import and Export)
- Content promotion in Viya: overview and details
- New functionality for transitioning from Visual Analytics on 9.4 to Viya
Is it getting harder and harder to find empty Excel spreadsheets cells, as you run out of columns and rows? Do your spreadsheet cell labels have more letters than the license plate on your car? Do you find yourself waking up in the middle of the night in cold [...]
Are you suffering from demand planner spreadsheet overload? was published on SAS Voices by Charlie Chase
Do you wish you could predict the likelihood that one of your customers will open your marketing email? Or what if you could tell whether a new medical treatment for a patient will have a better outcome than the standard treatment? If you are familiar with propensity modeling, then you know such predictions about future behavior are possible! Propensity models generate a propensity score, which is the probability that a future behavior will occur. Propensity models are used often in machine learning and predictive data analytics, particularly in the fields of marketing, economics, business, and healthcare. These models can detect and remove bias in analysis of real-world, observational data where there is no control group.
SAS provides several approaches for calculating propensity scores. This excerpt from the new book, Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS®, discusses one approach for estimating propensity scores and provides associated SAS code. The example code and data used in the examples is available to download here.
A priori logistic regression model
One approach to estimating a propensity score is to fit a logistic regression model a priori, that is, identify the covariates in the model and fix the model before estimating the propensity score. The main advantage of an a priori model is that it allows researchers to incorporate knowledge external to the data into the model building. For example, if there is evidence that a covariate is correlated to the treatment assignment, then this covariate should be included in the model even if the association between this covariate and the treatment is not strong in the current data. In addition, the a priori model is easy to interpret. The directed acyclic graph approach could be very informative in building a logistic propensity score model a priori, as it clearly points out the relationship between covariates and interventions. The correlation structure between each covariate and the intervention selection is pre-specified and in a fixed form. However, one main challenge of the a priori modeling approach is that it might not provide the optimal balance between treatment and control groups.
Building an a priori model
To build an a priori model for propensity score estimation in SAS, we can use either PROC PSMATCH or PROC LOGISTIC as shown in Program 1. In both cases, the input data set is a one observation per patient data set containing the treatment and baseline covariates from the simulated REFLECTIONS study. Also, in both cases the code will produce an output data set containing the original data set with the additional estimated propensity score for each patient (_ps_).
Program 1: Propensity score estimation: a priori logistic regression
PROC PSMATCH DATA=REFL2 REGION=ALLOBS; CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE; PSMODEL COHORT(TREATED='OPIOID')= GENDER RACE AGE BMI_B BPIINTERF_B BPIPAIN_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PHYSICALSYMP_B SDS_B DR_RHEUM DR_PRIMCARE; OUTPUT OUT=PS PS=_PS_; RUN; PROC LOGISTIC DATA=REFL2; CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE; MODEL COHORT = GENDER RACE AGE BMI_B BPIINTERF_B BPIPAIN_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PHYSICALSYMP_B SDS_B DR_RHEUM DR_PRIMCARE; OUTPUT OUT=PS PREDICTED=PS; RUN;
Before building a logistic model in SAS, we suggest examining the distribution of the intervention indicator at each level of the categorical variable to rule out the possibility of “complete separation” (or “perfect prediction”), which means that for subjects at some level of some categorical variable, they would all receive one intervention but not the other. Complete separation can occur for several reasons and one common example is when using several categorical variables whose categories are coded by indicators. When the logistic regression model is fit, the estimate of the regression coefficients βs is based on the maximum likelihood estimation, and MLEs under logistic regression modeling do not have a closed form. In other words, the MLE β̂ cannot be written as a function of Xi and Ti. Thus, the MLE of βs are obtained using some numerical analysis algorithms such as the Newton-Raphson method. However, if there is a covariate X that can completely separate the interventions, then the procedure will not converge in SAS. If PROC LOGISTIC was used, the following warning message will be issued.
WARNING: There is a complete separation of data points. The maximum likelihood estimate does not exist. WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.
Notice that SAS will continue to finish the computation despite issuing warning messages. However, the estimate of such βs are incorrect, and so are the estimated propensity scores. If after examining the intervention distribution at each level of the categorical variables complete separation is found, then efforts should be made to address this issue. One possible solution is to collapse the categorical variable causing the problem. That is, combine the different outcome categories such that the complete separation no longer exists.
Firth logistic regression
Another possible solution is to use Firth logistic regression. It uses a penalized likelihood estimation method. Firth bias-correction is considered an ideal solution to the separation issue for logistic regression (Heinze and Schemper, 2002). In PROC LOGISTIC, we can add an option to run the Firth logistic regression as shown in Program 2.
Program 2: Firth logistic regression
PROC LOGISTIC DATA=REFL2; CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE; MODEL COHORT = GENDER RACE DR_RHEUM DR_PRIMCARE BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B / FIRTH; OUTPUT OUT=PS PREDICTED=PS; RUN;
Heinze G, Schemper M (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine 21.16: 2409-2419.
Propensity Score Estimation with PROC PSMATCH and PROC LOGISTIC was published on SAS Users.
The ROC curve is a graphical method that summarizes how well a binary classifier can discriminate between two populations, often called the "negative" population (individuals who do not have a disease or characteristic) and the "positive" population (individuals who do have it). As shown in a previous article, there is a theoretical model, called the binormal model, that describes the fundamental features in binary classification. The model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the negative population is less than the mean of scores for the positive population. The figure to the right (which was discussed in the previous article) shows a threshold value (the vertical line) that a researcher can use to classify an individual as belonging to the positive or negative population, according to whether his score is greater than or less than the threshold, respectively.
In most applications, any reasonable choice of the threshold will misclassify some individuals. Members of the negative population can be misclassified, which results in a false positive (FP). Members of the positive population can be misclassified, which results in a false negative (FP). Correctly classified individuals are true negatives (TN) and true positives (TP).
Vizualize the binary classification method
One way to assess the predictive accuracy of the classifier is to use the proportions of the populations that are classified correctly or are misclassified. Because the total area under a normal curve is 1, the threshold parameter divides the area into two proportions. It is instructive to look at how the proportions change as the threshold value ranges. The proportions are usually called "rates." The four regions correspond to the True Negative Rate (TNR), False Positive Rate (FPR), False Negative Rate (FNR), and True Positive Rate (TPR).
For the binormal model, you can use the standard deviations of the populations to choose a suitable range for the threshold parameter. The following SAS DATA step uses the normal cumulative distribution function (CDF) to compute the proportion of each population that lies to the left and to the right of the threshold parameter for a range of values. These proportions are then plotted against the threshold parameters.
%let mu_N = 0; /* mean of Negative population */ %let sigma_N = 1; /* std dev of Negative population */ %let mu_P = 2; /* mean of Positive population */ %let sigma_P = 0.75; /* std dev of Positive population */ /* TNR = True Negative Rate (TNR) = area to the left of the threshold for the Negative pop FPR = False Positive Rate (FPR) = area to the right of the threshold for the Negative pop FNR = False Negative Rate (FNR) = area to the left of the threshold for the Positive pop TPR = True Positive Rate (TPR) = area to the right of the threshold for the Positive pop */ data ClassRates; do t = -3 to 4 by 0.1; /* threshold cutoff value (could use mean +/- 3*StdDev) */ TNR = cdf("Normal", t, &mu_N, &sigma_N); FPR = 1 - TNR; FNR = cdf("Normal", t, &mu_P, &sigma_P); TPR = 1 - FNR; output; end; run; title "Classification Rates as a Function of the Threshold"; %macro opt(lab); name="&lab" legendlabel="&lab" lineattrs=(thickness=3); %mend; proc sgplot data=ClassRates; series x=t y=TNR / %opt(TNR); series x=t y=FPR / %opt(FPR); series x=t y=FNR / %opt(FNR); series x=t y=TPR / %opt(TPR); keylegend "TNR" "FNR" / position=NE location=inside across=1; keylegend "TPR" "FPR" / position=SE location=inside across=1; xaxis offsetmax=0.2 label="Threshold"; yaxis label="Classification Rates"; run;
The graph shows how the classification and misclassification rates vary as you change the threshold parameter. A few facts are evident:
- Two of the curves are redundant because FPR = 1 – TNR and TPR = 1 – FNR. Thus, it suffices to plot only two curves. A common choice is to display only the FPR and TPR curves.
- When the threshold parameter is much less than the population means, essentially all individuals are predicted to belong to the positive population. Thus, the FPR and the TPR are both essentially 1.
- As the parameter increases, both rates decrease monotonically.
- When the threshold parameter is much greater than the population means, essentially no individuals are predicted to belong to the positive population. Thus, the FPR and TPR are both essentially 0.
The ROC curve
The graph in the previous section shows the FPR and TPR as functions of t, the threshold parameter. Alternatively, you can plot the parametric curve ROC(t) = (FPR(t), TPR(t)), for t ∈ (-∞, ∞). Because the FPR and TPR quantities are proportions, the curve (called the ROC curve) is always contained in the unit square [0, 1] x [0, 1]. As discussed previously, as the parameter t → -∞, the curve ROC(t) → (1, 1). As the parameter t → ∞, the curve ROC(t) → (0, 0). The main advantage of the ROC curve is that the ROC curve is independent of the scale of the population scores. In fact, the standard ROC curve does not display the threshold parameter. This means that you can compare the ROC curves from different models that might use different scores to classify the negative and positive populations.
The following call to PROC SGPLOT creates an ROC curve for the binormal model by plotting the TPR (on the vertical axis) against the FPR (on the horizontal axis). The resulting ROC curve is shown to the right.
title "ROC Curve"; title2; proc sgplot data=ClassRates aspect=1 noautolegend; series x=FPR y=TPR / lineattrs=(thickness=2); lineparm x=0 y=0 slope=1 / lineattrs=(color=gray); xaxis grid; yaxis grid; run;
The standard ROC curve does not display the values of the threshold parameter. However, for instructional purposes, it can be enlightening to plot the values of a few selected threshold parameters. An example is shown in the following ROC curve. Displaying the ROC curve this way emphasizes that each point on the ROC curve corresponds to a different threshold parameter. For example, when t=1, the cutoff parameter is 1 and the classification is accomplished by using the vertical line and binormal populations that are shown at the beginning of this article.
Interpretation of the ROC curve
The ROC curve shows the tradeoff between correctly classifying those who have a disease/condition and those who do not. For concreteness, suppose you are trying to classify people who have cancer based on medical tests. Wherever you place the threshold cutoff, you will make two kinds of errors: You will not identify some people who actually have cancer and you will mistakenly tell other people that they have cancer when, in fact, they do not. The first error is very bad; the second error is also bad but not life-threatening. Consider three choices for the threshold parameter in the binormal model:
- If you use the threshold value t=2, the previous ROC curve indicates that about half of those who have cancer are correctly classified (TPR=0.5) while misclassifying very few people who do not have cancer (FPR=0.02). This value of the threshold is probably not optimal because the test only identifies half of those individuals who actually have cancer.
- If you use t=1, the ROC curve indicates that about 91% of those who have cancer are correctly classified (TPR=0.91) while misclassifying about 16% of those who do not have cancer (FPR=0.16). This value of the threshold seems more reasonable because it detects most cancers while not alarming too many people who do not have cancer.
- As you decrease the threshold parameter, the detection rate only increases slightly, but the proportion of false positives increases rapidly. If you use t=0, the classifier identifies 99.6% of the people who have cancer, but it also mistakenly tells 50% of the non-cancer patients that they have cancer.
In general, the ROC curve helps researchers to understand the trade-offs and costs associated with false positive and false negatives.
In summary, the binormal ROC curve illustrates fundamental features of the binary classification problem. Typically, you use a statistical model to generate scores for the negative and positive populations. The binormal model assumes that the scores are normally distributed and that the mean of the negative scores is less than the mean of the positive scores. With that assumption, it is easy to use the normal CDF function to compute the FPR and TPR for any value of a threshold parameter. You can graph the FPR and TPR as functions of the threshold parameter, or you can create an ROC curve, which is a parametric curve that displays both rates as the parameter varies.
The binormal model is a useful theoretical model and is more applicable than you might think. If the variables in the classification problem are multivariate normal, then any linear classifier results in normally distributed scores. In addition, Krzandowski and Hand (2009, p. 34-35), state that the ROC curve is unchanged by any monotonic increasing transformation of scores, which means that the binormal model applies to any set of scores that can be transformed to normality. This is a large set, indeed, since it includes the complete Johnson system of distributions.
In practice, we do not know the distribution of scores for the population. Instead, we have to estimate the FPR and TPR by using collected data. PROC LOGISTIC in SAS can estimate an ROC curve for data by using a logistic regression classifier. Furthermore, PROC LOGISTIC can automatically create an empirical ROC curve from any set of paired observed and predicted values.
The purpose of this article is to show how to use SAS to create a graph that illustrates a basic idea in a binary classification analysis, such as discriminant analysis and logistic regression. The graph, shown at right, shows two populations. Subjects in the "negative" population do not have some disease (or characteristic) whereas individuals in the "positive" population do have it. There is a function (a statistical model) that associates a score with each individual, and the distribution of the scores is shown. A researcher wants to use a threshold value (the vertical line) to classify individuals. An individual is predicted to be negative (does not have the disease) or positive (does have the disease) according to whether the individual's score is lower than or higher than the cutoff theshold, respectively.
Unless the threshold value perfectly discriminates between the populations, some individuals will be classified correctly, and others will be classified incorrectly. There are four possibilities:
- A subject that belongs to the negative population might be classified as "negative." This is a correct classification, so this case is called a "true negative" (TN).
- A subject that belongs to the negative population might be classified as "positive." This is a wrong classification, so this case is called a "false positive" (FP).
- A subject that belongs to the positive population might be classified as "negative." This is a wrong classification, so this case is called a "false negative" (FN).
- A subject that belongs to the positive population might be classified as "positive." This is a wrong classification, so this case is called a "true positive" (TP).
Typically, these concepts are visualized by using a panel of histograms based on a finite sample of data. However, this visualization uses the populations themselves, rather than data. In particular, this visualization assumes two normal populations, a situation that is called the binormal model for binary discrimination (Krzandowski and Hand, ROC Curves for Continuous Data, 2009, p. 31-35).
A graph of the populations
In building up any complex graph, it is best to start with a simpler version. A simple version enables you to develop and debug your program and to experiment with various visualizations. This section creates a version of the graph that does not have the threshold line or the four categories (TN, FP, FN, and TP).
The following DATA step uses the PDF function to generate the density curves for two populations. For this graph, the negative population is chosen to be N(0, 1) whereas the positive population is N(2, 0.75). By convention, the mean of the negative population is chosen to be less than the mean of the positive population. I could have used two separate DO loops to iterate over the X values for the distributions, but instead I used (temporary) arrays to store the parameters for each population.
/* 1. Create data for the negative and positive populations by using a binormal model. The data are in "long form." An indicator variable (Class) has the values "Negative" and "Positive." */ %let mu_N = 0; /* mean of Negative population */ %let sigma_N = 1; /* std dev of Negative population */ %let mu_P = 2; /* mean of Positive population */ %let sigma_P = 0.75; /* std dev of Positive population */ data Binormal(drop=i); array mu _temporary_ (&mu_N, &mu_P); array sigma _temporary_ (&sigma_N, &sigma_P); array c $ _temporary_ ("Negative", "Positive"); do i = 1 to 2; Class = c[i]; do x = mu[i] - 3*sigma[i] to mu[i] + 3*sigma[i] by 0.05; pdf = pdf("Normal", x, mu[i], sigma[i]); output; end; end; run;
The first few observations are shown below:
Class x pdf Negative -3.00 .004431848 Negative -2.95 .005142641 Negative -2.90 .005952532
Because the data are in "long form," you can use PROC SGPANEL to create a basic graph. You need to use PANELBY Class to display the negative population in one graph and the positive population in another. Here are a few design decisions that I made for the visualization:
- SAS will assign default colors to the two populations, but you can use the STYLEATTRS statement to assign specific colors to each population curve.
- You can use the BAND statement to fill in the area under the population density curves.
- The SGPANEL procedure will display row headers or column headers to identify the positive and negative populations, but I used the NOHEADER option to suppress these headers and used the INSET statement to add the information in the upper left corner of each graph.
- Since this graph is a schematic diagram, I suppress the ticks and values on the axes by using the DISPLAY=(noticks novalues) option on the ROWAXIS and COLAXIS statements.
ods graphics / width=480px height=360px; title "The 'Negative' and 'Positive' Populations"; proc sgpanel data=Binormal noautolegend; styleattrs datacolors=(SteelBlue LightBrown); panelby Class / layout=rowlattice onepanel noheader; inset Class / position=topleft textattrs=(size=14) nolabel; band x=x upper=pdf lower=0 / group=Class; series x=x y=pdf / lineattrs=(color=black); rowaxis offsetmin=0 display=(noticks novalues) label="Density"; colaxis display=(noticks novalues) label="Score"; run;
Adding regions for true and false classifications
The previous section graphs the populations. To add the regions that are correctly and incorrectly classified by a given threshold value, you need to modify the DATA step that creates the density curves. In addition to the Class indicator variable (which has the values "Negative" and "Positive"), you need to add an indicator variable that has four values: "TN", "FP", "FN", and "TP". This second indicator variable will be used to assign colors for each region. The four regions depend on the value of the threshold parameter, which means that the DO loop that iterates over the X values should be split into two parts: the part less than the threshold and the part greater than the threshold. This is shown by the following:
%let cutoff = 1; /* value of the threshold parameter */ data Binormal2(drop=i); array mu _temporary_ (&mu_N, &mu_P); array sigma _temporary_ (&sigma_N, &sigma_P); array c $ _temporary_ ("Negative", "Positive"); array T[2, 2] $ _temporary_ ("TN", "FP", "FN", "TP"); do i = 1 to 2; Class = c[i]; Type = T[i, 1]; do x = mu[i] - 3*sigma[i] to &cutoff by 0.01; pdf = pdf("Normal", x, mu[i], sigma[i]); output; end; Type = T[i, 2]; do x = &cutoff to mu[i] + 3*sigma[i] by 0.01; pdf = pdf("Normal", x, mu[i], sigma[i]); output; end; end; run;
The first few observations are shown below:
Class Type x pdf Negative TN -3.00 .004431848 Negative TN -2.99 .004566590 Negative TN -2.98 .004704958
The graph should display labels for the four regions. I will use the TEXT statement to place the labels. For the height of the labels, I will use 90% of the maximum height of the density curves. For the horizontal positions, I will offset the text by +/-50% of the standard deviation of the distribution. I use the STYLEATTRS statement to assign the colors to the four regions.
/* find the maximum height of the graph */ proc means data=Binormal2 max noprint; var pdf; output out=OutLab max=Max; run; /* use +/- s*StdDev to space the labels */ data label(drop=Max); set OutLab(keep=Max); y = 0.9 * Max; /* 90% of the maximum height */ Class = "Negative"; x = &cutoff - 0.5*&sigma_N; Text = "TN"; output; x = &cutoff + 0.5*&sigma_N; Text = "FP"; output; Class = "Positive"; x = &cutoff - 0.5*&sigma_P; Text = "FN"; output; x = &cutoff + 0.5*&sigma_P; Text = "TP"; output; run; data All; set Binormal2 label; /* merge the two data sets */ run; ods graphics / width=480px height=360px; title "Relationship Between Threshold and Classification"; proc sgpanel data=All noautolegend; styleattrs datacolors=(SteelBlue LightBlue Cream LightBrown); panelby Class / layout=rowlattice onepanel noheader; inset Class / position=topleft textattrs=(size=14) nolabel; band x=x upper=pdf lower=0 / group=Type; series x=x y=pdf / lineattrs=(color=black); refline &cutoff / axis=X; text x=x y=y text=Text / textattrs=(size=14); rowaxis offsetmin=0 display=(noticks novalues) label="Density"; colaxis display=(noticks novalues) label="Score"; run;
The graph is shown at the top of this article.
In summary, this article shows how to create a graph that illustrates a fundamental relationship in the binary classification problem.