十一 032016

Electronic health records (EHRs) and the overall advancement of information technology have produced a tsunami of data that must be stored, managed and used. Some had naively hoped that EHRs would bring a simpler, more streamlined industry. Instead, we’re finding that the delivery and management of health care is more […]

Four ways to continue the health care transformation was published on SAS Voices.

十一 022016

Being an Eagle Scout, the data for good movement caught my attention. I wondered if I could apply my computer skills in a way that might help. How about showing people better ways to visualize HIV/AIDS data - that might help doctors better understand the data, and therefore better treat […]

The post Building a better HIV/AIDS map appeared first on SAS Learning Post.

十一 022016

9781629596709_frontcoverRonald Snee and Roger Hoerl have written a book called Strategies for Formulations Development. It is intended to help scientists and engineers be successful in creating formulations quickly and efficiently.

The following tip is from this new book, which focuses on providing the essential information needed to successfully conduct formulation studies in the chemical, biotech and pharmaceutical industries:

Although most journal articles present mixture experiments and models that only involve the formulation components, most real applications also involve process variables, such as temperature, pressure, flow rate and so on. How should we modify our experimental and modeling strategies in this case? A key consideration is whether the formulation components and process variables interact. If there is no interaction, then an additive model, fitting the mixture and process effects independently, can be used:

c(x,z) = f(x) + g(z), where 1

f(x) is the mixture model, and g(z) is the process variable model. Independent designs could also be used. However, in our experience, there is typically interaction between mixture and process variables. What should we do in this case? Such interaction is typically modeled by replacing the additive model in Equation 1 with a multiplicative model:

c(x,z) = f(x)*g(z) 2

Note that this multiplicative model is actually non-linear in the parameters. Most authors, including Cornell (2002), therefore suggest multiplying out the individual terms in f(x) and g(z) from Equation 2, creating a linear hybrid model. However, this tends to be a large model, since the number of terms in linearized version of c(x,z) will be the number in f(x) times the number in g(z). In Cornell’s (2002) famous fish patty experiment, there were three mixture variables (7 terms) and three process variables (8 terms), but the linearized c(x,z) had 7*8 = 56 terms, requiring a 56-run hybrid design.

Recent research by Snee et al. (2016) has shown that by considering hybrid models that are non-linear in the parameters, the number of terms required, and therefore the size of designs required, can be significantly reduced, often on the order of 50%. For example, if we fit equation 2 directly as a non-linear model, then the number of terms to estimate is the number in f(x) plus the number in g(z); 7 + 8 = 15 in the fish patty case. Snee et al. (2016) showed using real data that this approach can often provide reasonable models, allowing use of much smaller fractional hybrid designs. We therefore recommended an overall sequential strategy involving initial use of fractional designs and non-linear models, but with the option of moving to linearized models if necessary.

Continue reading »

十一 022016

Occasionally on a discussion forum, a statistical programmer will ask a question like the following:

I am trying to fit a parametric distribution to my data. The sample has a long tail, so I have tried the lognormal, Weibull, and gamma distributions, but nothing seems to fit. Please help!!

In general, there is no reason to expect a particular distribution to fit arbitrary data. However, sometimes the person asking the question has a theoretical reason why the model should fit, such as the data are supposed to be "lifetime" data that are part of a reliability or survival analysis.

For several situations that I can remember, the problem occurred because the data distribution was skewed to the left, whereas by convention the usual "named" distributions have positive skewness. The following sample of 50 data values provides an example:

data Sample;
input X @@;
81 91 90 91 87 56 93 80 80 89 93 87 86 58 81 90 82 71 85 94 
86 79 82 89 87 87 96 76 77 91 87 67 93 84 90 88 78 92 87 86 
61 82 83 92 81 83 87 91 84 72 
proc univariate data=Sample;
   histogram X / endpoints=55 to 100 by 5
         odstitle="Distribution of Sample Data";
Sample data that has negative skewness

The descriptive statistics from PROC UNIVARIATE (not shown) indicate that the sample skewness is about -1.5. The histogram confirms that the data distribution has negative skewness. Consequently, the lognormal, Weibull, and gamma distributions will not fit these data well.

A transformation that reverses the data distribution

You can transform the data so that the skewness is positive and the long tail is to the right. To do this correctly requires domain-specific knowledge, but the general idea is to apply a linear transformation of the form Y = cb X for some constants c and b. If you don't want to change the scale of the data, use b = 1.

For example, suppose that the data are the results of an assessment procedure that assigns a value between 0 (bad) and 100 (good) to each item on an assembly line. An alternative way to score each item is to record the number of points that are deducted by the assessment procedure. For the alternative scoring system, low scores are good and high scores are bad. The conversion between the scoring systems is simply Y = 100 – X. The following DATA step creates the new scores and overlays several parametric models that fit the new transformed data:

data Transform;
set Sample;
Y = 100 - X;
proc univariate data=Transform;
   var Y;
   histogram / endpoints=0 to 45 by 5
         odstitle="Distribution of Reversed Data"
         lognormal(threshold=0)  Weibull(threshold=0)  gamma(threshold=0); 
Transformed data has positive skewness

The transformed data has positive skewness. I used knowledge of the data measurements to choose reasonable values for the linear transformation that flips the data distribution. If you know nothing about the data, you could choose c to be any value greater than the maximum data value for X. That guarantees that the transformed data could be modeled by a distribution that has zero for the threshold parameter. Try to choose a transformation for which the new measurements are easy to understand; different values of c will lead to different estimates for the parameters.

In summary, many standard modeling distributions (exponential, lognormal, gamma, Weibull, ...) assume that the data are positively skewed. If your data has negative skewness, try to use a linear transformation to reverse the data before you model it.

tags: Data Analysis

The post Sometimes you need to reverse the data before you fit a distribution appeared first on The DO Loop.

十一 022016

Fun with ODS GraphicsSAS Community member @tc (a.k.a. Ted Conway) has found a new toy: ODS Graphics. Using PROC SGPLOT and GTL (Graph Template Language), along with some creative data prep steps, Ted has created several fun examples that show off what you can do with a bit of creativity, some math knowledge, and open data.

And bonus -- since most of his examples work with SAS University Edition, it's easy for you to try them yourself. Here are some of my favorites.

Learn to draw a Jack-O-Lantern

Using the GIF output device and free data from Math-Aids.com, Ted shows how to use GTL (PROC TEMPLATE and PROC SGRENDER) to animate this Halloween icon.

learn to draw a Jack-O-Lantern

The United Polygons of America

Usually map charts with SAS require specialized procedures and map data, but here's a technique that can plot a stylized version of the USA and convey some interesting data. (You might have seen this one featured in a SAS Tech Report newsletter. Do you subscribe?)

United Polygons of America

A look at Katie Ledecky's dominance

Using a vector plot, Ted shows how this championship swimmer dominated her event during the summer games in Rio. This example contains a lot of text information too; and that's a cool trick in PROC SGPLOT with the AXISTABLE statement. Click on the image for a closer look.

Katie Ledecky dominates

Demonstrating the Bublé Sort

This example is nerdy on so many levels. It's a take on the Computer Science 101 concept of "bubble sort," an algorithm for placing a collection of items in a desired order. In this case, the items consist of Christmas songs recorded by Michael Bublé, that dreamy crooner from Canada.

See the songs sort things out
Ted posts these examples (and more) in the SAS/GRAPH and ODS Graphics section of SAS Support Communities. That's a great place to learn SAS graphing techniques, from simple to advanced, and to see what other practitioners are doing. Experts like Ted hang out there, and the SAS visualization developers often post answers to the tricky questions.

More from @tc

In addition to his community posts, Ted is an award-winning contributor to SAS Global Forum with some very popular presentations. Here are a few of his papers.

tags: ODS Graphics, SAS Communities, SGPLOT

The post Binge on this series: Fun with ODS Graphics appeared first on The SAS Dummy.

十一 022016

With DataFlux Data Management 2.7, the major component of SAS Data Quality and other SAS Data Management solutions, every job has a REST API automatically created once moved to the Data Management Server. This is a great feature and enables us to easily call Data Management jobs from programming languages like Python. We can then involve the Quality Knowledge Base (QKB), a  pre-built set of data quality rules, and do other Data Quality work that is impossible or challenging to do when using only Python.

calling-sas-data-quality-jobs-from-pythonIn order to make a RESTful call from Python we need to first get the REST API information for our Data Management job. The best way to get this information is to go to Data Management Server in your browser where you’ll find respective links for:

  • Batch Jobs
  • Real-Time Data Jobs
  • Real-Time Process Jobs.

From here you can drill through to your job REST API.

Alternatively, you can use a “shortcut” to get the information by calling the job’s REST API metadata URL directly. The URL looks like this:

http://<DM Server>:<port>/<job type>/rest/jobFlowDefns/<job id>/metadata


The <job id> is simply the job name (with subdirectory and extension) Base64 encoded. This is a common method to avoid issues with illegal URL characters like: # % & * { } : < > ? / + or space. You can go to this website to Base64 encode your job name.

If you have many jobs on the Data Management Server it might be quicker to use the “shortcut” instead of drilling through from the top.

Here an example to get the REST API information for the Data Management job “ParseAddress.ddf” which is in the subdirectory Demo of Real-Time Data Services on DM Server:


We Base64 encode the job name “Demo/ParseAddress.ddf” using the website mentioned above…


…and call the URL for the job’s REST API metadata:

http://DMServer:21036/SASDataMgmtRTDataJob/rest/jobFlowDefns/ RGVtby9QYXJzZUFkZHJlc3MuZGRm/metadata


From here we collect the following information:

The REST API URL and Content-Type information…

…the JSON structure for input data


…which we need in this format when calling the Data Management job from Python:

{"inputs" : {"dataTable" : {"data" : [[ "sample string" ],[ "another string" ]], "metadata" : [{"maxChars" : 255, "name" : "Address", "type" : "string"}]}}}

…and the JSON structure for the data returned by the Data Management job.


When you have this information, the Python code to call the Data Management job would look like this:


The output data from the Data Management job will be in data_raw. We call the built-in JSON decoder from the “request” module to move the output into a dictionary (data_out) from where we can access the data. The structure of the dictionary is according to the REST metadata. We can access the relevant output data via data_out[‘outputs’][‘dataTable’][‘data’]



The Python program will produce an output like this…


You can find more information about the DataFlux Data Management REST API here.

Calling Data Management jobs from Python is straight forward and is a convenient way to augment your Python code with the more robust set of Data Quality rules and capabilities found in the SAS Data Quality solution.

Learn more about SAS Data Quality.

tags: data management, DataFlux Data Management Studio, open source, REST API

Calling SAS Data Quality jobs from Python was published on SAS Users.

十一 022016

Leads are the lifeblood of any sales effort. But not all leads are created equal. Some have a high value for an organization and represent a realistic opportunity to win business. Others are early-stage engagements that take months or years of development.

Because of this disparity, the question “What is a lead?” puzzles many organizations. Sales and marketing groups have worked for years to formalize the definition of a lead and what it means within an existing business model. Regardless of your definition, one thing is consistent – marketing has to adapt its strategies to bring in more, better, or just different mixes of leads. The key question is: “How do you get there?”

The challenge

Over the years, the SAS marketing organization built a complex method of passing leads from marketing to sales. The process was similar to what other companies have in place, that is, leads that met a set of rules were qualified and then sent to a salesperson to follow up. The system was effective but difficult to manage, leadsespecially when business needs changed.

To build a new model to score and qualify leads, the marketing team looked at existing data and then conferred with their counterparts in sales to reorient the lead management process to accomplish two main goals:

  • Increase the number and percentage of leads that convert to opportunities. This meant identifying the best leads and finding a faster way to pass more high-qualified leads to sales.
  • Improve the outcomes from the lead conversion process. Obviously, high-quality leads are essential to creating a larger pipeline of deals. The team needed a better way to score, and then prioritize, leads.

An added wrinkle was that the project had to be global. For example, a lead in Australia would have the same meaning as a lead in Germany. That way, the company could compare lead performance across geographies and fuel global decisions about what strategies would be more effective.

The approach

While the previous rules-based model was geared more toward quantity, the team opted for a model-based approach to lead scoring that emphasized quality based on likely outcomes. The team developed an analytics-driven model that could evaluate the range of customer behaviors (registrations, website page views, e-mail clicks, and so on) to identify the best leads.

Beyond the quality-versus-quantity discussion, the sales and marketing teams agreed that the timing of the lead handoff to sales was also important. To accomplish this effectively, the model evaluated many behaviors, and once certain criteria were met, the information was added to the customer relationship management (CRM) system. To improve the lead conversion process, the team also focused on converting more sales-ready leads. Not only did the new scoring model evaluate more behavioral data, but that information was passed on as a “digital footprint” for each lead. The salesperson can see interactions for the lead from within the CRM system, giving her important information to guide her initial outreach.

Additionally, the team decided not to send all leads to the CRM system. Because the model does a better job of classifying better leads, those that aren’t routed to sales go to a lead-nurturing pro- gram, where the contact receives a cadence of relevant e-mails. The contact’s behavior when receiving those e-mails (click-thrus, registrations, website visits, etc.) are all fed into the model.

The results

When the lead-scoring model was still in the early stages, the initial feedback was positive. Salespeople appreciated that the leads were more qualified and reliable. Rather than sifting through dozens of contacts, they know that leads indicate an interest in SAS and its solutions. That was once a luxury for a salesperson. Now, it’s an everyday reality.

To fine tune the model, analysts track the total number of leads passed to sales and the number of leads that convert to opportunities. The marketing team wants to make sure rates continue to rise for both numbers. If there is a plateau or a decline, the analysts receive rapid feedback and can adjust programs as necessary.

SAS marketing analysts can also fine tune the model as sales requirements change or the market evolves. The model is more flexible than the rules-based approach, allowing the team to rapidly adjust strategies. The team can adjust the lead conversion rate if there is a shift in internal focus or if a sales group an increase or decrease in capacity.

How SAS can help

We've created a practical ebook to modernizing a marketing organization with marketing analytics: Your guide to modernizing the marketing organization.

SAS Customer Intelligence 360 enables the delivery of contextually relevant emails, ensuring their content is personalized and timely.  Emails sent with SAS Customer Intelligence 360 are backed by segmentation, analytics and scoring behind the scenes to help ensure messaging matches the customer journey.

Whether you're just getting started or want to add new skills, we offer a variety of free tutorials and other training options: Learn SAS Customer Intelligence 360


Editor’s note: This post is part of a series excerpted from Adele Sweetwood’s book, The Analytical Marketer: How to Transform Your Marketing Organization. Each post is a real-world case study of how to improve your customers’ experience and optimize your marketing campaigns.

tags: customer journey, email marketing, lead scoring, marketing analytics, marketing campaigns, sales leads, SAS Customer Intelligence 360, segmentation, The Analytical Marketer

Scoring leads to drive more effective sales was published on Customer Intelligence.

十一 012016

When designing an experiment, a common diagnostic is the statistical power of effects. Bradley Jones has written a number of blog posts on this very topic. In essence, what is the probability that we can detect non-negligible effects given a specified model? Of course, there are a set of assumptions/specifications needed in order to do this, such as effect sizes, error, and significance level of tests. I encourage you to read some of those previous blog posts if you’re unfamiliar with the topic.

If our response is continuous, and we are assuming a linear regression model, we can use results from the Power Analysis outline under Design Evaluation. However, what if our response is based on pass/fail data, where we are planning to do 10 trials at each experimental run? For this response, we can fit a logistic regression model, but we cannot use the results in the Design Evaluation outline. Nevertheless, we’re still interested in the power...

What to do?
We could do a literature review to see about estimating the power, and hope to find something that applies (and do so for each specific case that comes up in the future). But, it is more straight-forward to run a Monte Carlo simulation. To do so, we need to be able to generate responses according to a specified logistic regression model. For each of these generated responses, fit the model and, for each effect, check if the p-value falls below a certain threshold (say 0.05). This has been possible in previous versions of JMP using JSL, but requires a certain level of comfort with scripting and in particular scripting formulas and extracting information from JMP reports. Also, you need to find the time to write the script. In JMP Pro 13, you can now perform Monte Carlo simulations with just a few mouse-clicks.

That sounds awesome
The first time I saw the new one-click simulate, I was ecstatic, thinking of the possible uses with designed experiments. A key element needed to use the one-click simulate feature is a column containing a formula with a random component. If you read my previous blog post on the revamped Simulate Responses in DOE, then you know we already have a way to generate such a formula without having to write it ourselves.

1. Create the Design, and then Make Table with Simulate Responses checked
In this example, we have four factors (A-D), and plan an eight-run experiment. I’ll assume that you’re comfortable using the Custom Designer, but if not, you can read about the Custom Design platform here. This example can essentially be set up the same way as an example in our documentation.

Before you click the Make Table button, you need to make sure that Simulate Responses has been selected from the hotspot at the top of the Custom Design platform.


2. Set up the simulation
Once the data table is created, we now have to setup our simulation via the Simulate Response dialog described previously. Under Distribution, we select Binomial, with and set N to 10 (i.e. 10 trials for each row of the design). Here, I’ve chosen a variety of coefficients for A-D, with factor D having a coefficient of 0 (i.e., that factor is inactive). The Simulate Response dialog I will use is:


Clicking the Apply button, we get a Y Simulated column simulating the number of successes out of 10 trials, and a column indicating the number of trials (which is used in Fit Model). For modeling purposes, I copied the Y Simulated column into Y.


If we look at the formula for Y Simulated, we see that it can generate a response vector based on the model given in the Simulate Responses dialog.


3. Fit the Model
Now that we have a formula for simulating responses, we need to set up the modeling for the simulated responses. In this case, we want to collect p-values for the effects from repeated logistic regression analyses on the simulated responses. We first need to do the analysis for a single response. If we launch the Fit Model platform, we can add the number of trials to the response role (Y), and change the Personality to Generalized Linear Model with a Binomial Distribution. My Fit Model launch looks like this:


Click the Run button to fit the model. The section of the report that we’re interested in is the Parameter Estimates outline.


For the initial simulated response, A, B, and C were found to be active, and D was not (which is correct). Of course, this is just for a single simulation. We could keep simulating a new response vector, and keeping track of these p-values for each effect, or, we could use one-click simulate and let it do this for us.

4. Run the simulation
The column we’re interested in for this blog post is the Prob>ChiSq. We right-click on that column to bring up the menu, and (if you have JMP Pro), at the bottom, above bootstrap, we see an option for Simulate.


The dialog that pops up has a choice for Column to Switch Out and a Choice for Column to Switch In. For our simulations, instead of using the response Y, we want to use Y Simulated, as it contains the formula with the Random Binomial. Instead of using Y when we first used Fit Model, we could have instead used Y Simulated, and switch it out with itself. The Number of Samples refers to how many times to simulate a response. Here I’ve left it at 2500.


Now we just click OK, and let it run. After a short wait, we’re presented with a data table containing a column for each effect from the Fit Model dialog (as well as a simulation ID, SimID), and 2501 rows – the first is the original fit, and marked as excluded, while each other row corresponds to the results from one of our 2500 simulated responses. The values are the p-values for each effect from Fit Model.


The one-click Simulate has also pre-populated the data table with a distribution script, and, because it recognizes the results are p-values, another script called Power Analysis. Running the Power Analysis script provides a distribution of the p-values for each effect, as well as a summary titled Simulated Power with the rejection rate at different levels of alpha. For example, if we look at the effect of factor B, we see that at alpha = 0.05, 2103 times out of 2500 the null hypothesis of no effect was rejected for a rejection rate (empirical power) of about 84%.


I typically right-click on one of the Simulated Power tables, and select Make Combined Data Table. This provides a data table that provides the rejection rates for each term at the four different alpha levels. This makes it easier to view the results in Graph Builder, such as the results for alpha = 0.05.


Now we can see that we have high power to detect the effects for A and B (recall that they had the largest coefficients), while C and the intercept are around 50%. Since D was inactive in our simulation, the rejection rate is around 0.05, as we would expect. We may be concerned with 50% power assuming that the effect of C is correct. With the ease of being able to perform these simulations, it’s simple to go back to Simulate Reponses and change the number of trials for each row of the design before running another one-click simulation. Likewise, we could create a larger design to see how that affects the power. We could even try modeling using generalized regression with a binomial distribution.

Final thoughts
To me, a key aspect of this new feature is that it allows you to go through different “what if” scenarios with ease. This is especially true if you are in a screening situation, where it’s not unusual to be using model selection techniques when analyzing data. Now you can have empirical power calculations that match the analysis you plan to use, and help alert you to pitfalls that can arise during analysis. While this was possible prior to JMP 13, I typically didn’t find the time to create a custom formula each time I was considering a design. In the short time I’ve been using the one-click simulate, the ease with which I can create the formula and run the simulations has led me to insights I would not have gleaned otherwise, and has become an important tool in my toolbox.

tags: Design of Experiments (DOE), JMP 13, Power Analysis, Simulate Responses, Statistics

The post Empirical power calculations for designed experiments with 1-click simulate in JMP 13 appeared first on JMP Blog.

十一 012016

What would happen if we could ask any type of scientific or clinical question about patients, and then go out and find the data to answer our questions? With "real-world data," we can do just that. Real-world data is all medicinal product data that comes from real-life patients. In contrast, […]

Using real-world data from real-life patients was published on SAS Voices.