Joyce Norris-Montanari poses the question: Is Hadoop/big data technology actually ready for MDM?

The post Is big data just a source? Or is Hadoop ready for MDM? appeared first on The Data Roundtable.

5月 042017

Joyce Norris-Montanari poses the question: Is Hadoop/big data technology actually ready for MDM?

The post Is big data just a source? Or is Hadoop ready for MDM? appeared first on The Data Roundtable.

5月 042017

While attending SAS Global Forum, a user asked me about creating a map with a zoomed inset map. This is a topic many users might be interested in, so I decided to create an example and share it. But first, I had to decide which map to use. I thought [...]

The post Creating maps with a zoomed inset, in SAS appeared first on SAS Learning Post.

5月 032017

I recently saw an interesting data visualization on the flowingdata website, which analyzed & compared the causes of fatal crashes in the US, by month and time-of-day. At first I thought it was a really cool visualization, but after I studied it a while, I realized that I had misinterpreted [...]

The post When do fatal crashes happen? appeared first on SAS Learning Post.

5月 032017

If a financial analyst says it is "likely" that a company will be profitable next year, what probability would you ascribe to that statement? If an intelligence report claims that there is "little chance" of a terrorist attack against an embassy, should the ambassador interpret this as a one-in-a-hundred chance, a one-in-ten chance, or some other value?

Analysts often use vague statements like "probably" or "chances are slight" to convey their beliefs that a future event will or will not occur. Government officials and policy-makers who read reports from analysts must interpret and act on these vague statements. If the reader of a report interprets a phrase different from what the writer intended, that can lead to bad decisions.

In the book *Psychology of Intelligence Analysis* (Heuer, 1999), the author presents "the results of an experiment with 23 NATO military officers accustomed to reading intelligence reports. They were given a number of sentences such as: "It is *highly unlikely* that ...." All the sentences were the same except that the verbal expressions of probability changed. The officers were asked what percentage probability they would attribute to each statement if they read it in an intelligence report."

The results are summarized in the adjacent dot plot from Heuer (Chapter 12), which summarizes how the officers assess the probability of various statements. The graph includes a gray box for some statements. The box is not a statistical box plot. Rather it indicates the probability range according to a nomenclature proposed by Kent (1964), who tried to get the intelligence community to agree that certain phrases would be associated with certain probability ranges.

For some statements (such as "better than even" and "almost no chance") there was general agreement among the officers. For others, there was large variability in the probability estimates. For example, many officers interpreted "probable" as approximately a 75% chance, but quite a few interpreted it as less than 50% chance.

The results of this experiment are interesting on many levels, but I am going to focus on the visualization of the data. I do not have access to the original data, but this experiment was repeated in 2015 when the user "Zonination" got 46 users on Reddit (who were not military experts) to assign probabilities to the statements. His visualization of the resulting data won a 2015 Kantar Information is Beautiful Award. The visualization uses box plots to show the schematic distribution and overlays the 46 individual estimates by using a jittered, semi-transparent, scatter plot. The Zonination plot is shown at the right (click to enlarge). Notice that the "boxes" in this second graph are determined by quantiles of the data, whereas in the first graph they were theoretical ranges.

I decided to remake Zonination's plot by using PROC SGPLOT in SAS. I made several modifications that improve the readability and clarity of the plot.

- I sorted the categories by the median probability. The median is a robust estimate of the "consensus probability" for each statement. The sorted categories indicate the relative order of the statements in terms of perceived likelihood. For example, an "unlikely" event is generally perceived as more probable than an event that has "little chance." For details about sorting the variables in SAS, see my article about how to sort variables by a statistic.
- I removed the colors. Zonination's rainbow-colored chart is aesthetically pleasing, but the colors do not add any new information about the data. However, the colors help the eye track horizontally across the graph, so I used alternating bands to visually differentiate adjacent categories. You can create color bands by using the COLORBANDS= option in the YAXIS statement.
- To reduce overplotting of markers, I used systematic jittering instead of random jittering. In random jittering, each vertical position is randomly offset. In systematic (centered) jittering, the markers are arranged so that they are centered on the "spine" of the box plot. Vertical positions are changed only when the markers would otherwise overlap. You can use the JITTER option in the SCATTER statement to systematically jitter marker positions.
- Zonination's plot displays some markers twice, which I find confusing. Outliers are displayed once by the box plot and a second time by the jittered scatter plot. In my version, I suppress the display of outliers by the box plot by using the NOOUTLIERS option in the HBOX statement.

You can download the SAS code that creates the data, sorts the variables by median, and creates the plot. The following call to PROC SGPLOT shows the HBOX and SCATTER statements that create the plot:

title "Perceptions of Probability"; proc sgplot data=Long noautolegend; hbox _Value_ / category=_Label_ nooutliers nomean nocaps; scatter x=_Value_ y=_Label_ / jitter transparency=0.5 markerattrs=GraphData2(symbol=circlefilled size=4); yaxis reverse discreteorder=data labelpos=top labelattrs=(weight=bold) colorbands=even colorbandsattrs=(color=gray transparency=0.9) offsetmin=0.0294 offsetmax=0.0294; /* half of 1/k, where k=number of catgories */ xaxis grid values=(0 to 100 by 10); label _Value_ = "Assigned Probability (%)" _label_="Statement"; run; |

The graph indicates that some responders either didn't understand the task or intentionally gave ridiculous answers. Of the 17 categories, nine contain extreme outliers, such as assigning certainty (100%) to the phrases "probably not," "we doubt," and "little chance." However, the extreme outliers do not affect the statistical conclusions about the distribution of probabilities because box plots (which use quartiles) are robust to outliers.

The SAS graph, which uses systematic jittering, reveals a fact about the data that was hidden in the graphs that used random jittering. Namely, most of the data values are multiples of 5%. Although a few people responded with values such as 88.7%, 1%, or 3%, most values (about 80%) are rounded to the nearest 5%. For the phrases "likely" and "we believe," 44 of 46 responses (96%) were a multiple of 5%. In contrast, the phrase "almost no chance" had only 18 of 46 responses (39%) were multiples of 5% because many responses were 1%, 2%, or 3%.

Like the military officers in the original study, there is considerable variation in the way that the Reddit users assign a probability to certain phrases. It is interesting that some phrases (for example, "We believe," "Likely," and "Probable") have the same median value but wildly different interquartile ranges. For clarity, speakers/writers should use phrases that have small variation or (even better!) provide their own assessment of probability.

Does something about this perception study surprise you? Do you have an opinion about the best way to visualize these data? Leave a comment.

The post Perceptions of probability appeared first on The DO Loop.

5月 032017

The list of SAS credentials keeps growing every year, as more and more SAS users want to validate their application of SAS skills in different business topic areas, such as SAS Data Management, SAS Administration, and more. The field of Big Data is no exception, and the SAS Global Certification [...]

The post Do SAS Big Data credentials equal big professional value? appeared first on SAS Learning Post.

5月 032017

.@philsimon chimes in on some oft-overlooked differences.

The post Operational vs. analytical MDM appeared first on The Data Roundtable.

5月 022017

For many years the humble spreadsheet has held many different roles and responsibilities supporting finance, marketing, sales -- pretty much every department in your business. There's always someone with a “magic spreadsheet,” but how effective is this culture that always uses the same format to consume data? My view of [...]

The spreadsheet: Friend or foe? was published on SAS Voices by Tim Clark

5月 012017

How transferable are features in deep neural networks?
https://arxiv.org/abs/1411.1792

TensorFlow CNN for fast style transfer
https://github.com/lengstrom/fast-style-transfer

https://github.com/HappyShadowWalker/ChineseTextClassify
中文文本分类，使用搜狗文本分类语料库

https://lukeoakdenrayner.wordpress.com/2017/04/24/the-end-of-human-doctors-understanding-medicine/
The End of Human Doctors – Understanding Medicine

Machine Learning in Science and Industry slides
http://arogozhnikov.github.io/2017/04/20/machine-learning-in-science-and-industry.html
https://github.com/yandexdataschool/MLAtGradDays

all the available code repos for the NIPS 2016's top papers
https://www.reddit.com/r/MachineLearning/comments/5hwqeb/project_all_code_implementations_for_nips_2016/

Best Practices for Applying Deep Learning to Novel
Applications https://arxiv.org/abs/1704.01568v1?utm_campaign=Revue
newsletter&utm_medium=Newsletter&utm_source=revue

Medical Image Analysis with Deep Learning
https://medium.com/@taposhdr/medical-image-analysis-with-deep-learning-i-23d518abf531

https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf

中国谣言数据库 http://rumor.thunlp.org/

5月 012017

A frequently asked question on SAS discussion forums concerns randomly assigning units (often patients in a study) to various experimental groups so that each group has approximately the same number of units. This basic problem is easily solved in SAS by using PROC SURVEYSELECT or a DATA step program.

A more complex problem is when the researcher wants the distribution of a covariate to be approximately equal for each group. For example, a medical researcher might want each group of patients to have approximately the same mean and variance for their cholesterol levels, as shown in the plot to the right. This second problem is much harder because it involves distributing the units (non-randomly) so that the groups satisfy some objective. Conceptually, you want an assignment that minimizes the difference between moments (mean, variance, skewness,...) of the subgroups.

Creating an experimental design that optimizes some criterion is one of the many uses of the OPTEX procedure in SAS/QC software In fact, this problem is one of the examples in the PROC OPTEX documentation. The example in this blog post follows the documentation example; see the documentation for details.

To solve this assignment problem with PROC OPTEX, you need to specify three things:

- A data set (call it UNITS) that contains the individual units (often patients) that you want to assign to the treatments. For this example, the units are the living patients in the Sashelp.Heart data set.
- A data set (call it TREATMENTS) that contains the names of the treatment groups. This example uses five groups with values 1, 2, 3, 4, and 5.
- The covariate in the problem. The OPTEX procedure will assign units to groups so that the first
*k*moments are approximately equal across treatment groups. This example uses the Cholesterol variable and*k*=2, which means that the mean and variance of the cholesterol levels for patients in each treatment group will be approximately equal.

The following statements create the two data sets and define a macro variable that contains the name of the Cholesterol variable:

/* Split the living Sashelp.heart patients into five groups so that mean and variance of cholesterol is the same across groups. */ data Units; /* each row is a unit to be assigned */ set Sashelp.Heart(where=(status = "Alive")); keep Sex AgeAtStart Height Weight Diastolic Systolic MRW Cholesterol; run; %let NumGroups =5; /* number of treatment groups */ data Treatments; do Trt = 1 to &NumGroups; /* Trt is variable that assigns patients to groups */ output; end; run; %let Var = Cholesterol; /* name of covariate */ |

As discussed in the PROC OPTEX documentation, the following call creates an output data set (named GROUPS) that assigns each patient to a group (1–5). A call to PROC MEANS displays the mean and variance of each group.

proc optex data=Treatments seed=97531 coding=orthcan; class Trt; model Trt; /* specify treatment model */ blocks design=Units; /* specify units */ model &Var &Var*&Var; /* fixed covariates: &Var--> mean, &Var*&Var--> variance */ output out=Groups; /* merged data: units assigned to groups */ run; proc means data=Groups mean std; class Trt; var &Var; run; |

Success! The table shows that each group has approximately the same size and approximately the same mean and variance of the Cholesterol variable. Remember, though, that this assignment scheme is not random, so be careful not to make unjustified inferences from estimates based on these group assignments.

I am not an expert on experimental design, but a knowledgeable colleague tells me that optimal design theory states that the design that minimizes the variance of the treatment effects (adjusting for the first two moments of the covariate) is the design in which treatment means and variances of the covariate are as equal as possible. This is the design that PROC OPTEX has produced.

The following call to PROC SGPLOT uses box plots to visualize the distribution of the Cholesterol variable across treatment groups. The graph is shown at the top of this article. Clearly the distribution of cholesterol is very similar for each group.

proc sgplot data=Groups; vbox &Var / category=Trt; run; |

In summary, the OPTEX procedure in SAS/QC software enables you to assign units to groups, where each group has approximately the same distribution of a specified covariate. In this article, the covariate measured cholesterol levels in patients, but you can also group individuals according to income, scholastic aptitude, and so on. For large data sets, the assignment problem is a challenging optimization problem, but PROC OPTEX provides a concise syntax and solves this problem efficiently.

The post Split data into groups that have the same mean and variance appeared first on The DO Loop.

5月 012017

Machine learning seems to be the new hot topic these days. Everybody's talking about how machines are beating human players in chess, Jeopardy, and now even Go. In the future, artificial intelligence will drive our cars and our jobs will be taken over by robots. There’s a lot of hype, [...]

Autotuning: How machine learning helps optimize itself was published on SAS Voices by Sascha Schubert