10月 162017
 

Last week Warren Kuhfeld wrote about a graph called the "lines plot" that is produced by SAS/STAT procedures in SAS 9.4M5. (Notice that the "lines plot" has an 's'; it is not a line plot!) The lines plot is produced as part of an analysis that performs multiple comparisons of means. Although the graphical version of the lines plot is new in SAS 9.4M5, the underlying analysis has been available in SAS for decades. If you have an earlier version of SAS, the analysis is presented as a table rather than as a graph.

Warren's focus was on the plot itself, with an emphasis on how to create it. However, the plot is also interesting for the statistical information it provides. This article discusses how to interpret the lines plot in a multiple comparisons of means analysis.

The lines plot in SAS

You can use the LINES option in the LSMEANS statement to request a lines plot in SAS 9.4M5. The following data are taken from Multiple Comparisons and Multiple Tests (p. 42-53 of the First Edition). Researchers are studying the effectiveness of five weight-loss diets, denoted by A, B, C, D, and E. Ten male subjects are randomly assigned to each method. After a fixed length of time, the weight loss of each subject is recorded, as follows:

/* Data and programs from _Multiple Comparisons and Multiple Tests_
   Westfall, Tobias, Rom, Wolfinger, and Hochberg (1999, First Edition) */
data wloss;
do diet = 'A','B','C','D','E';
   do i = 1 to 10;  input WeightLoss @@;  output;  end;
end;
datalines;
datalines;
12.4 10.7 11.9 11.0 12.4 12.3 13.0 12.5 11.2 13.1
 9.1 11.5 11.3  9.7 13.2 10.7 10.6 11.3 11.1 11.7
 8.5 11.6 10.2 10.9  9.0  9.6  9.9 11.3 10.5 11.2
 8.7  9.3  8.2  8.3  9.0  9.4  9.2 12.2  8.5  9.9
12.7 13.2 11.8 11.9 12.2 11.2 13.7 11.8 11.5 11.7
;

You can use PROC GLM to perform a balanced one-way analysis of variance and use the MEANS or LSMEANS statement to request pairwise comparisons of means among the five diet groups:

proc glm data=wloss;
   class diet;
   model WeightLoss = diet;
   *means diet / tukey LINES; 
   lsmeans diet / pdiff=all adjust=tukey LINES;
quit;

In general, I use the LSMEANS statement rather than the MEANS statement because LS-means are more versatile and handle unbalanced data. (More about this in a later section.) The PDIFF=ALL option requests an analysis of all pairwise comparisons between the LS-means of weight loss for the different diets. The ADJUST=TUKEY option is a common way to adjust the widths of confidence intervals to accommodate the multiple comparisons. The analysis produces several graphs and tables, including the following lines plot.

Lines plot for multiple comparisons of means in SAS

How to interpret a lines plot

In the lines plot, the vertical lines visually connect groups whose LS-means are "statistically indistinguishable." Statistically speaking, two means are "statistically indistinguishable" when their pairwise difference is not statistically significant.

If you have k groups, there are k(k-1)/2 pairwise differences that you can examine. The lines plot attempts to summarize those comparisons by connecting groups whose means are statistically indistinguishable. Often there are fewer lines than pairwise comparisons, so the lines plot displays a summary of which groups have similar means.

In the previous analysis, there are five groups, so there are 10 pairwise comparisons of means. The lines plot summarizes the results by using three vertical lines. The leftmost line (blue) indicates that the means of the 'B' and 'C' groups are statistically indistinguishable (they are not significantly different). Similarly, the upper right vertical bar (red) indicates that the means of the pairs ('E','A'), ('E','B'), and ('A','B') are not significantly different from each other. Lastly, the lower right vertical bar (green) indicates that the means for groups 'C' and 'D' are not significantly different. Thus in total, the lines plot indicates that five pairs of means are not significantly different.

The remaining pairs of mean differences (for example, 'E' and 'D') are significantly different. By using only three vertical lines, the lines plot visually associates pairs of means that are essentially the same. Those pairs that are not connected by a vertical line are significantly different.

Advantages and limitations of the lines plot

Advantages of the lines plot include the following:

  • The groups are ordered according to the observed means of the groups.
  • The number of vertical lines is often much smaller than the number of pairwise comparisons between groups.

Notice that the groups in this example are the same size (10 subjects). When the group sizes are equal (the so-called "balanced ANOVA" case), the lines plot can always correctly represent the relationships between the group means. However, that is not always true for unbalanced data. Westfall et al. (1999, p. 69) provide an example in which using the LINES option on the MEANS statement produces a misleading plot.

The situation is less severe when you use the LSMEANS statement, but for unbalanced data it is sometimes impossible for the lines plot to accurately connect all groups that have insignificant mean differences. In those cases, SAS appends a footnote to the plot that alerts you to the situation and lists the additional significances not represented by the plot.

In my next blog post, I will show some alternative graphical displays that are appropriate for multiple comparisons of means for unbalanced groups.

Summary

In summary, the new lines plot in SAS/STAT software is a graphical version of an analysis that has been in SAS for decades. You can create the plot by using the LINES option in the LSMEANS statement. The lines plot indicates which groups have mean differences that are not significant. For balanced data (or nearly balanced), it does a good job of summarizes which differences of means are not significant. For highly unbalanced data, there are other graphs that you can use. Those graphs will be discussed in a future article.

The post Graphs for multiple comparisons of means: The lines plot appeared first on The DO Loop.

10月 132017
 

Every year in early October, the eyes of the world turn to Sweden and Norway, where the Nobel Prize winners are announced to the world. The Nobel Prize is considered the world's most prestigious award. Since 1901, the Prize has been presented to individuals and organizations that have made significant achievements in the fields of physics, chemistry, physiology or medicine, world peace and literature in each year (there were several exceptions during war years). In 1968, Sveriges Riksbank established the Sveriges Riksbank Prize in Economic Sciences in memory of Alfred Nobel, founder of the Nobel Prize. Today, individuals or organizations who are awarded Nobel Prizes and the Prize in Economic Sciences are called Nobel Laureates.

So far, more than 900 Nobel Laureates have been awarded. In this post, I wanted to learn a little more about these impressive individuals. Where were these Nobel Laureates from? Why do they get awarded? Is there any common characteristics you’ll find in these Laureates? Below you’ll find a preliminary analysis of Nobel Laureates using SAS Visual Analytics.

The analysis is based on data from List_of_Nobel_laureates, List of Nobel laureates by university affiliation and Nobel Laureates datasets at Kaggle, which definitely has some missing and inconsistent values. I have cleaned the data to correct for some obvious inconsistency as possible for my analysis.

How many Nobel Laureates have their been so far?

Recently, 12 new Nobel Laureates were awarded by the 2017 Nobel Prizes and Prize in Economic Sciences, and that makes 923 Laureates in total since the first Nobel Prize in 1901. Some Laureates share one prize, so we see more shared Laureates total in below table. While we see 27 organization winners of the Peace prize, most Laureates are individual winners.

analysis of the Nobel Laureates

The chart below shows the overall trend of annual total Nobel Laureates is increasing year-over-year, as more and more winners are sharing the Prize. The purple circle on the plot indicates that there are shared winners in that year. The average number of winners is about eight each year. Yet there was only one winner in 1916 for the Literature Prize. The most winners came in 2001, with 15 Laureates sharing the prizes. I also note from the chart that during the First World War, there were very few Nobel Prizes awarded, and during the Second World War, there were none.

Moreover, we know that most Nobel Laureates are awarded one Nobel Prize, yet I learned from childhood that the female scientist Marie Curie received two Nobel Prizes. If you search the datasets for winners awarded more than one Prize, you’ll find four scientists accomplished this feat. They are: Marie Curie, Linus Pauling, John Bardeen and Frederick Sanger.

Do Nobel Laureates live longer?

The answer is YES, per the research by Prof. Andrew Oswald from University of Warwick. Winning a Nobel prize adds about 1.5 years to the lifespan of Nobel Laureates compared to those who were merely nominated. Of course, it is not because of the monetary benefits that come with the Nobel Prize, but because of ‘the deep links between mind and body’, and that ‘happiness’ may make people live longer, which makes sense to me.

Since I don’t have the data of Nobel Prize nominees, let’s only test the lifespan of the Nobel Laureates and the ages they got awarded. The average life age of all Nobel Laureates is nearly 80, much older than the global average life expectancy of 71.4 years-old (according to World Health Organization 2015). Digging a bit more, we see Martin Luther King is the Nobel laureate (Peace, 1964) who died at youngest age. He was assassinated at 39 years old. Laureates who lived longest are Rita Levi-Montalcini (Medicine, 1986) and Ronald H. Coase (Economics, 1991), who both lived to 103 years old. You may also notice that the distribution of the Laureates’ lifespan is left skewed, the Nobel Prize winners certainly live longer than most.

In addition, something more worth noting:

  • The most laureates with the longest lifespans are from the Economics and Medicine categories. The Nobel Prize winning economists live longer than other categories’ winners on average. The average lifespan of these economists is about 86 years-old, five years longer than the second category of Medicine.
  • Economics winners are winning the awards at the highest age – 67 years-old on average. More digging shows that the oldest awarded age is 90 when Leonid Hurwicz (Economics, 2007) was awarded his Prize. We see the average awarded age of Physics winners is 56, which is 10+ years younger than that of the Economics winners. Thus, we get the impression that economists need more time to have outstanding achievements.
  • If we compare the time span between Laurates’ average awarded ages and their lifespan, the Physics Prize winners enjoy the longest life time after winning the award – about 20 years on average.
  • It is also worth noting that the Nobel Peace winners have the largest span of awarded age, about 70 years’ span. That’s because the youngest Nobel Laureates Malala Yousafzai, who got awarded of Nobel Peace Prize at 17 years-old in 2014.

The chart below is created in SAS Visual Analytics and shows the awarded ages of all individual Nobel Laureates in different prize categories. The reference line is the average awarded age of 59. It is very easy to note that no Nobel Prize was awarded during 1940-1943 due to the Second World War.

From which universities have Nobel Laureates graduated?

Next, let’s look at the educational background of Nobel Laureates. The left chart below obviously shows that much more Nobel winners hold Doctorate degrees than those of Bachelor or Master degrees. If we see the chart for Literature and Peace categories on the right, the difference is not that big. From the data, we know that the educational background of Nobel Laureates in Physics, Chemistry, Medicine and Economics categories (I call these four categories the scientific categories for easier description later) has the higher percentage of doctorate than that of winners in the Literature and Peace categories.

To learn more about the universities the Laureates in these scientific categories are graduated from, I ranked the top 10 university affiliations for the scientific categories in below chart, and their distribution among these categories, as well as the countries in which these universities are located.

The top 10 university affiliations were selected basing on the highest degree of the scientific categories’ Laureates obtained. That is, if one winner held a Master degree from Harvard University and a Doctorate degree from University of Cambridge, he/she is counted in University of Cambridge but not in the Harvard University. From the parallel coordinates plot, you may have noticed that the Physics in University of Cambridge and the Medicine in Harvard University are their greatest majors respectively. On the right, it shows the countries where these top 10 university affiliations are in United States, United Kingdom, France and Germany. The bar charts on the left show the percentage of educational degrees (Doctorate, Master, Bachelor) of each in the scientific categories (according to the available dataset). In the bottom chart, top 10 universities are ranked by their percentages. Perhaps now you have a great university in your mind for future education?

Next, I created the chart below to show the top eight countries having the university affiliations that more Nobel Prize winners graduated from. (Here the chart only shows for scientific categories, thus it excludes the Nobel Literature Prize and Peace prize.). An obvious trend we see from the chart is that the United States has the most Laureates spanning in the scientific categories after the Second World War, while Germany has more Laureates in the scientific categories comparatively before World War II.

Why do the Nobel Laureates get awarded?

Per the ‘nobelprize.org’, in his excerpt of the will, Alfred Nobel (1833-1896) dictates that his entire remaining estate should be used to endow "prizes to those who, during the preceding year, shall have conferred the greatest benefit to mankind." So Alfred's interests are reflected in the Prize, which said “The whole of his remaining realizable estate constitutes a fund, and the annually interest shall be divided into five equal parts, which shall be apportioned as follows: one part to the person who shall have made the most important discovery or invention within the field of physics; one part to the person who shall have made the most important chemical discovery or improvement; one part to the person who shall have made the most important discovery within the domain of physiology or medicine; one part to the person who shall have produced in the field of literature the most outstanding work in an ideal direction; and one part to the person who shall have done the most or the best work for fraternity between nations, for the abolition or reduction of standing armies and for the holding and promotion of peace congresses.”

Since it’s not easy to seek evidence in the datasets that Nobel Laureates are awarded by fulfilling Alfred’s will, what I do is to use SAS Visual Analytics text topics analysis performing some preliminary text analysis of the ‘Motivation’ field in the dataset for a validation to some extent. The ‘Motivation’ is given by ‘nobelprize.org’ for why the Laureate gets awarded. The analysis shows that the most frequently mentioned word is ‘discovery’, while the most 5 frequently appeared words include ‘work’, ‘development’, ‘contribution’, and ‘theory’. And from the topics analysis result, the top 10 topics are about ‘discovery’, ‘human”, “structure”, “economic”,” technique”, etc., which are reflecting Alfred Nobel‘s will in establishing the Prize. Moreover, the sentimental analysis result shows that the statements in the ‘Motivation’ field are mainly neutral (being ‘objective’), even though there are few positive and negative sentimental statements.

 

I hope you’ve found this analysis of Nobel Laureates data interesting. I believe there are still many other perspectives you can analyze to get insights. Is there anything interesting you see?

A preliminary analysis of the Nobel Laureates was published on SAS Users.

10月 132017
 

In this education analytics series of blog posts, we have been on a journey to learn how education customers are turning their data into insights to be a more data-informed and analytical organization. So far on our journey, we learned how education customers are using SAS, the positive impact that SAS [...]

Education analytics: Best practices for using data visualization and analytics was published on SAS Voices by Georgia Mariani

10月 122017
 

I’ve been making note of recent interactions and observations with different companies that I’ve done business with lately. My goal is to share with you real ways in which excellent customer experiences are shaped. Observation #1: Witty conversations rule As I write this, I’m sitting at a Panera Bread restaurant [...]

Two memorable multi-channel customer experiences was published on SAS Voices by Lonnie Miller

10月 122017
 

As analytics advances into more areas of our everyday life there is an increased need to understand just what it is that analytics is accomplishing for us.  I find there is a quicker acceptance of an analytical solution and even excitement if there is a good conceptual understanding of what [...]

The post Transforming Analytics into an Art Form appeared first on SAS Learning Post.

10月 122017
 

With SAS Data Management, you can setup SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

When a data issue is discovered it can be sent automatically or manually to a remediation queue where it can be corrected by designated users.

Let’s look how to setup a remediation service and how to send issue records to Data Remediation.

Register the remediation service.

To register a remediation service in SAS Data Remediation we go to Data Remediation Administrator “Add New Client Application.

Under Properties we supply an ID, which can be the name of the remediation service as long as it is unique, and a Display name, which is the name showing in the Remediation UI.

Under the tab Subject Area, we can register different subject categories for this remediation service.  When calling the remediation service we can categorize different remediation issues by setting different subject areas. We can, for example, use the Subject Area to point to different Data Quality Dimensions like Completeness, Uniqueness, Validity, Accuracy, Consistency.

Under the tab Issues Types, we can register issue categories. This enables us to categorize the different remediation issues. For example, we can point to the affected part of record like Name, Address, Phone Number.

At Task Templates/Select Templates we can set a workflow to be used for each issue type. You can design your own workflow using SAS Workflow Studio or you can use a prepared workflow that comes with Data Remediation. You need to make sure that the desired workflow is loaded on to Workflow Server to link it to the Data Remediation Service. Workflows are not mandatory in SAS Data Remediation but will improve efficiency of the remediation process.

Saving the remediation service will make it available to be called.

Sending issues to Data Remediation.

When you process data, and have identified issues that you want to send to Data Remediation, you can either call Data Remediation from the job immediately where you process the data or you store the issue records in a table first and then, in a second step, create remediation records via a Data Management job.

To send records to Data Remediation you can call remediation REST API form the HTTP Request node in a Data Management job.

Remediation REST API

The REST API expects a JSON structure supplying all required information:

{
	"application": "mandatory",
	"subjectArea": "mandatory",
	"name": "mandatory",
	"description": "",
	"userDefinedFieldLabels": {
		"1": "",
		"2": "",
		"3": ""
	},
	"topics": [{
		"url": "",
		"name": "",
		"userDefinedFields": {
			"1": "",
			"2": "",
			"3": ""
		},
		"key": "",
		"issues": [{
			"name": "mandatory",
			"importance": "",
			"note": "",
			"assignee": {
				"name": ""
			},
			"workflowName": "",
			"dueDate": "",
			"status": ""
		}]
	}]
}

 

JSON structure description:

In a Data Management job, you can create the JSON structure in an Expression node and use field substitution to pass in the necessary values from the issue records. The expression code could look like this:

REM_APPLICATION= "Customer Record"
REM_SUBJECT_AREA= "Completeness"
REM_PACKAGE_NAME= "Data Correction"
REM_PACKAGE_DESCRIPTION= "Mon-Result: " &formatdate(today(),"DD MM YY") 
REM_URL= "http://myserver/Sourcesys/#ID=" &record_id
REM_ITEM_NAME= "Mobile phone number missing"
REM_FIELDLABEL_1= "Source System"
REM_FIELD_1= "CRM"
REM_FIELDLABEL_2= "Redord ID"
REM_FIELD_2= record_id
REM_FIELDLABEL_3= "-"
REM_FIELD_3= ""
REM_KEY= record_id
REM_ISSUE_NAME= "Phone Number"
REM_IMPORTANCE= "high"
REM_ISSUE_NOTE= "Violated data quality rule phone: 4711"
REM_ASSIGNEE= "Ben"
REM_WORKFLOW= "Customer Tag"
REM_DUE-DATE= "2018-11-01"
REM_STATUS= "open"
 
JSON_REQUEST= '
{
  "application":"' &REM_APPLICATION &'",
  "subjectArea":"' &REM_SUBJECT_AREA &'",
  "name":"' &REM_PACKAGE_NAME &'",
  "description":"' &REM_PACKAGE_DESCRIPTION &'",
  "userDefinedFieldLabels": {
    "1":"' &REM_FIELDLABEL_1 &'",
    "2":"' &REM_FIELDLABEL_2 &'",
    "3":"' &REM_FIELDLABEL_3 &'"
  },
  "topics": [{
    "url":"' &REM_URL &'",
    "name":"' &REM_ITEM_NAME &'",
    "userDefinedFields": {
      "1":"' &REM_FIELD_1 &'",
      "2":"' &REM_FIELD_2 &'",
      "3":"' &REM_FIELD_3 &'"
    },
    "key":"' &REM_KEY &'",
    "issues": [{
      "name":"' &REM_ISSUE_NAME &'",
      "importance":"' &REM_IMPORTANCE &'",
      "note":"' &REM_ISSUE_NOTE &'",
      "assignee": {
        "name":"' &REM_ASSIGNEE &'"
      },
      "workflowName":"' &REM_WORKFLOW &'",
      "dueDate":"' &REM_DUE_DATE &'",
      "status":"' &REM_STATUS &'"
    }]
  }]
}'

 

Tip: You could also write a global function to generate the JSON structure.

After creating the JSON structure, you can invoke the web service to create remediation records. In the HTTP Request node, you call the web service as follows:

Address:  http://[server]:[port]/SASDataRemediation/rest/groups
Method: post
Input Filed: The variable containing the JSON structure. I.e. JSON_REQUEST
Output Filed: A field to take the output from the web service. You can use the New button create a filed and set the size to 1000
Under Security… you can set a defined user and password to access Data Remediation.
In the HTTP Request node’s advanced settings set the WSCP_HTTP_CONTENT_TYPE options to application/json

 

 

 

You can now execute the Data Management job to create the remediation records in SAS Data Remediation.

Improving data quality through SAS Data Remediation was published on SAS Users.

10月 112017
 

The article "Fisher's transformation of the correlation coefficient" featured a Monte Carlo simulation that generated sample correlations from bivariate normal data. The simulation used three steps:

  1. Simulate B samples of size N from a bivariate normal distribution with correlation ρ.
  2. Use PROC CORR to compute the sample correlation matrix for each of the B samples.
  3. Use the DATA step to extract the off-diagonal elements from the correlation matrices.

After the three steps, you obtain a distribution of B sample correlation coefficients that approximates the sampling distribution of the Pearson correlation coefficient for bivariate normal data.

There is a simpler way to simulate the correlation estimates: You can directly simulate from the Wishart distribution. Each draw from the Wishart distribution is a sample covariance matrix for a multivariate normal sample of size N. If you convert that covariance matrix to a correlation matrix, you can immediately extract the off-diagonal elements, as shown in the following SAS/IML statements:

%let rho = 0.8;           /* correlation for bivariate normal distribution */
%let N = 20;              /* sample size */
%let NumSamples = 2500;   /* number of simulated samples */
 
/* generate sample correlation coefficients by using Wishart distribution */
proc iml;
call randseed(12345);
NumSamples = &NumSamples;
DF = &N - 1;              /* X ~ N obs from MVN(0, Sigma) */
Sigma = {1     &rho,      /* covariance for MVN samples */
         &rho   1  };
S = RandWishart(NumSamples, DF, Sigma); /* each row is 2x2 matrix */
Corr = j(NumSamples, 1);  /* allocate vector for correlation estimates */
do i = 1 to nrow(S);      /* convert to correlation; extract off-diagonal */
   Corr[i] = cov2corr( shape(S[i,], 2, 2) )[1,2];
end;

You can create a comparative histogram of the sample correlation coefficients. In the following graph, the histogram at the top of the panel displays the distribution of the simulated correlation coefficients from the three-step method. The bottom histogram displays the distribution of correlations coefficients that are generated from the Wishart distribution.

Visually, the histograms appear to be similar. You can use PROC NPAR1WAY to run various hypothesis tests that compare the distributions; all tests support the hypothesis that these two distributions are equivalent.

If you'd like to see the complete analysis, you can download the SAS program that runs both simulations and compares the resulting distributions.

Although the Wishart distribution is more efficient for this simulation, recall that the Wishart distribution assumes that the underlying data distribution is multivariate normal. In contrast, the three-step simulation is more general. It can be used to generate correlation coefficients for any data distribution. So although the three-step simulation is not necessary for multivariate normal data, it is still an important technique to store in your simulation toolbox.

The post Simulate correlations by using the Wishart distribution appeared first on The DO Loop.