3月 042019
 

Standardized tests like the SAT and ACT can cause stress for both high school students and their parents, but according to a Wall Street Journal article, the SAT and ACT "provide an invaluable measure of how students are likely to perform in college and beyond." Naturally, students wonder how their individual scores compare with others at their high school, their school district, or their state. I addressed some of those questions in a previous article that visualizes national ACT scores among college-bound students.

The average test scores at a school (or in a school system) are also scrutinized. Administrators and state legislators look at standardized scores to help assess the effectiveness of school districts, principals, and teachers. Parents might use scores to help them decide whether to send their child to the local public school, a charter school, or a private school.

A recent article in the Raleigh News and Observer links to a website of the NC Department of Public Instruction that contains the full set of SAT scores for all 496 public schools in NC. The table is great if you want to find the scores for a particular school, but is not very illuminating if you want an overview of how SAT scores vary at schools across the state. Thus, I decided to visualize the data for all NC schools by using SAS.

The data are only for public schools. You can download the data and the SAS program that I used to create the graphs in this article.

The distribution of total SAT scores

The SAT score contains two components, a score on a math test and a score on a reading and writing test (sometimes referred to as the "English" score). Colleges look at the component scores and the sum of the scores (the "total" score). The following histogram shows the distribution of the average total SAT score for schools in North Carolina:

From this graph, you can determine several facts about the data:

  1. For most NC schools, the average SAT score is about 1100.
  2. About 73% of NC schools have an average SAT score between 1000 and 1200.
  3. There are a few schools that have much higher scores than the others. Those schools are Early College At Guilford (Total=1442), Raleigh Charter High School (Total=1356), and East Chapel Hill High (Total=1290).

Visualize the SAT math and English scores

If you want to compare the distributions of the math and English SAT scores, you can create a comparative histogram, as shown below:

In the comparative histogram, the average English scores for schools are shown in the top row; the math scores are in the bottom row. The English scores are generally higher, with a median value of 543. The median math score is about 15 points lower, with a value of 528.

Of course, the measurements in these two histograms are not independent. In reality, the math and English scores are paired, since every student takes both the math and English tests. Therefore, it makes sense to plot the joint distribution of the scores in a scatter plot, as follows:

In this plot, each school is represented by a marker. The math and English scores determine the placement of the marker. You can see that the math and English scores are highly correlated and linearly related. Schools with high (respectively, low) English scores tend to have high (respectively, low) math scores.

I used SAS to add tool tips to the graph. When I hover the cursor over a marker, the graph display information about the school and the average test scores. That technique makes it easy to identify the outliers. For example, the image above shows the tool-tip information for the Raleigh Charter High School, which has exceptionally high SAT scores.

And speaking of charter schools, legislators and school boards sometimes discuss the merits and disadvantages of using public tax dollars to establish and run charter schools. I have used red triangles to indicate the 39 charter schools in NC, which represent about 8% of the total schools. Overall, the distribution of SAT scores among the charter schools appears to be similar to the non-charter schools. I also created a comparative histogram of the charter/non-charter schools (not shown), which shows that the overall distribution is similar.

SAT scores for all NC public high schools

The previous graphs show the distribution of SAT scores for all public NC high schools, but do not enable you to easily compare different school districts. One way to compare school districts is to rank them according to the median scores for the schools within the district. You can create a graph that shows the school scores for each district. Because there are 115 school districts in NC, this graph will be very tall or wide. Nevertheless, it can be useful to rank the school districts and simultaneously see the distribution of scores for each school. The following plot displays the top 75 school districts, which collectively contain 385 high schools:

The top school district is Chapel-Hill/Carrboro, which contains three high-performing high schools. The next few school districts are also small, having three or fewer high schools. Wake County, the second-largest school district in the state, is ninth in terms of the median SAT scores, but you can see considerable variation among the schools in the district. Other large school districts (Charter schools, Guilford, and Charlotte/Mecklenburg), show similar variation in scores.

Summary

By using data visualization, you can better understand the distribution of SAT scores across public high schools in NC. The graphs in this article enable you to see that there are several high-performing schools, lots of schools in the middle, and a few low-performing schools. Similarly, you can compare the test scores by school districts. The school-district graph enables you to see the variation of scores of schools within a district. Lastly, when you create these graphs in SAS, you can add tool tips that help you identify the individual schools within each school district.

In my next article, I will describe an alternative visualization that displays a box plot for each school district.

Download the data and the SAS program (NCSAT.sas) for this article.

The post Visualize SAT scores in North Carolina appeared first on The DO Loop.

2月 282019
 

Across organizations of all types, massive amounts of information are stored in unstructured formats such as video, images, audio, and of course, text. Let’s talk more about text and natural language processing. We know that there is tremendous value buried in call center and chat dialogues, survey comments, product reviews, technical notes, legal contracts, and other sources where context is captured in words versus numbers. But how can we extract the signal we want amidst all the noise?

In this post, we will examine this problem using publicly available descriptions of side effects or adverse events that patients have reported following a vaccination. This Vaccine Adverse Event Reporting System (VAERS) is managed by the CDC and FDA. Among other objectives, these agencies use it to:

* Monitor increases in known adverse events and detect new or unusual vaccine adverse events

* Identify potential patient risk factors, including temporal, demographic, or geographic reporting clusters

Below is a view of the raw data. It contains a text field which holds freeform case notes, along with structured fields which contain the patient’s location, age, sex, date, vaccination details, and flags for serious outcomes such as hospitalization or death.

In this dashboard, notice how we easily can do a search for a keyword “seizure” to filter to patients who have reported this symptom in the comments. However, analysts need much more than just Search. They need to be able to not only investigate all the symptoms an individual patient is experiencing, but also see what patterns are emerging in aggregate so they can detect systemic safety or process issues. To do this, we need to harvest the insights from the freeform text field, and for that we’ll use SAS Visual Text Analytics.

In this solution, we can do many types of text analysis – which you choose depends on the nature of the data and your goals. When we load the data into the solution, it first displays all the variables in the table and detects their types. We could profile the structured fields further to see summary statistics and determine if any data cleansing is appropriate, but for now let’s just build a quick text model for the SYMPTOM_TEXT variable.

After assigning this variable to the “Text” role, SAS Visual Text Analytics automatically builds a pipeline which we can use to string together analytic tasks. In this default pipeline, first we parse the data and identify key entities, and then the solution assigns a sentiment label to each document, discovers topics (i.e. themes) of interest, and categorizes the collection in a meaningful way. Each of these nodes is interactive.

In this post, we’ll show just a tiny piece of overall functionality – how to automatically extract custom entities and relationships using a combination of machine learning and linguistic rules. In the Concepts node, we provide several standard entities to use out of the box. For example, here are the automatic matches to the pre-defined “DATE” concept:

However, for this data, we’re interested in extracting something different – patient symptoms, and where on the body they occurred. Since neither open source Named Entity Recognition (NER) models nor SAS Pre-defined Concepts will do something as domain-specific as this out of the box, it’s up to us to define what we mean by a symptom or a body part under Custom Concepts.

For Body Parts, we started with a list of expected parts from medical dictionaries and subject matter experts. As I iterate through and inspect the results, I might see a keyword or phrase that I missed. In the upcoming version of SAS Visual Text Analytics, I will be able to simply highlight it and right click to add it to the rule set.

We also will be adding a powerful new feature that applies machine learning to suggest additional rules for us. Note that this isn’t a simple thesaurus lookup! Instead, an algorithm is using the matches you’ve already told it are good, combined with the data itself, to learn the pattern you’re interested in. The suggested rules are placed in a new Sandbox area where you can test and evaluate them before adding them to your final definition.

We will also be able to auto-generate fact rules. This will help us pull out meaningful relationships between two entities and suggest a generalized pattern for modeling it. Here, we’ll have the machine determine the best relationship between Body Parts and Localized Symptoms, so that we can answer questions like, “where does it hurt?”, or “what body part was red (or itchy or swollen or tingly, etc.)?”. For this data, the tool suggested a rule which looks for a body part within 6 terms of a symptom, regardless of order, so long as both are contained in the same sentence.

Let’s apply just these few simple rules to our entire dataset and go back to the dashboard view. If we look at the results, we can see now much richer potential for finding insights the data. I can easily select a single patient and see an entire list of his/her side effects alongside key details about the vaccination. I can also compare the most commonly reported symptoms by age group, gender, or geography, or which body parts and symptoms may be predictors of a severe outcome like hospitalization or death.

Of course, there is much more we could do with this data. We could extract the name of the vaccine that was administered, the time to symptom onset, duration period of the symptoms, and other important information. However, even this simple example illustrates the technique and power of contextual extraction, and how it can enhance our ability to analyze large collections of complex data. Currently, concept rule generation is on the forefront of our research efforts in its experimental first stages. This, along with the sandbox testing environment, will make it even faster and easier for analysts to do this work in SAS Visual Text Analytics. Here are a few other resources to check out if you want to dig in further.

Article: Reduce the cost-barrier of generating labeled text data for machine learning algorithms

Paper: Analyzing Text In-Stream and at the Edge

Automatically extracting key information from textual data was published on SAS Users.

2月 272019
 

In this post, we continue our discussion of geography variables, the foundation of Visual Analytics Geo maps. This time we will look at Custom Coordinates.  As with any statistical graph, understanding your data is key.  But when using Custom Coordinates for geographic maps, this understanding becomes even more important.

Use the Custom Coordinate geography variable when your data does not match one of VA’s predefined geography types (see previous post, Fundamentals of SAS Visual Analytics geo maps).  For Custom coordinates, your data set must include latitude and longitude values as separate variables.   These values should be sourced from trustworthy providers and validated for accuracy prior to loading into VA.

When using Custom Coordinates, the Coordinate Space must also be considered.  The coordinate space defines the grid used to plot your data.  The underlying map is also based on a grid.  In order for your data to display correctly on a map, these grids must match.  Visual Analytics uses the World Geodetic System (WGS84) as the default coordinate space (grid).  This will work for most scenarios, including the example below.

Once you have selected a dataset and confirmed it contains the required spatial information, you can now create a Custom Geography variable.  In this example, I am using the variable Business Address from the dataset Wake_Co_Pizza.  Let’s get started.

  1. Begin by opening VA and navigate to the Data panel on the left of the application.
  2. Select the dataset and locate the variable that you wish to map. Click the down arrow to the right of the variable and chose ‘Geography’ from the Classification dropdown menu.
  3. The ‘Edit Geography Item’ window appears. Select Custom coordinates in the ‘Geography data type’ dropdown.   Three new dropdown lists appear that are specific to the Custom coordinates data type: ‘Latitude (y)’, ‘Longitude (x)’ and ‘Coordinate Space’.

When using the Custom coordinates data type, we must tell VA where to find the spatial data in our dataset.  We do this using the Latitude (y) and Longitude (x) dropdown lists.  They contain all measures from your dataset.  In this example, the variable ‘Latitude World Geodetic System’ contains our latitude values and the variable  ‘Longitude World Geodetic System’ contains our longitude values.   The ‘Coordinate Space’ dropdown defaults to World Geodetic System (WGS84) and is the correct choice for this example.

  1. Click the OK button to complete the setup once the latitude and longitude variables have been selected from their respective dropdown lists. You should see a new ‘Geography’ section in the Data panel.  The name of the variable (or its edited value) will be displayed beside a globe icon to indicate it is a geography variable.  In this case we see the variable Business Address.

 

Congratulations!  You have now created a custom geography variable and are ready to display it on a map.  To do this, simply drag it from the Data panel and drop it on the report canvas.  The auto-map feature of VA will recognize it as a geography variable and display the data as a bubble map with an OpenStreetMap background.

In this post, we created a custom geography variable using the default Coordinate Space.  Using a custom geography variable gives you the flexibility of mapping data sets that contain valid latitude and longitude values.  Next time, we will take our exploration of the geography variable one step further and explore using custom polygons in your maps.

Using Custom Coordinates for map creation in SAS Visual Analytics was published on SAS Users.

2月 272019
 

Box plots are a great way to compare the distributions of several subpopulations of your data. For example, box plots are often used in clinical studies to visualize the response of patients in various cohorts. This article describes three techniques to visualize responses when the cohorts have a nested or hierarchical structure, such as experimental treatments nested inside of clinics. The techniques are:

  • Use PROC SGPLOT (or PROC SGPANEL): Use the VBOX statement to visualize the nested structure. You can use the CATEGORY= option to specify the "outer" variable and the GROUP= option to specify the "inner" (nested) variable.
  • Use PROC GLM: For a model that has exactly two categorical variables, one nested in the other, PROC GLM automatically creates a nested box plot.
  • Use PROC BOXPLOT: You can use PROC BOXPLOT to create a nested box plot. The procedure supports several options that can enhance the visualization.

An example of nested data: Leaves on plants

Did you know that turnip greens are an excellent source of calcium? The following example is from the PROC NESTED documentation and is based on data analyzed in Snedecor and Cochran (Statistical Methods, 6th ed., 1967, p. 286). The original data is from an experiment in which four random turnip green plants are selected, then three leaves are randomly selected on each plant. From each leaf, two 100-mg samples were selected and used to determine the amount of calcium. (The units for calcium are percentage of dry weight.) Because the box plot for a two-element sample is trivial, I made up four additional fake measurements of calcium for each leaf, as shown in the following DATA step:

/* PROC NESTED example. First two measurements are real data. The last four are fake. */
data Turnip;
do Plant=1 to 4;
   do Leaf=1 to 3;
      do Sample=1 to 6;
         input Calcium @@;  output;
      end;
   end;
end;
/* 
|--REAL--|  |----- FAKE ------|  */
datalines;
3.28 3.09   3.26 3.19 3.27 3.01 
3.52 3.48   3.53 3.47 3.50 3.49 
2.88 2.80   2.98 2.70 2.28 2.81 
2.46 2.44   2.58 2.30 2.61 2.48 
1.87 1.92   1.97 1.90 1.86 1.90 
2.19 2.19   2.21 2.17 2.23 2.19 
2.77 2.66   2.79 2.63 2.72 2.69 
3.74 3.44   3.75 3.45 3.73 3.34 
2.55 2.55   2.59 2.51 2.49 2.61 
3.78 3.87   3.79 3.81 3.88 3.76 
4.07 4.12   4.23 4.27 4.01 4.08 
3.31 3.31   3.30 3.34 3.33 3.30 
;
 
title 'Calcium Concentration in Turnip Leaves';
title2 'Leaves Nested Within Plants';
footnote J=L 'Based on Snedecor and Cochran (1967, p. 286)';
proc sgplot data=Turnip;
   vbox Calcium / Group=Leaf category=Plant;
   xaxis discreteorder=data;
run;
Nested boxplots created by PROC SGPLOT in SAS

This simple visualization uses the VBOX statement in PROC SGPLOT. As I explained previously, you can use the CATEGORY= and GROUP= options to display the distribution of calcium for the joint levels of the two categorical variables. The CATEGORY= option specifies the horizontal variable; the GROUP= option specifies the levels of a second variable. In this case, the levels of the LEAF variable are nested inside the levels of the PLANT variable. The result is shown. Colors are used to identify the level of the GROUP= variable.

If there were additional levels of nesting (for examples, multiple farms), you could use the SGPANEL procedure and include the additional variables on the PANELBY statement.

Visualize a nested ANOVA model

You can improve the previous graph by visually dividing the leaves in one plant from the leaves in another. You can use PROC GLM to automatically display the divisions when you analyze a model for which the explanatory categorical variables are nested. The following call to PROC GLM specifies a nested mode. The "NestPlot" graph is created automatically when you use ODS GRAPHICS:

ods graphics on;
proc glm data=Turnip;
   class Plant Leaf;
   model Calcium = Leaf(Plant);   /* Leaf nested in Plant */
quit;
Nested boxplots created by PROC GLM in SAS

As you can see, the graph is the same except that it contains vertical lines that divide one plant from another. I like this version of the graph better.

Box plots for independent units

If you think about it, the color in the previous plot is not necessary. You might even consider it misleading because the leaf values are unrelated across plants. "Leaf 1" for "Plant 1" has no relationship to "Leaf 1" for the other plants. To emphasize that fact, you could relabel the leaves by using the values 1 through 12, where leaves 1–3 are from Plant=1, leaves 4–6 are from Plant=2, and so forth.

Labeling the leaves that way is necessary if you want to use to BOXPLOT procedure to visualize the data. The BOXPLOT procedure supports nested categories and the syntax is similar to the GLM syntax:

data Turnip2;
set Turnip;
LeafID = Leaf + (Plant-1)*3;   /* label samples 1, 2, 3, ..., 12 */
run;
 
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote;
run;
Nested boxplots created by PROC BOXPLOT in SAS

The output from the BOXPLOT procedure uses column headers to indicate the plants. This is similar to the visualization that I used to label categories for tropical storms and hurricanes.

You can use three features of SAS/STAT 15.1 (SAS 9.4M6) to add vertical reference lines, to color the background, and to center the plant labels in the column headers. The options are the BLOCKREF option, the BLOCKREFFILL option, and the BLOCKVALUEPOS= option, respectively.

/* Options for extending the column headers into the plot region.
   These options require SAS/STAT 15.1 */
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote
                      blockref blockreffill blockvaluepos=center;
run;
Nested boxplots created by PROC BOXPLOT in SAS

I like this graph because it uses colors sparingly and does not require a legend. Also, if the number of samples is large (such as 30 or 50), PROC BOXPLOT will automatically produce a series of graphs, each displaying a portion of the data.

In summary, this article shows three ways to create box plots of responses for nested categorical data. The simplest is to use the VBOX statement in PROC SGPLOT, although if there are additional categorical variables in the model, you can create a lattice of box plots by using the PANELBY statement in PROC SGPANEL. The GLM procedure enables you to perform an ANOVA analysis for the data and also create a visualization. The BOXPLOT procedure provides nice headers and (in SAS/STAT 15.1) colored strips for each level of the "outer" categories.

The post 3 ways to create nested box plots in SAS appeared first on The DO Loop.

2月 272019
 

Box plots are a great way to compare the distributions of several subpopulations of your data. For example, box plots are often used in clinical studies to visualize the response of patients in various cohorts. This article describes three techniques to visualize responses when the cohorts have a nested or hierarchical structure, such as experimental treatments nested inside of clinics. The techniques are:

  • Use PROC SGPLOT (or PROC SGPANEL): Use the VBOX statement to visualize the nested structure. You can use the CATEGORY= option to specify the "outer" variable and the GROUP= option to specify the "inner" (nested) variable.
  • Use PROC GLM: For a model that has exactly two categorical variables, one nested in the other, PROC GLM automatically creates a nested box plot.
  • Use PROC BOXPLOT: You can use PROC BOXPLOT to create a nested box plot. The procedure supports several options that can enhance the visualization.

An example of nested data: Leaves on plants

Did you know that turnip greens are an excellent source of calcium? The following example is from the PROC NESTED documentation and is based on data analyzed in Snedecor and Cochran (Statistical Methods, 6th ed., 1967, p. 286). The original data is from an experiment in which four random turnip green plants are selected, then three leaves are randomly selected on each plant. From each leaf, two 100-mg samples were selected and used to determine the amount of calcium. (The units for calcium are percentage of dry weight.) Because the box plot for a two-element sample is trivial, I made up four additional fake measurements of calcium for each leaf, as shown in the following DATA step:

/* PROC NESTED example. First two measurements are real data. The last four are fake. */
data Turnip;
do Plant=1 to 4;
   do Leaf=1 to 3;
      do Sample=1 to 6;
         input Calcium @@;  output;
      end;
   end;
end;
/* 
|--REAL--|  |----- FAKE ------|  */
datalines;
3.28 3.09   3.26 3.19 3.27 3.01 
3.52 3.48   3.53 3.47 3.50 3.49 
2.88 2.80   2.98 2.70 2.28 2.81 
2.46 2.44   2.58 2.30 2.61 2.48 
1.87 1.92   1.97 1.90 1.86 1.90 
2.19 2.19   2.21 2.17 2.23 2.19 
2.77 2.66   2.79 2.63 2.72 2.69 
3.74 3.44   3.75 3.45 3.73 3.34 
2.55 2.55   2.59 2.51 2.49 2.61 
3.78 3.87   3.79 3.81 3.88 3.76 
4.07 4.12   4.23 4.27 4.01 4.08 
3.31 3.31   3.30 3.34 3.33 3.30 
;
 
title 'Calcium Concentration in Turnip Leaves';
title2 'Leaves Nested Within Plants';
footnote J=L 'Based on Snedecor and Cochran (1967, p. 286)';
proc sgplot data=Turnip;
   vbox Calcium / Group=Leaf category=Plant;
   xaxis discreteorder=data;
run;
Nested boxplots created by PROC SGPLOT in SAS

This simple visualization uses the VBOX statement in PROC SGPLOT. As I explained previously, you can use the CATEGORY= and GROUP= options to display the distribution of calcium for the joint levels of the two categorical variables. The CATEGORY= option specifies the horizontal variable; the GROUP= option specifies the levels of a second variable. In this case, the levels of the LEAF variable are nested inside the levels of the PLANT variable. The result is shown. Colors are used to identify the level of the GROUP= variable.

If there were additional levels of nesting (for examples, multiple farms), you could use the SGPANEL procedure and include the additional variables on the PANELBY statement.

Visualize a nested ANOVA model

You can improve the previous graph by visually dividing the leaves in one plant from the leaves in another. You can use PROC GLM to automatically display the divisions when you analyze a model for which the explanatory categorical variables are nested. The following call to PROC GLM specifies a nested mode. The "NestPlot" graph is created automatically when you use ODS GRAPHICS:

ods graphics on;
proc glm data=Turnip;
   class Plant Leaf;
   model Calcium = Leaf(Plant);   /* Leaf nested in Plant */
quit;
Nested boxplots created by PROC GLM in SAS

As you can see, the graph is the same except that it contains vertical lines that divide one plant from another. I like this version of the graph better.

Box plots for independent units

If you think about it, the color in the previous plot is not necessary. You might even consider it misleading because the leaf values are unrelated across plants. "Leaf 1" for "Plant 1" has no relationship to "Leaf 1" for the other plants. To emphasize that fact, you could relabel the leaves by using the values 1 through 12, where leaves 1–3 are from Plant=1, leaves 4–6 are from Plant=2, and so forth.

Labeling the leaves that way is necessary if you want to use to BOXPLOT procedure to visualize the data. The BOXPLOT procedure supports nested categories and the syntax is similar to the GLM syntax:

data Turnip2;
set Turnip;
LeafID = Leaf + (Plant-1)*3;   /* label samples 1, 2, 3, ..., 12 */
run;
 
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote;
run;
Nested boxplots created by PROC BOXPLOT in SAS

The output from the BOXPLOT procedure uses column headers to indicate the plants. This is similar to the visualization that I used to label categories for tropical storms and hurricanes.

You can use three features of SAS/STAT 15.1 (SAS 9.4M6) to add vertical reference lines, to color the background, and to center the plant labels in the column headers. The options are the BLOCKREF option, the BLOCKREFFILL option, and the BLOCKVALUEPOS= option, respectively.

/* Options for extending the column headers into the plot region.
   These options require SAS/STAT 15.1 */
proc boxplot data=Turnip2;
   plot Calcium*LeafID(Plant) / odstitle=title odstitle2=title2 odsfootnote=footnote
                      blockref blockreffill blockvaluepos=center;
run;
Nested boxplots created by PROC BOXPLOT in SAS

I like this graph because it uses colors sparingly and does not require a legend. Also, if the number of samples is large (such as 30 or 50), PROC BOXPLOT will automatically produce a series of graphs, each displaying a portion of the data.

In summary, this article shows three ways to create box plots of responses for nested categorical data. The simplest is to use the VBOX statement in PROC SGPLOT, although if there are additional categorical variables in the model, you can create a lattice of box plots by using the PANELBY statement in PROC SGPANEL. The GLM procedure enables you to perform an ANOVA analysis for the data and also create a visualization. The BOXPLOT procedure provides nice headers and (in SAS/STAT 15.1) colored strips for each level of the "outer" categories.

The post 3 ways to create nested box plots in SAS appeared first on The DO Loop.