Data Visualization

4月 232020
 

At the end of March, the German government sponsored a hackathon called #WirVsVirus. The aim was to bring Germany’s collective coding expertise to bear on some of the many problems surrounding COVID-19. In total, more than 27,000 coders joined the challenge, working from home, and programming for 48 hours from [...]

27,000 coders vs. the coronavirus was published on SAS Voices by Tom Sabo

4月 032020
 

These are incredibly tough times, and fortunately, everyone I know is still healthy, safe and doing everything they can to stay that way. I know I'm lucky -- the worst I've suffered so far under coronavirus stay-at-home orders is sports withdrawal. I'm trying to fill that void in my life [...]

Visualizing shots, goals, players and more with SAS was published on SAS Voices by Frank Silva

12月 092019
 

Building on my last post, How to create checklist tables in SAS®, this one shows you how to compare SAS data Check mark and cross mark sets that include common and uncommon columns. You'll learn how to visualize side-by-side columns commonalities and differences in data tables.

As before, we're working with a comparison matrix (aka checklist table) where check-marks / x-marks indicate included / excluded columns.

Data tables will be comparable products while their columns (variables) will represent product features. We'll add background color to highlight which attributes are different in the common columns. Since there might be several different attributes for a given column, we will use a hierarchy typelengthlabel to indicate only the highest mismatched level of hierarchy. For example:

  • If same-named columns have different type (Numeric vs. Character), their corresponding check-mark will be shown on a light-red background, which indicates the highest degree of mismatch.
  • If same-named columns have the same type, a yellow background will indicate any difference in variables length.
  • When same-named variables type and length match, a light-blue background marks any difference in variables label.

SAS code to create color-enhanced comparison matrix

Let’s compare variable attributes in two data tables: one is SAS-supplied SASHELP.CARS, and another WORK.NEWCARS that I derive from the first one, slightly scrambling its column definitions:

data WORK.NEWCARS (drop=temp:);
   set SASHELP.CARS (rename=(Origin=Region EngineSize=temp1 Make=temp2));
   length EngineSize $3 Make $20;
   EngineSize = put(temp1,3.1);
   Make = temp2; 
   label Type='New Car Type';
run;

In this NEWCARS data table, I did the following:

  • Replaced column name Origin with Region
  • Changed type of column EngineSize from Numeric to Character
  • Changed length of column Make from $13 to $20
  • Changed label of column Type from blank to “New Car Type”

Now let’s build the comparison matrix:

proc contents data=SASHELP.CARS noprint out=DS1(keep=Name Type Length Label);
run;
 
proc contents data=WORK.NEWCARS noprint out=DS2(keep=Name Type Length Label);
run;
 
data comparison_matrix;
   merge
      DS1(in=in1 rename=(Type=Typ1 Length=Len1 Label=Lab1))
      DS2(in=in2 rename=(Type=Typ2 Length=Len2 Label=Lab2));
   by Name;
 
   /* set symbol shape: 1=V; 0=X */
   ds1 = 1; ds2 = 1;
   if in1 and not in2 then ds2 = 0; else
   if in2 and not in1 then ds1 = 0;
 
   /* add background color */
   if ds1=ds2=1 then
   select;
      when(Typ1^=Typ2) do; ds1=2; ds2=2; end;
      when(Len1^=Len2) do; ds1=3; ds2=3; end;
      when(Lab1^=Lab2) do; ds1=4; ds2=4; end;
      otherwise; 
   end;
 
   label
      Name = 'Column Name'
      ds1 = 'SASHELP.CARS'
      ds2 = 'WORK.NEWCARS'
      ;
run;
 
proc format;
   value chmark
      0   = '(*ESC*){unicode "2718"x}'
      1-4 = '(*ESC*){unicode "2714"x}'
      ;
   value chcolor
      0   = red
      1-4 = green
      ;
   value bgcolor
      2 = 'cxffccbb'
      3 = 'cxffe177'
      4 = 'cxd4f8d4' 
      ;
run;
 
ods html path='c:\temp' file='comp_marix.html' style=Seaside;
ods escapechar='^';
title 'Data set columns comparison matrix';
 
proc odstext;
   p '<div align="center">Mismatch Legend:'||
     '<span style="background-color:#ffccbb;margin-left:17px">^_^_^_^_</span> Type'||
     '<span style="background-color:#ffe177;margin-left:17px">^_^_^_^_</span> Length'||
     '<span style="background-color:#d4f8d4;margin-left:17px">^_^_^_^_</span> Label</div>'
   / style=[fontsize=9pt];
run;
 
title; 
proc print data=comparison_matrix label noobs;
   var Name / style={fontweight=bold width=100px};
   var ds1 ds2 / style={color=chcolor. backgroundcolor=bgcolor. just=center fontweight=bold width=120px};
   format ds1 ds2 chmark.;
run;
 
ods html close;

Here is a brief explanation of the code:

  1. Two PROC CONTENTS produce alphabetical lists (as datasets) of the data table column names, as well as their attributes (type, length, label)
  2. The DATA STEP merges these 2 lists and creates DS1 and DS2 variables indicating common name (values 1, 2, 3, 4) or uncommon name (value 0).
  3. PROC FORMAT creates 3 user-defined formats chmark, chcolor, bgcolor responsible for checkmark shape, checkmark color, and background color respectively. For checkmark shape, we use Unicode characters, and for colors we use both, color names (e.g. red, green) and hexadecimal RGB color notations (e.g. 'cxFFCCBB').
  4. PROC ODSTEXT’s P statement is used to display color legend for the comparison matrix.
  5. Finally, PROC PRINT with user-defined formats produces our color-enhanced comparison matrix.

Data tables comparison matrix – OUTPUT

The above code will generate the following HTML output with the comparison matrix for variables in two data sets:

Comparison matrix for common/uncommon variables in 2 datasets

Adding more detail to the comparison matrix chart

We can further enhance our output comparison matrix by adding detailed descriptive information about differences between variable attributes. For comprehensive view, we can add a COMMENTS column that spells out differences (attributes mismatches). In addition to the hierarchical logic defining only one mismatch of the highest degree indicated by color highlighting above, comments can include all found discrepancies. Simply add the following two pieces of SAS code:

1. Add the following group of statements to the above DATA Step (right after SELECT statement):

 length Comments $200;
   if ds1>1 then
   do;
      if Typ1^=Typ2 then Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
      if Len1^=Len2 then Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
      if Lab1^=Lab2 then Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');
   end;

Depending on your needs this Comments can be added unconditionally – you would just need to remove IF-THEN logic keeping only:

length Comments $200;
Comments = catx(' ', Comments, 'Type1=',   Typ1, '; Type2=',   Typ2, ';');
Comments = catx(' ', Comments, 'Length1=', Len1, '; Length2=', Len2, ';');
Comments = catx(' ', Comments, 'Label1=',  Lab1, '; Label2=',  Lab2, ';');

2. Add the following statement to the above PROC PRINT (right before the FORMAT statement):

var comments / style={width=250px};

Then your HTML output will look as follows:

Detailed comparison matrix for common/uncommon variables in 2 datasets

Conclusion

Comparison matrix charts are a convenient tool for data development and metadata validation when you're comparing a data table’s metadata against requirements descriptions.

It allows us to quickly identify tables’ common and uncommon variables, as well as common variable inconsistencies by type, length and other attributes, such as labels and formats.

We can easily add detailed descriptive information when needed.

On a related note

While this post focused on visualizing SAS data sets comparison vis-à-vis common and uncommon columns, it's worth noting SAS websites have plenty of info on finding common variables (or columns) in data sets. For example:

Your thoughts?

Do you find this material useful? What other usages of the checklist tables and color-enhanced comparison matrices can you suggest?

How to compare SAS data tables for common/uncommon columns was published on SAS Users.

11月 202019
 

Colleges and universities have access to enormous stores of data and analytics has the power to help higher education tackle some of its biggest challenges. Larry Burns, Assistant Director of Institutional Research and Information Management (IRIM), Oklahoma State University (OSU) knows a great deal about the power of analytics to [...]

Oklahoma State University’s analytics journey was published on SAS Voices by Georgia Mariani

10月 302019
 

Check mark
When I'm about to make a major purchase, I appreciate being able to compare products features at a glance, side by side. I am sure you have seen these ubiquitous comparison tables with check marks showing which features are characteristic of different products and which are not.

These data visualizations, sometimes called comparison matrixes, are also commonly known as checklist tables or checklist table charts. Such charts are extremely useful, persuasive visuals as they allow us to quickly identify differences as well as commonalities between comparable products or solutions and quickly decide which one of them is more desirable or suitable for our needs.

For example, here is such a table that I created in MS Word:
Checklist table example created in Word

SAS code to create checklist table chart

Thanks to SAS’ ability to use Unicode characters in formatted data, it’s very easy to create such a checklist table in SAS. Just imagine that each cell value with visible check mark is assigned value of 1 and each cell value with no check mark is assigned value of 0. That is exactly the data table that lies behind this data visualization. To print this data table with proper formatting, we will format number 1 to a more visually appealing check mark, and 0 to a “silent” blank. Here is the SAS code to accomplish this:

data CHECKLIST;
   length FEATURE $10;
   input FEATURE $1-10 A B C;
   label
      FEATURE = 'Feature'
      A = 'Product A'
      B = 'Product B'
      c = 'Product C';
   datalines;
Feature 1 1 1 1
Feature 2 1 0 1
Feature 3 0 1 1
Feature N 1 0 1
;
 
proc format;
   value chmark
      1 = '(*ESC*){unicode "2714"x}'
      other = ' ';
   value chcolor
      1 = green;
run;
 
ods html path='c:\temp' file='checklist1.html' style=HTMLBlue;
 
proc print data=CHECKLIST label noobs;
   var FEATURE / style={fontweight=bold};
   var A B C / style={color=chcolor. just=center fontweight=bold};
   format A B C chmark.;
run;
 
ods html close;

If you run this SAS code, your output will look much as the one above created in MS Word:
Checklist table created in SAS
Key elements of the SAS code that produce this checklist table are user-defined formats in the PROC FORMAT. You format the values of 1 to a Unicode 2714 corresponding to a checkmark character ✔ in a user-defined format chmark. Also, the value of 1 is formatted to green color in the chcolor user-defined format. The syntax for using Unicode symbols in user-defined formats is this:

value chmark
1 = '(*ESC*){unicode "2714"x}'

NOTE: ESC here must be upper-case; x at the end stands for “hexadecimal.”

Unicode characters for checklist tables

Unicode or Unicode Transformation Format (UTF) is an international encoding standard by which each letter, digit or symbol is assigned a unique numeric value that applies across different platforms and programs. The Unicode standard is supported by many operating systems and all modern browsers.

It is implemented in HTML, XML, Java, JavaScript, E-mail, ASP, PHP, etc. The most commonly used Unicode encodings standards are UTF-8 and UTF-16. HTML 5 supports both UTF-8 and UTF-16.

You can use this HTML Unicode (UTF-8) Reference to look up and choose symbols you can embed in your report using SAS user-defined formats. They are grouped by categories to make it easier to find the ones you needed.

Here is just a small random sample of the Unicode symbols that can be used to spice up your checklist tables to get their different flavors:

Unicode characters and codes
You can also apply colors to all these symbols the way we did it in the SAS code example above.

Different flavors of checklist tables

By just changing user-defined formats for the symbol shapes and colors we can get quite a variety of different checklist tables.

For example, we can format 0 to ✘ instead of blank and also make it red to explicitly visualize feature exclusion from product (in addition to explicit inclusion). All we need to do is to modify our PROC FORMAT to look like this:

proc format;
   value chmark
      1 = '(*ESC*){unicode "2714"x}'
      0 = '(*ESC*){unicode "2718"x}';
   value chcolor
      1 = green
      0 = red;
run;

SAS output comparison matrix will look a bit more dramatic and persuasive:
SAS-generated checklist table
Or, if you’d like, you can use the following format definition:

proc format;
   value chmark
      1 = '(*ESC*){unicode "2611"x}'
      0 = '(*ESC*){unicode "2612"x}';
   value chcolor
      1 = green
      0 = red;
run;

producing the following SAS-generated ballot-like table checklist:
Ballot-like checklist table created in SAS
Here is another one:

proc format;
   value chmark
      1 = '(*ESC*){unicode "1F5F9"x}'
      0 = '(*ESC*){unicode "20E0"x}';
   value chcolor
      1 = green
      0 = red;
run;

producing the following variation of the checklist table:
Another SAS-generated checklist table
As you can see, the possibilities are endless.

Your thoughts?

Do you find these comparison matrixes or checklist tables useful? Do you envision SAS producing them for your presentation, documentation, data story or marketing materials? What Unicode symbols do you like? Can you come up with some creative usages of symbols and colors? For example, table cells background colors...

How to create checklist tables in SAS® was published on SAS Users.

10月 302019
 

Check mark
When I'm about to make a major purchase, I appreciate being able to compare products features at a glance, side by side. I am sure you have seen these ubiquitous comparison tables with check marks showing which features are characteristic of different products and which are not.

These data visualizations, sometimes called comparison matrixes, are also commonly known as checklist tables or checklist table charts. Such charts are extremely useful, persuasive visuals as they allow us to quickly identify differences as well as commonalities between comparable products or solutions and quickly decide which one of them is more desirable or suitable for our needs.

For example, here is such a table that I created in MS Word:
Checklist table example created in Word

SAS code to create checklist table chart

Thanks to SAS’ ability to use Unicode characters in formatted data, it’s very easy to create such a checklist table in SAS. Just imagine that each cell value with visible check mark is assigned value of 1 and each cell value with no check mark is assigned value of 0. That is exactly the data table that lies behind this data visualization. To print this data table with proper formatting, we will format number 1 to a more visually appealing check mark, and 0 to a “silent” blank. Here is the SAS code to accomplish this:

data CHECKLIST;
   length FEATURE $10;
   input FEATURE $1-10 A B C;
   label
      FEATURE = 'Feature'
      A = 'Product A'
      B = 'Product B'
      c = 'Product C';
   datalines;
Feature 1 1 1 1
Feature 2 1 0 1
Feature 3 0 1 1
Feature N 1 0 1
;
 
proc format;
   value chmark
      1 = '(*ESC*){unicode "2714"x}'
      other = ' ';
   value chcolor
      1 = green;
run;
 
ods html path='c:\temp' file='checklist1.html' style=HTMLBlue;
 
proc print data=CHECKLIST label noobs;
   var FEATURE / style={fontweight=bold};
   var A B C / style={color=chcolor. just=center fontweight=bold};
   format A B C chmark.;
run;
 
ods html close;

If you run this SAS code, your output will look much as the one above created in MS Word:
Checklist table created in SAS
Key elements of the SAS code that produce this checklist table are user-defined formats in the PROC FORMAT. You format the values of 1 to a Unicode 2714 corresponding to a checkmark character ✔ in a user-defined format chmark. Also, the value of 1 is formatted to green color in the chcolor user-defined format. The syntax for using Unicode symbols in user-defined formats is this:

value chmark
1 = '(*ESC*){unicode "2714"x}'

NOTE: ESC here must be upper-case; x at the end stands for “hexadecimal.”

Unicode characters for checklist tables

Unicode or Unicode Transformation Format (UTF) is an international encoding standard by which each letter, digit or symbol is assigned a unique numeric value that applies across different platforms and programs. The Unicode standard is supported by many operating systems and all modern browsers.

It is implemented in HTML, XML, Java, JavaScript, E-mail, ASP, PHP, etc. The most commonly used Unicode encodings standards are UTF-8 and UTF-16. HTML 5 supports both UTF-8 and UTF-16.

You can use this HTML Unicode (UTF-8) Reference to look up and choose symbols you can embed in your report using SAS user-defined formats. They are grouped by categories to make it easier to find the ones you needed.

Here is just a small random sample of the Unicode symbols that can be used to spice up your checklist tables to get their different flavors:

Unicode characters and codes
You can also apply colors to all these symbols the way we did it in the SAS code example above.

Different flavors of checklist tables

By just changing user-defined formats for the symbol shapes and colors we can get quite a variety of different checklist tables.

For example, we can format 0 to ✘ instead of blank and also make it red to explicitly visualize feature exclusion from product (in addition to explicit inclusion). All we need to do is to modify our PROC FORMAT to look like this:

proc format;
   value chmark
      1 = '(*ESC*){unicode "2714"x}'
      0 = '(*ESC*){unicode "2718"x}';
   value chcolor
      1 = green
      0 = red;
run;

SAS output comparison matrix will look a bit more dramatic and persuasive:
SAS-generated checklist table
Or, if you’d like, you can use the following format definition:

proc format;
   value chmark
      1 = '(*ESC*){unicode "2611"x}'
      0 = '(*ESC*){unicode "2612"x}';
   value chcolor
      1 = green
      0 = red;
run;

producing the following SAS-generated ballot-like table checklist:
Ballot-like checklist table created in SAS
Here is another one:

proc format;
   value chmark
      1 = '(*ESC*){unicode "1F5F9"x}'
      0 = '(*ESC*){unicode "20E0"x}';
   value chcolor
      1 = green
      0 = red;
run;

producing the following variation of the checklist table:
Another SAS-generated checklist table
As you can see, the possibilities are endless.

Your thoughts?

Do you find these comparison matrixes or checklist tables useful? Do you envision SAS producing them for your presentation, documentation, data story or marketing materials? What Unicode symbols do you like? Can you come up with some creative usages of symbols and colors? For example, table cells background colors...

How to create checklist tables in SAS® was published on SAS Users.

10月 242019
 

“Analytics Can Save Higher Education. Really.” is a call to action for the higher education community to leverage data and analytics for better decision making at colleges and universities. It stresses the importance of using data and analytics to improve student outcomes, campus operations and much more. Oklahoma State University [...]

Establishing an analytics culture: An interview with Oklahoma State University was published on SAS Voices by Georgia Mariani

8月 282019
 

Moving Average (MA) is a common indicator in stocks, securities and futures trading in financial markets to gauge momentum and confirm trends. MA is often used to smooth out short-term fluctuations and show long-term trends. But most MA indicators have big lags in signaling a changing trend. To be faster to capture a trend reversal, several New MA indicators are now available that more quickly detect trend changes – and of those, the Hull Moving Average (HMA), is one of the most popular. This post demonstrates its superiority.

A closer look at HMA

Developed by Alan Hull, it's faster and thus a more useful signal than others. Its main advantage over general MA indicators is its relative smoothness as it signals change. Commonly-used MA indicators include Simple Moving Average (SMA), Weighted Moving Average (WMA) and so on. SMA calculates the arithmetic mean of the prices, which gives individual value equal weight. WMA averages individual values with some predetermined weights.

Since moving averages are computed from prior data, all MA indicators suffer a significant drawback of being a lagging indicator. Even in a shorter-period of moving average, which has less lag than one with a longer period, a stock price may drop sharply before a MA indicator signals the trend change. The Hull Moving Average (HMA) uses weighted moving average and the square root of the period instead of the actual period itself, which leads it to be more responsive to most recent price activity, whilst maintaining smoothness.

According to Alan Hull, the formula for HMA is:

We see that the major computing components in HMA are three WMAs. Refer to the specification here, we have the corresponding WMA formula as pictured below. In the WMA formula, the weight of each price value is related to the position of the value and the period length. The more recent the higher weights, and the shorter of the period the higher weights.

HMA in action

In the remainder of this post, I will show how to calculate HMA of a stock price using calculated items in SAS Visual Analytics and show that HMA gives faster upward/downward signals than SMA. I use the data from SASHELP. STOCK with ‘IBM’ as an example. The data needs to be sorted by the date and a column (named ‘tid’) added to hold the sequence number before loading into SAS Visual Analytics for calculation. The data preparation codes can be found here. After loading the data into SAS Visual Analytics, we can start by creating the calculated items. Here, I set the period length to 5 in calculation (i.e. =5 in the formula) and calculate HMA for ‘Close’ price of IBM stock for example.

1. Calculate the first WMA like so...

... using the AggregateCells operator in SAS Visual Analytics. I name it as 'WMA(5/2 days)'. Have the data value in, note I’ve rounded the (5⁄2) to an integer of 3. That is, the aggregation is starting from the previous two (-2) row and ending at current row (0). The corresponding formula of the calculated item ‘WMA(5/2 days)’ in SAS VA is:

AggregateCells(_Sum_, ( 'Close'n * 'tid'n ), default, CellIndex(current, -2), CellIndex(current, 0)) / AggregateCells(_Sum_, 'tid'n, default, CellIndex(current, -2), CellIndex(current, 0))

 

2. Similarly, calculate the second in SAS Visual Analytics:

Name it as ‘WMA(5 days)’. The corresponding formula is:
AggregateCells(_Sum_, ( 'Close'n * 'tid'n ), default, CellIndex(current, -4), CellIndex(current, 0)) / AggregateCells(_Sum_, 'tid'n, default, CellIndex(current, -4), CellIndex(current, 0))

3. Now we calculate the HMA, which computes the third WMA using the two WMAs we get from above calculation. In SAS Visual Analytics, if we directly apply a similar approach for the last WMA calculation, it will show message of operands requiring group. So here, I need a workaround to make the aggregation work.

4. To work around the problem, I create an aggregated item named ‘sumtid’ to indicate the row sequence number in an aggregation way. To do this, firstly create a calculated item named ‘One’ with the constant value 1; then use AggregateCells operator creating the ‘sumtid’ to get the current row number: AggregateCells(_Sum_, 'One'n, default, CellIndex(start, 0), CellIndex(current, 0)).

5. Now we can compute the HMA in a similar way as we do for previous two WMAs. Name it as ‘HMA for close (5 days)’. Since int(√(5 ))=2, the starting position of the aggregation is set to the previous row (-1) and the ending position is set to the current row (0). Note the operands are now using the aggregated item ‘sumtid’. The formula for the ‘HMA for close (5 days)’ item is:

AggregateCells(_Sum_, ( ( ( 2 * 'WMA(5/2 days)'n ) - 'WMA(5 days)'n ) * 'sumtid'n ), default, CellIndex(current, -1), CellIndex(current, 0)) / AggregateCells(_Sum_, 'sumtid'n, default, CellIndex(current, -1), CellIndex(current, 0))

So far, we’ve created the Hull Moving Average of IBM stock Close price and saved it in the calculated item ‘HMA for close (5 days)’. We can easily draw its time series plot in SAS Visual Analytics. Now, I'll create a Simple Moving Average of ‘SMA for the close (5 days)’ with an equal weight, and then compare it with the HMA. The formula for ‘SMA for the close (5 days)’ is: AggregateCells(_Average_, 'Close'n, default, CellIndex(current, -4), CellIndex(current, 0))

Now let’s visualize the ‘SMA for the close (5 days)’ and ‘HMA for close (5 days)’ respectively. In below chars, each grey vertical bar shows the monthly price span of IBM stock, and the red lines correspond to SMA and HMA respectively. With the upper SMA line, we see constant lags with price changing and poor smoothness. And with bottom HMA line, we see rapid keep-up with price activities while maintaining good smoothness.

Below is the comparison of the ‘SMA for the close (5 days)’, ‘HMA for close (5 days)’ and the Close price. Besides smoothing out some fluctuations in Close price, the HMA indeed gives better signal than SMA does in indicating a turning point when there is an upward/downward trend reversal. Note the obvious lags of SMA compared to HMA. For example, compare the trends around the reference line in the visualization below. The Close price reached to a local peak at Jun1992 and started to go down from Jul1992. HMA quickly reflected the downward turn with one lag at Aug1992, while SMA still showed the rising trend in the meantime. SMA started to go down with one more lag to give the reversal signal.

Now it’s easy to understand why HMA is a better indicator than SMA to signal the reversal point. What has been your experience with HMA?

How to Calculate Hull Moving Average in SAS Visual Analytics was published on SAS Users.

8月 272019
 

I was born in a country (Brazil) where voting is mandatory. Most of my family still lives there, and now that I live in the US, they ask me about American politics all the time. One thing that often catches them by surprise is that not only is voting not [...]

Examining voter registration data with SAS Visual Analytics was published on SAS Voices by Frank Silva

8月 092019
 

Opening Plenary session, Esri UC 2019

Several of my colleagues and I attended the annual Esri User Conference last month in San Diego - along with 18,000 other Geo professionals.  It was a busy week of meetings, seminars and talks about the latest in GIS and Spatial technologies.  The days were long and exhausting, but it was also exciting and a ton of fun.  As we continue to process, plan and prepare to integrate some of these technologies into SAS Visual Analytics, I thought it would be beneficial to highlight the Esri features available in VA today.

One topic that received a lot of questions during this year’s SAS Global Forum in Dallas was that of Geocoding.  Geocoding is the process of transforming text address data into numeric latitude and longitude values.  Once the latitude and longitude are known, they can be mapped and analyzed spatially.  SAS has offered geocoding capabilities for quite some time as a part of SAS/Graph.  Beginning with SAS v940m5, PROC GEOCODE has moved into BASE SAS.  See my colleague’s blog posts here and here for more information on geocoding from BASE SAS.

But Geocoding is no longer limited to just Base SAS.  You can also geocode from within Visual Analytics, thanks to the integration with the Esri geocoding api.  This feature is part of the Esri Premium agreement, and became available in VA 8.3.   Esri premium features require an existing relationship and credentials with Esri.  This post assumes that relationship exists and your credentials have been validated.  I will discuss the details of the Esri premium features in a future post, but for today the focus is how to use the Esri Geocoding feature from VA with a real-world data set.

1. Getting the data into Visual Analytics

We will be using point data from the City of Dallas for the Public Library branch locations.  You can download the .csv file from the Dallas Open Data portal.  After downloading, it must be imported into VA for geocoding.

  • From the Data tab in VA, select Import > Local File
  • Navigate to the location of the Dallas library .csv file and select it
  • Adjust the default settings, if desired, and click the ‘Import Item’ button
  • Once you see the green success message, the data has been imported into VA and is ready to be geocoded. Click the ‘Cancel’ button

Message indicating successful data import

2. Selecting the data columns to geocode on

Accessing the Geocoding feature in VA follows a similar process to the steps we just performed to import the .csv file.

  • From the Data tab in VA, select Import > Esri > Geocode. Here, you must select the location of the newly imported library data set.  This path will vary depending upon the configuration of your VA instance.  For my installation, it is located at cas-shared-default > Public folder > CITY_OF_DALLAS_LIBRARY_LOCATIONS.  Once located, click the 'Select' button
  • The Geocoding Import window will open. This window should look familiar.  The top half is the same as the Import data we just used to get the .csv file into VA.  Essentially, the geocoding process is a new data import.  It will send selected columns to Esri via a REST api call.  The response will contain the corresponding latitude and longitude values we desire.  They will be added to our existing data set and imported into VA as a new geocoded data set.  The name of the new data set will have _GEO_CODE appended to the end of the original data set name.  This name can be modified as desired.

Geocoding selection dialog window

  • At the bottom of the Geocoding Import window are two list boxes, Available items and Selected items. The Available items box on the left contains all columns in the data set.  Select the column(s) containing the address information you wish to geocode.  Double click or click the right arrow to move them to the Selected items window on the right.  In this example, we are using the Address column.
  • VA concatenates the selected column(s) to generate a sample address for geocoding. Clicking the ‘Test’ button returns coordinates for the sample address and a score representing the confidence level of the results.  In the screenshot above, our score is 71/100 for the test address.  Not bad, but it could be better.  More on this a bit later.
  • To finish the geocoding process, click the ‘Import Item’ at the top of the page, as we did with the original .csv file import. This time, you will be presented with a new dialog window.  Geocoding, as with other Esri premium features require the use of credits.  This dialog indicates how many Esri credits will be used by the geocoding process and will also be discussed in detail in a future post.

Esri credit usage alert dialog

For now, select 'Yes' to continue.  When you see the green success message, the operation is complete.  We are now ready to map our Dallas Library locations.  Click 'Ok' to open the new geocoded data set.

3. Create the geography variable and display the map

Next, we need to create our geography variable from the new geocoded data set.  As part of the geocoding process, four new columns have been added to the new data set: esri_latitude, esri_longitude, esri_score, esri_address.  We only need the esri_latitude and esri_longitude columns for our map.

  • Select the Branch Name category variable and change its Classification to Geography
  • For Geography data type, select Custom Coordinates
  • Select esri_latitude for Latitude
  • Select esri_longitude for Longitude
  • Click 'OK'
  • Drag the Branch Name geography variable to the canvas to create the map

Map of non-unique geocoded addresses

What happened??  Our data set contains Dallas Public library locations, so why are the data points spread across the world?  It’s all in the data.  If you look at the original data a bit deeper, you will notice the Address field we selected for the geocoding only contains the street number and street name of the library location.  It does not contain enough information to make it unique.  Therefore, during the geocoding process, the first instance of that address will be considered a match, regardless of where it is actually located.

Detailed view of incorrect geocoded address

In the image above for the Preston Royal branch, its street number and name were a perfect match to a location in Eugene, Oregon.  Not quite what we were looking for.  So, how do we fix this?  To make our addresses unique, it requires a simple addition to the source data .csv file.

Column selection to ensure unique addresses for geocoding

We need to add a ‘City’ and ‘State’ column to the original .csv file with the values of ‘Dallas’ and ‘Texas’ assigned to all entries.  This will ensure each address is unique and within our area of interest.  Re-import the new .csv file and geocode it using the Address, City and State columns.  The result?  A confidence score of a perfect 100.  Much better than our first attempt!  This will now give us the map we desire for the Dallas Public Library locations.

 

Final geocoded map of Dallas Library branches

In this post, I used real-world data to illustrate two things: the importance of knowing your data set, and how to geocode address information in SAS Visual Analytics.  Public data sets are a great resource but need to be used with a critical eye.  They may still need additional cleansing in order to work for your situation.

The geocoding feature is one example of the premium Esri features currently available in VA.  In future posts, I will go into more detail on other Esri features available, what make these features ‘premium’ and examples of how to use them.  Stay tuned!

Esri integration with SAS Visual Analytics: Geocoding was published on SAS Users.