learn sas

3月 192020
 

At SAS Press, we agree with the saying “The best things in life are free.” And one of the best things in life is knowledge. That’s why we offer free e-books to help you learn SAS or improve your skills. In this blog post, we will introduce you to one of our amazing titles that is absolutely free.

SAS Programming for R Users

Many data scientists today need to know multiple programming languages including SAS, R, and Python. If you already know basic statistical concepts and how to program in R but want to learn SAS, then SAS Programming for R Users by Jordan Bakerman was designed specifically for you! This free e-book explains how to write programs in SAS that replicate familiar functions and capabilities in R. This book covers a wide range of topics including the basics of the SAS programming language, how to import data, how to create new variables, random number generation, linear modeling, Interactive Matrix Language (IML), and many other SAS procedures. This book also explains how to write R code directly in the SAS code editor for seamless integration between the two tools.

The book is based on the free, 14-hour course of the same name offered by SAS Education available here. Keep reading to learn more about the differences between SAS and R.

SAS versus R

R is an object-oriented programming language. Results of a function are stored in an object and desired results are pulled from the object as needed. SAS revolves around the data table and uses procedures to create and print output. Results can be saved to a new data table.

Let’s briefly compare SAS and R in a general way. Look at the following table, which outlines some of the major differences between SAS and R.

Here are a few other things about SAS to note:

  • SAS has the flexibility to interact with objects. (However, the book focuses on procedural methods.)
  • SAS does not have a command line. Code must be run in order to return results.

SAS Programs

A SAS program is a sequence of one or more steps. A step is a sequence of SAS statements. There are only two types of steps in SAS: DATA and PROC steps.

  • DATA steps read from an input source and create a SAS data set.
  • PROC steps read and process a SAS data set, often generating an output report. Procedures can be called an umbrella term. They are what carry out the global analysis. Think of a PROC step as a function in R.

Every step has a beginning and ending boundary. SAS steps begin with either of the following statements:

  • a DATA statement
  • a PROC statement

After a DATA or PROC statement, there can be additional SAS statements that contain keywords that request SAS perform an operation or they can give information to the system. Think of them as additional arguments to a procedure. Statements always end with a semicolon!

SAS options are additional arguments and they are specific to SAS statements. Unfortunately, there is no rule to say what is a statement versus what is an option. Understanding the difference comes with a little bit of experience. Options can be used to do the following:

  • generate additional output like results and plots
  • save output to a SAS data table
  • alter the analytical method

SAS detects the end of a step when it encounters one of the following statements:

  • a RUN statement (for most steps)
  • a QUIT statement (for some procedures)

Most SAS steps end with a RUN statement. Think of the RUN statement as the right parentheses of an R function. The following table shows an example of a SAS program that has a DATA step and a PROC step. You can see that both SAS statements end with RUN statements, while the R functions begin and end with parentheses.

If you want to learn more about this book or any other free e-books from SAS Press, visit https://support.sas.com/en/books/free-books.html. Subscribe to our newsletter to get the latest information on new books.

Free e-book: SAS Programming for R Users was published on SAS Users.

2月 142020
 

In honor of Valentine’s day, we thought it would be fitting to present an excerpt from a paper about the LIKE operator because when you like something a lot, it may lead to love! If you want more, you can read the full paper “Like, Learn to Love SAS® Like” by Louise Hadden, which won best paper at WUSS 2019.

Introduction

SAS provides numerous time- and angst-saving techniques to make the SAS programmer’s life easier. Among those techniques are the ability to search and select data using SAS functions and operators in the data step and PROC SQL, as well as the ability to join data sets based on matches at various levels. This paper explores how LIKE is featured in each one of these techniques and is suitable for all SAS practitioners. I hope that LIKE will become part of your SAS toolbox, too.

Smooth Operators

SAS operators are used to perform a number of functions: arithmetic calculations, comparing or selecting variable values, or logical operations. Operators are loosely grouped as “prefix” (for example a sign before a variable) or “infix” which generally perform an operation BETWEEN two variables. Arithmetic operations using SAS operators may include exponentiation (**), multiplication (*), and addition (+), among others. Comparison operators may include greater than (>, GT) and equals (=, EQ), among others. Logical, or Boolean, operators include such operands as || or !!, AND, and OR, and serve the purpose of grouping SAS operations. Some operations that are performed by SAS operators have been formalized in functions. A good example of this is the concatenation operators (|| and !!) and the more powerful CAT functions which perform similar, but not identical, operations. LIKE operators are most frequently utilized in the DATA step and PROC SQL via a DATA step.

There is a category of SAS operators that act as comparison operators under special circumstances, generally in where statements in PROC SQL and the data step (and DS2) and subsetting if statements in the data step. These operators include the LIKE operator and the SOUNDS LIKE operator, as well as the CONTAINS and the SAME-AND operators. It is beyond the scope of this short paper to discuss all the smooth operators, but they are definitely worth a look.

LIKE Operator

Character operators are frequently used for “pattern matching,” that is, evaluating whether a variable value equals, does not equal, or sounds like a specified value or pattern. The LIKE operator is a case-sensitive character operator that employs two special “wildcard” characters to specify a pattern: the percent sign (%) indicates any number of characters in a pattern, while the underscore (_) indicates the presence of a single character per underscore in a pattern. The LIKE operator is akin to the GREP utility available on Unix/Linux systems in terms of its ability to search strings.

The LIKE operator also includes an escape routine in case you need to use a string that includes a comparison operator such as the carat, the underscore, or the percent sign, etc. An example of the escape routine syntax, when looking for a string containing a percent sign, is:

where yourvar like ‘100%’ escape ‘%’;

Additionally, SAS practitioners can use the NOT LIKE operator to select variables WITHOUT a given pattern. Please note that the LIKE statement is case-sensitive. You can use the UPCASE, LOWCASE, or PROPCASE functions to adjust input strings prior to using the LIKE statement. You may string multiple LIKE statements together with the AND or OR operators.

SOUNDS LIKE Operator

The LIKE operator, described above, searches the actual spelling of operands to make a comparison. The SOUNDS LIKE operator uses phonetic values to determine whether character strings match a given pattern. As with the LIKE operator, the SOUNDS LIKE operator is useful for when there are misspellings and similar sounding names in strings to be compared. The SOUNDS LIKE operator is denoted with a short cut ‘-*’. SOUNDS LIKE is based on SAS’s SOUNDEX algorithm. Strings are encoded by retaining the original first column, stripping all letters that are or act as vowels (A, E, H, I, O, U, W, Y), and then assigning numbers to groups: 1 includes B, F, P, and V; 2 includes C, G, J, K, Q, S, X, Z; 3 includes D and T; 4 includes L; 5 includes M and N; and 6 includes R. “Tristn” therefore becomes T6235, as does Tristan, Tristen, Tristian, and Tristin.

For more on the SOUNDS LIKE operator, please read the documentation.

Joins with the LIKE Operator

It is possible to select records with the LIKE operator in PROC SQL with a WHERE statement, including with joins. For example, the code below selects records from the SASHELP.ZIPCODE file that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP1 AS
    select
        a.City ,
        a.countynm  , a.city2 ,
         a.statename , a.statename2
    from sashelp.zipcode as a
    where upcase(a.city) like 'SPR%' and 
upcase(a.statename)='MASSACHUSETTS' ; 
quit;

The test print of table TEMP1 shows only cases for Springfield, Massachusetts.

The code below joins SASHELP.ZIPCODE and a copy of the same file with a renamed key column (city --> geocity), again selecting records for the join that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP2 AS
    select
        a.City , b.geocity, 
        a.countynm  ,
        a.statename , b.statecode, 
        a.x, a.y
    from sashelp.zipcode as a, zipcode2 as b
    where a.city = b.geocity and upcase(a.city) like 'SPR%' and b.statecode
= 'MA' ;
quit;

The test print of table TEMP2 shows only cases for Springfield, Massachusetts with additional variables from the joined file.

The LIKE “Condition”

The LIKE operator is sometimes referred to as a “condition,” generally in reference to character comparisons where the prefix of a string is specified in a search. LIKE “conditions” are restricted to the DATA step because the colon modifier is not supported in PROC SQL. The syntax for the LIKE “condition” is:

where firstname=: ‘Tr’;

This statement would select all first names in Table 2 above. To accomplish the same goal in PROC SQL, the LIKE operator can be used with a trailing % in a where statement.

Conclusion

SAS provides practitioners with several useful techniques using LIKE statements including the smooth LIKE operator/condition in both the DATA step and PROC SQL. There’s definitely reason to like LIKE in SAS programming.

To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

References

    Gilsen, Bruce. September 2001. “SAS® Program Efficiency for Beginners.” Proceedings of the Northeast SAS Users Group Conference, Baltimore, MD.

    Roesch, Amanda. September 2011. “Matching Data Using Sounds-Like Operators and SAS® Compare Functions.” Proceedings of the Northeast SAS Users Group Conference, Portland, ME.

    Shankar, Charu. June 2019. “The Shape of SAS® Code.” Proceedings of PharmaSUG 2019 Conference, Philadelphia, PA.

Learn to Love SAS LIKE was published on SAS Users.

1月 162020
 

This blog is part of a series on SAS Visual Data Mining and Machine Learning (VDMML).

If you're new to SAS VDMML and you want a brief overview of the features available, check out my last blog post!

This blog will discuss types of missing data and how to use imputation in SAS VDMML to improve your predictions.

 
Imputation is an important aspect of data preprocessing that has the potential to make (or break) your model. The goal of imputation is to replace missing values with values that are close to what the missing value might have been. If successful, you will have more data with which the model can learn. However, you could introduce serious bias and make learning more difficult.

missing data

When considering imputation, you have two main tasks:

  • Try to identify the missing data mechanism (i.e. what caused your data to be missing)
  • Determine which method of imputation to use

Why is your data missing?

Identifying exactly why your data is missing is often a complex task, as the reason may not be apparent or easily identified. However, determining whether there is a pattern to the missingness is crucial to ensuring that your model is as unbiased as possible, and that you have enough data to work with.

One reason imputation may be used is that many machine learning models cannot use missing values and only do complete case analysis, meaning that any observation with missing values will be excluded.

Why is this a problem? Well look at the percentage of missing data in this dataset...

Percentage of missing data for columns in HMEQ dataset

View the percentage of missing values for each variable in the Data tab within Model Studio.  Two variables are highlighted to demonstrate a high percentage of missingness (DEBTINC - 21%, DEROG - 11%).

If we use complete case analysis with this data, we will lose at least 21% of our observations! Not good.

Missing data can be classified in 3 types:

  • missing completely at random (MCAR),
  • missing at random (MAR),
  • and missing not at random (MNAR).

Missing completely at random (MCAR) indicates the missingness is unrelated to any of the variables in the analysis -- rare case for missing data. This means that neither the values that are missing nor the other variables associated with those observations are related to the missingness.  In this case, complete case analysis should not produce a biased model. However, if you are concerned about losing data, imputing the missing values is still a valid approach.

Data can also be missing at random (MAR), where the missing variable value is random and unrelated to the missingness, but the missingness is related to the other variables in the analysis. For example, let’s say there was a survey sent out in the mail asking participants about their activity levels. Then, a portion of surveys from the certain region were lost in the mail. In this case, the missing data is not at random, but this missingness is not related to the variable we are interested in, activity level.

Dramatization of potential affected survey participants

An important consideration is that even though the missingness is unrelated to activity level, the data could be biased if participants in that region are particularly fit or lazy. Therefore, if we had data from other people in the affected region, we may be able to use that information in how we impute the data. However, imputing MAR data will bias the results, to what extent will be dependent on your data and method for imputation.

In some cases, the data is missing not at random (MNAR) and the missingness could provide information that is useful to prediction. MNAR data is non-ignorable and must be addressed or your model will be biased.  For example, if a survey asks a sensitive question about a person's weight, a higher proportion of people with a heavier weight may withhold that information -- creating bias within the missing data. One possible solution is to create another level or label for the missing value.

The bottom-line is that whether you should use imputation for missing values is completely dependent on your data and what the missing value represents. Be diligent!

If you have determined that imputation may be beneficial for your data, there are several options to choose from. I will not be going into detail on the types of imputation available as the type you use will be highly dependent on your data and goals, but I have listed a few examples below to give you a glimpse of the possibilities.

Examples of types of imputations:

  • Mean, Median
  • Regression
  • Hot-deck
  • Last Observation Carried Forward (LOCF - for longitudinal data)
  • Multiple Imputation

How to Impute in SAS VDMML

Now, I will show an example of Imputation in SAS Viya - Model Studio. (SAS VDMML)

First, I created a New Project, selecting the Data Mining and Machine Learning, and the default template. 

On the data tab, I will specify how I want certain features to be imputed. This does not do the imputation for us! Think about the specifications on the data tab as a blueprint for how you want the machine learning pipeline to handle the data.

To start, I selected CLAGE feature and changed the imputation to "Mean" (See demonstration below). Then, I selected DELINQ and changed the imputation to "Median". For NINQ, I selected "minimum" as the imputed value. For DEBTINC and YOJ, I will set the Imputation for both to Custom Constant Value (50,100).

Selecting to impute the mean value for the CLAGE variable

Selecting to impute the mean value for the CLAGE variable

Now, I will go over to the Pipeline tab, and add an imputation node (this is where the work gets done!)

Adding an Imputation node to the machine learning pipeline

Adding an Imputation node to the machine learning pipeline

Selecting the imputation node, I view the options and change the default operations for imputation to none. I changed this because we only want the variables previously specified to be imputed. However, if you wanted to run the default imputation, SAS VDMML automatically imputes all missing values that are Class (categorical) variables to the Median and Interval (numerical) values to the Mean. 

Updating default imputation values to none for Categorical and Interval variable

Updating default imputation values to none for Categorical and Interval variables

Looking at the results, we can see what variables imputation was performed on, what value was used, and how many values for each variable were replaced.

Results of running Imputation node on data

Results of running Imputation node on data

And that's how easy it is to impute variables in SAS Model Studio!

Interested in checking out SAS Viya?

Machine Learning Using SAS Viya is a course that teaches machine learning basics, gives instruction on using SAS Viya VDMML, and provides access to the SAS Viya for Learners software all for $79.This course is the pre-requisite course for the SAS Certified Specialist in Machine Learning Certification. Going through the course myself, I was able to quickly learn how to use SAS VDMML and received a refresher on many data preprocessing tactics and machine learning concepts. 

Want to learn more? 

Stay tuned; I will be posting more blogs with examples of specific features in the SAS VDMML. If you there’s any specific features you would like to know more about, leave a comment below! 

Missing data? Quickly impute values using SAS Viya was published on SAS Users.

9月 042019
 

Editor's Note: SAS' Evan Mann contributed to this post.

First came SAS' reputation as a great place to work. Next came a storytellers article series offering a glimpse of the people behind the brand.

Now there's the SAS Users YouTube Channel, where tutorial videos provide a window into some of our personalities—trainers initially, but future contributors will include experts from other areas of SAS.

"Our goal is to make the SAS Users channel on YouTube a ‘must-visit’ channel for those who have the desire to grow their analytical skill set and to learn SAS," said Principal IT Software Developer Michael Penwell, who manages the channel. "It's for both new and experienced users."

My SAS Support Communities teammates Anna Brown and Thiago De Souza produce the videos and Karen Feldman, Senior Technical Learning and Development Specialist, lines up the on-camera talent. Brown explains the idea behind this approach: “Infusing personality into these how-to videos welcomes users into our SAS family in a way we never have before. With this no-script, natural style, we hope that they feel like they’re not being told how to do something in SAS – but rather they’re learning right along with us, together. We’re simply a guide to help users achieve whatever they can dream up in SAS.”

A casual, conversational approach

As Brown describes, instructors walk viewers through the subject matter using a casual, conversational tone - almost vlog style. This approach lets their personalities come through.

Anna Yarbrough, who joined SAS Education last year, travels to teach SAS programming. She likes interacting with SAS users IRL and relishes a good coding challenge. She keeps an eye out for users' questions on her videos and enjoys the digital banter. When this video on merging data sets published, it received the most user comments right off the bat. Why so much interest in merging data?

"When working with structured data," said Yarbrough, "it’s never going to all be stored in one massive data set. There will be smaller tables that can be linked together based on a common column or columns like social security number, or patient number, or something else that uniquely identifies each row of data."

"When you can bring these tables together," Yarbrough added, "you have the ability to answer more complex business questions. I think people are really interested in covering this topic, as well as converting columns from character to numeric, just because it is something that needs to be done all the time."

While building out the videos, Yarbrough recalled questions she had when she first learned SAS. (Don't be surprised if they become topics for future tutorials.) "It's really important as an instructor to be able to put yourself in the shoes of our newer SAS programmers," she said. "Even though we spend so much time covering this content, we need to remember what it’s like to hear about these topics for the first time so that we can present material in a clear and easy-to-follow way. It’s been awesome interacting with users through YouTube."

Check out the first few


Ari Zitin: Decision Trees and Neural Networks

Kathy Kiraly: SAS to Excel

Kathy Kiraly: Excel to SAS

Anna Rakers: Online Resources

Mark Stevens: Certification Tips

Mark Stevens: Performance-Based Questions

Peter Styliadis: Two Common Scenarios

Anna Yarbrough: Character Functions

Anna Yarbrough: Merge Data Sets

 

Subscribe to the channel

Subscribe, like, and share the SAS Users YouTube channel to get notified of new videos and help fellow SAS users find them.

Ready to Learn SAS? It's Time to Meet Your Instructors!

YouTube.com/SASusers: a pop of personality was published on SAS Users.

8月 092019
 

Opening Plenary session, Esri UC 2019

Several of my colleagues and I attended the annual Esri User Conference last month in San Diego - along with 18,000 other Geo professionals.  It was a busy week of meetings, seminars and talks about the latest in GIS and Spatial technologies.  The days were long and exhausting, but it was also exciting and a ton of fun.  As we continue to process, plan and prepare to integrate some of these technologies into SAS Visual Analytics, I thought it would be beneficial to highlight the Esri features available in VA today.

One topic that received a lot of questions during this year’s SAS Global Forum in Dallas was that of Geocoding.  Geocoding is the process of transforming text address data into numeric latitude and longitude values.  Once the latitude and longitude are known, they can be mapped and analyzed spatially.  SAS has offered geocoding capabilities for quite some time as a part of SAS/Graph.  Beginning with SAS v940m5, PROC GEOCODE has moved into BASE SAS.  See my colleague’s blog posts here and here for more information on geocoding from BASE SAS.

But Geocoding is no longer limited to just Base SAS.  You can also geocode from within Visual Analytics, thanks to the integration with the Esri geocoding api.  This feature is part of the Esri Premium agreement, and became available in VA 8.3.   Esri premium features require an existing relationship and credentials with Esri.  This post assumes that relationship exists and your credentials have been validated.  I will discuss the details of the Esri premium features in a future post, but for today the focus is how to use the Esri Geocoding feature from VA with a real-world data set.

1. Getting the data into Visual Analytics

We will be using point data from the City of Dallas for the Public Library branch locations.  You can download the .csv file from the Dallas Open Data portal.  After downloading, it must be imported into VA for geocoding.

  • From the Data tab in VA, select Import > Local File
  • Navigate to the location of the Dallas library .csv file and select it
  • Adjust the default settings, if desired, and click the ‘Import Item’ button
  • Once you see the green success message, the data has been imported into VA and is ready to be geocoded. Click the ‘Cancel’ button

Message indicating successful data import

2. Selecting the data columns to geocode on

Accessing the Geocoding feature in VA follows a similar process to the steps we just performed to import the .csv file.

  • From the Data tab in VA, select Import > Esri > Geocode. Here, you must select the location of the newly imported library data set.  This path will vary depending upon the configuration of your VA instance.  For my installation, it is located at cas-shared-default > Public folder > CITY_OF_DALLAS_LIBRARY_LOCATIONS.  Once located, click the 'Select' button
  • The Geocoding Import window will open. This window should look familiar.  The top half is the same as the Import data we just used to get the .csv file into VA.  Essentially, the geocoding process is a new data import.  It will send selected columns to Esri via a REST api call.  The response will contain the corresponding latitude and longitude values we desire.  They will be added to our existing data set and imported into VA as a new geocoded data set.  The name of the new data set will have _GEO_CODE appended to the end of the original data set name.  This name can be modified as desired.

Geocoding selection dialog window

  • At the bottom of the Geocoding Import window are two list boxes, Available items and Selected items. The Available items box on the left contains all columns in the data set.  Select the column(s) containing the address information you wish to geocode.  Double click or click the right arrow to move them to the Selected items window on the right.  In this example, we are using the Address column.
  • VA concatenates the selected column(s) to generate a sample address for geocoding. Clicking the ‘Test’ button returns coordinates for the sample address and a score representing the confidence level of the results.  In the screenshot above, our score is 71/100 for the test address.  Not bad, but it could be better.  More on this a bit later.
  • To finish the geocoding process, click the ‘Import Item’ at the top of the page, as we did with the original .csv file import. This time, you will be presented with a new dialog window.  Geocoding, as with other Esri premium features require the use of credits.  This dialog indicates how many Esri credits will be used by the geocoding process and will also be discussed in detail in a future post.

Esri credit usage alert dialog

For now, select 'Yes' to continue.  When you see the green success message, the operation is complete.  We are now ready to map our Dallas Library locations.  Click 'Ok' to open the new geocoded data set.

3. Create the geography variable and display the map

Next, we need to create our geography variable from the new geocoded data set.  As part of the geocoding process, four new columns have been added to the new data set: esri_latitude, esri_longitude, esri_score, esri_address.  We only need the esri_latitude and esri_longitude columns for our map.

  • Select the Branch Name category variable and change its Classification to Geography
  • For Geography data type, select Custom Coordinates
  • Select esri_latitude for Latitude
  • Select esri_longitude for Longitude
  • Click 'OK'
  • Drag the Branch Name geography variable to the canvas to create the map

Map of non-unique geocoded addresses

What happened??  Our data set contains Dallas Public library locations, so why are the data points spread across the world?  It’s all in the data.  If you look at the original data a bit deeper, you will notice the Address field we selected for the geocoding only contains the street number and street name of the library location.  It does not contain enough information to make it unique.  Therefore, during the geocoding process, the first instance of that address will be considered a match, regardless of where it is actually located.

Detailed view of incorrect geocoded address

In the image above for the Preston Royal branch, its street number and name were a perfect match to a location in Eugene, Oregon.  Not quite what we were looking for.  So, how do we fix this?  To make our addresses unique, it requires a simple addition to the source data .csv file.

Column selection to ensure unique addresses for geocoding

We need to add a ‘City’ and ‘State’ column to the original .csv file with the values of ‘Dallas’ and ‘Texas’ assigned to all entries.  This will ensure each address is unique and within our area of interest.  Re-import the new .csv file and geocode it using the Address, City and State columns.  The result?  A confidence score of a perfect 100.  Much better than our first attempt!  This will now give us the map we desire for the Dallas Public Library locations.

 

Final geocoded map of Dallas Library branches

In this post, I used real-world data to illustrate two things: the importance of knowing your data set, and how to geocode address information in SAS Visual Analytics.  Public data sets are a great resource but need to be used with a critical eye.  They may still need additional cleansing in order to work for your situation.

The geocoding feature is one example of the premium Esri features currently available in VA.  In future posts, I will go into more detail on other Esri features available, what make these features ‘premium’ and examples of how to use them.  Stay tuned!

Esri integration with SAS Visual Analytics: Geocoding was published on SAS Users.

7月 122019
 

Are you a seasoned data scientist looking for a fast, all-inclusive machine learning solution? Curious about machine learning but have little to no programming experience? Interested in using AI to take over the world? Follow my lead and use SAS VDDML to fast track your world domination.

This blog is the beginning of a series on  SAS Visual Data Mining and Machine Learning (VDMML) told from my perspective as a first-time SAS Viya user, Graduate Intern at SAS, and ABD PhD Candidate in Computer Science. I'm writing this series for two main reasons: 1) to express how surprised I am at seeing how easily complex tasks can be completed after doing it the hard way for years and 2) to provide examples to convince you, too.  

SAS VDMML is only one of many products available in SAS Viya®. Its distinguishing feature being the machine learning pipelines which are created in a single, integrated in-memory environment via a drag and drop interface.

In this post, I will provide a high-level overview of a few of the features available in SAS VDMML. In the next posts, I will provide detailed examples and code comparisons for individual features, such as pipeline creation and autotuning. 

Tip: At the bottom of the post, I talk about a course on machine learning using SAS Viya that provides access to the software and teaches machine learning basics. 

simpleDecisionTreePipeline

Simple custom pipeline using SAS Viya

If you've never used SAS VDMML, here are the top 3 reasons why I think you should check it out. 

sasvdmml_hyperparameters

A sample of the variety of hyperparameter modifications available.

SAS VDMML creates a simplified approach to machine learning solutions beneficial to people with a wide range of expertise.

Have you been programming for as long as you can remember and are well-versed in the machine learning world?

Why spend all your precious mental energy focusing on tedious programming tasks? Instead, you could be focusing your energy on diving deeper in your data and discovering the extent of its modelling capabilities.

After spending only a week familiarizing myself with the interface, I felt confident I could perform my normal tasks with ease and with better hyperparameter tuning and more comprehensive model evaluations.

If you are wondering if these simplifications limit the customization of models, think again. For most uses, the customization options available match the level programming provides by utilizing features such as drop-down menus and editable text boxes - eliminating unnecessary mental overhead. 

 

opensourcecodenode

Open Source Code node as a supervised learning node and example code in code editor

Still itching to program?

You have options: the SAS Code node and the Open Source Code node (available for use with R and Python). Both nodes can be used in any part of the pipeline, including preprocessing, supervised learning, and post-processing.

For example, you may have preprocessing code for extra messy data already written in R. All you need to do is add the Open Source Code node, insert your code, and update the variables to match the provided macros. Or, maybe you want to try the Deep Learning toolkit and CAS action? Drop a SAS Code node into your pipeline, add your code, and you are good to go!

 

Little to no programming experience or not quite a machine learning expert?

The drag and drop interface, wide selection of templates, and the extensive evaluations allow for almost anyone to produce professional-level results in a matter of minutes. While using SAS VDMML might not require expert-level knowledge on machine learning, important projects should have an expert review the approach and results. 

Example of creating a new pipeline using an advanced template

sasvdmml_pipelinenodeoptions

A sample of options for preprocessing and supervised learning.

The days of spending weeks programming scripts for feature extraction, fine-tuning models, and evaluating your model are over!

For example, let's say I'm attempting to impute some variables using R. First, I might store the names of  each columns separately based on the type of imputation I want to perform on it. Then, I could create the code for each type of imputation. If I'm only attempting 2 different types of imputation, I will most likely need less than 10 lines of code. Not much, right?

But, I will also need to test and verify that each variable has been imputed correctly.

Instead, the same task in SAS VDMML would just require you to drop an Imputation node into the pipeline and select via a drop-down menu how to impute the variables - no time wasted.

Additionally, you can quickly compare a variety of supervised learning methods as well as test out the same model with different pre-processing methods using the automatic evaluations provided.  

Looking to save even more time?

You can use the autotuning feature to select the best set of hyperparameters for your model by turning it on in your supervised learning node of choice and hitting run. 

sasvdmml_autotuning

Turn on the autotuning feature inside your supervised learning node, then adjust ranges for the hyperparameters.

After the run is complete, view the supervised learning model's results to see the best configuration of parameters as determined by autotuning. 

sasvdmml_autotuningresults

Example of the results after using autotuning for a decision tree.

All of this can be accomplished with a few clicks, which eliminates the hours spent debugging scripts and connecting the steps in your workflow.

stressedatcomputerYou’ve spent the last hour transforming and creating features in your code editor of choice. Now, after waiting 30 minutes for your model to run again, you get the same results! How? Wait...you’ve forgotten to update the reference to your new data, AGAIN. (This has definitely not happened to me.) 

Fortunately, SAS VDMML only allows you to view results if the pipeline is up-to-date, which ensures that all changes are accounted for. Now, instead of checking and checking again that I passed the right data to the right functions, I can immediately know that my small tweak had no effect on the results. *sigh* OR on a brighter note, that the drastic improvement is not a fluke!  

Updating the Feature Extraction node resets all child nodes below - ensuring that the pipeline stays up-to-date.

Interested in checking out SAS Viya?

Machine Learning Using SAS Viya is a course that teaches machine learning basics, gives instruction on using SAS Viya VDMML, and provides access to the SAS Viya for Learners software all for $79.This course is the pre-requisite course for the SAS Certified Specialist in Machine Learning Certification. Going through the course myself, I was able to quickly learn how to use SAS VDMML and received a refresher on many data preprocessing tactics and machine learning concepts. 

Want to learning more? 

Stay tuned!!

I will be posting blogs with in-depth examples of specific features in the SAS VDMML and adding links to the new blog posts here as they are posted. If you there’s any specific features you would like to know more about, leave a comment below! 

Visual machine learning using SAS Viya: a Graduate Intern’s perspective was published on SAS Users.

7月 122019
 

Are you a seasoned data scientist looking for a fast, all-inclusive machine learning solution? Curious about machine learning but have little to no programming experience? Interested in using AI to take over the world? Follow my lead and use SAS VDDML to fast track your world domination.

This blog is the beginning of a series on  SAS Visual Data Mining and Machine Learning (VDMML) told from my perspective as a first-time SAS Viya user, Graduate Intern at SAS, and ABD PhD Candidate in Computer Science. I'm writing this series for two main reasons: 1) to express how surprised I am at seeing how easily complex tasks can be completed after doing it the hard way for years and 2) to provide examples to convince you, too.  

SAS VDMML is only one of many products available in SAS Viya®. Its distinguishing feature being the machine learning pipelines which are created in a single, integrated in-memory environment via a drag and drop interface.

In this post, I will provide a high-level overview of a few of the features available in SAS VDMML. In the next posts, I will provide detailed examples and code comparisons for individual features, such as pipeline creation and autotuning. 

Tip: At the bottom of the post, I talk about a course on machine learning using SAS Viya that provides access to the software and teaches machine learning basics. 

simpleDecisionTreePipeline

Simple custom pipeline using SAS Viya

If you've never used SAS VDMML, here are the top 3 reasons why I think you should check it out. 

sasvdmml_hyperparameters

A sample of the variety of hyperparameter modifications available.

SAS VDMML creates a simplified approach to machine learning solutions beneficial to people with a wide range of expertise.

Have you been programming for as long as you can remember and are well-versed in the machine learning world?

Why spend all your precious mental energy focusing on tedious programming tasks? Instead, you could be focusing your energy on diving deeper in your data and discovering the extent of its modelling capabilities.

After spending only a week familiarizing myself with the interface, I felt confident I could perform my normal tasks with ease and with better hyperparameter tuning and more comprehensive model evaluations.

If you are wondering if these simplifications limit the customization of models, think again. For most uses, the customization options available match the level programming provides by utilizing features such as drop-down menus and editable text boxes - eliminating unnecessary mental overhead. 

 

opensourcecodenode

Open Source Code node as a supervised learning node and example code in code editor

Still itching to program?

You have options: the SAS Code node and the Open Source Code node (available for use with R and Python). Both nodes can be used in any part of the pipeline, including preprocessing, supervised learning, and post-processing.

For example, you may have preprocessing code for extra messy data already written in R. All you need to do is add the Open Source Code node, insert your code, and update the variables to match the provided macros. Or, maybe you want to try the Deep Learning toolkit and CAS action? Drop a SAS Code node into your pipeline, add your code, and you are good to go!

 

Little to no programming experience or not quite a machine learning expert?

The drag and drop interface, wide selection of templates, and the extensive evaluations allow for almost anyone to produce professional-level results in a matter of minutes. While using SAS VDMML might not require expert-level knowledge on machine learning, important projects should have an expert review the approach and results. 

Example of creating a new pipeline using an advanced template

sasvdmml_pipelinenodeoptions

A sample of options for preprocessing and supervised learning.

The days of spending weeks programming scripts for feature extraction, fine-tuning models, and evaluating your model are over!

For example, let's say I'm attempting to impute some variables using R. First, I might store the names of  each columns separately based on the type of imputation I want to perform on it. Then, I could create the code for each type of imputation. If I'm only attempting 2 different types of imputation, I will most likely need less than 10 lines of code. Not much, right?

But, I will also need to test and verify that each variable has been imputed correctly.

Instead, the same task in SAS VDMML would just require you to drop an Imputation node into the pipeline and select via a drop-down menu how to impute the variables - no time wasted.

Additionally, you can quickly compare a variety of supervised learning methods as well as test out the same model with different pre-processing methods using the automatic evaluations provided.  

Looking to save even more time?

You can use the autotuning feature to select the best set of hyperparameters for your model by turning it on in your supervised learning node of choice and hitting run. 

sasvdmml_autotuning

Turn on the autotuning feature inside your supervised learning node, then adjust ranges for the hyperparameters.

After the run is complete, view the supervised learning model's results to see the best configuration of parameters as determined by autotuning. 

sasvdmml_autotuningresults

Example of the results after using autotuning for a decision tree.

All of this can be accomplished with a few clicks, which eliminates the hours spent debugging scripts and connecting the steps in your workflow.

stressedatcomputerYou’ve spent the last hour transforming and creating features in your code editor of choice. Now, after waiting 30 minutes for your model to run again, you get the same results! How? Wait...you’ve forgotten to update the reference to your new data, AGAIN. (This has definitely not happened to me.) 

Fortunately, SAS VDMML only allows you to view results if the pipeline is up-to-date, which ensures that all changes are accounted for. Now, instead of checking and checking again that I passed the right data to the right functions, I can immediately know that my small tweak had no effect on the results. *sigh* OR on a brighter note, that the drastic improvement is not a fluke!  

Updating the Feature Extraction node resets all child nodes below - ensuring that the pipeline stays up-to-date.

Interested in checking out SAS Viya?

Machine Learning Using SAS Viya is a course that teaches machine learning basics, gives instruction on using SAS Viya VDMML, and provides access to the SAS Viya for Learners software all for $79.This course is the pre-requisite course for the SAS Certified Specialist in Machine Learning Certification. Going through the course myself, I was able to quickly learn how to use SAS VDMML and received a refresher on many data preprocessing tactics and machine learning concepts. 

Want to learning more? 

Stay tuned!!

I will be posting blogs with in-depth examples of specific features in the SAS VDMML and adding links to the new blog posts here as they are posted. If you there’s any specific features you would like to know more about, leave a comment below! 

Visual machine learning using SAS Viya: a Graduate Intern’s perspective was published on SAS Users.

5月 302019
 

Human behavior is fascinating. We come in so many shapes, sizes and backgrounds. Doesn’t it make sense that any tests we write also accommodate our wonderful differences?

This picture is of Miko, a northern rescue and a recent addition to my family. He’s learning to live in an urban household and doing great with some training. He’s going through so many new tests as he adapts to life in the city, which is quite different from being free in the northern territories. Watch for a later post on his training successes.

I’m so happy to share how SAS has been helping candidates by offering a variety of certification credentials geared towards testing for differences and preferences in thought. If you are wondering – I’ve been addicted to psychometrics for a while now, anything human behavior-related interests me. I thought I would begin with sharing some different types of testing roles that I have held in the past.

1. Psychometric testing

Before I joined SAS, I worked at CSI. To answer that unspoken thought dear reader, CSI has been providing financial training and accreditation since 1964 – way before CSI the TV show became popular.

My role as Test Manager was super exciting for someone with a curiosity for analytics and helping people succeed. In a team of four we scored over 200 exams to provide credentials. Psychometrics was the most exciting part of my job analyzing the performance of test takers to constantly innovate our tests. Psychometric tests are used to identify a candidate's skills, knowledge and personality.

2. Multiple-choice testing

While setting multiple choice exam questions, I learned that it was ideal for the four answer choices to be similar in length, and complexity (e.g. if candidates typically chose option A for a question whose right response was B, we would dig deeper to compare the lengths of the options, the language of the options, and then change the option if that was what the review committee agreed upon).

3. Adaptive testing

Prior to CSI, I worked at the test center of Devry Institute of technology. In adaptive testing, the test’s difficulty adapts to candidate performance. A correct response leads into a more complex question. On the flip side, an incorrect response leads to an easier next question. So that, eventually, we could help candidates decide which engineering program would be the right skill fit.

This is where I met the student who asked, “can my boyfriend write my exam?”

4. Performance testing

With SAS at the forefront of analytics, it should come as no surprise that certification exams have evolved to the next level. As a certification candidate you can now try out performance-based testing.

A performance test requires a candidate to actually perform a task, rather than simply answering questions. An example is writing SAS code. Instead of answering a knowledge-level multiple choice exam about SAS code, the candidate is asked to actually write code to arrive at answers.

Certification at SAS

SAS Certified Specialist: Base Programming Using SAS 9.4 is great for those who can demonstrate ease in putting into practice the knowledge learned in the Foundation Programming classes 1 and 2. During this performance-based exam, candidates will access a SAS environment. Coding challenges will be presented, and you will need to write and execute SAS code to determine the correct answers to a series of questions.

SAS® Certified Base Programmer for SAS®9 credential remains, but the exam will be retired in June 2019.

While writing this post I came across this on Wikipedia: it shows how the study of adaptive behavior goes back to Darwin’s time. It’s a good read for anyone intrigued by the science and art of testing.

“Charles Darwin was the inspiration behind Sir Francis Galton who led to the creation of psychometrics. In 1859, Darwin published his book The Origin of Species, which pertained to individual differences in animals. This book discussed how individual members in a species differ and how they possess characteristics that are more adaptive and successful or less adaptive and less successful. Those who are adaptive and successful are the ones that survive and give way to the next generation, who would be just as or more adaptive and successful. This idea, studied previously in animals, led to Galton's interest and study of human beings and how they differ one from another, and more importantly, how to measure those differences.”

Are you fascinated by the science and art of human behavior as it relates to testing? Are you as excited as I am about the possibilities of performance-based testing? I would love to hear your comments below.

New at SAS: Psychometric testing was published on SAS Users.

5月 212019
 

If you spend any time working with maps and spatial data, having a fundamental understanding of coordinate systems and map projections becomes necessary.  It’s the foundation of how spatial data and maps work.  These areas invariably evoke trepidation and some angst, even in the most seasoned map professional.  And rightfully so, it can get complicated quickly. Fortunately, most of those worries can be set aside when creating maps with SAS Visual Analytics, without requiring a degree in Geodesy.

Visual Analytics includes several different coordinate system definitions configured out-of-the-box.  Like the Predefined geography types (see Fundamental of SAS Visual Analytics geo maps), they are selected from a drop-down list during the geography variable setup.  With the details handled by VA, all you need to know is what coordinate space your data uses and select the appropriate one.

The four Coordinate spaces included with VA are:

  1. World Geodetic System (WGS84)
    Area of coverage: World.  Used by GPS navigation systems and NATO military geodetic surveying.  This is the VA default and should work in most situations.
  2. Web Mercator
    Area of coverage: World.  Format used by Google maps, OpenStreetMap, Bing maps and other web map providers.
  3. British National Grid (OSGB36)
    Area of coverage: United Kingdom – Great Britain, Isle of Man
  4. Singapore Transverse Mercator (SVY21)
    Area of coverage: Singapore onshore/offshore

But what if your data does not use one of these?  For those situations, VA also supports custom coordinate spaces.  With this option, you can specify the definition of your desired coordinate space using industry standard formats for EPSG codes or Proj4 strings.  Before we get into the details of how to use custom coordinate spaces in VA, let’s take a step back and review the basics of coordinate spaces and projections.

Background

A coordinate space is simply a grid designed to cover a specific area of the Earth.  Some have global coverage (WGS84, the default in VA) and others cover relatively small areas (SVY21/Singapore Transverse Mercator).  Each coordinate space is defined by several parameters, including but not limited to:

  • Center coordinates (origin)
  • Coverage area (‘bounds’ or ‘extent’)
  • Unit of measurement (feet or meters)

Comparison of coordinate space definitions included in Visual Analytics -- Source: http://epsg.io

The image above compares the four coordinate space definitions included with VA.  The two on the right, BNG and Singapore Transverse Mercator, have a limited extent.  A red rectangle outlines the area of coverage for each region.  The two on the left, WGS84 and Mercator, are both world maps.  At first glance, they may appear to have the same coverage area, but they are not interchangeable.  The origin for both is located at the intersection of the Equator and the Prime Meridian.  However, the similarities end there.  Notice the extent for WGS84 covers the entire latitude range, from -90 to +90.  Mercator on the other hand, covers from -85 to +85 latitude, so the first 5 degrees from each Pole are not included.  Another difference is the unit of measurement.  WGS84 is measured in un-projected degrees, which is indicative of a spherical Geographic Coordinate System (GCS).  Mercator uses meters, which implies a Projected Coordinate System (PCS) used for a flat surface, ie. a screen or paper.

The projection itself is a complex mathematical operation that transforms the spherical surface of the GCS into the flat surface of the PCS.  This transformation introduces distortion in one or more qualities of the map: shape, area, direction, or distance.  The process of map projection compares to peeling an orange. Removing the peel and placing it on a flat surface will cause parts of it to stretch, tear or separate as it flattens. The same thing happens to a map projection.

A flat map will always have some degree of distortion.  The amount of distortion depends on the projection used.  Select a projection that minimizes the distortion in the areas most important to the map.  For example, are you creating a navigation map where direction is critical?  How about a World map to compare land mass of various countries?  Or maybe a local map of Municipality services where all factors are equally important?  These decisions are important if you are collecting and creating your data set from the field.  But, if you are using existing data sets, chances are that decision has already been made for you.  It then becomes a task of understanding what coordinate system was selected and how to use it within VA.

Using a Custom Coordinate Space in VA

When using VA’s custom coordinate space option, it is critical the geography variable and the dataset use the same coordinate space.  This tells VA how to align the grid used by the data with the grid used by the underlying map.  If they align, the data will be placed at the expected location.  If they don’t align, the data will appear in the wrong location or may not be displayed at all.

Illustration of aligning the map and data grids

To illustrate the process of using a custom coordinate space in VA, we will be creating a custom region map of the Oklahoma City School Districts.  The data can be found on the Oklahoma City Open Data Portal.  We will use the Esri shapefile format.  As you may recall from a previous blog post, Creating custom region maps with SAS Visual Analytics, the first step is to import the Esri shapefile data into a SAS dataset.

Once the shapefile has been successfully imported into SAS, we then must determine the coordinate system of the data.  While WGS84 is common and will work in many situations, it should not be assumed.  The first place to look is at the source, the data provider.  Many Open Data portals will have the coordinate system listed along with the metadata and description of the dataset.  But when using an Esri shapefile, there is an easier way to find what we need.

Locate the directory where you unzipped the original shapefile.  Inside of that directory is a file with a .prj extension.  This file defines the projection and coordinate system used by the shapefile.  Below are the contents of our .prj file with the first parameter highlighted.  We are only interested in this value.  Here, you can see the data has been defined in the Oklahoma State Plane coordinate system -- not in VA’s default WGS84.  So, we must use a custom coordinate system when defining the geography variable.

PROJCS["NAD_1983_StatePlane_Oklahoma_North_FIPS_3501_Feet",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101004]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]], PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",1968500],PARAMETER["False_Northing",0],PARAMETER["Central_Meridian",-98],PARAMETER["Standard_Parallel_1",35.5666666666667],PARAMETER["Standard_Parallel_2",36.7666666666667],PARAMETER["Scale_Factor",1],PARAMETER["Latitude_Of_Origin",35],UNIT["Foot_US",0.304800609601219]]

Next, we need to look up the Oklahoma State Plane coordinate system to find a definition VA understands.  From the main page of the SpatialReference.org website, type ‘Oklahoma State Plane’ into the search box. Four results are returned.  Compare the results with the string highlighted above.  You can see the third option is what we are looking for: NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.

Selecting the appropriate definition based on the .prj file contents

To get the definitions we need for VA, click the third link for the option NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.  Here you will see a grey box with a bulleted list of links.  Each of these links represent a definition for the Oklahoma StatePlane coordinate space.

Visual Analytics supports two of the listed formats, EPSG and Proj4.  EPSG stands for European Petroleum Survey Group, an organization that publishes a database of coordinate system and projection information.  The syntax of this format is epsg:<number> or esri:<number>, where <number> is a 4-6 digit for the desired coordinate system.  In our cases, the format we need is the title of the page:

ESRI:102724

The second format supported by VA is Proj4, the third link in the image above.  This format consists of a string of space-delimited name value pairs.  The Oklahoma StatePlane proj4 definition we are interested in is:

+proj=lcc +lat_1=35.56666666666667 +lat_2=36.76666666666667 +lat_0=35 +lon_0=-98 +x_0=600000.0000000001 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs

Now we have identified the coordinate system used by our data set and looked up its definition, we are ready to configure VA to use it.

Using a Projected Coordinate System definition in VA

The following section assumes you are familiar with custom region maps and setting up a polygon provider.  If not, see my previous post on that process, Creating custom region maps with SAS Visual Analytics.  The first step in setting up a geography variable for a custom region map is to start with the polygon provider.  At the bottom of the ‘Edit Polygon Provider’ window, there is an ‘Advanced’ section that is collapsed by default.  Expand it to see the Coordinate Space option.  By default, it is populated with the value EPSG:4326, which is the EPSG code for WGS84.  Since our Oklahoma City School District code data does not use WGS84, we need to replace this value with the EPSG code that we looked up from SpatialReference.org (ESRI:102724).

Using the same Custom Coordinate definition for Polygon provider and geography variable

Next, we must make sure to configure the geography variable itself with the same coordinate space as the polygon provider.  On the ‘Edit Geography Item’ window, the Coordinate Space option is the last item.  Again, we must change this from the default WGS84 to ESRI:102724.  From the dropdown list, select the option ‘Custom’.  A new entry box appears where we can enter the custom coordinate space definition.  If configured correctly, you should see your map in the preview thumbnail and a 100% mapped indicator.

Congratulations!  The setup was successful.  Now, simply click OK and drag the geography variable to the canvas.  VA’s auto-map feature will recognize it and display the custom region map.

In this post, I showed how to identify the coordinate system of your Esri shapefile data, lookup its epsg and proj4 definitions, and configure VA to use it via the Custom Coordinate space option.  While the focus was on a custom region map, the technique also applies to Custom Coordinate maps, minus the polygon provider setup.  The support of custom coordinate spaces in VA allow the mapping of practically any spatial dataset, giving you a new level of power and flexibility in your mapping efforts.

Essentials of Map Coordinate Systems and Projections in Visual Analytics was published on SAS Users.

4月 082019
 
The catch phrase “everything happens somewhere” is increasingly common these days.  That “somewhere” translates into a location on the Earth; a latitude and longitude.  When one of these “somewhere’s” is combined with many other “somewhere’s”, you quickly have a robust spatial data set that becomes actionable with the right analytic tools.

Opportunities for Spatial Analytics are increasing

In today’s modern world, GPS-enabled devices are ubiquitous, and their use continues to increase daily.  Cell phones, cars, fitness trackers, and cameras are all able to locate and track our position.  As a result, the location analytics market is expected to grow to over USD 16 Billion by 2021, up 17.6% from 2016 [1].

Waldo Tobler, an American-Swiss geographer and cartographer, developed his First Law of Geography based on this concept of everything happening somewhere.  He stated, “Everything is related to everything else, but near things are more related than distant things”[2].  As analytic professionals, we are accustomed to working with these correlations using scatterplots, heatmaps, or clustering models.  But what happens when we add a geographic map into the analysis?

Maps offer the ability to unlock a new level of insight into our data that traditional graphs do not offer: personal connection.  As humans, we naturally relate to our surroundings on a spatial level.   It helps build our perspective and frame of reference through which we view and navigate the world.  We feel a sense of loss when a physical landmark from our childhood – a building, tree, park, or route we used to walk to school – is destroyed or changed from the memories we have of it.  In this sense, we are connected, spatially and emotionally, to our surroundings.

We inherently understand how data relates to the world around us, at some level, just by viewing it on a map.  Whether it is a body of water or a mountain affecting a driving route or maybe a trendy area of a city causing housing prices to increase faster than the local average, a map connects us with these facts intuitively.  We come to these basic conclusions based solely on our experiences in the world and knowledge of the physical landmarks in the map.

One of the best examples of this is the 1854 Cholera outbreak in London.  Dr. John Snow was one of the first to use a map for understanding the origin of an epidemiological outbreak.  He created a map of the affected London neighborhood by plotting the location of all known Cholera deaths.  In addition to the deaths, he also plotted the location of 13 community wells that served as the public water supply.  Using this data, he was able to see a clustering of deaths around a single pump.  Armed with this information, Dr. Snow was able to convince local officials to remove the handle from the Broad Street pump.  Once removed, new cases of Cholera quickly began to diminish.  This helped prove his theory the outbreak’s origin was not air-borne as commonly believed during that time, but rather of a water-borne origin. [3]

1854 London Cholera deaths: Tabular data vs. Coordinate map [3]

Let’s look at how Dr. Snow’s map helped mitigate the outbreak and prove his theory.  The image above compares the data of the recorded deaths and community wells in tabular form to a Coordinate map.  It is obvious from the coordinate map that there is a clustering of points.  Town officials and those familiar with the neighborhood could easily get a sense of where the outbreak was concentrated.  The map told a better story by connecting their personal experience of the area to the locations of the deaths and ultimately to the wells.  Something a data table or traditional graph could not do.

Maps of London Cholera deaths with modern analytic overlays [3]

Today, with the computing power and modern analytic methods available to us, we can take the analysis even further.  The examples above show the same coordinate map with added Voronoi polygon and cluster analysis overlays.  The concentration around the Broad Street pump becomes even clearer, showing why Geographic Maps are an important tool to have in your analytic toolbox.

SAS Global Forum 2019 is being held April 28-May 1, 2019 in Dallas, Texas.  If you are planning to go to this year’s event, be sure to attend one of our presentations on the latest mapping features included in SAS Visual Analytics and BASE SAS.  While you’re there, don’t forget to stop by the SAS Mapping booth located in the QUAD to say ‘Hi!’ and let us help with your spatial data needs.  See you in Dallas!

Introduction to Esri Integration in SAS Visual Analytics

  • Monday, April 29, 4:30-5:30p, Room: Level 1, D162

There’s a Map for That! What’s New and Coming Soon in SAS Mapping Technologies

  • Tuesday April 30, 4:00-4:30p, Room: Level 1, D162

Creating Great Maps in ODS Graphics Using the SGMAP Procedure

  • Wednesday May 01, 11:30a-12:30p, Room: Level 1, D162

[1] https://www.marketsandmarkets.com/Market-Reports/location-analytics-market-177193456.html

[2] https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography

[3] https://www1.udel.edu/johnmack/frec682/cholera/

How the 1854 Cholera outbreak showed us the importance of spatial analysis was published on SAS Users.