SAS R&D

5月 212019
 

If you spend any time working with maps and spatial data, having a fundamental understanding of coordinate systems and map projections becomes necessary.  It’s the foundation of how spatial data and maps work.  These areas invariably evoke trepidation and some angst, even in the most seasoned map professional.  And rightfully so, it can get complicated quickly. Fortunately, most of those worries can be set aside when creating maps with SAS Visual Analytics, without requiring a degree in Geodesy.

Visual Analytics includes several different coordinate system definitions configured out-of-the-box.  Like the Predefined geography types (see Fundamental of SAS Visual Analytics geo maps), they are selected from a drop-down list during the geography variable setup.  With the details handled by VA, all you need to know is what coordinate space your data uses and select the appropriate one.

The four Coordinate spaces included with VA are:

  1. World Geodetic System (WGS84)
    Area of coverage: World.  Used by GPS navigation systems and NATO military geodetic surveying.  This is the VA default and should work in most situations.
  2. Web Mercator
    Area of coverage: World.  Format used by Google maps, OpenStreetMap, Bing maps and other web map providers.
  3. British National Grid (OSGB36)
    Area of coverage: United Kingdom – Great Britain, Isle of Man
  4. Singapore Transverse Mercator (SVY21)
    Area of coverage: Singapore onshore/offshore

But what if your data does not use one of these?  For those situations, VA also supports custom coordinate spaces.  With this option, you can specify the definition of your desired coordinate space using industry standard formats for EPSG codes or Proj4 strings.  Before we get into the details of how to use custom coordinate spaces in VA, let’s take a step back and review the basics of coordinate spaces and projections.

Background

A coordinate space is simply a grid designed to cover a specific area of the Earth.  Some have global coverage (WGS84, the default in VA) and others cover relatively small areas (SVY21/Singapore Transverse Mercator).  Each coordinate space is defined by several parameters, including but not limited to:

  • Center coordinates (origin)
  • Coverage area (‘bounds’ or ‘extent’)
  • Unit of measurement (feet or meters)

Comparison of coordinate space definitions included in Visual Analytics -- Source: http://epsg.io

The image above compares the four coordinate space definitions included with VA.  The two on the right, BNG and Singapore Transverse Mercator, have a limited extent.  A red rectangle outlines the area of coverage for each region.  The two on the left, WGS84 and Mercator, are both world maps.  At first glance, they may appear to have the same coverage area, but they are not interchangeable.  The origin for both is located at the intersection of the Equator and the Prime Meridian.  However, the similarities end there.  Notice the extent for WGS84 covers the entire latitude range, from -90 to +90.  Mercator on the other hand, covers from -85 to +85 latitude, so the first 5 degrees from each Pole are not included.  Another difference is the unit of measurement.  WGS84 is measured in un-projected degrees, which is indicative of a spherical Geographic Coordinate System (GCS).  Mercator uses meters, which implies a Projected Coordinate System (PCS) used for a flat surface, ie. a screen or paper.

The projection itself is a complex mathematical operation that transforms the spherical surface of the GCS into the flat surface of the PCS.  This transformation introduces distortion in one or more qualities of the map: shape, area, direction, or distance.  The process of map projection compares to peeling an orange. Removing the peel and placing it on a flat surface will cause parts of it to stretch, tear or separate as it flattens. The same thing happens to a map projection.

A flat map will always have some degree of distortion.  The amount of distortion depends on the projection used.  Select a projection that minimizes the distortion in the areas most important to the map.  For example, are you creating a navigation map where direction is critical?  How about a World map to compare land mass of various countries?  Or maybe a local map of Municipality services where all factors are equally important?  These decisions are important if you are collecting and creating your data set from the field.  But, if you are using existing data sets, chances are that decision has already been made for you.  It then becomes a task of understanding what coordinate system was selected and how to use it within VA.

Using a Custom Coordinate Space in VA

When using VA’s custom coordinate space option, it is critical the geography variable and the dataset use the same coordinate space.  This tells VA how to align the grid used by the data with the grid used by the underlying map.  If they align, the data will be placed at the expected location.  If they don’t align, the data will appear in the wrong location or may not be displayed at all.

Illustration of aligning the map and data grids

To illustrate the process of using a custom coordinate space in VA, we will be creating a custom region map of the Oklahoma City School Districts.  The data can be found on the Oklahoma City Open Data Portal.  We will use the Esri shapefile format.  As you may recall from a previous blog post, Creating custom region maps with SAS Visual Analytics, the first step is to import the Esri shapefile data into a SAS dataset.

Once the shapefile has been successfully imported into SAS, we then must determine the coordinate system of the data.  While WGS84 is common and will work in many situations, it should not be assumed.  The first place to look is at the source, the data provider.  Many Open Data portals will have the coordinate system listed along with the metadata and description of the dataset.  But when using an Esri shapefile, there is an easier way to find what we need.

Locate the directory where you unzipped the original shapefile.  Inside of that directory is a file with a .prj extension.  This file defines the projection and coordinate system used by the shapefile.  Below are the contents of our .prj file with the first parameter highlighted.  We are only interested in this value.  Here, you can see the data has been defined in the Oklahoma State Plane coordinate system -- not in VA’s default WGS84.  So, we must use a custom coordinate system when defining the geography variable.

PROJCS["NAD_1983_StatePlane_Oklahoma_North_FIPS_3501_Feet",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101004]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]], PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",1968500],PARAMETER["False_Northing",0],PARAMETER["Central_Meridian",-98],PARAMETER["Standard_Parallel_1",35.5666666666667],PARAMETER["Standard_Parallel_2",36.7666666666667],PARAMETER["Scale_Factor",1],PARAMETER["Latitude_Of_Origin",35],UNIT["Foot_US",0.304800609601219]]

Next, we need to look up the Oklahoma State Plane coordinate system to find a definition VA understands.  From the main page of the SpatialReference.org website, type ‘Oklahoma State Plane’ into the search box. Four results are returned.  Compare the results with the string highlighted above.  You can see the third option is what we are looking for: NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.

Selecting the appropriate definition based on the .prj file contents

To get the definitions we need for VA, click the third link for the option NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.  Here you will see a grey box with a bulleted list of links.  Each of these links represent a definition for the Oklahoma StatePlane coordinate space.

Visual Analytics supports two of the listed formats, EPSG and Proj4.  EPSG stands for European Petroleum Survey Group, an organization that publishes a database of coordinate system and projection information.  The syntax of this format is epsg:<number> or esri:<number>, where <number> is a 4-6 digit for the desired coordinate system.  In our cases, the format we need is the title of the page:

ESRI:102724

The second format supported by VA is Proj4, the third link in the image above.  This format consists of a string of space-delimited name value pairs.  The Oklahoma StatePlane proj4 definition we are interested in is:

+proj=lcc +lat_1=35.56666666666667 +lat_2=36.76666666666667 +lat_0=35 +lon_0=-98 +x_0=600000.0000000001 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs

Now we have identified the coordinate system used by our data set and looked up its definition, we are ready to configure VA to use it.

Using a Projected Coordinate System definition in VA

The following section assumes you are familiar with custom region maps and setting up a polygon provider.  If not, see my previous post on that process, Creating custom region maps with SAS Visual Analytics.  The first step in setting up a geography variable for a custom region map is to start with the polygon provider.  At the bottom of the ‘Edit Polygon Provider’ window, there is an ‘Advanced’ section that is collapsed by default.  Expand it to see the Coordinate Space option.  By default, it is populated with the value EPSG:4326, which is the EPSG code for WGS84.  Since our Oklahoma City School District code data does not use WGS84, we need to replace this value with the EPSG code that we looked up from SpatialReference.org (ESRI:102724).

Using the same Custom Coordinate definition for Polygon provider and geography variable

Next, we must make sure to configure the geography variable itself with the same coordinate space as the polygon provider.  On the ‘Edit Geography Item’ window, the Coordinate Space option is the last item.  Again, we must change this from the default WGS84 to ESRI:102724.  From the dropdown list, select the option ‘Custom’.  A new entry box appears where we can enter the custom coordinate space definition.  If configured correctly, you should see your map in the preview thumbnail and a 100% mapped indicator.

Congratulations!  The setup was successful.  Now, simply click OK and drag the geography variable to the canvas.  VA’s auto-map feature will recognize it and display the custom region map.

In this post, I showed how to identify the coordinate system of your Esri shapefile data, lookup its epsg and proj4 definitions, and configure VA to use it via the Custom Coordinate space option.  While the focus was on a custom region map, the technique also applies to Custom Coordinate maps, minus the polygon provider setup.  The support of custom coordinate spaces in VA allow the mapping of practically any spatial dataset, giving you a new level of power and flexibility in your mapping efforts.

Essentials of Map Coordinate Systems and Projections in Visual Analytics was published on SAS Users.

4月 082019
 
The catch phrase “everything happens somewhere” is increasingly common these days.  That “somewhere” translates into a location on the Earth; a latitude and longitude.  When one of these “somewhere’s” is combined with many other “somewhere’s”, you quickly have a robust spatial data set that becomes actionable with the right analytic tools.

Opportunities for Spatial Analytics are increasing

In today’s modern world, GPS-enabled devices are ubiquitous, and their use continues to increase daily.  Cell phones, cars, fitness trackers, and cameras are all able to locate and track our position.  As a result, the location analytics market is expected to grow to over USD 16 Billion by 2021, up 17.6% from 2016 [1].

Waldo Tobler, an American-Swiss geographer and cartographer, developed his First Law of Geography based on this concept of everything happening somewhere.  He stated, “Everything is related to everything else, but near things are more related than distant things”[2].  As analytic professionals, we are accustomed to working with these correlations using scatterplots, heatmaps, or clustering models.  But what happens when we add a geographic map into the analysis?

Maps offer the ability to unlock a new level of insight into our data that traditional graphs do not offer: personal connection.  As humans, we naturally relate to our surroundings on a spatial level.   It helps build our perspective and frame of reference through which we view and navigate the world.  We feel a sense of loss when a physical landmark from our childhood – a building, tree, park, or route we used to walk to school – is destroyed or changed from the memories we have of it.  In this sense, we are connected, spatially and emotionally, to our surroundings.

We inherently understand how data relates to the world around us, at some level, just by viewing it on a map.  Whether it is a body of water or a mountain affecting a driving route or maybe a trendy area of a city causing housing prices to increase faster than the local average, a map connects us with these facts intuitively.  We come to these basic conclusions based solely on our experiences in the world and knowledge of the physical landmarks in the map.

One of the best examples of this is the 1854 Cholera outbreak in London.  Dr. John Snow was one of the first to use a map for understanding the origin of an epidemiological outbreak.  He created a map of the affected London neighborhood by plotting the location of all known Cholera deaths.  In addition to the deaths, he also plotted the location of 13 community wells that served as the public water supply.  Using this data, he was able to see a clustering of deaths around a single pump.  Armed with this information, Dr. Snow was able to convince local officials to remove the handle from the Broad Street pump.  Once removed, new cases of Cholera quickly began to diminish.  This helped prove his theory the outbreak’s origin was not air-borne as commonly believed during that time, but rather of a water-borne origin. [3]

1854 London Cholera deaths: Tabular data vs. Coordinate map [3]

Let’s look at how Dr. Snow’s map helped mitigate the outbreak and prove his theory.  The image above compares the data of the recorded deaths and community wells in tabular form to a Coordinate map.  It is obvious from the coordinate map that there is a clustering of points.  Town officials and those familiar with the neighborhood could easily get a sense of where the outbreak was concentrated.  The map told a better story by connecting their personal experience of the area to the locations of the deaths and ultimately to the wells.  Something a data table or traditional graph could not do.

Maps of London Cholera deaths with modern analytic overlays [3]

Today, with the computing power and modern analytic methods available to us, we can take the analysis even further.  The examples above show the same coordinate map with added Voronoi polygon and cluster analysis overlays.  The concentration around the Broad Street pump becomes even clearer, showing why Geographic Maps are an important tool to have in your analytic toolbox.

SAS Global Forum 2019 is being held April 28-May 1, 2019 in Dallas, Texas.  If you are planning to go to this year’s event, be sure to attend one of our presentations on the latest mapping features included in SAS Visual Analytics and BASE SAS.  While you’re there, don’t forget to stop by the SAS Mapping booth located in the QUAD to say ‘Hi!’ and let us help with your spatial data needs.  See you in Dallas!

Introduction to Esri Integration in SAS Visual Analytics

  • Monday, April 29, 4:30-5:30p, Room: Level 1, D162

There’s a Map for That! What’s New and Coming Soon in SAS Mapping Technologies

  • Tuesday April 30, 4:00-4:30p, Room: Level 1, D162

Creating Great Maps in ODS Graphics Using the SGMAP Procedure

  • Wednesday May 01, 11:30a-12:30p, Room: Level 1, D162

[1] https://www.marketsandmarkets.com/Market-Reports/location-analytics-market-177193456.html

[2] https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography

[3] https://www1.udel.edu/johnmack/frec682/cholera/

How the 1854 Cholera outbreak showed us the importance of spatial analysis was published on SAS Users.

4月 052019
 

Recently, you may have heard about the release of the new SAS Analytics Cloud. The platform allows fast access to data-science applications in the cloud! Running on the SAS Cloud and using the latest container technology, Analytics Cloud eliminates the need to install, update, or maintain software or related infrastructure.

SAS Machine Learning on SAS Analytics Cloud is designed for SAS and open source data scientists to gain on-demand programmatic access to SAS Viya. All the algorithms provided by SAS Visual Data Mining and Machine Learning (VDMML), SAS Visual Statistics and SAS Visual Analytics are available through the offering. Developers and data scientists access SAS through a programming interface using either the SAS or Python programming languages.

A free trial for Analytics Cloud is available, and registration is simple. The trial environment allows users to manage and collaborate with others, share data, and create runtime models to analyze their data. The system is pre-loaded with sample data for learning, and allows users to upload their own data. My colleague Joe Furbee explains how to register for the trial and takes you on a tour of the system in his article, Zero to SAS in 60 Seconds- SAS Machine Learning on SAS Analytics Cloud.

Luckily, I had the privilege of being the technical writer for the documentation for SAS Analytics Cloud, and through this met two of my now close friends at SAS.

Alyssa Andrews (pictured left) and Mariah Bragg (pictured right) are both Software Developers at SAS, but worked on the UI for SAS Analytics Cloud. Mariah works in the Research and Development (R&D) division of SAS while Alyssa works in the Information Technology (IT) division. As you can see this project ended up being an interesting mix of SAS teams!

As Mariah told me the history, I learned that SAS Analytics Cloud “was a collaborative project between IT and R&D. The IT team presented the container technology idea to Dr. Goodnight but went to R&D because they wanted this idea run like an R&D project.”

As we prepared for the release of SAS Analytics Cloud to the public, I asked Mariah and Alyssa about their experience working on the UI for SAS Analytics Cloud, and about all the work that they had completed to bring this powerful platform to life!


What is SAS Analytics Cloud for you? How do you believe it will help SAS users?

Alyssa: For me, it is SAS getting to do Software as a Service. So now you can click on our SAS Software and it can magically run without having to add the complexity of shipping a technical support agent to the customers site to install a bunch of complex software.

Mariah: I agree. This will be a great opportunity for SAS to unify and have all our SAS products on cloud.

Alyssa: Now, you can trial and then pay for SAS products on the fly without having to go through any complexities.

What did you do on the project as UI Developers?

Alyssa: I was lent out to the SAS Analytics Cloud team from another team and given a tour-of-duty because I had a background in Django (a high-level Python Web design tool) which is another type of API framework you can build a UI on top of. Then I met Mariah, who came from an Angular background, and we decided to build the project on Angular. So, I would say Mariah was the lead developer and I was learning from her. She did more of the connecting to the API backend and building the store part out, and I did more of the tweaks and the overlays.

What is something you are proud of creating for SAS Analytics Cloud?

Mariah: I’m really proud to be a part of something that uses Angular. I think I was one of the first people to start using Angular at SAS and I am so excited that we have something out there that is using this new technology. I am also really proud of how our team works together, and I’m really proud of how we architectured the application. We went through multiple redesigns, but they were very manageable, and we really built and designed such that we could pull out components and modify parts without much stress.

Alyssa: That we implemented good design practices. It is a lot more work on the front-end, but it helps so much not to have just snowflake code (a term used by developers to describe code that isn’t reusable or extremely unique to where it becomes a problem later on and adds weight to the program) floating. Each piece of code is there for a reason, it’s very modular.

What are your hopes for the future of SAS Analytics Cloud?

Alyssa: I hope that it continues to grow and that we add even more applications to this new container technology, so that SAS can move even more into the cloud arena. I hope it brings success. It is a really cool platform, so I can’t wait to hear about users and their success with it.

Mariah:
I agree with Alyssa. I also hope it is successful so that we keep moving into the Cloud with SAS.

Learning more

As a Developmental Editor with SAS Press, it was a new and engaging experience to get to work with such an innovative technology like SAS Analytics Cloud. I was happy I got to work with such an exciting team and I also look forward to what is next for SAS Analytics Cloud.

And as a SAS Press team member, I hope you check out the new way to trial SAS Machine Learning with SAS Analytics Cloud. And while you are learning SAS, check out some of our great books that can help you get started with SAS Studio, like Ron Cody’s Biostatistics by Example Using SAS® Studio and also explore Geoff Der and Brian Everitt’s Essential Statistics Using SAS® University Edition.

Already experienced but want to know more about how to integrate R and Python into SAS? Check out Kevin D. Smith’s blogs on R and Python with SAS Viya. Also take a moment to investigate our new books on using open source R and Python with SAS Viya: SAS Viya: The R Perspective by Yue Qi, Kevin D. Smith, and XingXing Meng and SAS Viya: The Phyton Perspective by Kevin D. Smith and XingXing Meng.

These great books can set you on the right path to learning SAS before you begin your jump into SAS Analytics Cloud, the new way to experience SAS.

SAS® Analytics Cloud—an interview with the women involved was published on SAS Users.

3月 272019
 

SAS Visual Analytics supports region maps for Country, US states, and provinces out-of-the-box.  These work well for small scale maps covering the world, a continent, or a single country.  However, other regions are often needed.  Beginning in version 8.3, VA supports custom polygons to display regions such as sales territories, counties, or zip codes.

Region (choropleth) maps use a fill color to show relationships between the regions based upon a response value from your data.  Using custom polygons in VA follows the same steps outlined in previous posts for predefined or custom coordinate geography items, with just a few additional steps.  Here’s the basic flow:

  • Identify your data
  • Import polygon shapefile into SAS dataset
  • Import the shape dataset into VA
  • Create a Custom polygon provider
  • Create the geography item
  • Create and customize the map

Before we begin

VA supports two sources for creating custom polygons: Esri shapefiles and Esri Feature Services.  The goal for this post is to show how to create custom polygons using an Esri shapefile.

Typically, when working with custom polygons, you will have two datasets: the first defines the custom regions (shape data) and the second contains the data you wish to map (business data).  The shape data is derived from an Esri shapefile or feature service.  The business data can be in a shapefile or any format supported by VA (.sas7bdat, .csv, .xls, etc). It contains the information you want to analyze distributed across the regions defined by the shape data.

It is recommended that you verify the imported shape data before using it in your final map.  This will confirm the data is valid and make debugging an issue easier should you encounter any errors.  To verify, use the same dataset for both the shape and business data.  The example below will use this approach.

Access to a GIS application such as Esri’s ArcGIS or QGIS is recommended.  There are two areas where they can help you prepare to use custom polygons in your VA map:

  • Creating a shapefile to define polygons specific to your business need or application
  • Viewing the attribute table of existing shapefiles to determine its unique identifier column

For this example, we will be creating a map of registered Neighborhood Associations in Boise, Idaho. To follow along, download the data from the City of Boise open data site: Boise Neighborhood Associations

1. Identify your data

Shape data

The shape data defining the custom regions needs to be in an Esri shapefile format. These files can be created in a GIS application or obtained from a wide variety of online sources such as: the US Census Bureau (http://www.census.gov); local and state municipalities; state agencies such as the Department of Transportation; and university GIS departments.  Most municipalities now have Open Data portals that provide a wealth of reliable data for public use.  These sources are maintained by dedicated staff and are updated regularly.

Business data

The business data can be specific to your company’s operation or customer base.  Or it can be broad and general using census or demographic information.  It answers the question of What you want to analyze on the map.  The business data must contain a column that aligns with your shape data.  For example: If you want to map the age distribution and spending habits of your target customers across zip codes, then your business data must have a column for zip codes that allows it to be joined to a zip code region in the shape data.

2. Import polygon data into a SAS dataset

VA 8.3 does not support the native shapefile format. To use a shapefile in VA, you must first import it into SAS.  Included with Viya3.4, the %shpimprt macro will convert a shapefile into a SAS dataset and load it into CAS.  You can find the documentation for it here: %shpimprt documentation.

Alternatively, the shapefile can be manually imported with these basic steps:

  • Import the shapefile into SAS
  • Add a sequence column to the dataset
  • Reduce the density of the dataset
  • Limit the dataset based on the density value

Additional details and sample code for each of these steps can be found in the text file linked here: Manual shapefile import steps.

3. Import the shape dataset into VA

Next, we must import the dataset into VA, if using the manual shapefile import process.  To do this, locate the data pane on the left of VA.  From the ‘Open Data Source’ window, select Import > Local File.  Navigate to the location of the SAS dataset created from Step 2 and click the Open button.

Adjust the target location as needed, based on your VA installation, and make note of the location selected.  This path will be required to configure the custom polygon provider. Review and adjust the other options as needed.  Click the blue ‘Import Item’ button at the top of the window to start the import process.  A message will appear indicating the import status. Upon successful import, click the 'OK' button to open the dataset.

Since we are using the same dataset for the shape and business data, we need to make a copy of the category variable that will be used for our map. Right click on ‘ASSOCIATIO’ and select ‘Duplicate’.  Next, let’s change the names of both variables to better distinguish them from one another:

  • Change ‘ASSOCIATIO’ to ‘Business data’
  • Change ‘ASSOCIATIO (1)’ to ‘Shape data’

4. Create the geography item

We are now ready to start creating the geography item.  With Custom polygons, an additional step is required beyond what was described in previous posts with predefined and custom coordinates geography items.  We must define a Custom Polygon provider so VA knows how to locate and display the Boise Neighborhood Associations.  This is needed only once and is part of the geography item setup you are familiar with.

Our goal is to map the regions of the Boise Neighborhood Associations, so we will use ‘Shape data’ for our geography item.  Locate it in the VA data panel and change its Classification type to ‘Geography’.  From the ‘Geography data type’ dropdown, select ‘Custom polygonal shapes’. Several new fields will be displayed.  In the ‘Custom polygon provider’ dropdown, click the ‘Define new polygon provider’ button.

A ‘New Polygon Provider’ window will appear.  All fields shown are required.  The Advanced section has additional options, but they are not needed for this example.

Configure the fields based on the following:

  • Name / Label – Enter ‘Boise Neighborhoods’ for both (these values do not have to be the same)
  • Type – The default CAS Table is the correct option for this example.
  • Server / Library – These values must match those used for the data upload in Step 3.
  • Table – Select the name of the table uploaded in Step 3 (Boise_Neighborhoods)
  • ID Column – The unique identifier column of the dataset. Used to join the shape and business data together. (Select OBJECTID)
  • Sequence Column – This column is created during the import process from Step 2. Needed by VA to display the custom regions. (Select SEQUENCE)

The custom polygon provider is now configured.  All that is needed to finish the geography item setup, is to identify the Region ID.  This is the crucial step that will join the shape data to the business data.  The Region ID column must match the ID Column chosen when the custom polygon provider was setup.  Since we are using the same dataset in this example, that value is the same: OBJECTID.

In cases where different datasets are used for the shape and business data, the name of Region ID and ID Column may be different.  The column labels are not important, but their content must match for the join to occur.

Notice that once you select the correct RegionID value, the preview window will display the custom regions from the imported shape data.  The Latitude and Longitude columns are not required in this example.  Click the ‘OK’ button, to finish the setup.

5. Create and customize the map

You are now ready to create your map.  Drag the Boise Neighborhoods geography item to the report canvas.  Let’s enhance the appearance of our map by making a few style changes:

  • Set a Color role to shade the Neighborhood Association regions (Roles > Color > Business data)
  • Position the legend on the left of the map (Options > Legend)
  • Adjust the transparency of the fill color to 45% (Options > Map Transparency)
  • Change the map service to Esri World Street Map (Options > Map service)

Final map with custom polygons.

Congratulations!  You have just created your first custom region map.  In this post we discussed how to use the Custom Polygon provider to define your own regions using an Esri shapefile.  Compared to the Predefined and Custom Coordinate options, custom polygons give you additional flexibility and control over how your spatial data is analyzed.

Creating custom region maps with SAS Visual Analytics was published on SAS Users.

2月 272019
 

In this post, we continue our discussion of geography variables, the foundation of Visual Analytics Geo maps. This time we will look at Custom Coordinates.  As with any statistical graph, understanding your data is key.  But when using Custom Coordinates for geographic maps, this understanding becomes even more important.

Use the Custom Coordinate geography variable when your data does not match one of VA’s predefined geography types (see previous post, Fundamentals of SAS Visual Analytics geo maps).  For Custom coordinates, your data set must include latitude and longitude values as separate variables.   These values should be sourced from trustworthy providers and validated for accuracy prior to loading into VA.

When using Custom Coordinates, the Coordinate Space must also be considered.  The coordinate space defines the grid used to plot your data.  The underlying map is also based on a grid.  In order for your data to display correctly on a map, these grids must match.  Visual Analytics uses the World Geodetic System (WGS84) as the default coordinate space (grid).  This will work for most scenarios, including the example below.

Once you have selected a dataset and confirmed it contains the required spatial information, you can now create a Custom Geography variable.  In this example, I am using the variable Business Address from the dataset Wake_Co_Pizza.  Let’s get started.

  1. Begin by opening VA and navigate to the Data panel on the left of the application.
  2. Select the dataset and locate the variable that you wish to map. Click the down arrow to the right of the variable and chose ‘Geography’ from the Classification dropdown menu.
  3. The ‘Edit Geography Item’ window appears. Select Custom coordinates in the ‘Geography data type’ dropdown.   Three new dropdown lists appear that are specific to the Custom coordinates data type: ‘Latitude (y)’, ‘Longitude (x)’ and ‘Coordinate Space’.

When using the Custom coordinates data type, we must tell VA where to find the spatial data in our dataset.  We do this using the Latitude (y) and Longitude (x) dropdown lists.  They contain all measures from your dataset.  In this example, the variable ‘Latitude World Geodetic System’ contains our latitude values and the variable  ‘Longitude World Geodetic System’ contains our longitude values.   The ‘Coordinate Space’ dropdown defaults to World Geodetic System (WGS84) and is the correct choice for this example.

  1. Click the OK button to complete the setup once the latitude and longitude variables have been selected from their respective dropdown lists. You should see a new ‘Geography’ section in the Data panel.  The name of the variable (or its edited value) will be displayed beside a globe icon to indicate it is a geography variable.  In this case we see the variable Business Address.

 

Congratulations!  You have now created a custom geography variable and are ready to display it on a map.  To do this, simply drag it from the Data panel and drop it on the report canvas.  The auto-map feature of VA will recognize it as a geography variable and display the data as a bubble map with an OpenStreetMap background.

In this post, we created a custom geography variable using the default Coordinate Space.  Using a custom geography variable gives you the flexibility of mapping data sets that contain valid latitude and longitude values.  Next time, we will take our exploration of the geography variable one step further and explore using custom polygons in your maps.

Using Custom Coordinates for map creation in SAS Visual Analytics was published on SAS Users.

2月 082019
 

Creating a map with SAS Visual Analytics begins with the geographic variable.  The geographic variable is a special type of data variable where each item has a latitude and longitude value.  For maximum flexibility, VA supports three types of geography variables:

  1. Predefined
  2. Custom coordinates
  3. Custom polygons

This is the first in a series of posts that will discuss each type of geography variable and their creation. The predefined geography variable is the easiest and quickest way to begin and will be the focus of this post.

SAS Visual Analytics comes with nine (9) predefined geographic lookup types.  This lookup method requires that your data contains a variable matching one of these nine data types:

  • Country or Region Names – Full proper name of a country or region (ISO 3166-1)
  • Country or Region ISO 2-Letter Codes – Alpha-2 country code (ISO 3166-1)
  • Country or Region ISO Numeric Codes – Numeric-3 country code (ISO 3166-1)
  • Country or Region SAS Map ID Values – SAS ID values from MPASGFK continent data sets
  • Subdivision (State, Province) Names – Full proper name for level 2 admin regions (ISO 3166-2)
  • Subdivision (State, Province) SAS Map ID Values – SAS ID values from MAPSGFK continent data sets (Level 1)
  • US State Names – Full proper name for US State
  • US State Abbreviations – Two letter US State abbreviation
  • US Zip Codes – A 5-digit US zip code (no regions)

Once you have identified a variable in your dataset matching one of these types, you are ready to begin.  For our example map, the dataset 'Crime' and variable 'State name' will be used.  Let’s get started.

Creating a predefined geography variable in SAS Visual Analytics

  1. Begin by opening VA and navigate to the Data panel on the left of the application.
  2. Select the desired dataset and locate a variable that matches one of the predefined lookup types discussed above. Click the down arrow to the right of the variable and select ‘Geography’ from the Classification dropdown menu.
  3. The ‘Edit Geography Item’ window will open. Depending upon the type of geography variable selected, some of the options on this dialog will vary.  The 'Name' textbox is common for all types and will contain the variable selected from your dataset.  Edit this label as needed to make it more user friendly for your intended audience.
  4. The ‘Geography data type’ drop down list is where you select the desired type of geography variable.  In this example, we are using the default predefined option.
  5. Locate the 'Name or code context' dropdown list.  Select the type of predefined variable that matches the data type of the variable chosen from your data.  Once selected, VA scans your data and does an internal lookup on each data item.  This process identifies latitude and longitude values for each item of your dataset.  Lookup results are shown on the right of the window as a percentage and a thumbnail size map.  The thumbnail map displays the the first 100 matches.
  6. If there are any unmatched data items, the first 5 will be displayed.  This may provide a better understanding of your data.  In this example, it is clear from variable name as to what type should be selected (US State Names).  However, in most cases that choice will not be this obvious.  The lesson here, know your data!

Unmatched data items indicators

Once you are satisfied with the matched results, click the OK button to continue.  You should see a new section in the Data panel labeled ‘Geography’.  The name of the variable will be displayed beside a globe icon. This icon represents the geography variable and provides confirmation it was created successfully.

Icon change for geography variable

Now that the geography variable has been created, we are ready to create a map.  To do this, simply drag it from the Data panel and drop it on the VA report canvas.  The auto-map feature of VA will recognize the geography variable and create a bubble map with an OpenStreetMap background.  Congratulations!  You have just created your first map in VA.

Bubble map created with predefined geography variable

The concept of a geography variable was introduced in this post as the foundation for creating all maps in VA.  Using the predefined geography variable is the quickest way to get started with Geo maps.  In situations when the predefined type is not possible, using one of VA's custom geography types becomes necessary.  These scenarios will be discussed in future blog posts.

Fundamentals of SAS Visual Analytics geo maps was published on SAS Users.

7月 262018
 

SAS Text Analytics analyze documents at document-level by default, but sometimes sentence-level analysis gains further insights into the data. Two years ago, SAS Text Analytics team did some research on sentence-level text analysis and shared their discoveries in a SGF paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations. Recently my team started working on a concept extraction project. We need to extract all sentences containing one or two query words, so that linguists don't need to read the whole documents in order to write concept extraction rules. This improves their work efficiency on rules development and rule tuning significantly.

Sentence boundary detection

Sentence boundary detection is a challenge in Natural Language Processing -- it's more complicated than you might expect. For example, most sentences in English end with a period, but sometimes a period is used to denote an abbreviation or used as a part of ellipsis. My colleagues Biljana and Teresa wrote an article about the complexities of how a period may be used. if you are interested in this topic, please check out their article Text analytics through linguists' eyes: When is a period not a full stop?

Sentence boundary rules are different for different languages, and when you work with multilingual data you might want to write one set of code to manipulate all data in varied languages. For example, a period in German is used to denote ending of an ordinal number token; in Chinese, the sentence-final period is different from English period; and Thai does not use period to denote the end of a sentence.

Here are several sentence boundary examples:

Sentences Language Text
1 English Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
2 English I paid $23.45 for this book.
3 English We earn more and more money, but we feel less and less happier. So…what happened to us?
4 Chinese 北京确实人多车多,但是根源在哪里?
5 Chinese 在于首都集中了太多全国性资源。
6 German Was sind die Konsequenzen der Abstimmung vom 12. Juni?

How to tokenize documents into sentences with SAS?

There are several methods to build a sentence tokenizer with SAS Text Analytics. Here I only list three methods:

  • Method 1: Use CAS action tpParse and SAS Viya
  • Method 3: Use SAS Data Step Code and SAS 9

Among the above three methods, I recommend the first method, because it can extract sentences and keep the raw texts intact. With the second method, uppercase letters are changed into lowercase letters after parsing with SAS, and some unseen characters will be replaced with white spaces. The third method is based on traditional SAS 9 technology (not SAS Viya), so it might not scale to large data as well.

In my article, I show the SAS code of only the first two methods. For details of the SAS code for the last method, please check out the paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations.

Use CAS action The applyConcept action performs concept extraction using a concept extraction model that you compile and validate.

%macro sentenceTokenizer1(
   dsIn=,
   docVar=,
   textVar=,
   language=,
   dsOut=
);
/* Rule for determining sentence boundaries */
data sascas1.concept_rule;
   length rule $ 200;
   ruleId=1;
   rule='ENABLE:SentBoundaries';
   output;
 
   ruleId=2;
   rule='PREDICATE_RULE:SentBoundaries(first,last):(SENT,"_first{_w}","_last{_w}")';
   output;
run;
 
proc cas;
textRuleDevelop.validateConcept / 
   table={name="concept_rule"}
   config='rule'
   ruleId='ruleId'
   language="&language"
   casOut={name='outValidation',replace=TRUE}
;
run;
quit;
 
/* Compile concept rule; */
proc cas;
textRuleDevelop.compileConcept / 
   table={name="concept_rule"}
   config="rule"
   enablePredefined=false
   language="&language"
   casOut={name="outli", replace=TRUE}
;
run;
quit;
 
/* Get Sentences */
proc cas;
textRuleScore.applyConcept / 
   table={name="&dsIn"}
   docId="&docVar"
   text="&textVar"
   language="&language"
   model={name="outli"}
   matchType="best"
   casOut={name="outpos_eli", replace=TRUE}
   factOut={name="&dsOut", replace=TRUE, where="_fact_argument_=''"}
;
run;
quit;
 
proc cas;
   table.dropTable name="concept_rule" quiet=true; run;
   table.dropTable name="outli" quiet=true; run;
   table.dropTable name="outpos_eli" quiet=true; run;
quit; 
%mend sentenceTokenizer1;

Use CAS action NLP technique called tpParse.

%macro sentenceTokenizer2(
   dsIn=,
   docVar=,
   textVar=,
   language=,
   dsOut=
);
/* Parse the data set */
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   selectAttribute={opType="IGNORE",tagList={}}
   selectPos={opType="IGNORE",tagList={}}
   offset={name="offset",replace=TRUE}
;
run;
 
/* Get Sentences */
proc cas;
table.partition / 
   table={name="offset" 
          groupby={{name="_document_"}, {name="_sentence_"}}
          orderby={{name="_start_"}}
         }
   casout={name="offset" replace=true};
run;
 
datastep.runCode /
code= "
data &dsOut;
   set offset;
   by _document_ _sentence_ _start_;
   length _text_ varchar(20000);
   if first._sentence_ then do;
      _text_='';
      _lag_end_ = -1;
   end;  
   if _start_=_lag_end_+1 then
      _text_=cats(_text_, _term_);
   else
      _text_=trim(_text_)||repeat(' ',_start_-_lag_end_-2)||_term_;
   _lag_end_=_end_;  
   if last._sentence_ then output;
   retain _text_ _lag_end_;
   keep _document_ _sentence_ _text_;
run;
";
run;   
quit;
 
proc cas;
   table.dropTable name="offset" quiet=true; run;
quit; 
%mend sentenceTokenizer2;

Here are three examples for using each of these tokenizer methods:

/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
/*-------------------------------------*/
/* Example 1: Chinese texts            */
/*-------------------------------------*/
data sascas1.text_zh;
   infile cards dlm='|' missover;
   input _document_ text :$200.;
   cards;
1|北京确实人多车多,但是根源在哪里?在于首都集中了太多全国性资源。
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_zh,
   docVar=_document_,
   textVar=text,
   language=chinese,
   dsOut=sentences_zh1
);
 
%sentenceTokenizer2(
   dsIn=text_zh,
   docVar=_document_,
   textVar=text,
   language=chinese,
   dsOut=sentences_zh2
);
 
/*-------------------------------------*/
/* Example 2: English texts            */
/*-------------------------------------*/
data sascas1.text_en;
   infile cards dlm='|' missover;
   input _document_ text :$500.;
   cards;
1|Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
2|I paid $23.45 for this book.
3|We earn more and more money, but we feel less and less happier. So…what happened to us?
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_en,
   docVar=_document_,
   textVar=text,
   language=english,
   dsOut=sentences_en1
);
 
%sentenceTokenizer2(
   dsIn=text_en,
   docVar=_document_,
   textVar=text,
   language=english,
   dsOut=sentences_en2
);
 
 
/*-------------------------------------*/
/* Example 3: German texts             */
/*-------------------------------------*/
data sascas1.text_de;
   infile cards dlm='|' missover;
   input _document_ text :$600.;
   cards;
1|Was sind die Konsequenzen der Abstimmung vom 12. Juni?
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_de,
   docVar=_document_,
   textVar=text,
   language=german,
   dsOut=sentences_de1
);
 
%sentenceTokenizer2(
   dsIn=text_de,
   docVar=_document_,
   textVar=text,
   language=german,
   dsOut=sentences_de2
);

The sentences extracted of the three examples as Table 2 shows below.

Example Doc Text Sentence (Method 1) Sentence (Method 2)
English

 

1 Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. rolls-royce motor cars inc. said it expects its u.s. sales to remain steady at about 1,200 cars in 1990.
2 I paid $23.45 for this book. I paid $23.45 for this book. i paid $23.45 for this book.
3 We earn more and more money, but we feel less and less happier. So…what happened to us? We earn more and more money, but we feel less and less happier. we earn more and more money, but we feel less and less happier.
So…what happened? so…what happened?
Chinese

 

1 北京确实人多车多,但是根源在哪里?在于首都集中了太多全国性资源。 北京确实人多车多,但是根源在哪里? 北京确实人多车多,但是根源在哪里?
在于首都集中了太多全国性资源。 在于首都集中了太多全国性资源。
German 1 Was sind die Konsequenzen der Abstimmung vom 12. Juni? Was sind die Konsequenzen der Abstimmung vom 12. Juni? was sind die konsequenzen der abstimmung vom 12. juni?

From the above table, you can see that there is no difference between two methods with Chinese textual data, but many differences between two methods with English or German textual data. So which method you should use? It depends on the SAS products that you have available. Method 1 depends on compileConcept, validateConcept, and applyConcept actions, and requires SAS Visual Text Analytics. Method 2 depends on the tpParse action in SAS Visual Analytics. If you have both products available, then consider your use case. If you are working on text analytics that are case insensitive, such as topic detection or text clustering, you may choose method 2. Otherwise, if the text analytics are case sensitive such as named entity recognition, you must choose method 1. (And of course, if you don't have SAS Viya, you can use method 3 with SAS 9 and guidance from the cited paper.)

If you have SAS Viya, I suggest trying the above sentence tokenization method with your data and then run text mining actions on the sentence-level data to see what insights you will get.

How to tokenize documents into sentences was published on SAS Users.

7月 062018
 

SAS Visual Text Analytics provides dictionary-based and non-domain-specific tokenization functionality for Chinese documents, however sometimes you still want to get N-gram tokens. This can be especially helpful when the documents are domain-specific and most of the tokens are not included into the SAS-provided Chinese dictionary.

What is an N-gram?

An N-gram is a sequence of N items from a given text with n representing any positive integer starting from 1. When n is 1, it refers to a unigram; when n is 2, it refers to a bigram; when n is 3, it refers to a trigram. For example, suppose we have a text in Chinese "我爱中国。", which means "I love China." Its N-gram sequence looks like the following:

n Size N-gram Sequence
1 [我], [爱], [中], [国], [。]
2 [我爱], [爱中], [中国], [国。]
3 [我爱中], [爱中国], [中国。]

How many N-gram tokens are in a given sentence?

If Token_Count_of_Sentence is number of words in a given sentence, then the number of N-grams would be:

Count of N-grams = Token_Count_of_Sentence – ( n - 1 )

The following table shows the N-gram token count of "我爱中国。" with different n sizes.

n Size N-gram Sequence Token Count
1 [我], [爱], [中], [国], [。] 5 = 5- (1-1)
2 [我爱], [爱中], [中国], [国。] 4 = 5- (2-1)
3 [我爱中], [爱中国], [中国。] 3 = 5- (3-1)

In real actual language processing (NLP) tasks, we often want to get unigram, bigram and trigram together when we set N as 3. Similarly, when we set N as 4, we want to get unigram, bigram, trigram, and four-gram together.

N-gram theory is very simple and under some conditions it has big advantage over dictionary-based tokenization method, especially when the corpus you are working on has many vocabularies out of the dictionary or you don't have a dictionary at all.

How to get N-grams with SAS?

SAS is a powerful programming language when you manipulate data. Below you'll find a program I wrote, using the DATA step to get N-grams.

data data_test;
   infile cards dlm='|' missover;
   input _document_ text :$100.;
cards;
1|我爱中国。
;
run;
 
data NGRAMS;
   set data_test;
   _tmpStr_ = text;
   do while (klength(_tmpStr_)>0);  
      _maxN_=min(klength(_tmpStr_), 3);  
      do _i_=1 to _maxN_;
         _term_ = ksubstr(_tmpStr_, 1, _i_);
         output;  
      end;  
      if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2);  
      else _tmpStr_ = '';
   end;
   keep _document_ _term_ _i_;
run;

Let's see the SAS results.

proc sort data=NGRAMS;
   by _document_ _i_;
run;
 
proc print; run;

N-gram results

N-grams tokenization is the first step of NLP tasks. For most NLP tasks the second step is to calculate the term frequency–inverse document frequency (TF-IDF). Here's the approach:

tfidf(t,d,D) = tf(t,d) * idf(t,D)
IDF(t) = log_e(total number of documents / number of documents that contain term t)

Where t denotes the terms; d denotes each document; D denotes the collection of documents.

Suppose that you need to handle process lots of documents -- let me show you how to do it using SAS Viya. I used these four steps.

Step 1: Start CAS Server and create a CAS library.

cas casauto host="host.example.com" port=5570;
libname mycas cas;
 
<h4>Step 2: Load your data into CAS. </h4>
Here to simply the code, I only tried 3 sentences for demo purpose. 
data mycas.data_test;
   infile cards dlm='|' missover;
   input _document_ fact :$100.;
cards;
1|我爱中国。
2|我是中国人。
3|我是山西人。
;
run;

Once the data in loaded to CAS, you may run following code to check the column information and record count of your corpus.

proc cas;
  table.columnInfo / table="data_test";
run;
 
  table.recordCount / table="data_test";
run;
quit;

Step 3: Tokenize texts into N-grams

%macro TextToNgram(dsin=, docvar=, textvar=, N=, dsout=);
proc cas;
   loadactionset "dataStep";
   dscode =
      "data &dsout;
         set &dsin;
         length _term_ varchar(&N);
         _tmpStr_ = &textvar;
         do while (klength(_tmpStr_)>0);
            _maxN_=min(klength(_tmpStr_), &N);
            do _i_=1 to _maxN_;
              _term_ = ksubstr(_tmpStr_, 1, _i_);
              output;
            end;
            if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2); 
            else _tmpStr_ = ''; 
         end;
         keep &docvar _term_;
      run;";
   runCode code = dscode; 
run;
quit;
%mend TextToNgram;
 
%TextToNgram(dsin=data_test, docvar=_document_, textvar=text, N=3, dsout=NGRAMS);

Step 4: Calculate TF-IDF.

%macro NgramTfidfCount(dsin=, docvar=, termvar=, dsout=);
proc cas;
simple.groupBy / table={name="&dsin"}
                 inputs={"&docvar", "&termvar"}
                 aggregator="n" 
                 casout={name="NGRAMS_Count", replace=true};
run;
quit;
 
proc cas;
simple.groupBy / table={name="&dsin"}
                 inputs={"&docvar", "&termvar"}
                 casout={name="term_doc_nodup", replace=true};
run;
 
simple.groupBy / table={name="term_doc_nodup"}
                 inputs={"&docvar"}
                 casout={name="doc_nodup", replace=true};
run;
numRows result=r/ table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
simple.groupBy / table={name="term_doc_nodup"}
                 inputs={"&termvar"}
                 aggregator="n" 
                 casout={name="term_numdocs", replace=true};
run;
 
mergePgm = 
    "data &dsout;"
      || "merge NGRAMS_Count(keep=&docvar &termvar _score_ rename=(_score_=tf))
            term_numdocs(keep=&termvar _score_ rename=(_score_=numDocs));"
      || "by &termvar;"
      || "idf=log("||totalDocs||"/numDocs);"
      || "tfidf=tf*idf;"
      || "run;";
print mergePgm;
dataStep.runCode / code=mergePgm;
run;
quit;
%mend NgramTfidfCount;
 
%NgramTfidfCount(dsin=NGRAMS, docvar=_document_, termvar=_term_, dsout=NGRAMS_TFIDF);

Now let's see the TFIDF result of the first sentence.

proc print data=sascas1.NGRAMS_TFIDF;
   where _document_=1;
run;

ngram results

These N-gram methods are not designed only for Chinese documents; and documents in any language can be tokenized with this method. However, the tokenization granularity of English documents is different from Chinese documents, which is word-based rather than character-based. To handle English documents, you only need to make small changes to my code.

How to get N-grams and TF-IDF count from Chinese documents was published on SAS Users.

3月 302018
 

Gradient boosting is one of the most widely used machine learning models in practice, with more and more people like to use it in Kaggle competitions. Are you interested in seeing how to use gradient boosting model for classification in SAS Visual Data Mining and Machine Learning? Here I play with the classification of Fisher’s Iris flower dataset using gradient boosting, and this may serve as a start point to those interested in trying the classification models in SAS Visual Data Mining and Machine Learning product.

Fisher’s Iris data is a well-known dataset in data mining. Per Wikipedia, Fisher developed a linear discriminant model to distinguish the species from each other by the features provided in the dataset. You may already see people run different classification models on this dataset, such as neural network. What I am interested in, is to see how well SAS gradient boosting model will do the species classification.

#1  Explore the dataset

We can easily load Fisher’s Iris dataset from SASHelp.Iris into SAS Viya. The dataset consists of 50 samples each species of Iris Setosa Virginica and Versicolor, totally 150 records with five attributes: Petal Length, Petal Width, Sepal Length, Sepal width and Iris Species. The dataset itself is already well-formed, with neither missing values, nor outliers. Take a quick look of the dataset in SAS Visual Analytics as below.

Gradient boosting

From the chart, we see that the iris species of ‘Setosa’ can be easily distinguished from the ‘Versicolor’ and ‘Virginica’ species by the length and width of their petals and sepals. However, this is not the case for the latter two species, some of them are staggered closely, which makes it a little hard to distinguish each other by these features.

#2  Prepare Data

There is not much effort needed to prepare the data for the prediction. But one thing I’d like to mention here is about the standardization of measure variables. By viewing the measure details in SAS Visual Analytics, we see that neither Petal Length distribution nor Petal Width distribution is normal. You may wonder if we need to normalize the data before applying it to the model for analysis, but this leads to one great thing I like the Gradient Boosting model. Users do not need to explicitly standardize quantitative data. Tree-base models should be robust to such problem in an input feature, since the algorithm is based on node splits. (Here is an article discussing a similar problem.)

So, here my data preparation is just doing the data partitioning before starting the classification on iris species. I need to make sure each partition will follow the same distribution on different species in the iris dataset. This can be achieved easily in SAS Visual Analytics by adding a partition data item - by setting the Sampling method to ‘Stratified sampling’ and add the ‘Iris Species’ as the column to be stratified by. I define two partitions so I have training partition, validation partition. I set 60% for training, and 40% for validation partition, with random seed 1234. Thus, a categorical data item ‘Partition’ is added, with value of 0 for validation, 1 for training partition. (For easier understanding in the charts, I’ve created a custom category called ‘Partitions’ based on the ‘Partition’ data item values.)

The charts below show that the 150 rows in Fisher’s Iris dataset are distributed equally into three species, and the created partitions are sampled with the same percentage among the three species.

#3  Train the gradient boosting model

Training various models in SAS Visual Data Mining and Machine Learning allows us to appreciate the advantages of visualization, and it’s very straight-forward for users. In ‘Objects’ tab, drag and drop the ‘Gradient Boosting’ to the canvas. Assign the ‘Iris Species’ as response variable, and ‘Petal Length, Petal Width, Sepal Length, Sepal width’ as predictors. Then set the ‘Partition’ data item for Partition ID. After that, the system will train the model and show the model assessment. I’ve taken a screenshot for ‘Virginica’ event as below.

The response variable of Iris Species has three event levels – ‘Setosa’, ‘Versicolor’ and ‘Virginica’, and we can choose desired event level to have a look of the model output. In addition, we may switch the assessment plot of Lift to ROC plot, or to Misclassification plot (Note: the misclassification plot is based on event level, thus it will show the ‘Setosa’ and ‘NOT Setosa’ species if we choose the ‘Setosa’ event.). Below is a screenshot with ROC plot and the model assessment statistics.

In practice, training models usually cost a lot of effort in tuning model parameters. SAS Visual Data Mining and Machine Learning has provided the ‘Autotune’ feature that can help this, users may decide some settings like maximum iterations, seconds, and evaluations and the product will choose the optimal values for the hyperparameters of the model. Considering that this dataset only has 150 samples, I won’t bother to do the hyperparameters tuning.

#4  Make prediction by the model

Now I can start to make predictions from the gradient boosting model for the data in testing partition. There are several ways to go here. In Visual Data Mining and Machine Learning, on the right-button mouse menu, either click the ‘Export model…’ or click the ‘Derive predicted…’ menu. The first one will export the model codes, so you can run it in SAS Studio with your data to be predicted. The latter one is very straight-forward in SAS Visual Data Mining and Machine Learning. It will pop up the ‘New Prediction Items’ page, where you may choose to get the predicted value and its probability values for all the levels of Iris Species. These data items will be added to the iris CAS table for further evaluation. Since the iris dataset has three species in the sample, I need to set ‘All levels’ so the prediction will give out the classification in three species and their probabilities.

#5  Review the prediction result

In the model assessment tab, we already see the model assessment statistics for model evaluation. We may also switch to ‘Variable Importance’ tab, or ‘Lift’ tab, ‘ROC’ tab, and ‘Misclassification’ tab to see more about the model. Here I’d like to visually compare the predicted species value with the iris species value provided in the dataset.

To show how many failures of the classification visually, I perform following actions:

  • In SAS Visual Analytics, create a list table to show all 150 rows of the iris dataset. Since there is no primary key in the dataset, the SAS Visual Analytics list table will do aggregation for measure variables by default, so be sure to set the ‘Detail data’ option in the Options tab.
  • Create a calculated item (named ‘equals’) to compare if the values of ‘Iris Species’ and ‘Predicted: Iris Species’ columns are equal: {IF ( 'Iris Species'n = 'Predicted: Iris Species'n ) RETURN 1 ELSE 0. }
  • Define a display rule with the calculated item to highlight the misclassified rows. I’ve sorted the table by above ‘equals’ value so those rows without equal value of ‘Iris Species’ and ‘Predicted : Iris Species’ columns are shown on top.

We see four rows are misclassified by the model, 3 of them are from training partition and 1 from validation partition. So far, the result looks not bad, right?

We may continue to tune the parameters of gradient boosting model easily in SAS Visual Data Mining and Machine Learning, to improve the model. For example, if I set smaller leaf size value to 2 instead of the default value of 5, the model accuracy will be improved (too good to be true?). See below screenshot for a comparison.

Of course, people may like to try tuning other parameters, or to generate more features to refine the model. Anyway, it is easy-to-use and straight-forwarded to do classification using gradient boosting model in SAS Visual Data Mining and Machine Learning. In addition, there are many other models in SAS Visual Data Mining and Machine Learning people may like to run for classification. Do you like to play with the other models for practicing?

Play with classification of Iris data using gradient boosting was published on SAS Users.

3月 032018
 

Report data shared by educational institutions, government agencies, healthcare organizations, and human resource departments can contain sensitive or confidential data. Data in such reports are suppressed selectively to protect the identities of individuals or to prevent the report’s audience from easily inferring individual values. The Data Suppression feature in SAS Visual Analytics 8.2 is easy to use when you need to selectively suppress aggregated data values in your reports.

All you need to do is create a calculated data item for Data Suppression and apply it to a report object such as a list table or a crosstab.  You could apply Data Suppression to a variety of report objects, but suppressing data for cells in either list tables or crosstabs is a common practice.

Here are a couple of examples where data suppression is applicable:

  • Universities and schools that release data on their students often use a cell threshold value in their report data to protect the risk of identifying specific students when the number of students in a class falls below the defined threshold value, and individual values for test scores or other criteria such as race can be easily determined by looking at the data.
  • In official reports with federal statistics that are provided by the Centers for Disease Control and Prevention in the U.S., certain data cells in the reports are suppressed to protect the confidentiality of patients and eliminate the risk of disclosing their identity. Patient data in such reports are suppressed by using a cell suppression threshold value of 16.

Before we jump into data suppression in SAS Visual Analytics, a quick note on understanding two kinds of data suppression.

Data Suppression by Using the withComplement Option

When a calculated data item is created for Data Suppression, SAS Visual Analytics applies the  withComplement option by default, and an additional complementary value is hidden randomly (by displaying an asterisk)  when you suppress the data for a single aggregated value.  This is done to prevent easy inference of the data values by viewing the total, subtotals, or other cell values.

Data Suppression by Using the withoutComplement Setting

If a calculated data item for Data Suppression is created by using the withoutComplement option, SAS Visual Analytics suppresses (by using an asterisk) only the aggregated data values that you chose to suppress, and no other additional complementary values are hidden with asterisks.

Let’s Do It

As an instructional exercise for data suppression, I chose a small subset of the data for high school students and their SAT test scores in the state of the North Carolina. I added three list tables to my report. My first list table has no data suppression (so we can see the data that I intend to suppress). My second list table will have data suppression without complementary values, and my third list table will have data suppression with complementary values.

In the first list table, the TESTED column shows the number of students that took the SAT test in each high school. If 14 or fewer than 14 students took the SAT test, I want to suppress the display of the number of students in the TESTED column for that high school.

Create the Calculated Data Item for Data Suppression Without Complementary Values

1.  In SAS Visual Analytics, I click on Data, right click on TESTED (the measure upon which my calculated item for data suppression will be created), and select New calculation.

2.  In the Create Calculation dialog, I change the Type to Suppression. By default, SAS Visual Analytics fills in the default value of 5 observations for the Suppress data if count less than: parameter field. I plan to change this value and the condition; for now, I keep the default value so I click OK.

Edit the Calculated Data Item for Data Suppression Without Complementary Values

1.  To edit the calculated item that I just created, I click on Data, right click on the calculated item I just created (TESTED (Data suppression) 1 and choose Edit.

2.  In the Visual mode, I see the calculated item for data suppression.

3.  I click on Text because I want to suppress low values for the TESTED column (which is the number of students that took the test) to 14 and below, and not the number of observations (Frequency) that are suppressed by default. So I edited the condition for data suppression and saved it:

4.   My second list table already has roles assigned to it. Now I added the newly created calculated data item: TESTED (Data Suppression) 1.
This List Table now shows asterisks for values suppressed in the TESTED column for any high school where 14 or fewer than 14 students took the SAT test.

All values for the TESTED measure upon which my condition is based are replaced with asterisk characters. It is important to note that although the suppressed values for TESTED are hidden from view with asterisks, they are still present in the data source. Therefore, I should hide the original measure (in this case, TESTED) from view in the report to prevent the accidental use of the TESTED measure for other report objects in the same report – (we’ll take a quick look at that at the end).

Create the Calculated Data Item for Data Suppression With Complementary Value

1.  I click on Data, right click on TESTED, and select New calculation.

2.  In the Create Calculation dialog, I change the Type to Suppression and click OK to save this new calculated item.

Edit the Calculated Data Item for Data Suppression With Calculated Value Suppression

1.  To edit the calculated item that I just created, I right click on the calculated item for data suppression and choose Edit.

2.  In the Edit Calculated Item dialog, I click Text to see the text version of the calculated data item, and I edited the condition to ensure that data is suppressed for high schools where the total number of students tested equals 13.

My List Table now shows values suppressed in the TESTED column for the high school where 13 students took the SAT test. In addition, another value in the TESTED column is also suppressed randomly by SAS Visual Analytics – in this case, it was for Creswell High School. The random suppression of another value is done to prevent your audience from looking at the Totals column and guessing the number of students that took the SAT test in each high school.

Be sure to follow the three best practices that are described for data suppression in the SAS Visual Analytics 8.2 documentation:

The TESTED measure does not display anymore.

For details on how to show or hide data items, see Is it sensitive? Mask it with data suppression was published on SAS Users.