The internet is rich with data, and much of that data seems to exist only on web pages, which -- for some crazy reason -- are designed for humans to read. When students/researchers want to apply data science techniques to analyze collect and analyze that data, they often turn to "data scraping." What is "data scraping?" I define it as using a program to fetch the contents of a web page, sift through its contents with data parsing functions, and save its information into data fields with a structure that facilitates analysis.
Python and R users have their favorite packages that they use for scraping data from the web. New SAS users often ask whether there are similar packages available in the SAS language, perhaps not realizing that Base SAS is already well suited to this task -- no special bundles necessary.
The basic steps for data scraping are:
- Fetch the contents of the target web page
- Process the source content of the page -- usually HTML source code -- and parse/save the data fields you need.
- If necessary, repeat for subsequent pages. This applies to those web sites that serve up lots of information in paginated form, and you want to collect all available pages of data.
Let's map these steps to the SAS programming language:
|For this step||Use these features||Get the contents of the web page||FILENAME URL|
|Process/parse the web page contents||DATA step, with parsing functions such as FIND, regular expressions via PRXMATCH. Use SAS informats to convert text to native data types.|
|Repeat across subsequent pages||SAS macro language (%DO %UNTIL processing) or DATA step with asked about scraping data from the Center for Disease Control (CDC) web site.
Step 0: Find the original data source and skip the scrape
I'm writing this article at the end of 2017, and at this point in our digital evolution, web scraping seems like a quaint pastime. Yes, there are still some cases for it, which is why I'm presenting this article. But if you find information that looks like data on a web page, then there probably is a real data source behind it. And with the movement toward open data (especially in government) and data-driven APIs (in social media and commercial sites) -- there is probably a way for you to get to that data source directly.
In my example page for this post, the source page is hosted by CDC.gov, a government agency whose mandate is to share information with other public and private entities. As one expert pointed out, the CDC shares a ton of data at its dedicated data.cdc.gov site. Does this include the specific table the user wanted? Maybe -- I didn't spend a lot of time looking for it. However, even if the answer is No -- it's a simple process to ask for it. Remember that web sites are run by people, and usually these people are keen to share their information. You can save effort and achieve a more reliable result if you can get the official data source.
If you find an official data source to use instead, you might still want to automate its collection and import into SAS. Here are two articles that cover different techniques:
Web scraping is lossy, fragile process. The information on the web page does not include data types, lengths, or constraints metadata. And one tweak to the presentation of the web page can break any automated scraping process. If you have no other alternative and you're willing to accept these limitations, let's proceed to Step 1.
Step 1: Fetch the web page
In this step, we want to achieve the equivalent of the "Save As..." function from your favorite web browser. Just like your web browser, SAS can act as an HTTP client. The two most popular methods are:
I prefer PROC HTTP, and here's why. First, as a separate explicit step it's easier to run just once and then work with the file result over the remainder of your program. You're guaranteed to fetch the file just once. Though a bit more code, it makes it more clear about what happens under the covers. Next reason: PROC HTTP has been refined considerably in SAS 9.4 to run fast and efficient. Its many options make it a versatile tool for all types of internet interactions, so it's a good technique to learn. You'll find many more uses for it.
Here's my code for getting the web page for this example:
Step 2: Parse the web page contents
Before you can write code that parses a web page, it helps to have some idea of how the page is put together. Most web pages are HTML code, and the data that is "locked up" within them are expressed in table tags: <table>, <tr>, <td> and so on. Before writing the first line of code, open the HTML source in your favorite text editor and see what patterns you can find.
For this example from the CDC, the data we're looking for is in a table that has 6 fields. Fortunately for us, the HTML layout is very regular, which means our parsing code won't need to worry about lots of variation and special cases.
The pattern is simple. We can find a unique "marker" for each entry by looking for the "/nndss/conditions" reference. The line following (call it offset-plus-1) has the disease name. And the lines that are offset-plus-7 and offset-plus-8 contain the date ranges that the original poster was asking for. (Does this seem sort of rough and "hacky?" Welcome to the world of web scraping.)
Knowing this, we can write code that does the following:
Here's the code that I came up with, starting with what the original poster shared:
The result still needs some cleanup -- there are a few errant data lines in the data set. However, it should be simple to filter out the records that don't make sense with some additional conditions. I left this as a to-do exercise to the original poster. Here's a sample of the result:
Step 3: Loop and repeat, if necessary
For this example with the CDC data, we're done. All of the data that we need appears on a single page. However, many web sites use a pagination scheme to break the data across multiple pages. This helps the page load faster in the browser, but it's less convenient for greedy scraping applications that want all of the data at once. For web sites that paginate, we need to repeat the process to fetch and parse each page that we need.
Web sites often use URL parameters such as "page=" to indicate which page of results to serve up. Once you know the URL parameter that controls this offset, you can create a SAS macro program that iterates through all of the pages that you want to collect. There are many ways to accomplish this, but I'll share just one here -- and I'll use a special page from the SAS Support Communities (communities.sas.com) as a test.
Aside from the macro looping logic, my code uses two tricks that might be new to some readers.
The result of this program is one long data set that contains the results from all of the pages -- that is, all of lines of HTML code from all of the web pages. You can then use a single DATA step to post-process and parse out the data fields that you want to keep. Like magic, right?
Where to learn more
Each year there are one or two "web scraping" case studies presented at SAS Global Forum. You can find them easily by searching on the keyword "scrape" from the SAS Support site or from lexjansen.com. Here are some that I found interesting: