PROC HTTP

9月 062019
 

A few years ago I shared a method to publish content from SAS to a Slack channel. Since that time, our teams at SAS have gone "all in" on collaboration with Microsoft Office 365, including Microsoft Teams. Microsoft Teams is the Office suite's answer to Slack, and it's not a coincidence that it works in nearly the same way.

The lazy method: send e-mail to the channel

Before I cover the "deluxe" method for sending content to a Microsoft Teams channel, I want to make sure you know that there is a simple method that involves no coding, and no need for APIs. The message experience isn't as nice, but it does the job. You can simply "send e-mail" to the channel. If you're automating output from SAS, it's a simple, well-documented process to send e-mail from a SAS program. (Here's an example from me, using FILENAME EMAIL.)

When you send e-mail to a Microsoft Teams channel, the message notice includes the message subject line, sender, and the first bit of the message content. To see the entire message, you must click on the "View original e-mail" link in the notice. This "downloads" the message to your device so that you can open it with a local tool (such as your e-mail reader, Microsoft Outlook). My team uses this method to receive certain alerts from our communities.sas.com platform. Here's an example:

To get the unique e-mail address for a channel, right-click on the channel name and select Get email address. Any message that you send to that e-mail address will be distributed to the team.

Getting started with a Microsoft Teams webhook

In order to provide a richer, more integrated experience with Microsoft Teams, you can publish content using a webhook. A webhook is a REST API endpoint that allows you to post messages and notifications with more control over the appearance and interactive options within the messages. In SAS, you can publish to a webhook by using PROC HTTP.

To get started, you need to add and configure a webhook for your Microsoft Teams channel:

  1. Right-click on the channel name and select Connectors.
  2. Microsoft Teams offers built-in connectors for many different applications. To find the connector for Incoming Webhook, use the search field to narrow the list. Then click Add to add the connector to the channel.
  3. You must grant certain permissions to the connector to interact with your channel. In this case, you need to allow the webhook to send messages and notifications. Review the permissions and click Install.
  4. On the Configuration page, assign a name to this connector and optionally customize the image. The image will be the avatar that's used when the connector posts content to the channel. When you've completed these changes, select Create.
  5. The connector generates a unique (and very long) URL that serves as the REST API endpoint. You can copy the URL from this field -- you will need it later in your SAS program. You can always come back to these configuration settings to change the connector avatar or re-copy the URL.

    At this point, it's a good idea to test that you can publish a basic message from SAS. The "payload" for a Teams message is a JSON-formatted structure, and you can find examples in the Microsoft Teams reference doc. Here's a SAS program that publishes the simplest message. Add your webhook URL and run the code to verify the connector is working for your channel.

    filename resp temp;
    options noquotelenmax;
    proc http
      /* Substitute your webhook URL here */
      url="https://outlook.office.com/webhook/your-unique-webhook-address-it-is-very-long"
      method="POST"
      in=
      '{
          "$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
          "type": "AdaptiveCard",
          "version": "1.0",
          "summary": "Test message from SAS",
          "text": "This message was sent by **SAS**!"
      }'
      out=resp;
    run;

    If successful, this step will post a simple message to your Teams channel:

    Design a message card for Microsoft Teams

    Now that we have the basic plumbing working, it's time to add some bells and whistles. Microsoft Teams calls these notifications "message cards", which are messages that can include interactive features such as images, data, action buttons, and more.

    Designing a simple message

    Microsoft Teams supports a large palette of building blocks (expressed in JSON) to create different card experiences. You can experiment with these cards in the MessageCard Playground that Microsoft hosts. The tool provides templates for several card varieties, and you can edit the JSON definitions to tweak and design your own.

    For one of my use cases, I designed a simple card to show the status of our recommendation engine on SAS Support Communities. (Read this article for more information about how we built and monitor the recommendation engine.) The engine runs as a service and is accessed with its own API. I wanted a periodic "health check" to post to our internal team that would alert us to any problems. Here's the JSON that I used in the MessageCard Playground to design it.

    Much of the JSON is boilerplate for the message. I drew the green blocks to indicate the areas that need to be dynamic -- that is, replaced with values from the real-time API call. Here's what the card looks like when rendered in the Microsoft Teams channel.

    Since my API call to the recommendation engine service creates a data set, I can run that data through PROC JSON to create the JSON segment I need:

    /* reading the results from my API call to the engine */
    libname results json fileref=resp;
     
    /* Prep a simple name-value data set with the results */
    data segment (keep=name value);
     set results.root;
     name="Score data updated (UTC)";
     value= astore_creation;
     output;
     name="Topics scored";
     value=left(num_topics);
     output;
     name="Number of users";
     value= left(num_users);
     output;
     name="Process time";
     value= process_time;
     output;
    run;
     
    /* use PROC JSON to create the segment */
    filename segment temp;
    proc json out=segment nosastags pretty;
     export segment;
    run;

    I shared a version of the complete program on GitHub. It should run as is -- but you would need to supply your own webhook endpoint for a channel that you can publish to.

    Design a message with actions

    I also use Microsoft Teams to share updates about the SAS Software GitHub organization. In a previous article I discussed how I use GitHub APIs to gather data from the GitHub service. Each day, my program summarizes the recent activity from github.com/sassoftware and publishes a message card to the team. Here's an example of a daily update:

    This card is fancier than my first example. I added action buttons that can direct the team members to the internal reports for more details and to the GitHub site itself. I used the Microsoft Teams documentation and the MessageCard Playground to design the experience:

    Messaging apps as part of a DevOps strategy

    Like many organizations, we (SAS) invest a considerable amount of time and energy into gathering metrics and building reports about our operations. However, reports are useful only when the intended audience is tuned in and refers to them regularly. With a small additional step, you can use SAS to bring your most interesting data forward to your team -- automatically.

    Whether you use Microsoft Teams or Slack, automated alerting and updates are a great opportunity to keep your teams informed. Each of these tools offers fit-for-purpose connectors that can tie in with information from other popular operational systems (Salesforce, GitHub, Yammer, JIRA, and many more). For cases where a built-in connector is not available, the webhook approach allows you to easily create your own.

The post How to publish to a Microsoft Teams channel using SAS appeared first on The SAS Dummy.

11月 292018
 

If your work environment is like ours here at SAS, you're seeing more of your data and applications move to the cloud. It's not yet a complete replacement for having local files on your desktop machine, but with cloud storage and apps -- like Microsoft OneDrive -- I can now access my work documents from any browser and any device, including my smartphone. I can update now my spreadsheets while waiting in the dentist office. Oh joy.

For those of us who use SAS to read and create Microsoft Excel documents, cloud-based files can add an extra wrinkle when we automate the process. It also adds some exciting possibilities! The Microsoft Office 365 suite offers APIs to discover, fetch, and update our documents using code. In this article, I'll show you how to use SAS programs to reach into your Microsoft OneDrive cloud to read and update your files. Note: All of this assumes that you already have a Microsoft Office 365 account -- perhaps provisioned by your IT support team -- and that you're using it to manage your documents.

Before I go on, I have to give major credit to Joseph Henry, the SAS developer who maintains PROC HTTP. Joseph did the heavy lifting for putting together the code and examples in this article. He also regularly adds new features to PROC HTTP that make it a more natural fit for calling REST APIs that require special authentication flows, such as OAuth2.

Using SAS with Microsoft OneDrive: an overview

Microsoft Office 365 uses an OAuth2-style authentication flow to grant access and permissions to third-party apps. When we're writing SAS programs to access Microsoft OneDrive, we're also writing a third-party app. This requires several setup steps, a few of which cannot be automated. Fortunately, these need to be done just once, or at least infrequently. Here's an outline of the steps:

  1. Register a new client application at the Microsoft Application Registration Portal. (You will need to sign in with your Office 365 credentials, which might be your primary organization credentials if you have single-signon with Active Directory.)
  2. Using your browser while you are signed into Office 365, navigate to a special web address to obtain an authorization code for your application.
  3. With your authorization code in hand, plug this into a SAS program (PROC HTTP step) to retrieve an OAuth2 access token (and a refresh token).
  4. With the access token, you can now use PROC HTTP and the Microsoft Office 365 APIs to retrieve your OneDrive folders and files, download files, upload files, and replace files.

You'll have to complete Step 1 just once for your application or project. Steps 2 and 3 can be done just once, or at least just occasionally. The access token is valid for a limited time (usually 1 hour), but you can always exchange the refresh token for a new valid access token. This refresh token step can be automated in your program, usually run just once per session. Occasionally that refresh token can be revoked (and thus made invalid) when certain events occur (such as you changing your account password). When that happens, you'll need to repeat steps 2 and 3 to get a new set of access/refresh tokens.

Step 1: Register your application

Visit the Microsoft Application Registration portal to register your new app. You'll sign in with your Microsoft Office 365 credentials.

Microsoft Application Registration portal

Note: At the time of this writing, Microsoft was testing a new web experience for application developers. I opted for the "preview experience" with the hope that my screenshots would have a longer shelf life for future readers.

Click New Registration to get started. This presents you with a form where you can complete the details that define your app. Mainly, you're giving it a name and defining its scope. You'll probably want to limit its use to just your organization (your company) unless you're collaborating with colleagues who work elsewhere.

"Register an application" form

When you create an app, you'll receive a Client ID (unique to your app) and Tenant ID (unique to your organization). You'll need these values to obtain your authorization code and tokens later. The application portal provides a sort of control center for all aspects of your app. (Note: I masked out my client ID and tenant ID in this screenshot.)

Details for my sample application

Specifying your app permissions

Your app will need specific permissions in order to function. In my example, I want my SAS program to read documents from my OneDrive, and also add new docs and update existing docs. The permissions I need are:

  • Files.ReadWrite.All: Allows the app to read, create, update and delete all files that you can access.
  • User.Read: Allows you to sign in to the app with your organizational account and let the app read your profile.

To add these to your app, click the API Permissions tab in the control center. To be clear, these are not permissions that your app will automatically have. These are the permissions that will be requested when you "sign into" the app for the first time, and that you'll have to agree to in order for the app to run.

Adding permissions that the app needs

Creating a configuration file

There are a few app-specific values that we'll need to reference throughout the SAS programs we're writing. I decided to create a configuration file for these settings rather than hard-code them into my SAS statements. This will make it easier for other people to reuse my code in their own applications.

I created a file named conf.json that looks like this (but with different tenant_id and client_id values):

{
  "tenant_id": "206db638-6adb-41b9-b20c-95d8d04abcbe",
  "client_id": "8fb7804a-8dfd-40d8-bf5b-d02c2cbc56f3",
  "redirect_uri": "https://login.microsoftonline.com/common/oauth2/nativeclient",
  "resource" : "https://graph.microsoft.com"
}

By "externalizing" the IDs specific to my account/instance, I can use SAS code to read the values at run time. Note: This code, like all of the code in this article, uses features from SAS 9.4 Maintenance 5.

/*
  Set the variables that will be needed through the code
  We'll need these for authorization and also for runtime 
  use of the service.
 
  Reading these from a config.json file so that the values
  are easy to adapt for different users or projects.
*/
 
%if %symexist(config_root) %then %do;
  filename config "&config_root./config.json";
  libname config json fileref=config;
  data _null_;
   set config.root;
   call symputx('tenant_id',tenant_id,'G');
   call symputx('client_id',client_id,'G');
   call symputx('redirect_uri',redirect_uri,'G');
   call symputx('resource',resource,'G');
  run;
%end;
%else %do;
  %put ERROR: You must define the CONFIG_ROOT macro variable.; 
%end;

Step 2: Obtain an authorization code

Now that I've defined the application, it's time to "sign into it" and grant it the permission to read and manage content in OneDrive. This step needs to be completed from a web browser while I am signed into my Microsoft Office 365 account. The web address is very long...but we can use a SAS program to generate it for us.

/* location of my config file */
%let config_root=/folders/myfolders/onedrive;
 
%include "&config_root./onedrive_config.sas";
 
/* Run this line to build the authorization URL */
%let authorize_url=https://login.microsoftonline.com/&tenant_id./oauth2/authorize?client_id=&client_id.%nrstr(&response_type)=code%nrstr(&redirect_uri)=&redirect_uri.%nrstr(&resource)=&resource.;
options nosource;
%put Paste this URL into your web browser:;
%put -- START -------;
%put &authorize_url;
%put ---END ---------;
options source;

This produces these output lines in the SAS log:

 Paste this URL into your web browser:
 -- START -------
https://login.microsoftonline.com/206db638-6adb-41b9-b20c-95d8d04abcbe/oauth2/authorize?client_id=8fb7804a-8dfd-40d8-bf5b-d02c2cbc56
f3&response_type=code&redirect_uri=https://login.microsoftonline.com/common/oauth2/nativeclient&resource=https://graph.microsoft.com
 ---END ---------

Copy and paste the URL (all on one line, no spaces) into the address bar of your web browser. When you press Enter, you'll be prompted to grant the required permissions:

Once you click Accept, the browser will redirect to what looks like a blank page, but the URL contains the authorization code that we need:

Copy the value that appears after the code= in the URL. It's going to be a very long string -- over 700 characters. We'll need that value for the next step.

Step 3: Obtain an access token

My colleague Joseph wrote a few convenient utility macros that can help manage the access token and refresh token within your SAS session. These macros include:

  • %get_token - get the initial access and refresh tokens, given an authorization code. Remember, an access token will expire in about 60 minutes. But the refresh token can be used to get a renewed access token.
  • %refresh - exchange a valid refresh token for a new access token
  • %process_token_file - read/update an external token file so that these values persist beyond your current SAS session.

I'm not going to walk through the macro code in this article, but the SAS programs are straightforward and well-documented. See "How to get this example code" at the end of this article.

With these macros in place, we can paste the (very long) authorization code we retrieved in the previous step into a macro variable. Then we can run the %get_token macro to generate the tokens and store them in a local file.

%let config_root=/folders/myfolders/onedrive;
 
%include "&config_root./onedrive_config.sas";
%include "&config_root./onedrive_macros.sas";
 
filename token "&config_root./token.json";
%let auth_code=AQABAAIAAAC5una0EUFgTIF8ElaxtWjTqwohjyfG; * and much more;
 
/*
  Now that we have an authorization code we can get the access token
  This step will write the tokens.json file that we can use in our
  production programs.
*/
%get_token(&client_id.,&auth_code,&resource.,token,tenant=&tenant_id);

Running this step will create a new file, token.json, in your designated config folder. Here's an screenshot of what my version looks like right now:

It's very important that you keep this file secure. With the information in this file (your refresh token) and your conf.json file (with your client ID and tenant ID), anyone can use these code techniques to impersonate you and access your Microsoft Office 365 data. There are techniques for storing these files such that only you can see them.

Using Microsoft Office 365 APIs to access OneDrive from SAS

Whew! It took near 1500 words to get this far, so thanks for sticking with me. The good news is that these steps take much longer to describe than to actually execute. Plus, creating apps is fun! (Right?)

From the screenshots I've shared, you probably already noticed that these services are working on Microsoft Azure, which is Microsoft's cloud platform for applications. For the remainder of this article, I'll be using methods from the Microsoft Graph API. This REST-based API provides access to almost all of Microsoft's hosted services. For my examples, I'll be using methods within the Files component of the API: Drives and Drive Items (folders and files).

Initializing and refreshing the access token in a new session

Now that we have the access and refresh tokens, we can get down to business with some actual OneDrive interactions. Here's how to initialize your SAS session with the tokens.

%let config_root=/folders/myfolders/onedrive;
 
%include "&config_root./onedrive_config.sas";
%include "&config_root./onedrive_macros.sas";
 
/*
  Our json file that contains the oauth token information
*/
filename token "&config_root./token.json";
 
%process_token_file(token);
 
/* If this is first use for the session, we'll likely need to refresh  */
/* the token.  This will also call process_token_file again and update */
/* our token.json file.                                                */
%refresh(&client_id.,&refresh_token.,&resource.,token,tenant=&tenant_id.);
 
/*
  At this point we have a valid access token and we can start using the API.
*/

If all goes well, we'll have our access token, and it will be stored in a macro variable named &access_token. It's going to be another long and illegible (>700 characters) value.

(Ever hear of the "infinite monkey theorem? That a monkey hitting a typewriter for an infinite amount of time is certain to produce a certain text, such as the complete works of Shakespeare? Well, that monkey is not going to produce this access token. Plus, who has a typewriter anymore?)

Retrieving the top-level drive identifier

We'll need to explore the OneDrive system from the top-down, using code. First, we need the identifier for the root drive. It's possible for you to have multiple root drives, and if that's the case for you, you'll need to modify this code a bit. This code queries the service for your drives, and stores the identifier for just the first drive in a macro variable. We'll need that identifier later to retrieve a list of top-level items.

/*
First we need the ID of the "drive" we are going to use.
to list the drives the current user has access to you can do this
*/
filename resp TEMP;
/* Note: oauth_bearer option added in 9.4M5 */
proc http url="https://graph.microsoft.com/v1.0/me/drives/"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
libname jresp json fileref=resp;
 
/*
 I only have access to 1 drive, but if you have multiple you can filter 
 the set with a where clause on the name value.
 
 This creates a data set with the one record for the drive.
*/
data drive;
 set jresp.value;
run;
 
/* store the ID value for the drive in a macro variable */
proc sql noprint;
 select id into: driveId from drive;
quit;

Note that this code uses the new OAUTH_BEARER option in PROC HTTP -- a convenient addition when working with OAuth2-compliant APIs. This is shorthand -- and more intuitive syntax -- for placing "Authorization: Bearer TOKEN-VALUE" in the HTTP headers.

Retrieve a list of top-level folders/files

With the drive identifier in hand, I can use the /children verb on the Microsoft Graph API to get a list of all of the top-level objects in that drive. These represent the folders and files that are at the root.

/*
 To list the items in the drive, use the /children verb with the drive ID
*/
filename resp TEMP;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/root/children"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
libname jresp json fileref=resp;
 
/* Create a data set with the top-level paths/files in the drive */
data paths;
 set jresp.value;
run;

Here's what I'm keeping in my OneDrive right now. It's not too disorganized, is it?

List the files in a particular folder

If I'm interested in exploring a particular folder, I'll need to find the folder identifier as it's known to OneDrive. Using PROC SQL and SELECT INTO, I can find the folder by its name and store its ID in another macro variable. Then, I use the /children verb again, but this time with the folder ID instead of the "root" constant.

/*
 At this point, if you want to act on any of the items, you just replace "root" 
 with the ID of the item. So to list the items in the "SASGF" folder I have:
  - find the ID for that folder
  - list the items within by using the "/children" verb
*/
 
/* Find the ID of the folder I want */
proc sql noprint;
 select id into: folderId from paths
  where name="SASGF";
quit;
 
filename resp TEMP;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&folderId./children"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
/* This creates a data set of the items in that folder, 
   which might include other folders.
*/
libname jresp json fileref=resp;
data folderItems;
 set jresp.value;
run;

Here are the items from my SASGF folder. Can you tell that I don't throw anything away?

Download a file from OneDrive and import into SAS

I know that I keep a spreadsheet named "sas_tech_talks_18.xlsx" in this SASGF folder. With the /content verb, I can download the file from OneDrive and store it in the file system that is local to my SAS session. Then, I can use PROC IMPORT to read it into a SAS data set.

/*
 With a list of the items in this folder, we can download
 any item of interest by using the /content verb 
*/
 
/* Find the item with a certain name */
proc sql noprint;
 select id into: fileId from folderItems
  where name="sas_tech_talks_18.xlsx";
quit;
 
filename fileout "&config_root./sas_tech_talks_18.xlsx";
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&fileId./content"
     oauth_bearer="&access_token"
     out = fileout;
	 run;
 
/* Import the first sheet into a SAS data set */
proc import file=fileout 
 out=sasgf
 dbms=xlsx replace;
run;

Boom! I've just downloaded my data from the cloud and brought it into my SAS session.

Add a new file to OneDrive

We can build wonderful documents from SAS too, and it's important to be able to share those. By using the PUT method with the /content verb, we can copy a file from the local SAS session into a target folder on OneDrive. Most often, this will probably be an Excel spreadsheet or maybe a PDF report. (But hey, maybe it's a good opportunity to try out

/*
  We can upload a new file to that same folder with the PUT method and /content verb
  Notice the : after the folderId and the target filename
*/
 
/* Create a simple Excel file to upload */
%let targetFile=iris.xlsx;
filename tosave "%sysfunc(getoption(WORK))/&targetFile.";
ods excel(id=upload) file=tosave;
proc print data=sashelp.iris;
run;
ods excel(id=upload) close;
 
filename details temp;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&folderId.:/&targetFile.:/content"
  method="PUT"
  in=tosave
  out=details
  oauth_bearer="&access_token";
run;
 
/*
  This returns a json response that describes the item uploaded.
  This step pulls out the main file attributes from that response.
*/
libname attrs json fileref=details;
data newfileDetails (keep=filename createdDate modifiedDate filesize);
 length filename $ 100 createdDate 8 modifiedDate 8 filesize 8;
 set attrs.root;
 filename = name;
 modifiedDate = input(lastModifiedDateTime,anydtdtm.);
 createdDate  = input(createdDateTime,anydtdtm.);
 format createdDate datetime20. modifiedDate datetime20.;
 filesize = size;
run;

Replace/update a file in OneDrive

If you want to replace an existing file, then you'll want to perform the additional step of retrieving the unique ID for that file from OneDrive. When you PUT the new version of the file into place, its history and sharing properties should remain intact. Here is my code for navigating the folder/file structure in my OneDrive and finally replacing an existing file.

/*
  If you want to replace a file instead of making a new file 
  then you need to upload it with the existing file ID.  If you
  don't replace it with the existing ID, some sharing properties
  and history could be lost.
*/
/* Create a simple Excel file to upload */
%let targetFile=iris.xlsx;
filename tosave "%sysfunc(getoption(WORK))/&targetFile.";
ods excel(id=upload) file=tosave;
proc print data=sashelp.iris;
run;
ods excel(id=upload) close;
 
/* Navigate the folder and file IDs from my OneDrive */
proc sql noprint;
 select id into: folderId from paths
  where name="SASGF";
quit;
 
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&folderId./children"
     oauth_bearer="&access_token"
     out = resp;
	 run;
 
libname jresp json fileref=resp;
data folderItems;
 set jresp.value;
run;
 
/* Find the ID of the existing file */
proc sql noprint;
 select id into: fileId from folderItems
  where name="iris.xlsx";
quit;
 
libname attrs json fileref=details;
proc http url="https://graph.microsoft.com/v1.0/me/drives/&driveId./items/&fileId./content"
 method="PUT"
 in=tosave
 out=details
 oauth_bearer="&access_token";
run;

As you can see from my OneDrive history for this file, I've tested this program a few times -- resulting in 23 revisions of this file in its history!

How to get this example code

You can find the source files for these examples on GitHub.

I've organized this code into 4 different files in order to make it easy to reuse:

  • onedrive_config.sas - read the fields from the conf.json and set them as global macro variables. This includes your client_id and tenant_id.
  • onedrive_setup.sas - the SAS statements that represent code you will need to run just once to get your authorization code and first access code.
  • onedrive_macros.sas - three utility macros that help you to create, refresh, and manage your access token and refresh token in your token.json file
  • onedrive_example_use.sas - sample SAS steps that I used in this article. They won't quite work for you as-is, since you don't have the same files that I do. (Unless you do have the same files, in which case...creepy.) My hope is that you can read and adapt them for your own content.

I also included a template for the conf.json file, with obvious placeholders for the client_id and tenant_id that you'll substitute with your own values. You'll also need to change the statements that define &CONFIG_LOC -- the location of your configuration directory where you're storing these files. I developed these examples in SAS University Edition -- yes, this works there! I also ran the code from my full SAS environment via SAS Enterprise Guide.

More about using REST APIs from SAS

This has been a monster article -- in terms of its length. But I hope it's clear enough to follow and has sufficient detail for you to try this on your own. If you have questions, post in the comments.

I've published a number of other articles about using REST APIs from SAS -- it's one of my favorite things to do. Check out:

The post Using SAS to access and update files on Microsoft OneDrive appeared first on The SAS Dummy.

11月 012018
 

This blog post was also written by SAS' Bari Lawhorn.

We have had several requests from customers who want to use SAS® software to automate the download of data from a website when there is no application programming interface (API) to do it. As an example, the Yahoo Finance website recently changed their service to decommission their API, and this generated an interesting challenge for one of our customers. This SAS programmer wanted to download historical stock price data "unattended," without having to click through a web page. While working on this case, we discovered that the Yahoo Finance website requires a cookie-crumb combination to download. To help you automate downloads from websites that do not have an API, this blog post takes you through how we used the DEBUG feature of PROC HTTP to achieve partial automation, and eventually full automation with this case.

Partial automation

To access the historical data for Apple stock (symbol: AAPL) on the Yahoo Finance website, we use this URL: https://finance.yahoo.com/quote/AAPL/history?p=AAPL

We click Historical Data --> Download Data and get a CSV file with historical stock price data for Apple. We could save this CSV file and read it into SAS. But, we want a process that does not require us to click in the browser.

Because we know the HTTP procedure, we right-click Download Data and then select Copy link address as shown from a screen shot using the Google Chrome browser below:

Note: The context menu that contains Copy link address looks different in each browser.

Using this link address, we expect to get a direct download of the data into a CSV file (note that your crumb= will differ from ours):

filename out "c:\temp\aapl.csv";
 
proc http
 url='https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1535399845&period2=1538078245&interval=1d&events=history&crumb=hKubrf50i1P'
 method="get" out=out;
run;

However, the above code results in the following log message:

NOTE: PROCEDURE HTTP used (Total process time):
real time           0.25 seconds
cpu time            0.14 seconds
 
NOTE: 401 Unauthorized

When we see this note, we know that the investigation needs to go further.

filename out "c:\temp\aapl.csv";
proc http
 url='https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1535399845&period2=1538078245&interval=1d&events=history&crumb=hKubrf50i1P'
 method="get" out=out;
 debug level=3;
run;

When we run the code, here's what we see in the log (snipped for convenience):

> GET
/v7/finance/download/AAPL?period1=1535399845&period2=1538078245&interval=1d&events=history&crumb=h
Kubrf50i1P HTTP/1.1
 
> User-Agent: SAS/9
> Host: query1.finance.yahoo.com
> Accept: */*
> Connection: Keep-Alive
> Cookie: B=fpd0km1dqnqg3&b=3&s=ug
> 
< HTTP/1.1 401 Unauthorized
< WWW-Authenticate: crumb
< Content-Type: application/json;charset=utf-8
 
…more output…
 
< Strict-Transport-Security: max-age=15552000
 
…more output…
 
{    "finance": {        "error": {            "code": "Unauthorized",
"description": "Invalid cookie"        }    }}
NOTE: PROCEDURE HTTP used (Total process time):
      real time           0.27 seconds
      cpu time            0.15 seconds
 
NOTE: 401 Unauthorized

The log snippet reveals that we did not provide the Yahoo Finance website with a valid cookie. It is important to note that the response header for the URL shows crumb for the authentication method (the line that shows WWW-Authenticate: crumb. A little web research helps us determine that the Yahoo site wants a cookie-crumb combination, so we need to also provide the cookie. But, why did we not need this step when we were using the browser? We used a tool called Fiddler to examine the HTTP traffic and discovered that the cookie was cached when we first clicked in the browser on the Yahoo Finance website:

Luckily, starting in SAS® 9.4M3 (TS1M3), PROC HTTP will set cookies and save them across HTTP steps if the response contains a "set-cookie:  <some cookie>" header when it successfully connects to a URL. So, we try this download in two steps. The first step does two things:

  • PROC HTTP sets the cookie for the Yahoo Finance website.
  • Adds the DEBUG statement so that we can obtain the crumb value from the log.
filename out "c:\temp\Output.txt";
 
filename hdrout "c:\temp\Response.txt";
 
proc http
 out=out
 headerout=hdrout
 url="https://finance.yahoo.com/quote/AAPL/history?p=AAPL"
 method="get";
 debug level=3;
run;

Here's our log snippet showing the set-cookie header and the crumb we copy and use in our next PROC HTTP step:

…more output…
< set-cookie: B=2ehn8rhdsf5r2&b=3&s=fe; expires=Wed, 17-Oct-2019 20:11:14 GMT; path=/;
domain=.yahoo.com
 
…more output…
 
Initialized"},"account-switch-uh-0-AccountSwitch":{"status":"initialized"}}},"CrumbStore":{"crumb":
"4fKG9lnt5jw"},"UserStore":{"guid":"","login":"","alias":"","firstName":"","comscoreC14":-1,"isSig

The second step uses the cached cookie from Yahoo Finance (indicated in the "CrumbStore" value), and in combination with the full link that includes the appropriate crumb value, downloads the CSV file into our c:\temp directory.

filename out "c:\temp\aapl.csv";
 
proc http
 url='https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1535399845&period2=1538078245&interval=1d&events=history&crumb=4fKG9lnt5jw'
 method="get"
 out=out;
run;

With the cookie value in place, our download attempt succeeds!

Here is our log snippet:

31
32   proc http
33       out=data
34       headerout=hdrout2
35       url='https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1534602937&peri
35 ! od2=1537281337&interval=1d&events=history&crumb=4fKG9lnt5jw'
36       method="get";
37   run;
 
NOTE: PROCEDURE HTTP used (Total process time):
      real time           0.37 seconds
      cpu time            0.17 seconds
 
NOTE: 200 OK

Full automation

This partial automation requires us to visit the website and right-click on the download link to get the URL. There’s nothing streamlined about that, and SAS programmers want full automation!

So, how can we fully automate the process? In this section, we'll share a "recipe" for how to get the crumb value -- a value that changes with each transaction. To get the current crumb, we use the first PROC HTTP statement to "screen scrape" the URL and to cache the cookie value that comes back in the response. In this example, we store the first response in the Output.txt file, which contains all the content from the page:

filename out "c:\temp\Output.txt";
filename hdrout "c:\temp\Response.txt";
 
proc http 
    out=out 
    headerout=hdrout
    url="https://finance.yahoo.com/quote/AAPL/history?p=AAPL" 
    method="get";
run;

It is a little overwhelming to examine the web page in its entirety. And the HTML page contains some very long lines, some of them over 200,000 characters long! However, we can still use the SAS DATA step to parse the file and retrieve the text or information that might change on a regular basis, such as the crumb value.

In this DATA step we read chunks of the text data and scan the buffer for the "CrumbStore" keyword. Once found, we're able to apply what we know about the text pattern to extract the crumb value.

data crumb (keep=crumb);
  infile out  recfm=n lrecl=32767;
  /* the @@ directive says DON'T advance pointer to next line */
  input txt: $32767. @@;
  pos = find(txt,"CrumbStore");
  if (pos>0) then
    do;
      crumb = dequote(scan(substr(txt,pos),3,':{}'));
      /* cookie value can have unicode characters, so must URLENCODE */
      call symputx('getCrumb',urlencode(trim(crumb)));
      output;
    end;
run;
 
%put &=getCrumb.;

Example result:

 102        %put &=getCrumb.;
 GETCRUMB=PWDb1Ve5.WD

We feel so good about finding the crumb, we're going to treat ourselves to a whole cookie. Anybody care for a glass of milk?

Complete Code for Full Automation
The following code brings it all together. We also added a PROC IMPORT step and a bonus highlow plot to visualize the results. We've adjusted the file paths so that the code works just as well on SAS for Windows or Unix/Linux systems.

/* use WORK location to store our temp files */
filename out "%sysfunc(getoption(WORK))/output.txt";
filename hdrout "%sysfunc(getoption(WORK))/response1.txt";
 
/* This PROC step caches the cookie for the website finance.yahoo.com */
/* and captures the web page for parsing later                        */
proc http 
  out=out
  headerout=hdrout
  url="https://finance.yahoo.com/quote/AAPL/history?p=AAPL" 
  method="get";
run;
 
/* Read the response and capture the cookie value from     */
/* the CrumbStore field.                                   */
/* The file has very long lines, longer than SAS can       */
/* store in a single variable.  So we read in <32k chunks. */
data crumb (keep=crumb);
  infile out  recfm=n lrecl=32767;
  /* the @@ directive says DON'T advance pointer to next line */
  input txt: $32767. @@;
  pos = find(txt,"CrumbStore");
  if (pos>0) then
    do;
      crumb = dequote(scan(substr(txt,pos),3,':{}'));
      /* cookie value can have unicode characters, so must URLENCODE */
      call symputx('getCrumb',urlencode(trim(crumb)));
      output;
    end;
run;
 
%put &=getCrumb.;
 
filename data "%sysfunc(getoption(WORK))/data.csv";
filename hdrout2 "%sysfunc(getoption(WORK))/response2.txt";
 
proc http 
    out=data 
    headerout=hdrout2
    url="https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1535835578%str(&)period2=1538427578%str(&)interval=1d%str(&)events=history%str(&)crumb=&getCrumb."
    method="get";
run;
 
proc import
 file=data
 out=history
 dbms=csv
 replace;
run;
 
proc sgplot data=history;
  highlow x=date high=high low=low / open=open close=close;
  xaxis display=(nolabel) minor;
  yaxis display=(nolabel);
run;


Disclaimer: As we've seen, Yahoo Finance could change their website at any time, so the URLs in this blog post might not be accurate at a later date. Note that, as of the time of this writing, the above code runs error-free with Base SAS 9.4M5. And it also works in SAS University Edition and SAS OnDemand for Academics!

How to automate a data download with PROC HTTP was published on SAS Users.

7月 032018
 

At SAS, we love data. Data is central to our corporate vision: to transform a world of data into a world of intelligence. We're also famous for enjoying M&Ms, but to us they are more than a sweet snack. They're also another source of data.

My colleague Pete Privitera, with a team of like-minded "makers," built a device that they named SnackBot. SnackBot is an internet-connected sensor that measures the flow of M&Ms in a particular SAS break room. There's a lot to love about this project. You can learn more by watching its origin story in this video:

As the number of M&Ms changes, SnackBot takes a reading and records the M&M count in a database. Most readings reflect a decrease in candy pieces, as my colleagues help themselves to a treat. But once per week, the reading shows a drastic increase -- as our facilities staff restocks the canister. SnackBot has its own website. It also has its own API, and you know what that means (right?). It means that we can use SAS to read and analyze the sensor data.

Reading sensor data into SAS with the SnackBot API

The SnackBot system offers a REST API with a JSON data response. Like any REST API, we can use PROC HTTP to fetch the data, and the JSON library engine to parse the response into a SAS data set.

%let start = '20MAY2018:0:0:0'dt;
 
/* format the start/end per the API needs */
%let start_time= %sysfunc(putn(&start.,is8601dt26.));
%let end_time=   %sysfunc(datetime(),is8601dt26.);
 
/* Call the SnackBot API from snackbot.net */
filename resp temp;
proc http
  method="GET"
  url="http://snackbot.net/snackdata?start_time=&start_time.%str(&)end_time=&end_time.%str(&)utc_offset_minutes=-240"
  out=resp;
run;
 
/* JSON libname engine to read the result       */
/* Simple record layout, all in the ROOT member */
libname mms json fileref=resp;
data mmlevels;
  set mms.root;
run;

I've written about how to use SAS with REST APIs in several other blog posts, so I won't dwell on this part of the process here. This short program retrieves the raw data from SnackBot, which represents a series of M&M "levels" (count of remaining pieces) and a timestamp for each measurement. It's a good start. Though there are only two fields to work with here, there's quite a bit we can do with these data.

raw SnackBot data

Add features to the raw sensor data

With a few additional DATA step statements and some built-in SAS formats, we can derive several interesting characteristics of these data for use in further analysis.

First, we need to convert the character-formatted datetime field to a proper SAS datetime value. That's easily achieved with the INPUT function and the ANYDTDTM informat. (Rick Wicklin wrote a helpful article about how how the ANYDT* informats work.)

data mmlevels;
  set mms.root;
  drop ordinal_root timestamp;
  /* Convert the TIMESTAMP field to native value -- it's a character */
  datetime = input(timestamp, anydtdtm.);
  date = datepart(datetime);
  time = timepart(datetime);
  dow = date;
  qhour = round(datetime,'0:15:0'T);
  format  datetime datetime20. 
          qhour datetime20.
          date date9.
          time timeampm10.
          dow downame.;
run;

For convenience, I duplicated the datetime value a few times and applied different formats so we can get different views of the same value: datetime, just the date, just the time-of-day, and the day-of-week. I also used the ROUND function to "round" the raw datetime value to the nearest quarter hour. I'll explain why I've done that in a later step, but the ROUNDing of a time value is one of the documented unusual uses of the ROUND function.

SnackBot data with features

Even with this small amount of data preparation, we can begin to analyze the characteristics of these data. For example, let's look at the descriptive stats for the data classified by day-of-week:

title "SnackBot readings per day-of-week";
proc means data=mmlevels mean stddev max min;
 var pieces;
 class dow;
run;

SnackBot by day of week

The "N Obs" column shows the number of measurements taken over the entire "study period" broken down by day-of-week. If a measurement is a proxy for a "number-of-pieces-changed" event, then we can see that most events happen on Wednesday, Thursday, and Friday. From this, can you guess which day the M&M canister is refilled?

Let's take another slice through these data, but this time looking at time-of-day. For this, I used PROC FREQ to count the measurements by hour. I applied the HOUR2. format, which allows the SAS procedure to group these data into hour-long intervals with no need for additional data prep. ( I've written previously about how to use SAS formats to derive new categories without expensive data rewriting.) Then I used PROC SGPLOT to produce a step plot for the 24-hour cycle.

/* Count of readings per hour of the day */ 
title "SnackBot readings per hour";
proc freq data=mmlevels ;
 table time / out=perhour;
 format time hour2.;
run;
 
ods graphics / height=400 width=800;
 
title "SnackBot readings per hour";
proc sgplot data=perhour des="Readings per hour of day";
 step x=time y=count;
 xaxis min='0:0:0't max='24:0:0't label="Time of day" grid;
 yaxis label="Servings";
run;

SnackBot hour step

From the chart, we can see that most M&M "events" happen around 11am, and then again between 2pm and 4pm. From personal experience, I can confirm that those are the times when I hear the M&Ms calling to me.

Expand the time series to regular intervals

The SnackBot website can tell you how many M&Ms are remaining right now. But what if you want to know how many were remaining last Friday? Or on any typical Monday morning?

The sensor data that we've analyzed so far is sparse -- that is, there are data entries for each "change" event, but not for every discrete time interval in the study period. I don't know how the SnackBot sensor records its readings -- it might sample the M&M levels every minute, or every second. Regardless, the API reports (and probably stores) only the records that represent a change. If SnackBot records that the final 24 pieces were depleted at 25JUN2018:07:45:00 (a Monday morning) bringing the count to 0, how many M&Ms remain at 1pm later that day? The data don't tell us explicitly with a recorded reading. But we can assume at that point that the count was still 0. The next recorded reading occurs at 27JUN2018:10:30:00 (on a Wednesday, bringing the count to 1332 -- oh joy!).

If we want to create a useful time series visualization of the M&M candy counts over time, we need to expand the time series from these sparse recordings to regular intervals. SAS offers a few sophisticated time series procedures to accomplish this: PROC EXPAND, PROC TIMESERIES, and PROC TIMEDATA. Each of these offer powerful econometrics methods for interpolation and forecasting -- and that's more than we need for this situation. For my example, I took a more low-tech approach.

First, I created an empty data set with datetime entries at quarter-hour intervals, covering the study period of the data we're looking at.

/* Empty data set with 15 minute interval slots    */
/* Regular intervals for the entire "study" period */
data timeslots;
  last = datetime();
  length qhour 8;
  format qhour datetime20;
  drop last i;
  do i = &start. to last by '0:15:00't;
    qhour = i;
    output;
  end;
run;

Then I used a DATA step to merge these empty slots with the actual event data that I had rounded to the nearest quarter hour (remember that?):

/* Merge the sample data with the timeslots */
data expand;
  merge mmlevels(keep=pieces qhour) timeslots;
  by qhour;
run;

Finally, I used a variation of a last-observation-carried-forward (LOCF) approach to fill in the remaining empty slots. If a reading at 20MAY2018:11:15:00 reports 132 pieces remaining, then that value should be RETAINed for each 15-minute slot until the next reading at 20MAY2018:17:30:00. (That reading is 82 pieces -- meaning somebody helped themselves to 50 pieces. Recommended serving size for plain M&Ms is 20 pieces, but I'm not passing judgement.) I also recorded a text value for the day-of-week to help with the final visualization.

/* for empty timeslots, carry the sample data   */
/* forward, so we always have a count of pieces */
/* Variation on a LOCF technique                */
data final;
  set expand;
  length day $ 3;
  /* 3-char value for day of week */
  day=put(datepart(qhour),weekdate3.);
  retain hold;
  if not missing(pieces) then
    hold=pieces;
  else pieces=hold;
  drop hold;
  if not missing(pieces);
run;

Now I have data that represents the regular intervals that we need.

SnackBot regular intervals

Putting it all together

For my final visualization, I created a series plot for the study period. It shows the rise and fall of M&Ms levels in one SAS break room over several weeks. For additional "color", I annotated the plot with a block chart to delineate the days of the week.

title 'Plain M&M pieces on S1 tracked by SnackBot';
ods graphics / height=300 width=1600;
 
proc sgplot data=final des='M&M pieces tracked by SnackBot';
 
  /* plot the data as a series */ 
  series x=qhour y=pieces / lineattrs=(color=navy thickness=3px);
 
  /* Yes, these are the "official" M&M colors               */
  /* Will be applied in data-order, so works best when data */
  /* begins on a Sunday                                     */
  styleattrs datacolors=(red orange yellow green blue CX593B18 red);
  /* block areas to indicate days-of-week                   */
  block x=qhour block=day / transparency=0.65
    valueattrs=(weight=bold size=10pt color=navy);
 
  xaxis minor display=(nolabel);
  yaxis display=(nolabel) grid max=1600 minor;
run;

You can see the pattern. M&Ms are typically filled on Wednesday to the canister capacity of about 1400 pieces. We usually enter into the weekend with 0 remaining, but there are exceptions. The week of May 27 was our Memorial Day holiday, which explains the lack of activity on Monday (and even Tuesday) during that week as SAS folks took advantage of a slow week with their vacation plans.

SnackBot visualization

More about SAS and M&Ms data

You can download the complete code for this example from my public Gist on GitHub. The example code should work with SAS University Edition and SAS OnDemand for Academics, as well as with any SAS environment that can reach the internet with PROC HTTP.

For more M&M data fun, check out Rick Wicklin's article about the distribution of colors in plain M&Ms. SnackBot does not (yet) report on how many and which color of M&Ms are taken per serving, but using statistics, we can predict that!

The post The Internet of Snacks: SnackBot data and what it reveals about SAS life appeared first on The SAS Dummy.

1月 242018
 

Using SAS with REST APIs is fun and rewarding, but it's also complicated. When you're dealing with web services, credentials, data parsing and security, there are a lot of things that can go wrong. It's useful to have a simple program that verifies that the "basic plumbing" is working before you try to push a lot of complex coding through it.

I'm gratified that many of my readers are able to adapt my API examples and achieve similar results. When I receive a message from a frustrated user who can't get things to work, the cause is almost always one of the following:

  • Not using a recent enough version of SAS. PROC HTTP was revised and improved in SAS 9.4 Maint 3. The JSON library engine was added in SAS 9.4 Maint 4 (released in 2016). It's time to upgrade!
  • Cannot access the web service through a corporate firewall. Use the SSL certificates with the SSLCALISTLOC= option. If you're using SAS University Edition, you'll need the update from December 2017 or later -- that's where SSL was added. The editions prior to that did not include the SSL support.

The following SAS program is a simple plumbing test. It uses a free HTTP test service (httpbin.org) to verify your Internet connectivity from SAS and your ability to use SSL. The endpoint returns a JSON-formatted collection of timestamps in various formats, which the program parses using the JSON library engine. I have successfully run this program from my local SAS on Windows, from SAS University Edition (Dec 2017 release), and from SAS OnDemand for Academics (using SAS Studio).

If you can run this program successfully from your SAS session, then you're ready to attempt the more complex REST API calls. If you encounter any errors while running this simple test, then you will need to resolve these before moving on to the really useful APIs. (Like maybe checking on who is in space right now...)

/* PROC HTTP and JSON libname test          */
/* Requires SAS 9.4m4 or later to run       */
/* SAS University Edition Dec 2017 or later */
 
/* utility macro to echo the JSON out */
%macro echoFile(fn=);
  data _null_;
   infile &fn end=_eof;
   input;
   put _infile_;
  run;
%mend;
 
filename resp "%sysfunc(getoption(WORK))/now.json";
proc http
 url="https://now.httpbin.org/"
 method="GET"
 out=resp;
run;
 
/* Supported with SAS 9.4 Maint 5 */
/*
 %put HTTP Status code = &SYS_PROCHTTP_STATUS_CODE. : &SYS_PROCHTTP_STATUS_PHRASE.; 
*/
 
%echoFile(fn=resp);
 
/* Tell SAS to parse the JSON response */
libname time JSON fileref=resp;
 
title "JSON library structure";
proc datasets lib=time;
quit;
 
/* interpret the various datetime vals and convert to SAS */
data values (keep=from:);
 length from_epoch 8
        from_iso8601 8
        from_rfc2822 8
        from_rfc3339 8;
 
 /* Apply the DT format to a range of vars */
 format from: datetime20.;
 set time.now;
 
 /* epoch = # seconds since 01Jan1970        */
 /* SAS datetime is based on 01Jan1960, so   */
 /* add the offset here for a SAS value      */
 from_epoch = epoch + '01Jan1970:0:0:0'dt;
 
 /* straight conversion to ISO8061           */
 from_iso8601 = input(iso8601,is8601dt.);
 
 /* trim the leading "day of week"      */
 from_rfc2822 = input(substr(rfc2822,5),anydtdtm21.);
 
 /* allow SAS to figure it out */
 from_rfc3339 = input(rfc3339,anydtdtm.);
 
 run;
 
title "Raw values from the JSON response";
proc print data=time.now (drop=ord:);
run;
 
title "Formatted values as SAS datetime";
proc print data=values;
run;
 
libname time clear;
filename resp clear;

The post How to test PROC HTTP and the JSON library engine appeared first on The SAS Dummy.

1月 172018
 

I've used SAS with a bunch of different REST APIs: GitHub, Brightcove, Google Analytics, Lithium, LinkedIn, and more. For most of these I have to send user/password or "secret" application tokens to the web service so that it knows who I am and what data I can retrieve. I do not want to keep this secret information in my SAS program files -- that would be a bad idea. If my credentials were part of the program -- even if they were obfuscated and not stored in clear text -- then anyone who managed to get a copy of my program could run it. And they could gain access to my data, as if they were me.

I've written about this topic for SAS-related passwords. In this article, I'll share the approach that I use for API credentials and tokens.

REST APIs: Each service requires different types of secrets

My REST API services don't require just simple user ID and password combos. It depends on the API, but usually the information is in the form of one or more tokens that I've generated using the vendor's developer console, or perhaps that have been granted by an administrator.

For example, to access the Google Analytics API, I need three things: a client ID, a client secret token, and a valid "refresh" token. I can send these three items to the Google OAuth2 API, and in return I'll receive a live "access" token that I can use to request my data. I think of this like checking into a hotel. I show my ID and a credit card at the front desk, and in exchange I receive a room key. Just like my hotel room key, the access token doesn't last forever and cannot be reused on my next visit.

Other APIs are simpler and require just a single token that never expires. That's more like a house key -- it's mine to use forever, or until someone decides to change the locks.

Whether a static token or a token-for-token exchange, I don't want to just leave these keys lying around for just anyone to find and use.

Hide your tokens in a file that only you can read

My simple approach is to store the token values in a text file within my home directory. Then, I change the permissions on the file such that only my account can read it. Whether I submit my program interactively (in SAS Enterprise Guide or SAS Studio) or as a scheduled batch job, it's running under my account. I'm showing the instructions here for UNIX/Linux, but Windows users can accomplish something similar with Windows permissions.

On Linux, I've used the chmod command to specify the octal value that says "only the owner can read/write." That's "chmod 600 filename". The "ls -l" command shows that this permissions mask has been applied.

chmod 600 ./.google_creds.csv
ls -l ./.google_creds.csv
> -rw------- 1 myid mygroup 184 Jan 15 12:41 ./.google_creds.csv

I stored my tokens in a standard CSV format because it's easy for SAS to read and it's easy for me to read if I ever need to change it.

Use INFILE to read the tokens dynamically

With this critical data now stored externally, and the file permissions in place, I can use SAS to read the credentials/tokens within my program and store the values in SAS macro variables. In the following SAS program, I assigned a macro variable to my user root folder. Since I might run this program on Linux or Windows, I used this trick to determine the proper path notation. I also used the &SYSUSERID macro variable to make my program more portable. If I want to supply this program to any colleagues (or to you!), the only thing that's needed is to create and store the token CSV files in the proper location.

/* My path is different for UNIX vs Windows */
%let authpath = %sysfunc(ifc(&SYSSCP. = WIN,
	 \\netshare\root\u\&sysuserid.,
	 /u/&sysuserid.));
 
/* This should be a file that only YOU or trusted group members can read */
/* Use "chmod 0600 filename" in UNIX environment */
/* "dotfile" notation is convention for on UNIX for "hidden" */
filename auth "&authpath./.google_creds.csv";
 
/* Read in the secret account keys from another file */
data _null_;
 infile auth firstobs=2 dsd delimiter=',' termstr=crlf;
 length client_id $ 100 client_secret $ 30 refresh_token $ 60;
 input client_id client_secret refresh_token;
 call symputx('client_id',client_id);
 call symputx('client_secret',client_secret);
 call symputx('refresh_token',refresh_token);
run;

When I run this code in my production job, I can see the result:

NOTE: The infile AUTH is:
      Filename=/u/myid/.google_creds.csv,
      Owner Name=myid,Group Name=mygroup,
      Access Permission=rw-------,
      Last Modified=Mon Jan 15 12:41:58 2018,
      File Size (bytes)=184

NOTE: 1 record was read from the infile AUTH.
      The minimum record length was 145.
      The maximum record length was 145.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.01 seconds

For this example, my next step is to call the Google API to get my access token. I'll use the macro variables that my program created with CALL SYMPUT to build the proper API call.

/* Call Google API to exchange the refresh token for an active access token */
%let oauth2=https://www.googleapis.com/oauth2/v4/token;
filename rtoken temp;
proc http
 method="POST"
 url="&oauth2.?client_id=&client_id.%str(&)client_secret=&client_secret.%str(&)grant_type=refresh_token%str(&)refresh_token=&refresh_token."
 out=rtoken;
run;

See the full explanation of this Google Analytics example in this article.

The post How to secure your REST API credentials in SAS programs appeared first on The SAS Dummy.

1月 172018
 

I've used SAS with a bunch of different REST APIs: GitHub, Brightcove, Google Analytics, Lithium, LinkedIn, and more. For most of these I have to send user/password or "secret" application tokens to the web service so that it knows who I am and what data I can retrieve. I do not want to keep this secret information in my SAS program files -- that would be a bad idea. If my credentials were part of the program -- even if they were obfuscated and not stored in clear text -- then anyone who managed to get a copy of my program could run it. And they could gain access to my data, as if they were me.

I've written about this topic for SAS-related passwords. In this article, I'll share the approach that I use for API credentials and tokens.

REST APIs: Each service requires different types of secrets

My REST API services don't require just simple user ID and password combos. It depends on the API, but usually the information is in the form of one or more tokens that I've generated using the vendor's developer console, or perhaps that have been granted by an administrator.

For example, to access the Google Analytics API, I need three things: a client ID, a client secret token, and a valid "refresh" token. I can send these three items to the Google OAuth2 API, and in return I'll receive a live "access" token that I can use to request my data. I think of this like checking into a hotel. I show my ID and a credit card at the front desk, and in exchange I receive a room key. Just like my hotel room key, the access token doesn't last forever and cannot be reused on my next visit.

Other APIs are simpler and require just a single token that never expires. That's more like a house key -- it's mine to use forever, or until someone decides to change the locks.

Whether a static token or a token-for-token exchange, I don't want to just leave these keys lying around for just anyone to find and use.

Hide your tokens in a file that only you can read

My simple approach is to store the token values in a text file within my home directory. Then, I change the permissions on the file such that only my account can read it. Whether I submit my program interactively (in SAS Enterprise Guide or SAS Studio) or as a scheduled batch job, it's running under my account. I'm showing the instructions here for UNIX/Linux, but Windows users can accomplish something similar with Windows permissions.

On Linux, I've used the chmod command to specify the octal value that says "only the owner can read/write." That's "chmod 600 filename". The "ls -l" command shows that this permissions mask has been applied.

chmod 600 ./.google_creds.csv
ls -l ./.google_creds.csv
> -rw------- 1 myid mygroup 184 Jan 15 12:41 ./.google_creds.csv

I stored my tokens in a standard CSV format because it's easy for SAS to read and it's easy for me to read if I ever need to change it.

Use INFILE to read the tokens dynamically

With this critical data now stored externally, and the file permissions in place, I can use SAS to read the credentials/tokens within my program and store the values in SAS macro variables. In the following SAS program, I assigned a macro variable to my user root folder. Since I might run this program on Linux or Windows, I used this trick to determine the proper path notation. I also used the &SYSUSERID macro variable to make my program more portable. If I want to supply this program to any colleagues (or to you!), the only thing that's needed is to create and store the token CSV files in the proper location.

/* My path is different for UNIX vs Windows */
%let authpath = %sysfunc(ifc(&SYSSCP. = WIN,
	 \\netshare\root\u\&sysuserid.,
	 /u/&sysuserid.));
 
/* This should be a file that only YOU or trusted group members can read */
/* Use "chmod 0600 filename" in UNIX environment */
/* "dotfile" notation is convention for on UNIX for "hidden" */
filename auth "&authpath./.google_creds.csv";
 
/* Read in the secret account keys from another file */
data _null_;
 infile auth firstobs=2 dsd delimiter=',' termstr=crlf;
 length client_id $ 100 client_secret $ 30 refresh_token $ 60;
 input client_id client_secret refresh_token;
 call symputx('client_id',client_id);
 call symputx('client_secret',client_secret);
 call symputx('refresh_token',refresh_token);
run;

When I run this code in my production job, I can see the result:

NOTE: The infile AUTH is:
      Filename=/u/myid/.google_creds.csv,
      Owner Name=myid,Group Name=mygroup,
      Access Permission=rw-------,
      Last Modified=Mon Jan 15 12:41:58 2018,
      File Size (bytes)=184

NOTE: 1 record was read from the infile AUTH.
      The minimum record length was 145.
      The maximum record length was 145.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.01 seconds

For this example, my next step is to call the Google API to get my access token. I'll use the macro variables that my program created with CALL SYMPUT to build the proper API call.

/* Call Google API to exchange the refresh token for an active access token */
%let oauth2=https://www.googleapis.com/oauth2/v4/token;
filename rtoken temp;
proc http
 method="POST"
 url="&oauth2.?client_id=&client_id.%str(&)client_secret=&client_secret.%str(&)grant_type=refresh_token%str(&)refresh_token=&refresh_token."
 out=rtoken;
run;

See the full explanation of this Google Analytics example in this article.

The post How to secure your REST API credentials in SAS programs appeared first on The SAS Dummy.

12月 052017
 

The internet is rich with data, and much of that data seems to exist only on web pages, which -- for some crazy reason -- are designed for humans to read. When students/researchers want to apply data science techniques to analyze collect and analyze that data, they often turn to "data scraping." What is "data scraping?" I define it as using a program to fetch the contents of a web page, sift through its contents with data parsing functions, and save its information into data fields with a structure that facilitates analysis.

Python and R users have their favorite packages that they use for scraping data from the web. New SAS users often ask whether there are similar packages available in the SAS language, perhaps not realizing that Base SAS is already well suited to this task -- no special bundles necessary.

The basic steps for data scraping are:

  1. Fetch the contents of the target web page
  2. Process the source content of the page -- usually HTML source code -- and parse/save the data fields you need.
  3. If necessary, repeat for subsequent pages. This applies to those web sites that serve up lots of information in paginated form, and you want to collect all available pages of data.

Let's map these steps to the SAS programming language:

For this step Use these features
Get the contents of the web page FILENAME URL
Process/parse the web page contents DATA step, with parsing functions such as FIND, regular expressions via PRXMATCH. Use SAS informats to convert text to native data types.
Repeat across subsequent pages SAS macro language (%DO %UNTIL processing) or DATA step with asked about scraping data from the Center for Disease Control (CDC) web site.

Step 0: Find the original data source and skip the scrape

I'm writing this article at the end of 2017, and at this point in our digital evolution, web scraping seems like a quaint pastime. Yes, there are still some cases for it, which is why I'm presenting this article. But if you find information that looks like data on a web page, then there probably is a real data source behind it. And with the movement toward open data (especially in government) and data-driven APIs (in social media and commercial sites) -- there is probably a way for you to get to that data source directly.

In my example page for this post, the source page is hosted by CDC.gov, a government agency whose mandate is to share information with other public and private entities. As one expert pointed out, the CDC shares a ton of data at its dedicated data.cdc.gov site. Does this include the specific table the user wanted? Maybe -- I didn't spend a lot of time looking for it. However, even if the answer is No -- it's a simple process to ask for it. Remember that web sites are run by people, and usually these people are keen to share their information. You can save effort and achieve a more reliable result if you can get the official data source.

If you find an official data source to use instead, you might still want to automate its collection and import into SAS. Here are two articles that cover different techniques:

Web scraping is lossy, fragile process. The information on the web page does not include data types, lengths, or constraints metadata. And one tweak to the presentation of the web page can break any automated scraping process. If you have no other alternative and you're willing to accept these limitations, let's proceed to Step 1.

Step 1: Fetch the web page

In this step, we want to achieve the equivalent of the "Save As..." function from your favorite web browser. Just like your web browser, SAS can act as an HTTP client. The two most popular methods are:

  • FILENAME URL, which assigns a SAS fileref to the web page content. You can then use INFILE and INPUT to read that file from a DATA step.
  • PROC HTTP, which requires a few more lines of code to get and store the web page content in a single SAS procedure step.

I prefer PROC HTTP, and here's why. First, as a separate explicit step it's easier to run just once and then work with the file result over the remainder of your program. You're guaranteed to fetch the file just once. Though a bit more code, it makes it more clear about what happens under the covers. Next reason: PROC HTTP has been refined considerably in SAS 9.4 to run fast and efficient. Its many options make it a versatile tool for all types of internet interactions, so it's a good technique to learn. You'll find many more uses for it.

Here's my code for getting the web page for this example:

/* Could use this, might be slower/less robust */
* filename src url "https://wwwn.cdc.gov/nndss/conditions/search/";
 
/* I prefer PROC HTTP for speed and flexibility */
filename src temp;
proc http
 method="GET"
 url="https://wwwn.cdc.gov/nndss/conditions/search/"
 out=src;
run;

Step 2: Parse the web page contents

Before you can write code that parses a web page, it helps to have some idea of how the page is put together. Most web pages are HTML code, and the data that is "locked up" within them are expressed in table tags: <table>, <tr>, <td> and so on. Before writing the first line of code, open the HTML source in your favorite text editor and see what patterns you can find.

For this example from the CDC, the data we're looking for is in a table that has 6 fields. Fortunately for us, the HTML layout is very regular, which means our parsing code won't need to worry about lots of variation and special cases.

 <tr >
	<td style="text-align:left;vertical-align:middle;">
    	<a href="/nndss/conditions/chancroid/">
		Chancroid
		</a>
	</td>
	<td class="tablet-hidden"></td>
	<td class="tablet-hidden"></td>
	<td class="tablet-hidden"><i>Haemophilus ducreyi</i></td>
	<td  >1944 </td>
	<td  > Current</td>
</tr>

The pattern is simple. We can find a unique "marker" for each entry by looking for the "/nndss/conditions" reference. The line following (call it offset-plus-1) has the disease name. And the lines that are offset-plus-7 and offset-plus-8 contain the date ranges that the original poster was asking for. (Does this seem sort of rough and "hacky?" Welcome to the world of web scraping.)

Knowing this, we can write code that does the following:

Here's the code that I came up with, starting with what the original poster shared:

/* Read the entire file and skip the blank lines */
/* the LEN indicator tells us the length of each line */
data rep;
infile src length=len lrecl=32767;
input line $varying32767. len;
 line = strip(line);
 if len>0;
run;
 
/* Parse the lines and keep just condition names */
/* When a condition code is found, grab the line following (full name of condition) */
/* and the 8th line following (Notification To date)                                */
/* Relies on this page's exact layout and line break scheme */
data parsed (keep=condition_code condition_full note_to);
 length condition_code $ 40 condition_full $ 60 note_to $ 20;
 set rep;
 if find(line,"/nndss/conditions/") then do;
   condition_code=scan(line,4,'/');
   pickup= _n_+1 ;
   pickup2 = _n_+8;
   /* "read ahead" one line */
   set rep (rename=(line=condition_full)) point=pickup;
   condition_full = strip(condition_full);
 
   /* "read ahead" 8 lines */
   set rep (rename=(line=note_to)) point=pickup2;
   /* use SCAN to get just the value we want */
   note_to = strip(scan(note_to,2,'<>'));
 
   /* write this record */
   output;
  end;
run;

The result still needs some cleanup -- there are a few errant data lines in the data set. However, it should be simple to filter out the records that don't make sense with some additional conditions. I left this as a to-do exercise to the original poster. Here's a sample of the result:

Step 3: Loop and repeat, if necessary

For this example with the CDC data, we're done. All of the data that we need appears on a single page. However, many web sites use a pagination scheme to break the data across multiple pages. This helps the page load faster in the browser, but it's less convenient for greedy scraping applications that want all of the data at once. For web sites that paginate, we need to repeat the process to fetch and parse each page that we need.

Web sites often use URL parameters such as "page=" to indicate which page of results to serve up. Once you know the URL parameter that controls this offset, you can create a SAS macro program that iterates through all of the pages that you want to collect. There are many ways to accomplish this, but I'll share just one here -- and I'll use a special page from the SAS Support Communities (communities.sas.com) as a test.

Aside from the macro looping logic, my code uses two tricks that might be new to some readers.

/* Create a temp folder to hold the HTML files */
options dlcreatedir;
%let htmldir = %sysfunc(getoption(WORK))/html;
libname html "&htmldir.";
libname html clear;
 
/* collect the first few pages of results of this site */
%macro getPages(num=);
 %do i = 1 %to &num.;
   %let url = https://communities.sas.com/t5/custom/page/page-id/activity-hub?page=&i.;
   filename dest "&htmldir./page&i..html";
   proc http 
      method="GET"
	  url= "&url."
	  out=dest;
   run;
 %end;
%mend;
 
/* How many pages to collect */
%getPages(num=3);
 
/* Use the wildcard feature of INFILE to read all */
/* matching HTML files in a single step */
data results;
 infile "&htmldir./*.html" length=len lrecl=32767;
 input line $varying32767. len ;
 line = strip(line);
 if len>0;
run;

The result of this program is one long data set that contains the results from all of the pages -- that is, all of lines of HTML code from all of the web pages. You can then use a single DATA step to post-process and parse out the data fields that you want to keep. Like magic, right?

Where to learn more

Each year there are one or two "web scraping" case studies presented at SAS Global Forum. You can find them easily by searching on the keyword "scrape" from the SAS Support site or from lexjansen.com. Here are some that I found interesting:

How JMP 13 sees the CDC page

If you have access to JMP software from SAS, you could try the File->Internet Open... feature. It's useful for a one-time, ad-hoc parsing step that you can later capture into a JSL script for future use. With Internet Open you can specify a URL and select to open it "as HTML," and then JMP will offer a selection of available tables to import as data. This is definitely worth a try if you have JMP on your desktop, as it can lend some insight about the structure of the page you're trying to scrape.

JMP 13 version of the data, imported

The post How to scrape data from a web page using SAS appeared first on The SAS Dummy.