Chris Hemedinger

7月 072021
 

When I was a computer science student in the 1980s, our digital alphabet was simple and small. We could express ourselves with the letters A..Z (and lowercase a..z) and numbers (0..9) and a handful of punctuation and symbols. Thanks to the ASCII standard, we could represent any of these characters in a single byte (actually just 7 bits). This allowed for a generous 128 different characters, and we had character slots to spare. (Of course for non-English and especially non-latin characters we had to resort to different code pages...but that was before the Internet forced us to work together. Before Unicode, we lived in a digital Tower of Babel.)

Even with the limited character set, pictorial communication was possible with ASCII through the fun medium of "ASCII art." ASCII art is basically the stone-age version of emojis. For example, consider the shrug emoji: 🤷

Its ASCII-art ancestor is this: ¯\_(ツ)_/¯ While ASCII art currently enjoys a retro renaissance, the emoji has become indispensable in our daily communications.

Emojis before Unicode

Given the ubiquity of emojis in every communication channel, it's sometimes difficult to remember that just a few years ago emoji characters were devised and implemented in vendor-specific offerings. As the sole Android phone user in my house, I remember a time when my iPhone-happy family could express themselves in emojis that I couldn't read in the family group chat. Apple would release new emojis for their users, and then Android (Google) would leap frog with another set of their own fun symbols. But if you weren't trading messages with users of the same technology, then chunks of your text would be lost in translation.

Enter Unicode. A standard system for encoding characters that allows for multiple bytes of storage, Unicode has seemingly endless runway for adding new characters. More importantly, there is a standards body that sets revisions for Unicode characters periodically so everyone can use the same huge alphabet. In 2015, emoji characters were added into Unicode and have been revised steadily with universal agreement.

This standardization has helped to propel emojis as a main component of communication in every channel. Text messages, Twitter threads, Venmo payments, Facebook messages, Slack messages, GitHub comments -- everything accepts emojis. (Emojis are so ingrained and expected that if you send a Venmo payment without using an emoji and just use plain text, it could be interpreted as a slight or at the least as a miscue.)

For more background about emojis, read How Emjois Work (source: How Stuff Works).

Unicode is essential for emojis. In SAS, the use of Unicode is possible by way of UTF-8 encoding. If you work in a modern SAS environment with a diverse set of data, you should already be using ENCODING=UTF8 as your SAS session encoding. If you use SAS OnDemand for Academics (the free environment for any learner), this is already set for you. And SAS Viya offers only UTF-8 -- which makes sense, because it's the best for most data and it's how most apps work these days.

Emojis as data and processing in SAS

Emojis are everywhere, and their presence can enrich (and complicate) the way that we analyze text data. For example, emojis are often useful cues for sentiment (smiley face! laughing-with-tears face! grimace face! poop!). It's not unusual for a text message to be ALL emojis with no "traditional" words.

The website Unicode.org maintains the complete compendium of emojis as defined in the latest standards. They also provide the emoji definitions as data files, which we can easily read into SAS. This program reads all of the data as published and adds features for just the "basic" emojis:

/* MUST be running with ENCODING=UTF8 */
filename raw temp;
proc http
 url="https://unicode.org/Public/emoji/13.1/emoji-sequences.txt"
 out=raw;
run;
 
ods escapechar='~';
data emojis (drop=line);
length line $ 1000 codepoint_range $ 45 val_start 8 val_end 8 type $ 30 comments $ 65 saschar $ 20 htmlchar $ 25;
infile raw ;
input;
line = _infile_;
if substr(line,1,1)^='#' and line ^= ' ' then do;
 /* read the raw codepoint value - could be single, a range, or a combo of several */
 codepoint_range = scan(line,1,';');
 /* read the type field */
 type = compress(scan(line,2,';'));
 /* text description of this emoji */
 comments = scan(line,3,'#;');
 
 /* for those emojis that have a range of values */
 val_start = input(scan(codepoint_range,1,'. '), hex.);
 if find(codepoint_range,'..') > 0 then do;
  val_end = input(scan(codepoint_range,2,'.'), hex.);
 end;
 else val_end=val_start;
 
 if type = "Basic_Emoji" then do;
  saschar = cat('~{Unicode ',scan(codepoint_range,1,' .'),'}');
  htmlchar = cats('<span>&#x',scan(codepoint_range,1,' .'),';</span>');
 end;
 output;
end;
run;
 
proc print data=emojis; run;

(As usual, all of the SAS code in this article is available on GitHub.)

The "features" I added include the Unicode representation for an emoji character in SAS, which could then be used in any SAS report in ODS or any graphics produced in the SG procedures. I also added the HTML-encoded representation of the emoji, which uses the form &#xNNNN; where NNNN is the Unicode value for the character. Here's the raw data view:

When you PROC PRINT to an HTML destination, here's the view in the results browser:

In search of structured emoji data

The Unicode.org site can serve up the emoji definitions and codes, but this data isn't exactly ready for use within applications. One could work through the list of emojis (thousands of them!) and tag these with descriptive words and meanings. That could take a long time and to be honest, I'm not sure I could accurately interpret many of the emojis myself. So I began the hunt for data files that had this work already completed.

I found the GitHub/gemoji project, a Ruby-language code repository that contains a structured JSON file that describes a recent collection of emojis. From all of the files in the project, I need only one JSON file. Here's a SAS program that downloads the file with PROC HTTP and reads the data with the JSON libname engine:

filename rawj temp;
 proc http
  url="https://raw.githubusercontent.com/github/gemoji/master/db/emoji.json"
  out=rawj;
run;
 
libname emoji json fileref=rawj;

Upon reading these data, I quickly realized the JSON text contains the actual Unicode character for the emoji, and not the decimal or hex value that we might need for using it later in SAS.

I wanted to convert the emoji character to its numeric code. That's when I discovered the UNICODEC function, which can "decode" the Unicode sequence into its numeric values. (Note that some characters use more than one value in a sequence).

Here's my complete program, which includes some reworking of the tags and aliases attributes so I can have one record per emoji:

filename rawj temp;
 proc http
  url="https://raw.githubusercontent.com/github/gemoji/master/db/emoji.json"
  out=rawj;
run;
 
libname emoji json fileref=rawj;
 
/* reformat the tags and aliases data for inclusion in a single data set */
data tags;
 length ordinal_root 8 tags $ 60;
 set emoji.tags;
 tags = catx(', ',of tags:);
 keep ordinal_root tags;
run;
 
data aliases;
 length ordinal_root 8 aliases $ 60;
 set emoji.aliases;
 aliases = catx(', ',of aliases:);
 keep ordinal_root aliases;
run;
 
/* Join together in one record per emoji */
proc sql;
 create table full_emoji as 
 select  t1.emoji as emoji_char, 
    unicodec(t1.emoji,'esc') as emoji_code, 
    t1.description, t1.category, t1.unicode_version, 
    case 
     when t1.skin_tones = 1 then  t1.skin_tones
	 else 0
	end as has_skin_tones,
    t2.tags, t3.aliases
  from emoji.root t1
  left join tags t2 on (t1.ordinal_root = t2.ordinal_root)
  left join aliases t3 on (t1.ordinal_root = t3.ordinal_root)
 ;
quit;
 
proc print data=full_emoji; run;

Here's a snippet of the report that includes some of the more interesting sequences:

The diversity and inclusion aspect of emoji glyphs is ever-expanding. For example, consider the emoji for "family":

  • The basic family emoji code is \u0001F46A (👪)
  • But since families come in all shapes and sizes, you can find a family that better represents you. For example, how about "family: man, man, girl, girl"? The code is \u0001F468\u200D\u0001F468\u200D\u0001F467\u200D\u0001F467, which includes the codes for each component "member" all smooshed together with a "zero-width joiner" (ZWJ) code in between (👨‍👨‍👧‍👧)
  • All of the above, but with a dark-skin-tone modifier (\u0001F3FF) for 2 of the family members: \u0001F468\u0001F3FF\u200D\u0001F468\u200D\u0001F467\u200D\u0001F467\u0001F3FF (👨🏿‍👨‍👧‍👧🏿)

Conclusion: Emojis reflect society, and society adapts to emojis

As you might have noticed from that last sequence I shared, a single concept can call for many different emojis. As our society becomes more inclusive around gender, skin color, and differently capable people, emojis are keeping up. Everyone can express the concept in the way that is most meaningful for them. This is just one way that the language of emojis enriches our communication, and in turn our experience feeds back into the process and grows the emoji collection even more.

As emoji-rich data is used for reporting and for training of AI models, it's important for our understanding of emoji context and meaning to keep up with the times. Already we know that emoji use differs among different age generations and across other demographic groups. The use and application of emojis -- separate from the definition of emoji codes -- is yet another dimension to the data.

Our task as data scientists is to bring all of this intelligence and context into the process when we parse, interpret and build training data sets. The mechanics of parsing and producing emoji-rich data is just the start.

If you're encountering emojis in your data and considering them in your reporting and analytics, please let me know how! I'd love to hear from you in the comments.

The post How to work with emojis in SAS appeared first on The SAS Dummy.

1月 182021
 

Recommended soundtrack for this blog post: Netflix Trip by AJR.

This week's news confirms what I already knew: The Office was the most-streamed television show of 2020. According to reports that I've seen, the show was streamed for 57 billion minutes during this extraordinary year. I'm guessing that's in part because we've all been shut in and working from home; we crave our missing office interactions. We lived vicariously (and perhaps dysfunctionally) through watching Dunder Mifflin staff. But another major factor was the looming deadline of the departure of The Office from Netflix as of January 1, 2021. It was a well-publicized event, so Netflix viewers had to get their binge on while they could.

People in my house are fans of the show, and they account for nearly 6,000 of those 57 billion streaming minutes. I can be this precise (nerd alert!) because I'm in the habit of analyzing our Netflix activity by using SAS. In fact, I can tell you that since late 2017, we've streamed 576 episodes of The Office. We streamed 297 episodes in 2020. (Since the show has only 201 episodes we clearly we have a few repeats in there.)

I built a heatmap that shows the frequency and intensity of our streaming of this popular show. In this graph each row is a month, each square is a day. White squares are Office-free. A square with any red indicates at least one virtual visit with the Scranton crew; the darker the shade, the more episodes streamed during that day. You can see that Sept 15, 2020 was a particular big binge with 17 episodes. (Each episode is about 20-21 minutes, so it's definitely achievable.)

netflix trip through The Office

Heatmap of our household streaming of The Office

How to build the heatmap

To build this heatmap, I started with my Netflix viewing history (downloaded from my Netflix account as CSV files). I filtered to just "The Office (U.S.)" titles, and then merged with a complete "calendar" of dates between late 2017 and the start of 2021. Summarized and merged, the data looks something like this:


With all of the data summarized in this way such that there is only one observation per X and Y value, I can use the HEATMAPPARM statement in PROC SGPLOT to visualize it. (If I needed the procedure to summarize/bin the data for me, I would use the HEATMAP statement. Thanks to Rick Wicklin for this tip!)

proc sgplot data=ofc_viewing;
 title height=2.5 "The Office - a Netflix Journey";
 title2 height=2 "&episodes. episodes streamed on &days. days, over 3 years";
 label Episodes="Episodes per day";
 format monyear monyy7.;
 heatmapparm x=day y=monyear 
   colorresponse=episodes / x2axis
    outline
   colormodel=(white  CXfcae91 CXfb6a4a CXde2d26 CXa50f15) ;
 yaxis  minor reverse display=(nolabel) 
  values=(&amp;allmon.)
  ;
 x2axis values=(1 to 31 by 1) 
   display=(nolabel)  ;
run;

You can see the full code -- with all of the data prep -- on my GitHub repository here. You may even run the code in your own SAS environment -- it will fetch my Netflix viewing data from another GitHub location where I've stashed it.

Distribution of Seasons (not "seasonal distribution")

If you examine the heatmap I produced, you can almost see our Office enthusiasm in three different bursts. These relate directly to our 3 children and the moments they discovered the show. First was early 2018 (middle child), then late 2019 (youngest child), then late 2020 (oldest child, now 22 years old, striving to catch up).

The Office ran for 9 seasons, and our kids have their favorite seasons and episodes -- hence the repeated viewings. I used PROC FREQ to show the distribution of episode views across the seasons:


Season 1 is remarkably low for two reasons. First and most importantly, it contains the fewest episodes. Second, many viewers agree that Season 1 is the "cringiest" content, and can be uncomfortable to watch. (This Reddit user leaned into the cringe with his data visualization of "that's what she said" jokes.)

From the data (and from listening to my kids), I know that Season 2 is a favorite. Of the 60 episodes we streamed at least 4 times, 19 of them were in Season 2.

More than streaming, it's an Office lifestyle

Office fandom goes beyond just watching the show. Our kids continue to embrace The Office in other mediums as well. We have t-shirts depicting the memes for "FALSE." and "Schrute Farms." We listen to The Office Ladies podcast, hosted by two stars of the show. In 2018 our daughter's Odyssey of the Mind team created a parody skit based on The Office (a weather-based office named Thunder Mifflin) -- and advanced to world finals.

Rarely does a day go by without some reference to an iconic phrase or life lesson that we gleaned from The Office. We're grateful for the shared experience, and we'll miss our friends from the Dunder Mifflin Paper Company.

The post Visualizing our Netflix Trip through <em>The Office</em> appeared first on The SAS Dummy.

11月 102020
 

The code and data that drive analytics projects are important assets to the organizations that sponsor them. As such, there is a growing trend to manage these items in the source management systems of record. For most companies these days, that means Git. The specific system might be GitHub Enterprise, GitLab, or Bitbucket -- all platforms that are based on Git.

Many SAS products support direct integration with Git. This includes SAS Studio, SAS Enterprise Guide, and the SAS programming language. (That last one checks a lot of boxes for ways to use Git and SAS together.) While we have good documentation and videos to help you learn about Git and SAS, we often get questions around "best practices" -- what is the best/correct way to organize your SAS projects in Git?

In this article I'll dodge that question, but I'll still try to provide some helpful advice in the process.

Ask the Expert resource: Using SAS® With Git: Bring a DevOps Mindset to Your SAS® Code

Guidelines for managing SAS projects in Git

It’s difficult for us to prescribe exactly how to organize project repositories in source control. Your best approach will depend so much on the type of work, the company organization, and the culture of collaboration. But I can provide some guidance -- mainly things to do and things to avoid -- based on experience.

Do not create one huge repository

DO NOT build one huge repository that contains everything you currently maintain. Your work only grows over time and you'll come to regret/revisit the internal organization of a huge project. Once established, it can be tricky to change the folder structure and organization. If you later try to break a large project into smaller pieces, it can be difficult or impossible to maintain the integrity of source management benefits like file histories and differences.

Design with collaboration in mind

DO NOT organize projects based only on the teams that maintain them. And of course, don't organize projects based on individual team members.

  • Good repo names: risk-adjustment-model, engagement-campaigns
  • Bad repo names: joes-code, claims-dept

All teams reorganize over time, and you don't want to have to reorganize all of your code each time that happens. And code projects change hands, so keep the structure personnel-agnostic if you can. Major refactoring of code can introduce errors, and you don't want to risk that just because you got a new VP or someone changed departments.

Instead, DO organize projects based on function/work that the code accomplishes. Think modular...but don't make projects too granular (or you'll have a million projects). I personally maintain several SAS code projects. The one thing they have in common is that I'm the main contributor -- but I organize them into functional repos that theoretically (oh please oh please) someone else could step in to take over.

The Git view of my YouTube API project in SAS Enterprise Guide

Up with reuse, down with ownership

This might seem a bit communist, but collaboration works best when we don't regard code that we write as "our turf." DO NOT cling to notions of code "ownership." It makes sense for teams/subject-matter experts to have primary responsibility for a project, but systems like Git are designed to help with transparency and collaboration. Be open to another team member suggesting and merging (with review and approval) a change that improves things. GitHub, GitLab, and Bitbucket all support mechanisms for issue tracking and merge requests. These allow changes to be suggested, submitted, revised, and approved in an efficient, transparent way.

DO use source control to enable code reuse. Many teams have foundational "shared code" for standard operations, coded in SAS macros or shared statements. Consider placing these into their own project that other projects and teams can import. You can even use Git functions within SAS to fetch and include this code directly from your Git repository:

/* create a temp folder to hold the shared code */
options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/shared-code;
libname repo "&repoPath.";
libname repo clear;
 
/* Fetch latest code from Git */
data _null_;
 rc = git_clone( 
   "https://gitlab.mycompany.com/sas-projects/shared-code/",
   "&repoPath.");
run;
 
options source2;
/* run the code in this session */
%include "&repoPath./bootstrap-macros.sas";

If you rely on a repository for shared code and components, make sure that tests are in place so changes can be validated and will not break downstream systems. You can even automate tests with continuous integration tools like Jenkins.

DO document how projects relate to each other, dependencies, and prepare guidance for new team members to get started quickly. For most of us, we feel more accountable when we know that our code will be placed in central repositories visible to our peers. It may inspire cleaner code, more complete documentation, and a robust on-boarding process for new team members. Use the Markdown files (README.md and others) in a repository to keep your documentation close to the code.

My SAS code to check Pagespeed Insights, with documentation

Work with Git features (and not against them)

Once your project files are in a Git repository, you might need to change your way of working so that you aren't going against the grain of Git benefits.

DO NOT work on code changes in a shared directory with multiple team members –- you'll step on each other. The advantage of Git is that it's a distributed workflow and each developer can work with their own copy of the repository, and merge/accept changes from others at their own pace.

DO use Git branching to organize and isolate changes until you are ready to merge them with the main branch. It takes a little bit of learning and practice, but when you adopt a branching approach you'll find it much easier to manage -- it beats keeping multiple copies of your code with slightly different file and folder names to mark "works in progress."

DO consider learning and using Git tools such as Git Bash (command line), Git GUI, and a code IDE like VS Code. These don't replace the SAS-provided coding tools with their Git integration, but they can supplement your workflow and make it easier to manage content among several projects.

Learning more

When you're ready to learn more about working with Git and SAS, we have many webinars, videos, and documentation resources:

The post How to organize your SAS projects in Git appeared first on The SAS Dummy.

9月 062019
 

A few years ago I shared a method to publish content from SAS to a Slack channel. Since that time, our teams at SAS have gone "all in" on collaboration with Microsoft Office 365, including Microsoft Teams. Microsoft Teams is the Office suite's answer to Slack, and it's not a coincidence that it works in nearly the same way.

The lazy method: send e-mail to the channel

Before I cover the "deluxe" method for sending content to a Microsoft Teams channel, I want to make sure you know that there is a simple method that involves no coding, and no need for APIs. The message experience isn't as nice, but it does the job. You can simply "send e-mail" to the channel. If you're automating output from SAS, it's a simple, well-documented process to send e-mail from a SAS program. (Here's an example from me, using FILENAME EMAIL.)

When you send e-mail to a Microsoft Teams channel, the message notice includes the message subject line, sender, and the first bit of the message content. To see the entire message, you must click on the "View original e-mail" link in the notice. This "downloads" the message to your device so that you can open it with a local tool (such as your e-mail reader, Microsoft Outlook). My team uses this method to receive certain alerts from our communities.sas.com platform. Here's an example:

To get the unique e-mail address for a channel, right-click on the channel name and select Get email address. Any message that you send to that e-mail address will be distributed to the team.

Getting started with a Microsoft Teams webhook

In order to provide a richer, more integrated experience with Microsoft Teams, you can publish content using a webhook. A webhook is a REST API endpoint that allows you to post messages and notifications with more control over the appearance and interactive options within the messages. In SAS, you can publish to a webhook by using PROC HTTP.

To get started, you need to add and configure a webhook for your Microsoft Teams channel:

  1. Right-click on the channel name and select Connectors.
  2. Microsoft Teams offers built-in connectors for many different applications. To find the connector for Incoming Webhook, use the search field to narrow the list. Then click Add to add the connector to the channel.
  3. You must grant certain permissions to the connector to interact with your channel. In this case, you need to allow the webhook to send messages and notifications. Review the permissions and click Install.
  4. On the Configuration page, assign a name to this connector and optionally customize the image. The image will be the avatar that's used when the connector posts content to the channel. When you've completed these changes, select Create.
  5. The connector generates a unique (and very long) URL that serves as the REST API endpoint. You can copy the URL from this field -- you will need it later in your SAS program. You can always come back to these configuration settings to change the connector avatar or re-copy the URL.

    At this point, it's a good idea to test that you can publish a basic message from SAS. The "payload" for a Teams message is a JSON-formatted structure, and you can find examples in the Microsoft Teams reference doc. Here's a SAS program that publishes the simplest message. Add your webhook URL and run the code to verify the connector is working for your channel.

    filename resp temp;
    options noquotelenmax;
    proc http
      /* Substitute your webhook URL here */
      url="https://outlook.office.com/webhook/your-unique-webhook-address-it-is-very-long"
      method="POST"
      in=
      '{
          "$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
          "type": "AdaptiveCard",
          "version": "1.0",
          "summary": "Test message from SAS",
          "text": "This message was sent by **SAS**!"
      }'
      out=resp;
    run;

    If successful, this step will post a simple message to your Teams channel:

    Design a message card for Microsoft Teams

    Now that we have the basic plumbing working, it's time to add some bells and whistles. Microsoft Teams calls these notifications "message cards", which are messages that can include interactive features such as images, data, action buttons, and more.

    Designing a simple message

    Microsoft Teams supports a large palette of building blocks (expressed in JSON) to create different card experiences. You can experiment with these cards in the MessageCard Playground that Microsoft hosts. The tool provides templates for several card varieties, and you can edit the JSON definitions to tweak and design your own.

    For one of my use cases, I designed a simple card to show the status of our recommendation engine on SAS Support Communities. (Read this article for more information about how we built and monitor the recommendation engine.) The engine runs as a service and is accessed with its own API. I wanted a periodic "health check" to post to our internal team that would alert us to any problems. Here's the JSON that I used in the MessageCard Playground to design it.

    Much of the JSON is boilerplate for the message. I drew the green blocks to indicate the areas that need to be dynamic -- that is, replaced with values from the real-time API call. Here's what the card looks like when rendered in the Microsoft Teams channel.

    Since my API call to the recommendation engine service creates a data set, I can run that data through PROC JSON to create the JSON segment I need:

    /* reading the results from my API call to the engine */
    libname results json fileref=resp;
     
    /* Prep a simple name-value data set with the results */
    data segment (keep=name value);
     set results.root;
     name="Score data updated (UTC)";
     value= astore_creation;
     output;
     name="Topics scored";
     value=left(num_topics);
     output;
     name="Number of users";
     value= left(num_users);
     output;
     name="Process time";
     value= process_time;
     output;
    run;
     
    /* use PROC JSON to create the segment */
    filename segment temp;
    proc json out=segment nosastags pretty;
     export segment;
    run;

    I shared a version of the complete program on GitHub. It should run as is -- but you would need to supply your own webhook endpoint for a channel that you can publish to.

    Design a message with actions

    I also use Microsoft Teams to share updates about the SAS Software GitHub organization. In a previous article I discussed how I use GitHub APIs to gather data from the GitHub service. Each day, my program summarizes the recent activity from github.com/sassoftware and publishes a message card to the team. Here's an example of a daily update:

    This card is fancier than my first example. I added action buttons that can direct the team members to the internal reports for more details and to the GitHub site itself. I used the Microsoft Teams documentation and the MessageCard Playground to design the experience:

    Messaging apps as part of a DevOps strategy

    Like many organizations, we (SAS) invest a considerable amount of time and energy into gathering metrics and building reports about our operations. However, reports are useful only when the intended audience is tuned in and refers to them regularly. With a small additional step, you can use SAS to bring your most interesting data forward to your team -- automatically.

    Whether you use Microsoft Teams or Slack, automated alerting and updates are a great opportunity to keep your teams informed. Each of these tools offers fit-for-purpose connectors that can tie in with information from other popular operational systems (Salesforce, GitHub, Yammer, JIRA, and many more). For cases where a built-in connector is not available, the webhook approach allows you to easily create your own.

The post How to publish to a Microsoft Teams channel using SAS appeared first on The SAS Dummy.

8月 192019
 

I'm old enough to remember when USA Today began publication in the early 1980s. As a teenager who was not particularly interested in current events, I remember scanning each edition for the USA Today Snapshots, a mini infographic feature that presented some statistic in a fun and interesting way. Back then, I felt that these stats made me a little bit smarter for the day. I had no reason to question the numbers I saw, nor did I have the tools, skill or data access to check their work.

Today I still enjoy the USA Today Snapshots feature, but for a different reason. An interesting infographic will spark curiosity. And provided that I have time and interest, I can use the tools of data science (SAS, in my case) and public data to pursue more answers.

In the August 7, 2019 issue, USA Today published this graphic about marijuana use in Colorado. Before reading on, I encourage you to study the graphic for a moment and see what questions arise for you.

Source: USA Today Snapshot from Aug 7 2019

I have some notes

For me, as I studied this graphic, several questions came to mind immediately.

  • Why did they publish this graphic? USA Today Snapshots are usually offered without explanation or context -- that's sort of their thing. So why did the editors choose to share these survey results about marijuana use in Colorado? As readers, we must supply our own context. Most of us know that Colorado recently legalized marijuana for recreational use. The graphic seems to answer the question, "Has marijuana use among certain age groups increased since the law changed?" And a much greater leap: "Has marijuana use increased because of the legal change?"
  • Just Colorado? We see trend lines here for Colorado, but there are other states that have legalized marijuana. How does this compare to Maine or Alaska or California? And what about those states where it's not yet legal, like North Carolina?
  • People '26 and older' are also '18 and older' The reported age categories overlap. '18 and older' includes '18 to 25' and '26 and older'. I believe that the editors added this combined category by aggregating the other two. Why did they do that?
  • Isn't '26 and older' a wide category? '12 to 17' is a 6-year span, and '18 to 25' is an 8-year span. But '26 and older' covers what? 60-plus years?
  • "Coloradoans?" Is that really how people from Colorado refer to themselves? Turns out that's a matter of style preference.

The vagaries of survey results

To its credit, the infographic cites the source for the original data: the National Survey on Drug Use and Health (NSDUH). The organization that conducts the annual survey is the Substance Abuse and Mental Health Services Administration (SAMHSA), which is under the US Department of Health and Human Services. From the survey description: The data provides estimates of substance use and mental illness at the national, state, and sub-state levels. NSDUH data also help to identify the extent of substance use and mental illness among different sub-groups, estimate trends over time, and determine the need for treatment services.

This provides some insight into the purpose of the survey: to help policy makers plan for mental health and substance abuse services. "How many more people are using marijuana for fun?" -- the question I've inferred from the infographic choices -- is perhaps tangential to that charter.

Due to privacy concerns, SAMHSA does not provide the raw survey responses for us analyze. The survey collects details about the respondent's drug use and mental health treatment, as well as demographic information about gender, age, income level, education level, and place of residence. For a deep dive into the questions and survey flow, you can review the 2019 questionnaire here. SAMHSA uses the survey responses to extrapolate to the overall population, producing weighted counts for each response across recoded categories, and imputing counts and percentages for each aspect of substance use (which drugs, how often, how recent).

SAMHSA provides these survey data in two channels: the Public-use Data Analysis System and the Restricted-use Data Analysis System. The "Public-use" data provides annualized statistics about the substance use, mental health, and demographics responses across the entire country. If you want data that includes locale information (such as the US state of residence), then you have to settle for the "Restricted-use" system -- which does not provide annual data, but instead provides data summarized across multi-year study periods. In short, if you want more detail about one aspect of the survey responses, you must sacrifice detail across other facets of the data.

My version of the infographic

I spent hours reviewing the available survey reports and data, and here's what I learned: I am an amateur when it comes to understanding health survey reports. However, I believe that I successfully reverse-engineered the USA Today Snapshot data source so that I could produce my own version of the chart. I used the "Restricted-use" version of the survey reports, which allowed access to imputed data values across two-year study periods. My version shows the same data points, but with these formatting changes:

  • I set the Y axis range as 0% to 100%, which provides a less-exaggerated slope of the trend lines.
  • I did not compute the "18 and over" data point.
  • I added reference lines (dashed blue) to indicate the end of each two-year study period for which I have data points.

Here's one additional data point that's not in the survey or in the USA Today graphic. Colorado legalized marijuana for recreational use in 2012. In my chart, you can see that marijuana use was on the rise (especially among 18-25 years old) well before that, especially since 2009. Medical use was already permitted then (see Robert Allison's chart of the timeline), and we can presume that Coloradoans (!) were warming up to the idea of recreational use before the law was passed. But the health survey measures only reported use, and does not measure the user's purpose (recreational, medical, or otherwise) or attitudes toward the substance.

Limitations of the survey data and this chart

Like the USA Today version, my graph has some limitations.

  • My chart shows the same three broad age categories as the original. These are recoded age values from the study data. For some of the studies it was possible to get more granular age categories (5 or 6 bins instead of 3), but I could not get this for all years. Again, when you push for more detail on one aspect, the "Restricted-use" version of the data pushes back.
  • The "used in the past 12 months" indicators is computed. The survey report doesn't offer this as a binary value. Instead it offers "Used in the past 30 days" and "Used more than 30 days ago but less than 12 months." So, I added those columns together, and I assume that the USA Today editors did the same.
  • I'm not showing the confidence intervals for the imputed survey responses. Since this is survey data, the data values are not absolute but instead are estimates accompanied by a percent-confidence that the true values fall in this certain range. The editors probably decided that this is too complex to convey in your standard USA Today Snapshot -- and it might blunt the potential drama of the graphic. Here's what it would look like for the "Used marijuana in past 30 days" response, with the colored band indicating the 95% confidence interval.

Beyond Colorado: what about other states?

Having done the work to fetch the survey data for Colorado, it was simple to gather and plot the same data for other states. Here's the same graph with data from North Carolina (where marijuana use is illegal) and Maine and California.

While I was limited to the two-year study reports for data at the state level, I was able to get the corresponding data points for every year for the country as a whole:

I noticed that the reported use among those 12-17 years old declined slightly across most states, as well as across the entire country. I don't know what the logistics are for administering such a comprehensive survey to young people, but this made me wonder if something about the survey process had changed over time.

The survey data also provides results for other drugs, like alcohol, tobacco, cocaine, and more. Alcohol has been legal for much longer and is certainly widely used. Here are the results for Alcohol use (imputed 12 months recency) in Colorado. Again I see a decline in the self-reported use among those 12-17 years old. Are fewer young people using alcohol? If true, we don't usually hear about that. Or has something changed in the survey methods with regard to minors?

SAS programs to access NSDUH survey data

On my GitHub repo, you can find my SAS programs to fetch and chart the NSDUH data. The website offers a point-and-click method to select your dimensions: a row, column, and control variable (like a BY group).

I used the interactive report tool to navigate to the data I wanted. After some experimentation, I settled on the "Imputed Marijuana Use Recency" (IRMJRC) value for the report column -- I think that's what USA Today used. Also, I found other public reports that referenced it for similar purposes. The report tool generates a crosstab report and an optional chart, but it also then offers a download option for the CSV version of the data.

I was able to capture that download directive as a URL, and then used PROC HTTP to download the data for each study period. This made it possible to write SAS code to automate the process -- much less tedious than clicking through reports for each study year.

%macro fetchStudy(state=,year=);
  filename study "&workloc./&state._&year..csv";
 
  proc http
   method="GET"
   url="https://rdas.samhsa.gov/api/surveys/NSDUH-&year.-RD02YR/crosstab.csv/?" ||
       "row=CATAG2%str(&)column=IRMJRC%str(&)control=STNAME%str(&)weight=DASWT_1" ||
        "%str(&)run_chisq=false%str(&)filter=STNAME%nrstr(%3D)&state."
   out=study;
  run;
%mend;
 
%let state=COLORADO;
/* Download data for each 2-year study period */
%fetchStudy(state=&state., year=2016-2017);
%fetchStudy(state=&state., year=2015-2016);
%fetchStudy(state=&state., year=2014-2015);
%fetchStudy(state=&state., year=2012-2013);
%fetchStudy(state=&state., year=2010-2011);
%fetchStudy(state=&state., year=2008-2009);
%fetchStudy(state=&state., year=2006-2007);

Each data file represents one two-year study period. To combine these into a single SAS data set, I use the INFILE-with-a-wildcard technique that I've shared here.

 INFILE "&workloc./&state._*.csv"
    filename=fname
    LRECL=32767 FIRSTOBS=2 ENCODING="UTF-8" DLM='2c'x
    MISSOVER DSD;
  INPUT
    state_name   
    recency 
    age_cat 
    /* and so on */

The complete programs are in GitHub -- one version for the state-level two-year study data, and one version for the annual data for the entire country. These programs should work as-is within SAS Enterprise Guide or SAS Studio, including in SAS University Edition. Grab the code and change the STATE macro variable to find the results for your favorite US state.

Conclusion: maintain healthy skepticism

News articles and editorial pieces often use simplified statistics to convey a message or support an argument. There is just something about including numbers that lends credibility to reporting and arguments. Citing statistics is a time-honored and effective method to inform the public and persuade an audience. Responsible journalists will always cite their data sources, so that those with time and interest can fact-check and find additional context beyond what the media might share.

I enjoy features like the USA Today Snapshot, even when they send me down a rabbit hole as this one did. As I tell my children often (and they are weary of hearing it), statistics in the media should not be accepted at face value. But if they make you curious about a topic so that you want to learn more, then I think the editors should be proud of a job well done. It's on the rest of us to follow through to find the deeper answers.

The post A skeptic's guide to statistics in the media appeared first on The SAS Dummy.

7月 252019
 

Recommendations on SAS Support Communities

If you visit the SAS Support Communities and sign in with your SAS Profile, you'll experience a little bit of SAS AI with every topic that you view.

While it appears to be a simple web site widget, the "Recommended by SAS" sidebar is made possible by an application of the full Analytics Life Cycle. This includes data collection and prep, model building and test, API wrappers with a gateway for monitoring, model deployment in containers with orchestration in Kubernetes, and model assessment using feedback from click actions on the recommendations. We built this by using a combination of SAS analytics and open source tools -- see the SAS Global Forum paper by my colleague, Jared Dean, for the full list of ingredients.

Jared and I have been working for over a year to bring this recommendation engine to life. We discussed it at SAS Global Forum 2018, and finally near the end of 2018 it went into production on communities.sas.com. The engine scores user visits for new recommendations thousands of times per day. The engine is updated each day with new data and a new scoring model.

Now that the recommendation engine is available, Jared and I met again in front of the camera. This time we discussed how the engine is working and the efforts required to get into production. Like many analytics projects, the hardest part of the journey was that "last mile," but we (and the entire company, actually) were very motivated to bring you a live example of SAS analytics in action. You can watch the full video at (where else?) communities.sas.com. The video is 17 minutes long -- longer than most "explainer"-type videos. But there was a lot to unpack here, and I think you'll agree there is much to learn from the experience. Not ready to binge on our video? I'll use the rest of this article to cover some highlights.

Good recommendations begin with clean data

The approach of our recommendation engine is based upon your viewing behavior, especially as compared to the behavior of others in the community. With this approach, we don't need to capture much information about you personally, nor do we need information about the content you're reading. Rather, we just need the unique IDs (numbers) for each topic that is viewed, and the ID (again, a number) for the logged-in user who viewed it. One benefit of this approach is that we don't have to worry about surfacing any personal information in the recommendation API that we'll ultimately build. That makes the conversation with our IT and Legal colleagues much easier.

Our communities platform captures details about every action -- including page views -- that happens on the site. We use SAS and the community platform APIs to fetch this data every day so that we can build reports about community activity and health. We now save off a special subset of this data to feed our recommendation engine. Here's an example of the transactions we're using. It's millions of records, covering nearly 100,000 topics and nearly 150,000 active users.

Sample data records for the model

Building user item recommendations with PROC FACTMAC

Starting with these records, Jared uses SAS DATA step to prep the data for further analysis and a pass through the algorithm he selected: factorization machines. As Jared explains in the video, this algorithm shines when the data are represented in sparse matrices. That's what we have here. We have thousands of topics and thousands of community members, and we have a record for each "view" action of a topic by a member. Most members have not viewed most of the topics, and most of the topics have not been viewed by most members. With today's data, that results in a 13 billion cell matrix, but with only 3.3 million view events. Traditional linear algebra methods don't scale to this type of application.

Jared uses PROC FACTMAC (part of SAS Visual Data Mining and Machine Learning) to create an analytics store (ASTORE) for fast scoring. Using the autotuning feature, the FACTMAC selects the best combination of values for factors and iterations. And Jared caps the run time to 3600 seconds (1 hour) -- because we do need this to run in a predictable time window for updating each day.

proc factmac data=mycas.weighted_factmac  outmodel=mycas.factors_out;
   autotune maxtime=3600 objective=MSE 
       TUNINGPARAMETERS=(nfactors(init=20) maxiter(init=200) learnstep(init=0.001) ) ;
   input user_uid conversation_uid /level=nominal;
   target rating /level=interval;
   savestate rstore=mycas.sascomm_rstore;
run;

Using containers to build and containers to score

To update the model with new data each day and then deploy the scoring model as an ASTORE, Jared uses multiple SAS Viya environments. These SAS Viya environments need to "live" only for a short time -- for building the model and then for scoring data. We use Docker containers to spin these up as needed within the cloud environment hosted by SAS IT.

Jared makes the distinction between the "building container," which hosts the full stack of SAS Viya and everything that's needed to prep data and run FACTMAC, and the "scoring container", which contains just the ASTORE and enough code infrastructure (include the SAS Micro Analytics Service, or MAS) to score recommendations. This scoring container is lightweight and is actually run on multiple nodes so that our engine scales to lots of requests. And the fact that it does just the one thing -- score topics for user recommendations -- makes it an easier case for SAS IT to host as a service.

DevOps flow for the recommendation engine

Monitoring API performance and alerting

To access the scoring service, Jared built a simple API using a Python Flask app. The API accepts just one input: the user ID (a number). It returns a list of recommendations and scores. Here's my Postman snippet for testing the engine.

To provision this API as a hosted service that can be called from our community web site, we use an API gateway tool called Apigee. Apigee allows us to control access with API keys, and also monitors the performance of the API. Here's a sample performance report for the past 7 days.

In addition to this dashboard for reporting, we have integrated proactive alerts into Microsoft Teams, the tool we use for collaboration on this project. I scheduled a SAS program that tests the recommendations API daily, and the program then posts to a Teams channel (using the Teams API) with the results. I want to share the specific steps for this Microsoft Teams integration -- that's a topic for another article. But I'll tell you this: the process is very similar to the technique I shared about publishing to a Slack channel with SAS.

Are visitors selecting recommended content?

To make it easier to track recommendation clicks, we added special parameters to the recommended topics URLs to capture the clicks as Google Analytics "events." Here's what that data looks like within the Google Analytics web reporting tool:

You might know that I use SAS with the Google Analytics API to collect web metrics. I've added a new use case for that trick, so now I collect data about the "SAS Recommended Click" events. Each click event contains the unique ID of the recommendation score that the engine generated. Here's what that raw data looks like when I collect it with SAS:

With the data in SAS, we can use that to monitor the health/success of the model in SAS Model Manager, and eventually to improve the algorithm.

Challenges and rewards

This project has been exciting from Day 1. When Jared and I saw the potential for using our own SAS Viya products to improve visitor experience on our communities, we committed ourselves to see it through. Like many analytics applications, this project required buy-in and cooperation from other stakeholders, especially SAS IT. Our friends in IT helped with the API gateway and it's their cloud infrastructure that hosts and orchestrates the containers for the production models. Putting models into production is often referred to as "the last mile" of an analytics project, and it can represent a difficult stretch. It helps when you have the proper tools to manage the scale and the risks.

We've all learned a lot in the process. We learned how to ask for services from IT and to present our case, with both benefits and risks. And we learned to mitigate those risks by applying security measures to our API, and by limiting the execution scope and data of the API container (which lives outside of our firewall).

Thanks to extensive preparation and planning, the engine has been running almost flawlessly for 8 months. You can experience it yourself by visiting SAS Support Communities and logging in with your SAS Profile. The recommendations that you see will be personal to you (whether they are good recommendations...that's another question). We have plans to expand the engine's use to anonymous visitors as well, which will significantly increase the traffic to our little API. Stay tuned!

The post Building a recommendation engine with SAS appeared first on The SAS Dummy.

4月 202019
 

Do you have a favorite television show? Or a favorite movie franchise that you follow? If you call yourself a "fan," just how much of a fan are you? Are you merely a spectator, or do you take your fanaticism to the next level by creating something new?

When it comes to fandom for franchises like Game of Thrones, the Marvel movies, or Stranger Things, there's a new kind of nerd in town. And this nerd brings data science skills. You've heard of the "second screen" experience for watching television, right? That's where fans watch a show (or sporting event or awards ceremony), but also keep up with Twitter or Facebook so they can commune with other fans of the show on social media. These fan-data-scientists bring a third screen: their favorite data workbench IDE.

I was recently lured into into a rabbit hole of Game of Thrones data by a tweet. The Twitter user was reacting to a data visualization of character screen time during the show. The visualization was built in a different tool, but the person was wondering whether it could be done in SAS. I knew the answer was Yes...as long as we could get the data. That turned out to be the easiest part.

WARNING: While this blog post does not reveal any plot points from the show, the data does contain spoilers! No spoilers in what I'm showing here, but if you run my code examples there might be data points that you cannot "unsee." I was personally conflicted about this, since I'm a fan of the show but I'm not yet keeping up with the latest episodes. I had to avert my eyes for the most recent data.

Data is Coming

A GitHub user named Jeffrey Lancaster has shared a repository for all aspects of data around Game of Thrones. He also has similar repos for Stranger Things and Marvel universe. Inside that repo there's a JSON file with episode-level data for all episodes and seasons of the show. With a few lines of code, I was able to read the data directly from the repo into SAS:

filename eps temp;
 
/* Big thanks to this GoT data nerd for assembling this data */
proc http
 url="https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/episodes.json"
 out=eps
 method="GET";
run;
 
/* slurp this in with the JSON engine */
libname episode JSON fileref=eps;

Note that I've shared all of my code for my steps in my own GitHub repo (just trying to pay it forward). Everything should work in Base SAS, including in SAS University Edition.

The JSON library reads the data into a series of related tables that show all of the important things that can happen to characters within a scene. Game of Thrones fans know that death, sex, and marriage (in that order) make up the inflection points in the show.

Building the character-scene data

With a little bit of data prep using SQL, I was able to show the details of the on-screen time per character, per scene. These are the basis of the visualization I was trying to create.

/* Build details of scenes and characters who appear in them */
PROC SQL;
   CREATE TABLE WORK.character_scenes AS 
   SELECT t1.seasonNum, 
          t1.episodeNum,
          t2.ordinal_scenes as scene_id, 
          input(t2.sceneStart,time.) as time_start format=time., 
          input(t2.sceneEnd,time.) as time_end format=time., 
          (calculated time_end) - (calculated time_start) as duration format=time.,
          t3.name
      FROM EPISODE.EPISODES t1, 
           EPISODE.EPISODES_SCENES t2, 
           EPISODE.SCENES_CHARACTERS t3
      WHERE (t1.ordinal_episodes = t2.ordinal_episodes AND 
             t2.ordinal_scenes = t3.ordinal_scenes);
QUIT;

With a few more data prep steps (see my code on GitHub), I was able to summarize the screen time for scene locations:

You can see that The Crownlands dominate as a location. In the show that's a big region and a sort of headquarters for The Seven Kingdoms, and the show data actually includes "sub-locations" that can help us to break that down. Here's the makeup of that 18+ hours of time in The Crownlands:

Screen time for characters

My goal is to show how much screen time each of the major characters receives, and how that changes over time. I began by creating a series of charts using PROC SGPLOT. These were created using a single SGPLOT step using a BY group, segmented by show episode. They appear in a grid because I used ODS LAYOUT GRIDDED to arrange them.

Here's the code segment that creates these dozens of charts. Again, see my GitHub for the intermediate data prep work.

/* Create a gridded presentation of Episode graphs CUMULATIVE timings */
ods graphics / width=500 height=300 imagefmt=svg noborder;
ods layout gridded columns=3 advance=bygroup;
proc sgplot data=all_times noautolegend ;
  hbar name / response=cumulative 
    categoryorder=respdesc  
    colorresponse=total_screen_time dataskin=crisp
    datalabel=name datalabelpos=right datalabelattrs=(size=10pt)
    seglabel seglabelattrs=(weight=bold size=10pt color=white) ;
   ;
  by epLabel notsorted;
  format cumulative time.;
  label epLabel="Ep";
  where rank<=10;
  xaxis display=(nolabel)  grid ;
  yaxis display=none grid ;
run;
ods layout end;
ods html5 close;

Creating an animated timeline

The example shared on Twitter showed an animation of screen time, per character, over the complete series of episodes. So instead of a huge grid with many plots, need to produce a single file with layers for each episode. In SAS we can produce an animated GIF or animated SVG (scalable vector graphics) file. The SVG is a much smaller file format, but you need a browser or a special viewer to "play" it. Still, that's the path I followed:

/* Create a single animated SVG file for all episodes */
options printerpath=svg animate=start animduration=1 
  svgfadein=.25 svgfadeout=.25 svgfademode=overlap
  nodate nonumber; 
 
/* change this file path to something that works for you */
ODS PRINTER file="c:\temp\got_cumulative.svg" style=daisy;
 
/* For SAS University Edition
ODS PRINTER file="/folders/myfolders/got_cumulative.svg" style=daisy;
*/
 
proc sgplot data=all_times noautolegend ;
  hbar name / response=cumulative 
    categoryorder=respdesc 
    colorresponse=total_screen_time dataskin=crisp
    datalabel=name datalabelpos=right datalabelattrs=(size=10pt)
    seglabel seglabelattrs=(weight=bold size=10pt color=white) ;
   ;
  by epLabel notsorted;
  format cumulative time.;
  label epLabel="Ep";
  where rank<=10;
  xaxis label="Cumulative screen time (HH:MM:SS)" grid ;
  yaxis display=none grid ;
run;
options animation=stop;
ods printer close;

Here's the result (hosted on my GitHub repo -- but as a GIF for compatibility.)

I code and I know things

Like the Game of Thrones characters, my visualization is imperfect in many ways. As I was just reviewing it I discovered a few data prep missteps that I should correct. I used some features of PROC SGPLOT that I've learned only a little about, and so others might suggest improvements. And my next mission should be to bring this data in SAS Visual Analytics, where the real "data viz maesters" who work with me can work their magic. I'm just hoping that I can stay ahead of the spoilers.

The post Deeper enjoyment of your favorite shows -- through data appeared first on The SAS Dummy.

4月 112019
 

This blog post could be subtitled "To Catch a Thief" or maybe "Go ahead. Steal this blog. I dare you."* That's because I've used this technique several times to catch and report other web sites who lift the blog content from blogs.sas.com and present it as their own.

Syndicating blog content is an honorable practice, made possible by RSS feeds that virtually every blog platform supports. With syndicated feeds, your blog content can appear on someone else's site, but this always includes attribution and a link back to the original source. However, if you copy content from another website and publish it natively on your own web property, without attribution or citations...well, that's called plagiarism. And the Digital Millennium Copyright Act (DCMA) provides authors with recourse to have stolen content removed from infringing sites -- if you can establish that you're the rightful copyright owner.

Establishing ownership is a tedious task, especially when someone steals dozens or hundreds of articles. You must provide links to each example of infringing content, along with links to the original authorized content. Fortunately, as I've discussed before, I have ready access to the data about all 17,000+ blog posts that we've published at SAS (see How SAS uses SAS to Analyze SAS Blogs). In this article, I'll show you how I gathered that same information from the infringing websites so that I could file the DCMA "paperwork" almost automatically.

The complete programs from this article are available on GitHub.

Read a JSON feed using the JSON engine

In my experience, the people who steal our blog content don't splurge on fancy custom web sites. They tend to use free or low-cost web site platforms, and the most popular of these include WordPress (operated by Automattic) and Blogspot (operated by Google). Both of these platforms support API-like syndication using feeds.

Blogspot sites can generate article feeds in either XML or JSON. I prefer JSON when it's available, as I find that the JSON libname engine in SAS requires fewer "clues" in order to generate useful tables. (See documentation for the JSON engine.) While you can supply a JSON map file that tells SAS how to assemble your tables and data types, I find it just as easy to read the data as-is and post-process it to join the fields I need and convert data fields. (For an example that uses a JSON map, see Reading data with the SAS JSON libname engine.)

Since I don't want to draw attention to the specific infringing sites, I'll use an example of a popular (legitimate!) Blogspot site named "Maps Mania". If you're into data and maps (who isn't?) you might like their content. In this code I use PROC HTTP to fetch the RSS feed, using "alt=json" to request JSON format and "max-results=100" to retrieve a larger-than-default batch of published posts.

/* Read JSON feed into a local file. */
/* Use Blogspot parameters to get 100 posts at a time */
filename resp temp;
proc http
 url='https://googlemapsmania.blogspot.com/feeds/posts/default?alt=json&max-results=100'
 method="get"
 out=resp;
run;
 
libname rss json fileref=resp;

This JSON libname breaks the data into a series of tables that relate to each other via common keys.

RSS feed tables

With a little bit of exploration in SAS Enterprise Guide and the Query Builder, I was able to design a PROC SQL step to assemble just the fields and records I needed: post title and post URL.

/* Join the relevant feed entry items to make a single table */
/* with post titles and URLs */
proc sql;
   create table work.blogspot as 
   select t2._t as rss_title,
          t1.href as rss_href          
      from rss.entry_link t1
           inner join rss.entry_title t2 on (t1.ordinal_entry = t2.ordinal_entry)
      where t1.type = 'text/html' and t1.rel = 'alternate';
quit;
 
libname rss clear;

RSS output from blogspot

Read an XML feed using the XMLv2 engine

WordPress sites generate XML-based feeds by default. Site owners can install a WordPress plugin to generate JSON feeds as well, but most sites don't bother with that. Like the JSON feeds, the XML feed can contain many fields that relate to each other. I find that with XML, the best approach is to use the SAS XML Mapper application to explore the XML and "design" the final data tables that you need. You use SAS XML Mapper to create a map file, which you can then feed into the SAS XMLv2 engine to instruct SAS how to read the data. (See documentation for the XMLv2 engine.)

SAS XML Mapper is available as a free download from support.sas.com. Download it as a ZIP file (on Windows), and extract the ZIP file to a temporary folder. Then run setup.exe in the root of that folder to install the app on your system.

To design the map, I use an example of the XML feed from the blog that I want to examine. Once again, I'll choose a popular WordPress blog instead of the actual infringing sites. In this case, let's look at the Star Wars News site. I point my browser at the feed address is https://www.starwars.com/news/feed and save as an XML file. Then, I use SAS XML Mapper to Open XML (File menu), and examine the result.

I found everything that I needed in "item" subset of the feed. I dragged that group over to the right pane to include in the map. That creates a data set container named "item." Then dragged just the title, link, and pubDate fields into that data set to include in the final result.

The SAS XML Mapper generates a SAS program that you can include to define the map, and that's what I've done with the following code. It uses DATA step to create the map file just as I need it.

filename rssmap temp;
data _null_;
 infile datalines;
 file rssmap;
 input;
 put _infile_;
 datalines;
<?xml version="1.0" encoding="windows-1252"?>
<SXLEMAP name="RSSMAP" version="2.1">
  <NAMESPACES count="0"/>
  <!-- ############################################################ -->
  <TABLE name="item">
    <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
    <COLUMN name="title">
      <PATH syntax="XPath">/rss/channel/item/title</PATH>
      <TYPE>character</TYPE>
      <DATATYPE>string</DATATYPE>
      <LENGTH>250</LENGTH>
    </COLUMN>
    <COLUMN name="link">
      <PATH syntax="XPath">/rss/channel/item/link</PATH>
      <TYPE>character</TYPE>
      <DATATYPE>string</DATATYPE>
      <LENGTH>200</LENGTH>
    </COLUMN>
    <COLUMN name="pubDate">
      <PATH syntax="XPath">/rss/channel/item/pubDate</PATH>
      <TYPE>character</TYPE>
      <DATATYPE>string</DATATYPE>
      <LENGTH>40</LENGTH>
    </COLUMN>
  </TABLE>
</SXLEMAP>
;
run;

Because WordPress feeds return just most recent 25 items by default, I need to use the "pageid=" directive to go deeper into the archive and return older items. I used a simple SAS macro loop to iterate through 5 pages (125 items) in this example. Note how I specified the XMLv2 libname with the XMLMAP= option to include my custom map. That ensures that SAS will read the XML and build the table as I've designed it.

My final DATA step in this part is to recast the pubDate field (a text field by default) into a proper SAS date.

/* WordPress feeds return data in pages, 25 entries at a time        */
/* So using a short macro to loop through past 5 pages, or 125 items */
%macro getItems;
  %do i = 1 %to 5;
  filename feed temp;
  proc http
   method="get"
   url="https://www.starwars.com/news/feed?paged=&i."
   out=feed;
  run;
 
  libname result XMLv2 xmlfileref=feed xmlmap=rssmap;
 
  data posts_&i.;
   set result.item;
  run;
  %end;
%mend;
 
%getItems;
 
/* Assemble all pages of entries                       */
/* Cast the date field into a proper SAS date          */
/* Have to strip out the default day name abbreviation */
/* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019      */
data allPosts ;
 set posts_:;
 length sasPubdate 8;
 sasPubdate = input( substr(pubDate,4),anydtdtm.);
 format sasPubdate dtdate9.;
 drop pubDate;
run;

Reporting the results

After gathering the data I need from RSS feeds, I use SAS to match that with the WordPress data that I have about our blogs. I can then generate a table that I can easily submit in a DCMA form.

Usually, matching by "article title" is the easiest method. However, sometimes the infringing site will alter the titles a little bit or even make small adjustments to the body of the article. (This reminds me of my college days in computer science, when struggling students would resort to cheating by copying someone else's program, but change just the variable names. It's a weak effort.) With the data in SAS, I've used several other techniques to detect the "distance" of a potentially infringing post from the original work.

Maybe you want to see that code. But you can't expect me to reveal all of my secrets, can you?


* props to Robert "I dare you to knock it off" Conrad.

The post Read RSS feeds with SAS using XML or JSON appeared first on The SAS Dummy.

2月 082019
 

Since we added the new "Recommended by SAS" widget in the SAS Support Communities, I often find myself diverted to topics that I probably would not have found otherwise. This is how I landed on this question and solution from 2011 -- "How to convert 5ft. 4in. (Char) into inches (Num)". While the question was deftly answered by my friend (and SAS partner) Jan K. from the Netherlands, the topic inspired me to take it a step further here.

Jan began his response by throwing a little shade on the USA:

Short of moving to a country that has a decent metric system in place, I suggest using a regular expression.

On behalf of my nation I just want say that, for the record, we tried. But we did not get very far with our metrication, so we are stuck with the imperial system for most non-scientific endeavors.

Matching patterns with a regular expression

Regular expressions are a powerful method for finding specific patterns in text. The syntax of regular expressions can be a challenge for those getting started, but once you've solved a few pattern-recognition problems by using them, you'll never go back to your old methods.

Beginning with the solution offered by Jan, I extended this program to read in a "ft. in." measurement, convert to the component number values, express the total value in inches, and then convert the measurement to centimeters. I know that even with my changes, we can think of patterns that might not be matched. But allow me to describe the updates:

Here's my program, followed by the result:

data measure;
 length 
     original $ 25
     feet 8 inches 8 
     total_inches 8 total_cm 8;
 /* constant regex is parsed just once */
 re = prxparse('/(\d*)ft.(\s*)((\d?\.?\d?)in.)?/'); 
 input;
 original = _infile_;
 if prxmatch(re, original) then do;
  feet =   input ( prxposn(re, 1, original), best12.);
  inches = input ( prxposn(re, 4, original), best12.);
  if missing(inches) and not missing(feet) then inches=0;
 end;
 else 
   original = "[NO MATCH] " || original;
 total_inches = (feet*12) + inches;
 total_cm = total_inches * 2.54;
 drop re;
cards;
5ft. 4in.
4ft 0in.
6ft. 10in.
3ft.2in.
4ft.
6ft.     1.5in.
20ft. 11in.
25ft. 6.5in.
Soooooo big
;

Other tools to help with regular expressions

The Internet offers a plethora of online tools to help developers build and test regular expression syntax. Here's a screenshot from RegExr.com, which I used to test some aspects of my program.

Tools like these provide wonderful insight into the capture groups and regex directives that will influence the pattern matching. They are part tutorial, part workbench for regex crafters at all levels.

Many of these tools also include community contributions for matching common patterns. For example, many of us often need to match/parse data with e-mail addresses, phone numbers, and other tokens as data fields. Sites like RegExr.com include the syntax for many of these common patterns that you can copy and use in your SAS programs.

See Also

The post Convert a text-based measurement to a number in SAS appeared first on The SAS Dummy.

1月 182019
 

It seems that everyone knows about GitHub -- the service that hosts many popular open source code projects. The underpinnings of GitHub are based on Git, which is itself an open-source implementation of a source management system. Git was originally built to help developers collaborate on Linux (yet another famous open source project) -- but now we all use it for all types of projects.

There are other free and for-pay services that use Git, like Bitbucket and GitLab. And there are countless products that embed Git for its versioning and collaboration features. In 2014, SAS developers added built-in Git support for SAS Enterprise Guide.

Since then, Git (and GitHub) have grown to play an even larger role in data science operations and DevOps in general. Automation is a key component for production work -- including check-in, check-out, commit, and rollback. In response, SAS has added Git integration to more SAS products, including:

  • the Base SAS programming language, via a collection of SAS functions.
  • SAS Data Integration Studio, via a new source control plugin
  • SAS Studio (experimental in v3.8)

You can use this Git integration with any service that supports Git (GitHub, GitLab, etc.), or with your own private Git servers and even just local Git repositories.

SAS functions for Git

Git infrastructure and functions were added to SAS 9.4 Maintenance 6. The new SAS functions all have the helpful prefix of "GITFN_" (signifying "Git fun!", I assume). Here's a partial list:

GITFN_CLONE  Clones a Git repository (for example, from GitHub) into a directory on the SAS server.
GITFN_COMMIT  Commits staged files to the local repository
GITFN_DIFF Returns the number of diffs between two commits in the local repository and creates a diff record object for the local repository.
GITFN_PUSH  Pushes the committed files in the local repository to the remote repository.
GITFN_NEW_BRANCH  Creates a Git branch

 

The function names make sense if you're familiar with Git lingo. If you're new to Git, you'll need to learn the terms that go with the commands: clone, repo, commit, stage, blame, and more. This handbook provided by GitHub is friendly and easy to read. (Or you can start with this xkcd comic.)

You can

data _null_;
 version = gitfn_version();
 put version=;             
 
 rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/",
   "c:\Projects\sas-dummy-blog");
 put rc=;
run;

In one line, this function fetches an entire collection of code files from your source control system. Here's a more concrete example that fetches the code to a work space, then runs a program from that repository. (This is safe for you to try -- here's the code that will be pulled/run. It even works from SAS University Edition.)

options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-dummy-blog;
libname repo "&repoPath.";
libname repo clear;
 
/* Fetch latest code from GitHub */
data _null_;
 rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/",
   "&repoPath.");
 put rc=;
run;
 
/* run the code in this session */
%include "&repoPath./rng_example_thanos.sas";

You could use the other GITFN functions to stage and commit the output from your SAS jobs, including log files, data sets, ODS results -- whatever you need to keep and version.

Using Git in SAS Data Integration Studio

SAS Data Integration Studio has supported source control integration for many years, but only for CVS and Subversion (still in wide use, but they aren't media darlings like GitHub). By popular request, the latest version of SAS Data Integration Studio adds support for a Git plug-in.

Example of Git in SAS DI Studio

See the documentation for details:

Read more about setup and use in the available here as part of our "Custom Tasks Tuesday" series.

Using Git in SAS Enterprise Guide

This isn't new, but I'll include it for completeness. SAS Enterprise Guide supports built-in Git repository support for SAS programs that are stored in your project file. You can use this feature without having to set up any external Git servers or repositories. Also, SAS Enterprise Guide can recognize when you reference programs that are managed in an external Git repository. This integration enables features like program history, compare differences, commit, and more. Read more and see a demo of this in action here.

program history

If you use SAS Enterprise Guide to edit and run SAS programs that are managed in an external Git repository, here's an important tip. Change your project file properties to "Use paths relative to the project for programs and importable files." You'll find this checkbox in File->Project Properties.

With this enabled, you can store the project file (EGP) and any SAS programs together in Git, organized into subfolders if you want. As long as these are cloned into a similar structure on any system you use, the file paths will resolve automatically.

The post Using built-in Git operations in SAS appeared first on The SAS Dummy.