9月 082019
 


Today, September 8th, is International Literacy Day! A day celebrated by UNESCO since 1967 to emphasize the importance of literacy around the world. Here at SAS, we have decided to highlight data literacy, a critical part of our evolving knowledge as data and analytics continue to dominate the way we do business.

SAS Press author Susan Slaughter defines data literacy as, “understanding that data are not dry, dusty, abstract squiggles on a computer screen, but represent living things: people, plants, animals. You know you are fluent in a foreign language when you are comfortable speaking it and can communicate what you want to say. The same is true for data literacy; it is about reaching a level of comfort, about being able to communicate what is important to you, and about seeing the meaning behind the data.

Everyone knows that technology is becoming more and more a part of everyday life. Without data literacy, people become passive recipients; with data literacy, you can actively engage with technology. SAS calls it ‘the power to know’ and that's an accurate description.”

SAS Press has been helping users be more fluent in data literacy for almost 30 years! The Little SAS Book is about to publish its sixth edition and has been helping programmers learn SAS and analyze their data since 1995.

Free SAS Press e-books

To celebrate national literacy day and do our part in sharing about data literacy, SAS Press would like to share with you our free e-books on a range of topics related to data analytics. These books focus on topics such as text analytics, data management, AI, and Machine Learning.

Moving to the cloud?

Looking for information on SAS Viya? Download our two new free e-books on Exploring SAS Viya. Both books cover the features and capabilities of SAS Viya. SAS Viya extends the SAS platform to enable everyone – data scientists, business analysts, developers, and executives alike – to collaborate and realize innovative results faster.

Here is a list of our free e-books on SAS Viya:

Exploring SAS ®Viya®: Programming and Data Management
This first book in the series covers how to access data files, libraries, and existing code in SAS® Studio. You also will learn about new procedures in SAS Viya, how to write new code, and how to use some of the pre-installed tasks that come with SAS® Visual Data Mining and Machine Learning.

Exploring SAS® Viya®: Visual Analytics, Statistics, and Investigations
Data visualization enables decision-makers to see analytics presented visually so that they can grasp difficult concepts or identify new patterns. This book includes four visualization solutions powered by SAS Viya: SAS Visual Analytics, SAS Visual Statistics, SAS Visual Text Analytics, and SAS Visual Investigator.

Interested in learning more?

As becoming more data literate becomes increasingly more important in our daily lives, knowing where to get new information and tools to learn becomes critical to innovation and change. To stay up-to-date on new SAS Press books and our new free e-books releases, subscribe to our monthly newsletter.

Celebrating #InternationalLiteracyDay with Free SAS E-books! was published on SAS Users.

9月 072019
 

By 2020, 50% of organizations will lack sufficient AI and data literacy skills to achieve business value. – Gartner

What is data literacy?

Data literacy is the ability to read, work with, analyze, and argue with data. – Wikipedia

Data literacy is the ability to derive meaningful information from data, just as literacy in general is the ability to derive information from the written word. – WhatIs.com

Why is it important?

As data and analytics become core to the enterprise, and data becomes an organizational asset, employees must have at least a basic ability to communicate and understand conversations about data. Just as it is a given that employees are now competent in word processing and spreadsheets, the ability to “speak data” will become an integral aspect of most day-to-day jobs.

Gone will be the days when data scientists, analysts, and statisticians are the only ones “speaking data.” Valerie Logan, Senior Director Analyst, Gartner, says workforce data literacy must treat information as a second language. Just as we expect all employees today to have a basic level of computer literacy, use email, and understand spreadsheets, employees will also need to be able to understand and speak basic data.

Chris Hemedinger, author of SAS for Dummies, touched on this in his blog a skeptics guide to statistics in the media. He is old enough to remember when USA Today began publication in the early 1980s. He remembers scanning each edition for the USA Today Snapshots, a mini infographic feature that presented some statistics in a fun and interesting way. “Back then, I felt that these stats made me a little bit smarter for the day. I had no reason to question the numbers I saw, nor did I have the tools, skill, or data access to check their work.”

Chris warns that as more and more “news articles and editorial pieces often use simplified statistics to convey a message or support an argument,” we will need to learn that “statistics in the media should not be accepted at face value.” Learning to analyze and understand data and statistics will become increasingly more vital for future generations.

Best-selling SAS Press author, Ron Cody, cautions that with the augmented technology that allows non-programmers to be able to run complex programs to search databases, summarize data, and conduct statistical tests, it is vital that everyone has a basic understanding of the data and analytics behind the results. “With advances in artificial intelligence, we may be able to tell the computer our problem and let it solve it and tell us the answer.” With technology advancing so quickly with AI, we will all need to understand the data and avoid including bias into our models. Misunderstood data can negatively influence AI algorithms or interpretation of models.

The future

Tom Fisher, Senior Vice President of Business Development at SAS explains, “the convergence of model management with data management represents one of the most exciting business opportunities of the future. The merging and blending of these two disciplines should enable the elimination of bias that may occur in the collection and aggregation of data.” Initiatives such as MIT’s Data Nutrition Project address the missing step in the model development pipeline, “assessing data sets based on standard quality measures that are both qualitative and quantitative.” As Fisher concludes, “these kinds of approaches are designed to allow consumers of data, as input to models, to have a more complete understanding of the data that’s being ingested. At the end of the day, the goal of these integrated disciplines is to provide greater accuracy and comfort with the result sets that are being delivered by data scientists and data engineers.”

As the Gartner report quoted earlier notes, as organizations become more data-driven, poor data literacy will become an inhibitor to growth. But not everyone wants to be a statistician or data scientist. This is where the analogy to computer literacy parts ways. We don’t all have to have a statistics degree – AI can help. SAS is developing solutions where AI is augmented into its most sophisticated and powerful solutions to give everyone data literacy. For example, SAS® Model Manager looks at the data and the problem to suggest models. It can then choose the best model based on the user’s criteria, test the model, and score. Technology to report and explain the results, and even answer questions is under development – all in natural language! A virtual personal assistant who can “speak data” and translate.

While data literacy will become increasingly important, so too will tools to help moderate and translate the data that will continue to drive our enterprises and our lives.

Resources:
Become more data literate with our library of Getting Started with SAS, Statistics, Machine Learning, and Data Management books. Visit SAS Books.

Explore SAS Analytics Industry Solutions at sas.com/industry.

Why we need to learn how to "speak data" in a data-driven future was published on SAS Users.

9月 062019
 

A few years ago I shared a method to publish content from SAS to a Slack channel. Since that time, our teams at SAS have gone "all in" on collaboration with Microsoft Office 365, including Microsoft Teams. Microsoft Teams is the Office suite's answer to Slack, and it's not a coincidence that it works in nearly the same way.

The lazy method: send e-mail to the channel

Before I cover the "deluxe" method for sending content to a Microsoft Teams channel, I want to make sure you know that there is a simple method that involves no coding, and no need for APIs. The message experience isn't as nice, but it does the job. You can simply "send e-mail" to the channel. If you're automating output from SAS, it's a simple, well-documented process to send e-mail from a SAS program. (Here's an example from me, using FILENAME EMAIL.)

When you send e-mail to a Microsoft Teams channel, the message notice includes the message subject line, sender, and the first bit of the message content. To see the entire message, you must click on the "View original e-mail" link in the notice. This "downloads" the message to your device so that you can open it with a local tool (such as your e-mail reader, Microsoft Outlook). My team uses this method to receive certain alerts from our communities.sas.com platform. Here's an example:

To get the unique e-mail address for a channel, right-click on the channel name and select Get email address. Any message that you send to that e-mail address will be distributed to the team.

Getting started with a Microsoft Teams webhook

In order to provide a richer, more integrated experience with Microsoft Teams, you can publish content using a webhook. A webhook is a REST API endpoint that allows you to post messages and notifications with more control over the appearance and interactive options within the messages. In SAS, you can publish to a webhook by using PROC HTTP.

To get started, you need to add and configure a webhook for your Microsoft Teams channel:

  1. Right-click on the channel name and select Connectors.
  2. Microsoft Teams offers built-in connectors for many different applications. To find the connector for Incoming Webhook, use the search field to narrow the list. Then click Add to add the connector to the channel.
  3. You must grant certain permissions to the connector to interact with your channel. In this case, you need to allow the webhook to send messages and notifications. Review the permissions and click Install.
  4. On the Configuration page, assign a name to this connector and optionally customize the image. The image will be the avatar that's used when the connector posts content to the channel. When you've completed these changes, select Create.
  5. The connector generates a unique (and very long) URL that serves as the REST API endpoint. You can copy the URL from this field -- you will need it later in your SAS program. You can always come back to these configuration settings to change the connector avatar or re-copy the URL.

    At this point, it's a good idea to test that you can publish a basic message from SAS. The "payload" for a Teams message is a JSON-formatted structure, and you can find examples in the Microsoft Teams reference doc. Here's a SAS program that publishes the simplest message. Add your webhook URL and run the code to verify the connector is working for your channel.

    filename resp temp;
    options noquotelenmax;
    proc http
      /* Substitute your webhook URL here */
      url="https://outlook.office.com/webhook/your-unique-webhook-address-it-is-very-long"
      method="POST"
      in=
      '{
          "$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
          "type": "AdaptiveCard",
          "version": "1.0",
          "summary": "Test message from SAS",
          "text": "This message was sent by **SAS**!"
      }'
      out=resp;
    run;

    If successful, this step will post a simple message to your Teams channel:

    Design a message card for Microsoft Teams

    Now that we have the basic plumbing working, it's time to add some bells and whistles. Microsoft Teams calls these notifications "message cards", which are messages that can include interactive features such as images, data, action buttons, and more.

    Designing a simple message

    Microsoft Teams supports a large palette of building blocks (expressed in JSON) to create different card experiences. You can experiment with these cards in the MessageCard Playground that Microsoft hosts. The tool provides templates for several card varieties, and you can edit the JSON definitions to tweak and design your own.

    For one of my use cases, I designed a simple card to show the status of our recommendation engine on SAS Support Communities. (Read this article for more information about how we built and monitor the recommendation engine.) The engine runs as a service and is accessed with its own API. I wanted a periodic "health check" to post to our internal team that would alert us to any problems. Here's the JSON that I used in the MessageCard Playground to design it.

    Much of the JSON is boilerplate for the message. I drew the green blocks to indicate the areas that need to be dynamic -- that is, replaced with values from the real-time API call. Here's what the card looks like when rendered in the Microsoft Teams channel.

    Since my API call to the recommendation engine service creates a data set, I can run that data through PROC JSON to create the JSON segment I need:

    /* reading the results from my API call to the engine */
    libname results json fileref=resp;
     
    /* Prep a simple name-value data set with the results */
    data segment (keep=name value);
     set results.root;
     name="Score data updated (UTC)";
     value= astore_creation;
     output;
     name="Topics scored";
     value=left(num_topics);
     output;
     name="Number of users";
     value= left(num_users);
     output;
     name="Process time";
     value= process_time;
     output;
    run;
     
    /* use PROC JSON to create the segment */
    filename segment temp;
    proc json out=segment nosastags pretty;
     export segment;
    run;

    I shared a version of the complete program on GitHub. It should run as is -- but you would need to supply your own webhook endpoint for a channel that you can publish to.

    Design a message with actions

    I also use Microsoft Teams to share updates about the SAS Software GitHub organization. In a previous article I discussed how I use GitHub APIs to gather data from the GitHub service. Each day, my program summarizes the recent activity from github.com/sassoftware and publishes a message card to the team. Here's an example of a daily update:

    This card is fancier than my first example. I added action buttons that can direct the team members to the internal reports for more details and to the GitHub site itself. I used the Microsoft Teams documentation and the MessageCard Playground to design the experience:

    Messaging apps as part of a DevOps strategy

    Like many organizations, we (SAS) invest a considerable amount of time and energy into gathering metrics and building reports about our operations. However, reports are useful only when the intended audience is tuned in and refers to them regularly. With a small additional step, you can use SAS to bring your most interesting data forward to your team -- automatically.

    Whether you use Microsoft Teams or Slack, automated alerting and updates are a great opportunity to keep your teams informed. Each of these tools offers fit-for-purpose connectors that can tie in with information from other popular operational systems (Salesforce, GitHub, Yammer, JIRA, and many more). For cases where a built-in connector is not available, the webhook approach allows you to easily create your own.

The post How to publish to a Microsoft Teams channel using SAS appeared first on The SAS Dummy.

9月 052019
 

When you order an item online, the website often recommends other items based on your purchase. In fact, these kinds of "recommendation engines" contributed to the early success of companies like Amazon and Netflix. SAS uses a recommender engine to suggest articles on the SAS Support Communities. Although recommender engines use many techniques, one technique that estimates the similarity of items is the cosine similarity. You can use the cosine similarity to compare songs, documents, articles, recipes, and more.

This blog post demonstrates how to compute and visualize the similarities among recipes. For simplicity, only the ingredients in the recipes are used. You can extend the example by using the quantities of ingredients or by including other variables that describe the recipe, such as how the recipe is cooked (skillet, oven, grill, etc.).

The data: Ingredients for recipes

I looked up the ingredients for six recipes: spaghetti sauce, a spaghetti sauce with meat, an Italian eggplant relish (called caponata), a creole sauce, a salsa, and an enchilada sauce. Three recipes are Italian, one is creole, and two are Mexican. Four are sauces and two (salsa and eggplant relish) are typically appetizers. All have tomatoes (fresh, canned, or paste). The following DATA step defines the ingredients in each recipe by using a binary indicator variable. The 35 variables are the ingredients (tomato, garlic, salt, and so forth) and the six rows are the recipes.

data recipes;
   input Recipe $ 1-20
      (Tomato Garlic Salt Onion TomatoPaste OliveOil Celery Broth 
       GreenPepper Cumin Flour BrownSugar BayLeaf GroundBeef 
       BlackPepper ChiliPowder Cilantro Carrot CayennePepper Oregano 
       Oil Parsley PorkSausage RedPepper Paprika Thyme Tomatillo 
       JalapenoPepper WorcestershireSauce Lime
       Eggplant GreenOlives Capers Sugar) (1.);
datalines;
Spag Sauce          1111110000000000000101000000000000
Spag Meat Sauce     1111111010001100010000110000000000
Eggplant Relish     0111110000000000000000000000001111
Creole Sauce        1011111110000010000000001100100000
Salsa               1111000000000000100000000011010000
Enchilada Sauce     1000000101110001001010000000000000
;

If you look carefully at the Eggplant Relish and Enchilada Sauce rows, you will see that they have no ingredients in common. Therefore, those recipes should be the most dissimilar.

The cosine similarity of recipes

In a previous article, I showed that you can use PROC DISTANCE in SAS to compute the cosine similarity of rows. Alternatively, you can use the SAS/IML language to define a function that computes the cosine similarity. The following SAS/IML program loads the function and creates a heat map of the similarity matrix for these six recipes:

proc iml;
use Recipes; read all var _NUM_ into X[c=varNames r=Recipe]; close;
 
load module=(CosSimRows);     /* load module from previous post */
cosSim = CosSimRows(X);
Colors = palette('BRBG', 7);  /* brown-blue-green color ramp */
call heatmapcont(cosSim) xvalues=Recipe yvalues=Recipe range={0 1} 
            colorramp=Colors title="Cosine Similarity of Recipes";
print CosSim[F=4.2 r=Recipe c=Recipe];

The heat map and printed table show the similarity scores for the six recipes. The following list summarizes the similarity matrix, which is based solely on the presence or absence of ingredients:

  • The ingredients in the spaghetti sauce recipe are most similar to the meat sauce and the eggplant relish. They are slightly less similar to the creole sauce and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the spaghetti meat sauce are most similar to the spaghetti sauce and the creole sauce. They are slightly less similar to the eggplant relish and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the eggplant relish are similar to both spaghetti sauces. The recipe has zero similarity to the enchilada sauce recipe because they do not share any common ingredients.
  • The ingredients in the creole sauce are similar to both spaghetti sauces.
  • The salsa is most similar to the spaghetti sauce.
  • The enchilada sauce recipe is not similar to any of the other recipes.

Of course, these results depend on what ingredients you use for the recipes. When I planned this study, I expected the salsa and enchilada sauce to be similar, but it turned out that the recipes had only one ingredient in common (tomatoes).

If you do not have very many items to compare, you can use a bar chart to visualize the pairwise similarities. The result is shown below. Even if you have too many items to display the full set of pairwise similarities, you could use a WHERE clause (such as WHERE CosSim > 0.5) to see only the most similar pairs of items.

The bar chart is sorted by the cosine similarity, so it is easy to see the very similar and very dissimilar pairs.

Conclusions

A recommendation engine can use calculations like this to suggest additional recipes that are similar to those that you like. Did you like the spaghetti sauce recipe? Try the eggplant relish! Replace "recipe" with "movie," "book," or "product" and you begin to see how recommender engines work.

An advantage of the cosine similarity is that it preserves the sparsity of the data matrix. The data matrix for these recipes has 204 cells, but only 58 (28%) of the cells are nonzero. If you add additional recipes, the number of variables (the union of the ingredients) might climb into the hundreds, but a typical recipe has only a dozen ingredients, so most of the cells in the data matrix are zero. When you compare two recipes, only the common ingredients contribute to the cosine similarity score. This means that you can compute the cosine similarity very efficiently, and it requires making only a single pass through the data.

You can also compute the cosine similarity of the columns, which tells you which pairs of ingredients appear together in recipes. As you might imagine, olive oil, garlic, and onions are similar to each other because they often appear in recipes together. Cumin and green olive? Not very similar.

You can download the SAS program that computes the cosine similarities and creates the graphs in this article.

The post Use cosine similarity to make recommendations appeared first on The DO Loop.

9月 052019
 

When you order an item online, the website often recommends other items based on your purchase. In fact, these kinds of "recommendation engines" contributed to the early success of companies like Amazon and Netflix. SAS uses a recommender engine to suggest articles on the SAS Support Communities. Although recommender engines use many techniques, one technique that estimates the similarity of items is the cosine similarity. You can use the cosine similarity to compare songs, documents, articles, recipes, and more.

This blog post demonstrates how to compute and visualize the similarities among recipes. For simplicity, only the ingredients in the recipes are used. You can extend the example by using the quantities of ingredients or by including other variables that describe the recipe, such as how the recipe is cooked (skillet, oven, grill, etc.).

The data: Ingredients for recipes

I looked up the ingredients for six recipes: spaghetti sauce, a spaghetti sauce with meat, an Italian eggplant relish (called caponata), a creole sauce, a salsa, and an enchilada sauce. Three recipes are Italian, one is creole, and two are Mexican. Four are sauces and two (salsa and eggplant relish) are typically appetizers. All have tomatoes (fresh, canned, or paste). The following DATA step defines the ingredients in each recipe by using a binary indicator variable. The 35 variables are the ingredients (tomato, garlic, salt, and so forth) and the six rows are the recipes.

data recipes;
   input Recipe $ 1-20
      (Tomato Garlic Salt Onion TomatoPaste OliveOil Celery Broth 
       GreenPepper Cumin Flour BrownSugar BayLeaf GroundBeef 
       BlackPepper ChiliPowder Cilantro Carrot CayennePepper Oregano 
       Oil Parsley PorkSausage RedPepper Paprika Thyme Tomatillo 
       JalapenoPepper WorcestershireSauce Lime
       Eggplant GreenOlives Capers Sugar) (1.);
datalines;
Spag Sauce          1111110000000000000101000000000000
Spag Meat Sauce     1111111010001100010000110000000000
Eggplant Relish     0111110000000000000000000000001111
Creole Sauce        1011111110000010000000001100100000
Salsa               1111000000000000100000000011010000
Enchilada Sauce     1000000101110001001010000000000000
;

If you look carefully at the Eggplant Relish and Enchilada Sauce rows, you will see that they have no ingredients in common. Therefore, those recipes should be the most dissimilar.

The cosine similarity of recipes

In a previous article, I showed that you can use PROC DISTANCE in SAS to compute the cosine similarity of rows. Alternatively, you can use the SAS/IML language to define a function that computes the cosine similarity. The following SAS/IML program loads the function and creates a heat map of the similarity matrix for these six recipes:

proc iml;
use Recipes; read all var _NUM_ into X[c=varNames r=Recipe]; close;
 
load module=(CosSimRows);     /* load module from previous post */
cosSim = CosSimRows(X);
Colors = palette('BRBG', 7);  /* brown-blue-green color ramp */
call heatmapcont(cosSim) xvalues=Recipe yvalues=Recipe range={0 1} 
            colorramp=Colors title="Cosine Similarity of Recipes";
print CosSim[F=4.2 r=Recipe c=Recipe];

The heat map and printed table show the similarity scores for the six recipes. The following list summarizes the similarity matrix, which is based solely on the presence or absence of ingredients:

  • The ingredients in the spaghetti sauce recipe are most similar to the meat sauce and the eggplant relish. They are slightly less similar to the creole sauce and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the spaghetti meat sauce are most similar to the spaghetti sauce and the creole sauce. They are slightly less similar to the eggplant relish and salsa. They are not similar to the enchilada sauce ingredients.
  • The ingredients in the eggplant relish are similar to both spaghetti sauces. The recipe has zero similarity to the enchilada sauce recipe because they do not share any common ingredients.
  • The ingredients in the creole sauce are similar to both spaghetti sauces.
  • The salsa is most similar to the spaghetti sauce.
  • The enchilada sauce recipe is not similar to any of the other recipes.

Of course, these results depend on what ingredients you use for the recipes. When I planned this study, I expected the salsa and enchilada sauce to be similar, but it turned out that the recipes had only one ingredient in common (tomatoes).

If you do not have very many items to compare, you can use a bar chart to visualize the pairwise similarities. The result is shown below. Even if you have too many items to display the full set of pairwise similarities, you could use a WHERE clause (such as WHERE CosSim > 0.5) to see only the most similar pairs of items.

The bar chart is sorted by the cosine similarity, so it is easy to see the very similar and very dissimilar pairs.

Conclusions

A recommendation engine can use calculations like this to suggest additional recipes that are similar to those that you like. Did you like the spaghetti sauce recipe? Try the eggplant relish! Replace "recipe" with "movie," "book," or "product" and you begin to see how recommender engines work.

An advantage of the cosine similarity is that it preserves the sparsity of the data matrix. The data matrix for these recipes has 204 cells, but only 58 (28%) of the cells are nonzero. If you add additional recipes, the number of variables (the union of the ingredients) might climb into the hundreds, but a typical recipe has only a dozen ingredients, so most of the cells in the data matrix are zero. When you compare two recipes, only the common ingredients contribute to the cosine similarity score. This means that you can compute the cosine similarity very efficiently, and it requires making only a single pass through the data.

You can also compute the cosine similarity of the columns, which tells you which pairs of ingredients appear together in recipes. As you might imagine, olive oil, garlic, and onions are similar to each other because they often appear in recipes together. Cumin and green olive? Not very similar.

You can download the SAS program that computes the cosine similarities and creates the graphs in this article.

The post Use cosine similarity to make recommendations appeared first on The DO Loop.

9月 042019
 

Editor's Note: SAS' Evan Mann contributed to this post.

First came SAS' reputation as a great place to work. Next came a storytellers article series offering a glimpse of the people behind the brand.

Now there's the SAS Users YouTube Channel, where tutorial videos provide a window into some of our personalities—trainers initially, but future contributors will include experts from other areas of SAS.

"Our goal is to make the SAS Users channel on YouTube a ‘must-visit’ channel for those who have the desire to grow their analytical skill set and to learn SAS," said Principal IT Software Developer Michael Penwell, who manages the channel. "It's for both new and experienced users."

My SAS Support Communities teammates Anna Brown and Thiago De Souza produce the videos and Karen Feldman, Senior Technical Learning and Development Specialist, lines up the on-camera talent. Brown explains the idea behind this approach: “Infusing personality into these how-to videos welcomes users into our SAS family in a way we never have before. With this no-script, natural style, we hope that they feel like they’re not being told how to do something in SAS – but rather they’re learning right along with us, together. We’re simply a guide to help users achieve whatever they can dream up in SAS.”

A casual, conversational approach

As Brown describes, instructors walk viewers through the subject matter using a casual, conversational tone - almost vlog style. This approach lets their personalities come through.

Anna Yarbrough, who joined SAS Education last year, travels to teach SAS programming. She likes interacting with SAS users IRL and relishes a good coding challenge. She keeps an eye out for users' questions on her videos and enjoys the digital banter. When this video on merging data sets published, it received the most user comments right off the bat. Why so much interest in merging data?

"When working with structured data," said Yarbrough, "it’s never going to all be stored in one massive data set. There will be smaller tables that can be linked together based on a common column or columns like social security number, or patient number, or something else that uniquely identifies each row of data."

"When you can bring these tables together," Yarbrough added, "you have the ability to answer more complex business questions. I think people are really interested in covering this topic, as well as converting columns from character to numeric, just because it is something that needs to be done all the time."

While building out the videos, Yarbrough recalled questions she had when she first learned SAS. (Don't be surprised if they become topics for future tutorials.) "It's really important as an instructor to be able to put yourself in the shoes of our newer SAS programmers," she said. "Even though we spend so much time covering this content, we need to remember what it’s like to hear about these topics for the first time so that we can present material in a clear and easy-to-follow way. It’s been awesome interacting with users through YouTube."

Check out the first few


Ari Zitin: Decision Trees and Neural Networks

Kathy Kiraly: SAS to Excel

Kathy Kiraly: Excel to SAS

Anna Rakers: Online Resources

Mark Stevens: Certification Tips

Mark Stevens: Performance-Based Questions

Peter Styliadis: Two Common Scenarios

Anna Yarbrough: Character Functions

Anna Yarbrough: Merge Data Sets

 

Subscribe to the channel

Subscribe, like, and share the SAS Users YouTube channel to get notified of new videos and help fellow SAS users find them.

Ready to Learn SAS? It's Time to Meet Your Instructors!

YouTube.com/SASusers: a pop of personality was published on SAS Users.

9月 042019
 

Editor’s note: This article is a continuation of the series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. Access all the articles in the series here.

In a previous article in this series, Accessing Databases in the Cloud – SAS Data Connectors and Amazon Web Services, I covered SAS and database as a service (DBaaS) and storage offerings from Amazon Web Services (AWS). Today, I cover the various storage options available on AWS and how connect to and interact with them from SAS.

Object Storage

Amazon Simple Storage Service (S3) is a low-cost, scalable cloud object storage for any type of data in its native format. Individual Amazon S3 objects can range in size from 1 byte all the way to 5 terabytes (TB). Amazon S3 organizes these objects into buckets. A bucket is globally unique. You access the bucket directly through an API from anywhere in the world, if granted permissions. The default granted to the bucket is least access. Amazon advertises 11 9’s, or 99.999999999% of durability, meaning that you never lose your data. Data replicates automatically across availability zones to meet this durability. You can reduce the number of replicants or use one of the various tiers of archive services to reduce your object storage cost. Costs are calculated based on terabytes of storage per month with added costs for request and transfers of data.

SAS and S3

Support for Amazon Web Services S3 as a Caslib data source for SAS Cloud Analytic Services (CAS) was added in SAS Viya 3.4. This data source enables you to access SASHDAT files and CSV files in S3. You can use the CASLIB statement or the table.addCaslib action to add a Caslib for S3. SAS is currently exploring native object storage integration with AWS S3 for more file types. For other file types you can copy the data from S3 and then use a SAS Data Connector to load the data into memory. For example, if I had Excel data in S3, I could use PROC S3 to copy the data locally and then load the data into CAS using the SAS Data Connector to PC Files.

Block Storage

Amazon Elastic Block Store (EBS) is the block storage service designed for use with Amazon Elastic Compute Cloud (EC2). Only when attached to an operating system is the storage class accessible. Storage volumes can be treated as an independent disk drive controlled by a server operating system. You would mount an EBS volume to an operating system as if it were a physical disk. EBS volumes are valuable because they are the storage that will persist when you terminate your compute instance. You can choose from four different volume types that supply performance levels at corresponding costs.

SAS and EBS

EBS is used as the permanent SAS data storage and persists through a restart of your SAS environment. The performance choices made when selecting from the different EBS volume type will have a direct impact on the performance that you get from SAS. One thing to consider is using compute instances that have enhanced EBS performance or dedicated solid state drive instance storage. For example, the SAS Viya on AWS QuickStart uses Storage Optimized and Memory Optimized compute instances with local NVMe-based SSDs that are physically connected to the host server that is coupled to the lifetime of the instance. This is beneficial for performance.

SAS Cloud Analytic Services (CAS) is an in-memory server that relies on the CAS Disk Cache as the virtual memory storage backend. This is especially true if you are reading data from a database. In this case, make sure you have enough block storage, in the form of EBS volumes for use as the CAS Disk Cache.

File Storage

Amazon Elastic File System (EFS) provides access to data through a shared file system. EFS is an elastic network file system that grows and shrinks as you add or remove files, so you only pay for the storage you consume. Users create, delete, modify, read, and write files organized logically in a directory structure for intuitive access. This allows simultaneous access for multiple users to a common set of file data managed with user and group permissions. Amazon FSx for Lustre is the high-performance file system service.

SAS and EFS

EFS shared file system storage can be a powerful tool if utilizing a SAS Grid architecture. If you have a requirement in your SAS architecture for a shared location that any node in a group can access and write to, then EFS could meet your requirement. To access the data stored in your network file system you will have to mount the EFS file system. You can mount your Amazon EFS file systems to any EC2 instance, or any on-premises server connected to your Amazon VPC.

BONUS: Serverless

Amazon Athena is query service for Amazon S3. This service makes it easy to submit queries against the objects stored in S3. You can run analysis on this data using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet.

SAS and Athena

Amazon Athena is ODBC/JDBC compliant which means I can use SAS/ACCESS Interface to ODBC or SAS/ACCESS Interface to JDBC to connect using SAS. Download an Amazon Athena ODBC driver and submit code from SAS just like you would any ODBC data source. Athena is a great tool if you want to use the serverless computing power of Amazon to query data in S3.

Finally

Many times, we do not have a choice of technologies we use and infrastructures on which they sit. Luckily, if you use AWS, integration with SAS is not a concern. I’ve now covered databases and storage for AWS. In future articles, I’ll cover the same topics for Microsoft Azure and Google Cloud Platform.

Additional Resources

Storage in the Cloud – SAS and Amazon Web Services was published on SAS Users.

9月 032019
 

The startup ecosystem is dynamic and the flow of venture capital into tech is at an all-time high. Billions of dollars are invested in tech startups every year. Many tech startups market themselves as ‘powered by AI’ and pitch investors with buzzword laden phrases such as, ‘we leverage state of [...]

7 ways SAS empowers startups with artificial intelligence and machine learning was published on SAS Voices by Avinash Sooriyarachchi

9月 032019
 

In part one of this blog series, we introduced hybrid marketing as a method that combines both direct and digital marketing capabilities while absorbing insights from machine learning. In part two, we will share perspectives on: How SAS Customer Intelligence 360 completes analytic's last mile. How campaign management processes can easily [...]

SAS Customer Intelligence 360: Hybrid marketing and analytic's last mile [Part 2] was published on Customer Intelligence Blog.

9月 032019
 

An important application of the dot product (inner product) of two vectors is to determine the angle between the vectors. If u and v are two vectors, then
cos(θ) = (u ⋅ v) / (|u| |v|)
You could apply the inverse cosine function if you wanted to find θ in [0, π], but since the cosine function is a monotonic increasing transformation on [0, π], it is usually sufficient to know the cosine of the angle between two vectors. The expression (u ⋅ v) / (|u| |v|) is called the cosine similarity between the vectors u and v. It is a value in [-1, 1]. This article discusses the cosine similarity, why it is useful, and how you can compute it in SAS.

What is the cosine similarity?

For multivariate numeric data, you can compute the cosine similarity of the rows or of the columns. The cosine similarity of the rows tells you which subjects are similar to each other. The cosine similarity of the columns tells you which variables are similar to each other. For example, if you have clinical data, the rows might represent patients and the variables might represent measurements such as blood pressure, cholesterol, body-mass index, and so forth. The row similarity compares patients to each other; the column similarity compares vectors of measurements.

Vectors that are very similar to each other have a cosine similarity that is close to 1. Vectors that are nearly orthogonal have a cosine similarity near 0. Vectors that point in opposite directions have a cosine similarity of –1. However, in practice, the cosine similarity is often used on vectors that have nonnegative values. For those vectors, the angle between them is never more than 90 degrees and so the cosine similarity is between 0 and 1.

To illustrate how to compute the cosine similarity in SAS, the following statements create a simple data set that has four observations and two variables. The four row vectors are plotted:

data Vectors;
length Name $1;
input Name x y;
datalines;
A  0.5 1
B  3   5
C  3   2.8
D  5   1
;
 
ods graphics / width=400px height=400px;
title "Four Row Vectors";
proc sgplot data=Vectors aspect=1;
   vector x=x y=y / datalabel=Name datalabelattrs=(size=14);
   xaxis grid;  yaxis grid;
run;
Four vectors. Find the cosine similarity.

The cosine similarity of observations

If you look at the previous graph of vectors and think that vector A is unlike the other vectors, then you are using the magnitude (length) of the vectors to form that opinion. The cosine similarity does not use the magnitude of the vectors to decide which vectors are alike. Instead, it uses only the direction of the vectors. You can visualize the cosine similarity by plotting the normalized vectors that have unit length, as shown in the next graph.

Four normalized vectors. The cosine similarity is the angle between these vectors.

The second graph shows the vectors that are used to compute the cosine similarity. For these vectors, A and B are most similar to each other. Vector C is more similar to B than to D, and D is least similar to the others. You can use PROC DISTANCE and the METHOD=COSINE option to compute the cosine similarity between observations. If your data set has N observations, the result of PROC DISTANCE is an N x N matrix of cosine similarity values. The (i,j)th value is the similarity between the i_th vector and the j_th vector.

proc distance data=Vectors out=Cos method=COSINE shape=square;
   var ratio(_NUMERIC_);
   id Name;
run;
proc print data=Cos noobs; run;
Cosine similarity between four vectors

The results of the DISTANCE procedure confirm what we already knew from the geometry. Namely, A and B are most similar to each other (cosine similarity of 0.997), C is more similar to B (0.937) than to D (0.85), and D is not very similar to the other vectors (similarities range from 0.61 to 0.85).

Notice that the cosine similarity is not a linear function of the angle between vectors. The angle between vectors B and A is 4 degrees, which has a cosine of 0.997. The angle between vectors B and C is almost four times as big (16 degrees) but the cosine similarity is 0.961, which is not very different from 0.997 (and certainly not four times as small). Notice also that the graph of the cosine function is flat near θ=0. Consequently, the cosine similarity does not vary much between the vectors in this example.

Compute cosine similarity in SAS/IML

SAS/STAT does not include a procedure that computes the cosine similarity of variables, but you can use the SAS/IML language to compute both row and column similarity. The following PROC IML statements define functions that compute the cosine similarity for rows and for columns. To support missing values in the data, the functions use listwise deletion to remove any rows that contain a missing value. All functions are saved for future use.

/* Compute cosine similarity matrices in SAS/IML */
proc iml;
/* exclude any row with a missing value */
start ExtractCompleteCases(X);
   if all(X ^= .) then return X;
   idx = loc(countmiss(X, "row")=0);
   if ncol(idx)>0 then return( X[idx, ] );
   else                return( {} ); 
finish;
 
/* compute the cosine similarity of columns (variables) */
start CosSimCols(X, checkForMissing=1);
   if checkForMissing then do;    /* by default, check for missing and exclude */
      Z = ExtractCompleteCases(X);
      Y = Z / sqrt(Z[##,]);       /* stdize each column */
   end;
      else Y = X / sqrt(X[##,]);  /* skip the check if you know all values are valid */
   cosY = Y` * Y;                 /* pairwise inner products */
   /* because of finite precision, elements could be 1+eps or -1-eps */
   idx = loc(cosY> 1); if ncol(idx)>0 then cosY[idx]= 1; 
   idx = loc(cosY<-1); if ncol(idx)>0 then cosY[idx]=-1; 
   return cosY;
finish;
 
/* compute the cosine similarity of rows (observations) */
start CosSimRows(X);
   Z = ExtractCompleteCases(X);  /* check for missing and exclude */
   return T(CosSimCols(Z`, 0));  /* transpose and call CosSimCols */
finish;
store module=(ExtractCompleteCases CosSimCols CosSimRows);

Visualize the cosine similarity matrix

When you compare k vectors, the cosine similarity matrix is k x k. When k is larger than 5, you probably want to visualize the similarity matrix by using heat maps. The following DATA step extracts two subsets of vehicles from the Sashelp.Cars data set. The first subset contains vehicles that have weak engines (low horsepower) whereas the second subset contains vehicles that have powerful engines (high horsepower). You can use a heat map to visualize the cosine similarity matrix between these vehicles:

data Vehicles;
   set sashelp.cars(where=(Origin='USA'));
   if Horsepower < 140 OR Horsepower >= 310;
run;
proc sort data=Vehicles; by Type; run;  /* sort obs by vehicle type */
 
ods graphics / reset;
proc iml;
load module=(CosSimCols CosSimRows);
use Vehicles;
read all var _NUM_ into X[c=varNames r=Model]; 
read all var {Model Type}; close;
 
labl = compress(Type + ":" + substr(Model, 1, 10));
cosRY = CosSimRows(X);
call heatmapcont(cosRY) xvalues=labl yvalues=labl title="Cosine Similarity between Vehicles";
Cosine similarity between attributes of 20 vehicles

The heat map shows the cosine similarities of the attributes of 20 vehicles. Vehicles of a specific type (SUV, sedan, sports car,...) tend to be similar to each other and different from vehicles of other types. The four sports cars stand out. They are very dissimilar to the sedans, and they are also dissimilar to the SUVs and wagons.

Notice the small range for the cosine similarity values. Even the most dissimilar vehicles have a cosine similarity of 0.99.

Cosine similarity of columns

You can treat each row data as a vector of dimension p. Similarly (no pun intended!), you can treat each column as a vector of length N. You can use the CosSimCols function, defined in the previous section, to compute the cosine similarity matrix of numerical columns. The math is the same but is applied to the transpose of the data matrix. To demonstrate the function, the following statements compute and visualize the column similarity for the Vehicles data. There are 10 numerical variables in the data.

cosCY = CosSimCols(X);
call heatmapcont(cosCY) xvalues=varNames yvalues=varNames title="Cosine of Angle Between Variables";
Cosine similarity between variables in the Vehicles data set

The heat map for the columns is shown. You can see that the MPG_City and MPG_Highway variables are dissimilar to most other variables, but similar to each other. Other sets of similar variables include the variables that measure cost (MSRP and Invoice), the variables that measure power (EngineSize, Cylinders, Horsepower), and the variables that measure size (Wheelbase, Length, Weight).

The connection between cosine similarity and correlation

The similarity matrix of the variables shows which variables are similar and dissimilar. In that sense, the matrix might remind you of a correlation matrix. However, there is an important difference: The correlation matrix displays the pairwise inner products of centered variables. The cosine similarity does not center the variables. Although the correlation is scale-invariant and affine invariant, the cosine similarity is not affine invariant: If you add or subtract a constant from a variable, its cosine similarity with other variables will change.

When the data represent positive quantities (as in the Vehicles data), the cosine similarity between two vectors can never be negative. For example, consider the relationship between the fuel economy variables (MPG_City and MPG_Highway) and the "power" and "size" variables. Intuitively, the fuel economy is negatively correlated with power and size (for example, Horsepower and Weight). However, the cosine similarity is positive, which shows another difference between correlation and cosine similarity.

Why is the cosine similarity useful?

The fact that the cosine similarity does not center the data is its biggest strength (and also its biggest weakness). The cosine similarity is most often used in applications in which the variables represent counts or indicator variables. Often, each row represents a document such as a recipe, a book, or a song. The columns indicate an attribute such as an ingredient, word, or musical technique.

For text documents, the number of columns (which represent important words) can be in the thousands. The goal is to discover which documents are structurally similar to each other, perhaps because they share similar content. Recommender engines can use the cosine similarity to suggest new books or articles that are similar to one that was previously read. The same technique can recommend similar recipes or similar songs.

The advantage of the cosine similarity is its speed and the fact that it is very useful for sparse data. In a future article, I'll provide a simple example that demonstrates how you can use the cosine similarity to find recipes that are similar or dissimilar to each other.

The post Cosine similarity of vectors appeared first on The DO Loop.