Sian Roberts

12月 172019
 

The next time you pick up a book, you might want to pause and think about the work that has gone into producing it – and not just from the authors!

The authors of the SAS classic, The Little SAS Book, Sixth Edition, did just that. The Acknowledgement section in the front of a book is usually short – just a few lines to thank family and significant others. In their sixth edition, authors Lora D. Delwiche and Susan J. Slaughter, took it to the next level and produced The Little SAS Book Family Tree. The authors explained:

“Over the years, many people have helped to make this book a reality. We are grateful to everyone who has contributed both to this edition, and to editions in the past. It takes a family to produce a book including reviewers, copyeditors, designers, publishing specialists, marketing specialists, and of course our editors.”

So what happens after you sign a book contract?

First you will be assigned a development editor (DE) who will answer questions and be with you every step of the way – from writing your sample chapter to publication. Your DE will discuss schedules and milestones as well as give you an authoring template, style guidelines, and any software you need to write your book.

Once you have all the resources you need, you'll write your sample chapter. This will help your DE evaluate your writing style, quality of graphics and output, structure, and any potential production issues.

The next step is submitting a draft manuscript for technical review. You'll get feedback from internal and external subject-matter experts, and then you can revise the manuscript based on this feedback. When you and your editor are satisfied with the technical content, your DE will perform a substantive edit on your manuscript, taking particular care with the structure and flow of your writing.

Once in production, your manuscript will be copy edited, and a production specialist will lay out the pages and make sure everything looks good. A graphic designer will work with you to create a cover that encompasses both our branding and your suggestions. Your book will be published in print and e-book versions and made ready for sale.

Finally, after the book is published, a marketing specialist will start promoting your book through our social media channels and other campaigns.

So the next time you pick up a book, spare a thought for the many people who have worked to make it a reality!

For more information about publishing with SAS or for our full catalogue, visit our online bookstore.

How many people does it take to publish a book? was published on SAS Users.

9月 072019
 

By 2020, 50% of organizations will lack sufficient AI and data literacy skills to achieve business value. – Gartner

What is data literacy?

Data literacy is the ability to read, work with, analyze, and argue with data. – Wikipedia

Data literacy is the ability to derive meaningful information from data, just as literacy in general is the ability to derive information from the written word. – WhatIs.com

Why is it important?

As data and analytics become core to the enterprise, and data becomes an organizational asset, employees must have at least a basic ability to communicate and understand conversations about data. Just as it is a given that employees are now competent in word processing and spreadsheets, the ability to “speak data” will become an integral aspect of most day-to-day jobs.

Gone will be the days when data scientists, analysts, and statisticians are the only ones “speaking data.” Valerie Logan, Senior Director Analyst, Gartner, says workforce data literacy must treat information as a second language. Just as we expect all employees today to have a basic level of computer literacy, use email, and understand spreadsheets, employees will also need to be able to understand and speak basic data.

Chris Hemedinger, author of SAS for Dummies, touched on this in his blog a skeptics guide to statistics in the media. He is old enough to remember when USA Today began publication in the early 1980s. He remembers scanning each edition for the USA Today Snapshots, a mini infographic feature that presented some statistics in a fun and interesting way. “Back then, I felt that these stats made me a little bit smarter for the day. I had no reason to question the numbers I saw, nor did I have the tools, skill, or data access to check their work.”

Chris warns that as more and more “news articles and editorial pieces often use simplified statistics to convey a message or support an argument,” we will need to learn that “statistics in the media should not be accepted at face value.” Learning to analyze and understand data and statistics will become increasingly more vital for future generations.

Best-selling SAS Press author, Ron Cody, cautions that with the augmented technology that allows non-programmers to be able to run complex programs to search databases, summarize data, and conduct statistical tests, it is vital that everyone has a basic understanding of the data and analytics behind the results. “With advances in artificial intelligence, we may be able to tell the computer our problem and let it solve it and tell us the answer.” With technology advancing so quickly with AI, we will all need to understand the data and avoid including bias into our models. Misunderstood data can negatively influence AI algorithms or interpretation of models.

The future

Tom Fisher, Senior Vice President of Business Development at SAS explains, “the convergence of model management with data management represents one of the most exciting business opportunities of the future. The merging and blending of these two disciplines should enable the elimination of bias that may occur in the collection and aggregation of data.” Initiatives such as MIT’s Data Nutrition Project address the missing step in the model development pipeline, “assessing data sets based on standard quality measures that are both qualitative and quantitative.” As Fisher concludes, “these kinds of approaches are designed to allow consumers of data, as input to models, to have a more complete understanding of the data that’s being ingested. At the end of the day, the goal of these integrated disciplines is to provide greater accuracy and comfort with the result sets that are being delivered by data scientists and data engineers.”

As the Gartner report quoted earlier notes, as organizations become more data-driven, poor data literacy will become an inhibitor to growth. But not everyone wants to be a statistician or data scientist. This is where the analogy to computer literacy parts ways. We don’t all have to have a statistics degree – AI can help. SAS is developing solutions where AI is augmented into its most sophisticated and powerful solutions to give everyone data literacy. For example, SAS® Model Manager looks at the data and the problem to suggest models. It can then choose the best model based on the user’s criteria, test the model, and score. Technology to report and explain the results, and even answer questions is under development – all in natural language! A virtual personal assistant who can “speak data” and translate.

While data literacy will become increasingly important, so too will tools to help moderate and translate the data that will continue to drive our enterprises and our lives.

Resources:
Become more data literate with our library of Getting Started with SAS, Statistics, Machine Learning, and Data Management books. Visit SAS Books.

Explore SAS Analytics Industry Solutions at sas.com/industry.

Why we need to learn how to "speak data" in a data-driven future was published on SAS Users.

8月 052019
 

As a company, SAS consistently supports #data4good initiatives designed to help those less fortunate around the world. SAS Press team members recently took some time to reflect on the SAS initiatives that inspired them. We thought this would be a good opportunity to introduce some of the team who work so hard on our SAS Press books.

Sian Roberts, Publisher

I lead the SAS Press team and oversee the publication of our books from start to finish, including manuscript acquisition, book development, production, sales, and promotion.

Having lost both my dad and grandmother to cancer, the work SAS is doing to help improve care for cancer patients by tailoring treatments for individuals particularly resonates with me. For example, the wonderful work that is being done with Amsterdam University Medical Center to use computer vision and predictive analytics to improve care for cancer patients is of particular interest to me. My hope is that by using analytics and AI on data gathered from hospitals, research institutes, pharma and biotech companies, patterns can be identified earlier, and survival rates will increase.

 
 

Suzanne Morgen, Developmental Editor

I work with authors to help them develop and write their books, then go to conferences to sell those books and recruit more authors!

At SAS Global Forum, we heard about a pilot program at the New Hanover County Department of Social Services that uses SAS to alert caseworkers to risks for children in their care. I have been a foster parent for several years, so I am excited about any new resources that would help social workers intervene earlier in kids’ lives and hopefully keep them safer and even reduce the need for foster care. I hope SAS is able to partner with many more social services departments and use analytics to help protect more kids in the state and across the country.

Emily Scheviak-Livesay, Senior Business Operations Specialist

As a SAS employee for 22 years, I have learned to wear many hats. At SAS Press, I keep the business running smoothly and manage the metadata of all our books in all formats. I also work with our partners to ensure our titles are available both in the US and globally.

I love this story about JMP working with the Animal Humane Society! I’m a huge fan of “adopt, don’t shop” and it makes me so proud to work at a company where one of our products was used to assist in furthering the cause. For JMP to be able to take a huge amount of data from various sources and turn it into valuable information for The Animal Humane Society is amazing! Helping to care and save animals is what it’s all about. It truly is a fairy “tail” ending.

 

Missy Hannah, Senior Associate Developmental Editor

I work directly with SAS Press and JMP authors to plan and implement marketing strategies for our books. I grew up with a mother who not only was a Systems Engineer but who taught me all about technology. Looking back, I was always watching her code and work with technology and IT my entire life and seeing her do this meant those things came very easily for me. But often, other young women don’t find mentors in the field of data analytics and technology. Data shows that women account for less than 20% of computer science degrees in the U.S. and hold less than 25% of STEM-related jobs. That is why the Women’s In Tech Network at SAS has been something I have really enjoyed having at my company. SAS creating the Women’s Initiative Network (WIN) and all the other work they are doing to increase women in STEM and data fields is something that really matters.

Catherine Connolly, Developmental Editor

I work with authors to develop books that support SAS’ business initiatives. My main areas of focus are JMP, data management, and IoT.

There are so many SAS initiatives through #data4good that make me proud to be a SAS employee. One initiative I read earlier this year that stuck with me was a partnership between SAS and CAP Science to combat against repeated domestic violence. CAP Science developed wearables to be worn by both the domestic violence victim and the offender. The wearable uses SAS software to continuously collect data and report on the offender’s location in real-time in an effort to stop future attacks.

 
 

We hope you enjoyed this small insight into some of our team. We are all very proud to work for a company that takes the time to improve the lives of those who need it and uses the power of data and analytics to help the world.

What SAS #data4good initiative has been your favorite? Make sure to comment below!

What really matters: SAS #data4good and the SAS Press team was published on SAS Users.

8月 052019
 

As a company, SAS consistently supports #data4good initiatives designed to help those less fortunate around the world. SAS Press team members recently took some time to reflect on the SAS initiatives that inspired them. We thought this would be a good opportunity to introduce some of the team who work so hard on our SAS Press books.

Sian Roberts, Publisher

I lead the SAS Press team and oversee the publication of our books from start to finish, including manuscript acquisition, book development, production, sales, and promotion.

Having lost both my dad and grandmother to cancer, the work SAS is doing to help improve care for cancer patients by tailoring treatments for individuals particularly resonates with me. For example, the wonderful work that is being done with Amsterdam University Medical Center to use computer vision and predictive analytics to improve care for cancer patients is of particular interest to me. My hope is that by using analytics and AI on data gathered from hospitals, research institutes, pharma and biotech companies, patterns can be identified earlier, and survival rates will increase.

 
 

Suzanne Morgen, Developmental Editor

I work with authors to help them develop and write their books, then go to conferences to sell those books and recruit more authors!

At SAS Global Forum, we heard about a pilot program at the New Hanover County Department of Social Services that uses SAS to alert caseworkers to risks for children in their care. I have been a foster parent for several years, so I am excited about any new resources that would help social workers intervene earlier in kids’ lives and hopefully keep them safer and even reduce the need for foster care. I hope SAS is able to partner with many more social services departments and use analytics to help protect more kids in the state and across the country.

Emily Scheviak-Livesay, Senior Business Operations Specialist

As a SAS employee for 22 years, I have learned to wear many hats. At SAS Press, I keep the business running smoothly and manage the metadata of all our books in all formats. I also work with our partners to ensure our titles are available both in the US and globally.

I love this story about JMP working with the Animal Humane Society! I’m a huge fan of “adopt, don’t shop” and it makes me so proud to work at a company where one of our products was used to assist in furthering the cause. For JMP to be able to take a huge amount of data from various sources and turn it into valuable information for The Animal Humane Society is amazing! Helping to care and save animals is what it’s all about. It truly is a fairy “tail” ending.

 

Missy Hannah, Senior Associate Developmental Editor

I work directly with SAS Press and JMP authors to plan and implement marketing strategies for our books. I grew up with a mother who not only was a Systems Engineer but who taught me all about technology. Looking back, I was always watching her code and work with technology and IT my entire life and seeing her do this meant those things came very easily for me. But often, other young women don’t find mentors in the field of data analytics and technology. Data shows that women account for less than 20% of computer science degrees in the U.S. and hold less than 25% of STEM-related jobs. That is why the Women’s In Tech Network at SAS has been something I have really enjoyed having at my company. SAS creating the Women’s Initiative Network (WIN) and all the other work they are doing to increase women in STEM and data fields is something that really matters.

Catherine Connolly, Developmental Editor

I work with authors to develop books that support SAS’ business initiatives. My main areas of focus are JMP, data management, and IoT.

There are so many SAS initiatives through #data4good that make me proud to be a SAS employee. One initiative I read earlier this year that stuck with me was a partnership between SAS and CAP Science to combat against repeated domestic violence. CAP Science developed wearables to be worn by both the domestic violence victim and the offender. The wearable uses SAS software to continuously collect data and report on the offender’s location in real-time in an effort to stop future attacks.

 
 

We hope you enjoyed this small insight into some of our team. We are all very proud to work for a company that takes the time to improve the lives of those who need it and uses the power of data and analytics to help the world.

What SAS #data4good initiative has been your favorite? Make sure to comment below!

What really matters: SAS #data4good and the SAS Press team was published on SAS Users.

6月 182019
 

What is Item Response Theory?

Item Response Theory (IRT) is a way to analyze responses to tests or questionnaires with the goal of improving measurement accuracy and reliability.

A common application is in testing a student’s ability or knowledge. Today, all major psychological and educational tests are built using IRT. The methodology can significantly improve measurement accuracy and reliability while providing potential significant reductions in assessment time and effort, especially via computerized adaptive testing. For example, the SAT and GRE both use Item Response Theory for their tests. IRT takes into account the number of questions answered correctly and the difficulty of the question.

In recent years, IRT models have also become increasingly popular in health behavior, quality of life, and clinical research. There are many different models for IRT. Three of the most popular are:

The Rasch model

Two-parameter model

Graded Response model

Early IRT models (such as the Rasch model and two-parameter model) concentrate mainly on dichotomous responses. These models were later extended to incorporate other formats, such as ordinal responses, rating scales, partial credit scoring, and multiple category scoring.

Item Response Theory Models Using SAS

Ron Cody and Jeffrey K. Smith’s book, Test Scoring and Analysis Using SAS, uses SAS PROC IRT to show how to develop your own multiple-choice tests, score students, produce student rosters (in print form or Excel), and explore item response theory (IRT).

Aimed at non-statisticians working in education or training, the book describes item analysis and test reliability in easy-to-understand terms and teaches SAS programming to score tests, perform item analysis, and estimate reliability.

For those with a more statistical background, Bayesian Analysis of Item Response Theory Models Using SAS describes how to estimate and check IRT models using the SAS MCMC procedure. Written especially for psychometricians, scale developers, and practitioners, numerous programs are provided and annotated so that you can easily modify them for your applications.

Assessment has played, and continues to play, an integral part in our work and educational settings. IRT models continue to be increasingly popular in many other fields, such as medical research, health sciences, quality-of-life research, and even marketing research. With the use of IRT models, you can not only improve scoring accuracy but also economize test administration by adaptively using only the discriminative items.

Interested in learning more? Check out our chapter previews available for free. Want to learn more about SAS Press? Explore our online bookstore and subscribe to our newsletter to get all the latest discounts, news, and more.

Further resources

SAS Blogs:
New at SAS: Psychometric Testing by Charu Shankar
SAS author’s tip: Bayesian analysis of item response theory models

SAS Communities:
SAS Communities: Custom Task Tuesday: SAS Global Forum/PROC IRT Edition!

SAS Global Forum Paper:
Item Response Theory: What It Is and How You Can Use the IRTProcedure to Apply It by Xinming An and Yiu-Fai Yung

SAS Documentation:
The IRT Procedure
SAS/STAT 14.1 User Guide: The IRT Procedure
SAS/STAT 14.2 User Guide: Help Center

Understanding Item Response Theory with SAS was published on SAS Users.

4月 222019
 

Imagine a world where satisfying human-computer dialogues exist. With the resurgence of interest in natural language processing (NLP) and understanding (NLU) – that day may not be far off.

In order to provide more satisfying interactions with machines, researchers are designing smart systems that use artificial intelligence (AI) to develop better understanding of human requests and intent.

Last year, OpenAI used a machine learning technique called reinforcement learning to teach agents to design their own language. The AI agents were given a simple set of words and the ability to communicate with each other. They were then given a set of goals that were best achieved by cooperating (communicating) with other agents. The agents independently developed a simple ‘grounded’ language.

Grounded vs. inferred language


Human language is said to be grounded in experience. People grasp the meaning of many basic words by interaction – not by learning dictionary definitions by rote. They develop understanding in terms of sensory experience -- for example, words like red, heavy, above.

Abstract word meanings are built in relation to more concretely grounded terms. Grounding allows humans to acquire and understand words and sentences in context.

The opposite of a grounded language is an inferred language. Inferred languages derive meaning from the words themselves and not what they represent. In AI trained only on textual data, but not real-world representations, these methods lack true understanding of what the words mean.

What if the AI agent develops its own language we can’t understand?

It happens. Even if the researcher gives the agents simple English words the agent inevitably diverges to its own, unintelligible language. Recently researchers at Facebook, Google and OpenAI all experienced this phenomenon!

Agents are reward driven. If there is no reward for using English (or human language) then the agents will develop a more efficient shorthand for themselves.

That’s cool – why is that a problem?

When researchers at the Facebook Artificial Intelligence Research lab designed chatbots to negotiate with one another using machine learning, they had to tweak one of their models because otherwise the bot-to-bot conversation “led to divergence from human language as the agents developed their own language for negotiating.” They had to use what’s called a fixed supervised model instead.

The problem, there, is transparency. Machine learning techniques such as deep learning are black box technologies. A lot of data is fed into the AI, in this case a neural network, to train on and develop its own rules. The model is then fed new data which is used to spit out answers or information. The black box analogy is used because it is very hard, if not impossible in complex models, to know exactly how the AI derives the output (answers). If AI develops its own languages when talking to other AI, the transparency problem compounds. How can we fully trust an AI when we can’t follow how it is making its decisions and what it is telling other AI?

But it does demonstrate how machines are redefining people’s understanding of so many realms once believed to be exclusively human—like language. The Facebook researchers concluded that it offered a fascinating insight to human and machine language. The bots also proved to be very good negotiators, developing intelligent negotiating strategies.

These new insights, in turn, lead to smarter chatbots that have a greater understanding of the real world and the context of human dialog.

At SAS, we’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer in a recent SAS Insights article. “Chatbots create a humanlike interaction that makes results accessible to all.” The evolution of NLP toward NLU has a lot of important implications for businesses and consumers alike.

Satisfying human-computer dialogues will soon exist, and will have applications in medicine, law, and the classroom-to name but a few. As the volume of unstructured information continues to grow exponentially, we will benefit from AI’s tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters
White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?
SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham
SAS: What are chatbots?
Blog: Let’s chat about chatbots, by Wayne Thompson

Moving from natural language processing to natural language understanding was published on SAS Users.

4月 092019
 

Natural language understanding (NLU) is a subfield of natural language processing (NLP) that enables machine reading comprehension. While both understand human language, NLU goes beyond the structural understanding of language to interpret intent, resolve context and word ambiguity, and even generate human language on its own. NLU is designed for communicating with non-programmers – to understand their intent and act on it. NLU algorithms tackle the extremely complex problem of semantic interpretation – that is, understanding the intended meaning of spoken or written language, with all the subtleties, of human error, such as mispronunciations or fragmented sentences.

How does it work?

After your data has been analyzed by NLP to identify parts of speech, etc., NLU utilizes context to discern meaning of fragmented and run-on sentences to execute intent. For example, imagine a voice command to Siri or Alexa:

Siri / Alexa play me a …. um song by ... um …. oh I don’t know …. that band I like …. the one you played yesterday …. The Beach Boys … no the bass player … Dick something …

What are the chances of Siri / Alexa playing a song by Dick Dale? That’s where NLU comes in.

NLU reduces the human speech (or text) into a structured ontology – a data model comprising of a formal explicit definition of the semantics (meaning) and pragmatics (purpose or goal). The algorithms pull out such things as intent, timing, location and sentiment.

The above example might break down into:

Play song [intent] / yesterday [timing] / Beach Boys [artist] / bass player [artist] / Dick [artist]

By piecing together this information you might just get the song you want!

NLU has many important implications for businesses and consumers alike. Here are some common applications:

    Conversational interfaces – BOTs that can enhance the customer experience and deliver efficiency.
    Virtual assistants – natural language powered, allowing for easy engagement using natural dialogue.
    Call steering – allowing customers to explain, in their own words, why they are calling rather than going through predefined menus.
    Smart listener – allowing users to optimize speech output applications.
    Information summarization – algorithms that can ‘read’ long documents and summarize the meaning and/or sentiment.
    Pre-processing for machine learning (ML) – the information extracted can then be fed into a machine learning recommendation engine or predictive model. For example, NLU and ML are used to sift through novels to predict which would make hit movies at the box office!

Imagine the power of an algorithm that can understand the meaning and nuance of human language in many contexts, from medicine to law to the classroom. As the volumes of unstructured information continue to grow exponentially, we will benefit from computers’ tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters

White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?

SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis

Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham

So, you’ve figured out NLP but what’s NLU? was published on SAS Users.

4月 032019
 

Structuring a highly unstructured data source

Human language is astoundingly complex and diverse. We express ourselves in infinite ways. It can be very difficult to model and extract meaning from both written and spoken language. Usually the most meaningful analysis uses a number of techniques.

While supervised and unsupervised learning, and specifically deep learning, are widely used for modeling human language, there’s also a need for syntactic and semantic understanding and domain expertise. Natural Language Processing (NLP) is important because it can help to resolve ambiguity and add useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics. Machine learning runs outputs from NLP through data mining and machine learning algorithms to automatically extract key features and relational concepts. Human input from linguistic rules adds to the process, enabling contextual comprehension.

Text analytics provides structure to unstructured data so it can be easily analyzed. In this blog, I would like to focus on two widely used text analytics techniques: information extraction and entity resolution.

Information Extraction

Information Extraction (IE) automatically extracts structured information from an unstructured or semi-structured text data type -- for example, a text file, to create new structured text data. IE works at the sub-document level, in contrast with techniques such as categorization, that work at the document or record level. Therefore, the results of IE can further feed into other analyses, like predictive modeling or topic identification, as features for those processes. IE can also be used to create a new database of information. One example is the recording of key information about terrorist attacks from a group of news articles on terrorism. Any given IE task has a defined template, which is a (or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template. Such a database can then be used and analyzed through queries and reports about the data.

In their new book, SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, authors Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis, give some great examples of uses of IE:

"One good use case for IE is for creating a faceted search system. Faceted search allows users to narrow down search results by classifying results by using multiple dimensions, called facets, simultaneously. For example, faceted search may be used when analysts try to determine why and where immigrants may perish. The analysts might want to correlate geographical information with information that describes the causes of the deaths in order to determine what actions to take."

Another good example of using IE in predictive models is analysts at a bank who want to determine why customers close their accounts. They have an active churn model that works fairly well at identifying potential churn, but less well at determining what causes the churn. An IE model could be built to identify different bank policies and offerings, and then track mentions of each during any customer interaction. If a particular policy could be linked to certain churn behavior, then the policy could be modified to reduce the number of lost customers.

Reporting information found as a result of IE can provide deeper insight into trends and uncover details that were buried in the unstructured data. An example of this is an analysis of call center notes at an appliance manufacturing company. The results of IE show a pattern of customer-initiated calls about repairs and breakdowns of a type of refrigerator, and the results highlight particular problems with the doors. This information shows up as a pattern of increasing calls. Because the content of the calls is being analyzed, the company can return to its design team, which can find and remedy the root problem.

Entity Resolution and regular expressions

Entity Resolution is the technique of recognizing when two observations relate to the same entity (thing, person, company), despite having been described differently. And conversely, recognizing when two observations do not relate to the same entity, despite having been described similarly. For example, you are listed in one data base as S Roberts, Sian Roberts, S.Roberts. All refer to the same person but would be treated as different people in an analysis unless they are resolved (combined to one person).

Entity resolution can be performed as part of a data pre-processing step or as text analysis. Basically one helps resolve multiple entries (cleans the data) and the other resolves reference to a single entity to extract meaning, for example, pronoun resolution - when “it” refers to a particular company mentioned earlier in the text. Here is another example:

Assume each numbered item is a separate observation in the input data set:
1. SAS Institute is a great company. Our company has a recreation center and health care center for employees.
2. Our company has won many awards.
3. SAS Institute was founded in 1976.

The scoring output matches are below; note that the document ID associated with each match aligns with the number before the input document where the match was found.

Unstructured data clean-up

In the following section we focus on the pre-processing clean-up of the data. Unstructured data is the most voluminous form of data in the world, and analysts rarely receive it in perfect condition for processing. In other words, textual data needs to be cleaned, transformed, and enhanced before value can be derived from it.

A regular expression is a pattern that the regular expression engine attempts to match in input. In SAS programming, regular expressions are seen as strings of letters and special characters that are recognized by certain built-in SAS functions for the purpose of searching and matching. Combined with other built-in SAS functions and procedures, such as entity resolution, you can realize tremendous capabilities. Matthew Windham, author of Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, gives some great examples of how you might use these techniques to clean your text data in his book. Here we share one of them:

"As you are probably familiar with, data is rarely provided to analysts in a form that is immediately useful. It is frequently necessary to clean, transform, and enhance source data before it can be used—especially textual data."

Extract, Transform, and Load (ETL) ETL is a general set of processes for extracting data from its source, modifying it to fit your end needs, and loading it into a target location that enables you to best use it (e.g., database, data store, data warehouse). We’re going to begin with a fairly basic example to get us started. Suppose we already have a SAS data set of customer addresses that contains some data quality issues. The method of recording the data is unknown to us, but visual inspection has revealed numerous occurrences of duplicative records. In this example, it is clearly the same individual with slightly different representations of the address and encoding for gender. But how do we fix such problems automatically for all of the records?

First Name Last Name DOB Gender Street City State Zip Robert Smith 2/5/1967 M 123 Fourth Street Fairfax, VA 22030 Robert Smith 2/5/1967 Male 123 Fourth St. Fairfax va 22030

Using regular expressions, we can algorithmically standardize abbreviations, remove punctuation, and do much more to ensure that each record is directly comparable. In this case, regular expressions enable us to perform more effective record keeping, which ultimately impacts downstream analysis and reporting. We can easily leverage regular expressions to ensure that each record adheres to institutional standards. We can make each occurrence of Gender either “M/F” or “Male/Female,” make every instance of the Street variable use “Street” or “St.” in the address line, make each City variable include or exclude the comma, and abbreviate State as either all caps or all lowercase. This example is quite simple, but it reveals the power of applying some basic data standardization techniques to data sets. By enforcing these standards across the entire data set, we are then able to properly identify duplicative references within the data set. In addition to making our analysis and reporting less error-prone, we can reduce data storage space and duplicative business activities associated with each record (for example, fewer customer catalogs will be mailed out, thus saving money).

Your unstructured text data is growing daily, and data without analytics is opportunity yet to be realized. Discover the value in your data with text analytics capabilities from SAS. The SAS Platform fosters collaboration by providing a toolbox where best practice pipelines and methods can be shared. SAS also seamlessly integrates with existing systems and open source technology.

Further Resources:
Natural Language Processing: What it is and why it matters

White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?

SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis

Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham

Text analytics explained was published on SAS Users.

3月 082019
 

In a move to combat "stataphobia" and foster excellence in statistics in developing countries, SAS Press last month donated 70 SAS Press titles to the Serageldin Research Library at the Library of Alexandria in Egypt. The library’s mission is to achieve statistical equity so that a student in Chad has [...]

Breaking down walls for science: SAS Press donates books to the world’s largest research methods library was published on SAS Voices by Sian Roberts