Sian Roberts

6月 182019
 

What is Item Response Theory?

Item Response Theory (IRT) is a way to analyze responses to tests or questionnaires with the goal of improving measurement accuracy and reliability.

A common application is in testing a student’s ability or knowledge. Today, all major psychological and educational tests are built using IRT. The methodology can significantly improve measurement accuracy and reliability while providing potential significant reductions in assessment time and effort, especially via computerized adaptive testing. For example, the SAT and GRE both use Item Response Theory for their tests. IRT takes into account the number of questions answered correctly and the difficulty of the question.

In recent years, IRT models have also become increasingly popular in health behavior, quality of life, and clinical research. There are many different models for IRT. Three of the most popular are:

The Rasch model

Two-parameter model

Graded Response model

Early IRT models (such as the Rasch model and two-parameter model) concentrate mainly on dichotomous responses. These models were later extended to incorporate other formats, such as ordinal responses, rating scales, partial credit scoring, and multiple category scoring.

Item Response Theory Models Using SAS

Ron Cody and Jeffrey K. Smith’s book, Test Scoring and Analysis Using SAS, uses SAS PROC IRT to show how to develop your own multiple-choice tests, score students, produce student rosters (in print form or Excel), and explore item response theory (IRT).

Aimed at non-statisticians working in education or training, the book describes item analysis and test reliability in easy-to-understand terms and teaches SAS programming to score tests, perform item analysis, and estimate reliability.

For those with a more statistical background, Bayesian Analysis of Item Response Theory Models Using SAS describes how to estimate and check IRT models using the SAS MCMC procedure. Written especially for psychometricians, scale developers, and practitioners, numerous programs are provided and annotated so that you can easily modify them for your applications.

Assessment has played, and continues to play, an integral part in our work and educational settings. IRT models continue to be increasingly popular in many other fields, such as medical research, health sciences, quality-of-life research, and even marketing research. With the use of IRT models, you can not only improve scoring accuracy but also economize test administration by adaptively using only the discriminative items.

Interested in learning more? Check out our chapter previews available for free. Want to learn more about SAS Press? Explore our online bookstore and subscribe to our newsletter to get all the latest discounts, news, and more.

Further resources

SAS Blogs:
New at SAS: Psychometric Testing by Charu Shankar
SAS author’s tip: Bayesian analysis of item response theory models

SAS Communities:
SAS Communities: Custom Task Tuesday: SAS Global Forum/PROC IRT Edition!

SAS Global Forum Paper:
Item Response Theory: What It Is and How You Can Use the IRTProcedure to Apply It by Xinming An and Yiu-Fai Yung

SAS Documentation:
The IRT Procedure
SAS/STAT 14.1 User Guide: The IRT Procedure
SAS/STAT 14.2 User Guide: Help Center

Understanding Item Response Theory with SAS was published on SAS Users.

4月 222019
 

Imagine a world where satisfying human-computer dialogues exist. With the resurgence of interest in natural language processing (NLP) and understanding (NLU) – that day may not be far off.

In order to provide more satisfying interactions with machines, researchers are designing smart systems that use artificial intelligence (AI) to develop better understanding of human requests and intent.

Last year, OpenAI used a machine learning technique called reinforcement learning to teach agents to design their own language. The AI agents were given a simple set of words and the ability to communicate with each other. They were then given a set of goals that were best achieved by cooperating (communicating) with other agents. The agents independently developed a simple ‘grounded’ language.

Grounded vs. inferred language


Human language is said to be grounded in experience. People grasp the meaning of many basic words by interaction – not by learning dictionary definitions by rote. They develop understanding in terms of sensory experience -- for example, words like red, heavy, above.

Abstract word meanings are built in relation to more concretely grounded terms. Grounding allows humans to acquire and understand words and sentences in context.

The opposite of a grounded language is an inferred language. Inferred languages derive meaning from the words themselves and not what they represent. In AI trained only on textual data, but not real-world representations, these methods lack true understanding of what the words mean.

What if the AI agent develops its own language we can’t understand?

It happens. Even if the researcher gives the agents simple English words the agent inevitably diverges to its own, unintelligible language. Recently researchers at Facebook, Google and OpenAI all experienced this phenomenon!

Agents are reward driven. If there is no reward for using English (or human language) then the agents will develop a more efficient shorthand for themselves.

That’s cool – why is that a problem?

When researchers at the Facebook Artificial Intelligence Research lab designed chatbots to negotiate with one another using machine learning, they had to tweak one of their models because otherwise the bot-to-bot conversation “led to divergence from human language as the agents developed their own language for negotiating.” They had to use what’s called a fixed supervised model instead.

The problem, there, is transparency. Machine learning techniques such as deep learning are black box technologies. A lot of data is fed into the AI, in this case a neural network, to train on and develop its own rules. The model is then fed new data which is used to spit out answers or information. The black box analogy is used because it is very hard, if not impossible in complex models, to know exactly how the AI derives the output (answers). If AI develops its own languages when talking to other AI, the transparency problem compounds. How can we fully trust an AI when we can’t follow how it is making its decisions and what it is telling other AI?

But it does demonstrate how machines are redefining people’s understanding of so many realms once believed to be exclusively human—like language. The Facebook researchers concluded that it offered a fascinating insight to human and machine language. The bots also proved to be very good negotiators, developing intelligent negotiating strategies.

These new insights, in turn, lead to smarter chatbots that have a greater understanding of the real world and the context of human dialog.

At SAS, we’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer in a recent SAS Insights article. “Chatbots create a humanlike interaction that makes results accessible to all.” The evolution of NLP toward NLU has a lot of important implications for businesses and consumers alike.

Satisfying human-computer dialogues will soon exist, and will have applications in medicine, law, and the classroom-to name but a few. As the volume of unstructured information continues to grow exponentially, we will benefit from AI’s tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters
White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?
SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham
SAS: What are chatbots?
Blog: Let’s chat about chatbots, by Wayne Thompson

Moving from natural language processing to natural language understanding was published on SAS Users.

4月 092019
 

Natural language understanding (NLU) is a subfield of natural language processing (NLP) that enables machine reading comprehension. While both understand human language, NLU goes beyond the structural understanding of language to interpret intent, resolve context and word ambiguity, and even generate human language on its own. NLU is designed for communicating with non-programmers – to understand their intent and act on it. NLU algorithms tackle the extremely complex problem of semantic interpretation – that is, understanding the intended meaning of spoken or written language, with all the subtleties, of human error, such as mispronunciations or fragmented sentences.

How does it work?

After your data has been analyzed by NLP to identify parts of speech, etc., NLU utilizes context to discern meaning of fragmented and run-on sentences to execute intent. For example, imagine a voice command to Siri or Alexa:

Siri / Alexa play me a …. um song by ... um …. oh I don’t know …. that band I like …. the one you played yesterday …. The Beach Boys … no the bass player … Dick something …

What are the chances of Siri / Alexa playing a song by Dick Dale? That’s where NLU comes in.

NLU reduces the human speech (or text) into a structured ontology – a data model comprising of a formal explicit definition of the semantics (meaning) and pragmatics (purpose or goal). The algorithms pull out such things as intent, timing, location and sentiment.

The above example might break down into:

Play song [intent] / yesterday [timing] / Beach Boys [artist] / bass player [artist] / Dick [artist]

By piecing together this information you might just get the song you want!

NLU has many important implications for businesses and consumers alike. Here are some common applications:

    Conversational interfaces – BOTs that can enhance the customer experience and deliver efficiency.
    Virtual assistants – natural language powered, allowing for easy engagement using natural dialogue.
    Call steering – allowing customers to explain, in their own words, why they are calling rather than going through predefined menus.
    Smart listener – allowing users to optimize speech output applications.
    Information summarization – algorithms that can ‘read’ long documents and summarize the meaning and/or sentiment.
    Pre-processing for machine learning (ML) – the information extracted can then be fed into a machine learning recommendation engine or predictive model. For example, NLU and ML are used to sift through novels to predict which would make hit movies at the box office!

Imagine the power of an algorithm that can understand the meaning and nuance of human language in many contexts, from medicine to law to the classroom. As the volumes of unstructured information continue to grow exponentially, we will benefit from computers’ tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters

White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?

SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis

Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham

So, you’ve figured out NLP but what’s NLU? was published on SAS Users.

4月 032019
 

Structuring a highly unstructured data source

Human language is astoundingly complex and diverse. We express ourselves in infinite ways. It can be very difficult to model and extract meaning from both written and spoken language. Usually the most meaningful analysis uses a number of techniques.

While supervised and unsupervised learning, and specifically deep learning, are widely used for modeling human language, there’s also a need for syntactic and semantic understanding and domain expertise. Natural Language Processing (NLP) is important because it can help to resolve ambiguity and add useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics. Machine learning runs outputs from NLP through data mining and machine learning algorithms to automatically extract key features and relational concepts. Human input from linguistic rules adds to the process, enabling contextual comprehension.

Text analytics provides structure to unstructured data so it can be easily analyzed. In this blog, I would like to focus on two widely used text analytics techniques: information extraction and entity resolution.

Information Extraction

Information Extraction (IE) automatically extracts structured information from an unstructured or semi-structured text data type -- for example, a text file, to create new structured text data. IE works at the sub-document level, in contrast with techniques such as categorization, that work at the document or record level. Therefore, the results of IE can further feed into other analyses, like predictive modeling or topic identification, as features for those processes. IE can also be used to create a new database of information. One example is the recording of key information about terrorist attacks from a group of news articles on terrorism. Any given IE task has a defined template, which is a (or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template. Such a database can then be used and analyzed through queries and reports about the data.

In their new book, SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, authors Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis, give some great examples of uses of IE:

"One good use case for IE is for creating a faceted search system. Faceted search allows users to narrow down search results by classifying results by using multiple dimensions, called facets, simultaneously. For example, faceted search may be used when analysts try to determine why and where immigrants may perish. The analysts might want to correlate geographical information with information that describes the causes of the deaths in order to determine what actions to take."

Another good example of using IE in predictive models is analysts at a bank who want to determine why customers close their accounts. They have an active churn model that works fairly well at identifying potential churn, but less well at determining what causes the churn. An IE model could be built to identify different bank policies and offerings, and then track mentions of each during any customer interaction. If a particular policy could be linked to certain churn behavior, then the policy could be modified to reduce the number of lost customers.

Reporting information found as a result of IE can provide deeper insight into trends and uncover details that were buried in the unstructured data. An example of this is an analysis of call center notes at an appliance manufacturing company. The results of IE show a pattern of customer-initiated calls about repairs and breakdowns of a type of refrigerator, and the results highlight particular problems with the doors. This information shows up as a pattern of increasing calls. Because the content of the calls is being analyzed, the company can return to its design team, which can find and remedy the root problem.

Entity Resolution and regular expressions

Entity Resolution is the technique of recognizing when two observations relate to the same entity (thing, person, company), despite having been described differently. And conversely, recognizing when two observations do not relate to the same entity, despite having been described similarly. For example, you are listed in one data base as S Roberts, Sian Roberts, S.Roberts. All refer to the same person but would be treated as different people in an analysis unless they are resolved (combined to one person).

Entity resolution can be performed as part of a data pre-processing step or as text analysis. Basically one helps resolve multiple entries (cleans the data) and the other resolves reference to a single entity to extract meaning, for example, pronoun resolution - when “it” refers to a particular company mentioned earlier in the text. Here is another example:

Assume each numbered item is a separate observation in the input data set:
1. SAS Institute is a great company. Our company has a recreation center and health care center for employees.
2. Our company has won many awards.
3. SAS Institute was founded in 1976.

The scoring output matches are below; note that the document ID associated with each match aligns with the number before the input document where the match was found.

Unstructured data clean-up

In the following section we focus on the pre-processing clean-up of the data. Unstructured data is the most voluminous form of data in the world, and analysts rarely receive it in perfect condition for processing. In other words, textual data needs to be cleaned, transformed, and enhanced before value can be derived from it.

A regular expression is a pattern that the regular expression engine attempts to match in input. In SAS programming, regular expressions are seen as strings of letters and special characters that are recognized by certain built-in SAS functions for the purpose of searching and matching. Combined with other built-in SAS functions and procedures, such as entity resolution, you can realize tremendous capabilities. Matthew Windham, author of Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, gives some great examples of how you might use these techniques to clean your text data in his book. Here we share one of them:

"As you are probably familiar with, data is rarely provided to analysts in a form that is immediately useful. It is frequently necessary to clean, transform, and enhance source data before it can be used—especially textual data."

Extract, Transform, and Load (ETL) ETL is a general set of processes for extracting data from its source, modifying it to fit your end needs, and loading it into a target location that enables you to best use it (e.g., database, data store, data warehouse). We’re going to begin with a fairly basic example to get us started. Suppose we already have a SAS data set of customer addresses that contains some data quality issues. The method of recording the data is unknown to us, but visual inspection has revealed numerous occurrences of duplicative records. In this example, it is clearly the same individual with slightly different representations of the address and encoding for gender. But how do we fix such problems automatically for all of the records?

First Name Last Name DOB Gender Street City State Zip Robert Smith 2/5/1967 M 123 Fourth Street Fairfax, VA 22030 Robert Smith 2/5/1967 Male 123 Fourth St. Fairfax va 22030

Using regular expressions, we can algorithmically standardize abbreviations, remove punctuation, and do much more to ensure that each record is directly comparable. In this case, regular expressions enable us to perform more effective record keeping, which ultimately impacts downstream analysis and reporting. We can easily leverage regular expressions to ensure that each record adheres to institutional standards. We can make each occurrence of Gender either “M/F” or “Male/Female,” make every instance of the Street variable use “Street” or “St.” in the address line, make each City variable include or exclude the comma, and abbreviate State as either all caps or all lowercase. This example is quite simple, but it reveals the power of applying some basic data standardization techniques to data sets. By enforcing these standards across the entire data set, we are then able to properly identify duplicative references within the data set. In addition to making our analysis and reporting less error-prone, we can reduce data storage space and duplicative business activities associated with each record (for example, fewer customer catalogs will be mailed out, thus saving money).

Your unstructured text data is growing daily, and data without analytics is opportunity yet to be realized. Discover the value in your data with text analytics capabilities from SAS. The SAS Platform fosters collaboration by providing a toolbox where best practice pipelines and methods can be shared. SAS also seamlessly integrates with existing systems and open source technology.

Further Resources:
Natural Language Processing: What it is and why it matters

White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?

SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis

Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham

Text analytics explained was published on SAS Users.

3月 082019
 

In a move to combat "stataphobia" and foster excellence in statistics in developing countries, SAS Press last month donated 70 SAS Press titles to the Serageldin Research Library at the Library of Alexandria in Egypt. The library’s mission is to achieve statistical equity so that a student in Chad has [...]

Breaking down walls for science: SAS Press donates books to the world’s largest research methods library was published on SAS Voices by Sian Roberts

10月 152018
 


Old and new SAS users alike learned the tricks of the data trade from our Little SAS Book! We hope these fun tips from our exercise and project book teach you even more about how to master the data analytics game!

From Rebecca Ottesen:

Tip #1: Grouping Quantitative Variables
My favorite tip to share with students and SAS users is how to use PROC FORMAT to group quantitative variables into categories. A format can be created with a VALUE statement that specifies the ranges relevant to the category groupings. Then, this format can be applied with a FORMAT statement during an analysis to group the variable accordingly (don't forget the CLASS statement when applicable). You can also create categorical variables in the DATA step by applying the format in an assignment statement with a PUT function.

From Lora Delwiche:

Tip #2: Commenting Blocks of Code
This tip I learned from fellow SAS Press author Alan Wilson at SAS Global Forum 2008 in San Antonio. It might be a bit overly dramatic to say that this tip changed my life, but that’s not far from the truth! So, I am paying this tip forward. Thank you, Alan!

To comment out a whole block of code, simply highlight the lines of code, hold down the control key, and press the forward slash ( /). SAS will take those lines of code and turn them into comments by adding a /* to the beginning of each line and an */ at the end of each line.

To convert the commented lines back to code, highlight the lines again, hold down the control and shift keys, and press the forward slash ( /). This works in both the SAS Windowing environment (Display Manager) and SAS Enterprise Guide.

If you are using SAS Studio as your programming interface, you comment the same way, but to uncomment, just hold down the control key and then press the forward slash.

From Susan J. Slaughter:
Tip #3: Susan's Macro Mottos
There is no question that writing and debugging SAS macros can be a challenge. So I have two "macro mottos" that I use to help keep me on track.

“Remember, you are writing a program that writes a program.”

This is the most important concept to keep in mind whenever you work with SAS macros. If you feel the least bit confused by a macro, repeating this motto can help you to see what is going on. I speak from personal experience here. This is my macro mantra.

“To avoid mangling your macros, always write them one piece at a time.”

This means, write your program in standard SAS code first. When that is working and bug-free, then add your %MACRO and %MEND statements. When they are working, then add your parameters, if any, one at a time. If you make sure that each macro feature you add is working before you add another one, then debugging will be vastly simplified.

And, this is the best time ever to learn SAS! When I first encountered SAS, there were only two ways that I could get help. I could either ask another graduate student who might or might not know the answer, or I could go to the computer center and borrow the SAS manual. (There was only one.) Today it's totally different.

I am continually AMAZED by the resources that are available now—many for FREE. Here are four resources that every new SAS user should know about:

1. SAS Studio
This is a wonderful new interface for SAS that runs in a browser and has both programming and point-and-click features. SAS Studio is free for students, professors, and independent learners. You can download the SAS University Edition to run SAS Studio on your own computer, or use SAS OnDemand for Academics via the Internet.

2. Online classes
Two of the most popular self-paced e-learning classes are available for free: SAS Programming 1: Essentials, and Statistics 1. These are real classes which in the past people paid hundreds of dollars to take.

3. Videos
You can access hundreds of SAS training videos, tutorials, and demos at support.sas.com/training. Topics range from basic (What is SAS?) to advanced (SAS 9.4 Metadata Clustering).

4. Community of SAS users
If you encounter a problem, it is likely that someone else faced a similar situation and figured out how to solve it. On communities.sas.com you can post questions and get answers from SAS users and developers. On the site, www.lexjansen.com, you can find virtually every paper ever presented at a SAS users group conference.

If you want even more tips and tricks, check out our Exercises and Projects for The Little SAS Book, Fifth Edition! Let us know if enjoyed these tips in the comment boxes below.

New to SAS? Ready to learn more? Check out these tips and tricks from the authors of Exercises and Projects for The Little SAS Book, 5th Edition was published on SAS Users.

10月 102018
 

Deep learning (DL) is a subset of neural networks, which have been around since the 1960’s. Computing resources and the need for a lot of data during training were the crippling factor for neural networks. But with the growing availability of computing resources such as multi-core machines, graphics processing units (GPUs) accelerators and hardware specialized, DL is becoming much more practical for business problems.

Financial institutions use a large number of computations to evaluate portfolios, price securities, and financial derivatives. For example, every cell in a spreadsheet potentially implements a different formula. Time is also usually of the essence so having the fastest possible technology to perform financial calculations with acceptable accuracy is paramount.

In this blog, we talk to Henry Bequet, Director of High-Performance Computing and Machine Learning in the Finance Risk division of SAS, about how he uses DL as a technology to maximize performance.

Henry discusses how the performance of numerical applications can be greatly improved by using DL. Once a DL network is trained to compute analytics, using that DL network becomes drastically faster than more classic methodologies like Monte Carlo simulations.

We asked him to explain deep learning for numerical analysis (DL4NA) and the most common questions he gets asked.

Can you describe the deep learning methodology proposed in DL4NA?

Yes, it starts with writing your analytics in a transparent and scalable way. All content that is released as a solution by the SAS financial risk division uses the "many task computing" (MTC) paradigm. Simply put, when writing your analytics using the many task computing paradigm, you organize code in SAS programs that define task inputs and outputs. A job flow is a set of tasks that will run in parallel, and the job flow will also handle synchronization.

Fig 1.1 A Sequential Job Flow

The job flow in Figure 1.1 visually gives you a hint that the two tasks can be executed in parallel. The addition of the task into the job flow is what defines the potential parallelism, not the task itself. The task designer or implementer doesn’t need to know that the task is being executed at the same time as other tasks. It is not uncommon to have hundreds of tasks in a job flow.

Fig 1.2 A Complex Job Flow

Using that information, the SAS platform, and the Infrastructure for Risk Management (IRM) is able to automatically infer the parallelization in your analytics. This allows your analytics to run on tens or hundreds of cores. (Most SAS customers run out of cores before they run out of tasks to run in parallel.) By running SAS code in parallel, on a single machine or on a grid, you gain orders of magnitude of performance improvements.

This methodology also has the benefit of expressing your analytics in the form of Y= f(x), which is precisely what you feed a deep neural network (DNN) to learn. That organization of your analytics allows you to train a DNN to reproduce the results of your analytics originally written in SAS. Once you have the trained DNN, you can use it to score tremendously faster than the original SAS code. You can also use your DNN to push your analytics to the edge. I believe that this is a powerful methodology that offers a wide spectrum of applicability. It is also a good example of deep learning helping data scientists build better and faster models.

Fig 1.3 Example of a DNN with four layers: two visible layers and two hidden layers.

The number of neurons of the input layer is driven by the number of features. The number of neurons of the output layer is driven by the number of classes that we want to recognize, in this case, three. The number of neurons in the hidden layers as well as the number of hidden layers is up to us: those two parameters are model hyper-parameters.

How do I run my SAS program faster using deep learning?

In the financial risk division, I work with banks and insurance companies all over the world that are faced with increasing regulatory requirements like CCAR and IFRS17. Those problems are particularly challenging because they involve big data and big compute.

The good news is that new hardware architectures are emerging with the rise of hybrid computing. Computers are increasing built as a combination of traditional CPUs and innovative devices like GPUs, TPUs, FPGAs, ASICs. Those hybrid machines can run significantly faster than legacy computers.

The bad news is that hybrid computers are hard to program and each of them is specific: you write code for GPU, it won’t run on an FPGA, it won’t even run on different generations of the same device. Consequently, software developers and software vendors are reluctant to jump into the fray and data scientist and statisticians are left out of the performance gains. So there is a gap, a big gap in fact.

To fill that gap is the raison d’être of my new book, Deep Learning for Numerical Applications with SAS. Check it out and visit the SAS Risk Management Community to share your thoughts and concerns on this cross-industry topic.

Deep learning for numerical analysis explained was published on SAS Users.