10月 022019
 

I frequently see questions on SAS discussion forums about how to compute the geometric mean and related quantities in SAS. Unfortunately, the answers to these questions are sometimes confusing or even wrong. In addition, some published papers and web sites that claim to show how to calculate the geometric mean in SAS contain wrong or misleading information.

This article shows how to compute the geometric mean, the geometric standard deviation, and the geometric coefficient of variation in SAS. It first shows how to use PROC TTEST to compute the geometric mean and the geometric coefficient of variation. It then shows how to compute several geometric statistics in the SAS/IML language.

For an introduction to the geometric mean, see "What is a geometric mean." For information about the (arithmetic) coefficient of variation (CV) and its applications, see the article "What is the coefficient of variation?"

Compute the geometric mean and geometric CV in SAS

As discussed in my previous article, the geometric mean arises naturally when positive numbers are being multiplied and you want to find the average multiplier. Although the geometric mean can be used to estimate the "center" of any set of positive numbers, it is frequently used to estimate average values in a set of ratios or to compute an average growth rate.

The TTEST procedure is the easiest way to compute the geometric mean (GM) and geometric CV (GCV) of positive data. To demonstrate this, the following DATA step simulates 100 random observations from a lognormal distribution. PROC SGPLOT shows a histogram of the data and overlays a vertical line at the location of the geometric mean.

%let N = 100;
data Have;
call streaminit(12345);
do i = 1 to &N;
   x = round( rand("LogNormal", 3, 0.8), 0.1);    /* generate positive values */
   output;
end;
run;
 
title "Geometric Mean of Skewed Positive Data";
proc sgplot data=Have;
   histogram x / binwidth=10 binstart=5 showbins;
   refline 20.2 / axis=x label="Geometric/Mean" splitchar="/" labelloc=inside
                  lineattrs=GraphData2(thickness=3);
   xaxis values=(0 to 140 by 10);
   yaxis offsetmax=0.1;
run;

Where is the "center" of these data? That depends on your definition. The mode of this skewed distribution is close to x=15, but the arithmetic mean is about 26.4. The mean is pulled upwards by the long right tail. It is a mathematical fact that the geometric mean of data is always less than the arithmetic mean. For these data, the geometric mean is 20.2.

To compute the geometric mean and geometric CV, you can use the DIST=LOGNORMAL option on the PROC TTEST statement, as follows:

proc ttest data=Have dist=lognormal; 
   var x;
   ods select ConfLimits;
run;

The geometric mean, which is 20.2 for these data, estimates the "center" of the data. Notice that the procedure does not report the geometric standard deviation (or variance), but instead reports the geometric coefficient of variation (GCV), which has the value 0.887 for this example. The documentation for the TTEST procedure explains why the GCV is the better measure of variation: "For lognormal data, the CV is the natural measure of variability (rather than the standard deviation) because the CV is invariant to multiplication of [the data]by a constant."

You might wonder whether data need to be lognormally distributed to use this table. The answer is that the data do not need to be lognormally distributed to use the geometric mean and geometric CV. However, the 95% confidence intervals for these quantities assume log-normality.

Definitions of geometric statistics

As T. Kirkwood points out in a letter to the editors of Biometric (Kirkwood, 1979), if data are lognormally distributed as LN(μ σ), then

  • The quantity GM = exp(μ) is the geometric mean. It is estimated from a sample by the quantity exp(m), where m is the arithmetic mean of the log-transformed data.
  • The quantity GSD = exp(σ) is defined to be the geometric standard deviation. The sample estimate is exp(s), where s is the standard deviation of the log-transformed data.
  • The geometric standard error (GSE) is defined by exponentiating the standard error of the mean of the log-transformed data. Geometric confidence intervals are handled similarly.
  • Kirkwood's proposal for the geometric coefficient of variation (GCV) is not generally used. Instead, the accepted definition of the GCV is GCV = sqrt(exp(σ2) – 1), which is the definition that is used in SAS. The estimate for the GCV is sqrt(exp(s2) – 1).

You can use these formulas to compute the geometric statistics for any positive data. However, only for lognormal data do the statistics have a solid theoretical basis: transform to normality, compute a statistic, apply the inverse transform.

Compute the geometric mean in SAS/IML

You can use the SAS/IML language to compute the geometric mean and other "geometric statistics" such as the geometric standard deviation and the geometric CV. The GEOMEAN function is a built-in SAS/IML function, but the other statistics are implemented by explicitly computing statistics of the log-transformed data, as described in the previous section:

proc iml;
use Have; read all var "x"; close;  /* read in positive data */
GM = geomean(x);               /* built-in GEOMEAN function */
print GM;
 
/* To estimate the geometric mean and geometric StdDev, compute
   arithmetic estimates of log(X), then EXP transform the results. */
n = nrow(x);
z = log(x);                  /* log-transformed data */
m = mean(z);                 /* arithmetic mean of log(X) */
s = std(z);                  /* arithmetic std dev of log(X) */
GM2 = exp(m);                /* same answer as GEOMEAN function */
GSD = exp(s);                /* geometric std dev */
GCV = sqrt(exp(s**2) - 1);   /* geometric CV */
print GM2 GSD GCV;

Note that the GM and GCV match the output from PROC TTEST.

What does the geometric standard deviation mean? As for the arithmetic mean, you need to start by thinking about the location of the geometric mean (20.2). If the data are normally distributed, then about 68% of the data are within one standard deviation of the mean, which is the interval [m0s, m+s]. For lognormal data, about 68% of the data should be in the interval [GM/GSD, GM*GSD] and, in fact, 65 out of 100 of the simulated observations are in that interval. Similarly, about 95% of lognormal data should be in the interval [GM/GSD2, GM*GSD2]. For the simulated data, 94 out of 100 observations are in the interval, as shown below:

I am not aware of a similar interpretation of the geometric coefficient of variation. The GCV is usually used to compare two samples. As opposed to the confidence intervals in the previous paragraph, the GCV does not make any reference to the geometric mean of the data.

Other ways to compute the geometric mean

The methods in this article are the simplest ways to compute the geometric mean in SAS, but there are other ways.

  • You can use the DATA step to log-transform the data, use PROC MEANS to compute the descriptive statistics of the log-transformed data, then use the DATA step to exponentiate the results.
  • PROC SURVEYMEANS can compute the geometric mean (with confidence intervals) and the standard error of the geometric mean for survey responses. However, the variance of survey data is not the same as the variance of a random sample, so you should not use the standard error statistic unless you have survey data.

As I said earlier, there is some bad information out there on the internet about this topic, so beware. A site that seems to get all the formulas correct and present the information in a reasonable way is Alex Kritchevsky's blog.

You can download the complete SAS program that I used to compute the GM, GSD, and GCV. The program also shows how to compute confidence intervals for these quantities.

The post Compute the geometric mean, geometric standard deviation, and geometric CV in SAS appeared first on The DO Loop.

10月 012019
 

Did you know the first SAS® Users Group event took place before SAS was incorporated as a company? In 1976, hundreds of early SAS users gathered in sunny Kissimmee, FL to share tips and offer feedback before SAS was even officially a company. Our users have continued to influence the [...]

Customer experience matters was published on SAS Voices by Randy Guard

10月 012019
 

On behalf of the entire global Customer Contact Center, “Happy CX Day!” to all our SAS users!

Customer Experience Day (aka #CXDay2019) is one of our favorite days of the year—when we can reflect on customer interactions, questions and feedback from the past year—and look to the year ahead for ways to enhance our customers’ experience, like expanding our support options and helping drive improvements to our website and self-service offerings.

Plus, it gives us another excuse to celebrate all of our wonderful SAS users and partners! Whoohoo!

The opportunity to join our users on their journeys with SAS—from acquiring SAS, to learning, updating and renewing it, to attending SAS events, and beyond---is one of our favorite aspects of our jobs!

To show our appreciation, we want to share a little love.

 

 

….they are curious and passionate, and ask great questions that help us learn something new each day! - Mary

…they are passionate and dedicated to learning SAS! I love how users are always willing to help with new users’ questions in our SAS Communities groups! - Antionen

…they’re using SAS in such innovative and amazing ways to help make the world a better place. - Tricia

… we get satisfaction by providing the answers to tough questions. It makes our jobs worthwhile knowing that we have added personal value to anyone who has reached out to SAS. - Lida

...by caring for people, we’re a part of making the unimaginable possible. - Keila

…of the exciting ways they’re applying SAS to change lives – from new advancements in cancer research, clinical trials and drug testing, learning about species and ecosystems in efforts to protect endangered species and biodiversity, to impacting young lives by using advanced analytics to measure, as well as impact, student progress in K-12. - Lisa

Not familiar with the Customer Contact Center? We’re the folks who answer your SAS inquiries and point you in the right direction to get the help you need. Well, that’s part of what we do! We don’t just answer questions, we’re also listening to you and looking at ways to make things easier to navigate, simpler to find, and faster to share.

Get to know us better! Fun facts about our team:

  • We’re located in four SAS offices:
  • Collectively, our team speaks and supports 17 languages
  • …and supports nearly 100 countries
  • Some engagement professionals on our team speak more than three languages
  • We collaborate with just about every team at SAS
  • In 2018, we received over 113,000 inquiries worldwide
  • ~55% of SAS customers choose live chat as their communication channel
  • So far this year, we’ve received over 80,000 inquiries from around the world
  • The two most common SAS topics we’re asked about are SAS Training and Analytics U options

If you need any help, want to share feedback, or simply want to talk SAS, please reach out to us!

You can chat with us live on the SAS website, tweet us @SAS_Cares, or contact us via phone, email, or web form.

Want to collaborate with other SAS users? Search for answers or post questions in the SAS Support Communities. This is a great resource for your usage questions! If you're new to SAS, consider frequenting the New SAS User Community, where friendly, knowledgeable volunteers like KurtBremser are eager to help.

Wishing you a fabulous CX Day!

Happy Customer Experience Day! was published on SAS Users.

9月 302019
 

There are several different kinds of means. They all try to find an average value from among a set of numbers. Although the most popular mean is the arithmetic mean, the geometric mean can be useful for problems in statistics, finance, and biology. A common application of the geometric mean is to find an average growth rate for an asset or for a population.

What is the geometric mean?

The geometric mean of n nonnegative numbers is the n_th root of the product of the numbers:
GM = (x1 * x2 * ... * xn)1/n = (Π xi)1/n
When the numbers are all positive, the geometric mean is equivalent to computing the arithmetic mean of the log-transformed data and then using the exponential function to back-transform the result: GM = exp( (1/n) Σ log(xi) ).

Physical interpretation of the geometric mean

The geometric mean has a useful interpretation in terms of the volume of an n-dimensional rectangular solid. If GM is the geometric mean of n positive numbers, then GM is the length of the side of an n-dimensional cube that has the same volume as the rectangular solid with side lengths x1, x2, ..., xn. For example, a rectangular solid with sides 1.5, 2, and 3 has a volume V = 9. The geometric mean of those three numbers are ((1.5)(2)(3))1/3 ≈ 2.08. The volume of a cube with sides of length 2.08 is 9.

This interpretation in terms of volume is analogous to interpreting the arithmetic mean in terms of length. Namely, if you have n line segments of lengths x1, x2, ..., xn, then the total length of the segments is the same as the length of n copies of a segment of length AM, where AM is the arithmetic mean.

What is the geometric mean good for?

The geometric mean can be used to estimate the "center" of any set of positive numbers but is frequently used to estimate an average value in problems that deal with growth rates or ratios. For example, the geometric mean is an average growth rate for an asset or for a population. The following example uses the language of finance, although you can replace "initial investment" by "initial population" if you are interested in population growth

In precalculus, growth rates are introduced in terms of the growth of an initial investment amount (the principle) compounded yearly at a fixed interest rate, r. After n years, the principle (P) is worth
An = P(1 + r)n
The quantity 1 + r sometimes confuses students. It appears because when you add the principle (P) and the interest (Pr), you get P(1 + r).

In precalculus, the interest rate is assumed to be a constant, which is fine for fixed-rate investments like bank CDs. However, many investments have a growth rate that is not fixed but varies from year to year. If the growth rate of the investment is r1 during the first year, r2 during the second year, and rn during the n_th year, then after n years the investment is worth
An = P(1 + r1)(1 + r2)...(1 + rn)
  = P (Π xi), where xi = 1 - ri.

What is the average growth rate for the investment? One interpretation of the average growth rate is the fixed rate that would give the same return after n years. That hypothetical fixed-rate growth is found by using the geometric mean of the values x1, x2, ..., xn. That is, if GM is the geometric mean of the xi, the value
An = P*GMn,
which assumes a fixed interest rate, is exactly the same as for the varying-rate computation.

An example of the geometric mean: The growth rate of gold

Let's apply the ideas in the preceding section. Suppose that you bought $1000 of gold on Jan 1, 2010. The following table gives the yearly rate of return for gold during the years 2010–2018, along with the value of the $1000 investment at the end of each year.

According to the table, the value of the investment after 9 years is $1160.91, which represents a total return of about 16%. What is the fixed-rate that would give the same return after 9 years when compounded annually? That is found by computing the geometric mean of the numbers in the third column:
GM = (1.2774 * 1.1165 * ... * 0.9885)1/9 = 1.01672
In other words, the investment in gold yielded the same return as a fixed-rate bank CD at 1.672% that is compounded yearly for 9 years. The end-of-year values for both investments are shown in the following graph.

The geometric mean in statistics

The geometric mean arises naturally in situations in which quantities are multiplied together. This happens so often that there is a probability distribution, called the lognormal distribution, that models this situation. if Z has a normal distribution, then you can obtain a lognormal distribution by applying the exponential transformation: X = exp(Z) is lognormal. The Wikipedia article for the lognormal distribution states, "The lognormal distribution is important in the description of natural phenomena... because many natural growth processes are driven by the accumulation of many small percentage changes." For lognormal data, the geometric mean is often more useful than the arithmetic mean. In my next blog post, I will show how to compute the geometric mean and other associated statistics in SAS.

The post What is a geometric mean? appeared first on The DO Loop.

9月 282019
 

This article continues a series that began with Machine learning with SASPy: Exploring and preparing your data (part 1). Part 1 showed you how to explore data using SASPy with Python. Here, in part 2, you will learn how to begin to prepare your data to use it within a machine-learning model.

Review part 1 if needed and ensure you still have the ADULT data set ready to use. (The data set is available from the UCI Machine Learning Repository.) If not, take some time to download and explore the data again, as described in part 1.

Preparing your data

Preparing data is a necessary step to perform before applying the data toward a model. There are string values, skewed data, and missing data points to consider. In the data set, be sure to clear missing values, so you can jump into other methods.

For this exercise, you will explore how to transform skewed features using SASPy and Pandas.

First, you must separate the income data from the data set, because the income feature will later become your target variable to model.

Drop the income data and turn the pandas data frame back into a SAS data object, with the following code:

Now, let's take a second look at the numerical features. You will use SASPy to create a histogram of all numerical features. Typically, the Matplotlib library is used, but SASPy provides great opportunities to visualize the data.

The following graphs represent the expected output.

Taking a look at the numerical features, two values stick out. CAPITAL_GAIN and CAPITAL_LOSS are highly skewed. Highly skewed features can affect your model, as most models try to maintain a normally distributed curve. To fix this, you will apply a logarithmic transformation using pandas and then visualize the change using SASPy.

Transforming skewed features

First, you need to change the SAS data object back into a pandas data frame and assign the skewed features to a list variable:

Then, use pandas to apply the logarithmic transformation and convert the pandas data frame back into a SAS data object:

Display transformed data

Now, you are ready to visualize these changes using SASPy. In the previous section, you used histograms to display the data. To display this transformation, you will use the SASPy SASUTIL class. Specifically, you will use a procedure typically used in SAS, the UNIVARIATE procedure.

To use the SASUTIL class with SASPy, you first need to create a Python object that uses the SASUTIL class:

 

Now, use the univariate function from SASPy:

 

Using the UNIVARIATE procedure, you can set axis limits to the output histograms so that you can see the data in a clearer format. After running the selected code, you can use the dir() function to verify successful submission:

 

Here is the output:

 

 

 

The function calculates various descriptive statistics and plots. However, for this example, the focus is on the histogram.

 

Here are the results:

Wrapping up

You have now transformed the skewed data. Pandas applied the logarithmic transformation and SASPy displayed the histograms.

Up next

In the next and final article of this series, you will continue preparing your data by normalizing numerical features and one-hot encoding categorical features.

Machine learning with SASPy: Exploring and preparing your data (part 2) was published on SAS Users.

9月 262019
 

In part 1 of this post, we looked at setting up Spark jobs from Cloud Analytics Services (CAS) to load and save data to and from Hadoop. Now we are moving on to the next step in the analytic cycle, scoring data in Hadoop and executing SAS code as a Spark job. The Spark scoring jobs execute using SAS In-Database Technologies for Hadoop.

The integration of the SAS Embedded Process and Hadoop allows scoring code to run directly on Hadoop. As a result, publishing and scoring of both DS2 and Data step models occur inside Hadoop. Furthermore, access to Spark data exists through the SAS Workspace Server or the SAS Compute Server using SAS/ACCESS to Hadoop, or from the CAS server using SAS Data Connectors.

Scoring Data from CAS using Spark

SAS PROC SCOREACCEL provides an interface to the CAS server for DS2 and DATA step model publishing and scoring. Model code is published from CAS to Spark and then executed via the SAS Embedded Process.

PROC SCOREACCEL supports a file interface for passing the model components (model program, format XML, and analytic stores). The procedure reads the specified files and passes their contents on to the model-publishing CAS action. In this case, the files must be visible from the SAS client.

CAS publishModel and runModel actions publish and execute score data in Spark:

 
%let CLUSTER="/opt/sas/viya/config/data/hadoop/lib:
  /opt/sas/viya/config/data/hadoop/lib/spark:/opt/sas/viya/config/data/hadoop/conf";
proc scoreaccel sessref=mysess1;
publishmodel
   target=hadoop
   modelname="simple01"
   modeltype=DS2
/* 
   filelocation=local */
programfile="/demo/code/simple.ds2"
username="cas"
modeldir="/user/cas"
classpath=&CLUSTER.
; runmodel 
    target=hadoop
    modelname="simple01"
    username="cas"
    modeldir="/user/cas"
    server=hadoop.server.com'
    intable="simple01_scoredata"
    outtable="simple01_outdata"
    forceoverwrite=yes
    classpath=&CLUSTER.
    platform=SPARK
; 
quit;

In the PROC SCOREACCEL example above, a DS2 model is published to Hadoop and executed with the Spark processing engine. The CLASSPATH statement specifies a link to the Hadoop cluster. The input and output tables, simple01_scoredata and simple01_outdata, already exist on the Hadoop cluster.

Score data in Spark from CAS using SAS Scoring Accelerator

SAS Scoring Accelerator execution status in YARN as a Spark job from CAS

As you can see in the image above, the model is scored in Spark using the SAS Scoring Accelerator. The Spark job name reflects the input and output tables.

Scoring Data from MVA SAS using Spark

Steps to run a scoring model in Hadoop:

  1. Create a traditional scoring model using SAS Enterprise Miner or an analytic store scoring model, generated using SAS Factory Miner HPFOREST or HPSVM components.
  2. Specify the Hadoop connection attributes: %let indconn= user=myuserid;
  3. Use the INDCONN macro variable to provide credentials to connect to the Hadoop HDFS and Spark. Assign the INDCONN macro before running the %INDHD_PUBLISH_MODEL and the %INDHD_RUN_MODEL macros.
  4. Run the %INDHD_PUBLISH_MODEL macro.
  5. With traditional model scoring, the %INDHD_PUBLISH_MODEL performs multiple tasks using some of the files created by the SAS Enterprise Miner Score Code Export node. Using the scoring model program (score.sas file), the properties file (score.xml file), and (if the training data includes SAS user-defined formats) a format catalog, this model performs the following tasks:

    • translates the scoring model into the sasscore_modelname.ds2 file, runs scoring inside the SAS Embedded Process
    • takes the format catalog, if available, and produces the sasscore_modelname.xml file. This file contains user-defined formats for the published scoring model.
    • uses SAS/ACCESS Interface to Hadoop to copy the sasscore_modelname.ds2 and sasscore_modelname.xml scoring files to HDFS
  6. Run the %INDHD_RUN_MODEL macro.

The %INDHD_RUN_MODEL macro initiates a Spark job that uses the files generated by the %INDHD_PUBLISH_MODEL to execute the DS2 program. The Spark job stores the DS2 program output in the HDFS location specified by either the OUTPUTDATADIR= argument or by the element in the HDMD file.

Here is an example:

 
option set=SAS_HADOOP_CONFIG_PATH="/opt/sas9.4/Config/Lev1/HadoopServer/conf";
option set=SAS_HADOOP_JAR_PATH="/opt/sas9.4/Config/Lev1/HadoopServer/lib:/opt/sas9.4/Config/Lev1/HadoopServer/lib/spark";
 
%let scorename=m6sccode;
%let scoredir=/opt/code/score;
option sastrace=',,,d' sastraceloc=saslog;
option set=HADOOPPLATFORM=SPARK;
 
%let indconn = %str(USER=hive HIVE_SERVER=’hadoop.server.com');
%put &indconn;
%INDHD_PUBLISH_MODEL( dir=&scoredir., 
        datastep=&scorename..sas,
        xml=&scorename..xml,
        modeldir=/sasmodels,
        modelname=m6score,
        action=replace);
 
%INDHD_RUN_MODEL(inputtable=sampledata, 
outputtable=sampledata9score, 
scorepgm=/sasmodels/m6score/m6score.ds2, 
trace=yes, 
    platform=spark);
Score data in Spark from MVA SAS using Spark

SAS Scoring Accelerator execution status in YARN as a Spark job from MVA SAS

To execute the job in Spark, either set the HADOOPPLATFORM= option to SPARK or set PLATFORM= to SPARK inside the INDHD_RUN_MODEL macro. The SAS Scoring Accelerator uses SAS Embedded Process to execute the Spark job with the job name containing the input table and output table.

Executing user-written DS2 code using Spark

User-written DS2 programs can be complex. When running inside a database, a code accelerator execution plan might require multiple phases. Because of Scala program generation that integrates with the SAS Embedded Process program interface to Spark, the many phases of a Code Accelerator job reduces to one single Spark job.

In-Database Code Accelerator

The SAS In-Database Code Accelerator on Spark is a combination of generated Scala programs, Spark SQL statements, HDFS files access, and DS2 programs. The SAS In-Database Code Accelerator for Hadoop enables the publishing of user-written DS2 thread or data programs to Spark, executes in parallel, and exploits Spark’s massively parallel processing. Examples of DS2 thread programs include large transpositions, computationally complex programs, scoring models, and BY-group processing.

Below is a table of required DS2 options to execute as a Spark job.

DS2ACCEL Set to YES
HADOOPPLATFORM Set to SPARK

 

There are six different ways to run the code accelerator inside Spark, called CASES. The generation of the Scala program by the SAS Embedded Process Client Interface depends on how the DS2 program is written. In the following example, we are looking at Case 2, which is a thread and a data program, neither of them with a BY statement:

 
proc ds2 ds2accel=yes;
thread work.workthread / overwrite=yes; 
        method run();
          set hdplib.cars;
          output;
end; endthread; run; 
  data hdplib.carsout (overwrite=yes); dcl thread work.workthread m;
  dcl double count;
  keep count make model;
method run(); set from m; count+1; output; 
end; enddata; run; quit;

The entire DS2 program runs in two phases. The DS2 thread program runs during Phase One, and its tasks execute in parallel. The DS2 data program runs during Phase Two using a single task.

Finally

With SAS Scoring Accelerator and Spark integration, users have the power and flexibility to process and score modeled data in Spark. SAS Code Accelerator and Spark integration takes the flexibility further to process any Spark data using DS2 code. Furthermore it is now possible for business to respond to use cases immediately and with higher reliability in the big data space.

Data and Analytics Innovation using SAS & Spark - part 2 was published on SAS Users.

9月 262019
 

Mirror, mirror on the wall, whose conference presentations are the best of all?

Ok, well it doesn’t quite go that way in the fairy tale, but remakes and reimagining of classic tales have been plentiful in books (see The Shadow Queen), on the big screen (see Maleficent, which is about to get a sequel), on the little screen (see the seven seasons of Once upon a Time) and even on stage and screen (see Into the Woods). So, why not take some liberties in the service of analytics?

For this blog, I have turned our analytics mirror inward and gazed at the social media messages from four SAS conferences: SAS Global Forum 2018 in Denver, Analytics Experience 2018 in San Diego, Analytics Experience 2018 in Milan, and the 2019 Analyst Conference in Naples. While simply counting retweets could provide insight into what was popular, I wanted to look deeper to answer the question: What SAS conference presenters were most praised in social media and how? Information extraction, specifically fact extraction, could help with answering those questions.

Data preparation

Once upon a time, in a land far far away, there was a collection of social media messages, mostly Tweets, that the SAS social media department was kind enough to provide. I didn’t do much in terms of data preparation. I was only interested in unique messages, so I used Excel to remove duplicates based on the “Message” column.

Additionally, I kept only messages for which the language was listed as English, using the “language” column that was already provided in the data. SAS Text Analytics products support 33 languages, but for the purposes of this investigation I chose to focus on English only because the presentations were in English. Then, I imported this data, which was about 4,400 messages, into SAS Visual Text Analytics to explore it and create an information extraction model.

While exploring the data, I noticed that most of the tweets were in fact positive. Additionally, negation, such as “not great” for example, was generally absent. I took this finding into consideration while building my information extraction model: the rules did not have to account for negation, which made for a simpler model. No conniving sorcerer to battle in this tale!

Information extraction model

The magic wand here was SAS Visual Text Analytics. I created a rather simple concepts model with a top-level concept named posPerson, which was extracting pairs of mentions of presenters and positive words occurring within two sentences of the mentions of presenters. The model included several supporting concepts, as shown in this screenshot from SAS Visual Text Analytics concepts node.

Before I explain a little bit about each of the concepts, it is useful to understand how they are related together in the hierarchy represented in the following diagram. The lower-level concepts in the diagram are referenced in the rules of the higher-level ones.

Extending predefined concepts

The magic wand already came with predefined concepts such as nlpPerson and nlpOrganization (thanks, fairy godmother, ahem, SAS linguists). These concepts are included with Visual Text Analytics out of the box and allow users to tap into the knowledge of the SAS linguists for identifying person and organization names. Because Twitter handles, such as @oschabenberger and @randyguard, are not included in these predefined concepts, I expanded the predefined concepts with custom ones. The custom concepts for persons and organizations, customPerson and customOrg, referenced matches from the predefined concepts in addition to rules for combining the symbol @ from the atSymbol concept and various Twitter handles known to belong to persons and organizations, respectively. Here is the simple rule in the atSymbol concept that helps to accomplish this task:

CLASSIFIER:@ 

The screenshot below shows how the atSymbol concept and the personHandle concept are referenced together in the customPerson concept rule and produce matches, such as @RobertoVerganti and @mabel_pooe. Note also how the nlpPerson concept is referenced to produce matches, such as Oliver Schabenberger and Mary Beth Moore, in the same customPerson concept.

If you are interested to learn more about information extraction rules like the ones used in this blog, check out the book SAS Text Analytics for Business Applications: Concept Rules for Information Extraction Models, which my colleagues Teresa Jade and Michael Wallis co-authored with me. It’s a helpful guide for using your own magic wand for information extraction!

Exploratory use of the Sandbox

Visual Text Analytics also comes with its own crystal ball: the Sandbox feature. In the Sandbox, I refined the concept rules iteratively and was able to run the rules for each concept faster than running the entire model. Gazing into this crystal ball, I could quickly see how rule changes for one concept impacted matches.

In an exploratory step, I made the rules in the personHandle concept as general as possible, using part of speech tags such as :N (noun) and :PN (proper noun) in the definitions. As I explored the matches to those rules, I was able to identify matches that were actually organization handles, which I then added as CLASSIFIER rules to the orgHandle concept by double-clicking on a word or phrase and right-clicking to add that string to a rule.

I noticed that some handles were very similar to each other and REGEX rules more efficiently captured the possible combinations. Consult the book referenced above if you’re interested in understanding more about different rule types and how to use them effectively. After moving the rules to the Edit Concept tab, the rules for orgHandle included some of the ones in the following screenshot.

Automatic concept rule generation

Turning now to the second part of the original question, which was what words and phrases people used to praise the presenters, the answers came from two custom concepts: posAdj and posPhrase. The posAdj concept had rules that captured adjectives with positive sentiment, such as the following:

Most of these were captured from the text of the messages in the same manner as the person and organization Twitter handles.

But, the first two were created automatically by way of black magic! When I selected a term from the Textual Elements, as you can see below for the term “great”, the system automatically created the first rule in the concept above, including also the comparative form, “greater,” and the superlative, “greatest.” This is black magic harnessing the power of stemming or lemmatization.

The concept posPhrase built onto the posAdj concept by capturing the nouns that typically follow the adjectives in the first concept as well as a few other strings that have a positive connotation.

Filtering with global rules

Because the rules created overlapping matches, I took advantage of a globalRule concept, which allowed me to distinguish between the poisoned apples and the edible ones. Global rules served the following purposes:

  1. to remove matches from the more generally defined customPerson concept that were also matched for the customOrg concept
  2. to remove matches from the posAdj concept (such as “good”) that were also matched in the posPhrase concept (such as “good talk”)
  3. to remove false positives

As an example of a false positive, consider the following rule:

REMOVE_ITEM:(ALIGNED, "Data for _c{posAdj}", "Data for Good") 

Because the phrase “Data for Good” is a name of a program, the word “good” should not be taken into consideration in evaluating the positive mention. Therefore, the REMOVE_ITEM rule stated that when the posAdj concept match “good” is part of the phrase “Data for Good,” it should be removed from the posAdj concept matches.

Automatic fact rule generation

The top-most concept in the model, posPerson, took advantage of a magic potion called automatic fact rule building, which is another new feature added to the Visual Text Analytics product in the spring of 2019. This feature was used to put together matches from the posAdj and posPhrase concepts with matches from the customPerson concept without constructing the rule myself. It is a very useful feature for newer rule writers who want to explore the use of fact rules.

As input into the cauldron to make this magic potion, I selected the posAdj and customPerson concepts. These are the concepts I wanted the system to relate as facts.

I ran the node and inspected the autogenerated magic potion, i.e. the fact rule.

Then I did the same thing with the posPhrase and customPerson concepts. Each of the two rules that were created by Visual Text Analytics contained the SENT operator.

But I wanted to expand the context of the related concepts and tweaked the recipe a bit by replacing SENT with SENT_2 in order to look for matches within two sentences instead of one. I also replaced the names of the arguments, which the rule generation algorithm called concept1 and concept2, with ones that were more relevant to the task at hand, person and pos. Thus, the following rules were created:

PREDICATE_RULE:(person, pos):(SENT_2, "_person{customPerson}", "_pos{posAdj}")
PREDICATE_RULE:(person, pos):(SENT_2, "_person{customPerson}", "_pos{posPhrase}")

Results

So, what did the magic mirror show? Out of the 4,400 messages, I detected a reference to a person in about 1,650 (37%). In nearly 600 of the messages (14%) I extracted a positive phrase and in over 300 (7%) at least one positive adjective. Finally, only 7% (321) of the messages contained both a reference to a person and a positive comment within two sentences of each other.

I changed all but the posPerson and globalRule concepts to “supporting” so they don’t produce results and I can focus only on the relevant results. This step was akin to adjusting the mirror to focus only on the most important things and tuning out the background. You can learn more about this and other SAS Visual Text Analytics features in the User Guide.

Switching from the interactive view to the results view of the concepts node, I viewed the transactional output table.

With one click, I exported and opened this table in Visual Analytics in order to answer the questions which presenters were mentioned most often and in the context of what words or phrases with positive sentiment.

Visualization

With all of the magic items and preparation out of the way, I was ready to build a sparkly palace for my findings; that is, a report in Visual analytics. On the left, I added a treemap of the most common matches for the person argument. On the right, I added a word cloud with the most common matches for the pos argument and connected it with the treemap on the left. In both cases I excluded missing values in order to focus on the extracted information. With my trees and clouds in place, I turned to the bottom of the report. I added and connected a list table with the message, which was the entire input text, and keywords, which included the span of text from the match for the first argument to the match for the last argument, for an easy reference to the context for the above visualizations.

Based on the visualization on the left, the person with the most positive social media messages was SAS Chief Operating Officer (COO), Dr. Oliver Schabenberger, who accounted for 12% of the messages that contained both a person and a positive comment. His lead was followed by the featured presenters at the Milan conference, Roberto Verganti, Anders Indset and Giles Hutchins. Next most represented were the featured presenters at the San Diego conference, Robyn Benincasa and Dr. Tricia Wang.

Looking at the visualization on the right, some of the most common phrases expressing praise for all the presenters were “important,” “well done,” “great event,” and “exciting.” Quite a few phrases also contain the term “inspiring,” such as “inspiring videos,” “inspiring keynote,” “inspiring talk,” “inspiring speech,” etc.

Because of the connections that I set up in Visual Analytics between these visualizations, if I want to look at what positive phrases were most commonly associated with each presenter, I can click on their name in the treemap on the left; as a result, the word cloud on the right as well as the list table on the bottom will filter out data from other presenters. For example, the view for Oliver Schabenberger shows that the most common positive phrase associated with tweets about him was “great discussion.”

Conclusions

It is not surprising that the highest accolades in this experiment went to SAS’ COO since he participated in all four conferences and therefore had four times the opportunity to garner positive messages. Similarly, the featured presenters probably had larger audiences than breakout sessions, allowing these presenters more opportunities to be mentioned in social media messages. In this case, the reflection in the mirror is not too surprising. And they all lived happily ever after.

What tale does your social media data tell?

Learn more

A data fairy tale: Which speakers are the best received at conferences? was published on SAS Users.

9月 252019
 

One of the strengths of the SAS/IML language is its flexibility. Recently, a SAS programmer asked how to generalize a program in a previous article. The original program solved one optimization problem. The reader said that she wants to solve this type of problem 300 times, each time using a different set of parameters. Essentially, she wants to loop over a set of parameter values, solve the corresponding optimization problem, and save each solution to a data set.

Yes, you can do this in SAS/IML, and the technique is not limited to solving a series of optimization problems. Any time you have a parameterized family of problems, you can implement this idea. First, you figure out how to solve the problem for one set of parameters, then you can loop over many sets of parameters and solve each problem in turn.

Solve the problem one time

As I say in my article, "Ten tips before you run an optimization," you should always develop and debug solving ONE problem before you attempt to solve MANY problems. This section describes the problem and solves it for one set of parameters.

For this article, the goal is to find the values (x1, x2) that maximize a function of two variables:
F(x1, x2; a b) = 1 – (x1 – a)2 – (x2 – b)2 + exp(–(x12 + x22)/2);
The function has two parameters, (a, b), which are not involved in the optimization. The function is a sum of two terms: a quadratic function and an exponentially decreasing function that looks like the bivariate normal density.

Because the exponential term rapidly approaches zero when (x1, x2) moves away from the origin, this function is a perturbation of a quadratic function. The maximum will occur somewhat close to the value (x1, x2) = (a, b), which is the value for which the quadratic term is maximal. The following graph shows a contour plot for the function when (a, b) = (-1, 2). The maximum value occurs at approximately (x1, x2) = (-0.95, 1.9).

The following program defines the function in the SAS/IML language. In the definition, the parameters (a, b) are put in the GLOBAL statement because they are constant during each optimization. The SAS/IML language provides several nonlinear optimization routines. This program uses Newton-Raphson optimizer to solve the problem for (a, b) = (-1, 2).

proc iml;
start Func(x) global (a, b);
   return 1 - (x[,1] - a)##2 - (x[,2] - b)##2 +
            exp(-0.5*(x[,1]##2 + x[,2]##2));
finish;
 
a = -1; b = 2;      /* set GLOBAL parameters */
/* test functions for one set of parameters */
opt = {1,           /* find maximum of function   */
       2};          /* print a little bit of output */
x0 = {0 0};         /* initial guess for solution */
call nlpnra(rc, result, "Func", x0, opt);    /* find maximal solution */
print result[c={"x1" "x2"}];

As claimed earlier, when (a, b) = (-1, 2), the maximum value of the function occurs at approximately (x1, x2) = (-0.95, 1.9).

Solve the problem for many parameters

After you have successfully solved the problem for one set of parameters, you can iterate over a sequence of parameters. The following DATA step specifies five sets of parameters, but it could specify 300 or 3,000. These parameters are read into a SAS/IML matrix. When you solve a problem many times in a loop, it is a good idea to suppress any output. The SAS/IML program suppresses the tables and iteration history for each optimization step and saves the solution and the convergence status:

/* define a set of parameter values */
data Params;
input a b;
datalines;
-1  2
 0  1
 0  4
 1  0
 3  2
 ;
 
proc iml;
start Func(x) global (a, b);
   return 1 - (x[,1] - a)##2 - (x[,2] - b)##2 +
            exp(-0.5*(x[,1]##2 + x[,2]##2));
finish;
 
/* read parameters into a matrix */
varNames = {'a' 'b'}; 
use Params;  read all var varNames into Parms;  close;
 
/* loop over parameters and solve problem */
opt = {1,          /* find maximum of function   */
       0};         /* no print */
Soln = j(nrow(Parms), 2, .);
returnCode = j(nrow(Parms), 1, .);
do i = 1 to nrow(Parms);
   /* assign GLOBAL variables */
   a = Parms[i, 1];  b = Parms[i, 2]; 
   /* For now, use same guess for every parameter. */
   x0 = {0 0};        /* initial guess for solution */
   call nlpnra(rc, result, "Func", x0, opt);
   returnCode[i] = rc;   /* save convergence status and solution vector */
   Soln[i,] = result;
end;
print Parms[c=varNames] returnCode Soln[c={x1 x2} format=Best8.];

The output shows the parameter values (a,b), the status of the optimization, and the optimal solution for (x1, x2). The third column (returnCode) has the value 3 or 6 for these optimizations. You can look up the exact meaning of the return codes, but the main thing to remember is that a positive return code indicates a successful optimization. A negative return code indicates that the optimization terminated without finding an optimal solution. For this example, all five problems were solved successfully.

Choosing an initial guess based on the parameters

If the problem is not solved successfully for a certain parameter value, it might be that the initial guess was not very good. It is often possible to approximate the objective function to obtain a good initial guess for the solution. This not only helps ensure convergence, but it often improves the speed of convergence. If you want to solve 300 problems, having a good guess will speed up the total time.

For this example, the objective functions are a perturbation of a quadratic function. It is easy to show that the quadratic function has an optimal solution at (x1, x2) = (a, b), and you can verify from the previous output table that the optimal solutions are close to (a, b). Consequently, if you use (a, b) as an initial guess, rather than a generic value like (0, 0), then each problem will converge in only a few iterations. For this problem, the GetInitialGuess function return (a,b), but in general the function would return a function of the parameter or even solve a simpler set of equations.

/* Choose an initial guess based on the parameters. */
start GetInitialGuess(Parms);
   return Parms;  /* for this problem, the solution is near (x1,x2)=(a,b) */
finish;
 
do i = 1 to nrow(Parms);
   /* assign GLOBAL variables */
   a = Parms[i, 1]; 
   b = Parms[i, 2]; 
   /* Choose an initial guess based on the parameters. */
   x0 = GetInitialGuess(Parms[i,]); /* initial guess for solution */
   call nlpnra(rc, result, "Func", x0, opt);
   returnCode[i] = rc;
   Soln[i,] = result;
end;
print Parms[c=varNames] returnCode Soln[c={x1 x2} format=Best8.];

The solutions and return codes are similar to the previous example. You can see that changing the initial guess changes the convergence status and slightly changes the solution values.

For this simple problem, choosing a better initial guess provides only a small boost to performance. If you use (0,0) as an initial guess, you can solve 3,000 problems in about 2 seconds. If you use the GetInitialGuess function, it takes about 1.8 seconds to solve the same set of optimizations. For other problems, providing a good initial guess might be more important.

In conclusion, the SAS/IML language makes it easy to solve multiple optimization problems, where each problem uses a different set of parameter values. To improve convergence, you can sometimes use the parameter values to compute an initial guess.

The post Solve many optimization problems appeared first on The DO Loop.