Tech

3月 172017
 

Editor’s note: This is the second in a series of articles to help current SAS programmers add SAS Viya to their analytics skillset. In this post, Advisory Solutions Architect Steven Sober explores how to accomplish distributed data management using SAS Viya. Read additional posts in the series.

This article in the SAS Viya series will explore how to accomplish distributed data management using SAS Viya. In my next article, we will discuss how SAS programmers can collaborate with their open source colleagues to leverage SAS Viya for distributed data management.

Distributed Data Management

SAS Viya provides a robust, scalable, cloud ready distributed data management platform. This platform provides multiple techniques for data management that run distributive, i.e. using all cores on all compute nodes defined to the SAS Viya platform. The four techniques we will explore here are DATA Step, PROC DS2, PROC FEDSQL and PROC TRANSPOSE. With these four techniques SAS programmers and open source programmers can quickly apply complex business rules that stage data for downstream consumption, i.e., Analytics, visualizations, and reporting.

The rule for getting your code to run distributed is to ensure all source and target tables reside in the In-Memory component of SAS Viya i.e., Cloud Analytic Services (CAS).

Starting CAS

The following statement is an example of starting a new CAS session. In the coding examples that follow we will reference this session using the key word MYSESS. Also note, this CAS session is using one of the default CAS library, CASUSER.

Binding a LIBNAME to a CAS session

Now that we have started a CAS session we can bind a LIBNAME to that session using the following syntax:

Note: CASUSER is one of the default CAS libraries created when you start a CAS session. In the following coding examples we will utilize CASUSER for our source and target tables that reside in CAS.

To list all default and end-user CAS libraries, use the following statement:

Click here for more information on CAS libraries.

THREAD program

  • The PROC DS2 DATA program must declare a THREAD program
  • The source and target tables must reside in CAS
  • Unlike DATA Step, with PROC DS2 you use the SESSREF= parameter to identify which CAS environment the source and target tables reside in
  • SESSREF=

     For PROC TRANSPOSE to run in CAS the rules are:

    1. All source and target tables must reside in CAS
      a.   Like DATA Step you use a two-level name to reference these tables

    Collaborative distributed data management using SAS Viya was published on SAS Users.

    3月 132017
     

    My high school basketball coach started preparing us for the tournaments in the season’s first practice. He talked about the “long haul” of tournament basketball, and geared our strategies toward a successful run at the end of the season.

    I thought about the “long haul” when considering my brackets for this year’s NCAA Tournament, and came to this conclusion; instead of seeking to predict who might win a particular game, I wanted to use analytics to identify which teams were most likely to win multiple games. The question that I sought to answer was simply, “could I look at regular season data, and recognize the characteristics inherent in teams that win multiple games in the NCAA Tournament?”

    I prepared and extracted features from data representing the last 5 regular seasons. I took tournament data from the same period and counted numbers of wins per team (per season). This number would be my target value (0, 1, 2, 3, 4, 5, or 6 wins). Only teams that participated in the tournaments made the analysis.

    I used SAS Enterprise Miner’s High Performance Random Forest Node to build 10,000 trees (in less than 14 seconds), and I determined my “top 10 stats” by simply observing which factors were split on the most.

    Here are the results (remember that statistics represented are from the regular season and not the tournament), my “top 10 statistics to consider.”

    1 ---  Winning Percentage. Winners win, right?  It is evident this is true the further a team moves into the tournament.

    • Teams that win a single game have an average winning percentage of .729
    • Teams that win 6 games have an average winning percentage of .858
    • No team that has won a Final Four game over the last 5 years has a winning percentage less than .706
    • Teams that won 6 games have a minimum winning percentage of .765.

    2 --- Winning games by wide margins. Teams that advance in the tournament have beaten teams by wide margins during the regular season – this means that in some game over the course of the year, a team let go and won big! From a former player’s perspective, it doesn’t matter “who” you beat by a wide margin, but rather do you have the drive to crush the opponent?

    • Teams that won 6 games have beaten some team by 49 points differentiating themselves from even the 5 win teams by 9 points!

    3 --- The ratio of assists to turnovers (ATO). Teams that take care of and distribute the ball tend to be making assists instead of turnovers. From my perspective, the ATO indicates whether or not a team dictates the action.

    • Over the last 5 years, no team that won 6 games had an ATO less than 1.19!
    • Teams that have won at least 5 had an average ATO of 1.04.
    • Teams that won less than 5 had average ATOs of less than 1.

    4 --- Winning percentage on the road. We’re already noted that overall winning percentage is important, but it’s also important to win on the road since the tournament games are rarely played on a team’s home floor!

    • Teams that don’t win any tournament games win 52% of their road games
    • Teams that win 1-2 games win 57.8%
    • Teams that win 3-5 win 63%
    • Team that win 6 win 78% of their road games, and average only 2.4 (road) losses per year
    • No team that has won at least 5 games has lost more than 5 on the road (in the last 5 years)!

    5 --- The ratio of a team’s field goal percentage to the opposition’s field goal percentage? Winning games on the big stage requires both scoring and defense! A ratio above 1 indicates that you score the ball better than you allow your opposition to score.

    • Teams that win 2 or fewer games have a ratio of 1.12
    • Teams that win 3-5 games have a ratio of 1.18
    • Teams that win 6 games have a ratio of 1.23 – no team that has won 6 games had a ratio of less than 1.19!

    6 --- The ratio of turnovers to the turnovers created (TOR). I recall coaches telling me that a turnover committed by our team was essentially a 4-point play: 2 that we didn’t get, and 2 they did.

    • Teams that win the most tournament games have an average TOR of 0.89. This means they turn the ball over at a minimal rate when compared to the turnovers they create.
    • Over the past 5 years, teams that won 6 games have an average TOR .11 better than the rest of the pack which can be interpreted this way: they force the opposition into turnovers 10 times as often as they commit turnovers themselves.

    7 --- Just as important as beating teams by wide margins, are the close games! Close games build character, and provide preparation for the tournament.

    • Teams that win 6 games play more close games than any other group. The average minimum differential for this group is 1.6 points
    • Teams winning less games average a differential of 1.8 points.

    8 --- Defending the 3. Teams that win more games in the tournament defend the 3 point shot only slightly better than the other teams, but they are twice as consistent in doing it! So, regardless of who’s coming to play, look for some sticky D beyond the arc!

    • On average, teams allow a 3-point field goal percentage .328
    • Teams winning the most tournament games defend only slightly better at .324; however the standard deviation is the more interesting statistic indicating the consistency of doing so (defending the 3 point shot) is almost twice as good as the other teams!

    9 --- Teams that win are good at the stripe! Free throws close games. Make them and get away with win!

    • Teams that win the most games shoot for an average of .730 while the rest of the pack sits at .700

     
    10 --- Teams that win the most games block shots! They play defense, period.

    • Teams that win the most tournament games average over 5 blocks per game.
    • Teams winning 6 games have blocked at least 3.4 shots per game (over the last 5 years)

    Next steps? Take what’s been learned and apply it to this year’s tournament teams, and then as Larry Bird used to do, ask the question, “who’s playing for second?”

    In addition to SAS Enterprise Miner, I used SAS Enterprise Guide to prepare the data for analysis and, I used JMP’s Graph Builder to create the graphics. The data was provided by Kaggle.

    The top 10 statistics to consider when filling out your NCAA brackets was published on SAS Users.

    3月 112017
     

    Ensemble models have been used extensively in credit scoring applications and other areas because they are considered to be more stable and, more importantly, predict better than single classifiers (see Lessmann et al., 2015). They are also known to reduce model bias and variance (Myoung - Jong et al., 2006; Tsai C-F et. al., 2011). The objective of this article is to compare the predictive accuracy of four distinct datasets using two ensemble classifiers (Gradient boosting(GB)/Random Forest(RF)) and two single classifiers (Logistic regression(LR)/Neural Network(NN)) to determine if, in fact, ensemble models are always better. My analysis did not look into optimizing any of these algorithms or feature engineering, which are the building blocks of arriving at a good predictive model. I also decided to base my analysis on these four algorithms because they are the most widely used methods.

    What is the difference between a single and an ensemble classifier?

    Single classifier

    Individual classifiers pursue different objectives to develop a (single) classification model. Statistical methods either estimate (+|) directly (e.g., logistic regression), or estimate class-conditional probabilities (|), which they then convert into posterior probabilities using Bayes rule (e.g., discriminant analysis). Semi-parametric methods, such as NN or SVM, operate in a similar manner, but support different functional forms and require the modeller to select one specification a priori. The parameters of the resulting model are estimated using nonlinear optimization. Tree-based methods recursively partition a data set so as to separate good and bad loans through a sequence of tests (e.g., is loan amount > threshold). This produces a set of rules that facilitate assessing new loan applications. The specific covariates and threshold values to branch a node follow from minimizing indicators of node impurity such as the Gini coefficient or information gain (Baesens, et al., 2003).

    Ensemble classifier

    Ensemble classifiers pool the predictions of multiple base models. Much empirical and theoretical evidence has shown that model combination increases predictive accuracy (Finlay, 2011; Paleologo, et al., 2010). Ensemble learners create the base models in an independent or dependent manner. For example, the bagging algorithm derives independent base models from bootstrap samples of the original data (Breiman, 1996). Boosting algorithms, on the other hand, grow an ensemble in a dependent fashion. They iteratively add base models that are trained to avoid the errors of the current ensemble (Freund & Schapire, 1996). Several extensions of bagging and boosting have been proposed in the literature (Breiman, 2001; Friedman, 2002; Rodriguez, et al., 2006). The common denominator of homogeneous ensembles is that they develop the base models using the same classification algorithm (Lessmann et al., 2015).

    ensemble modifers

    Figure 1: Workflow of single v. ensemble classifiers: derived from the work of Utami, et al., 2014

    Experiment set-up

    Datasets

    Before modelling, I partitioned the dataset into 70% training and 30% validation dataset.

    Table 1: Summary of dataset used for model comparisons

    I used SAS Enterprise Miner as a modelling tool.

    Figure 2: Model flow using Enterprise Miner

    Results

    Table 2: Results showing misclassification rates of all dataset

    Conclusion

    Using misclassification rate as model performance, RF was the best model using Cardata, Organics_Data and HMEQ followed closely by NN. NN was the best model using Time_series_data and performed better than GB ensemble model using Organics_Data and Cardata.

    My findings partly supports the hypothesis that ensemble models naturally do better in comparison to single classifiers, but not in all cases. NN, which is a single classifier, can be very powerful unlike most classifiers (single or ensemble) which are kernel machines and data-driven. NN can generalize from unseen data and act as universal functional approximators (Zhang, et al., 1998).

    According to Kaggle CEO and Founder, Anthony Goldbloom:

    “In the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted & Neural Networks”.

    What are your thoughts?

    Are ensemble classifiers always better than single classifiers? was published on SAS Users.

    3月 092017
     

    SAS® Federation Server provides a central, virtual environment for administering and securing access to your data. It also allows you to combine data from multiple sources without moving or copying the data. SAS Federation Server Manager, a web-based application, is used to administer SAS Federation Server(s).

    Data privacy is a major concern for organizations and one of the features of SAS Federation Server is it allows you to effectively and efficiently control access to your data, so you can limit who is able to view sensitive data such as credit card numbers, personal identification numbers, names, etc. In this three-part blog series, I will explore the topic of controlling data access using SAS Federation Server. The series covers the following topics:

    SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data. SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data.  Note: Permissions applied for a particular data source cannot be bypassed with SAS Federation Server security. If permissions are denied at the source data, for example on a table, then users will always be denied access to that table, no matter what permissions are set in SAS Federation Server.

    In this post, I will build on the examples from my previous articles and demonstrate how you can use data masking to conceal actual data values from users, but still allow them access for analysis and reporting purposes.

    In previous posts, I gave the Finance Users group access to the SALARY table. Linda is a member of the Finance Users group, so currently she has access to the SALARY table.

    However, I want to restrict her access. She needs access to the Salary info for analytic purposes, but does not need to know the identifying data of IDNUM, so I can hide that column from her. She does need the JOBCODE information for her analytics; however, she does not need to know the actual JOBCODE information associated with the record, so that data can be masked to prevent her from viewing that identifying information.

    First, I create a FedSQL View of the SALARY table. FedSQL is the implementation of SQL that SAS Federation Server uses to access relational data.  For the view, I set the Security to Use the definer’s privileges when accessed since I will eventually deny Linda the rights to view the underlying table to the view.

    Here is the default code for the view:

    I change the code to the following to remove the IDNUM column from the view and mask the JOBCODE column, so Linda will not know what is the real JOBCODE associated with the Salary.

    There are several data masking functions available for use. In this instance, I use the TRANC function to mask the JOBCODE field using transliterated values by replacing the first three characters with other values.  Refer to the Data Masking section of the SAS Federation Server Manager 4.2: User’s Guide for more information on the different data masking functions.

    Now that I have created the FedSQL view, I then need to grant Linda authorization to it.

    Next, I need to deny Linda authorization to the SALARY table, so she won’t be able to access the original table.

    Linda is only able to view the SALARY_VIEW with the IDNUM column removed and the JOBCODE information masked.

    Linda is denied access to the SALARY table.

    However, Kate another member of the Finance team is able to view the full SALARY table with the IDNUM column and the real information (non-masked) in the JOBCODE column.

    In this blog entry, I covered the third part of this series on controlling data access to SAS Federation Server 4.2.  Other blogs in the series include

    For more information on SAS Federation Server visit the:

    Securing sensitive data using SAS Federation Server data masking was published on SAS Users.

    3月 092017
     

    Several months ago, I posted a blog about calculating moving averages for a measure in the Visual Analytics Designer. Soon after that, I was asked about calculating not only the average, but also the standard deviation over a period of months, when the data might consist of one or more repeated values of a measure for each month of a series of N months.  For the example of N=20 months, we might want to view the average and standard deviation over the last n months, where n is any number between 3 and 20.

    The example report shown below allows the user to type in a number, n, between 3 and 20, to display a report consisting of the amount values for past n months, the amount values for Current Month Amt-Previous, the Avg over the last n months, the Standard Deviation over the last n months, and the absolute value of the (Current Month Amt – Previous Month Amt), divided by the Standard Deviation over the last n months. A Display rule is applied to the final Abs column, showing Green for a value less than 1 and red for a value greater than or equal to 1.

    The data used in this example had multiple Amount values for each month, so we first used the Visual Data Builder to create a SUM aggregation for Amount for each unique Date value.  This step gives more flexibility in using the amount value for aggregations in the designer.

    When the modified data source is initially added to the report, it contains only the Category data item Month, with a format of MMYYYY, and the measure Amount Sum for Month.

    The data will be displayed in a list table. The first columns added to the table will be Month, displayed with a MMYYYY format, and Amount Sum for Month.

    Specify the properties for the list table as below:

    Since we want to display the last n months, we create a new calculated data item, Numeric Date, calculated as below, using the TREATAS operator on the Month data item:

    Then we create the Current Month Amt-Previous aggregated measure using the RelativePeriod date operator:

    Next, create the Avg over all displayed months aggregated measure as below:

    Then, create the Std.Dev. over all displayed months aggregated measure as shown below:

    Create the Abs (Current-Previous/StdDev) as shown below:

    Create a numeric parameter, Number of Months, as shown, with minimum value of 3 (smallest value that a standard deviation will make sense) and maximum value of 20 (the number of months in our data). You can let the default (Current value) value be any value that you choose:

    For the List Table, create a Rank, as shown below. Note that we are creating the rank on the Numeric Date (not the Month data item), and rather than a specific value for count, we are going to use the value of the parameter, Number of Months.

    Create a text input object that enables the user to type in a ‘number of months’ between 3 and 20.

    Associate the Parameter with the Text input object:

    If you wish, you can add display rules to sound an alarm whenever there is an alarming month-to-month difference in comparison to the standard deviation for the months.

    So the final result of all of the above is this report, which points out month-to-month differences, which might deserve further concern or investigation. Note that the Numeric Date value is included below just to enable you to see what those values look like—you likely would not want to include that calculated data item in your report.

    Calculating standard deviation of a measure in Visual Analytics Designer was published on SAS Users.

    3月 022017
     

    SAS Global Forum 2017 is just a month away and, if you’re a SAS administrator, it’s a great place to meet your peers, share your experiences and attend presentations on SAS administration tips and tricks.

    SAS Global Forum 2017 takes place in Orlando FL, April 2-5. You can find more information at https://www.sas.com/en_us/events/sas-global-forum/sas-global-forum-2017.html.  This schedule is for the entire conference and include pre and post conference events.

    If you’re an administrator, though, I wanted to highlight a few events that would be of particular interest to you:

    On Sunday, April 2nd from 2-4 pm there is a “Helping the SAS Administrator Succeed” event. More details can be found here.

    On Monday, April 3rd from 6:30-8:00 pm the SAS Users Group for Administrators (SUGA) will be hosting a Community Linkup, with panelists on hand to help answer questions from SAS administrators. Location will be in the Dolphin Level – Asia 4.

    There are two post-conference tutorials for the SAS Administrators:

    Introduction to SAS Grid Manager, Wednesday, April 5th from 2:30-6:30pm
    SAS Metadata Security, Thursday, April 6th from 8:00am-noon
    More details can be found here.

    For a list of the papers on the topic of SAS Administration, you can visit this link. You will see that SAS Administration has been broken down to Architecture, Deployment, SAS Administration and Security subtopic areas.

    Some of the key papers under each sub-topic area are:

    Architecture
    Twelve Cluster Technologies Available in SAS 9.4
    Deploying SAS on Software-Defined and Virtual Storage Systems
    Shared File Systems:  Determining the Best Choice for your Distributed SAS Foundation Applications
    Do You have a Disaster Recovery Plan for Your SAS Infrastructure

    Deployment
    Pillars of a Successful SAS Implementation with Lessons from Boston Scientific
    Getting the Latest and Greatest from SAS 9.4: Best Practices for Upgrades and Migrations
    Migrating Large, Complex SAS Environments: In-Place versus New Build

    SAS Administration
    SAS Metadata Security 201:  Security Basics for a New Administrator
    SAS Environment Manager: Advanced Topics
    The Top Ten SAS Studio Tips for SAS Grid Manager Administrators
    Implementing Capacity Management Policies on a SASLASR Analytic Server Platform: Can You Afford Not To?
    Auditing in SAS Visual Analytics
    SAS Viya: What it Means for SAS Administration

    Security
    Guidelines for Protecting Your Computer, Network, and Data from Malware Threats
    Getting Started with Designing and Implementing a SAS® 9.4 Metadata and File System Security Design
    SAS® Metadata Security 301: Auditing your SAS Environment
    SAS® Users Audit: An Automated Approach to Metadata Reporting
    SAS® Metadata Security

    In addition to the breakout sessions, there is an Administration Super Demo station where short presentations will be given. The schedule for these presentations is:

    Sunday, April 2nd:
    17:00     Shared File Systems for SAS Grid Manager
    18:00     Where to Place SAS WORK in your SAS Grid Infrastructure

    Monday, April 3rd:
    11:00     Hands-on Secure Socket Layer Configuration for SAS 9.4 Environment Manager
    12:00     Introduction to Configuring SAS Metadata Security for Mutlitenancy
    13:00     SAS Viya Overview
    14:00     Accelerate your SAS Programs with GPUs
    15:00     Authentication and Identity Management with SAS Viya

    Tuesday, April 4th:
    11:00     Accelerate your SAS Programs with GPUs
    12:00     Accelerating your Analytics Adoption with the Analytics Fast Track
    13:00     New Deployment Experience for SAS
    14:00     Managing Authorization in SAS Viya
    15:00     Clustering in SAS Viya
    16:00     A Docker Container Toolbox for the Data Scientist

    As you can see, there is lots for SAS Administrators to learn and opportunities for SAS Administrators to socialize with fellow SAS Administrators.

    Here’s to seeing you in sunny Florida next month.

    P.S. SAS administrators don’t have to go to SAS Global Forum to get help administering their environment. In addition to SAS Global Forum and the SUGA group mentioned above, you can find out more information on resources for administrators in this blog. You can also visit our new webpage devoted just to users who administer their organization’s SAS environment. You can find that page here.

    Resources for SAS Administrators at SAS Global Forum 2017 … and beyond was published on SAS Users.

    3月 012017
     

    In 2011, Loughran and McDonald applied a general sentiment word list to accounting and finance topics, and this led to a high rate of misclassification. They found that about three-fourths of the negative words in the Harvard IV TagNeg dictionary of negative words are typically not negative in a financial context. For example, words like “mine”, “cancer”, “tire” or “capital” are often used to refer to a specific industry segment. These words are not predictive of the tone of documents or of financial news and simply add noise to the measurement of sentiment and attenuate its predictive value. So, it is not recommended to use any general sentiment dictionary as is.

    Extracting domain-specific sentiment lexicons in the traditional way is time-consuming and often requires domain expertise. Today, I will show you how to extract a domain-specific sentiment lexicon from movie reviews through a machine learning method and construct SAS sentiment rules with the extracted lexicon to improve sentiment classification performance. I did the experiment with the help from my colleagues Meilan Ji and Teresa Jade, and our experiment with the Stanford Large Movie Review Dataset showed around 8% increase in the overall accuracy with the extracted lexicon. Our experiment also showed that the lexicon coverage and accuracy could be improved a lot with more training data.

    SAS Sentiment Analysis Studio released domain-independent Taxonomy Rules for 12 languages and domain-specific Taxonomy Rules for a few languages. For English, SAS has covered 12 domains, including Automotive, Banking, Health and Life Sciences, Hospitalities, Insurance, Telecommunications, Public Policy Countries and others. If the domain of your corpus is not covered by these industry rules, your first choice is to use general rules, which sometimes lead to poor classification performance, as Loughran and McDonald found. Automatically extracting domain-specific sentiment lexicons has been studied by researchers and three methods were proposed. The first method is to create a domain-specific word list by linguistic experts or domain experts, which may be expensive or time-consuming. The second method is to derive non-English lexicons based on English lexicons and other linguistic resources such as WordNet. The last method is to leverage machine learning to learn lexicons from a domain-specific corpus. This article will show you the third method.

    Because of the emergence of social media, researchers are able to relatively easily get sentiment data from the internet to do experiments. Dr. Saif Mohammad, a researcher in Computational Linguistics, National Research Council Canada, proposed a method to automatically extract sentiment lexicons from tweets. His method provided the best results in SemEval13 by leveraging emoticons in large tweets, using the PMI (pointwise mutual information) between words and tweet sentiment to define the sentiment attributes of words. It is a simple method, but quite powerful. At the ACL 2016 conference, one paper introduced how to use neural networks to learn sentiment scores, and in this paper I found the following simplified formula to calculate a sentiment score.

    Given a set of tweets with their labels, the sentiment score (SS) for a word w was computed as:
    SS(w) = PMI(w, pos) − PMI(w, neg), (1)

    where pos represents the positive label and neg represents the negative label. PMI stands for pointwise mutual information, which is
    PMI(w, pos) = log2((freq(w, pos) * N) / (freq(w) * freq(pos))), (2)

    Here freq(w, pos) is the number of times the word w occurs in positive tweets, freq(w) is the total frequency of word w in the corpus, freq(pos) is the total number of words in positive tweets, and N is the total number of words in the corpus. PMI(w, neg) is calculated in a similar way. Thus, Equation 1 is equal to:
    SS(w) = log2((freq(w, pos) * freq(neg)) / (freq(w, neg) * freq(pos))), (3)

    The movie review data I used was downloaded from Stanford; it is a collection of 50,000 reviews from IMDB. I used 25,000 reviews in train and test datasets respectively. The constructed dataset contains an even number of positive and negative reviews. I used SAS Text Mining to parse the reviews into tokens and wrote a SAS program to calculate sentiment scores.

    In my experiment, I used the train dataset to extract sentiment lexicons and the test dataset to evaluate sentiment classification performance with each sentiment score cutoff value from 0 to 2 with increment of 0.25. Data-driven learning methods frequently have an overfitting problem, and I used test data to filter out all weak-predictive words whose absolute value of sentiment scores are less than 0.75. In Figure-1, there is an obvious drop in the accuracy line plot of test data when the cutoff value is less than 0.75.

    extract domain-specific sentiment lexicons

    Figure-1 Sentiment Classification Accuracy by Sentiment Score Cutoff

    Finally, I got a huge list of 14,397 affective words; 7,850 positive words and 6,547 negative words from movie reviews. The top 50 lexical items from each sentiment category as Figure-2 shows.

    Figure-2 Sentiment Score of Top 50 Lexical Items

    Now I have automatically derived the sentiment lexicon, but how accurate is this lexicon and how to evaluate the accuracy? I googled movie vocabulary and got two lists from Useful Adjectives for Describing Movies and Words for Movies & TV with 329 adjectives categorized into positive and negative. 279 adjectives have vector data in the GloVe word embedding model downloaded from http://nlp.stanford.edu/projects/glove/ and the T-SNE plot as Figure-3 shows. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus. T-SNE is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten.[1]  It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. So two words are co-located or located closely in the scatter plot, if their semantic meanings are close or their co-occurrence in same contexts is high. Besides semantic closeness, I also showed sentiment polarity via different colors. Red stands for negative and blue stands for positive.

    Figure-3 T-SNE Plot of Movie Vocabularies

    From Figure-3, I find the positive vocabulary and the negative vocabulary are clearly separated into two big clusters, with very little overlap in the plot.

    Now, let me check the sentiment scores of the terms. 175 of them were included in my result and Figure-4 displays the top 50 terms of each category. I compared sentiment polarity of my result with the list, 168 of 175 terms are correctly labelled as negative or positive and the overall accuracy is 96%.

    Figure-4 Sentiment Score of Top 50 Movie Terms

    There are 7 polarity differences between my prediction and the list as the Table-1 shows.

    Table-1 Sentiment Polarity Difference between Predictions and Actual Labels

    One obvious prediction mistake is coherent. I checked the raw movie reviews that contain “coherent”, and only 25 of 103 reviews are positive. This is why its sentiment score is negative rather than positive. I went through these reviews and found most of them had a sentiment polarity reversal, such as “The Plot - Even for a Seagal film, the plot is just stupid. I mean it’s not just bad, it’s barely coherent. …” A possible solution to make the sentiment scores more accurate is to use more data or add a special manipulation for polarity reversals. I tried first method, and it did improve the accuracy significantly.

    So far, I have evaluated the sentiment scores’ accuracy with public linguistic resources and next I will test the prediction effect with SAS Sentiment Analysis Studio. I ran sentiment analysis against the test data with the domain-independent sentiment rules developed by SAS and the domain-specific sentiment rules constructed by machine learning, and compared the performance of two methods. The results showed an 8% increase in the overall accuracy. Table-2 and Table-3 show the detailed information.

    Test data (25,000 docs)

    Table-2 Performance Comparison with Test Data

    Table-3 Overall Performance Comparison with Test Data

    After you get domain-specific sentiment lexicons from your corpora, only a few steps are required in SAS Sentiment Analysis Studio to construct the domain-specific sentiment rules. So, next time you are processing domain-specific text for sentiment, you may want to try this method to get a listing of terms that are positive or negative polarity to augment your SAS domain-independent model.

    Detailed steps to construct domain-specific sentiment rules as follows.

    Step 1. Create a new Sentiment Analysis project.
    Step 2. Create intermediate entities named “Positive” and “Negative”, then put the learned lexicons to the two entities respectively.

    Step 3. Besides the learned lexicons, you may add an entity named “Negation” to handle the negated expressions. You can list some negations you are familiar with, such as “not, don’t, can’t” etc.

    Step 4. Create positive and negative rules in the Tonal Keyword. Add the rule “CONCEPT: _def{Positive}” to Positive tab, and the rule “CONCEPT: _def{Negative}” and “CONCEPT: _def{Negation} _def{Positive}” to Negative tab.

    Step 5. Build rule-based model, and now, you can use this model to predict the sentiment of documents.

    How to extract domain-specific sentiment lexicons was published on SAS Users.

    2月 252017
     

    Let’s have some fun, shall we? Share your video or photo!

    The SAS User Community, albeit spread around the world, is a tight-knit group. We may sit alone in our offices pounding out code, developing applications, tweaking system performance or creating reports,  but the truth is other SAS users (our colleagues at work, in online communities, and at local user group meetings), are always there to assist us, and to socialize with from time to time. We rely on our fellow SAS Users for support and companionship, as well as a resource for new ideas and techniques. Then, once each year, we join users on a global scale by gathering for a few days at SAS Global Forum.

    The opportunity to strengthen and extend our bonds with other SAS Users makes SASGF a much sought-after event. We will go to great lengths to attend; by demonstrating value to our employer to secure permission, presenting content to receive a registration discount, applying for an award or scholarship, volunteering as a presenter or room coordinator, joining the Conference Team, or even becoming Conference Chair!

    What might these efforts look like if we were to record metaphors for them? What I mean is, how would you represent your effort?

    For example, here is a photo of two determined SAS Users negotiating a portage on Lady Evelyn River (Ontario, Canada) on their way to SAS Global Forum.

    These two must really understand the value of attending!

    So...

    What are you willing to do to get to SAS Global Forum?!

    Share your videos and photos that represent your efforts to get to SASGF in Orlando. We’ll have some fun seeing how our fellow SAS Users spend their non-SAS-coding time. I’m looking forward to seeing new faces and new places.

    Simply follow @SASsoftware on Twitter and Instagram, then post your video, photo or gift. Make sure you tag your post with the #GetToSASGF and @SASsoftware. 

    Share more than one, encourage your fellow SAS Users to play along. And check back often to see what your peers have shared.

    Who knows, you may even see your picture or video on the Big Screen at SASGF 2017!

     

    What are you willing to do to get to SAS Global Forum? was published on SAS Users.

    2月 252017
     

    As a practitioner of visual analytics, I read the featured blog of ‘Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Python, SAS’ last year with great interest. In the post, the blogger Tim Matteson asked the readers to guess which software was used to create his 18 graphs. My buddy, Emily Gao, suggested that I should see how SAS VA does recreating these visualizations. I agreed.

    SAS Visual Analytics (VA) is better known for its interactive visual analysis, and it’s also able to create nice visualizations. Users can easily create professional charts and visualizations without SAS coding. So what I am trying to do in this post, is to load the corresponding data to SAS VA environment, and use VA Explorer and Designer to mimic Matteson’s visualizations.

    I want to specially thank Robert Allison for his valuable advices during the process of writing this post. Robert Allison is a SAS graph expert, and I learned a lot from his posts. I read his blog on creating 18 amazing graphs using purely SAS code, and I copied most data from his blog when doing these visualization, which saved me a lot time preparing data.

    So, here’s my attempt at recreating Matteson’s 18 visualization using SAS Visual Analytics.

    Chart 1

    This visualization is created by using two customized bar charts in VA, and putting them together using precision layout so it looks like one chart. The customization of bar charts can be done by using the ‘Custom Graph Builder’ in SAS VA, which includes: set the reverse order for X axis, set the axes direction to horizontal, and don’t show axis label for X axis and Y axis, uncheck the ‘show tick marks’, etc. Comparing with Matteson’s visualization, my version has the tick values on X axis displayed as non-negative numbers, as people generally would expect positive value for the frequency.

    Another thing is, I used the custom sort for the category to define the order of the items in the bar chart. This can be done by right click on the category and select ‘Edit Custom Sort…’ to get the desired order. You may also have noticed that the legend is a bit strange for the Neutral response, since it is split into Neutral_1stHalf and Neutral_2ndHalf, which I need to gracefully show the data symmetrically in the visualization in VA.

    Chart 2

    VA can create a grouped bar chart with desired sort order for the countries and the questions easily. However, we can only put the questions texts horizontally atop of each group bar in VA. VA uses vertical section bar instead, with its tooltip to show the whole question text when the mouse is hovered onto it. And we can see the value of each section in bar interactively in VA when hovering the mouse over.

    Chart 3

    Matteson’s chart looks a bit scattered to me, while Robert’s chart is great at label text and markers for the scatterplot matrix. Here I use VA Explorer to create the scatterplot matrix for the data, which omitted the diagonal cells and its diagonal symmetrical part for easier data analysis purpose. It can then be exported to report, and change the color of data points.

    Chart 4

    I used the ‘Numeric Series Plot’ to draw this chart of job losses in recession. It was straightforward. I just adjust some setting like checking the ‘Show markers’ in the Properties tab, unchecking the ‘Show label’ in X Axis and unchecking the ‘Use filled markers’, etc. To make refinement of X axis label of different fonts, I need to use the ‘Precision’ layout instead of the default ‘Tile’ layout. Then drag the ‘Text’ object to contain the wanted X axis label.

    Chart 5

    VA can easily draw the grouped bar charts automatically. Disable the X axis label, and set the grey color for the ‘Header background.’ What we need to do here, is to add some display rules for the mapping of color-value. For the formatted text at the bottom, use the ‘Text’ object. (Note: VA puts the Age_range values at the bottom of the chart.)

    Chart 6

    SAS VA does not support drawn 3D charts, so I could not make similar chart as Robert did with SAS codes. What I do for this visualization, is to create a network diagram using the Karate club dataset. The grouped detected communities (0, 1, 2, 3) are showing with different colors. The diagram can be exported as image in VAE.

    ***I use the following codes to generate the necessary data for the visualization:

    http://support.sas.com/documentation/cdl/en/procgralg/68145/HTML/default/viewer.htm#procgralg_optgraph_examples07.htm 

    /* Dataset of Zachary’s Karate Club data is from: http://support.sas.com/documentation/cdl/en/procgralg/68145/HTML/default/viewer.htm#procgralg_optgraph_examples07.htm  
    This dataset describes social network friendships in karate club at a U.S. university.
    */
    data LinkSetIn;
       input from to weight @@;
       datalines;
     0  9  1  0 10  1  0 14  1  0 15  1  0 16  1  0 19  1  0 20  1  0 21  1
     0 23  1  0 24  1  0 27  1  0 28  1  0 29  1  0 30  1  0 31  1  0 32  1
     0 33  1  2  1  1  3  1  1  3  2  1  4  1  1  4  2  1  4  3  1  5  1  1
     6  1  1  7  1  1  7  5  1  7  6  1  8  1  1  8  2  1  8  3  1  8  4  1
     9  1  1  9  3  1 10  3  1 11  1  1 11  5  1 11  6  1 12  1  1 13  1  1
    13  4  1 14  1  1 14  2  1 14  3  1 14  4  1 17  6  1 17  7  1 18  1  1
    18  2  1 20  1  1 20  2  1 22  1  1 22  2  1 26 24  1 26 25  1 28  3  1
    28 24  1 28 25  1 29  3  1 30 24  1 30 27  1 31  2  1 31  9  1 32  1  1
    32 25  1 32 26  1 32 29  1 33  3  1 33  9  1 33 15  1 33 16  1 33 19  1
    33 21  1 33 23  1 33 24  1 33 30  1 33 31  1 33 32  1
    ;
    run;
    /* Perform the community detection using resolution levels (1, 0.5) on the Karate Club data. */
    proc optgraph
       data_links            = LinkSetIn
       out_nodes             = NodeSetOut
       graph_internal_format = thin;
       community
          resolution_list    = 1.0 0.5
          out_level          = CommLevelOut
          out_community      = CommOut
          out_overlap        = CommOverlapOut
          out_comm_links     = CommLinksOut;
    run;
     
    /* Create the dataset of detected community (0, 1, 2, 3) for resolution level equals 1.0 */ 
    proc sql;
    	create table mylib.newlink as 
    	select a.from, a.to, b.community_1, c.nodes from LinkSetIn a, NodeSetOut b, CommOut c
     	where a.from=b.node and b.community_1=c.community and c.resolution=1 ;
    quit;

    Chart 7

    I created this map using the ‘Geo Coordinate Map’ in VA. I need to create a geography variable by right clicking on the ‘World-cities’ and selecting Geography->Custom…->, and set the Latitude to the ‘Unprojected degrees latitude,’ and Longitude to the ‘Unprojected degrees longitude.’ To get the black continents in the map, go to VA preferences, check the ‘Invert application colors’ under the Theme. Remember to set the ‘Marker size’ to 1, and change the first color of markers to black so that it will show in white when application color is inverted.

    Chart 8

    This is a very simple scatter chart in VA. I only set transparency in order to show the overlapping value. The blue text in left-upper corner is using a text object.

    Chart 9

    To get this black background graph, set the ‘Wall background’ color to black. Then change the ‘Line/Marker’ color in data colors section accordingly. I’ve also checked the ‘Show markers’ option and changed the marker size to bigger 6.

    Chart 10

    There is nothing special for creating this scatter plot in VA. I simply create several reference lines, and uncheck the ‘Use filled markers’ with smaller marker size. The transparency of the markers is set to 30%.

    Chart 11

    In VA’s current release, if we use a category variable for color, the marker will automatically change to different markers for different colors. So I create a customized scatterplot using VA Custom Graph Builder, to define the marker as always round. Nothing else, just set the transparency to clearly show the overlapping values. As always, we can add an image object in VA with precision layout.

    Chart 12

    I used the GEO Bubble Map to create this visualization. I needed to create a custom Geography variable from the trap variable using ‘lat_deg’ and ‘lon_deg’ as latitude and longitude respectively. Then rename the NumMosquitos measure to ‘Total Mosquitos’ and use it for bubble size. To show the presence of west nile virus, I use the display rule in VA. I also create an image to show the meaning of the colored icons for display rule. The precision layout is enabled in order to have text and images added for this visualization.

    Chart 13

    This visualization is also created with GEO bubble map in VA. First I did some data manipulation to make the magnitude squared just for the sake of the bubble size resolution, so it shows contrast in size. Then I create some display rules to show the significance of the earth quakes with different colors, and set the transparency of the bubble to 30% for clarity. I also created an image to show the meaning of the colored icons.

    Be aware that some data manipulation is needed for original longitude data. Since the geographic coordinates will use the meridian as reference, if we want to show the data of American in the right part, we need to add 360 to the longitude, whose value is negative.

    Chart 14

    My understanding that one of the key points of this visualization Matteson made, is to show the control/interaction feature. Great thing is, VA has various control objects for interactive analysis. For the upper part in this visualization, I simply put a list table object. The trick here is how to use display rule to mimic the style. Before assigning any data to the list table in VA, I create a display rule with Expression, and at this moment we can specify the column with any measure value in an expression. (Otherwise, you need to define the display rule for each column with some expressions.) Just define ‘Any measure value’ is missing or greater than a value with proper filled color for cell. (VA doesn’t support filling the cell with certain pattern like Robert did for missing value. Therefore, I use grey for missing value to differentiate from 0 with a light color.)

    For the lower part, I create a new dataset for interventions to hold the intervention items, and put it in the list control and a list table. The right horizontal bar chart is a target bar chart with the expected duration as the targeted value. The label on each bar shows the actual duration.

    Chart 15

    VA does not have solid-modeling animation like Matteson made in his original chart, yet VA has animation support for bubble plots in an interactive mode. So I made this visualization using Robert’s animation dataset, trying to make an imitation of the famous animation by the late Hans Rosling as a memorial. I set the dates for animation by creating the dates variable with the first day in each year (just for simplicity). One customization here is: I use the custom graph builder to add a new role so that it can display the data label in the bubble plot, and set the country name as the bubble label in VA Designer. Certainly, we can always filter the interested countries in VA for further analysis.

    VA can’t show only a part of the bubble labels as Robert did using SAS codes. So in order to clearly show the labels of those interested countries, I made a rank of top 20 countries of average populations, and set a filter to show data between year 1950 to 2011. I use a capture screen tool to have the animation saved as a .gif file. Be sure to click the chart to see the animation.

    Chart 16

    I think Matteson’s original chart is to show the overview axis in the line chart, since I don’t see specialty of the line chart otherwise. So I draw this time series plot with the overview axis enabled in VA using the SASHELP.STOCK dataset. It shows the date on X axis with tick marks splitting to months, which can be zoomed in to day level in VA interactively. The overview axis can do the zooming in and out, as well as movement of the focused period.

    Chart 17

    For this visualization, I use a customized bubble plot (in Custom Graph Builder, add a Data Label Role for Bubble Plot.) so it will have bubble labels displayed. I use one reference line with label of Gross Avg., and 2 reference lines for X and Y axis accordingly, thus it visually creats four quadrants. As usual, add 4 text objects to hold the labels at each corner in the precision layout.

    Chart 18

    I think Matteson made an impressive 3D chart, and Robert recreated a very beautiful 3D chart with pure SAS codes. But VA does not have any 3D charts. So for this visualization, I simply load the data in VA, and drag them to have a visualization in VAE. Then choose the best fit from the fit line list, and export the visualization to report. Then, add display rules according to the value of Yield. Since VA shows the display rules at information panel, I create an image for colored markers to show them as legend in the visualization and put it in the precision layout.

    There you have it. Matteson’s 18 visualizations recreated in VA.

    How did I do?

    18 Visualizations created by SAS Visual Analytics was published on SAS Users.

    2月 242017
     

    Editor's note: Amanda Farnsworth is Head of Visual Journalism at BBC News and a featured speaker at SAS Global Forum 2017, April 2-5, 2017 in Orlando.

    There was a best selling book some years ago called “Men are from Mars and Women are from Venus.” It’s a phrase I thought about a lot when I first started my current job, not so much in the gender sense, but because it can be really challenging to bring together teams with very different experiences, skillsets and, above all, cultures.

    Different strokes

    In 2013, I was asked to form a new department – Visual Journalism – bringing together Online Designers, TV Designers and Online Journalists with an aptitude for graphics and visuals as well as Developers who worked with me but not for me. These included people staffing the many Language Services of the BBC World Service. The different teams produced content for the BBC News website, TV 24hr News Channels and Bulletins.

    And boy, were they all different!

    The digital folk were very creative but in a controlled way.  They worked with a set visual design language and liked to do this as a structured process which involved a lot of user testing focusing on what the audience would understand and how would they behave when faced with some of our visual content.

    Meanwhile, those with a TV background liked to work in a much more fluid way – creative workshops and experimentation - with less audience focus, as it was so much harder to get proper audience feedback on the visual elements of a TV report that viewers would get a single chance to see and take in.

    And, as everyone who has gone through change knows, it can be a scary and difficult time for many of those involved. I have always found The Transition Curve one of the most useful things I ever learnt in a management course, helping me to identify how different parts of the team might be handling change. I stuck this diagram on a filing cabinet next to my desk when I created the Visual Journalism team.

    And there was one other thing: I had a predominantly TV background and I was now being asked to lead a team that was packed full of digital experts. I’d always prided myself on my technical as well as editorial ability, and now I was less technically skilled than most of the people working for me. How was I going to cope with that?

    New beginnings

    The first thing I did was to offer 30-minute one-on-ones with my staff.  About 60% agreed. I sent them a questionnaire to fill in in advance and asked them these  three questions:

    1.    What Single Thing could we do very quickly that would change things for the better?

    2.    How can the new Visual Journalism team work better together?

    3.    What new tools do you need to do your job?

    It proved to be a treasure trove of information with some interesting thoughts and great suggestions – here are a few examples of Q1 answers:

    “A single management structure for the whole team. Sometimes different disciplines within the team clash as they are pulled in different directions by the priorities of their respective managers. This wastes time and creates unnecessary tensions.”

    “Unfortunately we have a rather corrosive habit of 'rumour control' which is usually of a negative nature, particularly in this time of change and uncertainty. I think 'rumour control' can easily be reduced by providing as much information as possible ( good and bad ) so none is left to be made up!”

    “I would like to see a re-evaluation of the planning area. A map that just happens to be going on air tomorrow, Should that be taking up a slot in planning? Maybe planning should be more focussed on projects that are moving us and our journalism forward. “

    And from question 2:

    “One word: flexibility. The teams need to absorb the concept that we have one goal, the individual outputs need to grasp this to. A respect for the established disciplines is all well and good, but tribalism needs to be left behind.”

    So I had a lot of face time with a lot of staff.  They all appreciated the dedicated time, but it also gave me a chance to meet them individually. The questionnaires gave me a written record of all their top concerns which I could refer to in the coming months and use as a justification or guide for change. And I could say after six months that I had done a lot of the things they had asked for along with other things that I felt needed to be done.

    In addition, I wanted the teams to meet each other.  So, we held a Speed Dating session. We made two long rows of chairs facing each other and sat TV people on one side and online people on the other.  They had one minute to say what they did and one minute to listen to the person opposite them share the same before I sounded a horn and everyone moved down one chair. It was a bit chaotic and a little hysterical to watch, but proved to be a great way of breaking the ice between the teams.

    After a month in which I also immersed myself in the work of the various teams with a series of show and tells and shadowing days, and asked external stakeholders what they wanted from the new department, I drew up my vision.

    It’s main message was that we were now a cross-platform team who needed to share ideas, information, skills and assets to create great, innovative content across TV and Digital.

    The build phase

    Even as we began the process of real change, the outside world suddenly started to move quickly. More and more of our news website traffic started to come from mobile, not desktop devices, and the distinction between what was TV and what was Digital began to blur, with the use of more video and motion graphics online. Social media platforms proliferated and became a key way of reaching an audience that didn’t usually access BBCTV or Digital content. We found ourselves on the cutting edge of where TV meets the web. And we had to make the most of it.

    I began a series of internal attachments where online and TV designers learnt each other’s skills. I supplemented that with training so they could learn new software tools and design techniques. The lines between journalists and designers also began to fade, with many editorial people learning motion graphics skills for use on the increasingly important social media platforms.

    I also encouraged and stood the cost of people spending a month outside the department learning how other parts of the BBC News machine worked and help spread the word that the new dynamic Visual Journalism department wanted to partner up and do big high impact, cross platform projects.

    I revamped our Twitter feed, offered myself and other colleagues for public speaking at conferences and made sure we entered our best work for awards.

    Quite quickly this all began to pay dividends. We won a big data journalism prize and we formed some big external partnerships with universities doing interesting research and with official bodies like the Office for National Statistics. We received a big investment for more data journalism from our Global division and from BBC Sport who wanted to do some big data led projects around the World Cup and Olympics.

    Social glue was also important. We instituted a now legendary annual pot-luck Christmas lunch where the tables groaned with the amazing food people brought in to share. The Christmas jumpers are always impressive and we hold a raffle and quiz too.

    There was, and still is, a major job to do just listening and looking after the staff. I make a point of praising and rewarding great work. We don’t have a great deal of flexibility on pay at the BBC, but rewards like attending international conferences, getting training opportunities and receiving some retail vouchers from the scheme the BBC runs all help.  I also always facilitate flexible working as much as is humanly possible, not just for women returning to work after maternity leave, but for caregivers, people who want to work part-time and most recently for two new dads who are going to take advantage of the paternal leave scheme and be the sole parent at home for six months while their wives return to work.

    I also write an end-of-year review and look ahead to the next 12 months that I send to all staff.  It outlines achievements and great content we have made but also the aims, objectives and challenges for the year ahead.

    Not all plain sailing

    Of course there were and still are some issues. As the Transition Curve shows, not everyone is going to follow you and embrace the change you bring. Team members who have been expert in their fields and are happy doing what they do suddenly find they have to learn new things and can feel de-skilled.  By definition, they cannot be an immediate expert at something new that they are asked to do and that can be difficult.

    As roles and responsibilities blurred, we found we had to redefine the production process for online content as people became unsure of their roles.

    Meanwhile such was the external reputation of the team, we suffered a brain drain to Apple, Amazon and Adidas.

    And for me, as the department grew to over 160 people when I took on responsibility for the Picture Editors who edit the video for news and current affairs reports, I had to accept I was going to be more of an enabler and provider of editorial oversight than a practitioner.  Technology was moving so fast, while I had to know and understand it, actually being able to create content myself was going to be a rare occurrence.

    Conclusion

    Writing this post has helped me see just how far we’ve come as a department in a few short years. It’s certainly not perfect and the challenges we face are ever–changing.  But we have now won over 25 awards across all platforms and the cross-platform vision is embedded in the teams who really enjoy learning from each other and working on projects together.

    And, I have a secret weapon.  I enjoy singing pop songs at my desk everyday and of course Carols at Christmas.

    Trying not to encourage me to sing is something literally everyone can unite behind.

    Bringing teams together was published on SAS Users.