Tech

8月 202018
 

As SAS Global Forum 2019 Conference Chair, I am excited to share our plans for an extraordinary SAS experience in Dallas, Texas, April 28-May 1. SAS has always been my analytic tool of choice and has provided me an amazing journey in my professional career.  Attending many SAS global conferences throughout the years, I've found this is the best venue to learn, contribute and network with other SAS colleagues.

The content will not disappoint – submit your abstract now
What presentation topics will you hear at SAS Global Forum? In 2019, expanded topics will include deep dives into text analytics, business visualization, AI/Machine Learning, SAS administration, open source integration and career development discussions. As always, SAS basic programming techniques using the current SAS software environment will be the underpinning of many sessions.

Do you want to be a part of that great content but never presented at a conference? Now could be the time to learn new skills by participating in the SAS Global Forum Presenter Mentoring Program. Available for first-time presenters, you will collaborate with a seasoned expert to guide you on preparing and delivering a professional presentation. The mentors can also help before you even submit your abstract in the call for content – which is now open. Get help to craft your title and abstract for your submission. The open call is available through October 22.

How to get to Dallas
Did you know there are scholarships and awards to help SAS users attend SAS Global Forum? Our award program has been enhanced for New SAS Professionals. This award provides an opportunity for new SAS users to enhance their basic SAS skills as well as learn about the latest technology in their line of business. Hear what industry experts are doing to solve business issues by sharing real-world evidence cases.

In addition, we're offering an enhanced International Professional Award focused on global engagement and participation. This award is available for those outside of the continental 48 states to share their expertise and industry solutions from around the world.  At the conference, you will have a chance to learn and network with international professionals who work on analytic projects similar to your analytic career.

Don’t miss out on this valuable experience! Submit your abstract for consideration to be a presenter! I look forward to seeing you in Dallas and hearing about your work.

Call for content now open for SAS Global Forum 2019 was published on SAS Users.

8月 172018
 

Data density estimation is often used in statistical analysis as well as in data mining and machine learning. Visualization of data density estimation will show the data’s characteristics like distribution, skewness and modality, etc. The most widely-used visualizations people used for data density are boxplot, histogram, kernel density estimates, and some other plots. SAS has several procedures that can create such plots. Here, I'll visualize the kernel density estimates superimposing on histogram using SAS Visual Analytics.

A histogram shows the data distribution through some continuous interval bins, and it is a very useful visualization to present the data distribution. With a histogram, we can get a rough view of the density of the values distribution. However, the bin width (or number of bins) has significant impact to the shape of a histogram and thus gives different impressions to viewers. For example, we have same data for the two below histograms, the left one with 6 bins and the right one with 4 bins. Different bin width shows different distribution for same data. In addition, histogram is not smooth enough to visually compare with the mathematical density models. Thus, many people use kernel density estimates which looks more smoothly varying in the distribution.

Kernel density estimates (KDE) is a widely-used non-parametric approach of estimating the probability density of a random variable. Non-parametric means the estimation adjusts to the observations in the data, and it is more flexible than parametric estimation. To plot KDE, we need to choose the kernel function and its bandwidth. Kernel function is used to compute kernel density estimates. Bandwidth controls the smoothness of KDE plot, which is essentially the width of the sliding window used to generate the density. SAS offers several ways to generate the kernel density estimates. Here I use the Proc UNIVARIATE to create KDE output as an example (for simplicity, I set c = SJPI to have SAS select the bandwidth by using the Sheather-Jones plug-in method), then make the corresponding visualization in SAS Visual Analytics.

Visualize the kernel density estimates using SAS code

It is straightforward to run kernel density estimates using SAS Proc UNIVARIATE. Take the variable MSRP in SASHELP.CARS dataset as an example. The min/max value of MSRP column is 10280 and 192465 respectively. I plot the histogram with 15 bins here in the example. Below is the sample codes segment I used to construct kernel density estimates of the MSRP column:

title 'Kernel density estimates of MSRP';
proc univariate data = sashelp.cars noprint;	
   histogram MSRP / kernel (c = SJPI) endpoints = 10280 to 192465 by 12145 outkernel = KDE  odstitle = title; 
run;

Run above code in SAS Studio, and we get following graph.

Visualize the kernel density estimates using SAS Visual Analytics

  1. In SAS Visual Analytics, load the SASHELP.CARS and the KDE dataset (from previous Proc UNIVARIATE) to the CAS server.
  2. Drag and drop a ‘Precision Container’ in the canvas, and put a histogram and a numeric series plot in the container.
  3. Assign corresponding data to the histogram plot: assign CARS.MSRP as histogram Measure, and ‘Frequency Percent’ as histogram Frequency; Set the options of the histogram with following settings:
    Object -> Title: No title;

    Graph Frame: Grid lines: disabled

    Histogram -> Bin range: Measure values; check the ‘Set a fixed bin count’ and set ‘Bin count’ to 15.

    X Axis options:

       Fixed minimum: 10280

       Fixed maximum: 192465

       Axis label: disabled

       Axis Line: enabled

       Tick value: enabled

    Y Axis options:

       Fixed minimum: 0

       Fixed maximum: 0.5

       Axis label: disabled

       Axis Line: disabled

       Tick value: disabled

  1. Assign corresponding KDE data to the numeric series plot. Define a calculated item: Percent as (‘Percent of Observations Per Data Unit’n / 100) with the format of ‘PERCENT12.2’, and assign it to the ‘Y axis’; assign the ‘Data Value’ to the ‘X axis.’ Now set the options of the numeric series plot with following settings:
    Object -> Title: No title;

    Style -> Line/Marker: (change the first color to purple)

    Graph Frame -> Grid lines: disabled

    Series -> Line thickness: 2

    X Axis options:

       Axis label: disabled

       Axis Line: disabled

       Tick value: disabled

    Y Axis options:

       Fixed minimum: 0

       Fixed maximum: 0.5

       Axis label: enabled

       Axis Line: enabled

       Tick value: enabled

    Legend:

       Visibility: Off

  1. Now we can start to overlay the two charts. As can be seen in the screenshot below, SAS Visual Analytics 8.3 provides a smart guide with precision container, which shows grids to help you align the objects in it. If you hold the ctrl button while dragging the numeric series plot to overlay the histogram, some fine grids displayed by the smart guide to help you with basic alignment. It is a little tricky though, to make the overlay precisely, you may fine tune the value of the Left/Top/Width/Height in the Layout of VA Options panel. The goal is to make the intersection of the axes coincides with each other.

After that, we can add a text object above the charts we just made, and done with the kernel density estimates superimposing on a histogram shown in below screenshot, similarly as we got from SAS Proc UNIVARIATE. (If you'd like to use PROC KDE UNIVAR statement for data density estimates, you can visualize it in SAS Visual Analytics in a similar way.)

To go further, I make a KDE with a scatter plot where we can also get impression of the data density with those little circles; another KDE plot with a needle plot where the data density is also represented by the barcode-like lines. Both are created in similar ways as described in above histogram example.

So far, I’ve shown you how I visualize KDE using SAS Visual Analytics. There are other approaches to visualize the kernel density estimates in SAS Visual Analytics, for example, you may create a custom graph in Graph Builder and import it into SAS Visual Analytics to do the visualization. Anyway, KDE is a good visualization in helping you understand more about your data. Why not give a try?

Visualizing kernel density estimates in SAS Visual Analytics was published on SAS Users.

8月 152018
 

SAS Viya logoSAS Viya 3.4 has some new functionality that provides real help for those who want to transition from SAS Visual Analytics on 9.4 to SAS Viya. In prior releases of SAS Viya you could promote reports and explorations (and a few other supporting objects). In SAS Viya 3.4, promotion support is added for many additional SAS 9.4 resources, making it easier to make the leap to SAS Viya. In this blog, I will review this new functionality.

In SAS Viya 3.4, the following objects participate in promotion from SAS 9.4.

  • Configuration
    • Identities
    • Authorization
    • Data definitions
  • Content
    • Folders
    • Reports
    • Explorations
    • Stored processes
    • Supporting resources (such as themes, images, graphs templates)

The details of support for each resource are unique and are discussed below.

Identities

User and group promotion from SAS 9.4 to SAS Viya is used to support the transition to the target environment of authorization settings that are associated with content.  Metadata is exported to support the mapping of SAS 9.4 identity metadata (Users and Groups) to SAS Viya identities (Users, Groups and Custom groups).

During promotion of identity metadata:

  • Users connections are mapped using metadata DefaultAuth:logonid to SAS Viya identity id
  • Metadata-only groups from SAS 9.4 are converted to SAS Viya Custom groups (except SAS General Servers and SAS System Services)
  • If custom groups of the same name (or sometimes the same purpose but a different name) exist in the target, the group is preserved and any mapped members from the source system are added to the group.

Authorization

Identities are “promoted” to support re-implementation of authorization. You do not have to explicitly export authorization as it is included with libraries, tables, folders and reports when they are exported. Promotion of authorization is optional. If you don’t wish to include authorization, but rather re-implement it in
SAS Viya, you can switch this functionality off at import time.

SAS Viya has two authorization systems, the general authorization system for folders and content, and the CAS authorization system for data. These authorization systems are different than the metadata authorization model in SAS 9.4. So what happens when you promote content that includes authorization?

General Authorization (folders and content)

Promotion will attempt to convert SAS 9.4 authorization to rules in the General authorization system.  During the process:

  • Explicit Access Control Entries are converted to SAS Viya Rules
  • Access Control Entries with denials are discarded
  • Access Control Templates are not promoted

In addition, if an object (folder/report):

  • does not exist in the target environment,relevant authorization is set for the object and the access control entries from the source are implemented as rules on the object.
  • existsin the target environment, then access control entries from the source are merged with any pre-existing authorizations in the target environment.

CAS Authorization

The CAS authorization system covers CASlibs and data.  Promotion will attempt to convert SAS 9.4 authorization on libraries and tables to access controls in the CAS authorization system. During the process:

  • Access Control Entries are not promoted unless they are applied directly to a library or table.
  • Access Control Entries are converted to CAS access controls.
  • Row-level permissions are preserved.
  • If an object exists in the target environment no authorization settings are imported.
  • Access Control Templates are not promoted.

For details of how individual permissions for both data and content are mapped from SAS 9.4 to SAS Viya see the documentation has great coverage of the steps to follow.

The Process

To finish off, I'll share few observations on the process of exporting from 9.4 and importing in SAS Viya. Like SAS 9.4 promotion, you need to import in a specific order. This allows the software to make the relevant connections to dependent resources. For example, if the CASLIB already exists in the target, then import tables can be mapped to it. Typically, the order is: identities > library definitions > tables > reports and folders. To support this process, make sure, during export, you have a separate package for each resource type. Some considerations for the export process.

You should export:

  • Identities (users and groups) from the security folders in SAS 9.4 metadata to a separate package.
  • Only groups that you need in the target environment (you can subset any irrelevant SAS 9 groups at export time).
  • LASR and Base Libraries and tables directly from the library definition in the folder tree (this prevents extraneous folders being created in the target environment).
  • Libraries in a separate package from tables so that they may be imported first and be available for mapping when the tables are imported.
  • Content and reports from the base of the folder tree so that all directly applied access control entries will be included in the package.

Prior to importing, make sure that users and groups are configured correctly in LDAP. As I already mentioned, physical data is not promoted so ensure that required data and formats are accessible to the SAS Viya environment.

The new functionality for promotion is a great start in helping with the transition from SAS 9.4 to SAS Viya. Look for more functionality in future releases.

New functionality for transitioning from SAS Visual Analytics on 9.4 to SAS Viya was published on SAS Users.

8月 132018
 

Data in the cloud makes it easily accessible, and can help businesses run more smoothly. SAS Viya runs its calculations on Cloud Analytics Service (CAS). David Shannon of Amadeus Software spoke at SAS Global Forum 2018 and presented his paper, Come On, Baby, Light my SAS Viya: Programming for CAS. (In addition to being an avid SAS user and partner, David must be an avid Doors fan.) This article summarizes David's overview of how to run SAS programs in SAS Viya and how to use CAS sessions and libraries.

If you're using SAS Viya, you're going to need to know the basics of CAS to be able to perform calculations and use SAS Viya to its potential. SAS 9 programs are compatible with SAS Viya, and will run as-is through the CAS engine.

Using CAS sessions and libraries

Use a CAS statement to kick off a session, then use CAS libraries (caslibs) to store data and resources. To start the session, simply code "cas;" Each CAS session is given its own unique identifier (UUID) that you can use to reconnect to the session.

Handpicked Related VIDEO: SAS programming in the cloud: CASL code

There are a few significant codes that can help you to master CAS operations. Consider these examples, based on a CAS session that David labeled "speedyanalytics":

  • What CAS sessions do I have running?
    cas _all_ list;
  • Get the version and license specifics from the CAS server hosting my session:
    cas speedyanalytics listabout;
  • I want to sign out of SAS Studio for now, so I will disconnect from my CAS session, but return to it later…
    cas speedyanalytics disconnect;
  • ...later in the same or different SAS Studio session, I want to reconnect to the CAS session I started earlier using the UUID I previous grabbed from the macro variable or SAS log:
    cas uuid="&speedyanalytics_uuid";
  • At the end of my program(s), shutdown all my CAS sessions to release resources on the server:
    cas _all_ terminate;

Using CAS libraries

CAS libraries (caslib) are the method to access data that is being stored in memory, as well as the related metadata.

From the library, you can load data into CAS tables in a couple of different ways:

  1. Takes a sample data set, calculate a new measure and stores the output in memory
  2. Proc COPY can bring existing SAS data into a caslib
  3. Proc CASUTIL loads tables into caslibs

The Proc CASUTIL allows you to save your tables (named "classsi" data in David's examples) for future use through the SAVE statement:

proc casutil;
 save casdata="classsi" casout="classsi";
run;

And reload like this in a future session, using the LOAD statement:

proc casutil;
 load casdata="classsi" casout="classsi";
run;

When accessing your CAS libraries, remember that there are multiple levels of scope that can apply. "Session" refers to data from just the current session, whereas "Global" allows you to reach data from all CAS sessions.

Programming in CAS

Showing how to put CAS into action, David shared this diagram of a typical load/save/share flow:

Existing SAS 9 programs and CAS code can both be run in SAS Viya. The calculations and data memory occurs through CAS, the Cloud Analytics Service. Before beginning, it's important to understand a general overview of CAS, to be able to access CAS libraries and your data. For more about CAS architecture, read this paper from CAS developer Jerry Pendergrass.

The performance case for SAS Viya

To close out his paper, David outlined a small experiment he ran to demonstrate performance advantages that can be seen by using SAS Viya v3.3 over a standard, stand-alone SAS v9.4 environment. The test was basic, but performed reads, writes, and analytics on a 5GB table. The tests revealed about a 50 percent increase in performance between CAS and SAS 9 (see the paper for a detailed table of comparison metrics). SAS Viya is engineered for distributive computing (which works especially well in cloud deployments), so more extensive tests could certainly reveal even further increases in performance in many use cases.

Additional resources

A quick introduction to CAS in SAS Viya was published on SAS Users.

8月 092018
 

Recently I’ve been listening to the BBC Radio Series 50 Things That Made the Modern Economy, which was first broadcast in 2016. One of the episodes considers the impact of a simple box (the shipping container) and concludes its invention was a major contributor to the post-war boom in global trade. It’s worth a listen, if you can.

Notwithstanding the tenuous link, containerization is having perhaps an equally significant impact on Cloud Computing and I want to share a recent experience which highlights the convenience of containers. I’m not aiming to summarize all the multiple SAS initiatives in the Cloud (including SAS Viya and Cloud Foundry) here rather it’s to share a few observations about a specific offering for SAS 9.4.

Recently I attended a demonstration by SAS’ Doug Liming on SAS Analytics for Containers. While this product was launched in 2016, until now I confess I’d not appreciated its simplicity or potential. I’d like to use this blog post to share what I saw & learned because this session served as a bit of an epiphany for me.

As a reminder SAS Analytics for Containers consists of:

    • Foundation SAS (Base, STAT & Graph) ready-packaged to be deployed in a Docker container.
    • SAS Studio.
    • Optional SAS/Access connectors & Accelerators.

In the space of 20 minutes, Doug took us through the The power and potential of simplicity: SAS 9.4 and Containers was published on SAS Users.

8月 022018
 

SAS Viya has opened an entirely new set of capabilities, allowing SAS to analyze on cloud technology in real-time. One of the best new features of SAS Viya is its ability to pair with open source platforms, allowing developers the freedom of language and implementation to integrate with the power of SAS analytics.

At SAS Global Forum 2018, Sean Ankenbruck and Grace Heyne Lybrand from Zencos Consulting led the talk, SAS Viya: The Beauty of REST in Action. While the paper – and this blog post – outlines the use of Python and SAS Viya, note that SAS Viya integrates with R, Java and Lua as well.

Nonetheless, this Python integration example shows how easy it is to integrate SAS Viya and open source technologies. Here is the basic workflow:

1. A developer creates a web application, in a language of their choice
2. A user enters data in the web application
3. The collected data that is passed to Viya via the defined APIs
4. Analysis is performed in Viya using SAS actions
5. Results are passed back to the web application
6. The web application presents the results to the user

About the process

SAS’ Cloud Analytic Services (CAS) acts as a server to analyze data, and REST API’s are being used to integrate many programming languages into SAS Viya. REST stands for Representational State Transfer, and is a set of constraints that allows scalability and integration of multiple web-based systems. In layman’s terms, it’s a set of software design patterns that provides handy connector points from one web app to another. The REST API is what developers use to interact with and submit requests through the processing system.

CAS actions are what allow “tasks” to be completed on SAS Viya. These “tasks” are under the categories of Statistics, Analytics, System, and Data Mining and Machine Learning.

Integration with Python

To access CAS through Python, the SAS Scripting Wrapper for Analytics Transfer (SWAT) package is used, letting Python conventions dictate CAS actions. To create this interface, data must be captured through a web application in a format that Python can transmit to SAS Viya.
In order to connect Python and CAS, the following is necessary:

• Hostname
• CAS Port
• Username
• Password

Let’s see it in action

As an example, one project about wine preferences used CAS-collected data through a questionnaire stored in Python’s Pandas library. When the information was gathered, the decision tree was uploaded to SAS Viya. A model was created with common terms reviewers use to describe wines, feeding into a decision tree. The CAS server scored the users’ responses in real-time, and then sent the results back to the user providing them with suggested wines to match their inputs.

Process for model

Code to utilize tree:

conn.loadactionset("decisionTree") 
conn.decisionTree.dTreeTrain( 
      casOut = {"name":"tree_model"},
      inputs = [{vars}], 
      modelId = "DT_wine_variety", 
      table = {"caslib":"public", "name": "wines_model_data"}, 
      target = "variety")

"Decision" given to user

Conclusion

SAS Viya has opened SAS to a plethora of opportunities, allowing many different programming languages to be interpreted and quickly integrated, giving analysts and data scientists more flexibility.

Additional resources

At Your Service: Using SAS® Viya™ and Python to Create Worker Programs for Real-Time Analytics, Jon Klopfer, Scott Koval, and Mia List
SAS Viya
sas-viya-programming on github
python-swat on github
SAS Global Forum

Additional SAS Viya talks from SAS Global Forum

A Need for Speed: Loading Data via the Cloud, Henry Christoffels
Come On, Baby, Light my SAS® Viya®: Programming for CAS, David Shannon
Just Enough SAS® Cloud Analytic Services: CAS Actions for SAS® Visual Analytics Report Developers, Michael Drutar
Running SAS Viya on Oracle Cloud without Sacrificing Performance, Dan Grant
Command-Line Administration in SAS® Viya®, Danny Hamrick
Five Approaches for High-Performance Data Loading to the SAS® Cloud Analytic Services Server, Rob Collum

How SAS Viya uses REST APIs to integrate with Python was published on SAS Users.

7月 262018
 

SAS Text Analytics analyze documents at document-level by default, but sometimes sentence-level analysis gains further insights into the data. Two years ago, SAS Text Analytics team did some research on sentence-level text analysis and shared their discoveries in a SGF paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations. Recently my team started working on a concept extraction project. We need to extract all sentences containing one or two query words, so that linguists don't need to read the whole documents in order to write concept extraction rules. This improves their work efficiency on rules development and rule tuning significantly.

Sentence boundary detection

Sentence boundary detection is a challenge in Natural Language Processing -- it's more complicated than you might expect. For example, most sentences in English end with a period, but sometimes a period is used to denote an abbreviation or used as a part of ellipsis. My colleagues Biljana and Teresa wrote an article about the complexities of how a period may be used. if you are interested in this topic, please check out their article Text analytics through linguists' eyes: When is a period not a full stop?

Sentence boundary rules are different for different languages, and when you work with multilingual data you might want to write one set of code to manipulate all data in varied languages. For example, a period in German is used to denote ending of an ordinal number token; in Chinese, the sentence-final period is different from English period; and Thai does not use period to denote the end of a sentence.

Here are several sentence boundary examples:

Sentences Language Text
1 English Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
2 English I paid $23.45 for this book.
3 English We earn more and more money, but we feel less and less happier. So…what happened to us?
4 Chinese 北京确实人多车多,但是根源在哪里?
5 Chinese 在于首都集中了太多全国性资源。
6 German Was sind die Konsequenzen der Abstimmung vom 12. Juni?

How to tokenize documents into sentences with SAS?

There are several methods to build a sentence tokenizer with SAS Text Analytics. Here I only list three methods:

  • Method 1: Use CAS action tpParse and SAS Viya
  • Method 3: Use SAS Data Step Code and SAS 9

Among the above three methods, I recommend the first method, because it can extract sentences and keep the raw texts intact. With the second method, uppercase letters are changed into lowercase letters after parsing with SAS, and some unseen characters will be replaced with white spaces. The third method is based on traditional SAS 9 technology (not SAS Viya), so it might not scale to large data as well.

In my article, I show the SAS code of only the first two methods. For details of the SAS code for the last method, please check out the paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations.

Use CAS action The applyConcept action performs concept extraction using a concept extraction model that you compile and validate.

%macro sentenceTokenizer1(
   dsIn=,
   docVar=,
   textVar=,
   language=,
   dsOut=
);
/* Rule for determining sentence boundaries */
data sascas1.concept_rule;
   length rule $ 200;
   ruleId=1;
   rule='ENABLE:SentBoundaries';
   output;
 
   ruleId=2;
   rule='PREDICATE_RULE:SentBoundaries(first,last):(SENT,"_first{_w}","_last{_w}")';
   output;
run;
 
proc cas;
textRuleDevelop.validateConcept / 
   table={name="concept_rule"}
   config='rule'
   ruleId='ruleId'
   language="&language"
   casOut={name='outValidation',replace=TRUE}
;
run;
quit;
 
/* Compile concept rule; */
proc cas;
textRuleDevelop.compileConcept / 
   table={name="concept_rule"}
   config="rule"
   enablePredefined=false
   language="&language"
   casOut={name="outli", replace=TRUE}
;
run;
quit;
 
/* Get Sentences */
proc cas;
textRuleScore.applyConcept / 
   table={name="&dsIn"}
   docId="&docVar"
   text="&textVar"
   language="&language"
   model={name="outli"}
   matchType="best"
   casOut={name="outpos_eli", replace=TRUE}
   factOut={name="&dsOut", replace=TRUE, where="_fact_argument_=''"}
;
run;
quit;
 
proc cas;
   table.dropTable name="concept_rule" quiet=true; run;
   table.dropTable name="outli" quiet=true; run;
   table.dropTable name="outpos_eli" quiet=true; run;
quit; 
%mend sentenceTokenizer1;

Use CAS action NLP technique called tpParse.

%macro sentenceTokenizer2(
   dsIn=,
   docVar=,
   textVar=,
   language=,
   dsOut=
);
/* Parse the data set */
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   selectAttribute={opType="IGNORE",tagList={}}
   selectPos={opType="IGNORE",tagList={}}
   offset={name="offset",replace=TRUE}
;
run;
 
/* Get Sentences */
proc cas;
table.partition / 
   table={name="offset" 
          groupby={{name="_document_"}, {name="_sentence_"}}
          orderby={{name="_start_"}}
         }
   casout={name="offset" replace=true};
run;
 
datastep.runCode /
code= "
data &dsOut;
   set offset;
   by _document_ _sentence_ _start_;
   length _text_ varchar(20000);
   if first._sentence_ then do;
      _text_='';
      _lag_end_ = -1;
   end;  
   if _start_=_lag_end_+1 then
      _text_=cats(_text_, _term_);
   else
      _text_=trim(_text_)||repeat(' ',_start_-_lag_end_-2)||_term_;
   _lag_end_=_end_;  
   if last._sentence_ then output;
   retain _text_ _lag_end_;
   keep _document_ _sentence_ _text_;
run;
";
run;   
quit;
 
proc cas;
   table.dropTable name="offset" quiet=true; run;
quit; 
%mend sentenceTokenizer2;

Here are three examples for using each of these tokenizer methods:

/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
/*-------------------------------------*/
/* Example 1: Chinese texts            */
/*-------------------------------------*/
data sascas1.text_zh;
   infile cards dlm='|' missover;
   input _document_ text :$200.;
   cards;
1|北京确实人多车多,但是根源在哪里?在于首都集中了太多全国性资源。
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_zh,
   docVar=_document_,
   textVar=text,
   language=chinese,
   dsOut=sentences_zh1
);
 
%sentenceTokenizer2(
   dsIn=text_zh,
   docVar=_document_,
   textVar=text,
   language=chinese,
   dsOut=sentences_zh2
);
 
/*-------------------------------------*/
/* Example 2: English texts            */
/*-------------------------------------*/
data sascas1.text_en;
   infile cards dlm='|' missover;
   input _document_ text :$500.;
   cards;
1|Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
2|I paid $23.45 for this book.
3|We earn more and more money, but we feel less and less happier. So…what happened to us?
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_en,
   docVar=_document_,
   textVar=text,
   language=english,
   dsOut=sentences_en1
);
 
%sentenceTokenizer2(
   dsIn=text_en,
   docVar=_document_,
   textVar=text,
   language=english,
   dsOut=sentences_en2
);
 
 
/*-------------------------------------*/
/* Example 3: German texts             */
/*-------------------------------------*/
data sascas1.text_de;
   infile cards dlm='|' missover;
   input _document_ text :$600.;
   cards;
1|Was sind die Konsequenzen der Abstimmung vom 12. Juni?
;
run;   
 
%sentenceTokenizer1(
   dsIn=text_de,
   docVar=_document_,
   textVar=text,
   language=german,
   dsOut=sentences_de1
);
 
%sentenceTokenizer2(
   dsIn=text_de,
   docVar=_document_,
   textVar=text,
   language=german,
   dsOut=sentences_de2
);

The sentences extracted of the three examples as Table 2 shows below.

Example Doc Text Sentence (Method 1) Sentence (Method 2)
English

 

1 Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. rolls-royce motor cars inc. said it expects its u.s. sales to remain steady at about 1,200 cars in 1990.
2 I paid $23.45 for this book. I paid $23.45 for this book. i paid $23.45 for this book.
3 We earn more and more money, but we feel less and less happier. So…what happened to us? We earn more and more money, but we feel less and less happier. we earn more and more money, but we feel less and less happier.
So…what happened? so…what happened?
Chinese

 

1 北京确实人多车多,但是根源在哪里?在于首都集中了太多全国性资源。 北京确实人多车多,但是根源在哪里? 北京确实人多车多,但是根源在哪里?
在于首都集中了太多全国性资源。 在于首都集中了太多全国性资源。
German 1 Was sind die Konsequenzen der Abstimmung vom 12. Juni? Was sind die Konsequenzen der Abstimmung vom 12. Juni? was sind die konsequenzen der abstimmung vom 12. juni?

From the above table, you can see that there is no difference between two methods with Chinese textual data, but many differences between two methods with English or German textual data. So which method you should use? It depends on the SAS products that you have available. Method 1 depends on compileConcept, validateConcept, and applyConcept actions, and requires SAS Visual Text Analytics. Method 2 depends on the tpParse action in SAS Visual Analytics. If you have both products available, then consider your use case. If you are working on text analytics that are case insensitive, such as topic detection or text clustering, you may choose method 2. Otherwise, if the text analytics are case sensitive such as named entity recognition, you must choose method 1. (And of course, if you don't have SAS Viya, you can use method 3 with SAS 9 and guidance from the cited paper.)

If you have SAS Viya, I suggest trying the above sentence tokenization method with your data and then run text mining actions on the sentence-level data to see what insights you will get.

How to tokenize documents into sentences was published on SAS Users.

7月 252018
 

I recently joined SAS in a brand new role: I'm a Developer Advocate.  My job is to help SAS customers who want to access the power of SAS from within other applications, or who might want to build their own applications that leverage SAS analytics.  For my first contribution, I decided to write an article about a quick task that would interest developers and that isn't already heavily documented. So was born this novice's experience in using R (and RStudio) with SAS Viya. This writing will chronicle my journey from the planning stages, all the way to running commands from RStudio on the data stored in SAS Viya. This is just the beginning; we will discuss at the end where I should go next.

Why use SAS Viya with R?

From the start, I asked myself, "What's the use case here? Why would anyone want to do this?" After a bit of research discussion with my SAS colleagues, the answer became clear.  R is a popular programming language used by data scientists, developers, and analysts – even within organizations that also use SAS.  However, R has some well-known limitations when working with big data, and our SAS customers are often challenged to combine the work of a diverse set of tools into a well-governed analytics lifecycle. Combining the developers' familiarity of R programming with the power and flexibility of SAS Viya for data storage, analytical processing, and governance, this seemed like a perfect exercise.  For this purpose of this scenario, think of SAS Viya as the platform and the Cloud Analytics Server (CAS) is where all the data is stored and processed.

How I got started with SAS Viya

I did not want to start with the task of deploying my own SAS Viya environment. This is a non-trivial activity, and not something an analyst would tackle, so the major pre-req here is you'll need access to an existing SAS Viya setup.  Fortunately for me, here at SAS we have preconfigured SAS Viya environments available on a private cloud that we can use for demos and testing.  So, SAS Viya is my server-side environment. Beyond that, a client is all I needed. I used a generic Windows machine and got busy loading some software.

What documentation did I use/follow?

I started with the official SAS documentation: SAS Scripting Wrapper for Analytics Transfer (SWAT) for R.

The Process

The first two things I installed were R and RStudio, which I found at these locations:

https://cran.r-project.org/
https://www.rstudio.com/products/rstudio/download/

The installs were uneventful, so I'll won't list all those steps here. Next, I installed a couple of pre-req R packages and attempted to install the SAS Scripting Wrapper for Analytics Transfer (SWAT) package for R. Think of SWAT as what allows R and SAS to work together. In an R command line, I entered the following commands:

> install.packages('httr')
> install.packages('jsonlite')
> install.packages('https://github.com/sassoftware/R-swat/releases/download/v1.2.1/R-swat-1.2.1-> 
  linux64.tar.gz', repos=NULL, type='file')

When attempting the last command, I hit an error:

…
ERROR: dependency 'dplyr' is not available for package 'swat'
* removing 'C:/Program Files/R/R-3.5.1/library/swat'
In R CMD INSTALL
Warning message:
In install.packages("https://github.com/sassoftware/R-swat/releases/download/v1.2.1/R-swat-1.2.1-linux64.tar.gz",  :
installation of package 'C:/Users/sas/AppData/Local/Temp/2/RtmpEXUAuC/downloaded_packages/R-swat-1.2.1-linux64.tar.gz'
  had non-zero exit status

The install failed. Based on the error message, it turns out I had forgotten to install another R package:

> install.packages("dplyr")

(This dependency is documented in the R SWAT documentation, but I missed it. Since this could happen to anyone – right? – I decided to come clean here. Perhaps you'll learn from my misstep.)

After installing the dplyr package in the R session, I reran the swat install and was happy to hit a return code of zero. Success!

For the brevity of this post, I decided to not configure an authentication file and will be required to pass user credentials when making connections. I will configure authinfo in a follow-up post.

Testing my RStudio->SAS Viya connection

From RStudio, I ran the following command to connect to the CAS server:

> library(swat)
> conn <- CAS("mycas.company.com", 8777, protocol='http', user='user', password='password')

Now that I succeeded in connecting my R client to the CAS server, I was ready to load data and start making API calls.

How did I decide on a use case?

I'm in the process of moving houses, so I decided to find a data set on property values in the area to do some basic analysis, to see if I was getting a good deal. I did a quick google search and downloaded a .csv from a local government site. At this point, I was all set up, connected, and had data. All I needed now was to run some CAS Actions from RStudio.

CAS actions are commands that you submit through RStudio to tell the CAS server to 'do' something. One or more objects are returned to the client -- for example, a collection of data frames. CAS actions are organized into action sets and are invoked via APIs. You can find

> citydata <- cas.read.csv(conn, "C:\\Users\\sas\\Downloads\\property.csv", sep=';')
NOTE: Cloud Analytic Services made the uploaded file available as table PROPERTY in caslib CASUSER(user).

What analysis did I perform?

I purposefully kept my analysis brief, as I just wanted to make sure that I could connect, run a few commands, and get results back.

My RStudio session, including all of the things I tried

Here is a brief series of CAS action commands that I ran from RStudio:

Get the mean value of a variable:

> cas.mean(citydata$TotalSaleValue)
          Column     Mean
1 TotalSaleValue 343806.5

Get the standard deviation of a variable:

> cas.sd(citydata$TotalSaleValue)
          Column      Std
1 TotalSaleValue 185992.9

Get boxplot data for a variable:

> cas.percentile.boxPlot(citydata$TotalSaleValue)
$`BoxPlot`
          Column     Q1     Q2     Q3     Mean WhiskerLo WhiskerHi Min     Max      Std    N
1 TotalSaleValue 239000 320000 418000 343806.5         0    685000   0 2318000 185992.9 5301

Get boxplot data for another variable:

> cas.percentile.boxPlot(citydata$TotalBldgSqFt)
$`BoxPlot`
         Column   Q1   Q2   Q3     Mean WhiskerLo WhiskerHi Min   Max      Std    N
1 TotalBldgSqFt 2522 2922 3492 3131.446      1072      4943 572 13801 1032.024 5301

Did I succeed?

I think so. Let's say the house I want is 3,000 square feet and costs $258,000. As you can see in the box plot data, I'm getting a good deal. The house size is in the second quartile, while the house cost falls in the first quartile. Yes, this is not the most in depth statistical analysis, but I'll get more into that in a future article.

What's next?

This activity has really sparked my interest to learn more and I will continue to expand my analysis, attempt more complex statistical procedures and create graphs. A follow up blog is already in the works. If this article has piqued your interest in the subject, I'd like to ask you: What would you like to see next? Please comment and I will turn my focus to those topics for a future post.

Using RStudio with SAS Viya was published on SAS Users.

7月 252018
 

In SAS Visual Analytics 8.3, a Data View is a reusable and shareable template for a data source. That means that the data view is tied to the data source, and not to the report. If you update a data view it will not automatically propagate those changes into a report.
 
So, what can a data view do for you? Plenty! Here are just a few of the settings and customizations that a data view can save for a data source: (taken from documentation here):

  • Data item settings such as names, formats, classifications, and aggregations
  • Data source filters
  • Hierarchies
  • Derived data items
  • Calculated items
  • Custom categories
  • Duplicate data items
  • Show / hide status for data items
  • Unique row identifier selection

Create a Data View

Now you must be wondering, how do you save all these wonderful customizations for your data source? Answer: by creating a Data View.
 
To get started, use the Data Source menu and select Save data view…. In this example, I created a hierarchy for the SASHELP CARS data set but as you can see from the list above you could have created many more calculations, custom categories, etc.
 
 

 
Then give the Data View a name. A few other things you may notice about this Save Data View dialogue are the options for: Default data view and Shared data view.
 
 

Default data view

A default data view is automatically applied whenever the data source is added to the report.
 
Each user can create their own data view of the source data and select their own default data view. This could lead to each user having a personalized default view. But, what if you want share your data views with others on your team? Or have everyone start with the same default view? That is when you need to first be an Application Administrator and second use the Shared data view option.

Shared data view

In order to be able to share a data view, you must be an Application Administrator. Then the option to share a data view will be available. Once a data view is shared for a data source, other users with access to that data source will be able to apply that data view.

Apply a Data View

Data views are templates of saved settings, hierarchies, custom categories, calculated data item, etc. which can be combined in an infinite amount of ways. Therefore, it follows that multiple data views can be applied to the same data source. In the example above, I created a new hierarchy for the SASHELP CARS data set. But I could also create a new data view which changes the aggregation of the MPG measures to reflect the average aggregation and not the default sum aggregation.

To apply a data view: open a new report, select your data source, then use the Data Source menu and select Data views…. You will see any individually created data views as well as any shared data views. Highlight the data view you wish to apply, then select Apply. Repeat for all of the data views you wish to apply.

If any data items are duplicated with the addition of data views then, as shown below, those data items are given a (n) after their names.

Administrator-controlled Default Data View

We've learned what Data Views are and that we can share them. How can we ensure that all the users who select a data source get the same starting point with a particular data view? To set this up, you must be an Application Administrator and the Data View must be Shared.
 
Once these two criteria are met, you can navigate to the report's overflow menu and select Edit administration settings. Then select the data source and which data view to apply as the default for all users.


 
Caution: If the user has already selected a personal default data view, then the personal default data view overrides the administrator-set default data view. Remember that an individual user can apply a personal or another shared data view and override the default data view.

Conclusion

Data Views are just one of the exciting new features in SAS Visual Analytics 8.3. A few key points to remember:

  • Data Views are tied to a data source, not a report. If a data view is edited, those edits do not propagate to the reports that applied that Data View.
  • A data source can have multiple Data Views applied.
  • Only an Application Administrator can share a data view with other users as well as define a default data view for a data source for all users. Any personal defined default data views override the administrator-set default data view.
  • Data Views are a template of data settings and edits – not a fully robust semantic layer where updates are pushed to all instances of usage. While Data Views can be used to assist in defining commonly used calculations and custom categories, remember that each user can still create their own data views and thus override the administrator-set default.

Using Data Views in SAS Visual Analytics was published on SAS Users.

7月 192018
 

Suppose that you want to know the value of a character variable that has the highest frequency count or even the top three highest values. To determine that value, you need to create an output data set and sort the data by the descending Count or _FREQ_ variable. Then you need to print the top n observations using the OBS= option, based on the number of values that you want to see. You can do this easily using any of a variety of procedures that calculate a frequency count (for example, the FREQ Procedure or the MEANS Procedure).

This blog provides two detailed examples: one calculates the top n values for a single variable and one calculates the top n values for all character variables in a data set.

Print the top n observations of a single variable

The following example prints the three values of the Make variable in the Sashelp.Cars data set that have the highest frequency count. By default, PROC FREQ prints a variable called Count in the output data set. The output data set is sorted by this variable in descending order, and the number of observations that you want to keep is printed by using the OBS= data set option.

proc freq data=sashelp.cars noprint;
tables make / out=counts(drop=percent);
run;
 
proc sort data=counts;
by descending count;
run;
 
proc print data=counts(obs=3);
run;

Print the top n observations of all character variables in a data set

Suppose that you want to know the top three values for all the character variables in a data set. The process shown in the previous section is not efficient when you have many variables. Suppose you also want to store this information in a data set. You can use macro logic to handle both tasks. The following code uses PROC FREQ to create an output data set for each variable. Further manipulation is done in a DATA step so that all the data sets can be combined. A detailed explanation follows the example code:

%macro top_frequency(lib=,dsn=);
 
/* count character variables in the data set */
proc sql noprint;
select name into :charlist separated by ' '
from dictionary.columns
where libname=%upcase("&lib") and memname=%upcase("&dsn")and type='char';
quit;
 
%put &charlist;
%let cnt=%sysfunc(countw(&charlist,%str( )));
%put &cnt;
 
%do i=1 %to &cnt;
 
/* Loop through each character variable in */
/* FREQ and create a separate output  */           
/* data set.                               */
proc freq data=&lib..&dsn noprint;
tables %scan(&charlist,&i) / missing out=out&i(drop=percent 
 rename=(%scan(&charlist,&i)=value));
run;
 
data out&i;
length varname value $100;
set out&i;
varname="%scan(&charlist,&i)";
run;
 
proc sort data=out&i;
by varname descending count;
run;
 
%end;
 
data combine;
set %do i=1 %to &cnt;
out&i(obs=3) /* Keeps top 3 for each variable. */
%end;;
run;
 
proc print data=combine;
run;
 
%mend top_frequency;
 
options mprint mlogic symbolgen;
%top_frequency(lib=SASHELP,dsn=CARS);

I begin my macro definition with two keyword parameters that enable me to substitute the desired library and data set name in my macro invocation:

%macro top_frequency(lib=,dsn=);

The SQL procedure step selects all the character variables in the data set and stores them in a space-delimited macro variable called &CHARLIST. Another macro variable called &CNT counts how many words (or, variable names) are in this list.

proc sql noprint;
select name into :charlist separated by ' '
from dictionary.columns
where libname=%upcase("&lib") and memname=%upcase("&dsn") and type='char';
quit;
 
%put &charlist;
%let cnt=%sysfunc(countw(&charlist,%str( )));
%put &cnt;

The %DO loop iterates through each variable in the list and generates output data from PROC FREQ by using the OUT= option. The output data set contains two variables: the variable from the TABLES request with the unique values of that variable and the Count variable with the frequency counts. The variable name is renamed to Value so that all the data sets can be combined in a later step. In a subsequent DATA step, a new variable, called Varname, is created that contains the variable name as a character string. Finally, the data set is sorted by the descending frequency count.

%do i=1 %to &cnt;
 
/* Loop through each character variable in PROC FREQ */ 
/* and create a separate output data set.            */
proc freq data=&lib..&dsn noprint;
tables %scan(&charlist,&i) / missing
out=out&i(drop=percent 
 rename=(%scan(&charlist,&i)=value));
run;
 
data out&i;
length varname value $100;
set out&i;
varname="%scan(&charlist,&i)";
run;
 
proc sort data=out&i;
by varname descending count;
run;
 
%end;

The final DATA step combines all the data sets into one using another macro %DO loop in the SET statement. The %END statement requires two semicolons: one ends the SET statement and one ends the %END statement. Three observations of each data set are printed by using the OBS= option.

data combine;
set %do i=1 %to &cnt;
 out&i(obs=3) /* Keeps top 3 for each variable. */
%end;;
run;

Knowing your data is essential in any programming application. The ability to quickly view the top values of any or all variables in a data set can be useful for identifying top sales, targeting specific demographic segments, trying to understand the prevalence of certain illnesses or diseases, and so on. As explained in this blog, a variety of Base SAS procedures along with the SAS macro facility make it easy to accomplish such tasks.

Learn more

These resources show different ways to create "top N" reports in SAS:

Keeping the top frequency count (n) for each character variable in a SAS data set was published on SAS Users.