Tech

3月 202017
 

accessible ODS results with SASLet’s look at the term “accessible” and how it relates to the SAS world. Accessible output is output that can be read by a screen reader to someone with low or no vision, visualized by someone with low vision or color blindness, or navigated by someone with limited mobility. In January of 2017, the United States Access Board published a final rule that documents federal standards and guidelines for accessibility compliance specific to information and communication technology (ICT). The new standards and guidelines update the Section 508 law that was most recently amended in 1998, and adopt many of the Web Content Accessibility Guidelines (WCAG) 2.0 standards. Here’s a comparison document: Comparison Table of WCAG 2.0 to Existing 508 Standards. The final rule (as a parent, I appreciate this rule name!) is also known as the “508 refresh”.

To help SAS US customers comply with the visual-accessibility regulations outlined by the final rule, ODS developers are providing SAS programmers with the ability to create accessible results to present their SAS data. In SAS® 9.4 TS1M4 (9.4M4), there are three great improvements that I would like to highlight:

  • ACCESSIBLE_GRAPH option (preproduction) in the ODS HTML5 statement
  • SAS® Graphics Accelerator
  • ACCESSIBLE option (preproduction) in the ODS PDF statement

Note about “Preproduction” Status: ACCESSIBLE and ACCESSIBLE_GRAPH

Why are the ACCESSIBLE (in the ODS PDF FILE= statement) and ACCESSIBLE_GRAPH (in the ODS HTML5 FILE=statement) options preproduction? By setting the status as preproduction, the development team has greater flexibility to make changes to the syntax and underlying architecture. The development team has worked hard to provide these new features, and is very eager to hear feedback from the SAS programming community. They also encourage feedback from the compliance teams that work for those of you who are striving to make your SAS results accessible. Please request and install SAS 9.4M4 here, and start using these new features to generate your current results in the new formats. Ask your compliance team to assess the output, and let us know (accessibility@sas.com) how close we are to making your files compliant with the final rule.

ODS HTML5

The ODS HTML5 destination, which was introduced in SAS® 9.4, creates the most accessible output in SAS for consumption of tables and graphs on the web. This destination creates SVG graphs from ODS GRAPHICS results. SVG graphics scale when zoomed, which maintains the visual integrity of the image. To ensure that the results comply with the maximum number of WCAG standards, use one of the following styles:

  • DAISY (recommended)
  • VADARK
  • HIGHCONTRAST
  • JOURNAL2

In most cases, these styles provide a high level of contrast in graphics output and tabular output.[1]
When you use an accessibility checker such as the open-source accessibility testing tool aXe, you can see how the HTML5 destination compares to the HTML4 destination (the default destination in SAS® Foundation). Here is a comparison of a simple PRINT procedure step. Using this code, I generate two HTML files, html4.html and html5.html:

ods html file="html4.html" path="c:temp" ;
ods html5 file="html5.html" path="c:temp" style=daisy;
 
proc print data=sashelp.cars(obs=5);
var type origin;
run;
 
ods _all_ close;

 

An aXe analysis in Mozilla Firefox finds the following violations in the html4.html file:

Here is the analysis of the html5.html file:

Your success might vary with more complex procedures. The final paragraph of the blog post describes how to offer feedback after you test your code in the HTML5 destination using SAS 9.4M4.

SAS® Graphics Accelerator

The addition of the ACCESSIBLE_GRAPH preproduction option to the ODS HTML5 statement adds accessibility metadata (tags) around ODS GRAPHICS images routed to the HTML5 destination[2]. This metadata provides the ability to have bar charts, time series plots, heat maps, line charts, scatter plots, and histograms consumed by an exciting new add-in available for the Google Chrome and (coming soon) Firefox browsers: SAS GRAPHICS Accelerator. SAS Graphics Accelerator provides the following capabilities:

  • The interactive exploration of supported graphics using sound
  • The ability to download data in tabular format to a CSV file
  • Customization of visual and auditory settings for alternative presentations

Pay attention to the SAS Graphics Accelerator web page because improvements and features are being offered on a regular basis!

ODS PDF ACCESSIBLE

The web is “where it’s at” for most consumers of your organization’s information. However, many sites need PDF files for results requiring a longer storage time. Prior to SAS 9.4M4, using a screen reader with PDF files created by ODS did not work because the PDF files created by ODS are not “tagged.” Tags in a PDF file are not visible in Adobe Reader when the file is opened. But, when a PDF file is tagged, the file contains underlying metadata to facilitate screen readers verbalizing the results. Here’s an example of the same PROC PRINT step above written to two different PDF files using SAS 9.4M4. I created the tagged.pdf file using the ACCESSIBLE preproduction destination option, so the file includes tags, making it accessible using assistive technology.

ods pdf file="c:tempuntagged_default.pdf";
ods pdf (id=a) file="c:temptagged.pdf" accessible;
 
proc print data=sashelp.cars(obs=5);
var type origin;
run;
 
ods _all_ close;

 

 

To determine whether a PDF file is tagged, open the file and select View ► Show/Hide ► Navigation Panes ► Tags. For example, I checked the untagged_default.pdf file and saw the following, which means that this file is not useful to a screen reader:

Let’s compare the results from the tagged.pdf file:

A screen reader uses the HTML-like markup shown above to verbalize the file to someone with low or no vision.

Adobe Acrobat Pro has built-in accessibility checkers that enable us to examine the degree of accessibility of our files. You can display this setting by selecting View ► Tools ► Accessibility. A full discussion of the Adobe compliance-check features is outside the scope of this article. But, an initial examination of the tagged.pdf file shows that there are many accessible features included in the file, and that two of the features need a manual check:

Check with staff who are well-versed in compliance at your organization, and let us know if our files meet your standards.

I want to see sample code, and hear more! How do I get access to SAS 9.4M4 and more information about these new features?

Your SAS installation representative can order SAS 9.4M4 using the information on this page, Request a Maintenance Release. If you want to get a preview of SAS 9.4M4 to learn it before your site gets it, you can use SAS® University Edition (www.sas.com/universityedition).

Read these upcoming papers (available in April 2017) for code samples. And, if you are attending SAS Global Forum 2017, plan to attend the following presentations:

Here are links to the documentation and a previously published SAS Global Forum paper on the topic:

How did we do?

We welcome feedback regarding the results that you are generating with SAS 9.4M4, and look forward to offering the ODS statement options in production status with improved features and support for more procedures. Please send feedback to accessibility@sas.com.

 

[1] See “ODS HTML5 Statement Options Related to Accessibility” in Creating Accessible SAS 9.4 Output Using ODS and ODS Graphics for more information

 

Create accessible ODS results with SAS or "Why you should be running SAS 9.4m4!" was published on SAS Users.

3月 182017
 

Editor’s note: This is the third in a series of articles to help current SAS programmers add SAS Viya to their analytics skillset. In this post, Advisory Solutions Architect Steven Sober explores how to accomplish distributed data management using SAS Viya. Read additional posts in the series.

In my last article I explained how SAS programmers can execute distributed DATA Step, PROC DS2, PROC FEDSQL and PROC TRANSPOSE in SAS Viya’s Cloud Analytic Services (CAS) which speeds up the process of staging data for analytics, visualizations and reporting. In this article we will explore how open source programmers can leverage the same SAS coding techniques while remaining in their comfort zone.

For this post, I will utilize Jupyter Notebook to run the Python script that is leveraging the same code we used in part one of this series.

Importing Package and Starting CAS

First, we import the SAS Scripting Wrapper for Analytics Transfer (SWAT) package, which is the Python client to SAS Cloud Analytic Services (CAS). To down load the SWAT package, use this url: https://github.com/sassoftware/python-swat.

Let’s review the cell “In [16]”:

1.  Import SWAT

a.  Required statement, this loads the SWAT package into our Python client

2.  s = swat.CAS("viya.host.com", port#, "userid", "password")

a.  Required statement, for our example we will use “s” in our dot notation syntax to send our statements to CAS for processing. “s” is end-user definable (i.e. I could have used “steve =” instead of “s =”).

b.  Viya.host.com is the host name of your SAS Viya platform

c.  Port#

i.  Port number used to communicate with CAS

d.  userid

i.  Your user id for the SAS Viya platform

e.  Password

i.  Your password for your userid

3.  indata_dir = "/opt/sasinside/DemoData"

a.  Creating a variable call “indata_dir”. This is a directory on the SAS Viya platform where the source data for our examples is located.

4.  indata     = "cars"

a.  Creating a variable call “indata” which contains the name of the source table we will load into CAS

Reviewing cell “Out[16]” we see the information that CAS returns to our client when we connect successfully.

Loading our Source Table and CAS Action Sets

In order to load data into CAS we first need to create a dataSource={"srcType":"PATH"},
path = indata_dir)

a.  To send statements to CAS we use dot notation syntax where:

a.  s

i.  The CAS session that we established in cell “in[16]”

b.  table

i.  CAS action set

c.  addCaslib

i.  Action set’s action

d.  name

i.  Specifies the name of the caslib to add.

e.  dataSource

i.  Specifies the data source type and type-specific parameters.

f.  path

i.  Specifies data source-specific information. For PATH and HDFS, this is a file system path. In our example we are referencing the path using the variable “indata_dir” that we established in cell “In[16]”.

casOut={"caslib":"casuser", "name":"cars",
"replace":"True"},

)

a.  As we learned s. is our connection to CAS and “table.” is the CAS action set while “Table” is the action set’s action.

a.  path=

i.  Specifies the file, directory or table name. In our example this is the physical name of the SAS data set being loaded into CAS.

b. casOut=

i.  The CAS library we established in cell “In[17]” using the “addCaslib” action.

1  caslib.casuser

a.  “caslib” - is a reserved word and is use to reference all CAS libraries
b.  “casuser” - is the CAS library we will use in our examples
c.  “name”  - is the CAS table name
d.  “replace” - provides us an option to replace the CAS table if it already exists.

Reviewing cell “Out[17]” we see the information that CAS returns to our client when we successfully load a table into CAS.

Click here for information on the loadActionSet action.

DATA Step

We are now ready to continue by running DATA Step, PROC DS2, PROC FEDSQL and PROC TRANSPOSE via our python script.

Now that we understand the dot notation syntax used to send statements to CAS, it become extremely simple to leverage the same code our SAS programmers are using.

Reviewing cell “In[19]” we notice we are using the CAS action set “dataStep” and it’s action “runCode”.  Notice between the (“”” and  “””) we have the same DATA Step code we reviewed in part one of this series. By reviewing cell “Out19]” we can review the information CAS sent back providing information on the source (casuser.cars) and target (casuser.cars_data_step) tables used in our DATA Step.

With DS2 we utilize the CAS action set “ds2” with its action “runDS2.” In reviewing cell “In[23]” we do notice a slight difference in our code. There is no “PROC DS2” prior to the “thread mythread / overwrite = yes;” statement. With the DS2 action set we simply define our DS2 THREAD program and follow that with our DS2 DATA program. Notice in the DS2 DATA program we declare the DS2 THREAD that we just created.

Review the NOTE statements: prior to “Out[23]” These statements validate the DS2 THREAD and DATA programs executed in CAS.

With FedSQL we use the CAS action set “fedsql’ with its action “execDirect.” The “query=” parameter is where we place our FedSQL statements. By reviewing the NOTE statements we can validate our FedSQL ran successfully.

With TRANSPOSE we use the CAS action set “transpose” with its action “transpose.” The syntax is different for PROC TRANSPOSE, but it is very straight forward on mapping out the parameters to accomplish the transpose you need for your analytics, visualizations and reports.

Collaborative distributed data management with open source was published on SAS Users.

3月 172017
 

Community detection has been used in multiple fields, such as social networks, biological networks, tele-communication networks and fraud/terrorist detection etc. Traditionally, it is performed on an entity link graph in which the vertices represent the entities and the edges indicate links between pairs of entities, and its target is to partition an entity-link graph into communities such that the links within the community subgraphs are more densely connected than the links between communities. Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.[1] SAS implements the most popular algorithms and SAS-proprietary algorithms of graph and network analysis, and integrated them with other powerful analytics into SAS Social Network Analysis. SAS graph algorithms are packaged into PROC OPTGRAPH, and with it you can detect communities from network graph.

In text analytics, researchers did some explorations in applying community detection on textual interaction data and showcased its effectiveness, such as co-authorship network, textual-interaction network, and social-tag network etc.[2] In this post, I would like to show you how to cluster papers based on the keyword link graph using community detection.

Following steps that I introduced in my previous blog, you may get paper keywords with SAS Text Analytics. Suppose you already have the paper keywords, then you need to go through the following three steps.

Step 1: Build the network.

The network structure depends on your analysis purpose and data. Take the paper-keyword relationships, for example, there are two ways to construct the network. The first method uses papers as vertices and a link occurs only when two papers have a same keyword or keyword token. The second method treats papers, keywords and keyword tokens as vertices, and links only exist in paper-keyword pairs or paper-keyword_token pairs. There is no direct link between papers.

I compared the community detection results of the two methods, and finally I chose the second method because its result is more reasonable. So in this article, I will focus on the second method only.

In addition, SAS supports weighted graph analysis, in my experiment I used term frequency as weight. For example, keywords of paper 114 are “boosting; neural network; word embedding”. After parsing the paper keywords with SAS Text Analytics, we get 6 terms. They are “boost”, “neural network”, “neural”, “network”, “word”, and “embedding”. Here I turned on stemming and noun group options, and honored SAS stoplist for English. The network data of this paper as Table-1 shows.

In the text parse step, I set term weight and cell weight with none, because community defection depends on link density and term frequencies are more effective than weighted values in this tiny data. As the table-1 shows, term frequency is small too, so no need to use log transform for cell weight.

Step 2: Run community detection to clustering papers and output detection result.

There are two types of network graphs. They are directed graph and undirected graph. In paper-keyword relationships, direction from paper to keyword or versus does not make difference, so I chose undirected graph. PROC OPTGRAPH implements three heuristic algorithms for finding communities: the LOUVAIN algorithm proposed in Blondel et al. (2008), the label propagation algorithm proposed in Raghavan, Albert, and Kumara (2007), and the parallel label propagation algorithm developed by SAS (patent pending). The Louvain algorithm aims to optimize modularity, which is one of the most popular merit functions for community detection. Modularity is a measure of the quality of a division of a graph into communities. The modularity of a division is defined to be the fraction of the links that fall within the communities minus the expected fraction if the links were distributed at random, assuming that you do not change the degree of each node. [3] In my experiment, I used Louvain.

Besides algorithm, you also need to set resolution value. Larger resolution value produces more communities, each of which contains a smaller number of nodes. I tried three resolution values: 0.5, 1, 1.5, and finally I set 1 as resolution value because I think topic of each community is more reasonable. With these settings, I got 18 communities at last.

Step 3: Explore communities visually and get insights.

Once you have community detection result, you may use Network Diagram of SAS Visual Analytics to visually explore the communities and understand their topics or characteristics.

Take the largest community as an example, there are 14 papers in this community. Nodes with numbers notated are papers, otherwise they are keyword tokens. Node size is determined by sum of link weight (term frequency), and node color is decided by community value. From Figure-1, you may easily find out its topic: sentiment, which is the largest node in all keyword nodes. After I went through the conference program, I found they all are papers of IALP 2016 shared task, which is targeted to predict valence-arousal ratings of Chinese affective words.

Figure-1 Network Diagram of Papers in Community 0

 

Another example is community 8, and its topic terms are annotation and information.

Figure-2 Network Diagram of Papers in Community 8

Simultaneously the keywords were also clustered, and the keyword community may be used in your search engine to improve the keyword-based recommendation or improve the search performance by retrieving more relevant documents. I extracted the keywords (noun group only) of the top 5 communities and displayed them with SAS Visual Analytics. The top 3 keywords of community 0 are: sentiment analysis, affective computing, and affective lexicon, which are very close from semantic perspective. If you have more data, you may get better results than mine.

Figure-3 Keyword frequency chart of the top 5 communities

If you are interested in this analysis, why not try it with your data? The SAS scripts for clustering papers as below.

* Step 1: Build the paper-keyword network;
proc hptmine data=outlib.paper_keywords;
   doc_id documentName; 
   var keywords;
   parse notagging termwgt=none cellwgt=none
         stop=sashelp.engstop
         outterms=terms
         outparent=parent
         reducef=1;
run;quit;
 
proc sql;
   create table outlib.paper_keyword_network as
   select _document_ as from, term as to, _count_ as weight
   from parent
   left join terms
   on parent._termnum_ = terms.key
   where parent eq .;
quit;
 
* Step 2: Run community detection to clustering papers;
* NOTE: huge network = low resolution level;
proc optgraph
   loglevel = 1
   direction=undirected
   graph_internal_format = thin
   data_links = outlib.paper_keyword_network
   out_nodes  = nodes
   ;
 
   community
      loglevel = 1
      maxiter = 20
      link_removal_ratio = 0
      resolution_list    = 1
      ;
run;
quit;
 
proc sql;
   create table paper_community as
   select distinct Paper_keywords.documentName, keywords, community_1 as community
   from outlib.Paper_keywords
   left join nodes
   on nodes.node = Paper_keywords.documentName
   order by community_1, documentName;
quit;
 
* Step 3: Merge network and community data for VA exploration;
proc sql;
   create table outlib.paper_community_va as
   select paper_keyword_network.*, community
   from outlib.paper_keyword_network
   left join paper_community
   on paper_keyword_network.from = paper_community.documentName;
quit;
 
* Step 4: Keyword communities;
proc sql;
   create table keyword_community as
   select *
   from nodes
   where node not in (select documentName from outlib.paper_keywords)
   order by community_1, node;
quit;
 
proc sql;
   create table outlib.keyword_community_va as
   select keyword_community.*, freq
   from keyword_community
   left join terms
   on keyword_community.node = terms.term
   where parent eq . and role eq 'NOUN_GROUP'
   order by community_1, freq desc;
quit;

 

References

[1]. Communities in Networks
[2]. Automatic Clustering of Social Tag using Community Detection
[3]. SAS(R) OPTGRAPH Procedure 14.1: Graph Algorithms and Network Analysis

Clustering of papers using Community Detection was published on SAS Users.

3月 172017
 

Community detection has been used in multiple fields, such as social networks, biological networks, tele-communication networks and fraud/terrorist detection etc. Traditionally, it is performed on an entity link graph in which the vertices represent the entities and the edges indicate links between pairs of entities, and its target is to partition an entity-link graph into communities such that the links within the community subgraphs are more densely connected than the links between communities. Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.[1] SAS implements the most popular algorithms and SAS-proprietary algorithms of graph and network analysis, and integrated them with other powerful analytics into SAS Social Network Analysis. SAS graph algorithms are packaged into PROC OPTGRAPH, and with it you can detect communities from network graph.

In text analytics, researchers did some explorations in applying community detection on textual interaction data and showcased its effectiveness, such as co-authorship network, textual-interaction network, and social-tag network etc.[2] In this post, I would like to show you how to cluster papers based on the keyword link graph using community detection.

Following steps that I introduced in my previous blog, you may get paper keywords with SAS Text Analytics. Suppose you already have the paper keywords, then you need to go through the following three steps.

Step 1: Build the network.

The network structure depends on your analysis purpose and data. Take the paper-keyword relationships, for example, there are two ways to construct the network. The first method uses papers as vertices and a link occurs only when two papers have a same keyword or keyword token. The second method treats papers, keywords and keyword tokens as vertices, and links only exist in paper-keyword pairs or paper-keyword_token pairs. There is no direct link between papers.

I compared the community detection results of the two methods, and finally I chose the second method because its result is more reasonable. So in this article, I will focus on the second method only.

In addition, SAS supports weighted graph analysis, in my experiment I used term frequency as weight. For example, keywords of paper 114 are “boosting; neural network; word embedding”. After parsing the paper keywords with SAS Text Analytics, we get 6 terms. They are “boost”, “neural network”, “neural”, “network”, “word”, and “embedding”. Here I turned on stemming and noun group options, and honored SAS stoplist for English. The network data of this paper as Table-1 shows.

In the text parse step, I set term weight and cell weight with none, because community defection depends on link density and term frequencies are more effective than weighted values in this tiny data. As the table-1 shows, term frequency is small too, so no need to use log transform for cell weight.

Step 2: Run community detection to clustering papers and output detection result.

There are two types of network graphs. They are directed graph and undirected graph. In paper-keyword relationships, direction from paper to keyword or versus does not make difference, so I chose undirected graph. PROC OPTGRAPH implements three heuristic algorithms for finding communities: the LOUVAIN algorithm proposed in Blondel et al. (2008), the label propagation algorithm proposed in Raghavan, Albert, and Kumara (2007), and the parallel label propagation algorithm developed by SAS (patent pending). The Louvain algorithm aims to optimize modularity, which is one of the most popular merit functions for community detection. Modularity is a measure of the quality of a division of a graph into communities. The modularity of a division is defined to be the fraction of the links that fall within the communities minus the expected fraction if the links were distributed at random, assuming that you do not change the degree of each node. [3] In my experiment, I used Louvain.

Besides algorithm, you also need to set resolution value. Larger resolution value produces more communities, each of which contains a smaller number of nodes. I tried three resolution values: 0.5, 1, 1.5, and finally I set 1 as resolution value because I think topic of each community is more reasonable. With these settings, I got 18 communities at last.

Step 3: Explore communities visually and get insights.

Once you have community detection result, you may use Network Diagram of SAS Visual Analytics to visually explore the communities and understand their topics or characteristics.

Take the largest community as an example, there are 14 papers in this community. Nodes with numbers notated are papers, otherwise they are keyword tokens. Node size is determined by sum of link weight (term frequency), and node color is decided by community value. From Figure-1, you may easily find out its topic: sentiment, which is the largest node in all keyword nodes. After I went through the conference program, I found they all are papers of IALP 2016 shared task, which is targeted to predict valence-arousal ratings of Chinese affective words.

Figure-1 Network Diagram of Papers in Community 0

 

Another example is community 8, and its topic terms are annotation and information.

Figure-2 Network Diagram of Papers in Community 8

Simultaneously the keywords were also clustered, and the keyword community may be used in your search engine to improve the keyword-based recommendation or improve the search performance by retrieving more relevant documents. I extracted the keywords (noun group only) of the top 5 communities and displayed them with SAS Visual Analytics. The top 3 keywords of community 0 are: sentiment analysis, affective computing, and affective lexicon, which are very close from semantic perspective. If you have more data, you may get better results than mine.

Figure-3 Keyword frequency chart of the top 5 communities

If you are interested in this analysis, why not try it with your data? The SAS scripts for clustering papers as below.

* Step 1: Build the paper-keyword network;
proc hptmine data=outlib.paper_keywords;
   doc_id documentName; 
   var keywords;
   parse notagging termwgt=none cellwgt=none
         stop=sashelp.engstop
         outterms=terms
         outparent=parent
         reducef=1;
run;quit;
 
proc sql;
   create table outlib.paper_keyword_network as
   select _document_ as from, term as to, _count_ as weight
   from parent
   left join terms
   on parent._termnum_ = terms.key
   where parent eq .;
quit;
 
* Step 2: Run community detection to clustering papers;
* NOTE: huge network = low resolution level;
proc optgraph
   loglevel = 1
   direction=undirected
   graph_internal_format = thin
   data_links = outlib.paper_keyword_network
   out_nodes  = nodes
   ;
 
   community
      loglevel = 1
      maxiter = 20
      link_removal_ratio = 0
      resolution_list    = 1
      ;
run;
quit;
 
proc sql;
   create table paper_community as
   select distinct Paper_keywords.documentName, keywords, community_1 as community
   from outlib.Paper_keywords
   left join nodes
   on nodes.node = Paper_keywords.documentName
   order by community_1, documentName;
quit;
 
* Step 3: Merge network and community data for VA exploration;
proc sql;
   create table outlib.paper_community_va as
   select paper_keyword_network.*, community
   from outlib.paper_keyword_network
   left join paper_community
   on paper_keyword_network.from = paper_community.documentName;
quit;
 
* Step 4: Keyword communities;
proc sql;
   create table keyword_community as
   select *
   from nodes
   where node not in (select documentName from outlib.paper_keywords)
   order by community_1, node;
quit;
 
proc sql;
   create table outlib.keyword_community_va as
   select keyword_community.*, freq
   from keyword_community
   left join terms
   on keyword_community.node = terms.term
   where parent eq . and role eq 'NOUN_GROUP'
   order by community_1, freq desc;
quit;

 

References

[1]. Communities in Networks
[2]. Automatic Clustering of Social Tag using Community Detection
[3]. SAS(R) OPTGRAPH Procedure 14.1: Graph Algorithms and Network Analysis

Clustering of papers using Community Detection was published on SAS Users.

3月 172017
 

Editor’s note: This is the second in a series of articles to help current SAS programmers add SAS Viya to their analytics skillset. In this post, Advisory Solutions Architect Steven Sober explores how to accomplish distributed data management using SAS Viya. Read additional posts in the series.

This article in the SAS Viya series will explore how to accomplish distributed data management using SAS Viya. In my next article, we will discuss how SAS programmers can collaborate with their open source colleagues to leverage SAS Viya for distributed data management.

Distributed Data Management

SAS Viya provides a robust, scalable, cloud ready distributed data management platform. This platform provides multiple techniques for data management that run distributive, i.e. using all cores on all compute nodes defined to the SAS Viya platform. The four techniques we will explore here are DATA Step, PROC DS2, PROC FEDSQL and PROC TRANSPOSE. With these four techniques SAS programmers and open source programmers can quickly apply complex business rules that stage data for downstream consumption, i.e., Analytics, visualizations, and reporting.

The rule for getting your code to run distributed is to ensure all source and target tables reside in the In-Memory component of SAS Viya i.e., Cloud Analytic Services (CAS).

Starting CAS

The following statement is an example of starting a new CAS session. In the coding examples that follow we will reference this session using the key word MYSESS. Also note, this CAS session is using one of the default CAS library, CASUSER.

Binding a LIBNAME to a CAS session

Now that we have started a CAS session we can bind a LIBNAME to that session using the following syntax:

Note: CASUSER is one of the default CAS libraries created when you start a CAS session. In the following coding examples we will utilize CASUSER for our source and target tables that reside in CAS.

To list all default and end-user CAS libraries, use the following statement:

Click here for more information on CAS libraries.

THREAD program

  • The PROC DS2 DATA program must declare a THREAD program
  • The source and target tables must reside in CAS
  • Unlike DATA Step, with PROC DS2 you use the SESSREF= parameter to identify which CAS environment the source and target tables reside in
  • SESSREF=

     For PROC TRANSPOSE to run in CAS the rules are:

    1. All source and target tables must reside in CAS
      a.   Like DATA Step you use a two-level name to reference these tables

    Collaborative distributed data management using SAS Viya was published on SAS Users.

    3月 132017
     

    My high school basketball coach started preparing us for the tournaments in the season’s first practice. He talked about the “long haul” of tournament basketball, and geared our strategies toward a successful run at the end of the season.

    I thought about the “long haul” when considering my brackets for this year’s NCAA Tournament, and came to this conclusion; instead of seeking to predict who might win a particular game, I wanted to use analytics to identify which teams were most likely to win multiple games. The question that I sought to answer was simply, “could I look at regular season data, and recognize the characteristics inherent in teams that win multiple games in the NCAA Tournament?”

    I prepared and extracted features from data representing the last 5 regular seasons. I took tournament data from the same period and counted numbers of wins per team (per season). This number would be my target value (0, 1, 2, 3, 4, 5, or 6 wins). Only teams that participated in the tournaments made the analysis.

    I used SAS Enterprise Miner’s High Performance Random Forest Node to build 10,000 trees (in less than 14 seconds), and I determined my “top 10 stats” by simply observing which factors were split on the most.

    Here are the results (remember that statistics represented are from the regular season and not the tournament), my “top 10 statistics to consider.”

    1 ---  Winning Percentage. Winners win, right?  It is evident this is true the further a team moves into the tournament.

    • Teams that win a single game have an average winning percentage of .729
    • Teams that win 6 games have an average winning percentage of .858
    • No team that has won a Final Four game over the last 5 years has a winning percentage less than .706
    • Teams that won 6 games have a minimum winning percentage of .765.

    2 --- Winning games by wide margins. Teams that advance in the tournament have beaten teams by wide margins during the regular season – this means that in some game over the course of the year, a team let go and won big! From a former player’s perspective, it doesn’t matter “who” you beat by a wide margin, but rather do you have the drive to crush the opponent?

    • Teams that won 6 games have beaten some team by 49 points differentiating themselves from even the 5 win teams by 9 points!

    3 --- The ratio of assists to turnovers (ATO). Teams that take care of and distribute the ball tend to be making assists instead of turnovers. From my perspective, the ATO indicates whether or not a team dictates the action.

    • Over the last 5 years, no team that won 6 games had an ATO less than 1.19!
    • Teams that have won at least 5 had an average ATO of 1.04.
    • Teams that won less than 5 had average ATOs of less than 1.

    4 --- Winning percentage on the road. We’re already noted that overall winning percentage is important, but it’s also important to win on the road since the tournament games are rarely played on a team’s home floor!

    • Teams that don’t win any tournament games win 52% of their road games
    • Teams that win 1-2 games win 57.8%
    • Teams that win 3-5 win 63%
    • Team that win 6 win 78% of their road games, and average only 2.4 (road) losses per year
    • No team that has won at least 5 games has lost more than 5 on the road (in the last 5 years)!

    5 --- The ratio of a team’s field goal percentage to the opposition’s field goal percentage? Winning games on the big stage requires both scoring and defense! A ratio above 1 indicates that you score the ball better than you allow your opposition to score.

    • Teams that win 2 or fewer games have a ratio of 1.12
    • Teams that win 3-5 games have a ratio of 1.18
    • Teams that win 6 games have a ratio of 1.23 – no team that has won 6 games had a ratio of less than 1.19!

    6 --- The ratio of turnovers to the turnovers created (TOR). I recall coaches telling me that a turnover committed by our team was essentially a 4-point play: 2 that we didn’t get, and 2 they did.

    • Teams that win the most tournament games have an average TOR of 0.89. This means they turn the ball over at a minimal rate when compared to the turnovers they create.
    • Over the past 5 years, teams that won 6 games have an average TOR .11 better than the rest of the pack which can be interpreted this way: they force the opposition into turnovers 10 times as often as they commit turnovers themselves.

    7 --- Just as important as beating teams by wide margins, are the close games! Close games build character, and provide preparation for the tournament.

    • Teams that win 6 games play more close games than any other group. The average minimum differential for this group is 1.6 points
    • Teams winning less games average a differential of 1.8 points.

    8 --- Defending the 3. Teams that win more games in the tournament defend the 3 point shot only slightly better than the other teams, but they are twice as consistent in doing it! So, regardless of who’s coming to play, look for some sticky D beyond the arc!

    • On average, teams allow a 3-point field goal percentage .328
    • Teams winning the most tournament games defend only slightly better at .324; however the standard deviation is the more interesting statistic indicating the consistency of doing so (defending the 3 point shot) is almost twice as good as the other teams!

    9 --- Teams that win are good at the stripe! Free throws close games. Make them and get away with win!

    • Teams that win the most games shoot for an average of .730 while the rest of the pack sits at .700

     
    10 --- Teams that win the most games block shots! They play defense, period.

    • Teams that win the most tournament games average over 5 blocks per game.
    • Teams winning 6 games have blocked at least 3.4 shots per game (over the last 5 years)

    Next steps? Take what’s been learned and apply it to this year’s tournament teams, and then as Larry Bird used to do, ask the question, “who’s playing for second?”

    In addition to SAS Enterprise Miner, I used SAS Enterprise Guide to prepare the data for analysis and, I used JMP’s Graph Builder to create the graphics. The data was provided by Kaggle.

    The top 10 statistics to consider when filling out your NCAA brackets was published on SAS Users.

    3月 112017
     

    Ensemble models have been used extensively in credit scoring applications and other areas because they are considered to be more stable and, more importantly, predict better than single classifiers (see Lessmann et al., 2015). They are also known to reduce model bias and variance (Myoung - Jong et al., 2006; Tsai C-F et. al., 2011). The objective of this article is to compare the predictive accuracy of four distinct datasets using two ensemble classifiers (Gradient boosting(GB)/Random Forest(RF)) and two single classifiers (Logistic regression(LR)/Neural Network(NN)) to determine if, in fact, ensemble models are always better. My analysis did not look into optimizing any of these algorithms or feature engineering, which are the building blocks of arriving at a good predictive model. I also decided to base my analysis on these four algorithms because they are the most widely used methods.

    What is the difference between a single and an ensemble classifier?

    Single classifier

    Individual classifiers pursue different objectives to develop a (single) classification model. Statistical methods either estimate (+|) directly (e.g., logistic regression), or estimate class-conditional probabilities (|), which they then convert into posterior probabilities using Bayes rule (e.g., discriminant analysis). Semi-parametric methods, such as NN or SVM, operate in a similar manner, but support different functional forms and require the modeller to select one specification a priori. The parameters of the resulting model are estimated using nonlinear optimization. Tree-based methods recursively partition a data set so as to separate good and bad loans through a sequence of tests (e.g., is loan amount > threshold). This produces a set of rules that facilitate assessing new loan applications. The specific covariates and threshold values to branch a node follow from minimizing indicators of node impurity such as the Gini coefficient or information gain (Baesens, et al., 2003).

    Ensemble classifier

    Ensemble classifiers pool the predictions of multiple base models. Much empirical and theoretical evidence has shown that model combination increases predictive accuracy (Finlay, 2011; Paleologo, et al., 2010). Ensemble learners create the base models in an independent or dependent manner. For example, the bagging algorithm derives independent base models from bootstrap samples of the original data (Breiman, 1996). Boosting algorithms, on the other hand, grow an ensemble in a dependent fashion. They iteratively add base models that are trained to avoid the errors of the current ensemble (Freund & Schapire, 1996). Several extensions of bagging and boosting have been proposed in the literature (Breiman, 2001; Friedman, 2002; Rodriguez, et al., 2006). The common denominator of homogeneous ensembles is that they develop the base models using the same classification algorithm (Lessmann et al., 2015).

    ensemble modifers

    Figure 1: Workflow of single v. ensemble classifiers: derived from the work of Utami, et al., 2014

    Experiment set-up

    Datasets

    Before modelling, I partitioned the dataset into 70% training and 30% validation dataset.

    Table 1: Summary of dataset used for model comparisons

    I used SAS Enterprise Miner as a modelling tool.

    Figure 2: Model flow using Enterprise Miner

    Results

    Table 2: Results showing misclassification rates of all dataset

    Conclusion

    Using misclassification rate as model performance, RF was the best model using Cardata, Organics_Data and HMEQ followed closely by NN. NN was the best model using Time_series_data and performed better than GB ensemble model using Organics_Data and Cardata.

    My findings partly supports the hypothesis that ensemble models naturally do better in comparison to single classifiers, but not in all cases. NN, which is a single classifier, can be very powerful unlike most classifiers (single or ensemble) which are kernel machines and data-driven. NN can generalize from unseen data and act as universal functional approximators (Zhang, et al., 1998).

    According to Kaggle CEO and Founder, Anthony Goldbloom:

    “In the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted & Neural Networks”.

    What are your thoughts?

    Are ensemble classifiers always better than single classifiers? was published on SAS Users.

    3月 092017
     

    SAS® Federation Server provides a central, virtual environment for administering and securing access to your data. It also allows you to combine data from multiple sources without moving or copying the data. SAS Federation Server Manager, a web-based application, is used to administer SAS Federation Server(s).

    Data privacy is a major concern for organizations and one of the features of SAS Federation Server is it allows you to effectively and efficiently control access to your data, so you can limit who is able to view sensitive data such as credit card numbers, personal identification numbers, names, etc. In this three-part blog series, I will explore the topic of controlling data access using SAS Federation Server. The series covers the following topics:

    SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data. SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data.  Note: Permissions applied for a particular data source cannot be bypassed with SAS Federation Server security. If permissions are denied at the source data, for example on a table, then users will always be denied access to that table, no matter what permissions are set in SAS Federation Server.

    In this post, I will build on the examples from my previous articles and demonstrate how you can use data masking to conceal actual data values from users, but still allow them access for analysis and reporting purposes.

    In previous posts, I gave the Finance Users group access to the SALARY table. Linda is a member of the Finance Users group, so currently she has access to the SALARY table.

    However, I want to restrict her access. She needs access to the Salary info for analytic purposes, but does not need to know the identifying data of IDNUM, so I can hide that column from her. She does need the JOBCODE information for her analytics; however, she does not need to know the actual JOBCODE information associated with the record, so that data can be masked to prevent her from viewing that identifying information.

    First, I create a FedSQL View of the SALARY table. FedSQL is the implementation of SQL that SAS Federation Server uses to access relational data.  For the view, I set the Security to Use the definer’s privileges when accessed since I will eventually deny Linda the rights to view the underlying table to the view.

    Here is the default code for the view:

    I change the code to the following to remove the IDNUM column from the view and mask the JOBCODE column, so Linda will not know what is the real JOBCODE associated with the Salary.

    There are several data masking functions available for use. In this instance, I use the TRANC function to mask the JOBCODE field using transliterated values by replacing the first three characters with other values.  Refer to the Data Masking section of the SAS Federation Server Manager 4.2: User’s Guide for more information on the different data masking functions.

    Now that I have created the FedSQL view, I then need to grant Linda authorization to it.

    Next, I need to deny Linda authorization to the SALARY table, so she won’t be able to access the original table.

    Linda is only able to view the SALARY_VIEW with the IDNUM column removed and the JOBCODE information masked.

    Linda is denied access to the SALARY table.

    However, Kate another member of the Finance team is able to view the full SALARY table with the IDNUM column and the real information (non-masked) in the JOBCODE column.

    In this blog entry, I covered the third part of this series on controlling data access to SAS Federation Server 4.2.  Other blogs in the series include

    For more information on SAS Federation Server visit the:

    Securing sensitive data using SAS Federation Server data masking was published on SAS Users.

    3月 092017
     

    Several months ago, I posted a blog about calculating moving averages for a measure in the Visual Analytics Designer. Soon after that, I was asked about calculating not only the average, but also the standard deviation over a period of months, when the data might consist of one or more repeated values of a measure for each month of a series of N months.  For the example of N=20 months, we might want to view the average and standard deviation over the last n months, where n is any number between 3 and 20.

    The example report shown below allows the user to type in a number, n, between 3 and 20, to display a report consisting of the amount values for past n months, the amount values for Current Month Amt-Previous, the Avg over the last n months, the Standard Deviation over the last n months, and the absolute value of the (Current Month Amt – Previous Month Amt), divided by the Standard Deviation over the last n months. A Display rule is applied to the final Abs column, showing Green for a value less than 1 and red for a value greater than or equal to 1.

    The data used in this example had multiple Amount values for each month, so we first used the Visual Data Builder to create a SUM aggregation for Amount for each unique Date value.  This step gives more flexibility in using the amount value for aggregations in the designer.

    When the modified data source is initially added to the report, it contains only the Category data item Month, with a format of MMYYYY, and the measure Amount Sum for Month.

    The data will be displayed in a list table. The first columns added to the table will be Month, displayed with a MMYYYY format, and Amount Sum for Month.

    Specify the properties for the list table as below:

    Since we want to display the last n months, we create a new calculated data item, Numeric Date, calculated as below, using the TREATAS operator on the Month data item:

    Then we create the Current Month Amt-Previous aggregated measure using the RelativePeriod date operator:

    Next, create the Avg over all displayed months aggregated measure as below:

    Then, create the Std.Dev. over all displayed months aggregated measure as shown below:

    Create the Abs (Current-Previous/StdDev) as shown below:

    Create a numeric parameter, Number of Months, as shown, with minimum value of 3 (smallest value that a standard deviation will make sense) and maximum value of 20 (the number of months in our data). You can let the default (Current value) value be any value that you choose:

    For the List Table, create a Rank, as shown below. Note that we are creating the rank on the Numeric Date (not the Month data item), and rather than a specific value for count, we are going to use the value of the parameter, Number of Months.

    Create a text input object that enables the user to type in a ‘number of months’ between 3 and 20.

    Associate the Parameter with the Text input object:

    If you wish, you can add display rules to sound an alarm whenever there is an alarming month-to-month difference in comparison to the standard deviation for the months.

    So the final result of all of the above is this report, which points out month-to-month differences, which might deserve further concern or investigation. Note that the Numeric Date value is included below just to enable you to see what those values look like—you likely would not want to include that calculated data item in your report.

    Calculating standard deviation of a measure in Visual Analytics Designer was published on SAS Users.

    3月 022017
     

    SAS Global Forum 2017 is just a month away and, if you’re a SAS administrator, it’s a great place to meet your peers, share your experiences and attend presentations on SAS administration tips and tricks.

    SAS Global Forum 2017 takes place in Orlando FL, April 2-5. You can find more information at https://www.sas.com/en_us/events/sas-global-forum/sas-global-forum-2017.html.  This schedule is for the entire conference and include pre and post conference events.

    If you’re an administrator, though, I wanted to highlight a few events that would be of particular interest to you:

    On Sunday, April 2nd from 2-4 pm there is a “Helping the SAS Administrator Succeed” event. More details can be found here.

    On Monday, April 3rd from 6:30-8:00 pm the SAS Users Group for Administrators (SUGA) will be hosting a Community Linkup, with panelists on hand to help answer questions from SAS administrators. Location will be in the Dolphin Level – Asia 4.

    There are two post-conference tutorials for the SAS Administrators:

    Introduction to SAS Grid Manager, Wednesday, April 5th from 2:30-6:30pm
    SAS Metadata Security, Thursday, April 6th from 8:00am-noon
    More details can be found here.

    For a list of the papers on the topic of SAS Administration, you can visit this link. You will see that SAS Administration has been broken down to Architecture, Deployment, SAS Administration and Security subtopic areas.

    Some of the key papers under each sub-topic area are:

    Architecture
    Twelve Cluster Technologies Available in SAS 9.4
    Deploying SAS on Software-Defined and Virtual Storage Systems
    Shared File Systems:  Determining the Best Choice for your Distributed SAS Foundation Applications
    Do You have a Disaster Recovery Plan for Your SAS Infrastructure

    Deployment
    Pillars of a Successful SAS Implementation with Lessons from Boston Scientific
    Getting the Latest and Greatest from SAS 9.4: Best Practices for Upgrades and Migrations
    Migrating Large, Complex SAS Environments: In-Place versus New Build

    SAS Administration
    SAS Metadata Security 201:  Security Basics for a New Administrator
    SAS Environment Manager: Advanced Topics
    The Top Ten SAS Studio Tips for SAS Grid Manager Administrators
    Implementing Capacity Management Policies on a SASLASR Analytic Server Platform: Can You Afford Not To?
    Auditing in SAS Visual Analytics
    SAS Viya: What it Means for SAS Administration

    Security
    Guidelines for Protecting Your Computer, Network, and Data from Malware Threats
    Getting Started with Designing and Implementing a SAS® 9.4 Metadata and File System Security Design
    SAS® Metadata Security 301: Auditing your SAS Environment
    SAS® Users Audit: An Automated Approach to Metadata Reporting
    SAS® Metadata Security

    In addition to the breakout sessions, there is an Administration Super Demo station where short presentations will be given. The schedule for these presentations is:

    Sunday, April 2nd:
    17:00     Shared File Systems for SAS Grid Manager
    18:00     Where to Place SAS WORK in your SAS Grid Infrastructure

    Monday, April 3rd:
    11:00     Hands-on Secure Socket Layer Configuration for SAS 9.4 Environment Manager
    12:00     Introduction to Configuring SAS Metadata Security for Mutlitenancy
    13:00     SAS Viya Overview
    14:00     Accelerate your SAS Programs with GPUs
    15:00     Authentication and Identity Management with SAS Viya

    Tuesday, April 4th:
    11:00     Accelerate your SAS Programs with GPUs
    12:00     Accelerating your Analytics Adoption with the Analytics Fast Track
    13:00     New Deployment Experience for SAS
    14:00     Managing Authorization in SAS Viya
    15:00     Clustering in SAS Viya
    16:00     A Docker Container Toolbox for the Data Scientist

    As you can see, there is lots for SAS Administrators to learn and opportunities for SAS Administrators to socialize with fellow SAS Administrators.

    Here’s to seeing you in sunny Florida next month.

    P.S. SAS administrators don’t have to go to SAS Global Forum to get help administering their environment. In addition to SAS Global Forum and the SUGA group mentioned above, you can find out more information on resources for administrators in this blog. You can also visit our new webpage devoted just to users who administer their organization’s SAS environment. You can find that page here.

    Resources for SAS Administrators at SAS Global Forum 2017 … and beyond was published on SAS Users.