Tech

12月 222016
 

melissa_marshallEditor's note: This following post is from Melissa Marshall, Principal at Melissa Marshall Consulting LLC. Melissa is a featured speaker at SAS Global Forum 2017, and on a mission to transform how scientists and technical professionals present their work.  

Learn more about Melissa.


Think back to the last technical talk you were an audience member for. What did you think about that talk? Was it engaging and interesting? Boring and overwhelming?  Perhaps it was a topic that was important to you, but it was presented in a way that made it difficult to engage with the content. As an expert in scientific presentations, I often observe a significant “disconnect” between the way a speaker crafts a presentation and the needs of the audience. It is my belief that the way to bridge this gap is for you, as a technical presenter, to become an audience centered speaker vs. a speaker centered speaker.

transform-your-technical-talks01

Here I will provide some quick tips on how to transform your content and slides using your new audience centered speaking approach!

Audience Centered vs. Speaker Centered

The default setting for most presenters is that they are speaker centered—meaning that they make choices in their presentation because it is what works primarily for themselves as a speaker. Examples include: spending a lot of time speaking about an area of the topic that gave you the most difficulty or that you spent the most amount of time working on or using terms that are familiar to you but are jargon for the audience, putting most of the words you want to say on your slides to remind you what to say during the talk so your slides are basically your speaker notes, and standing behind a podium and disconnecting yourself physically from your audience. These choices are common in presentations, but they do not set you up for success. It is a key reason why many presentations of technical information fail.

A critical insight is to realize that your success as a speaker depends entirely upon your ability to make your audience successful.  You don’t get to decide that you gave a great talk (even if no one understood it)!  That’s because presentations, by their very nature, are always made for an audience.  You need something from your audience—that is why you are giving a talk!  So, it is time to get serious about making your audience successful (so you can be too!).  I might define “audience success” as: your audience understands and views your subject in the way you wanted them to.  Strategically, if you desire to be a successful speaker, then the best thing you do is go “all in” on making your audience successful!

Audience Centered Content

To make your content more audience centered, you can ask yourself 4 critical questions ahead of time about your audience:

  • Who are they?
  • What do they know?
  • Why are they here?
  • What biases do they have?

transform-your-technical-talks02

The answers to these questions will guide how you begin to focus your content. Additionally, as a presenter of technical information, one of the most important questions you need to answer along the way, at many stages in your presentation, is “So what?”.  Too often presenters share complex technical information or findings, but they do not make the direct connection to the audience of how that information is relevant or important to the big picture or overall message.  Remind yourself each time you share a technical finding to also follow up that information with the answer to the question “So what?”.  This will make your content immediately more engaging and relevant to your audience.

transform-your-technical-talks03

Audience Centered Slide Design

Think about the last several presentations that you sat through as an audience member.  How would you describe the slides?  Text heavy? Cluttered? No clear message? Full of bulleted lists?  Audiences consistently complain of “Death by PowerPoint”, which refers to the endless march of speakers through text filled slide after text filled slide.  The reason this is so detrimental to audiences is that our brains have a limited “bandwidth” for verbal information.  When we reach that limit, it’s called cognitive overload and our brains stop processing the information as effectively and efficiently.  When you have a speaker talking (the speaker’s words are verbal information) and then you have slides to read with lots of words on them (also more verbal information), you are at a high risk of cognitive overload for the audience.  Therefore, many audiences “tune out” during presentations or report feeling exhausted after a day of listening to presentations.  This is a result of cognitive overload.  A more effective way to approach slides for your audience is to think about making your slides do something for you that your words cannot. You are giving a talk, so the words part is mostly covered by what you are saying…it is much more powerful to make your slides primarily visual so that they convey information in a more memorable, engaging, and understandable way. This is known in the field of cognitive research as the Picture Superiority Effect.  John Medina’s excellent book Brain Rules states that “Based on research into the Picture Superiority Effect, when we read text alone, we are likely to remember only 10 percent of the information 3 days later. If that information is presented to us as text combined with a relevant image, we are likely to remember 65 percent of the information 3 days later.” 

A great a slide design strategy that I advocate for is called the assertion-evidence design.  This slide design strategy is based in research (including Medina’s mentioned above) and works beautifully for presentations of technical information. The assertion-evidence slide design is characterized by a concise, complete sentence headline (no longer than 2 lines) that states the main assertion (i.e. what you want the audience to know) of the slide. The body of the slide then consists of visual evidence for that take away message (charts, graphs, images, equations, etc.). Here is an example of a traditional slide transformed to an assertion-evidence slide:

transform-your-technical-talks04

transform-your-technical-talks05

Having trouble banishing bullet lists? One of my favorite quick (and free!) tools for getting yourself past bulleted lists is Nancy Duarte’s Diagrammer tool.  I like this tool because it asks you what is the relationship between the information that you are trying to show and creates a graphic to show that relationship.  Remember: the best presentations use a variety of visual evidence!  Charts, graphs, pictures, videos, diagrams, etc.  Give your audience lots of visual ways to connect with your content!

Final Thoughts

Next time you present, I encourage you to let every decision you make along the way be guided first by the needs of your audience.  Remember, the success of your audience in understanding your work is how your success as a speaker is measured! For more tips on technical talks, check out my TED Talk entitled “Talk Nerdy To Me.” For questions, comments, or to book a technical presentations workshop at your company or institution, please contact me at melissa@presentyourscience.com.

About Melissa Marshall

melissa_marshallMelissa Marshall is on a mission: to transform how scientists and technical professionals present their work. That’s because she believes that even the best science is destined to remain undiscovered unless it’s presented in a clear and compelling way that sparks innovation and drives adoption.

For almost a decade, she’s traveled around the world to work with Fortune 100 corporations, institutions and universities, teaching the proven strategies she’s mastered through her consulting work and during her 10 years as a faculty member at Penn State University.

When you work with Melissa, you will get the practical skills and the natural confidence you need to immediately shift your “information dump”-style presentations into ones that are meaningful, engaging, and inspire people to take action. And the benefits go far beyond any single presentation; working with Melissa, your entire organization will develop a culture of successful communication, one that will help you launch products and ideas more effectively than ever before.

Melissa is also a dynamic speaker who has lectured at Harvard Medical School, the New York Academy of Sciences, and the Centers for Disease Control and Prevention (CDC). For a sneak peek, check out her TED talk, “Talk Nerdy to Me.” It’s been watched by over 1.5 million people (and counting).

Visit Melissa and learn more at www.PresentYourScience.com.

Melissa can be reached at melissa@presentyourscience.com.

tags: papers & presentations, SAS Global Forum

Transform your technical talks with an audience centered approach was published on SAS Users.

12月 212016
 

The report-ready SAS Environment Manager Data Mart has been an invaluable addition to SAS 9.4 for SAS administrators. The data mart tables are created and maintained by the SAS Environment Manager Service Architecture Framework and provide a source of data for out-of-the box reports as well as custom reports that any SAS administrator can easily create. As you can imagine, the size of the tables in the data mart can grow quite large over time so balancing the desired time span of reporting and the size of the tables on disk requires some thought. The good news: SAS 9.4 M4 has made that job even easier.

The Environment Manager Data Mart (EVDM) has always provided a configuration setting to determine how many days of resource records to keep in the data mart tables. You can see below that in a fresh SAS 9.4 M4 installation, the default setting for “Number of Days of Resource Records in Data Mart” is set to 60 days. This means that EVDM data records older than 60 days are deleted from tables whenever the data mart ETL process executes.

EV Data Mart Tables in 9.4M4

The space required to house the Environment Manager Data Mart is split across three primary areas.

  • The ACM library tables contain system level information
  • The APM library tables contain audit and performance data culled from SAS logs
  • The KITS library tables contains miscellaneous tables created by data mart kits that collect specialty information about HTTP access, SAS data set access, and such.

Prior to SAS 9.4M4, the ACM and APM libraries duly archived data according to the “Number of Days of Resource Records in Data Mart” setting, but the KITS library did not. For most of the KITS tables this is not such a big deal but for some deployments, the HTTPACCESS table in the KITS library can grow quite large. For administrators who have enabled the VA feed for the Service Architecture Framework, the size of the HTTPACCESS table directly impacts the time it takes to autoload the results of each refresh of the data mart, as well as the amount of memory consumed by the LASR Server used for the Environment Manager Data Mart LASR library.

So what is the big change for SAS 9.4 M4?

The KITS library now respects the “Number of Days of Resource Records in Data Mart” setting and removes data older than the threshold.  If you are a SAS administrator, you can now forget about having to separately manage the KITS library which should simplify space management.

SAS administrators may need to adjust the “Number of Days of Resource Records in Data Mart” setting to strike a balance between the date range requirements for reporting and the amount of disk space they have available for storing the EVDM tables.  With SAS 9.4 M4, however, administrators can rest assured that all EVDM tables will self-manage according to their wishes.

More on the Service Architecture Framework.

tags: SAS 9.4, SAS Administrators, SAS Environment Manager, SAS Professional Services

Easier Space Management for EV Data Mart Tables in 9.4M4 was published on SAS Users.

12月 202016
 

Joining tables with PROC FORMAT

The title of this post borrows from Stanley Kubrick’s 1964 comedy “Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.” It stars the great Peter Sellers as the titular character as well as George C. Scott and Slim Pickens. The black and white film is strange and brilliant just like Kubrick was. Similarly, as I was experiencing the issue I outline below and was told of this solution, I thought two things. The first was “what a strange but brilliant solution” and the second one was “I’ll try anything as long as it works.”   Thus, a title was born. But enough about that. Why are we here?

Problem

You want to add a couple of columns of information to your already large dataset, but each time you try to join the tables you run out of memory!  For example, you want to append latitude and longitude values from Table B to an existing list of customer phone numbers in Table A.

You’ve tried this and got nowhere fast:

proc sort data = demo.tablea;
by npa nxx;
proc sort data = demo.tableb;
by npa nxx;
run;
 
data demo.aunionb;
merge demo.tablea (in=a) demo.tableb (in=b);
by npa nxx;
if a;
run;

And then you tried this and also got nowhere (albeit a little slower):

proc sql;
   	create table demo.aunionb as 
   	select *,
	from demo.tablea a
left join demo.tableb b on (a.npa = b.npa) and (a.nxx = b.nxx);
quit;

Solution - Joining tables with PROC FORMAT

Use PROC FORMAT!

Here’s how:

First, take Table B and create character equivalents of the fields required in your join (assuming they aren’t characters already). In this example, NPA and NXX are the two fields that you are joining on. They will be your key once you concatenate them.  Next, create character equivalents of the fields that you want appended.

data work.tableb (keep = npa_nxx--nxx_c); 
set demo.tableb; 
 
npa_c = compress(put(npa, best10.));
nxx_c = compress(put(nxx, best10.));
 
npa_nxx = catx('_',npa_c, nxx_c);
 
lat_c = compress(put(latitude, best14.3)); 
long_c = compress(put(longitude, best14.3)); 
run;

Next, make sure that you have only unique values of your key. Use PROC SORT with OPT=noduprecs turned on.

Now, create a table that will be used as the input into PROC FORMAT. In this example, you are creating a table that will contain the formats for the latitude column.

proc sort data = work.tableb noduprecs;
by npa_nxx;
 
data demo.tableb_lat_fmt(keep=fmtname type start label); 
retain fmtname 'lat_f' type 'C'; 
set work.tableb; 
 
if npa_nxx = '._.' then start = 'Other  ';
else start = npa_nxx; 
label = lat_c; 
run;
proc sort data = demo.tableb_fmt;
by start;
run;

This step creates a table that includes the format name (lat_f), the format type (C), the key field (start) and its corresponding latitude value (label).  Sort this table by the ‘start’ column and then repeat this step for every column you wish to append, with each column getting its own unique format and table.

Now run PROC FORMAT using the CNTLIN option pointing to the tables that you just created in order to create your format.

proc format cntlin=demo.tableb_lat_fmt; 
run; 
proc format cntlin=demo.tableb_long_fmt; 
run;

Now all you have to do is run your data step to create the resultant dataset with the appended values.

data demo.aunionb (drop = npa_nxx); 
set demo.tablea; 
 
npa_nxx = catx('_',compress(put(npa,best10.)),compress(put(nxx, best10.)));
 
latitude = input(put(npa_nxx, $lat_f.), BEST.); 
longitude = input(put(npa_nxx, $long_f.), BEST.);
 
run;

This step creates 3 columns: npa_nxx, latitude, and longitude. Npa_nxx is the key built from the NPA  and NXX values. Latitude and longitude are then populated with the formatted value of npa_nxx, which in this case is the character equivalent of the original latitude or longitude. It also formats the value back into a numeric field.

The result is a clever way to add columns to a dataset, much like a VLOOKUP function works in Microsoft Excel, without the hassle of running out of memory space.

Notes:

  1. The author realizes there are other, more boring ways of tackling this issue like indexing and using WHERE statements, but where’s the fun in that?
  2. This solution may not be right for you. See your doctor if you experience any of the following symptoms:  memory loss, headache, frustration, Cartesian rage, cranial-keyboard embedment or memory loss.
tags: Global Technology Practice, Joining tables, PROC FORMAT, SAS Programmers, tips & techniques

Dr. Strangeformat or: How I Learned to Stop Joining Tables and Love the PROC was published on SAS Users.

12月 172016
 

A multilabel format enables you to assign multiple labels to a value or a range of values. The capability to add multilabel formats was added to the FORMAT procedure in SAS® 8.2.  You assign multiple labels by specifying the MULTILABEL option in the VALUE statement of PROC FORMAT. For example, specifying the MULTILABEL option in the following VALUE statement enables the Agef format to have overlapping ranges.

value agef (multilabel)
11='11'
12='12'
13='13'
11-13='11-13';

Multilabel formats are available for use only in the MEANS, SUMMARY, TABULATE, and REPORT procedures. The code examples that follow show the creation of a simple mutlilabel format (using PROC FORMAT) and its use in each of these procedures.

First, a PROC FORMAT step creates a multilabel format for the Agef variable in the Sashelp.Class data set, along with a character format for the Sex variable. The NOTSORTED option is specified to indicate the preferred order of the ranges in the results.

proc format library=work;
value agef (multilabel notsorted)
11='11'
12='12'
13='13'
11-13='11-13'
14='14'
15='15'
16='16'
14-16='14-16';
value $sexf
'F'='Female'
'M'='Male';
run;

Now, the multilabel format is used in the other SAS procedures that are mentioned earlier. In PROC MEANS and PROC TABULATE, the MLF option must be specified in the CLASS statement for the Age variable. In PROC REPORT, the MLF option is specified in the DEFINE statement for Age. The PRELOADFMT and ORDER=DATA options are also specified to preserve the order as defined in the format. The PRELOADFMT option applies only to group and across variables in PROC REPORT.

proc tabulate data=sashelp.class format=8.1;
class age / mlf preloadfmt order=data;
class sex;
var height;
table age, sex*height*mean;
format age agef. sex $sexf.
title 'PROC TABULATE';
run;
proc means data=sashelp.class nway Mean nonobs maxdec=1 completetypes;
class age / mlf preloadfmt order=data;
class sex;
var height;
format age agef. sex $sexf.;
title 'PROC MEANS';
run;
 
proc report data=sashelp.class NOWD headline completerows;
col age sex height;
define age / group mlf preloadfmt order=data format=agef.;
define sex / group format=$sexf.;
define height / mean format=8.1;
break after age / skip;
title 'PROC REPORT';
run;

The output from each of these procedures is shown below.

creating-and-using-multilabel-formats01creating-and-using-multilabel-formats02creating-and-using-multilabel-formats03

You can use a multilabel format to facilitate the calculation of moving averages, as illustrated in the next example. This example creates a multilabel format using the CNTLIN= option in PROC FORMAT. Then, that format is used to calculate a three-month moving average in PROC SUMMARY.

data sample;  /*  Create the sample data set. */
do sasdate='01JAN2015'D to '31DEC2016'D;
x=ranuni(sasdate)*1234;
if day(sasdate)=1 then output;
end;
run;
 
proc print data=sample;
format sasdate date9.;
title 'Sample data set';
run;
 
data crfmt;  /* Create a CNTLIN data set for a multilabel format. */
keep fmtname start end label HLO;
begin='01JAN2015'D;
final='31DEC2016'D;
fmtname='my3month';
periods=intck('month',begin,final) -2;
do i=0 to periods;
end=intnx('month',final,-i,'E');
start=intnx('month',end,-2);
label=catx('-',put(start,date9.),put(end,date9.));
HLO='M';  /* M indicates "multilabel."  */
output;
end;
run;
 
proc print data=crfmt;
var fmtname start end label HLO;
format start end date9.;
title 'CNTLIN data set';
run;
 
/*  Use the CNTLIN= option to create the format.  */
proc format library=work cntlin=crfmt fmtlib;
select my3month;
title 'FMTLIB results for my3month format';
run;
 
proc summary data=sample NWAY order=data;
class sasdate / MLF;   /*  Use the MLF option.  */
var x;
output out=final (drop=_: ) mean= / autoname;
format sasdate my3month.;
run;
 
proc print data=final noobs;
title 'Three-Month Moving Averages';
run;

The example code above generates the following output:

 creating-and-using-multilabel-formats04

For additional information about the PROC FORMAT options that are used in the code examples above, as well as the options that are used in the other procedures, see the Base SAS X.X Procedures Guide for your specific SAS release. The procedures guide is available on the Base SAS Documentation web page (support.sas.com/documentation/onlinedoc/base/index.html).

Creating and Using Multilabel Formats was published on SAS Users.

12月 162016
 

paper-money-stackImagine making $50K a day out of thin air. Did you know that NASDAQ routinely processes around 10,000,000 trades a day? What if instead of rounding cents for each transaction, market makers truncated fractions of cents in the amount they owe you? Under the assumption that each transaction, on average, has half a cent that is usually rounded away, this would produce 10,000,000 x $0.005 = $50,000 and nobody would even notice it. I am not saying it's legal, but this example is just an illustration of the power of ordinary truncation.

However, sometimes it is necessary to truncate displayed numeric values to a specified number of decimal places without rounding. For example, if we need to truncate 3.1415926 to 4 decimal places without rounding, the displayed number would be 3.1415 (as compared to the rounded number, 3.1416).

If you think you can truncate numeric values by applying SAS w.d format, think again.

Try running this SAS code:

data _null_;
   x=3.1415926;
   put x= 6.4;
run;

If you expect to get x=3.1415, you will be disappointed. Surprisingly, you will get x=3.1416, which means that SAS format does not truncate the number, but rounds it. Same is true for the DOLLARw.d and COMMAw.d formats.

After running into this problem, I thought to instead use a SAS function to truncate numbers. The TRUNC function comes to mind. Indeed, if you look up the SAS TRUNC function, you will find that it does truncate numeric values, but (surprise!) not to a specified number of decimal places; rather it truncates to a specified number of bytes, which is not the same thing for numerics. This may be useful for evaluating the precision of numeric values, but has no direct bearing on our problem of truncating numeric values to a set number of decimal places.

What turned to be interesting is that the Excel TRUNC function does exactly what we need – it truncates values to a set number of decimal places (removes decimals) without rounding:

truncating-decimal-numbers-in-sas-without-rounding

Truncating numbers in Excel with the TRUNC Function

In general, the technique of number truncation should be limited to reporting purposes when displayed numbers are required to appear truncated. Be careful not to apply truncation before using the numbers in calculations, as you might get some seriously inaccurate results, even worse than when you round numbers before calculations. Unless, of course, your goal is to get inaccurate results, which is quite an honorable goal in fraud detection, simulation and investigation.

I can see two possible solutions to number truncations:

Solution 1: Numeric truncation

Let’s say we need to truncate the following number X.XXXXXXX , keeping only the red digits (that is get rid of all decimal digits after the 4th decimal place).

We can do it in 3 steps:

  1. Multiply our number by 104, effectively making the decimals part of a whole number (shifting the decimal point 4 positions to the right).
  2. Apply INT() function that truncates the decimal portion, keeping only the whole portion from the previous step.
  3. Divide the result of step 2 by 104, effectively restoring the order disturbed by step 1 (shifting the decimal point 4 positions to the left).

Here is SAS code implementing this algorithm:

%let d = 4; /* d must be a whole number: 0, 1, 2... */
 
data _null_;
   x = 3.1415926;
   p = 10**&d;
   y = int(x*p)/p;
   put x= / y=;
run;

If we run this code SAS log will show the following (expected and desired) results:

x=3.1415926
y=3.1415

WARNING: While in the SAS code example above the int() function might be substituted with the floor() function, for negative numbers the floor() function would produce incorrect results. For negative numbers, the ceil() function is the correct choice. However, the int() function does exactly what we need for both positive and negative numbers.

Solution 2: Character truncation

Since we use truncated numbers for output only, we can solve our truncation problem by converting numeric value into character, and then use character functions to get rid of extra digits. Let’s solve the same problem, to truncate the following number X.XXXXXXX keeping only the red digits.

Using this character approach we can also do it in 3 steps:

  1. Convert the numeric value into a character value and assign it to a new character variable.
  2. Determine position of a decimal point in the character string.
  3. Sub-string our initial character string to keep only 4 characters after decimal point.

Here is SAS code implementing this algorithm:

%let d = 4; /* d must be a whole number: 0, 1, 2... */
 
data _null_;
   x = 3.1415926;
   y = put(x,best.);
   y = substr(y,1,index(y,'.')+&d);
   put x= / y=;
run;

If we run this code SAS log will show the following results:

x=3.1415926
y=3.1415

As you can see, these results are correct and identical to the results of numeric truncation.

Both numeric and character truncation methods work for positive and negative numbers.

User-defined functions

We can also implement the above two methods as user-defined functions, say truncn() and truncc(), using PROC FCMP:

proc fcmp outlib = sasuser.functions.truncations;
   function truncn (x,d);
      p = 10**d;
      y = int(x*p)/p;
      return(y);
   endsub;
 
   function truncc (x,d);
      y = put(x,best.);
      y = substr(y,1,index(y,'.')+d);
      return(y);
   endsub;
run;

Then we can use those user-defined functions truncn() and truncc() as illustrated in the following SAS code sample:

options cmplib=sasuser.functions;
 
data A;
   length x n 8 c $9;
   input x;
   n = truncn(x,4);
   c = truncc(x,4);
   datalines;
3.1415926
-3.1415926
run;

This code will produce the following A dataset:

truncating-decimal-numbers-in-sas-without-rounding02

Notice that variables x and n are of numeric type while variable c is of character type.

Questions

  1. Which of these two methods of decimal number truncation do you like more? Why?
  2. Does it make sense to use these methods as user-defined functions? Why?

 

tags: SAS Programmers

Truncating decimal numbers in SAS without rounding was published on SAS Users.

12月 142016
 

Fun with Text AnalyticsLast week, I attended the IALP 2016 conference (20th International Conference on Asian Language Processing) in Taiwan. After the conference, each presenter received a u-disk with all accepted papers in PDF format. So when I got back to Beijing, I began going through the papers to extend my learning. Usually, when I return from a conference, I go through all paper titles and my conference notes, then choose the most interesting articles and dive into them for details. I’ll then summarize important research discoveries into one document. This always takes me several days or more to complete.

This time, I decided to try SAS Text Analytics to help me read papers efficiently. Here’s how I did it.

My first experiment was to generate a word cloud of all papers. I used these three steps.

Step 1: Convert PDF collections into text files.

With the SAS procedure TGFilter and SAS Document Conversion Server, you may convert PDF collections into a SAS dataset. If you don’t have SAS Document Conversion Server, you can download pdftotext for free. Pdftotext converts PDFfiles into texts only, you need to write SAS code to import all text files into a dataset. Moreover, if you use pdftotext, you need to check if the PDF file is converted correctly or not. It’s annoying to check texts one by one and I hope you look for smart ways to do this check. SAS TGFilter procedure has language detection functionality and language of any garbage document after conversion is empty rather than English, so I recommend you use TGFilter, then you can filter garbage documents out easily with a where statement of language not equal to ‘English.’

Step 2: Parse documents into words and get word frequencies.

Run SAS procedure HPTMINE or TGPARSE against the document SAS dataset, with stemming option turned on and English stop-word list released by SAS, you may get frequencies of all stems.

Step 3: Generate word cloud plot.

Once you have term frequencies, you can either use SAS Visual Analytics or use R to generate word cloud plot. I like programming, so I used SAS procedure IML to submit R scripts via SAS.

These steps generated a word cloud with the top 500 words of 66 papers. There were a total of 87 papers and 21 of them could not be converted correctly by SAS Document Conversion Server. 19 papers could not be converted correctly by pdftotext.

fun-with-sas-text-analytics

Figure-1 Word Cloud of Top 500 Words of 66 Papers

From figure-1, it is easy to see that the top 10 words were: word, language, chinese, model, system, sentence, corpus, information, feature and method. IALP is an international conference, and does not focus on Chinese only. However there was a shared task at this year’s conference, and its purpose is to predict traditional Chinese affective words’ valance and arousal ratings. Moreover, each team who had attended the shared task was requested to submit a paper to introduce their work, so “Chinese” contributed more than other languages in the word cloud.

You probably think there’s a lot of noise if we use paper bodies to do word cloud analysis, so my second experiment is to generate word cloud of paper abstracts through similar processing. You can view these results in Figure-2.

fun-with-sas-text-analytics02

Figure-2 Word Cloud of Top 500 Words of 66 Paper Abstracts

The top 10 words from the paper abstracts were:  word, chinese, system, language, method, paper, propose, sentiment, baseand corpus. These are quite a bit different from top 10 words extracted from paper bodies. Character, word and sentence are fundamental components of natural languages. For machine learning in NLP (natural language processing), annotated corpus is the key. Without corpus, you cannot build any model. However, annotated corpus is very rare even in big data era and that is why so many researchers pay efforts in annotation.

Can we do more analyses with SAS? Of course. We may analyze keywords, references, paper influence, paper categorization, etc. I hope I have time to try these interesting analyses and share my work with you in a future blog.

The SAS scripts for paper word cloud as below.

* Step 1: Convert PDF collections into sas dataset;
* NOTE: You should have SAS Document Converte purchased;
proc tgfilter out=paper(where=(text ne '' and language eq 'English'))
   tgservice=(hostname="yourhost" port=yourport)  
   srcdir="D:temppdf_collections"
   language="english"
   numchars=32000
   extlist="pdf"
   force;
run;quit;
 
 
* Step 2: Parse documents into words and get word frequencies;
* Add document id for each document;
data paper;
   set paper;
   document_id = _n_;
run;
 
proc hptmine data=paper;
   doc_id document_id;
   var text;
   parse notagging nonoungroups termwgt=none cellwgt=none
         stop=sashelp.engstop
         outparent=parent
         outterms=key
         reducef=1;
run;quit;
 
* Get stem freq data;
proc sql noprint;
   create table paper_stem_freq as
   select term as word, freq as freq
   from key
   where missing(parent) eq 1
   order by freq descending;
quit;
 
 
* Step 3: Generate word cloud plot with R;
data topstems;
   set paper_stem_freq(obs=500);
run;
 
proc iml;
call ExportDataSetToR("topstems", "d" );
 
submit / R;
library(wordcloud)
 
# sort by frequency
d
tags: SAS R&D, SAS Text Analytics

Fun with SAS Text Analytics: A qualitative analysis of IALP papers was published on SAS Users.

12月 132016
 

sasgf2017_globe_150x150-002

Editor's note: Amanda Farnsworth is Head of Visual Journalism at BBC News and a featured speaker at SAS Global Forum 2017, April 2-5, 2017 in Orlando.

My days are spent trying to put the best content we can in front of our loyal, heartland audience, while reaching out to others, particularly on social media, who may never usually come to the BBC for their news.

It can sometimes be hard to reach both audiences at the same time.

But recently we hit on a format that does exactly that. We call it The Personal Relevance Calculator. We have made a whole series of these calculators on different topics, including “The Great British Class Calculator” (yes we Brits are still obsessed with Class!) and “Will a Robot Take Your Job.

The idea is to take a big data set that tells a story and make it personally relevant to each and every user. Readers simply enter a small amount of personal information – it could be their age or height and weight, or a postcode of where they live – and the result they get back from the calculator is unique, or appears to be unique, to them.  This result is given in a rich, visual way and is very shareable on social media.

The advantages are a much deeper engagement in the subject than we might get by writing a traditional article and they are usually very popular, getting millions of hits, likes and shares. They also appeal to the parts of the audience other BBC content doesn’t reach.

Case Study - Who Is Your Olympic Body Match?

You can find the Olympic Body Match calculator using this link:

At the BBC, we know that the Olympics provide us an opportunity to reach a part of the audience that doesn’t often think of us.  Let’s call them Main Eventers – they are people who don’t like to be left out of those water cooler conversations when a big national or international sporting event is going on.  So they want some way of engaging with a story that they often don’t know much about. Perhaps they are not big sports fans.

big-data-made-small

Enter our calculator. By putting in your height, weight, date of birth and sex, our calculator matches you with the Olympic athlete most like you. Simple but very engaging!

We took care to make the calculator a rich, visual experience with beautiful illustrations drawn by one of our designers. We also used the colourful, carnival branding that our Marketing department came up with and which was used across all BBC Rio 2016 output during the Olympic and Paralympic Games. This didn’t look like a scary sports story, but more of a fun way to be part of the buzz that surrounds the Olympics.

The Calculator Results in Detail

After putting in the four pieces of personal information, the first result showed you how your height compares with the full range of Olympic athletes in Rio.

The next page did the same with your weight, the third with your age. And finally you were shown the 3 athletes most like you.

big-data-made-small02

You may have guessed that these are images of my own Olympic Body Match – I’m not sure being most like an Estonian Wrestler is quite what I had expected!

Hitting the share button generated a box pre-populated with text that enticed the users who received the results to have a go themselves. A link to the calculator is also embedded in the tweet, along with another attractive illustration.

The data and what we did with it.

The data for this interactive was from the Olympic Data Feed, which is used by the BBC and other broadcasters to show the results of all the different events. As part of this feed, the height, weight and age of over 10,000 athletes was available for my team to repurpose.

So far, so good. But it turned out not all the data was available ahead of time. A lot of the information was collected in the days running up to the Games, as the athletes started arriving in Rio, making things a little tight for our development deadlines. To solve this problem we made some test content using figures from the 2012 London Olympics, which we swapped out later for the Rio figures. (The figures for the British track and field team, of particular interest to us, arrived just as the Games were starting.)

As the real data started to come in, we kept our eyes peeled to see who would be the tallest, shortest, heaviest and lightest athletes, as we wanted to highlight them in our graphics, to show the interesting and extreme range of body types represented at the Games.

But, here we had to be careful. As with any dataset this large there was bound to be the occasional glitch, especially when you’re looking for the outliers. Initially the dataset looked as if it contained a rower weighing improbable 200 kilos, and a swimmer whose height was well over seven feet tall.

By checking back with the source, we were able to work out which outliers were incorrect, and which outliers were the right ones for us to focus on.

The shortest athlete was Brazilian gymnast Flavia Saraiva (4ft 4in); the tallest was Li Muhao a Chinese basketball player (7ft 2in).

The data was provided to us through XML feeds.  We matched our readers with the athletes using Euclidean distance. Where someone’s height and weight created more than three matches we picked the athletes whose own birthdate was closest to our reader’s as a way to break ties.

Audience Reaction

Our Body Match Calculator had 4m browsers, 5.8m page impressions and an engagement time of just over a minute.

The audience was 37% female and 63% male – using the gender people matched with as a proxy.

There was good engagement going down the page, with 66% of browsers filling in the form and getting to their results at the foot of the page.

It also did really well on social with this thread on reddit generating nearly a 1,000 comments.

tags: big data, papers & presentations, SAS Global Forum

Making data personal: Big data made small was published on SAS Users.

12月 122016
 

In a previous blog about SAS Studio I’ve briefly introduced the concept of using the Webwork library instead of the default Work. I also suggested, in SAS Global Forum 2016 paper, Deep Dive with SAS Studio into SAS Grid Manager 9.4, to save intermediate results in the Webwork library, because this special library is automatically assigned at start-up and is shared across all workspace server sessions. In the past days, I received some request to expand on the properties of this library and how it is shared across different sessions. What better way to share this information than writing this up in a blog?

As always, I’d like to start with a reference to the official documentation. SAS(R) Studio 3.5: User’s Guide describes the Webwork library, along with its differences with respect to the Work library, in the section about the Interactive Mode. The main points are:

  • Webwork is the default output library in interactive mode. If you refer to a table without specifying both the libref and the table name, SAS Studio assumes it is stored in the Webwork library.
  • The Webwork library is shared between interactive mode and non-interactive mode. Any data that you create in the Webwork library in one mode can be accessed in the other mode.
  • The Work library is not shared between interactive mode and non-interactive mode. Each workspace server session has its own separate Work library, and data cannot be shared between them.
  • Any data that you save to the Work library in interactive mode cannot be accessed from the Work library in non-interactive mode. Also, you cannot view data in the Work library from the Libraries section of the navigation pane if the data was created in interactive mode.

In addition to this, we can list some additional considerations:

  • The Webwork library is shared between every workspace server session started when using parallel process flows from the Visual Programming perspective.
  • The Webwork library is not shared between different SAS Studio sessions. When using multiple SAS Studio sessions, each one has a different Webwork, just like traditional SAS Foundation sessions do not share their Work libraries.
  • The Webwork library is cleared at the end of the SAS Studio session and its content is temporary in nature, just like the Work library.

Here are the logs of the same lines of code executed in different SAS Studio sessions to show the actual path, on a Windows machine, of the Work and Webwork directories:

First SAS Studio session, non-interactive mode

sas-studio-webwork-library01

Same session, interactive mode

sas-studio-webwork-library02

Second SAS Studio session, non-interactive mode

sas-studio-webwork-library03

And since a picture is worth a thousand words, the following diagram depicts the relationship between SAS Studio sessions, Work libraries and Webwork libraries.

sas-studio-webwork-library04

Finally, I’d like to remember that, in distributed environments where the workspace server sessions are load balanced across multiple hosts, it is imperative to configure the Webwork library on a shared filesystem, following the instructions explained in the SAS Studio tips for SAS Grid Manager Administrators blog.

tags: SAS Grid Manager, SAS Professional Services, sas studio, SAS Studio Webwork library

SAS Studio Webwork library demystified was published on SAS Users.

12月 102016
 

During a recent customer visit, I was asked how to include a calculated variable within SAS Asset Performance Analytics’ (APA) Root Cause Analysis workflow. This is a simple request. Is there a simple approach to do this?

To remind you, in the APA workflow, an ETL Administrator makes a Data Mart available in the solution for the APA users. They can select variables and explore, analyze and create a model based on the columns present in the Data Mart.

But if you want to analyze a calculated column, like a difference between two variables, do you need to change the Data Mart? Yes, if you want the APA Data Selection to include this calculated column. But this takes time, and do you really need to change the Data Mart? No!

A simpler and faster approach to adding a calculated column is modifying the APA Root Cause Analysis workflow. And, this is simple!

SAS Asset Performance Analytics is easily and highly configurable. You can easily customize analytical workflows by modifying its underlying stored process. Let me show you how to customize an existing analysis and add a calculation step to enhance APA’s Root Cause Analysis with a calculated column.

Benefits

The main purpose of this customized analysis is to avoid the SAS Enterprise Guide utilization. The users are rarely SAS experts. Thereby, asking users to switch between the tools depending on the functionalities availability on the APA GUI isn’t recommended. The more you can do within the APA interface via wizard-guided workflows, the easier it will be.

The second benefit is the keeping the Data Mart limited to only crucial variables. Instead of asking an ETL Administrator to add non-validated and/or infrequently used calculated columns to the Data Mart, allow the APA user to test and create meaningful tags to enhance workflows as needed. Once the APA user identifies new and meaningful calculated variables, they can easily be added to the Data Mart and available to APA Explorations and APA Stability Monitoring. Limiting only critical variables within the Data Mart will ensure the data size is optimizing and only adjusted as needed.

Root Cause Analysis Use

The use of this new Root Cause Analysis is very easy, instead of selecting “Root Cause Analysis,” select “Root Cause Analysis Calculated Columns” when required.

sas-asset-performance-analytics-root-cause-analysis01

Figure 1: Stored Process Folder allows us to choose the appropriate Analysis

Clicking on “Ok” triggers the same steps as the original RCA.

sas-asset-performance-analytics-root-cause-analysis02

Figure 2: steps triggered by the Root Cause Analysis

After specifying a data selection, the “Filter Data” step contains six new user prompts:

Figure 3: New “Filter Data” interface

Figure 3: New “Filter Data” interface

These prompts allow the user to define what calculation is needed. Six calculations choices are currently available. These choices can be further customized as needed. Calculations types currently available include:

  1. Difference between two variables if you want to study a gap between two tags
  2. Absolute difference of two variables if large gaps are suspicious regardless of the order
  3. Ratio of two variables if you want to control the proportionality
  4. Multiplication of two variables
  5. Summation of two variables
  6. None

By default, the Calculation option is set to “None,” to perform a classical Root Cause Analysis.

Figure 4: Dropdown list with calculation details

Figure 4: Dropdown list with calculation details

After choosing your variables, you can apply variable coefficients you want to test. By default, APA performs an absolute difference Variable 1 – Variable 2, with the coefficients set to 1. If the variables don’t have the same order of magnitude, you can apply a coefficient to put the variables at the same level. By doing this the newly created calculated variable fluctuates around 0, which is easily interpretable.

In the example below, the goal is to realize a simple absolute difference between the Variable 2 (V.Four_Press_Clap) and the Variable 1 (V.Four_Press_C). By default the newly created column name is DIFF_TAGS. You can modify it to something more descriptive for your purposes. Don’t forget that the new column name must follow standard SAS column naming conventions.

Before processing the step, be careful to check the data is not missing during the data period you’ve selected. If it’s missing during the full time period, the DIFF_TAGS variable will not be created and the output will be the same than if you’d selected “None” for the calculation prompt.

Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

Figure 5: DIFF_TAGS variable creation corresponding to the result of Four_Press_Clap – Four_Press_C

Click “Run and Save Step.” As a result, the newly created calculated column is added to your input RCA data and available in the “Expand and Build Statistics” step. Now you can apply minimum and maximum value bands to the calculated column similar to the original RCA workflow.

Figure 6: minimum and maximum values to consider as atypical

Figure 6: minimum and maximum values to consider as atypical

As a result, the calculated column is used like a standard input tag during the full Root Cause Analysis workflow.

Things to keep in mind

If you have a specific need, SAS Asset Performance Analytics is easily customizable. In France, for example, we propose an unlimited and modular APA interface.

Creating columns within the APA user interface analysis workflow has two main benefits:

  1. The calculation is intuitive for a non SAS user
  2. You don’t need to change the data mart

Only two steps are required to implement Root Cause Analysis including Calculated Columns within APA:

  1. Copy and paste the pam_rca_filterData_NEW.sas file in the SASHome/SASFoundation/ 9.4/pamsrmv/sasstp directory.
  2. Using the SAS Management Console, import the RCA_CC.spk file at the following location: /Products/SAS Asset Performance Analytics/Analytical Workbench in the Folders tab.

If you have any questions about this process, feel free to add a comment below.

tags: Global Technology Practice, SAS Asset Performance Analytics

Enhancing SAS Asset Performance Analytics’ Root Cause Analysis with Calculated Columns was published on SAS Users.

12月 022016
 

Nearly every organization has to deal with big data, and that often means dealing with big data problems. For some organizations, especially government agencies, addressing these problems provides more than a competitive advantage, it helps them ensure public confidence in their work or meet standards mandated by law. In this blog I wanted to share with you how SAS worked with a government revenue collection agency to successfully manage their big data issues and seamlessly integrate with Hadoop and other technologies.

Hadoop Security

We all know Hadoop pretty well, and if you haven’t heard of Hadoop yet, it is about time you invest some resources to learn more about this upcoming defacto standard for storage and compute. The core of Apache Hadoop consist of a storage part known as HDFS (Hadoop Distributed File System) and a processing part (called MapReduce). Hadoop splits large files into large blocks and distributes them across the nodes of a cluster.

Hadoop was initially developed to solve web-scale problems like webpage search and indexing at Yahoo. However, the potential of the platform to handle big data and analytics caught the attention of a number of industries. Since the initial used of Hadoop was to count webpages and implement algorithms like page-rank, security was never considered a major requirement, until it started getting used by enterprises across the world.

Security incidents and massive fines have become commonplace and financial institutions, in particular, are doing everything to avoid such incidents. Security should never be an afterthought and should be considered in the initial design of the system. The five core pillars of Enterprise Security are as follows:

sas-integration-with-hadoop05

Our customer had the four core pillars covered from Administration to Auditing using tools provided by their Hadoop vendor. While there are options in the open-source community that provide data protection, in this case the organization decided to use a data security company to protect data at rest on top of Cloudera Navigator Encryption. They refer to it as “Double Encryption.”

The challenge

SAS has multiple products around the Hadoop ecosystem to provide the best support for customers. The traditional way of working with Hadoop involves SAS/ACCESS which can involve pulling the data from Hadoop using Hive. However for larger installations, where data movement is a concern, SAS provides Embedded Process technology, which allows you to push SAS code inside of a Hadoop cluster and run it alongside the data blocks. This is a super-efficient way to access large data sets inside of Hadoop by pushing the compute to the data.

Our customer's data security vendor’s product supports access via Hive UDF’s which means you can tokenize/detokenize when working with SAS/ACCESS Interface to Hadoop using PROC SQL and other options, relatively out of the box. In addition, the SAS language (BASE SAS) can be added using the security company’s API (and PROC FMCP and PROC PROTO) to add additional new SAS language functions for the de/tokenisation of data inside BASE SAS already.

However, SAS Embedded Process has no default support for our customer's security vendor and SAS products which utilize SAS EP include SAS Code Accelerator, SAS Scoring Accelerator and LASR-based products and cannot work with data tokenized by the vendor. This was a major challenge for our customer who wanted to use SAS products like SAS Visual Analytics and SAS Visual Statistics on large volumes of data stored on Hadoop.

The challenge hence was to make SAS Embedded Process work with their data security vendor’s software to perform detokenization before passing the data to SAS procedures.

The possible solutions

We considered various solutions before agreeing on a solution that satisfies all current requirements and could be extended to meet the future needs of our customer. Let’s discuss the top two solutions and the final implementation.

Solution 1: SERDE approach

Our first approach was to create a custom Hive SERDE that wraps the data security company’s APIs. With 9.4M3 the SAS Embedded Process (EP) can read & write via SERDE APIs with some possible constraints and limitations including DS2’s SET/MERGE capabilities and potential identity credentials being passed on from SAS to the company’s APIs.

sas-integration-with-hadoop02

The approach had various drawbacks but the top drawback was in working with various file formats. This approach was discarded because it would have meant lots of rework with every new data format being released by the Hadoop community. While it is true that generally an organization would standardize a few formats to be used for its use cases, it is nonetheless a limiting factor.

Solution 2: Use HDMD with Custom Input Formats

The second approach was to use HDMD with custom input formats. SAS HDMD supports custom input formats which will allow you to plug in your custom input format. A high-level architectural diagram looks something like Figure 2.  This approach works with a variety of file formats, and we have tested it with Parquet, Avro and ORC with good results. The objective is to load a dataset onto Hadoop or use an existing data set and generate an HDMD file for the dataset. We plug in our custom reader in the HDMD file and as a part of the custom reader we make a number of API calls to the data security company’s API. The API will call on the specific protect and unprotect procedures of the security vendor to protect and/or unprotect the data depending on the requirements and pass the results back to the client.

sas-integration-with-hadoop03

What is an Input/Custom input format BTW?

Data inside Hadoop is typically stored on HDFS (Hadoop Distributed File System). The data needs to be read from the filesystem before being processed. This is achieved using Input Format, which has the following responsibilities.

  • Compute input splits
    • Input splits represent the part of the data that will be processed by each Map phase. A unique input split is passed to the process. At the start of a Map Reduce job, input format will split the data into multiple parts based on logical record boundaries and HDFS block size. To get the input splits, the following method is called:
      • List getSplits(JobContext ctx)
  • Provide a logic to read the input split
    • Each mapper gets a unique input split to process the data. Input format provides a logic to read the split, which is an implementation of the RecordReader interface. The record reader will read the split and emit <key,value> pairs as an input for each map function. The record reader is created using the following method:
      • RecordReader<K,V> createRecordReader(InputSplit is, TaskAttemptContext ctx)

All the common formats will provide a way to split the data and read records. However, if you want to read a custom data set for which data parsing isn’t available out of the box with Hadoop, you are better off writing a custom input format.

How to write a Custom Input Format?

Writing a custom input format needs Java skills (the programming language in which Hadoop has been written). You have the option to implement Abstract methods of InputFormat class, or extend one of the pre-existing input formats. In our case, we had extended FileInputFormat, and overrode few critical methods like

  • getSplits()
  • getRecordReader()

The getSplits() will create the splits from the input data, while the getRecordReader() should return an instance of a Java object, which has the ability to read custom records, which in our case was the security vendor’s API.

You can use one of the predefined Record Reader classes or implement your own (most likely if you are writing a custom input format). In our case, we implemented RecordReader interface, and implemented the next() method which is called whenever a new record is found. This is the method where your core business logic is implemented. In our case, we had to write the integration logic by looking at the data, understanding the user who has logged in (available as a part of JobConf object), and then calling the vendor’s APIs to decrypt the data. Sample codes can be requested by contacting me directly.

Integrating a custom input format with SAS

Integrating a custom input format is fairly easy with SAS. SAS allows us to plug in custom formats, which are called before the data is processed via SAS Embedded Process using HDMD files.

When you generate an HDMD file using PROC HDMD, you can specify your custom input format as a part of the generated XML file. Please refer to PROC HDMD documentation.

The generated HDMD file would look something like this.

sas-integration-with-hadoop04

When loading the data from HDFS, SAS will ensure that the specified input format is called prior to any data processing taking place.

The ultimate solution

The solution was demonstrated using data from the tax authorities and included tokenization of data via hive UDFS, detokenization of data according to the policies set on the data security appliance, and performing analytics using SAS Visual Analytics. Only users with permissions on the specific policy were able to view the data, while users with no permissions had access to decrypted data. This additional security helped the enterprise protect users’ information from inadvertent access and resulted in widespread use of Big Data technologies within the Enterprise.

Summary

As you can see from the example above, SAS is open for business, and is already providing deep integration with Hadoop and other technologies using custom APIs. The sky is the limit for people willing to explore the capabilities of SAS.

tags: Global Technology Practice, Hadoop, SAS/ACCESS Interface to Hadoop

SAS integration with Hadoop - one success story was published on SAS Users.