1月 042017

In my last blog, I showed you how to generate a word cloud of pdf collections. Word clouds show you which terms are mentioned by your documents and the frequency with which they occur in the documents. However, word clouds cannot lay out words from a semantic or linguistic perspective. In this blog I’d like to show you how we can overcome this constraint with new methods.

Word embedding has been widely used in Natural Language Processing, where words or phrases from the vocabulary are mapped to vectors of real numbers. There are several open source tools that can be used to build word embedding models. Two of the most popular tools are word2vec and Glove, and in my experiment I used Glove. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Suppose you have obtained the term frequencies from documents with SAS Text Miner and downloaded the word embedding model from Next you can extract vectors of terms using PROC SQL.

libname outlib 'D:temp';
* Rank terms according to frequencies;
proc sort data=outlib.abstract_stem_freq;
   by descending freq;

data ranking;
   set outlib.abstract_stem_freq;

data glove;
   infile "d:tempglove_100d_tab.txt" dlm="09"x firstobs=2;
   input term :$100. vector1-vector100;

proc sql;
   create table outlib.abstract_stem_vector as
   select glove.*, ranking, freq  
   from glove, ranking
   where glove.term = ranking.word;

Now you have term vectors, and there are two ways to project 100-dimension vector data in two dimensions. One is SAS PROC MDS (multiple dimensional scaling) and the other is T-SNE (t-Distributed Stochastic Neighbor Embedding). T-SNE is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten. It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.

Let’s try SAS PROC MDS first.  Before running PROC MDS, you need to run PROC DISTANCE to calculate distance or similarity of each pair of words. According to Glove website, the Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. Sometimes, the nearest neighbors according to this metric reveal rare but relevant words that lie outside an average human's vocabulary. I used Euclidean distance in my experiment and showed the top 10 words on the scatter plot as seen in Figure-1.

word scatter plot with SAS

proc distance data=outlib.abstract_stem_vector method=EUCLID out=distances;
   var interval(vector1-vector100);
   id term;

ods graphics off;
proc mds data=distances level=absolute out=outdim;
   id term;

data top10;
   set outlib.abstract_stem_vector;
   if ranking le 10 then label=term;
   else label='';
   drop ranking freq;

proc sql;
   create table mds_plot as
   select outdim.*, label
   from outdim
   left join top10
   on outdim.term = top10.term;

ods graphics on;
proc sgplot data = mds_plot;
   scatter x=dim1 y= dim2 
       / datalabel = label
         markerattrs=(symbol=plus size=5)
         datalabelattrs = (family="Arial" size=10pt color=red);

SAS does not have t-sne implementation, so I used PROC IML to call the RTSNE library. There are two libraries in R that can be used for t-sne plot: TSNE and RTSNE. RTSNE was acclaimed faster than TSNE. Just as I did with the SAS MDS plot, I showed the top 10 words only, but their font sizes are varied according to their frequencies in documents.
data top10;
   length label $ 20;
   length color 3;
   length font_size 8;
   set outlib.abstract_stem_vector;
   if ranking le 10 then label=term;
   else label='+';
   if ranking le 10 then color=1;
   else color=2;
   font_size = max(freq/25, 0.5);
   drop ranking freq;

proc iml;
   call ExportDataSetToR("top10", "vectors");

   submit /R;

      set.seed(1) # for reproducibility

If you compare the two figures, you may feel that MDS plot is more symmetric than T-SNE plot, but the T-SNE plot seems more reasonable from a linguistic/semantic perspective. In Colah’s blog, he did an exploration and comparison of various dimensionality reduction techniques with well-known computer vision dataset MNIST. I totally agree with his opinions below.

It’s easy to slip into a mindset of thinking one of these techniques is better than others, but I think they’re all complementary. There’s no way to map high-dimensional data into low dimensions and preserve all the structure. So, an approach must make trade-offs, sacrificing one property to preserve another. PCA tries to preserve linear structure, MDS tries to preserve global geometry, and t-SNE tries to preserve topology (neighborhood structure).

To learn more I would encourage you to read the following articles.

Word scatter plot with SAS was published on SAS Users.

12月 272016

We have seen in a previous post of this series how to configure SAS Studio to better manage user preferences in SAS Grid environments. There are additional settings that an administrator can leverage to properly configure a multi-user environment; as you may imagine, these options deserve special considerations when SAS Studio is deployed in SAS Grid environments.

SAS Studio R&D and product management often collect customer feedback and suggestions, especially during events such as SAS Global Forum. We received several requests for SAS Studio to provide administrators with the ability to globally set various options. The goal is to eliminate the need to have all users define them in their user preferences or elsewhere in the application. To support these requests, SAS Studio 3.5 introduced a new configuration option, webdms.globalSettings. This setting specifies the location of a directory containing XML files used to define these global options.

Tip #1

How can I manage this option?

The procedure is the same as we have already seen for the webdms.studioDataParentDirectory property. They are both specified in the file in the configuration directory for SAS Studio. Refer to the previous blog for additional details, including considerations for environments with clustered mid-tiers.

Tip #2

How do I configure this option?
By default, this option points to the directory path !SASROOT/GlobalStudioSettings. SASROOT translates to the directory where SAS Foundation binaries are installed, such as /opt/sas/sashome/SASFoundation/9.4 on Unix or C:/Program Files/SASHome/SASFoundation/9.4/ on Windows. It is possible to change the webdms.globalSettings property to point to any chosen directory.

SAS Studio 3.6 documentation provides an additional key detail : in a multi-machine environment, the GlobalStudioSettings directory must be on the machine that hosts the workspace servers used by SAS Studio. We know that, in grid environments, this means that this location should be on shared storage accessible by every node.

Tip #3

Configuring Global Folder Shortcuts

SAS Studio Tips for SAS Grid Manager Administrators

In SAS Studio, end users can create folder shortcuts from the Files and Folders section in the navigation pane. An administrator might want to create global shortcuts for all the users, so that each user does not have to create these shortcuts manually. This is achieved by creating a file called shortcuts.xml in the location specified by webdms.globalSettings, as detailed in

SAS Studio repositories are an easy way to share tasks and snippets between users. An administrator may want to configure one or multiple centralized repositories and make them available to everyone. SAS Studio users could add these repositories through their Preferences window, but it’s easier to create global repositories that are automatically available from the Tasks and Utilities and Snippets sections. Again, this is achieved by creating a file called repositories.xml in the location specified by webdms.globalSettings, as detailed in tags: SAS Administrators, SAS Grid Manager, SAS Professional Services, sas studio

More SAS Studio Tips for SAS Grid Manager Administrators: Global Settings was published on SAS Users.

12月 222016

melissa_marshallEditor's note: This following post is from Melissa Marshall, Principal at Melissa Marshall Consulting LLC. Melissa is a featured speaker at SAS Global Forum 2017, and on a mission to transform how scientists and technical professionals present their work.  

Learn more about Melissa.

Think back to the last technical talk you were an audience member for. What did you think about that talk? Was it engaging and interesting? Boring and overwhelming?  Perhaps it was a topic that was important to you, but it was presented in a way that made it difficult to engage with the content. As an expert in scientific presentations, I often observe a significant “disconnect” between the way a speaker crafts a presentation and the needs of the audience. It is my belief that the way to bridge this gap is for you, as a technical presenter, to become an audience centered speaker vs. a speaker centered speaker.


Here I will provide some quick tips on how to transform your content and slides using your new audience centered speaking approach!

Audience Centered vs. Speaker Centered

The default setting for most presenters is that they are speaker centered—meaning that they make choices in their presentation because it is what works primarily for themselves as a speaker. Examples include: spending a lot of time speaking about an area of the topic that gave you the most difficulty or that you spent the most amount of time working on or using terms that are familiar to you but are jargon for the audience, putting most of the words you want to say on your slides to remind you what to say during the talk so your slides are basically your speaker notes, and standing behind a podium and disconnecting yourself physically from your audience. These choices are common in presentations, but they do not set you up for success. It is a key reason why many presentations of technical information fail.

A critical insight is to realize that your success as a speaker depends entirely upon your ability to make your audience successful.  You don’t get to decide that you gave a great talk (even if no one understood it)!  That’s because presentations, by their very nature, are always made for an audience.  You need something from your audience—that is why you are giving a talk!  So, it is time to get serious about making your audience successful (so you can be too!).  I might define “audience success” as: your audience understands and views your subject in the way you wanted them to.  Strategically, if you desire to be a successful speaker, then the best thing you do is go “all in” on making your audience successful!

Audience Centered Content

To make your content more audience centered, you can ask yourself 4 critical questions ahead of time about your audience:

  • Who are they?
  • What do they know?
  • Why are they here?
  • What biases do they have?


The answers to these questions will guide how you begin to focus your content. Additionally, as a presenter of technical information, one of the most important questions you need to answer along the way, at many stages in your presentation, is “So what?”.  Too often presenters share complex technical information or findings, but they do not make the direct connection to the audience of how that information is relevant or important to the big picture or overall message.  Remind yourself each time you share a technical finding to also follow up that information with the answer to the question “So what?”.  This will make your content immediately more engaging and relevant to your audience.


Audience Centered Slide Design

Think about the last several presentations that you sat through as an audience member.  How would you describe the slides?  Text heavy? Cluttered? No clear message? Full of bulleted lists?  Audiences consistently complain of “Death by PowerPoint”, which refers to the endless march of speakers through text filled slide after text filled slide.  The reason this is so detrimental to audiences is that our brains have a limited “bandwidth” for verbal information.  When we reach that limit, it’s called cognitive overload and our brains stop processing the information as effectively and efficiently.  When you have a speaker talking (the speaker’s words are verbal information) and then you have slides to read with lots of words on them (also more verbal information), you are at a high risk of cognitive overload for the audience.  Therefore, many audiences “tune out” during presentations or report feeling exhausted after a day of listening to presentations.  This is a result of cognitive overload.  A more effective way to approach slides for your audience is to think about making your slides do something for you that your words cannot. You are giving a talk, so the words part is mostly covered by what you are saying…it is much more powerful to make your slides primarily visual so that they convey information in a more memorable, engaging, and understandable way. This is known in the field of cognitive research as the Picture Superiority Effect.  John Medina’s excellent book Brain Rules states that “Based on research into the Picture Superiority Effect, when we read text alone, we are likely to remember only 10 percent of the information 3 days later. If that information is presented to us as text combined with a relevant image, we are likely to remember 65 percent of the information 3 days later.” 

A great a slide design strategy that I advocate for is called the assertion-evidence design.  This slide design strategy is based in research (including Medina’s mentioned above) and works beautifully for presentations of technical information. The assertion-evidence slide design is characterized by a concise, complete sentence headline (no longer than 2 lines) that states the main assertion (i.e. what you want the audience to know) of the slide. The body of the slide then consists of visual evidence for that take away message (charts, graphs, images, equations, etc.). Here is an example of a traditional slide transformed to an assertion-evidence slide:



Having trouble banishing bullet lists? One of my favorite quick (and free!) tools for getting yourself past bulleted lists is Nancy Duarte’s Diagrammer tool.  I like this tool because it asks you what is the relationship between the information that you are trying to show and creates a graphic to show that relationship.  Remember: the best presentations use a variety of visual evidence!  Charts, graphs, pictures, videos, diagrams, etc.  Give your audience lots of visual ways to connect with your content!

Final Thoughts

Next time you present, I encourage you to let every decision you make along the way be guided first by the needs of your audience.  Remember, the success of your audience in understanding your work is how your success as a speaker is measured! For more tips on technical talks, check out my TED Talk entitled “Talk Nerdy To Me.” For questions, comments, or to book a technical presentations workshop at your company or institution, please contact me at

About Melissa Marshall

melissa_marshallMelissa Marshall is on a mission: to transform how scientists and technical professionals present their work. That’s because she believes that even the best science is destined to remain undiscovered unless it’s presented in a clear and compelling way that sparks innovation and drives adoption.

For almost a decade, she’s traveled around the world to work with Fortune 100 corporations, institutions and universities, teaching the proven strategies she’s mastered through her consulting work and during her 10 years as a faculty member at Penn State University.

When you work with Melissa, you will get the practical skills and the natural confidence you need to immediately shift your “information dump”-style presentations into ones that are meaningful, engaging, and inspire people to take action. And the benefits go far beyond any single presentation; working with Melissa, your entire organization will develop a culture of successful communication, one that will help you launch products and ideas more effectively than ever before.

Melissa is also a dynamic speaker who has lectured at Harvard Medical School, the New York Academy of Sciences, and the Centers for Disease Control and Prevention (CDC). For a sneak peek, check out her TED talk, “Talk Nerdy to Me.” It’s been watched by over 1.5 million people (and counting).

Visit Melissa and learn more at

Melissa can be reached at

tags: papers & presentations, SAS Global Forum

Transform your technical talks with an audience centered approach was published on SAS Users.

12月 212016

The report-ready SAS Environment Manager Data Mart has been an invaluable addition to SAS 9.4 for SAS administrators. The data mart tables are created and maintained by the SAS Environment Manager Service Architecture Framework and provide a source of data for out-of-the box reports as well as custom reports that any SAS administrator can easily create. As you can imagine, the size of the tables in the data mart can grow quite large over time so balancing the desired time span of reporting and the size of the tables on disk requires some thought. The good news: SAS 9.4 M4 has made that job even easier.

The Environment Manager Data Mart (EVDM) has always provided a configuration setting to determine how many days of resource records to keep in the data mart tables. You can see below that in a fresh SAS 9.4 M4 installation, the default setting for “Number of Days of Resource Records in Data Mart” is set to 60 days. This means that EVDM data records older than 60 days are deleted from tables whenever the data mart ETL process executes.

EV Data Mart Tables in 9.4M4

The space required to house the Environment Manager Data Mart is split across three primary areas.

  • The ACM library tables contain system level information
  • The APM library tables contain audit and performance data culled from SAS logs
  • The KITS library tables contains miscellaneous tables created by data mart kits that collect specialty information about HTTP access, SAS data set access, and such.

Prior to SAS 9.4M4, the ACM and APM libraries duly archived data according to the “Number of Days of Resource Records in Data Mart” setting, but the KITS library did not. For most of the KITS tables this is not such a big deal but for some deployments, the HTTPACCESS table in the KITS library can grow quite large. For administrators who have enabled the VA feed for the Service Architecture Framework, the size of the HTTPACCESS table directly impacts the time it takes to autoload the results of each refresh of the data mart, as well as the amount of memory consumed by the LASR Server used for the Environment Manager Data Mart LASR library.

So what is the big change for SAS 9.4 M4?

The KITS library now respects the “Number of Days of Resource Records in Data Mart” setting and removes data older than the threshold.  If you are a SAS administrator, you can now forget about having to separately manage the KITS library which should simplify space management.

SAS administrators may need to adjust the “Number of Days of Resource Records in Data Mart” setting to strike a balance between the date range requirements for reporting and the amount of disk space they have available for storing the EVDM tables.  With SAS 9.4 M4, however, administrators can rest assured that all EVDM tables will self-manage according to their wishes.

More on the Service Architecture Framework.

tags: SAS 9.4, SAS Administrators, SAS Environment Manager, SAS Professional Services

Easier Space Management for EV Data Mart Tables in 9.4M4 was published on SAS Users.

12月 202016

Joining tables with PROC FORMAT

The title of this post borrows from Stanley Kubrick’s 1964 comedy “Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.” It stars the great Peter Sellers as the titular character as well as George C. Scott and Slim Pickens. The black and white film is strange and brilliant just like Kubrick was. Similarly, as I was experiencing the issue I outline below and was told of this solution, I thought two things. The first was “what a strange but brilliant solution” and the second one was “I’ll try anything as long as it works.”   Thus, a title was born. But enough about that. Why are we here?


You want to add a couple of columns of information to your already large dataset, but each time you try to join the tables you run out of memory!  For example, you want to append latitude and longitude values from Table B to an existing list of customer phone numbers in Table A.

You’ve tried this and got nowhere fast:

proc sort data = demo.tablea;
by npa nxx;
proc sort data = demo.tableb;
by npa nxx;
data demo.aunionb;
merge demo.tablea (in=a) demo.tableb (in=b);
by npa nxx;
if a;

And then you tried this and also got nowhere (albeit a little slower):

proc sql;
   	create table demo.aunionb as 
   	select *,
	from demo.tablea a
left join demo.tableb b on (a.npa = b.npa) and (a.nxx = b.nxx);

Solution - Joining tables with PROC FORMAT


Here’s how:

First, take Table B and create character equivalents of the fields required in your join (assuming they aren’t characters already). In this example, NPA and NXX are the two fields that you are joining on. They will be your key once you concatenate them.  Next, create character equivalents of the fields that you want appended.

data work.tableb (keep = npa_nxx--nxx_c); 
set demo.tableb; 
npa_c = compress(put(npa, best10.));
nxx_c = compress(put(nxx, best10.));
npa_nxx = catx('_',npa_c, nxx_c);
lat_c = compress(put(latitude, best14.3)); 
long_c = compress(put(longitude, best14.3)); 

Next, make sure that you have only unique values of your key. Use PROC SORT with OPT=noduprecs turned on.

Now, create a table that will be used as the input into PROC FORMAT. In this example, you are creating a table that will contain the formats for the latitude column.

proc sort data = work.tableb noduprecs;
by npa_nxx;
data demo.tableb_lat_fmt(keep=fmtname type start label); 
retain fmtname 'lat_f' type 'C'; 
set work.tableb; 
if npa_nxx = '._.' then start = 'Other  ';
else start = npa_nxx; 
label = lat_c; 
proc sort data = demo.tableb_fmt;
by start;

This step creates a table that includes the format name (lat_f), the format type (C), the key field (start) and its corresponding latitude value (label).  Sort this table by the ‘start’ column and then repeat this step for every column you wish to append, with each column getting its own unique format and table.

Now run PROC FORMAT using the CNTLIN option pointing to the tables that you just created in order to create your format.

proc format cntlin=demo.tableb_lat_fmt; 
proc format cntlin=demo.tableb_long_fmt; 

Now all you have to do is run your data step to create the resultant dataset with the appended values.

data demo.aunionb (drop = npa_nxx); 
set demo.tablea; 
npa_nxx = catx('_',compress(put(npa,best10.)),compress(put(nxx, best10.)));
latitude = input(put(npa_nxx, $lat_f.), BEST.); 
longitude = input(put(npa_nxx, $long_f.), BEST.);

This step creates 3 columns: npa_nxx, latitude, and longitude. Npa_nxx is the key built from the NPA  and NXX values. Latitude and longitude are then populated with the formatted value of npa_nxx, which in this case is the character equivalent of the original latitude or longitude. It also formats the value back into a numeric field.

The result is a clever way to add columns to a dataset, much like a VLOOKUP function works in Microsoft Excel, without the hassle of running out of memory space.


  1. The author realizes there are other, more boring ways of tackling this issue like indexing and using WHERE statements, but where’s the fun in that?
  2. This solution may not be right for you. See your doctor if you experience any of the following symptoms:  memory loss, headache, frustration, Cartesian rage, cranial-keyboard embedment or memory loss.
tags: Global Technology Practice, Joining tables, PROC FORMAT, SAS Programmers, tips & techniques

Dr. Strangeformat or: How I Learned to Stop Joining Tables and Love the PROC was published on SAS Users.

12月 172016

A multilabel format enables you to assign multiple labels to a value or a range of values. The capability to add multilabel formats was added to the FORMAT procedure in SAS® 8.2.  You assign multiple labels by specifying the MULTILABEL option in the VALUE statement of PROC FORMAT. For example, specifying the MULTILABEL option in the following VALUE statement enables the Agef format to have overlapping ranges.

value agef (multilabel)

Multilabel formats are available for use only in the MEANS, SUMMARY, TABULATE, and REPORT procedures. The code examples that follow show the creation of a simple mutlilabel format (using PROC FORMAT) and its use in each of these procedures.

First, a PROC FORMAT step creates a multilabel format for the Agef variable in the Sashelp.Class data set, along with a character format for the Sex variable. The NOTSORTED option is specified to indicate the preferred order of the ranges in the results.

proc format library=work;
value agef (multilabel notsorted)
value $sexf

Now, the multilabel format is used in the other SAS procedures that are mentioned earlier. In PROC MEANS and PROC TABULATE, the MLF option must be specified in the CLASS statement for the Age variable. In PROC REPORT, the MLF option is specified in the DEFINE statement for Age. The PRELOADFMT and ORDER=DATA options are also specified to preserve the order as defined in the format. The PRELOADFMT option applies only to group and across variables in PROC REPORT.

proc tabulate data=sashelp.class format=8.1;
class age / mlf preloadfmt order=data;
class sex;
var height;
table age, sex*height*mean;
format age agef. sex $sexf.
proc means data=sashelp.class nway Mean nonobs maxdec=1 completetypes;
class age / mlf preloadfmt order=data;
class sex;
var height;
format age agef. sex $sexf.;
title 'PROC MEANS';
proc report data=sashelp.class NOWD headline completerows;
col age sex height;
define age / group mlf preloadfmt order=data format=agef.;
define sex / group format=$sexf.;
define height / mean format=8.1;
break after age / skip;
title 'PROC REPORT';

The output from each of these procedures is shown below.


You can use a multilabel format to facilitate the calculation of moving averages, as illustrated in the next example. This example creates a multilabel format using the CNTLIN= option in PROC FORMAT. Then, that format is used to calculate a three-month moving average in PROC SUMMARY.

data sample;  /*  Create the sample data set. */
do sasdate='01JAN2015'D to '31DEC2016'D;
if day(sasdate)=1 then output;
proc print data=sample;
format sasdate date9.;
title 'Sample data set';
data crfmt;  /* Create a CNTLIN data set for a multilabel format. */
keep fmtname start end label HLO;
periods=intck('month',begin,final) -2;
do i=0 to periods;
HLO='M';  /* M indicates "multilabel."  */
proc print data=crfmt;
var fmtname start end label HLO;
format start end date9.;
title 'CNTLIN data set';
/*  Use the CNTLIN= option to create the format.  */
proc format library=work cntlin=crfmt fmtlib;
select my3month;
title 'FMTLIB results for my3month format';
proc summary data=sample NWAY order=data;
class sasdate / MLF;   /*  Use the MLF option.  */
var x;
output out=final (drop=_: ) mean= / autoname;
format sasdate my3month.;
proc print data=final noobs;
title 'Three-Month Moving Averages';

The example code above generates the following output:


For additional information about the PROC FORMAT options that are used in the code examples above, as well as the options that are used in the other procedures, see the Base SAS X.X Procedures Guide for your specific SAS release. The procedures guide is available on the Base SAS Documentation web page (

Creating and Using Multilabel Formats was published on SAS Users.

12月 162016

paper-money-stackImagine making $50K a day out of thin air. Did you know that NASDAQ routinely processes around 10,000,000 trades a day? What if instead of rounding cents for each transaction, market makers truncated fractions of cents in the amount they owe you? Under the assumption that each transaction, on average, has half a cent that is usually rounded away, this would produce 10,000,000 x $0.005 = $50,000 and nobody would even notice it. I am not saying it's legal, but this example is just an illustration of the power of ordinary truncation.

However, sometimes it is necessary to truncate displayed numeric values to a specified number of decimal places without rounding. For example, if we need to truncate 3.1415926 to 4 decimal places without rounding, the displayed number would be 3.1415 (as compared to the rounded number, 3.1416).

If you think you can truncate numeric values by applying SAS w.d format, think again.

Try running this SAS code:

data _null_;
   put x= 6.4;

If you expect to get x=3.1415, you will be disappointed. Surprisingly, you will get x=3.1416, which means that SAS format does not truncate the number, but rounds it. Same is true for the DOLLARw.d and COMMAw.d formats.

After running into this problem, I thought to instead use a SAS function to truncate numbers. The TRUNC function comes to mind. Indeed, if you look up the SAS TRUNC function, you will find that it does truncate numeric values, but (surprise!) not to a specified number of decimal places; rather it truncates to a specified number of bytes, which is not the same thing for numerics. This may be useful for evaluating the precision of numeric values, but has no direct bearing on our problem of truncating numeric values to a set number of decimal places.

What turned to be interesting is that the Excel TRUNC function does exactly what we need – it truncates values to a set number of decimal places (removes decimals) without rounding:


Truncating numbers in Excel with the TRUNC Function

In general, the technique of number truncation should be limited to reporting purposes when displayed numbers are required to appear truncated. Be careful not to apply truncation before using the numbers in calculations, as you might get some seriously inaccurate results, even worse than when you round numbers before calculations. Unless, of course, your goal is to get inaccurate results, which is quite an honorable goal in fraud detection, simulation and investigation.

I can see two possible solutions to number truncations:

Solution 1: Numeric truncation

Let’s say we need to truncate the following number X.XXXXXXX , keeping only the red digits (that is get rid of all decimal digits after the 4th decimal place).

We can do it in 3 steps:

  1. Multiply our number by 104, effectively making the decimals part of a whole number (shifting the decimal point 4 positions to the right).
  2. Apply INT() function that truncates the decimal portion, keeping only the whole portion from the previous step.
  3. Divide the result of step 2 by 104, effectively restoring the order disturbed by step 1 (shifting the decimal point 4 positions to the left).

Here is SAS code implementing this algorithm:

%let d = 4; /* d must be a whole number: 0, 1, 2... */
data _null_;
   x = 3.1415926;
   p = 10**&d;
   y = int(x*p)/p;
   put x= / y=;

If we run this code SAS log will show the following (expected and desired) results:


WARNING: While in the SAS code example above the int() function might be substituted with the floor() function, for negative numbers the floor() function would produce incorrect results. For negative numbers, the ceil() function is the correct choice. However, the int() function does exactly what we need for both positive and negative numbers.

Solution 2: Character truncation

Since we use truncated numbers for output only, we can solve our truncation problem by converting numeric value into character, and then use character functions to get rid of extra digits. Let’s solve the same problem, to truncate the following number X.XXXXXXX keeping only the red digits.

Using this character approach we can also do it in 3 steps:

  1. Convert the numeric value into a character value and assign it to a new character variable.
  2. Determine position of a decimal point in the character string.
  3. Sub-string our initial character string to keep only 4 characters after decimal point.

Here is SAS code implementing this algorithm:

%let d = 4; /* d must be a whole number: 0, 1, 2... */
data _null_;
   x = 3.1415926;
   y = put(x,best.);
   y = substr(y,1,index(y,'.')+&d);
   put x= / y=;

If we run this code SAS log will show the following results:


As you can see, these results are correct and identical to the results of numeric truncation.

Both numeric and character truncation methods work for positive and negative numbers.

User-defined functions

We can also implement the above two methods as user-defined functions, say truncn() and truncc(), using PROC FCMP:

proc fcmp outlib = sasuser.functions.truncations;
   function truncn (x,d);
      p = 10**d;
      y = int(x*p)/p;
   function truncc (x,d);
      y = put(x,best.);
      y = substr(y,1,index(y,'.')+d);

Then we can use those user-defined functions truncn() and truncc() as illustrated in the following SAS code sample:

options cmplib=sasuser.functions;
data A;
   length x n 8 c $9;
   input x;
   n = truncn(x,4);
   c = truncc(x,4);

This code will produce the following A dataset:


Notice that variables x and n are of numeric type while variable c is of character type.


  1. Which of these two methods of decimal number truncation do you like more? Why?
  2. Does it make sense to use these methods as user-defined functions? Why?


tags: SAS Programmers

Truncating decimal numbers in SAS without rounding was published on SAS Users.

12月 142016

Fun with Text AnalyticsLast week, I attended the IALP 2016 conference (20th International Conference on Asian Language Processing) in Taiwan. After the conference, each presenter received a u-disk with all accepted papers in PDF format. So when I got back to Beijing, I began going through the papers to extend my learning. Usually, when I return from a conference, I go through all paper titles and my conference notes, then choose the most interesting articles and dive into them for details. I’ll then summarize important research discoveries into one document. This always takes me several days or more to complete.

This time, I decided to try SAS Text Analytics to help me read papers efficiently. Here’s how I did it.

My first experiment was to generate a word cloud of all papers. I used these three steps.

Step 1: Convert PDF collections into text files.

With the SAS procedure TGFilter and SAS Document Conversion Server, you may convert PDF collections into a SAS dataset. If you don’t have SAS Document Conversion Server, you can download pdftotext for free. Pdftotext converts PDFfiles into texts only, you need to write SAS code to import all text files into a dataset. Moreover, if you use pdftotext, you need to check if the PDF file is converted correctly or not. It’s annoying to check texts one by one and I hope you look for smart ways to do this check. SAS TGFilter procedure has language detection functionality and language of any garbage document after conversion is empty rather than English, so I recommend you use TGFilter, then you can filter garbage documents out easily with a where statement of language not equal to ‘English.’

Step 2: Parse documents into words and get word frequencies.

Run SAS procedure HPTMINE or TGPARSE against the document SAS dataset, with stemming option turned on and English stop-word list released by SAS, you may get frequencies of all stems.

Step 3: Generate word cloud plot.

Once you have term frequencies, you can either use SAS Visual Analytics or use R to generate word cloud plot. I like programming, so I used SAS procedure IML to submit R scripts via SAS.

These steps generated a word cloud with the top 500 words of 66 papers. There were a total of 87 papers and 21 of them could not be converted correctly by SAS Document Conversion Server. 19 papers could not be converted correctly by pdftotext.


Figure-1 Word Cloud of Top 500 Words of 66 Papers

From figure-1, it is easy to see that the top 10 words were: word, language, chinese, model, system, sentence, corpus, information, feature and method. IALP is an international conference, and does not focus on Chinese only. However there was a shared task at this year’s conference, and its purpose is to predict traditional Chinese affective words’ valance and arousal ratings. Moreover, each team who had attended the shared task was requested to submit a paper to introduce their work, so “Chinese” contributed more than other languages in the word cloud.

You probably think there’s a lot of noise if we use paper bodies to do word cloud analysis, so my second experiment is to generate word cloud of paper abstracts through similar processing. You can view these results in Figure-2.


Figure-2 Word Cloud of Top 500 Words of 66 Paper Abstracts

The top 10 words from the paper abstracts were:  word, chinese, system, language, method, paper, propose, sentiment, baseand corpus. These are quite a bit different from top 10 words extracted from paper bodies. Character, word and sentence are fundamental components of natural languages. For machine learning in NLP (natural language processing), annotated corpus is the key. Without corpus, you cannot build any model. However, annotated corpus is very rare even in big data era and that is why so many researchers pay efforts in annotation.

Can we do more analyses with SAS? Of course. We may analyze keywords, references, paper influence, paper categorization, etc. I hope I have time to try these interesting analyses and share my work with you in a future blog.

The SAS scripts for paper word cloud as below.

* Step 1: Convert PDF collections into sas dataset;
* NOTE: You should have SAS Document Converte purchased;
proc tgfilter out=paper(where=(text ne '' and language eq 'English'))
   tgservice=(hostname="yourhost" port=yourport)  
* Step 2: Parse documents into words and get word frequencies;
* Add document id for each document;
data paper;
   set paper;
   document_id = _n_;
proc hptmine data=paper;
   doc_id document_id;
   var text;
   parse notagging nonoungroups termwgt=none cellwgt=none
* Get stem freq data;
proc sql noprint;
   create table paper_stem_freq as
   select term as word, freq as freq
   from key
   where missing(parent) eq 1
   order by freq descending;
* Step 3: Generate word cloud plot with R;
data topstems;
   set paper_stem_freq(obs=500);
proc iml;
call ExportDataSetToR("topstems", "d" );
submit / R;
# sort by frequency
tags: SAS R&D, SAS Text Analytics

Fun with SAS Text Analytics: A qualitative analysis of IALP papers was published on SAS Users.

12月 132016


Editor's note: Amanda Farnsworth is Head of Visual Journalism at BBC News and a featured speaker at SAS Global Forum 2017, April 2-5, 2017 in Orlando.

My days are spent trying to put the best content we can in front of our loyal, heartland audience, while reaching out to others, particularly on social media, who may never usually come to the BBC for their news.

It can sometimes be hard to reach both audiences at the same time.

But recently we hit on a format that does exactly that. We call it The Personal Relevance Calculator. We have made a whole series of these calculators on different topics, including “The Great British Class Calculator” (yes we Brits are still obsessed with Class!) and “Will a Robot Take Your Job.

The idea is to take a big data set that tells a story and make it personally relevant to each and every user. Readers simply enter a small amount of personal information – it could be their age or height and weight, or a postcode of where they live – and the result they get back from the calculator is unique, or appears to be unique, to them.  This result is given in a rich, visual way and is very shareable on social media.

The advantages are a much deeper engagement in the subject than we might get by writing a traditional article and they are usually very popular, getting millions of hits, likes and shares. They also appeal to the parts of the audience other BBC content doesn’t reach.

Case Study - Who Is Your Olympic Body Match?

You can find the Olympic Body Match calculator using this link:

At the BBC, we know that the Olympics provide us an opportunity to reach a part of the audience that doesn’t often think of us.  Let’s call them Main Eventers – they are people who don’t like to be left out of those water cooler conversations when a big national or international sporting event is going on.  So they want some way of engaging with a story that they often don’t know much about. Perhaps they are not big sports fans.


Enter our calculator. By putting in your height, weight, date of birth and sex, our calculator matches you with the Olympic athlete most like you. Simple but very engaging!

We took care to make the calculator a rich, visual experience with beautiful illustrations drawn by one of our designers. We also used the colourful, carnival branding that our Marketing department came up with and which was used across all BBC Rio 2016 output during the Olympic and Paralympic Games. This didn’t look like a scary sports story, but more of a fun way to be part of the buzz that surrounds the Olympics.

The Calculator Results in Detail

After putting in the four pieces of personal information, the first result showed you how your height compares with the full range of Olympic athletes in Rio.

The next page did the same with your weight, the third with your age. And finally you were shown the 3 athletes most like you.


You may have guessed that these are images of my own Olympic Body Match – I’m not sure being most like an Estonian Wrestler is quite what I had expected!

Hitting the share button generated a box pre-populated with text that enticed the users who received the results to have a go themselves. A link to the calculator is also embedded in the tweet, along with another attractive illustration.

The data and what we did with it.

The data for this interactive was from the Olympic Data Feed, which is used by the BBC and other broadcasters to show the results of all the different events. As part of this feed, the height, weight and age of over 10,000 athletes was available for my team to repurpose.

So far, so good. But it turned out not all the data was available ahead of time. A lot of the information was collected in the days running up to the Games, as the athletes started arriving in Rio, making things a little tight for our development deadlines. To solve this problem we made some test content using figures from the 2012 London Olympics, which we swapped out later for the Rio figures. (The figures for the British track and field team, of particular interest to us, arrived just as the Games were starting.)

As the real data started to come in, we kept our eyes peeled to see who would be the tallest, shortest, heaviest and lightest athletes, as we wanted to highlight them in our graphics, to show the interesting and extreme range of body types represented at the Games.

But, here we had to be careful. As with any dataset this large there was bound to be the occasional glitch, especially when you’re looking for the outliers. Initially the dataset looked as if it contained a rower weighing improbable 200 kilos, and a swimmer whose height was well over seven feet tall.

By checking back with the source, we were able to work out which outliers were incorrect, and which outliers were the right ones for us to focus on.

The shortest athlete was Brazilian gymnast Flavia Saraiva (4ft 4in); the tallest was Li Muhao a Chinese basketball player (7ft 2in).

The data was provided to us through XML feeds.  We matched our readers with the athletes using Euclidean distance. Where someone’s height and weight created more than three matches we picked the athletes whose own birthdate was closest to our reader’s as a way to break ties.

Audience Reaction

Our Body Match Calculator had 4m browsers, 5.8m page impressions and an engagement time of just over a minute.

The audience was 37% female and 63% male – using the gender people matched with as a proxy.

There was good engagement going down the page, with 66% of browsers filling in the form and getting to their results at the foot of the page.

It also did really well on social with this thread on reddit generating nearly a 1,000 comments.

tags: big data, papers & presentations, SAS Global Forum

Making data personal: Big data made small was published on SAS Users.

12月 122016

In a previous blog about SAS Studio I’ve briefly introduced the concept of using the Webwork library instead of the default Work. I also suggested, in SAS Global Forum 2016 paper, Deep Dive with SAS Studio into SAS Grid Manager 9.4, to save intermediate results in the Webwork library, because this special library is automatically assigned at start-up and is shared across all workspace server sessions. In the past days, I received some request to expand on the properties of this library and how it is shared across different sessions. What better way to share this information than writing this up in a blog?

As always, I’d like to start with a reference to the official documentation. SAS(R) Studio 3.5: User’s Guide describes the Webwork library, along with its differences with respect to the Work library, in the section about the Interactive Mode. The main points are:

  • Webwork is the default output library in interactive mode. If you refer to a table without specifying both the libref and the table name, SAS Studio assumes it is stored in the Webwork library.
  • The Webwork library is shared between interactive mode and non-interactive mode. Any data that you create in the Webwork library in one mode can be accessed in the other mode.
  • The Work library is not shared between interactive mode and non-interactive mode. Each workspace server session has its own separate Work library, and data cannot be shared between them.
  • Any data that you save to the Work library in interactive mode cannot be accessed from the Work library in non-interactive mode. Also, you cannot view data in the Work library from the Libraries section of the navigation pane if the data was created in interactive mode.

In addition to this, we can list some additional considerations:

  • The Webwork library is shared between every workspace server session started when using parallel process flows from the Visual Programming perspective.
  • The Webwork library is not shared between different SAS Studio sessions. When using multiple SAS Studio sessions, each one has a different Webwork, just like traditional SAS Foundation sessions do not share their Work libraries.
  • The Webwork library is cleared at the end of the SAS Studio session and its content is temporary in nature, just like the Work library.

Here are the logs of the same lines of code executed in different SAS Studio sessions to show the actual path, on a Windows machine, of the Work and Webwork directories:

First SAS Studio session, non-interactive mode


Same session, interactive mode


Second SAS Studio session, non-interactive mode


And since a picture is worth a thousand words, the following diagram depicts the relationship between SAS Studio sessions, Work libraries and Webwork libraries.


Finally, I’d like to remember that, in distributed environments where the workspace server sessions are load balanced across multiple hosts, it is imperative to configure the Webwork library on a shared filesystem, following the instructions explained in the SAS Studio tips for SAS Grid Manager Administrators blog.

tags: SAS Grid Manager, SAS Professional Services, sas studio, SAS Studio Webwork library

SAS Studio Webwork library demystified was published on SAS Users.