Dynamic programming is a powerful technique to implement algorithms, and is often used to solve complex computational problems. Some are applications are world-changing, such as aligning DNA sequences; others are more "everyday," such as spelling correction. If you search for "dynamic programming," you will find lots of materials including sample programs written in other programming languages, such as Java, C, and Python etc., but there isn't any SAS sample program. SAS is a powerful language, and of course SAS can do it! This article will show you how to write your dynamic programming function including the SPEDIS function. The purpose of this article is to demonstrate how you can implement such a function in a SAS dynamic programming method.

### What is edit distance?

Edit distance is a metric used to measure dissimilarity between two strings, by matching one string to the other through insertions, deletions, or substitutions.

### What is minimum edit distance?

Minimum edit distance measures the dissimilarity between two strings through the least number of edit operations.
Taking two strings "BAD" and "BED" as an example. These two words have multiple match possibilities. One solution is an edit distance of 1, resulting from one substitution from letter "A" to "E".

Another solution might be deleting letter "A" from "BAD" then inserting letter "E" between "B" and "D", which results the edit distance of 2

There are several variants of edit distance, depending on the cost of edit operation. For example, given a string pair "BAD" and "BED", if each operation has cost of 1, then its edit distance is 1; if we set substitution cost as 2 (Levenshtein edit distance), then its edit distance is 2.

### What is dynamic programming?

The standard method used to solve minimum edit distance uses a dynamic programming algorithm.

Dynamic programming is a method used to resolve complex problems by breaking it into simpler sub-problems and solving these recursively. Partial solutions are saved in a big table, so it can be quickly accessed for successive calculations while avoiding repetitive work. Through this process of building on each preceding result, we eventually solve the original, challenging problem efficiently. Many difficult issues can be resolved using this method.

Here's the algorithm that solves Levenshtein edit distance through dynamic programming:

The following image shows the annotated SAS program that implements the algorithm. The complete code (which you can copy and test for yourself) is at the end of this article.

To demonstrate the edit distance function usage and validate the intermediate edit distance matrix table, I used a string pair "INTENTION" and "EXECUTION" that I copied from Stanford's class material as example. (This same resource also shows how the technique applies to DNA sequence alignment.)

```options cmplib=work.funcs; data test; infile cards missover; input word1 : \$20. word2 : \$20.; d=editDistance(word1,word2); put d=; cards; INTENTION EXECUTION ; run;```

The Levenshtein edit distance between "INTENTION" and "EXECUTION" is 8.

The edit distance table as follows.

 N 8 9 10 11 12 11 10 9 8 O 7 8 9 10 11 10 9 8 9 I 6 7 8 9 10 9 8 9 10 T 5 6 7 8 9 8 9 10 11 N 4 5 6 7 8 9 10 11 10 E 3 4 5 6 7 8 9 10 9 T 4 5 6 7 8 7 8 9 8 N 3 4 5 6 7 8 7 8 7 I 2 3 4 5 6 7 6 7 8 E X E C U T I O N

### More dynamic programming applications

Dynamic programming is a powerful technique, and it can be used to solve many complex computation problems. Anna Di and I are presenting a paper to the PharmaSUG 2018 China conference to demonstrate how to align DNA sequences with SAS FCMP and SAS Viya. If you are interested in this topic, please look for our paper after the conference proceedings are published.

### Appendix: Complete SAS program for editDistance function

```proc fcmp outlib=work.funcs.spedis; function editDistance(query \$, keyword \$); array distance[1,1]/nosymbols; m = length(query); n = length(keyword); call dynamic_array(distance, m+1, n+1); do i=1 to m+1; do j=1 to n+1; distance[i, j]=-1; end; end;   i_max=m; j_max=n;   dist = edDistRecursive(query, keyword, distance, m, n);   do i=i_max to 1 by -1; do j=1 to j_max; put distance[i, j] best3. @; end; put; end;   return (dist); endsub;   function edDistRecursive(query \$, keyword \$, distance[*,*], m, n); outargs distance;   if m = 0 then return (n); if n = 0 then return (m);   if distance[m,n] >= 0 then return (distance[m,n]); if (substr(query,m,1) = substr(keyword,n,1)) then delta = 0; else delta = 2;   ans = min(edDistRecursive(query, keyword, distance, m - 1, n - 1) + delta, edDistRecursive(query, keyword, distance, m - 1, n) + 1, edDistRecursive(query, keyword, distance, m, n - 1) + 1); distance[m,n] = ans; return (distance[m,n]); endsub; run; quit;   options cmplib=work.funcs; data test; infile cards missover; input word1 : \$20. word2 : \$20.; d=editDistance(word1,word2); put d=; cards; INTENTION EXECUTION ; run;```

Dynamic programming with SAS FCMP was published on SAS Users.

SAS Text Analytics analyze documents at document-level by default, but sometimes sentence-level analysis gains further insights into the data. Two years ago, SAS Text Analytics team did some research on sentence-level text analysis and shared their discoveries in a SGF paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations. Recently my team started working on a concept extraction project. We need to extract all sentences containing one or two query words, so that linguists don't need to read the whole documents in order to write concept extraction rules. This improves their work efficiency on rules development and rule tuning significantly.

### Sentence boundary detection

Sentence boundary detection is a challenge in Natural Language Processing -- it's more complicated than you might expect. For example, most sentences in English end with a period, but sometimes a period is used to denote an abbreviation or used as a part of ellipsis. My colleagues Biljana and Teresa wrote an article about the complexities of how a period may be used. if you are interested in this topic, please check out their article Text analytics through linguists' eyes: When is a period not a full stop?

Sentence boundary rules are different for different languages, and when you work with multilingual data you might want to write one set of code to manipulate all data in varied languages. For example, a period in German is used to denote ending of an ordinal number token; in Chinese, the sentence-final period is different from English period; and Thai does not use period to denote the end of a sentence.

Here are several sentence boundary examples:

 Sentences Language Text 1 English Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. 2 English I paid \$23.45 for this book. 3 English We earn more and more money, but we feel less and less happier. So…what happened to us? 4 Chinese 北京确实人多车多，但是根源在哪里？ 5 Chinese 在于首都集中了太多全国性资源。 6 German Was sind die Konsequenzen der Abstimmung vom 12. Juni?

### How to tokenize documents into sentences with SAS?

There are several methods to build a sentence tokenizer with SAS Text Analytics. Here I only list three methods:

• Method 1: Use CAS action tpParse and SAS Viya
• Method 3: Use SAS Data Step Code and SAS 9

Among the above three methods, I recommend the first method, because it can extract sentences and keep the raw texts intact. With the second method, uppercase letters are changed into lowercase letters after parsing with SAS, and some unseen characters will be replaced with white spaces. The third method is based on traditional SAS 9 technology (not SAS Viya), so it might not scale to large data as well.

In my article, I show the SAS code of only the first two methods. For details of the SAS code for the last method, please check out the paper Getting More from the Singular Value Decomposition (SVD): Enhance Your Models with Document, Sentence, and Term Representations.

#### Use CAS action NLP technique called tpParse.```%macro sentenceTokenizer2( dsIn=, docVar=, textVar=, language=, dsOut= ); /* Parse the data set */ proc cas; textparse.tpParse / docId="&docVar" documents={name="&dsIn"} text="&textVar" language="&language" cellWeight="NONE" stemming=false tagging=false noungroups=false entities="none" selectAttribute={opType="IGNORE",tagList={}} selectPos={opType="IGNORE",tagList={}} offset={name="offset",replace=TRUE} ; run;   /* Get Sentences */ proc cas; table.partition / table={name="offset" groupby={{name="_document_"}, {name="_sentence_"}} orderby={{name="_start_"}} } casout={name="offset" replace=true}; run;   datastep.runCode / code= " data &dsOut; set offset; by _document_ _sentence_ _start_; length _text_ varchar(20000); if first._sentence_ then do; _text_=''; _lag_end_ = -1; end; if _start_=_lag_end_+1 then _text_=cats(_text_, _term_); else _text_=trim(_text_)||repeat(' ',_start_-_lag_end_-2)||_term_; _lag_end_=_end_; if last._sentence_ then output; retain _text_ _lag_end_; keep _document_ _sentence_ _text_; run; "; run; quit;   proc cas; table.dropTable name="offset" quiet=true; run; quit; %mend sentenceTokenizer2;```Here are three examples for using each of these tokenizer methods:```/*-------------------------------------*/ /* Start CAS Server. */ /*-------------------------------------*/ cas casauto host="host.example.com" port=5570; libname sascas1 cas;   /*-------------------------------------*/ /* Example 1: Chinese texts */ /*-------------------------------------*/ data sascas1.text_zh; infile cards dlm='|' missover; input _document_ text :\$200.; cards; 1|北京确实人多车多，但是根源在哪里？在于首都集中了太多全国性资源。 ; run;   %sentenceTokenizer1( dsIn=text_zh, docVar=_document_, textVar=text, language=chinese, dsOut=sentences_zh1 );   %sentenceTokenizer2( dsIn=text_zh, docVar=_document_, textVar=text, language=chinese, dsOut=sentences_zh2 );   /*-------------------------------------*/ /* Example 2: English texts */ /*-------------------------------------*/ data sascas1.text_en; infile cards dlm='|' missover; input _document_ text :\$500.; cards; 1|Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. 2|I paid \$23.45 for this book. 3|We earn more and more money, but we feel less and less happier. So…what happened to us? ; run;   %sentenceTokenizer1( dsIn=text_en, docVar=_document_, textVar=text, language=english, dsOut=sentences_en1 );   %sentenceTokenizer2( dsIn=text_en, docVar=_document_, textVar=text, language=english, dsOut=sentences_en2 );     /*-------------------------------------*/ /* Example 3: German texts */ /*-------------------------------------*/ data sascas1.text_de; infile cards dlm='|' missover; input _document_ text :\$600.; cards; 1|Was sind die Konsequenzen der Abstimmung vom 12. Juni? ; run;   %sentenceTokenizer1( dsIn=text_de, docVar=_document_, textVar=text, language=german, dsOut=sentences_de1 );   %sentenceTokenizer2( dsIn=text_de, docVar=_document_, textVar=text, language=german, dsOut=sentences_de2 );```The sentences extracted of the three examples as Table 2 shows below. Example Doc Text Sentence (Method 1) Sentence (Method 2) English   1 Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990. rolls-royce motor cars inc. said it expects its u.s. sales to remain steady at about 1,200 cars in 1990. 2 I paid \$23.45 for this book. I paid \$23.45 for this book. i paid \$23.45 for this book. 3 We earn more and more money, but we feel less and less happier. So…what happened to us? We earn more and more money, but we feel less and less happier. we earn more and more money, but we feel less and less happier. So…what happened? so…what happened? Chinese   1 北京确实人多车多，但是根源在哪里？在于首都集中了太多全国性资源。 北京确实人多车多，但是根源在哪里？ 北京确实人多车多，但是根源在哪里？ 在于首都集中了太多全国性资源。 在于首都集中了太多全国性资源。 German 1 Was sind die Konsequenzen der Abstimmung vom 12. Juni? Was sind die Konsequenzen der Abstimmung vom 12. Juni? was sind die konsequenzen der abstimmung vom 12. juni? From the above table, you can see that there is no difference between two methods with Chinese textual data, but many differences between two methods with English or German textual data. So which method you should use? It depends on the SAS products that you have available. Method 1 depends on compileConcept, validateConcept, and applyConcept actions, and requires SAS Visual Text Analytics. Method 2 depends on the tpParse action in SAS Visual Analytics. If you have both products available, then consider your use case. If you are working on text analytics that are case insensitive, such as topic detection or text clustering, you may choose method 2. Otherwise, if the text analytics are case sensitive such as named entity recognition, you must choose method 1. (And of course, if you don't have SAS Viya, you can use method 3 with SAS 9 and guidance from the cited paper.)If you have SAS Viya, I suggest trying the above sentence tokenization method with your data and then run text mining actions on the sentence-level data to see what insights you will get.How to tokenize documents into sentences was published on SAS Users.

Last year when I went through the SAS Global Forum 2017 paper list, the paper Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis impressed me a lot. In this paper, the author raised out the common problems caused by traditional sampling method and proposed four sampling methods for textual data. Recently my team is working on a project in which we are facing a huge volume of documents from a specific field, and we need efforts of linguists and domain experts to analyze the textual data and annotate ground truth, so our first question is which documents we should start working on to get a panoramic image of the data with minimum efforts. Frankly, I don’t have a state-of-the-art method to extract representative documents and measure its effect, so why not try this innovative technique?

The paper proposed four sampling methods, and I only tried the first method through using cluster memberships as a strata. Before we step into details of the SAS program, let me introduce the steps of this method.

• Step 1: Parse textual data into tokens and calculate each term's TF-IDF value
• Step 2: Generate term-by-document matrix
• Step 3: Cluster documents through k-means algorithm
• Step 4: Get top k terms of each cluster
• Step 5: Do stratified sampling by cluster

I wrote a SAS macro for each step so that you are able to check the results step by step. If you are not satisfied with the final cluster result, you can tune the parameters of any step and re-run this step and its post steps. Now let's see how to do this using SAS Viya to extract samples from a movie review data.

The movie review data has 11,855 rows of observations, and there are 200,963 tokens. After removing stop words, there are 18,976 terms. In this example, I set dimension size of the term-by-document matrix as 3000. This means that I use the top 3000 terms with the highest TF-IDF values of the document collections as its dimensions. Then I use k-means clustering to group documents into K clusters, and I set the maximum K as 50 with the kClus action in CAS. The dataSegment action can cluster documents directly, but this action cannot choose the best K. You need to try the clustering action with different K values and choose the best K by yourself. Conversely the kClus action chooses the best K automatically among the K values defined by minimum K and maximum K, so I use kClus action in my implementation.

After running the program (full code at the end of this post), I got 39 clusters and top 10 terms of the first cluster as Table-1 shows.

Table-1 Top 10 terms of Cluster 1

Let's see what samples we get for the first cluster. I got 7 documents and each document either has term "predictable" or term "emotional."

I set sampPct as 5 which means 5% data will be randomly selected from each cluster. Finally I got 582 sample documents. Let's check the sample distribution of each cluster.

This clustering method helped us select a small part of documents from the piles of document collections intelligently, and most importantly it saved us much time and helped us to hit the mark.

I haven't had a chance to try the other three sampling methods from the paper; I encourage you have a try and share your experiences with us. Big thanks to my colleague Murali Pagolu for sharing this innovative technique during the SAS Global Forum 2017 conference and for kindly providing me with some good suggestions.

### Appendix: Complete code for text sampling

```  /*-------------------------------------*/ /* Get tfidf */ /*-------------------------------------*/ %macro getTfidf( dsIn=, docVar=, textVar=, language=, stemming=true, stopList=, dsOut= ); proc cas; textparse.tpParse / docId="&docVar" documents={name="&dsIn"} text="&textVar" language="&language" cellWeight="NONE" stemming=false tagging=false noungroups=false entities="none" offset={name="tpparse_out",replace=TRUE} ; run;   textparse.tpAccumulate / offset={name="tpparse_out"} stopList={name="&stopList"} termWeight="NONE" cellWeight="NONE" reduce=1 parent={name="tpAccu_parent",replace=TRUE} terms={name="tpAccu_term",replace=TRUE} showdroppedterms=false ; run; quit;   proc cas; loadactionset "fedsql"; execdirect casout={name="doc_term_stat", replace=true} query=" select tpAccu_parent.&docVar, tpAccu_term._term_, tpAccu_parent._count_ as _tf_, tpAccu_term._NumDocs_ from tpAccu_parent left join tpAccu_term on tpAccu_parent._Termnum_=tpAccu_term._Termnum_; " ; run;   simple.groupBy / table={name="tpAccu_parent"} inputs={"&docVar"} casout={name="doc_nodup", replace=true}; run;   numRows result=r / table={name="doc_nodup"}; totalDocs = r.numrows; run;   datastep.runcode / code = " data &dsOut; set doc_term_stat;" ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);" ||"run; "; run; quit;   proc cas; table.dropTable name="tpparse_out" quiet=true; run; table.dropTable name="tpAccu_parent" quiet=true; run; table.dropTable name="tpAccu_term" quiet=true; run; table.dropTable name="doc_nodup" quiet=true; run; table.dropTable name="doc_term_stat" quiet=true; run; quit; %mend getTfidf;     /*-------------------------------------*/ /* Term-by-document matrix */ /*-------------------------------------*/ %macro DocToVectors( dsIn=, docVar=, termVar=, tfVar=, dimSize=500, dsOut= ); proc cas; simple.summary / table={name="&dsIn", groupBy={"&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="term_tf_sum", replace=true}; run;   simple.topk / table={name="term_tf_sum"} inputs={"&termVar"} topk=&dimSize bottomk=0 raw=True weight="_Sum_" casout={name='termnum_top', replace=true}; run;   loadactionset "fedsql"; execdirect casout={name="doc_top_terms", replace=true} query=" select termnum.*, _rank_ from &dsIn termnum, termnum_top where termnum.&termVar=termnum_top._Charvar_ and &tfVar!=0; " ; run;   transpose.transpose / table={name="doc_top_terms", groupby={"&docVar"}, computedVars={{name="_name_"}}, computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"} transpose={"&tfVar"} casOut={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="term_tf_sum" quiet=true; run; table.dropTable name="termnum_top" quiet=true; run; table.dropTable name="termnum_top_misc" quiet=true; run; table.dropTable name="doc_top_terms" quiet=true; run; quit; %mend DocToVectors;     /*-------------------------------------*/ /* Cluster documents */ /*-------------------------------------*/ %macro clusterDocs( dsIn=, nClusters=10, seed=12345, dsOut= ); proc cas; /*get the vector variables list*/ columninfo result=collist / table={name="&dsIn"}; ndimen=dim(collist['columninfo']); vector_columns={}; j=1; do i=1 to ndimen; thisColumn = collist['columninfo'][i][1]; if lowcase(substr(thisColumn, 1, 4))='_dim' then do; vector_columns[j]= thisColumn; j=j+1; end; end; run;   clustering.kClus / table={name="&dsIn"}, nClusters=&nClusters, init="RAND", seed=&seed, inputs=vector_columns, distance="EUCLIDEAN", printIter=false, impute="MEAN", standardize='STD', output={casOut={name="&dsOut", replace=true}, copyvars="ALL"} ; run; quit; %mend clusterDocs;     /*-------------------------------------*/ /* Get top-k words of each cluster */ /*-------------------------------------*/ %macro clusterProfile( termDS=, clusterDS=, docVar=, termVar=, tfVar=, clusterVar=_CLUSTER_ID_, topk=10, dsOut= ); proc cas; loadactionset "fedsql"; execdirect casout={name="cluster_terms",replace=true} query=" select &termDS..*, &clusterVar from &termDS, &clusterDS where &termDS..&docVar = &clusterDS..&docVar; " ; run;   simple.summary / table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="cluster_terms_sum", replace=true}; run;   simple.topk / table={name="cluster_terms_sum", groupBy={"&clusterVar"}} inputs={"&termVar"} topk=&topk bottomk=0 raw=True weight="_Sum_" casout={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="cluster_terms" quiet=true; run; table.dropTable name="cluster_terms_sum" quiet=true; run; quit; %mend clusterProfile;     /*-------------------------------------*/ /* Stratified sampling by cluster */ /*-------------------------------------*/ %macro strSampleByCluster( docDS=, docClusterDS=, docVar=, clusterVar=_CLUSTER_ID_, seed=12345, sampPct=, dsOut= ); proc cas; loadactionset "sampling"; stratified result=r / table={name="&docClusterDS", groupby={"&clusterVar"}} sampPct=&sampPct partind="TRUE" seed=&seed output={casout={name="sampling_out",replace="TRUE"}, copyvars={"&docVar", "&clusterVar"}}; run; print r.STRAFreq; run;   loadactionset "fedsql"; execdirect casout={name="&dsOut", replace=true} query=" select docDS.*, &clusterVar from &docDS docDS, sampling_out where docDS.&docVar=sampling_out.&docVar and _PartInd_=1; " ; run;   proc cas; table.dropTable name="sampling_out" quiet=true; run; quit; %mend strSampleByCluster;     /*-------------------------------------*/ /* Start CAS Server. */ /*-------------------------------------*/ cas casauto host="host.example.com" port=5570; libname sascas1 cas;     /*-------------------------------------*/ /* Prepare and load data. */ /*-------------------------------------*/ %let myData=movie_reviews;   proc cas; loadtable result=r / importOptions={fileType="csv", delimiter='TAB',getnames="true"} path="data/movie_reviews.txt" casLib="CASUSER" casout={name="&myData", replace="true"} ; run; quit;   /* Browse the data */ proc cas; columninfo / table={name="&myData"}; fetch / table = {name="&myData"}; run; quit;   /* generate one unique index using data step */ proc cas; datastep.runcode / code = " data &myData; set &myData; rename id = _document_; keep id text score; run; "; run; quit;   /* create stop list*/ data sascas1.stopList; set sashelp.engstop; run;   /* Get tfidf by term by document */ %getTfidf( dsIn=&myData, docVar=_document_, textVar=text, language=english, stemming=true, stopList=stopList, dsOut=doc_term_tfidf );   /* document-term matrix */ %DocToVectors( dsIn=doc_term_tfidf, docVar=_document_, termVar=_term_, tfVar=_tfidf_, dimSize=2500, dsOut=doc_vectors );   /* Cluster documents */ %clusterDocs( dsIn=doc_vectors, nClusters=10, seed=12345, dsOut=doc_clusters );   /* Get top-k words of each cluster */ %clusterProfile( termDS=doc_term_tfidf, clusterDS=doc_clusters, docVar=_document_, termVar=_term_, tfVar=_tfidf_, clusterVar=_cluster_id_, topk=10, dsOut=cluster_topk_terms ); /*-------------------------------------------*/ /* Sampling textual data based on clustering */ /*-------------------------------------------*/     /*-------------------------------------*/ /* Get tfidf */ /*-------------------------------------*/ %macro getTfidf( dsIn=, docVar=, textVar=, language=, stemming=true, stopList=, dsOut= ); proc cas; textparse.tpParse / docId="&docVar" documents={name="&dsIn"} text="&textVar" language="&language" cellWeight="NONE" stemming=false tagging=false noungroups=false entities="none" offset={name="tpparse_out",replace=TRUE} ; run;   textparse.tpAccumulate / offset={name="tpparse_out"} stopList={name="&stopList"} termWeight="NONE" cellWeight="NONE" reduce=1 parent={name="tpAccu_parent",replace=TRUE} terms={name="tpAccu_term",replace=TRUE} showdroppedterms=false ; run; quit;   proc cas; loadactionset "fedsql"; execdirect casout={name="doc_term_stat", replace=true} query=" select tpAccu_parent.&docVar, tpAccu_term._term_, tpAccu_parent._count_ as _tf_, tpAccu_term._NumDocs_ from tpAccu_parent left join tpAccu_term on tpAccu_parent._Termnum_=tpAccu_term._Termnum_; " ; run;   simple.groupBy / table={name="tpAccu_parent"} inputs={"&docVar"} casout={name="doc_nodup", replace=true}; run;   numRows result=r / table={name="doc_nodup"}; totalDocs = r.numrows; run;   datastep.runcode / code = " data &dsOut; set doc_term_stat;" ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);" ||"run; "; run; quit;   proc cas; table.dropTable name="tpparse_out" quiet=true; run; table.dropTable name="tpAccu_parent" quiet=true; run; table.dropTable name="tpAccu_term" quiet=true; run; table.dropTable name="doc_nodup" quiet=true; run; table.dropTable name="doc_term_stat" quiet=true; run; quit; %mend getTfidf;     /*-------------------------------------*/ /* Term-by-document matrix */ /*-------------------------------------*/ %macro DocToVectors( dsIn=, docVar=, termVar=, tfVar=, dimSize=500, dsOut= ); proc cas; simple.summary / table={name="&dsIn", groupBy={"&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="term_tf_sum", replace=true}; run;   simple.topk / table={name="term_tf_sum"} inputs={"&termVar"} topk=&dimSize bottomk=0 raw=True weight="_Sum_" casout={name='termnum_top', replace=true}; run;   loadactionset "fedsql"; execdirect casout={name="doc_top_terms", replace=true} query=" select termnum.*, _rank_ from &dsIn termnum, termnum_top where termnum.&termVar=termnum_top._Charvar_ and &tfVar!=0; " ; run;   transpose.transpose / table={name="doc_top_terms", groupby={"&docVar"}, computedVars={{name="_name_"}}, computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"} transpose={"&tfVar"} casOut={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="term_tf_sum" quiet=true; run; table.dropTable name="termnum_top" quiet=true; run; table.dropTable name="termnum_top_misc" quiet=true; run; table.dropTable name="doc_top_terms" quiet=true; run; quit; %mend DocToVectors;     /*-------------------------------------*/ /* Cluster documents */ /*-------------------------------------*/ %macro clusterDocs( dsIn=, nClusters=10, seed=12345, dsOut= ); proc cas; /*get the vector variables list*/ columninfo result=collist / table={name="&dsIn"}; ndimen=dim(collist['columninfo']); vector_columns={}; j=1; do i=1 to ndimen; thisColumn = collist['columninfo'][i][1]; if lowcase(substr(thisColumn, 1, 4))='_dim' then do; vector_columns[j]= thisColumn; j=j+1; end; end; run;   clustering.kClus / table={name="&dsIn"}, nClusters=&nClusters, init="RAND", seed=&seed, inputs=vector_columns, distance="EUCLIDEAN", printIter=false, impute="MEAN", standardize='STD', output={casOut={name="&dsOut", replace=true}, copyvars="ALL"} ; run; quit; %mend clusterDocs;     /*-------------------------------------*/ /* Get top-k words of each cluster */ /*-------------------------------------*/ %macro clusterProfile( termDS=, clusterDS=, docVar=, termVar=, tfVar=, clusterVar=_CLUSTER_ID_, topk=10, dsOut= ); proc cas; loadactionset "fedsql"; execdirect casout={name="cluster_terms",replace=true} query=" select &termDS..*, &clusterVar from &termDS, &clusterDS where &termDS..&docVar = &clusterDS..&docVar; " ; run;   simple.summary / table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="cluster_terms_sum", replace=true}; run;   simple.topk / table={name="cluster_terms_sum", groupBy={"&clusterVar"}} inputs={"&termVar"} topk=&topk bottomk=0 raw=True weight="_Sum_" casout={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="cluster_terms" quiet=true; run; table.dropTable name="cluster_terms_sum" quiet=true; run; quit; %mend clusterProfile;     /*-------------------------------------*/ /* Stratified sampling by cluster */ /*-------------------------------------*/ %macro strSampleByCluster( docDS=, docClusterDS=, docVar=, clusterVar=_CLUSTER_ID_, seed=12345, sampPct=, dsOut= ); proc cas; loadactionset "sampling"; stratified result=r / table={name="&docClusterDS", groupby={"&clusterVar"}} sampPct=&sampPct partind="TRUE" seed=&seed output={casout={name="sampling_out",replace="TRUE"}, copyvars={"&docVar", "&clusterVar"}}; run; print r.STRAFreq; run;   loadactionset "fedsql"; execdirect casout={name="&dsOut", replace=true} query=" select docDS.*, &clusterVar from &docDS docDS, sampling_out where docDS.&docVar=sampling_out.&docVar and _PartInd_=1; " ; run;   proc cas; table.dropTable name="sampling_out" quiet=true; run; quit; %mend strSampleByCluster;   /*-------------------------------------*/ /* Start CAS Server. */ /*-------------------------------------*/ cas casauto host="host.example.com" port=5570; libname sascas1 cas; caslib _all_ assign;   /*-------------------------------------*/ /* Prepare and load data. */ /*-------------------------------------*/ %let myData=movie_reviews;   proc cas; loadtable result=r / importOptions={fileType="csv", delimiter='TAB',getnames="true"} path="data/movie_reviews.txt" casLib="CASUSER" casout={name="&myData", replace="true"} ; run; quit;   /* Browse the data */ proc cas; columninfo / table={name="&myData"}; fetch / table = {name="&myData"}; run; quit;   /* generate one unique index using data step */ proc cas; datastep.runcode / code = " data &myData; set &myData; rename id = _document_; keep id text score; run; "; run; quit;   /* create stop list*/ data sascas1.stopList; set sashelp.engstop; run;   /* Get tfidf by term by document */ %getTfidf( dsIn=&myData, docVar=_document_, textVar=text, language=english, stemming=true, stopList=stopList, dsOut=doc_term_tfidf );   /* document-term matrix */ %DocToVectors( dsIn=doc_term_tfidf, docVar=_document_, termVar=_term_, tfVar=_tfidf_, dimSize=3000, dsOut=doc_vectors );   /* Cluster documents */ %clusterDocs( dsIn=doc_vectors, nClusters=50, seed=12345, dsOut=doc_clusters );   /* Get top-k words of each cluster */ %clusterProfile( termDS=doc_term_tfidf, clusterDS=doc_clusters, docVar=_document_, termVar=_term_, tfVar=_tfidf_, clusterVar=_cluster_id_, topk=10, dsOut=cluster_topk_terms );   /* Browse topk terms of the first cluster */ proc cas; fetch / table={name="cluster_topk_terms", where="_cluster_id_=1"}; run; quit;   /* Stratified sampling by cluster */ %strSampleByCluster( docDS=&myData, docClusterDS=doc_clusters, docVar=_document_, clusterVar=_cluster_id_, seed=12345, sampPct=5, dsOut=doc_sample_by_cls );   /* Browse sample documents of the first cluster */ proc cas; fetch / table={name="doc_sample_by_cls", where="_cluster_id_=1"}; run; quit;```

How to sample textual data with SAS was published on SAS Users.

Word Mover's Distance (WMD) is a distance metric used to measure the dissimilarity between two documents, and its application in text analytics was introduced by a research group from Washington University in 2015. The group's paper, From Word Embeddings To Document Distances, was published on the 32nd International Conference on Machine Learning (ICML). In this paper, they demonstrated that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates on eight real world document classification data sets.

They leveraged word embedding and WMD to classify documents, and the biggest advantage of this method over the traditional method is its capability to incorporate the semantic similarity between individual word pairs (e.g. President and Obama) into the document distance metric. In a traditional way, one method to manipulate semantically similar words is to provide a synonym table so that the algorithm can merge words with same meaning into a representative word before measuring document distance, otherwise you cannot get an accurate dissimilarity result. However, maintaining synonym tables need continuous efforts of human experts and thus is time consuming and very expensive. Additionally, the semantic meaning of words depends on domain, and the general synonym table does not work well for varied domains.

### Definition of Word Mover's Distance

WMD is the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from one document to the other document. The distance is calculated through solving the following linear program problem.

Where

• Tij denotes how much of word i in document d travels to word j in document d';
• c(i; j) denotes the cost “traveling” from word i in document d to word j in document d'; here the cost is the two words' Euclidean distance in the word2vec embedding space;
• If word i appears ci times in the document d, we denote

WMD is a special case of the earth mover's distance metric (EMD), a well-known transportation problem.

### How to Calculate Earth Mover's Distance with SAS?

SAS/OR is the tool to solve transportation problems. Figure-1 shows a transportation example with four nodes and the distances between nodes, which I copied from this Earth Mover's Distance document. The objective is to find out the minimum flow from {x1, x2} to {y1, y2}. Now let's see how to solve this transportation problem using SAS/OR.

The weights of nodes and distances between nodes are given below.

Figure-1 A Transportation Problem

```data x_set; input _node_ \$ _sd_; datalines; x1 0.74 x2 0.26 ;   data y_set; input _node_ \$ _sd_; datalines; y1 0.23 y2 0.51 ;   data arcdata; input _tail_ \$ _head_ \$ _cost_; datalines; x1 y1 155.7 x1 y2 252.3 x2 y1 292.9 x2 y2 198.2 ;   proc optmodel; set xNODES; num w {xNODES};   set yNODES; num u {yNODES};   set <str,str> ARCS; num arcCost {ARCS};   read data x_set into xNODES=[_node_] w=_sd_; read data y_set into yNODES=[_node_] u=_sd_; read data arcdata into ARCS=[_tail_ _head_] arcCost=_cost_;   var flow {<i,j> in ARCS} >= 0; impvar sumY = sum{j in yNODES} u[j]; min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;   con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j]; con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];   solve with lp / algorithm=ns scale=none logfreq=1; print flow; quit;```

The solution of SAS/OR as Table-1 shows, and the EMD is the objective value: 203.26756757.

Table-1 EMD Calculated with SAS/OR

The flow data I got with SAS/OR as Table-2 shows the following, which is same as the diagram posted in the aforementioned Earth Mover's Distance document.

Table-2 Flow data of SAS/OR

Figure-2 Flow Diagram of Transportation Problem

### How to Calculate Word Mover's Distance with SAS

The paper, From Word Embeddings To Document Distances, proposed a new metric called Relaxed Word Mover's Distance (RWMD) by removing the second constraint of WMD in order to decrease the calculations. Since we need to read word embedding data, I will show you how to calculate RWMD of two documents with SAS Viya.

```/* start CAS server */ cas casauto host="host.example.com" port=5570; libname sascas1 cas;   /* load documents into CAS */ data sascas1.documents; infile datalines delimiter='|' missover; length text varchar(300); input text\$ did; datalines; Obama speaks to the media in Illinois.|1 The President greets the press in Chicago.|2 ; run;   /* create stop list*/ data sascas1.stopList; infile datalines missover; length term \$20; input term\$; datalines; the to in ; run;   /* load word embedding model */ proc cas; loadtable path='datasources/glove_100d_tab_clean.txt' caslib="CASTestTmp" importOptions={ fileType="delimited", delimiter='\t', getNames=True, guessRows=2.0, varChars=True } casOut={name='glove' replace=True}; run; quit;   %macro calculateRWMD ( textDS=documents, documentID=did, text=text, language=English, stopList=stopList, word2VectDS=glove, doc1_id=1, doc2_id=2 ); /* text parsing and aggregation */ proc cas; textParse.tpParse/ table = {name="&textDS",where="&documentID=&doc1_id or &documentID=&doc2_id"} docId="&documentID", language="&language", stemming=False, nounGroups=False, tagging=False, offset={name="outpos", replace=1}, text="&text"; run;   textparse.tpaccumulate/ parent={name="outparent1", replace=1} language="&language", offset='outpos', stopList={name="&stoplist"}, terms={name="outterms1", replace=1}, child={name="outchild1", replace=1}, reduce=1, cellweight='none', termWeight='none'; run; quit;   /* terms of the two test documents */ proc cas; loadactionset "fedsql"; execdirect casout={name="doc_terms",replace=true} query=" select outparent1.*,_term_ from outparent1 left join outterms1 on outparent1._termnum_ = outterms1._termnum_ where _Document_=&doc1_id or _Document_=&doc2_id; " ; run; quit;   /* term vectors and counts of the two test documents */ proc cas; loadactionset "fedsql"; execdirect casout={name="doc1_termvects",replace=true} query=" select word2vect.* from &word2VectDS word2vect, doc_terms where _Document_=&doc2_id and lowcase(term) = _term_; " ; run;   execdirect casout={name="doc1_terms",replace=true} query=" select doc_terms.* from &word2VectDS, doc_terms where _Document_=&doc2_id and lowcase(term) = _term_; " ; run;   simple.groupBy / table={name="doc1_terms"} inputs={"_Term_", "_Count_"} aggregator="n" casout={name="doc1_termcount", replace=true}; run; quit;   proc cas; loadactionset "fedsql"; execdirect casout={name="doc2_termvects",replace=true} query=" select word2vect.* from &word2VectDS word2vect, doc_terms where _Document_=&doc1_id and lowcase(term) = _term_; " ; run;   execdirect casout={name="doc2_terms",replace=true} query=" select doc_terms.* from &word2VectDS, doc_terms where _Document_=&doc1_id and lowcase(term) = _term_; " ; run;   simple.groupBy / table={name="doc2_terms"} inputs={"_Term_", "_Count_"} aggregator="n" casout={name="doc2_termcount", replace=true}; run; quit;   /* calculate Euclidean distance between words */ data doc1_termvects; set sascas1.doc1_termvects; run;   data doc2_termvects; set sascas1.doc2_termvects; run;   proc iml; use doc1_termvects; read all var _char_ into lterm; read all var _num_ into x; close doc1_termvects;   use doc2_termvects; read all var _char_ into rterm; read all var _num_ into y; close doc2_termvects;   d = distance(x,y);   lobs=nrow(lterm); robs=nrow(rterm); d_out=j(lobs*robs, 3, ' '); do i=1 to lobs; do j=1 to robs; d_out[(i-1)*robs+j,1]=lterm[i]; d_out[(i-1)*robs+j,2]=rterm[j]; d_out[(i-1)*robs+j,3]=cats(d[i,j]); end; end;   create distance from d_out; append from d_out; close distance; run;quit;   /* calculate RWMD between documents */ data x_set; set sascas1.doc1_termcount; rename _term_=_node_; _weight_=_count_; run;   data y_set; set sascas1.doc2_termcount; rename _term_=_node_; _weight_=_count_; run;   data arcdata; set distance; rename col1=_tail_; rename col2=_head_; length _cost_ 8; _cost_= col3; run;   proc optmodel; set xNODES; num w {xNODES};   set yNODES; num u {yNODES};   set <str,str> ARCS; num arcCost {ARCS};   read data x_set into xNODES=[_node_] w=_weight_; read data y_set into yNODES=[_node_] u=_weight_; read data arcdata into ARCS=[_tail_ _head_] arcCost=_cost_;   var flow {<i,j> in ARCS} >= 0; impvar sumY = sum{j in yNODES} u[j]; min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;   con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j]; /* con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];*/   solve with lp / algorithm=ns scale=none logfreq=1;   call symput('obj', strip(put(obj,best.))); create data flowData from [i j]={<i, j> in ARCS: flow[i,j].sol > 0} col("cost")=arcCost[i,j] col("flowweight")=flow[i,j].sol; run; quit;   %put RWMD=&obj; %mend calculateRWMD;   %calculateRWMD ( textDS=documents, documentID=did, text=text, language=English, stopList=stopList, word2VectDS=glove, doc1_id=1, doc2_id=2 );```

The RWMD value is 5.0041121662.

Now let's have a look at the flow values.

```proc print data=flowdata; run;quit;```

WMD method does not only measure document similarity, but also it explains why the two documents are similar by visualizing the flow data.

Besides document similarity, WMD is useful to measure sentence similarity. Check out this article about sentence similarity and you can try it with SAS.

How to calculate Word Mover's Distance with SAS was published on SAS Users.

SAS Visual Text Analytics provides dictionary-based and non-domain-specific tokenization functionality for Chinese documents, however sometimes you still want to get N-gram tokens. This can be especially helpful when the documents are domain-specific and most of the tokens are not included into the SAS-provided Chinese dictionary.

### What is an N-gram?

An N-gram is a sequence of N items from a given text with n representing any positive integer starting from 1. When n is 1, it refers to a unigram; when n is 2, it refers to a bigram; when n is 3, it refers to a trigram. For example, suppose we have a text in Chinese "我爱中国。", which means "I love China." Its N-gram sequence looks like the following:

 n Size N-gram Sequence 1 [我], [爱], [中], [国], [。] 2 [我爱], [爱中], [中国], [国。] 3 [我爱中], [爱中国], [中国。]

### How many N-gram tokens are in a given sentence?

If Token_Count_of_Sentence is number of words in a given sentence, then the number of N-grams would be:

`Count of N-grams = Token_Count_of_Sentence – ( n - 1 )`

The following table shows the N-gram token count of "我爱中国。" with different n sizes.

 n Size N-gram Sequence Token Count 1 [我], [爱], [中], [国], [。] 5 = 5- (1-1) 2 [我爱], [爱中], [中国], [国。] 4 = 5- (2-1) 3 [我爱中], [爱中国], [中国。] 3 = 5- (3-1)

In real actual language processing (NLP) tasks, we often want to get unigram, bigram and trigram together when we set N as 3. Similarly, when we set N as 4, we want to get unigram, bigram, trigram, and four-gram together.

N-gram theory is very simple and under some conditions it has big advantage over dictionary-based tokenization method, especially when the corpus you are working on has many vocabularies out of the dictionary or you don't have a dictionary at all.

### How to get N-grams with SAS?

SAS is a powerful programming language when you manipulate data. Below you'll find a program I wrote, using the DATA step to get N-grams.

```data data_test; infile cards dlm='|' missover; input _document_ text :\$100.; cards; 1|我爱中国。 ; run;   data NGRAMS; set data_test; _tmpStr_ = text; do while (klength(_tmpStr_)>0); _maxN_=min(klength(_tmpStr_), 3); do _i_=1 to _maxN_; _term_ = ksubstr(_tmpStr_, 1, _i_); output; end; if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2); else _tmpStr_ = ''; end; keep _document_ _term_ _i_; run;```

Let's see the SAS results.

```proc sort data=NGRAMS; by _document_ _i_; run;   proc print; run;```

N-grams tokenization is the first step of NLP tasks. For most NLP tasks the second step is to calculate the term frequency–inverse document frequency (TF-IDF). Here's the approach:

`tfidf(t,d,D) = tf(t,d) * idf(t,D)`
`IDF(t) = log_e(total number of documents / number of documents that contain term t)`

Where t denotes the terms; d denotes each document; D denotes the collection of documents.

Suppose that you need to handle process lots of documents -- let me show you how to do it using SAS Viya. I used these four steps.

#### Step 1: Start CAS Server and create a CAS library.

```cas casauto host="host.example.com" port=5570; libname mycas cas;   <h4>Step 2: Load your data into CAS. </h4> Here to simply the code, I only tried 3 sentences for demo purpose. data mycas.data_test; infile cards dlm='|' missover; input _document_ fact :\$100.; cards; 1|我爱中国。 2|我是中国人。 3|我是山西人。 ; run;```

Once the data in loaded to CAS, you may run following code to check the column information and record count of your corpus.

```proc cas; table.columnInfo / table="data_test"; run;   table.recordCount / table="data_test"; run; quit;```

#### Step 3: Tokenize texts into N-grams

```%macro TextToNgram(dsin=, docvar=, textvar=, N=, dsout=); proc cas; loadactionset "dataStep"; dscode = "data &dsout; set &dsin; length _term_ varchar(&N); _tmpStr_ = &textvar; do while (klength(_tmpStr_)>0); _maxN_=min(klength(_tmpStr_), &N); do _i_=1 to _maxN_; _term_ = ksubstr(_tmpStr_, 1, _i_); output; end; if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2); else _tmpStr_ = ''; end; keep &docvar _term_; run;"; runCode code = dscode; run; quit; %mend TextToNgram;   %TextToNgram(dsin=data_test, docvar=_document_, textvar=text, N=3, dsout=NGRAMS);```

#### Step 4: Calculate TF-IDF.

```%macro NgramTfidfCount(dsin=, docvar=, termvar=, dsout=); proc cas; simple.groupBy / table={name="&dsin"} inputs={"&docvar", "&termvar"} aggregator="n" casout={name="NGRAMS_Count", replace=true}; run; quit;   proc cas; simple.groupBy / table={name="&dsin"} inputs={"&docvar", "&termvar"} casout={name="term_doc_nodup", replace=true}; run;   simple.groupBy / table={name="term_doc_nodup"} inputs={"&docvar"} casout={name="doc_nodup", replace=true}; run; numRows result=r/ table={name="doc_nodup"}; totalDocs = r.numrows; run;   simple.groupBy / table={name="term_doc_nodup"} inputs={"&termvar"} aggregator="n" casout={name="term_numdocs", replace=true}; run;   mergePgm = "data &dsout;" || "merge NGRAMS_Count(keep=&docvar &termvar _score_ rename=(_score_=tf)) term_numdocs(keep=&termvar _score_ rename=(_score_=numDocs));" || "by &termvar;" || "idf=log("||totalDocs||"/numDocs);" || "tfidf=tf*idf;" || "run;"; print mergePgm; dataStep.runCode / code=mergePgm; run; quit; %mend NgramTfidfCount;   %NgramTfidfCount(dsin=NGRAMS, docvar=_document_, termvar=_term_, dsout=NGRAMS_TFIDF);```

Now let's see the TFIDF result of the first sentence.

```proc print data=sascas1.NGRAMS_TFIDF; where _document_=1; run;```

These N-gram methods are not designed only for Chinese documents; and documents in any language can be tokenized with this method. However, the tokenization granularity of English documents is different from Chinese documents, which is word-based rather than character-based. To handle English documents, you only need to make small changes to my code.

How to get N-grams and TF-IDF count from Chinese documents was published on SAS Users.

Community detection has been used in multiple fields, such as social networks, biological networks, tele-communication networks and fraud/terrorist detection etc. Traditionally, it is performed on an entity link graph in which the vertices represent the entities and the edges indicate links between pairs of entities, and its target is to partition an entity-link graph into communities such that the links within the community subgraphs are more densely connected than the links between communities. Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.[1] SAS implements the most popular algorithms and SAS-proprietary algorithms of graph and network analysis, and integrated them with other powerful analytics into SAS Social Network Analysis. SAS graph algorithms are packaged into PROC OPTGRAPH, and with it you can detect communities from network graph.

In text analytics, researchers did some explorations in applying community detection on textual interaction data and showcased its effectiveness, such as co-authorship network, textual-interaction network, and social-tag network etc.[2] In this post, I would like to show you how to cluster papers based on the keyword link graph using community detection.

Following steps that I introduced in my previous blog, you may get paper keywords with SAS Text Analytics. Suppose you already have the paper keywords, then you need to go through the following three steps.

### Step 1: Build the network.

The network structure depends on your analysis purpose and data. Take the paper-keyword relationships, for example, there are two ways to construct the network. The first method uses papers as vertices and a link occurs only when two papers have a same keyword or keyword token. The second method treats papers, keywords and keyword tokens as vertices, and links only exist in paper-keyword pairs or paper-keyword_token pairs. There is no direct link between papers.

I compared the community detection results of the two methods, and finally I chose the second method because its result is more reasonable. So in this article, I will focus on the second method only.

In addition, SAS supports weighted graph analysis, in my experiment I used term frequency as weight. For example, keywords of paper 114 are “boosting; neural network; word embedding”. After parsing the paper keywords with SAS Text Analytics, we get 6 terms. They are “boost”, “neural network”, “neural”, “network”, “word”, and “embedding”. Here I turned on stemming and noun group options, and honored SAS stoplist for English. The network data of this paper as Table-1 shows.

In the text parse step, I set term weight and cell weight with none, because community defection depends on link density and term frequencies are more effective than weighted values in this tiny data. As the table-1 shows, term frequency is small too, so no need to use log transform for cell weight.

### Step 2: Run community detection to clustering papers and output detection result.

There are two types of network graphs. They are directed graph and undirected graph. In paper-keyword relationships, direction from paper to keyword or versus does not make difference, so I chose undirected graph. PROC OPTGRAPH implements three heuristic algorithms for finding communities: the LOUVAIN algorithm proposed in Blondel et al. (2008), the label propagation algorithm proposed in Raghavan, Albert, and Kumara (2007), and the parallel label propagation algorithm developed by SAS (patent pending). The Louvain algorithm aims to optimize modularity, which is one of the most popular merit functions for community detection. Modularity is a measure of the quality of a division of a graph into communities. The modularity of a division is defined to be the fraction of the links that fall within the communities minus the expected fraction if the links were distributed at random, assuming that you do not change the degree of each node. [3] In my experiment, I used Louvain.

Besides algorithm, you also need to set resolution value. Larger resolution value produces more communities, each of which contains a smaller number of nodes. I tried three resolution values: 0.5, 1, 1.5, and finally I set 1 as resolution value because I think topic of each community is more reasonable. With these settings, I got 18 communities at last.

### Step 3: Explore communities visually and get insights.

Once you have community detection result, you may use Network Diagram of SAS Visual Analytics to visually explore the communities and understand their topics or characteristics.

Take the largest community as an example, there are 14 papers in this community. Nodes with numbers notated are papers, otherwise they are keyword tokens. Node size is determined by sum of link weight (term frequency), and node color is decided by community value. From Figure-1, you may easily find out its topic: sentiment, which is the largest node in all keyword nodes. After I went through the conference program, I found they all are papers of IALP 2016 shared task, which is targeted to predict valence-arousal ratings of Chinese affective words.

Figure-1 Network Diagram of Papers in Community 0

Another example is community 8, and its topic terms are annotation and information.

Figure-2 Network Diagram of Papers in Community 8

Simultaneously the keywords were also clustered, and the keyword community may be used in your search engine to improve the keyword-based recommendation or improve the search performance by retrieving more relevant documents. I extracted the keywords (noun group only) of the top 5 communities and displayed them with SAS Visual Analytics. The top 3 keywords of community 0 are: sentiment analysis, affective computing, and affective lexicon, which are very close from semantic perspective. If you have more data, you may get better results than mine.

Figure-3 Keyword frequency chart of the top 5 communities

If you are interested in this analysis, why not try it with your data? The SAS scripts for clustering papers as below.

```* Step 1: Build the paper-keyword network; proc hptmine data=outlib.paper_keywords; doc_id documentName; var keywords; parse notagging termwgt=none cellwgt=none stop=sashelp.engstop outterms=terms outparent=parent reducef=1; run;quit;   proc sql; create table outlib.paper_keyword_network as select _document_ as from, term as to, _count_ as weight from parent left join terms on parent._termnum_ = terms.key where parent eq .; quit;   * Step 2: Run community detection to clustering papers; * NOTE: huge network = low resolution level; proc optgraph loglevel = 1 direction=undirected graph_internal_format = thin data_links = outlib.paper_keyword_network out_nodes = nodes ;   community loglevel = 1 maxiter = 20 link_removal_ratio = 0 resolution_list = 1 ; run; quit;   proc sql; create table paper_community as select distinct Paper_keywords.documentName, keywords, community_1 as community from outlib.Paper_keywords left join nodes on nodes.node = Paper_keywords.documentName order by community_1, documentName; quit;   * Step 3: Merge network and community data for VA exploration; proc sql; create table outlib.paper_community_va as select paper_keyword_network.*, community from outlib.paper_keyword_network left join paper_community on paper_keyword_network.from = paper_community.documentName; quit;   * Step 4: Keyword communities; proc sql; create table keyword_community as select * from nodes where node not in (select documentName from outlib.paper_keywords) order by community_1, node; quit;   proc sql; create table outlib.keyword_community_va as select keyword_community.*, freq from keyword_community left join terms on keyword_community.node = terms.term where parent eq . and role eq 'NOUN_GROUP' order by community_1, freq desc; quit;```

### References

Clustering of papers using Community Detection was published on SAS Users.

Community detection has been used in multiple fields, such as social networks, biological networks, tele-communication networks and fraud/terrorist detection etc. Traditionally, it is performed on an entity link graph in which the vertices represent the entities and the edges indicate links between pairs of entities, and its target is to partition an entity-link graph into communities such that the links within the community subgraphs are more densely connected than the links between communities. Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.[1] SAS implements the most popular algorithms and SAS-proprietary algorithms of graph and network analysis, and integrated them with other powerful analytics into SAS Social Network Analysis. SAS graph algorithms are packaged into PROC OPTGRAPH, and with it you can detect communities from network graph.

In text analytics, researchers did some explorations in applying community detection on textual interaction data and showcased its effectiveness, such as co-authorship network, textual-interaction network, and social-tag network etc.[2] In this post, I would like to show you how to cluster papers based on the keyword link graph using community detection.

Following steps that I introduced in my previous blog, you may get paper keywords with SAS Text Analytics. Suppose you already have the paper keywords, then you need to go through the following three steps.

### Step 1: Build the network.

The network structure depends on your analysis purpose and data. Take the paper-keyword relationships, for example, there are two ways to construct the network. The first method uses papers as vertices and a link occurs only when two papers have a same keyword or keyword token. The second method treats papers, keywords and keyword tokens as vertices, and links only exist in paper-keyword pairs or paper-keyword_token pairs. There is no direct link between papers.

I compared the community detection results of the two methods, and finally I chose the second method because its result is more reasonable. So in this article, I will focus on the second method only.

In addition, SAS supports weighted graph analysis, in my experiment I used term frequency as weight. For example, keywords of paper 114 are “boosting; neural network; word embedding”. After parsing the paper keywords with SAS Text Analytics, we get 6 terms. They are “boost”, “neural network”, “neural”, “network”, “word”, and “embedding”. Here I turned on stemming and noun group options, and honored SAS stoplist for English. The network data of this paper as Table-1 shows.

In the text parse step, I set term weight and cell weight with none, because community defection depends on link density and term frequencies are more effective than weighted values in this tiny data. As the table-1 shows, term frequency is small too, so no need to use log transform for cell weight.

### Step 2: Run community detection to clustering papers and output detection result.

There are two types of network graphs. They are directed graph and undirected graph. In paper-keyword relationships, direction from paper to keyword or versus does not make difference, so I chose undirected graph. PROC OPTGRAPH implements three heuristic algorithms for finding communities: the LOUVAIN algorithm proposed in Blondel et al. (2008), the label propagation algorithm proposed in Raghavan, Albert, and Kumara (2007), and the parallel label propagation algorithm developed by SAS (patent pending). The Louvain algorithm aims to optimize modularity, which is one of the most popular merit functions for community detection. Modularity is a measure of the quality of a division of a graph into communities. The modularity of a division is defined to be the fraction of the links that fall within the communities minus the expected fraction if the links were distributed at random, assuming that you do not change the degree of each node. [3] In my experiment, I used Louvain.

Besides algorithm, you also need to set resolution value. Larger resolution value produces more communities, each of which contains a smaller number of nodes. I tried three resolution values: 0.5, 1, 1.5, and finally I set 1 as resolution value because I think topic of each community is more reasonable. With these settings, I got 18 communities at last.

### Step 3: Explore communities visually and get insights.

Once you have community detection result, you may use Network Diagram of SAS Visual Analytics to visually explore the communities and understand their topics or characteristics.

Take the largest community as an example, there are 14 papers in this community. Nodes with numbers notated are papers, otherwise they are keyword tokens. Node size is determined by sum of link weight (term frequency), and node color is decided by community value. From Figure-1, you may easily find out its topic: sentiment, which is the largest node in all keyword nodes. After I went through the conference program, I found they all are papers of IALP 2016 shared task, which is targeted to predict valence-arousal ratings of Chinese affective words.

Figure-1 Network Diagram of Papers in Community 0

Another example is community 8, and its topic terms are annotation and information.

Figure-2 Network Diagram of Papers in Community 8

Simultaneously the keywords were also clustered, and the keyword community may be used in your search engine to improve the keyword-based recommendation or improve the search performance by retrieving more relevant documents. I extracted the keywords (noun group only) of the top 5 communities and displayed them with SAS Visual Analytics. The top 3 keywords of community 0 are: sentiment analysis, affective computing, and affective lexicon, which are very close from semantic perspective. If you have more data, you may get better results than mine.

Figure-3 Keyword frequency chart of the top 5 communities

If you are interested in this analysis, why not try it with your data? The SAS scripts for clustering papers as below.

```* Step 1: Build the paper-keyword network; proc hptmine data=outlib.paper_keywords; doc_id documentName; var keywords; parse notagging termwgt=none cellwgt=none stop=sashelp.engstop outterms=terms outparent=parent reducef=1; run;quit;   proc sql; create table outlib.paper_keyword_network as select _document_ as from, term as to, _count_ as weight from parent left join terms on parent._termnum_ = terms.key where parent eq .; quit;   * Step 2: Run community detection to clustering papers; * NOTE: huge network = low resolution level; proc optgraph loglevel = 1 direction=undirected graph_internal_format = thin data_links = outlib.paper_keyword_network out_nodes = nodes ;   community loglevel = 1 maxiter = 20 link_removal_ratio = 0 resolution_list = 1 ; run; quit;   proc sql; create table paper_community as select distinct Paper_keywords.documentName, keywords, community_1 as community from outlib.Paper_keywords left join nodes on nodes.node = Paper_keywords.documentName order by community_1, documentName; quit;   * Step 3: Merge network and community data for VA exploration; proc sql; create table outlib.paper_community_va as select paper_keyword_network.*, community from outlib.paper_keyword_network left join paper_community on paper_keyword_network.from = paper_community.documentName; quit;   * Step 4: Keyword communities; proc sql; create table keyword_community as select * from nodes where node not in (select documentName from outlib.paper_keywords) order by community_1, node; quit;   proc sql; create table outlib.keyword_community_va as select keyword_community.*, freq from keyword_community left join terms on keyword_community.node = terms.term where parent eq . and role eq 'NOUN_GROUP' order by community_1, freq desc; quit;```

### References

Clustering of papers using Community Detection was published on SAS Users.

In 2011, Loughran and McDonald applied a general sentiment word list to accounting and finance topics, and this led to a high rate of misclassification. They found that about three-fourths of the negative words in the Harvard IV TagNeg dictionary of negative words are typically not negative in a financial context. For example, words like “mine”, “cancer”, “tire” or “capital” are often used to refer to a specific industry segment. These words are not predictive of the tone of documents or of financial news and simply add noise to the measurement of sentiment and attenuate its predictive value. So, it is not recommended to use any general sentiment dictionary as is.

Extracting domain-specific sentiment lexicons in the traditional way is time-consuming and often requires domain expertise. Today, I will show you how to extract a domain-specific sentiment lexicon from movie reviews through a machine learning method and construct SAS sentiment rules with the extracted lexicon to improve sentiment classification performance. I did the experiment with the help from my colleagues Meilan Ji and Teresa Jade, and our experiment with the Stanford Large Movie Review Dataset showed around 8% increase in the overall accuracy with the extracted lexicon. Our experiment also showed that the lexicon coverage and accuracy could be improved a lot with more training data.

SAS Sentiment Analysis Studio released domain-independent Taxonomy Rules for 12 languages and domain-specific Taxonomy Rules for a few languages. For English, SAS has covered 12 domains, including Automotive, Banking, Health and Life Sciences, Hospitalities, Insurance, Telecommunications, Public Policy Countries and others. If the domain of your corpus is not covered by these industry rules, your first choice is to use general rules, which sometimes lead to poor classification performance, as Loughran and McDonald found. Automatically extracting domain-specific sentiment lexicons has been studied by researchers and three methods were proposed. The first method is to create a domain-specific word list by linguistic experts or domain experts, which may be expensive or time-consuming. The second method is to derive non-English lexicons based on English lexicons and other linguistic resources such as WordNet. The last method is to leverage machine learning to learn lexicons from a domain-specific corpus. This article will show you the third method.

Because of the emergence of social media, researchers are able to relatively easily get sentiment data from the internet to do experiments. Dr. Saif Mohammad, a researcher in Computational Linguistics, National Research Council Canada, proposed a method to automatically extract sentiment lexicons from tweets. His method provided the best results in SemEval13 by leveraging emoticons in large tweets, using the PMI (pointwise mutual information) between words and tweet sentiment to define the sentiment attributes of words. It is a simple method, but quite powerful. At the ACL 2016 conference, one paper introduced how to use neural networks to learn sentiment scores, and in this paper I found the following simplified formula to calculate a sentiment score.

Given a set of tweets with their labels, the sentiment score (SS) for a word w was computed as:
SS(w) = PMI(w, pos) − PMI(w, neg), (1)

where pos represents the positive label and neg represents the negative label. PMI stands for pointwise mutual information, which is
PMI(w, pos) = log2((freq(w, pos) * N) / (freq(w) * freq(pos))), (2)

Here freq(w, pos) is the number of times the word w occurs in positive tweets, freq(w) is the total frequency of word w in the corpus, freq(pos) is the total number of words in positive tweets, and N is the total number of words in the corpus. PMI(w, neg) is calculated in a similar way. Thus, Equation 1 is equal to:
SS(w) = log2((freq(w, pos) * freq(neg)) / (freq(w, neg) * freq(pos))), (3)

The movie review data I used was downloaded from Stanford; it is a collection of 50,000 reviews from IMDB. I used 25,000 reviews in train and test datasets respectively. The constructed dataset contains an even number of positive and negative reviews. I used SAS Text Mining to parse the reviews into tokens and wrote a SAS program to calculate sentiment scores.

In my experiment, I used the train dataset to extract sentiment lexicons and the test dataset to evaluate sentiment classification performance with each sentiment score cutoff value from 0 to 2 with increment of 0.25. Data-driven learning methods frequently have an overfitting problem, and I used test data to filter out all weak-predictive words whose absolute value of sentiment scores are less than 0.75. In Figure-1, there is an obvious drop in the accuracy line plot of test data when the cutoff value is less than 0.75.

Figure-1 Sentiment Classification Accuracy by Sentiment Score Cutoff

Finally, I got a huge list of 14,397 affective words; 7,850 positive words and 6,547 negative words from movie reviews. The top 50 lexical items from each sentiment category as Figure-2 shows.

Figure-2 Sentiment Score of Top 50 Lexical Items

Now I have automatically derived the sentiment lexicon, but how accurate is this lexicon and how to evaluate the accuracy? I googled movie vocabulary and got two lists from Useful Adjectives for Describing Movies and Words for Movies & TV with 329 adjectives categorized into positive and negative. 279 adjectives have vector data in the GloVe word embedding model downloaded from http://nlp.stanford.edu/projects/glove/ and the T-SNE plot as Figure-3 shows. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus. T-SNE is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten.[1]  It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. So two words are co-located or located closely in the scatter plot, if their semantic meanings are close or their co-occurrence in same contexts is high. Besides semantic closeness, I also showed sentiment polarity via different colors. Red stands for negative and blue stands for positive.

Figure-3 T-SNE Plot of Movie Vocabularies

From Figure-3, I find the positive vocabulary and the negative vocabulary are clearly separated into two big clusters, with very little overlap in the plot.

Now, let me check the sentiment scores of the terms. 175 of them were included in my result and Figure-4 displays the top 50 terms of each category. I compared sentiment polarity of my result with the list, 168 of 175 terms are correctly labelled as negative or positive and the overall accuracy is 96%.

Figure-4 Sentiment Score of Top 50 Movie Terms

There are 7 polarity differences between my prediction and the list as the Table-1 shows.

Table-1 Sentiment Polarity Difference between Predictions and Actual Labels

One obvious prediction mistake is coherent. I checked the raw movie reviews that contain “coherent”, and only 25 of 103 reviews are positive. This is why its sentiment score is negative rather than positive. I went through these reviews and found most of them had a sentiment polarity reversal, such as “The Plot - Even for a Seagal film, the plot is just stupid. I mean it’s not just bad, it’s barely coherent. …” A possible solution to make the sentiment scores more accurate is to use more data or add a special manipulation for polarity reversals. I tried first method, and it did improve the accuracy significantly.

So far, I have evaluated the sentiment scores’ accuracy with public linguistic resources and next I will test the prediction effect with SAS Sentiment Analysis Studio. I ran sentiment analysis against the test data with the domain-independent sentiment rules developed by SAS and the domain-specific sentiment rules constructed by machine learning, and compared the performance of two methods. The results showed an 8% increase in the overall accuracy. Table-2 and Table-3 show the detailed information.

Test data (25,000 docs)

Table-2 Performance Comparison with Test Data

Table-3 Overall Performance Comparison with Test Data

After you get domain-specific sentiment lexicons from your corpora, only a few steps are required in SAS Sentiment Analysis Studio to construct the domain-specific sentiment rules. So, next time you are processing domain-specific text for sentiment, you may want to try this method to get a listing of terms that are positive or negative polarity to augment your SAS domain-independent model.

Detailed steps to construct domain-specific sentiment rules as follows.

Step 1. Create a new Sentiment Analysis project.
Step 2. Create intermediate entities named “Positive” and “Negative”, then put the learned lexicons to the two entities respectively.

Step 3. Besides the learned lexicons, you may add an entity named “Negation” to handle the negated expressions. You can list some negations you are familiar with, such as “not, don’t, can’t” etc.

Step 4. Create positive and negative rules in the Tonal Keyword. Add the rule “CONCEPT: _def{Positive}” to Positive tab, and the rule “CONCEPT: _def{Negative}” and “CONCEPT: _def{Negation} _def{Positive}” to Negative tab.

Step 5. Build rule-based model, and now, you can use this model to predict the sentiment of documents.

How to extract domain-specific sentiment lexicons was published on SAS Users.

Recently a colleague told me Google had published new, interesting data sets at BigQuery. I found a lot of Reddit data as well, so I quickly tried running BigQuery with these text data to see what I could produce.  After getting some pretty interesting results, I wanted to see if I could implement the same analysis with SAS and if using SAS Text Mining you would get deeper insights than simple queries. So, I tried SAS with Reddit comments data and I’d like to share my analyses and findings with you.

### Analysis 1: Significant Words

To get started with BigQuery, I googled what others were sharing regarding BigQuery and Reddit, and I found USING BIGQUERY WITH REDDIT DATA. In this article the author posted a query statement about extracting significant words from Politics subreddit. I then wrote a SAS program to mimic this query and I got following data with the July of Reddit comments. The result is not completely same as the one from BigQuery, since I downloaded the Reddit data from another web site and used SAS Text Parsing action to parse the comments into tokens rather than just splitting tokens by white space.

### Analysis 2: Daily Submissions

The words Trump and Hillary in the list raised my interest and begged for further analysis. So, I did a daily analysis to understand how hot Trump and Hillary were during this month. I filtered all comments mentioning Trump or Hillary under Politics subreddit and counted total submissions per day. The resulting time series plot is shown below.

I found several spikes in the plot, which happened on 2016/7/5, 2016/7/12, 2016/7/21, and 2016/7/26.

### Analysis 3: Topics Time Line

I wondered what Reddit users were concerned about on these specific days, so I extracted the top 10 topics from all comments submitted in July, 2016 within Politics subreddit and got the following data. These topics obviously focused on several aspects, such as vote, president candidates, party, and hot news such as Hillary’s email probe.

The topics showed what people were concerned about in the whole month, but I need further investigation in order to explain which topic mostly contributed to the four spikes. The topics’ time series plot helped me find the answer.

Some topics’ time series trends are very close and it is hard to determine which topic contributed mostly, so I got the top contribution topic based on their daily percentage growth. The top growth topic on July 05 is “emails, dnc, +server, hillary, +classify”, which has 256.23 times of growth.

Its time series plot also shows a high spike on July 05. Then, I googled with “July 5, 2016 emails dnc server hillary classify” and I got following news.

There is no doubt the spike on July 05 is related to the FBI’s decision about Clinton’s email probe. In order to confirm this, I extracted the Top 20 Reddit comments submitted on July 05 according to its Reddit score. I quoted partial comment from the top one and I found the link in the comment was included in the Google’s search result.

"...under normal circumstances, security clearances would be revoked. " This is your FBI. EDIT: I took paraphrased quote, this is the actual quote as per https://www.fbi.gov/news/pressrel/press-releases/statement-by-fbi-director-james-b.-comey-on-the-investigation-of-secretary-hillary-clintons-use-of-a-personal-e-mail-system - "

Similar analysis was done on the other three days and the hot topics as follows.

Interestingly, one person did a sentiment analysis with Twitter data and the tweet submission trend of July looks the same as Reddit.

And in this blog, he listed several important events that happened in July.

• July 5th: the FBI says it’s not going to end Clinton’s email probe and will not recommend prosecution.
• July 12th: Bernie Sanders endorses Hillary Clinton for president.
• July 21st: Donald Trump accepts the Republican nomination.
• July 25-28: Clinton accepts nomination in the DNC.

It showcased that different social media data have similar response trends on the same events.

Now I know why these spikes happened. However, more questions came to my mind.

• Who started posting these news?
• Were there cyber armies?
• Who were opinion leaders in the politics community?

I believe all these questions can be answered by analyzing the data with SAS.

Analyzing Trump v. Clinton text data at Reddit was published on SAS Users.

In my last blog, I showed you how to generate a word cloud of pdf collections. Word clouds show you which terms are mentioned by your documents and the frequency with which they occur in the documents. However, word clouds cannot lay out words from a semantic or linguistic perspective. In this blog I’d like to show you how we can overcome this constraint with new methods.

Word embedding has been widely used in Natural Language Processing, where words or phrases from the vocabulary are mapped to vectors of real numbers. There are several open source tools that can be used to build word embedding models. Two of the most popular tools are word2vec and Glove, and in my experiment I used Glove. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Suppose you have obtained the term frequencies from documents with SAS Text Miner and downloaded the word embedding model from http://nlp.stanford.edu/projects/glove/. Next you can extract vectors of terms using PROC SQL.

```libname outlib 'D:temp';
* Rank terms according to frequencies;
proc sort data=outlib.abstract_stem_freq;
by descending freq;
run;quit;

data ranking;
set outlib.abstract_stem_freq;
ranking=_n_;
run;

data glove;
infile "d:tempglove_100d_tab.txt" dlm="09"x firstobs=2;
input term :\$100. vector1-vector100;
run;

proc sql;
create table outlib.abstract_stem_vector as
select glove.*, ranking, freq
from glove, ranking
where glove.term = ranking.word;
quit;
```

Now you have term vectors, and there are two ways to project 100-dimension vector data in two dimensions. One is SAS PROC MDS (multiple dimensional scaling) and the other is T-SNE (t-Distributed Stochastic Neighbor Embedding). T-SNE is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten. It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.

Let’s try SAS PROC MDS first.  Before running PROC MDS, you need to run PROC DISTANCE to calculate distance or similarity of each pair of words. According to Glove website, the Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. Sometimes, the nearest neighbors according to this metric reveal rare but relevant words that lie outside an average human's vocabulary. I used Euclidean distance in my experiment and showed the top 10 words on the scatter plot as seen in Figure-1.

```proc distance data=outlib.abstract_stem_vector method=EUCLID out=distances;
var interval(vector1-vector100);
id term;
run;quit;

ods graphics off;
proc mds data=distances level=absolute out=outdim;
id term;
run;quit;

data top10;
set outlib.abstract_stem_vector;
if ranking le 10 then label=term;
else label='';
drop ranking freq;
run;

proc sql;
create table mds_plot as
select outdim.*, label
from outdim
left join top10
on outdim.term = top10.term;
quit;

ods graphics on;
proc sgplot data = mds_plot;
scatter x=dim1 y= dim2
/ datalabel = label
markerattrs=(symbol=plus size=5)
datalabelattrs = (family="Arial" size=10pt color=red);
run;quit;
```

SAS does not have t-sne implementation, so I used PROC IML to call the RTSNE library. There are two libraries in R that can be used for t-sne plot: TSNE and RTSNE. RTSNE was acclaimed faster than TSNE. Just as I did with the SAS MDS plot, I showed the top 10 words only, but their font sizes are varied according to their frequencies in documents.

```data top10;
length label \$ 20;
length color 3;
length font_size 8;
set outlib.abstract_stem_vector;
if ranking le 10 then label=term;
else label='+';
if ranking le 10 then color=1;
else color=2;
font_size = max(freq/25, 0.5);
drop ranking freq;
run;

proc iml;
call ExportDataSetToR("top10", "vectors");

submit /R;
library(Rtsne)

set.seed(1) # for reproducibility
vector_matirx```

If you compare the two figures, you may feel that MDS plot is more symmetric than T-SNE plot, but the T-SNE plot seems more reasonable from a linguistic/semantic perspective. In Colah’s blog, he did an exploration and comparison of various dimensionality reduction techniques with well-known computer vision dataset MNIST. I totally agree with his opinions below.

It’s easy to slip into a mindset of thinking one of these techniques is better than others, but I think they’re all complementary. There’s no way to map high-dimensional data into low dimensions and preserve all the structure. So, an approach must make trade-offs, sacrificing one property to preserve another. PCA tries to preserve linear structure, MDS tries to preserve global geometry, and t-SNE tries to preserve topology (neighborhood structure).