7月 182018
 

Last year when I went through the SAS Global Forum 2017 paper list, the paper Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis impressed me a lot. In this paper, the author raised out the common problems caused by traditional sampling method and proposed four sampling methods for textual data. Recently my team is working on a project in which we are facing a huge volume of documents from a specific field, and we need efforts of linguists and domain experts to analyze the textual data and annotate ground truth, so our first question is which documents we should start working on to get a panoramic image of the data with minimum efforts. Frankly, I don’t have a state-of-the-art method to extract representative documents and measure its effect, so why not try this innovative technique?

The paper proposed four sampling methods, and I only tried the first method through using cluster memberships as a strata. Before we step into details of the SAS program, let me introduce the steps of this method.

  • Step 1: Parse textual data into tokens and calculate each term's TF-IDF value
  • Step 2: Generate term-by-document matrix
  • Step 3: Cluster documents through k-means algorithm
  • Step 4: Get top k terms of each cluster
  • Step 5: Do stratified sampling by cluster

I wrote a SAS macro for each step so that you are able to check the results step by step. If you are not satisfied with the final cluster result, you can tune the parameters of any step and re-run this step and its post steps. Now let's see how to do this using SAS Viya to extract samples from a movie review data.

The movie review data has 11,855 rows of observations, and there are 200,963 tokens. After removing stop words, there are 18,976 terms. In this example, I set dimension size of the term-by-document matrix as 3000. This means that I use the top 3000 terms with the highest TF-IDF values of the document collections as its dimensions. Then I use k-means clustering to group documents into K clusters, and I set the maximum K as 50 with the kClus action in CAS. The dataSegment action can cluster documents directly, but this action cannot choose the best K. You need to try the clustering action with different K values and choose the best K by yourself. Conversely the kClus action chooses the best K automatically among the K values defined by minimum K and maximum K, so I use kClus action in my implementation.

After running the program (full code at the end of this post), I got 39 clusters and top 10 terms of the first cluster as Table-1 shows.

Table-1 Top 10 terms of Cluster 1

Let's see what samples we get for the first cluster. I got 7 documents and each document either has term "predictable" or term "emotional."

Samples from cluster

I set sampPct as 5 which means 5% data will be randomly selected from each cluster. Finally I got 582 sample documents. Let's check the sample distribution of each cluster.

Donut chart of cluster samples

This clustering method helped us select a small part of documents from the piles of document collections intelligently, and most importantly it saved us much time and helped us to hit the mark.

I haven't had a chance to try the other three sampling methods from the paper; I encourage you have a try and share your experiences with us. Big thanks to my colleague Murali Pagolu for sharing this innovative technique during the SAS Global Forum 2017 conference and for kindly providing me with some good suggestions.

Appendix: Complete code for text sampling

 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=2500, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=10, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
/*-------------------------------------------*/
/* Sampling textual data based on clustering */
/*-------------------------------------------*/
 
 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
caslib _all_ assign;
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=3000, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=50, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
 
/* Browse topk terms of the first cluster */
proc cas;
fetch / 
   table={name="cluster_topk_terms",
          where="_cluster_id_=1"};
run;
quit;
 
/* Stratified sampling by cluster      */
%strSampleByCluster(
   docDS=&myData, 
   docClusterDS=doc_clusters, 
   docVar=_document_, 
   clusterVar=_cluster_id_, 
   seed=12345,   
   sampPct=5,
   dsOut=doc_sample_by_cls
);
 
/* Browse sample documents of the first cluster */
proc cas;
fetch / 
   table={name="doc_sample_by_cls",
          where="_cluster_id_=1"};
run;
quit;

How to sample textual data with SAS was published on SAS Users.

7月 182018
 

This article shows how to implement balanced bootstrap sampling in SAS. The basic bootstrap samples with replacement from the original data (N observations) to obtain B new samples. This is called "uniform" resampling because each observation has a uniform probability of 1/N of being selected at each step of the resampling process. Within the union of the B bootstrap samples, each observation has an expected value of appearing B times.

Balanced bootstrap resampling (Davison, Hinkley, and Schechtman, 1986) is an alternative process in which each observation appears exactly B times in the union of the B bootstrap samples of size N. This has some practical benefits for estimating certain inferential statistics such as the bias and quantiles of the sampling distribution (Hall, 1990).

It is easy to implement a balanced bootstrap resampling scheme: Concatenate B copies of the data, randomly permute the B*N observations, and then use the first N observations for the first bootstrap sample, the next B for the second sample, and so forth. (Other algorithms are also possible, as discussed by Gleason, 1988). This article shows how to implement balanced bootstrap sampling in SAS.

Balanced bootstrap samples in SAS

To illustrate the idea, consider the following data set that has N=6 observations. Five observations are clustered near x=0 and the sixth is a large outlier (x=10). The sample skewness for these data is skew=2.316 because of the influence of the outlier.

data Sample(keep=x);
input x @@;
datalines;
-1 -0.2 0 0.2 1 10
;
 
proc means data=Sample skewness;
run;
%let ObsStat = 2.3163714;

You can use the bootstrap to approximate the sampling distribution for the skewness statistic for these data. I have previously shown how to use SAS to bootstrap the skewness statistic: Use PROC SURVEYSELECT to form bootstrap samples, use PROC MEANS with a BY statement to analyze the samples, and use PROC UNIVARIATE to analyze the bootstrap distribution of skewness values. In that previous article, PROC SURVEYSELECT is used to perform uniform sampling (sampling with replacement).

It is straightforward to modify the previous program to perform balanced bootstrap sampling. The following program is based on a SAS paper by Nils Penard at PhUSE 2012. It does the following:

  1. Use PROC SURVEYSEELCT to concatenate B copies of the input data.
  2. Use the DATA step to generate a uniform random number for each observation.
  3. Use PROC SORT to sort the data by the random values. After this step, the N*B observations are in random order.
  4. Generate a variable that indicates the bootstrap sample for each observation. Alternatively, reuse the REPLICATE variable from PROC SURVEYSELECT, as shown below.
/* balanced bootstrap computation */
proc surveyselect data=Sample out=DupData noprint
                  reps=5000              /* duplicate data B times */
                  method=SRS samprate=1; /* sample w/o replacement */
run;
 
data Permute;  
   set DupData;
   call streaminit(12345);
   u = rand("uniform");    /* generate a uniform random number for each obs */
run;
 
proc sort data=Permute; by u; run;  /* sort in random order */
 
data BalancedBoot;
   merge DupData(drop=x) Permute(keep=x);  /* reuse REPLICATE variable */
run;

You can use the BalancedBoot data set to perform subsequent bootstrap analyses. If you perform a bootstrap analysis, you obtain the following approximate bootstrap distribution for the skewness statistic. The observed statistic is indicated by a red vertical line. For reference, the mean of the bootstrap distribution is indicated by a gray vertical line. You can see that the sampling distribution for this tiny data set is highly nonnormal. Many bootstrap samples that contain the outlier (exactly one-sixth of the samples in a balanced bootstrap) will have a large skewness value.

Bootstrap distribution for balanced resampling method

To assure yourself that each of the original six observations appears exactly B times in the union of the bootstrap sample, you can run PROC FREQ, as follows:

proc freq data=BalancedBoot;   /* OPTIONAL: Show that each obs appears B times */
   tables x / nocum;
run;

Balanced bootstrap samples in SAS/IML

As shown in the article "Bootstrap estimates in SAS/IML," you can perform bootstrap computations in the SAS/IML language. For uniform sampling, the SAMPLE function samples with replacement from the original data. However, you can modify the sampling scheme to support balanced bootstrap resampling:

  1. Use the REPEAT function to duplicate the data B times.
  2. Use the SAMPLE function with the "WOR" option to sample without replacement. The resulting vector is a permutation of the B*N observations.
  3. Use the SHAPE function to reshape the permuted data into an N x B matrix for which each column is a bootstrap sample. This form is useful for implementing vectorized computations on the columns.

The following SAS/IML program modifies the program in the previous post to perform balanced bootstrap sampling:

/* balanced bootstrap computation in SAS/IML */
proc iml;
use Sample; read all var "x"; close;
call randseed(12345);
 
/* Return a row vector of statistics, one for each column. */
start EvalStat(M);
   return skewness(M);               /* <== put your computation here */
finish;
Est = EvalStat(x);                   /* 1. observed statistic for data */
 
/* balanced bootstrap resampling */
B = 5000;                            /* B = number of bootstrap samples */
allX = repeat(x, B);                 /* replicate the data B times */
s = sample(allX, nrow(allX), "WOR"); /* 2. sample without replacement (=permute) */
s = shape(s, nrow(x), B);            /*    reshape to (N x B) */
 
/* use the balanced bootstrap samples in subsequent computations */
bStat = T( EvalStat(s) );            /* 3. compute the statistic for each bootstrap sample */
bootEst = mean(bStat);               /* 4. summarize bootstrap distrib such as mean */
bias = Est - bootEst;                /*    Estimate of bias */
RBal = Est || BootEst || Bias;       /* combine results for printing */
print RBal[format=8.4 c={"Obs" "BootEst" "Bias"}];

As shown in the previous histogram, the bias estimate (the difference between the observed statistic and the mean of the bootstrap distribution) is sizeable.

It is worth mentioning that the SAS-supplied %BOOT macro performs balanced bootstrap sampling by default. To generate balanced bootstrap samples with the %BOOT macro, set the BALANCED=1 option, as follows:
%boot(data=Sample, samples=5000, balanced=1) /* or omit BALANCED= option */
If you want uniform (unbalanced) samples, call the macro as follows:
%boot(data=Sample, samples=5000, balanced=0).

In conclusion, it is easy to generate balanced bootstrap samples. Balanced sampling can improve the efficiency of certain bootstrap estimates and inferences. For details, see the previous references of Appendix II of Hall (1992).

The post Balanced bootstrap resampling in SAS appeared first on The DO Loop.

7月 172018
 

Automation for SAS Administrators - deleting old filesAttention SAS administrators! When running SAS batch jobs on schedule (or manually), they usually produce date-stamped SAS logs which are essential for automated system maintenance and troubleshooting. Similar log files have been created by various SAS infrastructure services (Metadata server, Mid-tier servers, etc.) However, as time goes on, the relevance of such logs diminishes while clutter stockpiles. In some cases, this may even lead to disk space problems.

There are multiple ways to solve this problem, either by deleting older log files or by stashing them away for auditing purposes (zipping and archiving). One solution would be using Unix/Linux or Windows scripts run on schedule. The other is much "SAS-sier."

Let SAS clean up its "mess"

We are going to write a SAS code that you can run manually or on schedule, which for a specified directory (folder) deletes all .log files that are older than 30 days.
First, we need to capture the contents of that directory, then select those file names with extension .log, and finally, subset that file selection to a sub-list where Date Modified is less than Today's Date minus 30 days.

Perhaps the easiest way to get the contents of a directory is by using the X statement (submitting DOS’ DIR command from within SAS with a pipe (>) option, e.g.

x 'dir > dirlist.txt';

or using pipe option in the filename statement:

filename DIRLIST pipe 'dir "C:\Documents and Settings"';

However, SAS administrators know that in many organizations, due to cyber-security concerns IT department policies do not allow enabling the X statement by setting SAS XCMD system option to NOXCMD (XCMD system option for Unix). This is usually done system-wide for the whole SAS Enterprise client-server installation via SAS configuration. In this case, no operating system command can be executed from within SAS. Try running any X statement in your environment; if it is disabled you will get the following ERROR in the SAS log:

ERROR: Shell escape is not valid in this SAS session.

To avoid that potential roadblock, we’ll use a different technique of capturing the contents of a directory along with file date stamps.

Macro to delete old log files in a directory/folder

The following SAS macro cleans up a Unix directory or a Windows folder removing old .log files. I must admit that this statement is a little misleading. The macro is much more powerful. Not only it can delete old .log files, it can remove ANY file types specified by their extension.

%macro mr_clean(dirpath=,dayskeep=30,ext=.log);
   data _null_;
      length memname $256;
      deldate = today() - &dayskeep;
      rc = filename('indir',"&dirpath");
      did = dopen('indir');
      if did then
      do i=1 to dnum(did);
         memname = dread(did,i);
         if reverse(trim(memname)) ^=: reverse("&ext") then continue;
         rc = filename('inmem',"&dirpath/"!!memname);
         fid = fopen('inmem');
         if fid then 
         do;
            moddate = input(finfo(fid,'Last Modified'),date9.);
            rc = fclose(fid);
            if . < moddate <= deldate then rc = fdelete('inmem');
         end;
      end; 
      rc = dclose(did);
      rc = filename('inmem');
      rc = filename('indir');
   run;
%mend mr_clean;

This macro has 3 parameters:

  • dirpath - directory path (required);
  • dayskeep - days to keep (optional, default 30);
  • ext - file extension (optional, default .log).

This macro works in both Windows and Linux/Unix environments. Please note that dirpath and ext parameter values are case-sensitive.

Here are examples of the macro invocation:

1. Using defaults

%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs;
%mr_clean(dirpath=&dir_to_clean)

With this macro call, all files with extension .log (default) which are older than 30 days (default) will be deleted from the specified directory.

2. Using default extension

%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs;
%mr_clean(dirpath=&dir_to_clean,dayskeep=20)

With this macro call, all files with extension .log (default) which are older than 20 days will be deleted from the specified directory.

3. Using explicit parameters

%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs;
%mr_clean(dirpath=&dir_to_clean,dayskeep=10,ext=.xls)

With this macro call, all files with extension .xls (Excel files) which are older than 10 days will be deleted from the specified directory.

Old file deletion SAS macro code explanation

The above SAS macro logic and actions are done within a single data _NULL_ step. First, we calculate the date from which file deletion starts (going back) deldate = today() - &dayskeep. Then we assign fileref indir to the specified directory &dirpath:

rc = filename('indir',"&dirpath");

Then we open that directory:

did = dopen('indir');

and if it opened successfully (did>0) we loop through its members which can be either files or directories:

do i=1 to dnum(did);

In that loop, first we grab the directory member name:

memname = dread(did,i);

and look for our candidates for deletion, i.e., determine if that name (memname) ends with "&ext". In order to do that we reverse both character strings and compare their first characters. If they don’t match (^=: operator) then we are not going to touch that member - the continue statement skips to the end of the loop. If they do match it means that the member name does end with "&ext" and it’s a candidate for deletion. We assign fileref inmem to that member:

rc = filename('inmem',"&dirpath/"!!memname);

Note that forward slash (/) Unix/Linux path separator in the above statement is also a valid path separator in Windows. Windows will convert it to back slash (\) for display purposes, but it interprets forward slash as a valid path separator along with back slash.
Then we open that file using fopen function:

fid = fopen('inmem');

If inmem is a directory, the opening will fail (fid=0) and we will skip the following do-group that is responsible for the file deletion. If it is file and is opened successfully (fid>0) then we go through the deletion do-group where we first grab the file Last Modified date as moddate, close the file, and if moddate <= deldate we delete that file:

rc = fdelete('inmem');

Then we close the directory and un-assign filerefs for the members and directory itself.

Deleting old files across multiple directories/folders

Macro %mr_clean is flexible enough to address various SAS administrators needs. You can use this macro to delete old files of various types across multiple directories/folders. First, let’s create a driver table as follows:

data delete_instructions;
   length days 8 extn $9 path $256;
   infile datalines truncover;
   input days 1-2 extn $ 4-12 path $ 14-270;
   datalines;
30 .log      C:\PROJECTS\Automatically deleting old files\Logs1
20 .log      C:\PROJECTS\Automatically deleting old files\Logs2
25 .txt      C:\PROJECTS\Automatically deleting old files\Texts
35 .xls      C:\PROJECTS\Automatically deleting old files\Excel
30 .sas7bdat C:\PROJECTS\Automatically deleting old files\SAS_Backups
;

This driver table specifies how many days to keep files of certain extensions in each directory. In this example, perhaps the most beneficial deletion applies to the SAS_Backups folder since it contains SAS data tables (extension .sas7bdat). Data files typically have much larger size than SAS log files, and therefore their deletion frees up much more of the valuable disk space.

Then we can use this driver table to loop through its observations and dynamically build macro invocations using CALL EXECUTE:

data _null_;
   set delete_instructions;
   s = cats('%nrstr(%mr_clean(dirpath=',path,',dayskeep=',days,',ext=',extn,'))');
   call execute(s);
run;

Alternatively, we can use DOSUBL() function to dynamically execute our macro at every iteration of the driver table:

data _null_;
   set delete_instructions;
   s = cats('%mr_clean(dirpath=',path,',dayskeep=',days,',ext=',extn,')');
   rc = dosubl(s);
run;

Put it on autopilot

When it comes to cleaning your old files (logs, backups, etc.), the best practice for SAS administrators is to schedule your cleaning job to automatically run on a regular basis. Then you can forget about this chore around your "SAS house" as %mr_clean macro will do it quietly for you without the noise and fuss of a Roomba.

Your turn, SAS administrators

Would you use this approach in your SAS environment? Any suggestions for improvement? How do you deal with old log files? Other old files? Please share below.

SAS administrators tip: Automatically deleting old SAS logs was published on SAS Users.

7月 172018
 

The Base SAS DATA step has been a powerful tool for many years for SAS programmers. But as data sets grow and programmers work with massively parallel processing (MPP) computing environments such as Teradata, Hadoop or the SAS High-Performance Analytics grid, the data step remains stubbornly single-threaded. Welcome DS2 – [...]

The post What DS2 can do for the DATA step appeared first on SAS Learning Post.

7月 162018
 

My colleague Robert Allison recently blogged about using the diameter of Texas as a unit of measurement. The largest distance across Texas is about 801 miles, so Robert wanted to find the set of all points such that the distance from the point to Texas is less than or equal to 801 miles.

Robert's implementation was complicated by the fact that he was interested in points on the round earth that are within 801 miles from Texas as measured along a geodesic. However, the idea of "thickening" or "inflating" a polygonal shape is related to a concept in computational geometry called the offset polygon or the inflated polygon. A general algorithm to inflate a polygon is complicated, but this article demonstrates the basic ideas that are involved. This article discusses offset regions for convex and nonconvex polygons in the plane. The article concludes by drawing a planar region for a Texas-shaped polygon that has been inflated by the diameter of the polygon. And, of course, I supply the SAS programs for all computations and images.

Offset regions as a union of circles and rectangles

Assume that a simple polygon is defined by listing its vertices in counterclockwise order. (Recall that a simple polygon is a closed, nonintersecting, shape that has no holes.) You can define the offset region of radius r as the union of the following shapes:

  • The polygon itself
  • Circles of radius r centered at each vertex
  • Rectangles of width r positions outside the polygon along each edge

The following graphic shows the offset region (r = 0.5) for a convex "house-shaped" polygon. The left side of the image shows the polygon with an overlay of circles centered at each vertex and outward-pointing rectangles along each edge. The right side of the graphic shows the union of the offset regions (blue) and the original polygon (red):

The image on the right shows why the process is sometimes called an "inflating" a polygon. For a convex polygon, the edges are pushed out by a distance r and the vertices become rounded. For large values of r, the offset region becomes a nearly circular blob, although the boundary is always the union of line segments and arcs of circles.

You can draw a similar image for a nonconvex polygon. The inflated region near a convex (left turning) vertex looks the same as before. However, for the nonconvex (right turning) vertices, the circles do not contribute to the offset region. Computing the offset region for a nonconvex polygon is tricky because if the distance r is greater than the minimum distance between vertices, nonlocal effects can occur. For example, the following graphic shows a nonconvex polygon that has two "prongs." Let r0 be the distance between the prongs. When you inflate the polygon by an amount r > r0/2, the offset region can contain a hole, as shown. Furthermore, the boundary of the offset regions is not a simple polygon. For larger values of r, the hole can disappear. This demonstrates why it is difficult to construct the boundary of an offset region for nonconvex polygons.

Inflating a Texas-shaped polygon

The shape of the Texas mainland is nonconvex. I used PROC GREDUCE on the MAPS.US data set in SAS to approximate the shape of Texas by a 36-sided polygon. The polygon is in a standardized coordinate system and has a diameter (maximum distance between vertices) of r = 0.2036. I then constructed the inflated region by using the same technique as shown above. The polygon and its inflated region are shown below.

The image on the left, which shows 36 circles and 36 rectangles, is almost indecipherable. However, the image on the right is almost an exact replica of the region that appears in Robert Allison's post. Remember, though, that the distances in Robert's article are geodesic distances on a sphere whereas these distances are Euclidean distances in the plane. For the planar problem, you can classify a point as within the offset region by testing whether it is inside the polygon itself, inside any of the 36 rectangles, or within a distance r of a vertex. That computation is relatively fast because it is linear in the number of vertices in the polygon.

I don't want to dwell on the computation, but I do want to mention that it requires fewer than 20 SAS/IML statements! The key part of the computation uses vector operations to construct the outward-facing normal vector of length r to each edge of the polygon. If v is the vector that connects the i_th and (i+1)_th vertex of the polygon, then the outward-facing normal vector is given by the concise vector expression r * (v / ||v||) * M, where M is a rotation matrix that rotates by 90 degrees. You can download the SAS program that computes all the images in this article.

In conclusion, you can use a SAS program to construct the offset region for an arbitrary simple polygon. The offset region is the union of circles, rectangles, and the original polygon, which means that it is easy to test whether an arbitrary point in the plane is in the offset region. That is, you can test whether any point is within a distance r to an arbitrary polygon.

The post Offset regions: Find all points within a specified distance from a polygon appeared first on The DO Loop.

7月 142018
 

One of the biggest challenges we’ve been hearing from customers lately is that they need help operationalizing analytics to extend the value of their modeling efforts. In many cases, they’ve hired smart data analysts who can transform data and create sophisticated models, but their work is used for a single [...]

7 tips for operationalizing analytics was published on SAS Voices by Randy Guard

7月 132018
 

In part one of this blog posting series, we took an introductory tour of recommendation systems, digital marketing, and SAS Customer Intelligence 360. Helping users of your website or mobile properties find items of interest is useful in almost any situation. This is why the concept of personalized marketing is [...]

SAS Customer Intelligence 360: The Digital Shapeshifter of Recommendation Systems [Part 2] was published on Customer Intelligence Blog.

7月 122018
 

Word Mover's Distance (WMD) is a distance metric used to measure the dissimilarity between two documents, and its application in text analytics was introduced by a research group from Washington University in 2015. The group's paper, From Word Embeddings To Document Distances, was published on the 32nd International Conference on Machine Learning (ICML). In this paper, they demonstrated that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates on eight real world document classification data sets.

They leveraged word embedding and WMD to classify documents, and the biggest advantage of this method over the traditional method is its capability to incorporate the semantic similarity between individual word pairs (e.g. President and Obama) into the document distance metric. In a traditional way, one method to manipulate semantically similar words is to provide a synonym table so that the algorithm can merge words with same meaning into a representative word before measuring document distance, otherwise you cannot get an accurate dissimilarity result. However, maintaining synonym tables need continuous efforts of human experts and thus is time consuming and very expensive. Additionally, the semantic meaning of words depends on domain, and the general synonym table does not work well for varied domains.

Definition of Word Mover's Distance

WMD is the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from one document to the other document. The distance is calculated through solving the following linear program problem.

Where

  • Tij denotes how much of word i in document d travels to word j in document d';
  • c(i; j) denotes the cost “traveling” from word i in document d to word j in document d'; here the cost is the two words' Euclidean distance in the word2vec embedding space;
  • If word i appears ci times in the document d, we denote

WMD is a special case of the earth mover's distance metric (EMD), a well-known transportation problem.

How to Calculate Earth Mover's Distance with SAS?

SAS/OR is the tool to solve transportation problems. Figure-1 shows a transportation example with four nodes and the distances between nodes, which I copied from this Earth Mover's Distance document. The objective is to find out the minimum flow from {x1, x2} to {y1, y2}. Now let's see how to solve this transportation problem using SAS/OR.

The weights of nodes and distances between nodes are given below.

Figure-1 A Transportation Problem

data x_set;
input _node_ $ _sd_;
datalines;
x1 0.74
x2 0.26
;
 
data y_set;
input _node_ $ _sd_;
datalines;
y1 0.23
y2 0.51
;
 
data arcdata;
input _tail_ $ _head_ $ _cost_;
datalines;
x1 y1 155.7
x1 y2 252.3
x2 y1 292.9
x2 y2 198.2
;
 
proc optmodel;
set xNODES;
num w {xNODES};
 
set yNODES;
num u {yNODES};
 
set <str,str> ARCS;
num arcCost {ARCS};
 
read data x_set into xNODES=[_node_] w=_sd_;
read data y_set into yNODES=[_node_] u=_sd_;
read data arcdata into ARCS=[_tail_ _head_] arcCost=_cost_;
 
var flow {<i,j> in ARCS} >= 0;
impvar sumY = sum{j in yNODES} u[j];
min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;
 
con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j];
con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];
 
solve with lp / algorithm=ns scale=none logfreq=1;
print flow;
quit;

The solution of SAS/OR as Table-1 shows, and the EMD is the objective value: 203.26756757.

Table-1 EMD Calculated with SAS/OR

The flow data I got with SAS/OR as Table-2 shows the following, which is same as the diagram posted in the aforementioned Earth Mover's Distance document.

Table-2 Flow data of SAS/OR

Figure-2 Flow Diagram of Transportation Problem

 

How to Calculate Word Mover's Distance with SAS

The paper, From Word Embeddings To Document Distances, proposed a new metric called Relaxed Word Mover's Distance (RWMD) by removing the second constraint of WMD in order to decrease the calculations. Since we need to read word embedding data, I will show you how to calculate RWMD of two documents with SAS Viya.

/* start CAS server */
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
/* load documents into CAS */
data sascas1.documents;
infile datalines delimiter='|' missover;
length text varchar(300);
input text$ did;
datalines;
Obama speaks to the media in Illinois.|1
The President greets the press in Chicago.|2
;
run;
 
/* create stop list*/
data sascas1.stopList;
infile datalines missover;
length term $20;
input term$;
datalines;
the
to
in
;
run;
 
/* load word embedding model */
proc cas;
loadtable path='datasources/glove_100d_tab_clean.txt'
caslib="CASTestTmp"
importOptions={
fileType="delimited",
delimiter='\t',
getNames=True,
guessRows=2.0,
varChars=True
}
casOut={name='glove' replace=True};
run;
quit;
 
%macro calculateRWMD
(
textDS=documents,
documentID=did,
text=text,
language=English,
stopList=stopList,
word2VectDS=glove,
doc1_id=1,
doc2_id=2
);
/* text parsing and aggregation */
proc cas;
textParse.tpParse/
table = {name="&textDS",where="&documentID=&doc1_id or &documentID=&doc2_id"}
docId="&documentID",
language="&language",
stemming=False,
nounGroups=False,
tagging=False,
offset={name="outpos", replace=1},
text="&text";
run;
 
textparse.tpaccumulate/
parent={name="outparent1", replace=1}
language="&language",
offset='outpos',
stopList={name="&stoplist"},
terms={name="outterms1", replace=1},
child={name="outchild1", replace=1},
reduce=1,
cellweight='none',
termWeight='none';
run;
quit;
 
/* terms of the two test documents */
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_terms",replace=true}
query="
select outparent1.*,_term_
from outparent1
left join outterms1
on outparent1._termnum_ = outterms1._termnum_
where _Document_=&doc1_id or _Document_=&doc2_id;
"
;
run;
quit;
 
/* term vectors and counts of the two test documents */
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc1_termvects",replace=true}
query="
select word2vect.*
from &word2VectDS word2vect, doc_terms
where _Document_=&doc2_id and lowcase(term) = _term_;
"
;
run;
 
execdirect casout={name="doc1_terms",replace=true}
query="
select doc_terms.*
from &word2VectDS, doc_terms
where _Document_=&doc2_id and lowcase(term) = _term_;
"
;
run;
 
simple.groupBy / table={name="doc1_terms"}
inputs={"_Term_", "_Count_"}
aggregator="n"
casout={name="doc1_termcount", replace=true};
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc2_termvects",replace=true}
query="
select word2vect.*
from &word2VectDS word2vect, doc_terms
where _Document_=&doc1_id and lowcase(term) = _term_;
"
;
run;
 
execdirect casout={name="doc2_terms",replace=true}
query="
select doc_terms.*
from &word2VectDS, doc_terms
where _Document_=&doc1_id and lowcase(term) = _term_;
"
;
run;
 
simple.groupBy / table={name="doc2_terms"}
inputs={"_Term_", "_Count_"}
aggregator="n"
casout={name="doc2_termcount", replace=true};
run;
quit;
 
/* calculate Euclidean distance between words */
data doc1_termvects;
set sascas1.doc1_termvects;
run;
 
data doc2_termvects;
set sascas1.doc2_termvects;
run;
 
proc iml;
use doc1_termvects;
read all var _char_ into lterm;
read all var _num_ into x;
close doc1_termvects;
 
use doc2_termvects;
read all var _char_ into rterm;
read all var _num_ into y;
close doc2_termvects;
 
d = distance(x,y);
 
lobs=nrow(lterm);
robs=nrow(rterm);
d_out=j(lobs*robs, 3, ' ');
do i=1 to lobs;
do j=1 to robs;
d_out[(i-1)*robs+j,1]=lterm[i];
d_out[(i-1)*robs+j,2]=rterm[j];
d_out[(i-1)*robs+j,3]=cats(d[i,j]);
end;
end;
 
create distance from d_out;
append from d_out;
close distance;
run;quit;
 
/* calculate RWMD between documents */
data x_set;
set sascas1.doc1_termcount;
rename _term_=_node_;
_weight_=_count_;
run;
 
data y_set;
set sascas1.doc2_termcount;
rename _term_=_node_;
_weight_=_count_;
run;
 
data arcdata;
set distance;
rename col1=_tail_;
rename col2=_head_;
length _cost_ 8;
_cost_= col3;
run;
 
proc optmodel;
set xNODES;
num w {xNODES};
 
set yNODES;
num u {yNODES};
 
set <str,str> ARCS;
num arcCost {ARCS};
 
read data x_set into xNODES=[_node_] w=_weight_;
read data y_set into yNODES=[_node_] u=_weight_;
read data arcdata into ARCS=[_tail_ _head_]
arcCost=_cost_;
 
var flow {<i,j> in ARCS} >= 0;
impvar sumY = sum{j in yNODES} u[j];
min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;
 
con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j];
/* con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];*/
 
solve with lp / algorithm=ns scale=none logfreq=1;
 
call symput('obj', strip(put(obj,best.)));
create data flowData
from [i j]={<i, j> in ARCS: flow[i,j].sol > 0}
col("cost")=arcCost[i,j]
col("flowweight")=flow[i,j].sol;
run;
quit;
 
%put RWMD=&obj;
%mend calculateRWMD;
 
%calculateRWMD
(
textDS=documents,
documentID=did,
text=text,
language=English,
stopList=stopList,
word2VectDS=glove,
doc1_id=1,
doc2_id=2
);

The RWMD value is 5.0041121662.

Now let's have a look at the flow values.

proc print data=flowdata;
run;quit;

WMD method does not only measure document similarity, but also it explains why the two documents are similar by visualizing the flow data.

Besides document similarity, WMD is useful to measure sentence similarity. Check out this article about sentence similarity and you can try it with SAS.

How to calculate Word Mover's Distance with SAS was published on SAS Users.