7月 242018
 

When you were a kid, were you intrigued by the 99-cent X-ray glasses advertised in the back of the comic books? If you ordered those glasses, you were probably disappointed. Now that you're an adult, how would you like to use computer techniques to "X-ray" your data and look for [...]

The post Using X-ray glasses to see patterns in your data! appeared first on SAS Learning Post.

7月 232018
 

Since the late 1990s, SAS has supplied macros for basic bootstrap and jackknife analyses. This article provides an example that shows how to use the %BOOT and %BOOTCI macros. The %BOOT macro generates a bootstrap distribution and computes basic statistics about the bootstrap distribution, including estimates of bias, standard error, and a confidence interval that is suitable when the sampling distribution is normally distributed. Because bootstrap methods are often used when you do not want to assume a statistic is normally distributed, the %BOOTCI macro supports several additional confidence intervals, such as percentile-based and bias-adjusted intervals.

You can download the macros for free from the SAS Support website. The website includes additional examples, documentation, and a discussion of the capabilities of the macros.

The %BOOT macro uses simple uniform random sampling (with replacement) or balanced bootstrap sampling to generate the bootstrap samples. It then calls a user-supplied %ANALYZE macro to compute the bootstrap distribution of your statistic.

How to install and use the %BOOT and %BOOTCI macros

To use the macros, do the following:

  1. Download the source file for the macros and save it in a directory that is accessible to SAS. For this example, I saved the source file to C:\Temp\jackboot.sas.
  2. Define a macro named %ANALYZE that computes the bootstrap statistic from a bootstrap sample. The next section provides an example.
  3. Call the %BOOT macro. The %BOOT macro creates three primary data sets:
    • BootData is a data set view that contains B bootstrap samples of the data. For this example, I use B=5000.
    • BootDist is a data set that contains the bootstrap distribution. It is created when the %BOOT macro internally calls the %ANALYZE macro on the BootData data set.
    • BootStat is a data set that contains statistics about the bootstrap distribution. For example, the BootStat data set contains the mean and standard deviation of the bootstrap distribution, among other statistics.
  4. If you want confidence inervals, use the %BOOTCI macro to compute up to six different interval estimates. The %BOOTCI macro creates a data set named BootCI that contains the statistics that are used to construct the confidence interval. (You can also generate multiple interval estimates by using the %ALLCI macro.)

An example of calling the %BOOT macro

This section shows how to call the %BOOT macro. The example was previously analyzed in an article that shows how to compute a bootstrap percentile confidence interval in SAS. The statistic of interest is the skewness of the SepalWidth variable for 50 iris flowers of the species Iris virginica. The following SAS statements define the sample data and compute the skewness statistic on the original data.

%include "C:\Temp\jackboot.sas";         /* define the %BOOT and %BOOTCI macros */
 
data sample(keep=x);                     /* data are sepal widths for 50 Iris virginica flowers */
   set Sashelp.Iris(where=(Species="Virginica") rename=(SepalWidth=x));
run;
 
/* compute value of the statistic on original data: Skewness = 0.366 */
title 'Skewness for Petal Widths (Iris virginica)';
proc means data=sample nolabels skewness;
   var x;
   output out=SkewOut skew=Skewness;   /* three output variables: _type_ _freq_ and Skewness */
run;

The skewness statistic (not shown) is 0.366. The call to PROC MEANS is not necessary, but it shows how to create an output data set (SkewOut) that contains the Skewness statistic. By default, the %BOOT macro will analyze all numeric variables in the output data set, so the next step defines the %ANALYZE macro and uses the DROP= data set option to omit some unimportant variables that PROC MEANS automatically generates.

When you define the %ANALYZE macro, be sure to use the NOPRINT option or otherwise suppress ODS output during the bootstrap process. Include the %BYSTMT macro, which will tell the %BOOT macro to use a BY statement to efficiently implement the bootstrap analysis. The %ANALYZE macro is basically the same as the previous call to PROC MEANS, except for the addition of the NOPRINT, %BYSTMT, and DROP= options:

%macro analyze(data=,out=);
   proc means noprint data=&data;   
      %bystmt;
      var x;
      output out=&out(drop=_type_ _freq_) skew=Skewness;
   run;
%mend;

Although the DROP= statement is not essential, it reduces the size of the data that are read and written during the bootstrap analysis. Do NOT use a KEEP= statement in the %ANALYZE macro because the %BOOT macro will generate several other variables (called _SAMPLE_ and _OBS_) as part of the resampling process.

You can now use the %BOOT macro to generate bootstrap samples and compute basic descriptive statistics about the bootstrap distribution:

/* creates GootData, BootDist, and BootStat data sets */
title2 'Bootstrap Analysis of Skewness';
%boot(data=sample,       /* data set that contains the original data */
      samples=5000,      /* number of bootstrap samples */
      random=12345,      /* random number seed for resampling */
      chart=0,           /* do not display the old PROC CHART histograms */
      stat=Skewness,     /* list of output variables to analyze (default=_NUMERIC_) */
      alpha=0.05,        /* significance level for CI (default=0.05) */
      print=1);          /* print descriptive stats (default=1)*/
 
proc print data=bootstat noobs;  /* or use LABEL option to get labels as column headers */
   id method n;
   var value bootmean bias stderr biasco alcl aucl;
run;

I recommend that you specify the first four options. The last three options are shown in case you want to override their default values. Although the %BOOT macro prints a table of descriptive statistics, the table contains 14 columns and is very wide. To shorten the output, I used PROC PRINT to display the most important results. The table shows the estimate of the skewness statistic on the original data (VALUE), the mean of the bootstrap distribution (BOOTMEAN), the estimate for the standard error of the statistic (STDERR), and lower and upper confidence limits (ALCL and AUCL) for an approximate confidence interval under the assumption that the statistic is normally distributed. (The limits are b ± z1-α * stderr, where z1-α is the (1 - α)th normal quantile and b = value - bias is a bias-corrected estimate.)

The data for the bootstrap distribution is in the BootDist data set, so you can use PROC SGPLOT to display a histogram of the bootstrap statistics. I like to assign some of the descriptive statistics into macro variables so that I can display them on the histogram, as follows:

/* OPTIONAL: Store bootstrap statistic in a macro variable */
proc sql noprint;
select value, alcl,     aucl 
 into :Stat, :LowerCL, :UpperCL
 from BootStat;
quit;
 
proc sgplot data=BootDist;      /* <== this data set contains the bootstrap distribution */
   histogram Skewness;
   refline &Stat / axis=x lineattrs=(color=red);
   refline &LowerCL &UpperCL / axis=x;
run;
Bootstrap distribution for skewness of Iris verginica petal widths

An example of calling the %BOOTCI macro

The %BOOTCI macro enables you to compute several confidence intervals (CIs) for the statistic that you are bootstrapping. The following statements display a percentile-based CI and a bias-adjusted and corrected CI.

title2 'Percentile-Based Confidence Interval';
%bootci(PCTL);    /* creates BootCI data set for Pctl CI */

The percentile-based CI is about the same width as the normal-based CI, but it is shifted to the left. The default output from the %BOOTCI macro is very wide, so sometimes I prefer to use the PRINT=0 option to suppress the output. The estimates are written to a data set named BootCI, so it is easy to use PROC PRINT to display only the statistics that you want to see, as shown in the following call that computes a bias-corrected and adjusted interval estimate:

title2 'Bias-Adjusted and Corrected Bootstrap Confidence Interval';
%bootci(BCa, print=0);       /* creates BootCI data set for BCa CI */
proc print data=BootCI noobs label;
   id method n;
   var value alcl aucl;
run;

Notice that each call to the %BOOTCI macro creates a data set named BootCI. In particular, the second call overwrites the data set that was created by the first call. If you want to compare the estimates, be sure to make a copy of the first BootCI data set before you overwrite it.

The %ALLCI macro

If you want to compare multiple CIs, you can use the %ALLCI macro, which computes multiple definitions of the CIs and concatenates them into a data set named AllCI, as shown by the following:

title2 'Comparison of Bootstrap Confidence Intervals';
%allci(print=0); 
proc print data=AllCI(drop=_LABEL_) noobs label;
   id method n;
   var value alcl aucl;
run;

The output (not shown) contains interval estimates for five bootstrap CIs and a jackknife CI.

Be aware the when you run the %ALLCI macro you will see several warnings in the SAS log, such as the following:

WARNING: Variable _lo was not found on DATA file.
WARNING: Variable bootmean was not found on BASE file. The variable will
         not be added to the BASE file.

These warnings are coming from PROC APPEND and can be ignored. To suppress these warnings, you can edit the jackboot.sas file, search for the word 'force' on the PROC APPEND statements, and add the NOWARN option to those PROC APPEND statements. For example:
proc append data=bootci&keep base=ALLCI force nowarn; run;

Pros and cons of using the %BOOT macro

The %BOOT, %BOOTCI, and %ALLCI macros can be a time-saver when you want to perform a basic bootstrap in SAS. However, in my opinion, they are not a substitute for understanding how to implement a bootstrap computation manually in SAS. Here are a few advantages and disadvantages of the macros:

  • Advantage: The macros encapsulate the tedious steps of the bootstrap analysis.
  • Advantage: The macros generate SAS data sets that you can use for additional analyses or for graphing the results.
  • Advantage: The macros handle the most common sampling schemes such as simple uniform sampling (with replacement), balanced bootstrap sampling, and residual sampling in regression models.
  • Advantage: The %BOOTCI macro supports many popular confidence intervals for parameters.
  • Disadvantage: The macros do not provide the same flexibility as writing your own analysis. For example, the macros do not support the stratified resampling scheme that is used for a bootstrap analysis of the difference of means in a t test.
  • Disadvantage: There are only a few examples of using the macros. When I first used them, I made several mistakes and had to look at the underlying source code to understand what the macros were doing.

Summary

The %BOOT and %BOOTCI macros provide a convenient way to perform simple bootstrap analyses in SAS. The macros support several common resampling schemes and estimates for confidence intervals. Although the macros are not a replacement for understanding how to program a general, efficient, bootstrap analysis, they can be a useful tool for data analysts who want compact code to create a bootstrap analysis in SAS.

The post How to use the %BOOT and %BOOTCI macros in SAS appeared first on The DO Loop.

7月 212018
 

Of course you know how to create graphs ... But do you often find that preparing the data to plot is often the hardest part? Well then, this blog post is for you! I'll be demonstrating how to import Excel data into SAS, transpose the data, use what were formerly column [...]

The post Import Excel data, transpose, and plot it! appeared first on SAS Learning Post.

7月 202018
 

"Code golf" is a fun programming pastime that challenges you to solve a problem with the least amount of code possible. Like regular golf, the goal is to use fewest code "strokes" to hit the mark. Here's a recent challenge that was posted to me via Twitter.

While I feel that I can solve nearly any problem (that I can understand) using SAS, my knowledge of the SAS language is quite limited when compared to that of many experts.  And so, I reached out to the SAS Support Communities for help on this one.

The answers were quick, creative, and diverse.  I'll share a few of them here.

The winner, in terms of concision, came from FreelanceReinhard.  He supplied a macro-function one-liner:

%sysfunc(findc(123456789,00112010302,b));

With this entry, FreelanceReinhard defied a natural algorithmic instinct to treat this as a numerical digit comparison problem, and instead approached it as simple pattern matching problem.  The highest digit comes from a finite set (0..9).  COMPRESS function is often used to eliminate matching characters from a string, but the k directive inverts the action to keep only the matching characters instead.

If you wanted to use the more traditional approach of looping through values, comparing, and keeping just the maximum value, then you can hardly do better than the code offered by hashman.

do j = 1 to length (str) ; 
  d = d <> input (char (str, j), 1.) ;
end ;

Experienced SAS programmers will remember that

AhmedAl_Attar offered the most dangerous approach, using memory manipulation techniques to populate members of an array:

array ct [20] $1 _temporary_;
call pokelong (str,addrlong(ct[1]),length(str));
c=max(of ct{*});

Visit the communities topic and chime in.

data max;
  str = '00112010302';
 
  /* novinosrin's approach */
  a=first(compress('9876543210',str,'k'));
 
  /* FreelanceReinhard's approach */
  b=findc('123456789',str,-9);
 
  /* AhmedAl_Attar's approach using POKELONG */
  array ct [20] $1 _temporary_;
  call pokelong (str,addrlong(ct[1]),length(str));
  c=max(of ct{*});
 
  /* loop approach from hashman */
  /* remember that <> is MAX    */
  do j = 1 to length (str) ;            
    d = d <> input (char (str, j), 1.) ;
  end ;
 
drop j;
run;
 
/* FreelanceReinhard's approach in a one-liner macro function */
%let str=00112010302;
%put max=%sysfunc(findc(123456789,&str.,b));
 
/* IML approach from ChanceTGardener */
/* Requires SAS/IML to run           */
proc iml;
  str='000112010302';
  maximum=max((substr(str,1:length(str),1)));
  print maximum;
quit;

The post SAS code golf: find the max digit in a string of digits appeared first on The SAS Dummy.

7月 192018
 

Suppose that you want to know the value of a character variable that has the highest frequency count or even the top three highest values. To determine that value, you need to create an output data set and sort the data by the descending Count or _FREQ_ variable. Then you need to print the top n observations using the OBS= option, based on the number of values that you want to see. You can do this easily using any of a variety of procedures that calculate a frequency count (for example, the FREQ Procedure or the MEANS Procedure).

This blog provides two detailed examples: one calculates the top n values for a single variable and one calculates the top n values for all character variables in a data set.

Print the top n observations of a single variable

The following example prints the three values of the Make variable in the Sashelp.Cars data set that have the highest frequency count. By default, PROC FREQ prints a variable called Count in the output data set. The output data set is sorted by this variable in descending order, and the number of observations that you want to keep is printed by using the OBS= data set option.

proc freq data=sashelp.cars noprint;
tables make / out=counts(drop=percent);
run;
 
proc sort data=counts;
by descending count;
run;
 
proc print data=counts(obs=3);
run;

Print the top n observations of all character variables in a data set

Suppose that you want to know the top three values for all the character variables in a data set. The process shown in the previous section is not efficient when you have many variables. Suppose you also want to store this information in a data set. You can use macro logic to handle both tasks. The following code uses PROC FREQ to create an output data set for each variable. Further manipulation is done in a DATA step so that all the data sets can be combined. A detailed explanation follows the example code:

%macro top_frequency(lib=,dsn=);
 
/* count character variables in the data set */
proc sql noprint;
select name into :charlist separated by ' '
from dictionary.columns
where libname=%upcase("&lib") and memname=%upcase("&dsn")and type='char';
quit;
 
%put &charlist;
%let cnt=%sysfunc(countw(&charlist,%str( )));
%put &cnt;
 
%do i=1 %to &cnt;
 
/* Loop through each character variable in */
/* FREQ and create a separate output  */           
/* data set.                               */
proc freq data=&lib..&dsn noprint;
tables %scan(&charlist,&i) / missing out=out&i(drop=percent 
 rename=(%scan(&charlist,&i)=value));
run;
 
data out&i;
length varname value $100;
set out&i;
varname="%scan(&charlist,&i)";
run;
 
proc sort data=out&i;
by varname descending count;
run;
 
%end;
 
data combine;
set %do i=1 %to &cnt;
out&i(obs=3) /* Keeps top 3 for each variable. */
%end;;
run;
 
proc print data=combine;
run;
 
%mend top_frequency;
 
options mprint mlogic symbolgen;
%top_frequency(lib=SASHELP,dsn=CARS);

I begin my macro definition with two keyword parameters that enable me to substitute the desired library and data set name in my macro invocation:

%macro top_frequency(lib=,dsn=);

The SQL procedure step selects all the character variables in the data set and stores them in a space-delimited macro variable called &CHARLIST. Another macro variable called &CNT counts how many words (or, variable names) are in this list.

proc sql noprint;
select name into :charlist separated by ' '
from dictionary.columns
where libname=%upcase("&lib") and memname=%upcase("&dsn") and type='char';
quit;
 
%put &charlist;
%let cnt=%sysfunc(countw(&charlist,%str( )));
%put &cnt;

The %DO loop iterates through each variable in the list and generates output data from PROC FREQ by using the OUT= option. The output data set contains two variables: the variable from the TABLES request with the unique values of that variable and the Count variable with the frequency counts. The variable name is renamed to Value so that all the data sets can be combined in a later step. In a subsequent DATA step, a new variable, called Varname, is created that contains the variable name as a character string. Finally, the data set is sorted by the descending frequency count.

%do i=1 %to &cnt;
 
/* Loop through each character variable in PROC FREQ */ 
/* and create a separate output data set.            */
proc freq data=&lib..&dsn noprint;
tables %scan(&charlist,&i) / missing
out=out&i(drop=percent 
 rename=(%scan(&charlist,&i)=value));
run;
 
data out&i;
length varname value $100;
set out&i;
varname="%scan(&charlist,&i)";
run;
 
proc sort data=out&i;
by varname descending count;
run;
 
%end;

The final DATA step combines all the data sets into one using another macro %DO loop in the SET statement. The %END statement requires two semicolons: one ends the SET statement and one ends the %END statement. Three observations of each data set are printed by using the OBS= option.

data combine;
set %do i=1 %to &cnt;
 out&i(obs=3) /* Keeps top 3 for each variable. */
%end;;
run;

Knowing your data is essential in any programming application. The ability to quickly view the top values of any or all variables in a data set can be useful for identifying top sales, targeting specific demographic segments, trying to understand the prevalence of certain illnesses or diseases, and so on. As explained in this blog, a variety of Base SAS procedures along with the SAS macro facility make it easy to accomplish such tasks.

Learn more

These resources show different ways to create "top N" reports in SAS:

Keeping the top frequency count (n) for each character variable in a SAS data set was published on SAS Users.

7月 192018
 

In a previous posting, SAS Customer Intelligence 360 was highlighted in the context of delivering relevant product, service, and content recommendations using automated machine learning within digital experiences. Shifting gears, SAS recognizes there are different user segments for our platform. This post will focus on building custom analytical recommendation models [...]

SAS Customer Intelligence 360: Factorization machines, visual analytics, and personalized marketing was published on Customer Intelligence Blog.

7月 182018
 

Last year when I went through the SAS Global Forum 2017 paper list, the paper Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis impressed me a lot. In this paper, the author raised out the common problems caused by traditional sampling method and proposed four sampling methods for textual data. Recently my team is working on a project in which we are facing a huge volume of documents from a specific field, and we need efforts of linguists and domain experts to analyze the textual data and annotate ground truth, so our first question is which documents we should start working on to get a panoramic image of the data with minimum efforts. Frankly, I don’t have a state-of-the-art method to extract representative documents and measure its effect, so why not try this innovative technique?

The paper proposed four sampling methods, and I only tried the first method through using cluster memberships as a strata. Before we step into details of the SAS program, let me introduce the steps of this method.

  • Step 1: Parse textual data into tokens and calculate each term's TF-IDF value
  • Step 2: Generate term-by-document matrix
  • Step 3: Cluster documents through k-means algorithm
  • Step 4: Get top k terms of each cluster
  • Step 5: Do stratified sampling by cluster

I wrote a SAS macro for each step so that you are able to check the results step by step. If you are not satisfied with the final cluster result, you can tune the parameters of any step and re-run this step and its post steps. Now let's see how to do this using SAS Viya to extract samples from a movie review data.

The movie review data has 11,855 rows of observations, and there are 200,963 tokens. After removing stop words, there are 18,976 terms. In this example, I set dimension size of the term-by-document matrix as 3000. This means that I use the top 3000 terms with the highest TF-IDF values of the document collections as its dimensions. Then I use k-means clustering to group documents into K clusters, and I set the maximum K as 50 with the kClus action in CAS. The dataSegment action can cluster documents directly, but this action cannot choose the best K. You need to try the clustering action with different K values and choose the best K by yourself. Conversely the kClus action chooses the best K automatically among the K values defined by minimum K and maximum K, so I use kClus action in my implementation.

After running the program (full code at the end of this post), I got 39 clusters and top 10 terms of the first cluster as Table-1 shows.

Table-1 Top 10 terms of Cluster 1

Let's see what samples we get for the first cluster. I got 7 documents and each document either has term "predictable" or term "emotional."

Samples from cluster

I set sampPct as 5 which means 5% data will be randomly selected from each cluster. Finally I got 582 sample documents. Let's check the sample distribution of each cluster.

Donut chart of cluster samples

This clustering method helped us select a small part of documents from the piles of document collections intelligently, and most importantly it saved us much time and helped us to hit the mark.

I haven't had a chance to try the other three sampling methods from the paper; I encourage you have a try and share your experiences with us. Big thanks to my colleague Murali Pagolu for sharing this innovative technique during the SAS Global Forum 2017 conference and for kindly providing me with some good suggestions.

Appendix: Complete code for text sampling

 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=2500, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=10, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
/*-------------------------------------------*/
/* Sampling textual data based on clustering */
/*-------------------------------------------*/
 
 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
caslib _all_ assign;
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=3000, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=50, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
 
/* Browse topk terms of the first cluster */
proc cas;
fetch / 
   table={name="cluster_topk_terms",
          where="_cluster_id_=1"};
run;
quit;
 
/* Stratified sampling by cluster      */
%strSampleByCluster(
   docDS=&myData, 
   docClusterDS=doc_clusters, 
   docVar=_document_, 
   clusterVar=_cluster_id_, 
   seed=12345,   
   sampPct=5,
   dsOut=doc_sample_by_cls
);
 
/* Browse sample documents of the first cluster */
proc cas;
fetch / 
   table={name="doc_sample_by_cls",
          where="_cluster_id_=1"};
run;
quit;

How to sample textual data with SAS was published on SAS Users.

7月 182018
 

This article shows how to implement balanced bootstrap sampling in SAS. The basic bootstrap samples with replacement from the original data (N observations) to obtain B new samples. This is called "uniform" resampling because each observation has a uniform probability of 1/N of being selected at each step of the resampling process. Within the union of the B bootstrap samples, each observation has an expected value of appearing B times.

Balanced bootstrap resampling (Davison, Hinkley, and Schechtman, 1986) is an alternative process in which each observation appears exactly B times in the union of the B bootstrap samples of size N. This has some practical benefits for estimating certain inferential statistics such as the bias and quantiles of the sampling distribution (Hall, 1990).

It is easy to implement a balanced bootstrap resampling scheme: Concatenate B copies of the data, randomly permute the B*N observations, and then use the first N observations for the first bootstrap sample, the next B for the second sample, and so forth. (Other algorithms are also possible, as discussed by Gleason, 1988). This article shows how to implement balanced bootstrap sampling in SAS.

Balanced bootstrap samples in SAS

To illustrate the idea, consider the following data set that has N=6 observations. Five observations are clustered near x=0 and the sixth is a large outlier (x=10). The sample skewness for these data is skew=2.316 because of the influence of the outlier.

data Sample(keep=x);
input x @@;
datalines;
-1 -0.2 0 0.2 1 10
;
 
proc means data=Sample skewness;
run;
%let ObsStat = 2.3163714;

You can use the bootstrap to approximate the sampling distribution for the skewness statistic for these data. I have previously shown how to use SAS to bootstrap the skewness statistic: Use PROC SURVEYSELECT to form bootstrap samples, use PROC MEANS with a BY statement to analyze the samples, and use PROC UNIVARIATE to analyze the bootstrap distribution of skewness values. In that previous article, PROC SURVEYSELECT is used to perform uniform sampling (sampling with replacement).

It is straightforward to modify the previous program to perform balanced bootstrap sampling. The following program is based on a SAS paper by Nils Penard at PhUSE 2012. It does the following:

  1. Use PROC SURVEYSEELCT to concatenate B copies of the input data.
  2. Use the DATA step to generate a uniform random number for each observation.
  3. Use PROC SORT to sort the data by the random values. After this step, the N*B observations are in random order.
  4. Generate a variable that indicates the bootstrap sample for each observation. Alternatively, reuse the REPLICATE variable from PROC SURVEYSELECT, as shown below.
/* balanced bootstrap computation */
proc surveyselect data=Sample out=DupData noprint
                  reps=5000              /* duplicate data B times */
                  method=SRS samprate=1; /* sample w/o replacement */
run;
 
data Permute;  
   set DupData;
   call streaminit(12345);
   u = rand("uniform");    /* generate a uniform random number for each obs */
run;
 
proc sort data=Permute; by u; run;  /* sort in random order */
 
data BalancedBoot;
   merge DupData(drop=x) Permute(keep=x);  /* reuse REPLICATE variable */
run;

You can use the BalancedBoot data set to perform subsequent bootstrap analyses. If you perform a bootstrap analysis, you obtain the following approximate bootstrap distribution for the skewness statistic. The observed statistic is indicated by a red vertical line. For reference, the mean of the bootstrap distribution is indicated by a gray vertical line. You can see that the sampling distribution for this tiny data set is highly nonnormal. Many bootstrap samples that contain the outlier (exactly one-sixth of the samples in a balanced bootstrap) will have a large skewness value.

Bootstrap distribution for balanced resampling method

To assure yourself that each of the original six observations appears exactly B times in the union of the bootstrap sample, you can run PROC FREQ, as follows:

proc freq data=BalancedBoot;   /* OPTIONAL: Show that each obs appears B times */
   tables x / nocum;
run;

Balanced bootstrap samples in SAS/IML

As shown in the article "Bootstrap estimates in SAS/IML," you can perform bootstrap computations in the SAS/IML language. For uniform sampling, the SAMPLE function samples with replacement from the original data. However, you can modify the sampling scheme to support balanced bootstrap resampling:

  1. Use the REPEAT function to duplicate the data B times.
  2. Use the SAMPLE function with the "WOR" option to sample without replacement. The resulting vector is a permutation of the B*N observations.
  3. Use the SHAPE function to reshape the permuted data into an N x B matrix for which each column is a bootstrap sample. This form is useful for implementing vectorized computations on the columns.

The following SAS/IML program modifies the program in the previous post to perform balanced bootstrap sampling:

/* balanced bootstrap computation in SAS/IML */
proc iml;
use Sample; read all var "x"; close;
call randseed(12345);
 
/* Return a row vector of statistics, one for each column. */
start EvalStat(M);
   return skewness(M);               /* <== put your computation here */
finish;
Est = EvalStat(x);                   /* 1. observed statistic for data */
 
/* balanced bootstrap resampling */
B = 5000;                            /* B = number of bootstrap samples */
allX = repeat(x, B);                 /* replicate the data B times */
s = sample(allX, nrow(allX), "WOR"); /* 2. sample without replacement (=permute) */
s = shape(s, nrow(x), B);            /*    reshape to (N x B) */
 
/* use the balanced bootstrap samples in subsequent computations */
bStat = T( EvalStat(s) );            /* 3. compute the statistic for each bootstrap sample */
bootEst = mean(bStat);               /* 4. summarize bootstrap distrib such as mean */
bias = Est - bootEst;                /*    Estimate of bias */
RBal = Est || BootEst || Bias;       /* combine results for printing */
print RBal[format=8.4 c={"Obs" "BootEst" "Bias"}];

As shown in the previous histogram, the bias estimate (the difference between the observed statistic and the mean of the bootstrap distribution) is sizeable.

It is worth mentioning that the SAS-supplied %BOOT macro performs balanced bootstrap sampling by default. To generate balanced bootstrap samples with the %BOOT macro, set the BALANCED=1 option, as follows:
%boot(data=Sample, samples=5000, balanced=1) /* or omit BALANCED= option */
If you want uniform (unbalanced) samples, call the macro as follows:
%boot(data=Sample, samples=5000, balanced=0).

In conclusion, it is easy to generate balanced bootstrap samples. Balanced sampling can improve the efficiency of certain bootstrap estimates and inferences. For details, see the previous references of Appendix II of Hall (1992).

The post Balanced bootstrap resampling in SAS appeared first on The DO Loop.