Suppose that you want to know the value of a character variable that has the highest frequency count or even the top three highest values. To determine that value, you need to create an output data set and sort the data by the descending Count or _FREQ_ variable. Then you need to print the top n observations using the OBS= option, based on the number of values that you want to see. You can do this easily using any of a variety of procedures that calculate a frequency count (for example, the FREQ Procedure or the MEANS Procedure).

This blog provides two detailed examples: one calculates the top n values for a single variable and one calculates the top n values for all character variables in a data set.

Print the top n observations of a single variable

The following example prints the three values of the Make variable in the Sashelp.Cars data set that have the highest frequency count. By default, PROC FREQ prints a variable called Count in the output data set. The output data set is sorted by this variable in descending order, and the number of observations that you want to keep is printed by using the OBS= data set option.

```proc freq data=sashelp.cars noprint; tables make / out=counts(drop=percent); run;   proc sort data=counts; by descending count; run;   proc print data=counts(obs=3); run;```

Print the top n observations of all character variables in a data set

Suppose that you want to know the top three values for all the character variables in a data set. The process shown in the previous section is not efficient when you have many variables. Suppose you also want to store this information in a data set. You can use macro logic to handle both tasks. The following code uses PROC FREQ to create an output data set for each variable. Further manipulation is done in a DATA step so that all the data sets can be combined. A detailed explanation follows the example code:

```%macro top_frequency(lib=,dsn=);   /* count character variables in the data set */ proc sql noprint; select name into :charlist separated by ' ' from dictionary.columns where libname=%upcase("&lib") and memname=%upcase("&dsn")and type='char'; quit;   %put &charlist; %let cnt=%sysfunc(countw(&charlist,%str( ))); %put &cnt;   %do i=1 %to &cnt;   /* Loop through each character variable in */ /* FREQ and create a separate output */ /* data set. */ proc freq data=&lib..&dsn noprint; tables %scan(&charlist,&i) / missing out=out&i(drop=percent rename=(%scan(&charlist,&i)=value)); run;   data out&i; length varname value \$100; set out&i; varname="%scan(&charlist,&i)"; run;   proc sort data=out&i; by varname descending count; run;   %end;   data combine; set %do i=1 %to &cnt; out&i(obs=3) /* Keeps top 3 for each variable. */ %end;; run;   proc print data=combine; run;   %mend top_frequency;   options mprint mlogic symbolgen; %top_frequency(lib=SASHELP,dsn=CARS);```

I begin my macro definition with two keyword parameters that enable me to substitute the desired library and data set name in my macro invocation:

`%macro top_frequency(lib=,dsn=);`

The SQL procedure step selects all the character variables in the data set and stores them in a space-delimited macro variable called &CHARLIST. Another macro variable called &CNT counts how many words (or, variable names) are in this list.

```proc sql noprint; select name into :charlist separated by ' ' from dictionary.columns where libname=%upcase("&lib") and memname=%upcase("&dsn") and type='char'; quit;   %put &charlist; %let cnt=%sysfunc(countw(&charlist,%str( ))); %put &cnt;```

The %DO loop iterates through each variable in the list and generates output data from PROC FREQ by using the OUT= option. The output data set contains two variables: the variable from the TABLES request with the unique values of that variable and the Count variable with the frequency counts. The variable name is renamed to Value so that all the data sets can be combined in a later step. In a subsequent DATA step, a new variable, called Varname, is created that contains the variable name as a character string. Finally, the data set is sorted by the descending frequency count.

```%do i=1 %to &cnt;   /* Loop through each character variable in PROC FREQ */ /* and create a separate output data set. */ proc freq data=&lib..&dsn noprint; tables %scan(&charlist,&i) / missing out=out&i(drop=percent rename=(%scan(&charlist,&i)=value)); run;   data out&i; length varname value \$100; set out&i; varname="%scan(&charlist,&i)"; run;   proc sort data=out&i; by varname descending count; run;   %end;```

The final DATA step combines all the data sets into one using another macro %DO loop in the SET statement. The %END statement requires two semicolons: one ends the SET statement and one ends the %END statement. Three observations of each data set are printed by using the OBS= option.

```data combine; set %do i=1 %to &cnt; out&i(obs=3) /* Keeps top 3 for each variable. */ %end;; run;```

Knowing your data is essential in any programming application. The ability to quickly view the top values of any or all variables in a data set can be useful for identifying top sales, targeting specific demographic segments, trying to understand the prevalence of certain illnesses or diseases, and so on. As explained in this blog, a variety of Base SAS procedures along with the SAS macro facility make it easy to accomplish such tasks.

These resources show different ways to create "top N" reports in SAS:

In a previous posting, SAS Customer Intelligence 360 was highlighted in the context of delivering relevant product, service, and content recommendations using automated machine learning within digital experiences. Shifting gears, SAS recognizes there are different user segments for our platform. This post will focus on building custom analytical recommendation models [...]

Last year when I went through the SAS Global Forum 2017 paper list, the paper Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis impressed me a lot. In this paper, the author raised out the common problems caused by traditional sampling method and proposed four sampling methods for textual data. Recently my team is working on a project in which we are facing a huge volume of documents from a specific field, and we need efforts of linguists and domain experts to analyze the textual data and annotate ground truth, so our first question is which documents we should start working on to get a panoramic image of the data with minimum efforts. Frankly, I don’t have a state-of-the-art method to extract representative documents and measure its effect, so why not try this innovative technique?

The paper proposed four sampling methods, and I only tried the first method through using cluster memberships as a strata. Before we step into details of the SAS program, let me introduce the steps of this method.

• Step 1: Parse textual data into tokens and calculate each term's TF-IDF value
• Step 2: Generate term-by-document matrix
• Step 3: Cluster documents through k-means algorithm
• Step 4: Get top k terms of each cluster
• Step 5: Do stratified sampling by cluster

I wrote a SAS macro for each step so that you are able to check the results step by step. If you are not satisfied with the final cluster result, you can tune the parameters of any step and re-run this step and its post steps. Now let's see how to do this using SAS Viya to extract samples from a movie review data.

The movie review data has 11,855 rows of observations, and there are 200,963 tokens. After removing stop words, there are 18,976 terms. In this example, I set dimension size of the term-by-document matrix as 3000. This means that I use the top 3000 terms with the highest TF-IDF values of the document collections as its dimensions. Then I use k-means clustering to group documents into K clusters, and I set the maximum K as 50 with the kClus action in CAS. The dataSegment action can cluster documents directly, but this action cannot choose the best K. You need to try the clustering action with different K values and choose the best K by yourself. Conversely the kClus action chooses the best K automatically among the K values defined by minimum K and maximum K, so I use kClus action in my implementation.

After running the program (full code at the end of this post), I got 39 clusters and top 10 terms of the first cluster as Table-1 shows.

Table-1 Top 10 terms of Cluster 1

Let's see what samples we get for the first cluster. I got 7 documents and each document either has term "predictable" or term "emotional."

I set sampPct as 5 which means 5% data will be randomly selected from each cluster. Finally I got 582 sample documents. Let's check the sample distribution of each cluster.

This clustering method helped us select a small part of documents from the piles of document collections intelligently, and most importantly it saved us much time and helped us to hit the mark.

I haven't had a chance to try the other three sampling methods from the paper; I encourage you have a try and share your experiences with us. Big thanks to my colleague Murali Pagolu for sharing this innovative technique during the SAS Global Forum 2017 conference and for kindly providing me with some good suggestions.

Appendix: Complete code for text sampling

```  /*-------------------------------------*/ /* Get tfidf */ /*-------------------------------------*/ %macro getTfidf( dsIn=, docVar=, textVar=, language=, stemming=true, stopList=, dsOut= ); proc cas; textparse.tpParse / docId="&docVar" documents={name="&dsIn"} text="&textVar" language="&language" cellWeight="NONE" stemming=false tagging=false noungroups=false entities="none" offset={name="tpparse_out",replace=TRUE} ; run;   textparse.tpAccumulate / offset={name="tpparse_out"} stopList={name="&stopList"} termWeight="NONE" cellWeight="NONE" reduce=1 parent={name="tpAccu_parent",replace=TRUE} terms={name="tpAccu_term",replace=TRUE} showdroppedterms=false ; run; quit;   proc cas; loadactionset "fedsql"; execdirect casout={name="doc_term_stat", replace=true} query=" select tpAccu_parent.&docVar, tpAccu_term._term_, tpAccu_parent._count_ as _tf_, tpAccu_term._NumDocs_ from tpAccu_parent left join tpAccu_term on tpAccu_parent._Termnum_=tpAccu_term._Termnum_; " ; run;   simple.groupBy / table={name="tpAccu_parent"} inputs={"&docVar"} casout={name="doc_nodup", replace=true}; run;   numRows result=r / table={name="doc_nodup"}; totalDocs = r.numrows; run;   datastep.runcode / code = " data &dsOut; set doc_term_stat;" ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);" ||"run; "; run; quit;   proc cas; table.dropTable name="tpparse_out" quiet=true; run; table.dropTable name="tpAccu_parent" quiet=true; run; table.dropTable name="tpAccu_term" quiet=true; run; table.dropTable name="doc_nodup" quiet=true; run; table.dropTable name="doc_term_stat" quiet=true; run; quit; %mend getTfidf;     /*-------------------------------------*/ /* Term-by-document matrix */ /*-------------------------------------*/ %macro DocToVectors( dsIn=, docVar=, termVar=, tfVar=, dimSize=500, dsOut= ); proc cas; simple.summary / table={name="&dsIn", groupBy={"&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="term_tf_sum", replace=true}; run;   simple.topk / table={name="term_tf_sum"} inputs={"&termVar"} topk=&dimSize bottomk=0 raw=True weight="_Sum_" casout={name='termnum_top', replace=true}; run;   loadactionset "fedsql"; execdirect casout={name="doc_top_terms", replace=true} query=" select termnum.*, _rank_ from &dsIn termnum, termnum_top where termnum.&termVar=termnum_top._Charvar_ and &tfVar!=0; " ; run;   transpose.transpose / table={name="doc_top_terms", groupby={"&docVar"}, computedVars={{name="_name_"}}, computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"} transpose={"&tfVar"} casOut={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="term_tf_sum" quiet=true; run; table.dropTable name="termnum_top" quiet=true; run; table.dropTable name="termnum_top_misc" quiet=true; run; table.dropTable name="doc_top_terms" quiet=true; run; quit; %mend DocToVectors;     /*-------------------------------------*/ /* Cluster documents */ /*-------------------------------------*/ %macro clusterDocs( dsIn=, nClusters=10, seed=12345, dsOut= ); proc cas; /*get the vector variables list*/ columninfo result=collist / table={name="&dsIn"}; ndimen=dim(collist['columninfo']); vector_columns={}; j=1; do i=1 to ndimen; thisColumn = collist['columninfo'][i][1]; if lowcase(substr(thisColumn, 1, 4))='_dim' then do; vector_columns[j]= thisColumn; j=j+1; end; end; run;   clustering.kClus / table={name="&dsIn"}, nClusters=&nClusters, init="RAND", seed=&seed, inputs=vector_columns, distance="EUCLIDEAN", printIter=false, impute="MEAN", standardize='STD', output={casOut={name="&dsOut", replace=true}, copyvars="ALL"} ; run; quit; %mend clusterDocs;     /*-------------------------------------*/ /* Get top-k words of each cluster */ /*-------------------------------------*/ %macro clusterProfile( termDS=, clusterDS=, docVar=, termVar=, tfVar=, clusterVar=_CLUSTER_ID_, topk=10, dsOut= ); proc cas; loadactionset "fedsql"; execdirect casout={name="cluster_terms",replace=true} query=" select &termDS..*, &clusterVar from &termDS, &clusterDS where &termDS..&docVar = &clusterDS..&docVar; " ; run;   simple.summary / table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="cluster_terms_sum", replace=true}; run;   simple.topk / table={name="cluster_terms_sum", groupBy={"&clusterVar"}} inputs={"&termVar"} topk=&topk bottomk=0 raw=True weight="_Sum_" casout={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="cluster_terms" quiet=true; run; table.dropTable name="cluster_terms_sum" quiet=true; run; quit; %mend clusterProfile;     /*-------------------------------------*/ /* Stratified sampling by cluster */ /*-------------------------------------*/ %macro strSampleByCluster( docDS=, docClusterDS=, docVar=, clusterVar=_CLUSTER_ID_, seed=12345, sampPct=, dsOut= ); proc cas; loadactionset "sampling"; stratified result=r / table={name="&docClusterDS", groupby={"&clusterVar"}} sampPct=&sampPct partind="TRUE" seed=&seed output={casout={name="sampling_out",replace="TRUE"}, copyvars={"&docVar", "&clusterVar"}}; run; print r.STRAFreq; run;   loadactionset "fedsql"; execdirect casout={name="&dsOut", replace=true} query=" select docDS.*, &clusterVar from &docDS docDS, sampling_out where docDS.&docVar=sampling_out.&docVar and _PartInd_=1; " ; run;   proc cas; table.dropTable name="sampling_out" quiet=true; run; quit; %mend strSampleByCluster;     /*-------------------------------------*/ /* Start CAS Server. */ /*-------------------------------------*/ cas casauto host="host.example.com" port=5570; libname sascas1 cas;     /*-------------------------------------*/ /* Prepare and load data. */ /*-------------------------------------*/ %let myData=movie_reviews;   proc cas; loadtable result=r / importOptions={fileType="csv", delimiter='TAB',getnames="true"} path="data/movie_reviews.txt" casLib="CASUSER" casout={name="&myData", replace="true"} ; run; quit;   /* Browse the data */ proc cas; columninfo / table={name="&myData"}; fetch / table = {name="&myData"}; run; quit;   /* generate one unique index using data step */ proc cas; datastep.runcode / code = " data &myData; set &myData; rename id = _document_; keep id text score; run; "; run; quit;   /* create stop list*/ data sascas1.stopList; set sashelp.engstop; run;   /* Get tfidf by term by document */ %getTfidf( dsIn=&myData, docVar=_document_, textVar=text, language=english, stemming=true, stopList=stopList, dsOut=doc_term_tfidf );   /* document-term matrix */ %DocToVectors( dsIn=doc_term_tfidf, docVar=_document_, termVar=_term_, tfVar=_tfidf_, dimSize=2500, dsOut=doc_vectors );   /* Cluster documents */ %clusterDocs( dsIn=doc_vectors, nClusters=10, seed=12345, dsOut=doc_clusters );   /* Get top-k words of each cluster */ %clusterProfile( termDS=doc_term_tfidf, clusterDS=doc_clusters, docVar=_document_, termVar=_term_, tfVar=_tfidf_, clusterVar=_cluster_id_, topk=10, dsOut=cluster_topk_terms ); /*-------------------------------------------*/ /* Sampling textual data based on clustering */ /*-------------------------------------------*/     /*-------------------------------------*/ /* Get tfidf */ /*-------------------------------------*/ %macro getTfidf( dsIn=, docVar=, textVar=, language=, stemming=true, stopList=, dsOut= ); proc cas; textparse.tpParse / docId="&docVar" documents={name="&dsIn"} text="&textVar" language="&language" cellWeight="NONE" stemming=false tagging=false noungroups=false entities="none" offset={name="tpparse_out",replace=TRUE} ; run;   textparse.tpAccumulate / offset={name="tpparse_out"} stopList={name="&stopList"} termWeight="NONE" cellWeight="NONE" reduce=1 parent={name="tpAccu_parent",replace=TRUE} terms={name="tpAccu_term",replace=TRUE} showdroppedterms=false ; run; quit;   proc cas; loadactionset "fedsql"; execdirect casout={name="doc_term_stat", replace=true} query=" select tpAccu_parent.&docVar, tpAccu_term._term_, tpAccu_parent._count_ as _tf_, tpAccu_term._NumDocs_ from tpAccu_parent left join tpAccu_term on tpAccu_parent._Termnum_=tpAccu_term._Termnum_; " ; run;   simple.groupBy / table={name="tpAccu_parent"} inputs={"&docVar"} casout={name="doc_nodup", replace=true}; run;   numRows result=r / table={name="doc_nodup"}; totalDocs = r.numrows; run;   datastep.runcode / code = " data &dsOut; set doc_term_stat;" ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);" ||"run; "; run; quit;   proc cas; table.dropTable name="tpparse_out" quiet=true; run; table.dropTable name="tpAccu_parent" quiet=true; run; table.dropTable name="tpAccu_term" quiet=true; run; table.dropTable name="doc_nodup" quiet=true; run; table.dropTable name="doc_term_stat" quiet=true; run; quit; %mend getTfidf;     /*-------------------------------------*/ /* Term-by-document matrix */ /*-------------------------------------*/ %macro DocToVectors( dsIn=, docVar=, termVar=, tfVar=, dimSize=500, dsOut= ); proc cas; simple.summary / table={name="&dsIn", groupBy={"&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="term_tf_sum", replace=true}; run;   simple.topk / table={name="term_tf_sum"} inputs={"&termVar"} topk=&dimSize bottomk=0 raw=True weight="_Sum_" casout={name='termnum_top', replace=true}; run;   loadactionset "fedsql"; execdirect casout={name="doc_top_terms", replace=true} query=" select termnum.*, _rank_ from &dsIn termnum, termnum_top where termnum.&termVar=termnum_top._Charvar_ and &tfVar!=0; " ; run;   transpose.transpose / table={name="doc_top_terms", groupby={"&docVar"}, computedVars={{name="_name_"}}, computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"} transpose={"&tfVar"} casOut={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="term_tf_sum" quiet=true; run; table.dropTable name="termnum_top" quiet=true; run; table.dropTable name="termnum_top_misc" quiet=true; run; table.dropTable name="doc_top_terms" quiet=true; run; quit; %mend DocToVectors;     /*-------------------------------------*/ /* Cluster documents */ /*-------------------------------------*/ %macro clusterDocs( dsIn=, nClusters=10, seed=12345, dsOut= ); proc cas; /*get the vector variables list*/ columninfo result=collist / table={name="&dsIn"}; ndimen=dim(collist['columninfo']); vector_columns={}; j=1; do i=1 to ndimen; thisColumn = collist['columninfo'][i][1]; if lowcase(substr(thisColumn, 1, 4))='_dim' then do; vector_columns[j]= thisColumn; j=j+1; end; end; run;   clustering.kClus / table={name="&dsIn"}, nClusters=&nClusters, init="RAND", seed=&seed, inputs=vector_columns, distance="EUCLIDEAN", printIter=false, impute="MEAN", standardize='STD', output={casOut={name="&dsOut", replace=true}, copyvars="ALL"} ; run; quit; %mend clusterDocs;     /*-------------------------------------*/ /* Get top-k words of each cluster */ /*-------------------------------------*/ %macro clusterProfile( termDS=, clusterDS=, docVar=, termVar=, tfVar=, clusterVar=_CLUSTER_ID_, topk=10, dsOut= ); proc cas; loadactionset "fedsql"; execdirect casout={name="cluster_terms",replace=true} query=" select &termDS..*, &clusterVar from &termDS, &clusterDS where &termDS..&docVar = &clusterDS..&docVar; " ; run;   simple.summary / table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}} inputs={"&tfVar"} summarySubset={"sum"} casout={name="cluster_terms_sum", replace=true}; run;   simple.topk / table={name="cluster_terms_sum", groupBy={"&clusterVar"}} inputs={"&termVar"} topk=&topk bottomk=0 raw=True weight="_Sum_" casout={name="&dsOut", replace=true}; run; quit;   proc cas; table.dropTable name="cluster_terms" quiet=true; run; table.dropTable name="cluster_terms_sum" quiet=true; run; quit; %mend clusterProfile;     /*-------------------------------------*/ /* Stratified sampling by cluster */ /*-------------------------------------*/ %macro strSampleByCluster( docDS=, docClusterDS=, docVar=, clusterVar=_CLUSTER_ID_, seed=12345, sampPct=, dsOut= ); proc cas; loadactionset "sampling"; stratified result=r / table={name="&docClusterDS", groupby={"&clusterVar"}} sampPct=&sampPct partind="TRUE" seed=&seed output={casout={name="sampling_out",replace="TRUE"}, copyvars={"&docVar", "&clusterVar"}}; run; print r.STRAFreq; run;   loadactionset "fedsql"; execdirect casout={name="&dsOut", replace=true} query=" select docDS.*, &clusterVar from &docDS docDS, sampling_out where docDS.&docVar=sampling_out.&docVar and _PartInd_=1; " ; run;   proc cas; table.dropTable name="sampling_out" quiet=true; run; quit; %mend strSampleByCluster;   /*-------------------------------------*/ /* Start CAS Server. */ /*-------------------------------------*/ cas casauto host="host.example.com" port=5570; libname sascas1 cas; caslib _all_ assign;   /*-------------------------------------*/ /* Prepare and load data. */ /*-------------------------------------*/ %let myData=movie_reviews;   proc cas; loadtable result=r / importOptions={fileType="csv", delimiter='TAB',getnames="true"} path="data/movie_reviews.txt" casLib="CASUSER" casout={name="&myData", replace="true"} ; run; quit;   /* Browse the data */ proc cas; columninfo / table={name="&myData"}; fetch / table = {name="&myData"}; run; quit;   /* generate one unique index using data step */ proc cas; datastep.runcode / code = " data &myData; set &myData; rename id = _document_; keep id text score; run; "; run; quit;   /* create stop list*/ data sascas1.stopList; set sashelp.engstop; run;   /* Get tfidf by term by document */ %getTfidf( dsIn=&myData, docVar=_document_, textVar=text, language=english, stemming=true, stopList=stopList, dsOut=doc_term_tfidf );   /* document-term matrix */ %DocToVectors( dsIn=doc_term_tfidf, docVar=_document_, termVar=_term_, tfVar=_tfidf_, dimSize=3000, dsOut=doc_vectors );   /* Cluster documents */ %clusterDocs( dsIn=doc_vectors, nClusters=50, seed=12345, dsOut=doc_clusters );   /* Get top-k words of each cluster */ %clusterProfile( termDS=doc_term_tfidf, clusterDS=doc_clusters, docVar=_document_, termVar=_term_, tfVar=_tfidf_, clusterVar=_cluster_id_, topk=10, dsOut=cluster_topk_terms );   /* Browse topk terms of the first cluster */ proc cas; fetch / table={name="cluster_topk_terms", where="_cluster_id_=1"}; run; quit;   /* Stratified sampling by cluster */ %strSampleByCluster( docDS=&myData, docClusterDS=doc_clusters, docVar=_document_, clusterVar=_cluster_id_, seed=12345, sampPct=5, dsOut=doc_sample_by_cls );   /* Browse sample documents of the first cluster */ proc cas; fetch / table={name="doc_sample_by_cls", where="_cluster_id_=1"}; run; quit;```

How to sample textual data with SAS was published on SAS Users.

This article shows how to implement balanced bootstrap sampling in SAS. The basic bootstrap samples with replacement from the original data (N observations) to obtain B new samples. This is called "uniform" resampling because each observation has a uniform probability of 1/N of being selected at each step of the resampling process. Within the union of the B bootstrap samples, each observation has an expected value of appearing B times.

Balanced bootstrap resampling (Davison, Hinkley, and Schechtman, 1986) is an alternative process in which each observation appears exactly B times in the union of the B bootstrap samples of size N. This has some practical benefits for estimating certain inferential statistics such as the bias and quantiles of the sampling distribution (Hall, 1990).

It is easy to implement a balanced bootstrap resampling scheme: Concatenate B copies of the data, randomly permute the B*N observations, and then use the first N observations for the first bootstrap sample, the next B for the second sample, and so forth. (Other algorithms are also possible, as discussed by Gleason, 1988). This article shows how to implement balanced bootstrap sampling in SAS.

Balanced bootstrap samples in SAS

To illustrate the idea, consider the following data set that has N=6 observations. Five observations are clustered near x=0 and the sixth is a large outlier (x=10). The sample skewness for these data is skew=2.316 because of the influence of the outlier.

```data Sample(keep=x); input x @@; datalines; -1 -0.2 0 0.2 1 10 ;   proc means data=Sample skewness; run; %let ObsStat = 2.3163714;```

You can use the bootstrap to approximate the sampling distribution for the skewness statistic for these data. I have previously shown how to use SAS to bootstrap the skewness statistic: Use PROC SURVEYSELECT to form bootstrap samples, use PROC MEANS with a BY statement to analyze the samples, and use PROC UNIVARIATE to analyze the bootstrap distribution of skewness values. In that previous article, PROC SURVEYSELECT is used to perform uniform sampling (sampling with replacement).

It is straightforward to modify the previous program to perform balanced bootstrap sampling. The following program is based on a SAS paper by Nils Penard at PhUSE 2012. It does the following:

1. Use PROC SURVEYSEELCT to concatenate B copies of the input data.
2. Use the DATA step to generate a uniform random number for each observation.
3. Use PROC SORT to sort the data by the random values. After this step, the N*B observations are in random order.
4. Generate a variable that indicates the bootstrap sample for each observation. Alternatively, reuse the REPLICATE variable from PROC SURVEYSELECT, as shown below.
```/* balanced bootstrap computation */ proc surveyselect data=Sample out=DupData noprint reps=5000 /* duplicate data B times */ method=SRS samprate=1; /* sample w/o replacement */ run;   data Permute; set DupData; call streaminit(12345); u = rand("uniform"); /* generate a uniform random number for each obs */ run;   proc sort data=Permute; by u; run; /* sort in random order */   data BalancedBoot; merge DupData(drop=x) Permute(keep=x); /* reuse REPLICATE variable */ run;```

You can use the BalancedBoot data set to perform subsequent bootstrap analyses. If you perform a bootstrap analysis, you obtain the following approximate bootstrap distribution for the skewness statistic. The observed statistic is indicated by a red vertical line. For reference, the mean of the bootstrap distribution is indicated by a gray vertical line. You can see that the sampling distribution for this tiny data set is highly nonnormal. Many bootstrap samples that contain the outlier (exactly one-sixth of the samples in a balanced bootstrap) will have a large skewness value.

To assure yourself that each of the original six observations appears exactly B times in the union of the bootstrap sample, you can run PROC FREQ, as follows:

```proc freq data=BalancedBoot; /* OPTIONAL: Show that each obs appears B times */ tables x / nocum; run;```

Balanced bootstrap samples in SAS/IML

As shown in the article "Bootstrap estimates in SAS/IML," you can perform bootstrap computations in the SAS/IML language. For uniform sampling, the SAMPLE function samples with replacement from the original data. However, you can modify the sampling scheme to support balanced bootstrap resampling:

1. Use the REPEAT function to duplicate the data B times.
2. Use the SAMPLE function with the "WOR" option to sample without replacement. The resulting vector is a permutation of the B*N observations.
3. Use the SHAPE function to reshape the permuted data into an N x B matrix for which each column is a bootstrap sample. This form is useful for implementing vectorized computations on the columns.

The following SAS/IML program modifies the program in the previous post to perform balanced bootstrap sampling:

```/* balanced bootstrap computation in SAS/IML */ proc iml; use Sample; read all var "x"; close; call randseed(12345);   /* Return a row vector of statistics, one for each column. */ start EvalStat(M); return skewness(M); /* <== put your computation here */ finish; Est = EvalStat(x); /* 1. observed statistic for data */   /* balanced bootstrap resampling */ B = 5000; /* B = number of bootstrap samples */ allX = repeat(x, B); /* replicate the data B times */ s = sample(allX, nrow(allX), "WOR"); /* 2. sample without replacement (=permute) */ s = shape(s, nrow(x), B); /* reshape to (N x B) */   /* use the balanced bootstrap samples in subsequent computations */ bStat = T( EvalStat(s) ); /* 3. compute the statistic for each bootstrap sample */ bootEst = mean(bStat); /* 4. summarize bootstrap distrib such as mean */ bias = Est - bootEst; /* Estimate of bias */ RBal = Est || BootEst || Bias; /* combine results for printing */ print RBal[format=8.4 c={"Obs" "BootEst" "Bias"}];```

As shown in the previous histogram, the bias estimate (the difference between the observed statistic and the mean of the bootstrap distribution) is sizeable.

It is worth mentioning that the SAS-supplied %BOOT macro performs balanced bootstrap sampling by default. To generate balanced bootstrap samples with the %BOOT macro, set the BALANCED=1 option, as follows:
%boot(data=Sample, samples=5000, balanced=1) /* or omit BALANCED= option */
If you want uniform (unbalanced) samples, call the macro as follows:
%boot(data=Sample, samples=5000, balanced=0).

In conclusion, it is easy to generate balanced bootstrap samples. Balanced sampling can improve the efficiency of certain bootstrap estimates and inferences. For details, see the previous references of Appendix II of Hall (1992).

The post Balanced bootstrap resampling in SAS appeared first on The DO Loop.

Attention SAS administrators! When running SAS batch jobs on schedule (or manually), they usually produce date-stamped SAS logs which are essential for automated system maintenance and troubleshooting. Similar log files have been created by various SAS infrastructure services (Metadata server, Mid-tier servers, etc.) However, as time goes on, the relevance of such logs diminishes while clutter stockpiles. In some cases, this may even lead to disk space problems.

There are multiple ways to solve this problem, either by deleting older log files or by stashing them away for auditing purposes (zipping and archiving). One solution would be using Unix/Linux or Windows scripts run on schedule. The other is much "SAS-sier."

Let SAS clean up its "mess"

We are going to write a SAS code that you can run manually or on schedule, which for a specified directory (folder) deletes all .log files that are older than 30 days.
First, we need to capture the contents of that directory, then select those file names with extension .log, and finally, subset that file selection to a sub-list where Date Modified is less than Today's Date minus 30 days.

Perhaps the easiest way to get the contents of a directory is by using the X statement (submitting DOS’ DIR command from within SAS with a pipe (>) option, e.g.

`x 'dir > dirlist.txt';`
`filename DIRLIST pipe 'dir "C:\Documents and Settings"';`

However, SAS administrators know that in many organizations, due to cyber-security concerns IT department policies do not allow enabling the X statement by setting SAS XCMD system option to NOXCMD (XCMD system option for Unix). This is usually done system-wide for the whole SAS Enterprise client-server installation via SAS configuration. In this case, no operating system command can be executed from within SAS. Try running any X statement in your environment; if it is disabled you will get the following ERROR in the SAS log:

`ERROR: Shell escape is not valid in this SAS session.`

To avoid that potential roadblock, we’ll use a different technique of capturing the contents of a directory along with file date stamps.

Macro to delete old log files in a directory/folder

The following SAS macro cleans up a Unix directory or a Windows folder removing old .log files. I must admit that this statement is a little misleading. The macro is much more powerful. Not only it can delete old .log files, it can remove ANY file types specified by their extension.

```%macro mr_clean(dirpath=,dayskeep=30,ext=.log); data _null_; length memname \$256; deldate = today() - &dayskeep; rc = filename('indir',"&dirpath"); did = dopen('indir'); if did then do i=1 to dnum(did); memname = dread(did,i); if reverse(trim(memname)) ^=: reverse("&ext") then continue; rc = filename('inmem',"&dirpath/"!!memname); fid = fopen('inmem'); if fid then do; moddate = input(finfo(fid,'Last Modified'),date9.); rc = fclose(fid); if . < moddate <= deldate then rc = fdelete('inmem'); end; end; rc = dclose(did); rc = filename('inmem'); rc = filename('indir'); run; %mend mr_clean;```

This macro has 3 parameters:

• dirpath - directory path (required);
• dayskeep - days to keep (optional, default 30);
• ext - file extension (optional, default .log).

This macro works in both Windows and Linux/Unix environments. Please note that dirpath and ext parameter values are case-sensitive.

Here are examples of the macro invocation:

1. Using defaults

```%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs; %mr_clean(dirpath=&dir_to_clean)```

With this macro call, all files with extension .log (default) which are older than 30 days (default) will be deleted from the specified directory.

2. Using default extension

```%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs; %mr_clean(dirpath=&dir_to_clean,dayskeep=20)```

With this macro call, all files with extension .log (default) which are older than 20 days will be deleted from the specified directory.

3. Using explicit parameters

```%let dir_to_clean = C:\PROJECTS\Automatically deleting old SAS logs\Logs; %mr_clean(dirpath=&dir_to_clean,dayskeep=10,ext=.xls)```

With this macro call, all files with extension .xls (Excel files) which are older than 10 days will be deleted from the specified directory.

Old file deletion SAS macro code explanation

The above SAS macro logic and actions are done within a single data _NULL_ step. First, we calculate the date from which file deletion starts (going back) deldate = today() - &dayskeep. Then we assign fileref indir to the specified directory &dirpath:

`rc = filename('indir',"&dirpath");`

Then we open that directory:

`did = dopen('indir');`

and if it opened successfully (did>0) we loop through its members which can be either files or directories:

`do i=1 to dnum(did);`

In that loop, first we grab the directory member name:

`memname = dread(did,i);`

and look for our candidates for deletion, i.e., determine if that name (memname) ends with "&ext". In order to do that we reverse both character strings and compare their first characters. If they don’t match (^=: operator) then we are not going to touch that member - the continue statement skips to the end of the loop. If they do match it means that the member name does end with "&ext" and it’s a candidate for deletion. We assign fileref inmem to that member:

`rc = filename('inmem',"&dirpath/"!!memname);`

Note that forward slash (/) Unix/Linux path separator in the above statement is also a valid path separator in Windows. Windows will convert it to back slash (\) for display purposes, but it interprets forward slash as a valid path separator along with back slash.
Then we open that file using fopen function:

`fid = fopen('inmem');`

If inmem is a directory, the opening will fail (fid=0) and we will skip the following do-group that is responsible for the file deletion. If it is file and is opened successfully (fid>0) then we go through the deletion do-group where we first grab the file Last Modified date as moddate, close the file, and if moddate <= deldate we delete that file:

`rc = fdelete('inmem');`

Then we close the directory and un-assign filerefs for the members and directory itself.

Deleting old files across multiple directories/folders

Macro %mr_clean is flexible enough to address various SAS administrators needs. You can use this macro to delete old files of various types across multiple directories/folders. First, let’s create a driver table as follows:

```data delete_instructions; length days 8 extn \$9 path \$256; infile datalines truncover; input days 1-2 extn \$ 4-12 path \$ 14-270; datalines; 30 .log C:\PROJECTS\Automatically deleting old files\Logs1 20 .log C:\PROJECTS\Automatically deleting old files\Logs2 25 .txt C:\PROJECTS\Automatically deleting old files\Texts 35 .xls C:\PROJECTS\Automatically deleting old files\Excel 30 .sas7bdat C:\PROJECTS\Automatically deleting old files\SAS_Backups ;```

This driver table specifies how many days to keep files of certain extensions in each directory. In this example, perhaps the most beneficial deletion applies to the SAS_Backups folder since it contains SAS data tables (extension .sas7bdat). Data files typically have much larger size than SAS log files, and therefore their deletion frees up much more of the valuable disk space.

Then we can use this driver table to loop through its observations and dynamically build macro invocations using CALL EXECUTE:

```data _null_; set delete_instructions; s = cats('%nrstr(%mr_clean(dirpath=',path,',dayskeep=',days,',ext=',extn,'))'); call execute(s); run;```

Alternatively, we can use DOSUBL() function to dynamically execute our macro at every iteration of the driver table:

```data _null_; set delete_instructions; s = cats('%mr_clean(dirpath=',path,',dayskeep=',days,',ext=',extn,')'); rc = dosubl(s); run;```

Put it on autopilot

When it comes to cleaning your old files (logs, backups, etc.), the best practice for SAS administrators is to schedule your cleaning job to automatically run on a regular basis. Then you can forget about this chore around your "SAS house" as %mr_clean macro will do it quietly for you without the noise and fuss of a Roomba.

Would you use this approach in your SAS environment? Any suggestions for improvement? How do you deal with old log files? Other old files? Please share below.

SAS administrators tip: Automatically deleting old SAS logs was published on SAS Users.

The Base SAS DATA step has been a powerful tool for many years for SAS programmers. But as data sets grow and programmers work with massively parallel processing (MPP) computing environments such as Teradata, Hadoop or the SAS High-Performance Analytics grid, the data step remains stubbornly single-threaded. Welcome DS2 – [...]

The post What DS2 can do for the DATA step appeared first on SAS Learning Post.

My colleague Robert Allison recently blogged about using the diameter of Texas as a unit of measurement. The largest distance across Texas is about 801 miles, so Robert wanted to find the set of all points such that the distance from the point to Texas is less than or equal to 801 miles.

Robert's implementation was complicated by the fact that he was interested in points on the round earth that are within 801 miles from Texas as measured along a geodesic. However, the idea of "thickening" or "inflating" a polygonal shape is related to a concept in computational geometry called the offset polygon or the inflated polygon. A general algorithm to inflate a polygon is complicated, but this article demonstrates the basic ideas that are involved. This article discusses offset regions for convex and nonconvex polygons in the plane. The article concludes by drawing a planar region for a Texas-shaped polygon that has been inflated by the diameter of the polygon. And, of course, I supply the SAS programs for all computations and images.

Offset regions as a union of circles and rectangles

Assume that a simple polygon is defined by listing its vertices in counterclockwise order. (Recall that a simple polygon is a closed, nonintersecting, shape that has no holes.) You can define the offset region of radius r as the union of the following shapes:

• The polygon itself
• Circles of radius r centered at each vertex
• Rectangles of width r positions outside the polygon along each edge

The following graphic shows the offset region (r = 0.5) for a convex "house-shaped" polygon. The left side of the image shows the polygon with an overlay of circles centered at each vertex and outward-pointing rectangles along each edge. The right side of the graphic shows the union of the offset regions (blue) and the original polygon (red):

The image on the right shows why the process is sometimes called an "inflating" a polygon. For a convex polygon, the edges are pushed out by a distance r and the vertices become rounded. For large values of r, the offset region becomes a nearly circular blob, although the boundary is always the union of line segments and arcs of circles.

You can draw a similar image for a nonconvex polygon. The inflated region near a convex (left turning) vertex looks the same as before. However, for the nonconvex (right turning) vertices, the circles do not contribute to the offset region. Computing the offset region for a nonconvex polygon is tricky because if the distance r is greater than the minimum distance between vertices, nonlocal effects can occur. For example, the following graphic shows a nonconvex polygon that has two "prongs." Let r0 be the distance between the prongs. When you inflate the polygon by an amount r > r0/2, the offset region can contain a hole, as shown. Furthermore, the boundary of the offset regions is not a simple polygon. For larger values of r, the hole can disappear. This demonstrates why it is difficult to construct the boundary of an offset region for nonconvex polygons.

Inflating a Texas-shaped polygon

The shape of the Texas mainland is nonconvex. I used PROC GREDUCE on the MAPS.US data set in SAS to approximate the shape of Texas by a 36-sided polygon. The polygon is in a standardized coordinate system and has a diameter (maximum distance between vertices) of r = 0.2036. I then constructed the inflated region by using the same technique as shown above. The polygon and its inflated region are shown below.

The image on the left, which shows 36 circles and 36 rectangles, is almost indecipherable. However, the image on the right is almost an exact replica of the region that appears in Robert Allison's post. Remember, though, that the distances in Robert's article are geodesic distances on a sphere whereas these distances are Euclidean distances in the plane. For the planar problem, you can classify a point as within the offset region by testing whether it is inside the polygon itself, inside any of the 36 rectangles, or within a distance r of a vertex. That computation is relatively fast because it is linear in the number of vertices in the polygon.

I don't want to dwell on the computation, but I do want to mention that it requires fewer than 20 SAS/IML statements! The key part of the computation uses vector operations to construct the outward-facing normal vector of length r to each edge of the polygon. If v is the vector that connects the i_th and (i+1)_th vertex of the polygon, then the outward-facing normal vector is given by the concise vector expression r * (v / ||v||) * M, where M is a rotation matrix that rotates by 90 degrees. You can download the SAS program that computes all the images in this article.

In conclusion, you can use a SAS program to construct the offset region for an arbitrary simple polygon. The offset region is the union of circles, rectangles, and the original polygon, which means that it is easy to test whether an arbitrary point in the plane is in the offset region. That is, you can test whether any point is within a distance r to an arbitrary polygon.

The post Offset regions: Find all points within a specified distance from a polygon appeared first on The DO Loop.

One of the biggest challenges we’ve been hearing from customers lately is that they need help operationalizing analytics to extend the value of their modeling efforts. In many cases, they’ve hired smart data analysts who can transform data and create sophisticated models, but their work is used for a single [...]

7 tips for operationalizing analytics was published on SAS Voices by Randy Guard