CAS

8月 132018
 

Data in the cloud makes it easily accessible, and can help businesses run more smoothly. SAS Viya runs its calculations on Cloud Analytics Service (CAS). David Shannon of Amadeus Software spoke at SAS Global Forum 2018 and presented his paper, Come On, Baby, Light my SAS Viya: Programming for CAS. (In addition to being an avid SAS user and partner, David must be an avid Doors fan.) This article summarizes David's overview of how to run SAS programs in SAS Viya and how to use CAS sessions and libraries.

If you're using SAS Viya, you're going to need to know the basics of CAS to be able to perform calculations and use SAS Viya to its potential. SAS 9 programs are compatible with SAS Viya, and will run as-is through the CAS engine.

Using CAS sessions and libraries

Use a CAS statement to kick off a session, then use CAS libraries (caslibs) to store data and resources. To start the session, simply code "cas;" Each CAS session is given its own unique identifier (UUID) that you can use to reconnect to the session.

Handpicked Related VIDEO: SAS programming in the cloud: CASL code

There are a few significant codes that can help you to master CAS operations. Consider these examples, based on a CAS session that David labeled "speedyanalytics":

  • What CAS sessions do I have running?
    cas _all_ list;
  • Get the version and license specifics from the CAS server hosting my session:
    cas speedyanalytics listabout;
  • I want to sign out of SAS Studio for now, so I will disconnect from my CAS session, but return to it later…
    cas speedyanalytics disconnect;
  • ...later in the same or different SAS Studio session, I want to reconnect to the CAS session I started earlier using the UUID I previous grabbed from the macro variable or SAS log:
    cas uuid="&speedyanalytics_uuid";
  • At the end of my program(s), shutdown all my CAS sessions to release resources on the server:
    cas _all_ terminate;

Using CAS libraries

CAS libraries (caslib) are the method to access data that is being stored in memory, as well as the related metadata.

From the library, you can load data into CAS tables in a couple of different ways:

  1. Takes a sample data set, calculate a new measure and stores the output in memory
  2. Proc COPY can bring existing SAS data into a caslib
  3. Proc CASUTIL loads tables into caslibs

The Proc CASUTIL allows you to save your tables (named "classsi" data in David's examples) for future use through the SAVE statement:

proc casutil;
 save casdata="classsi" casout="classsi";
run;

And reload like this in a future session, using the LOAD statement:

proc casutil;
 load casdata="classsi" casout="classsi";
run;

When accessing your CAS libraries, remember that there are multiple levels of scope that can apply. "Session" refers to data from just the current session, whereas "Global" allows you to reach data from all CAS sessions.

Programming in CAS

Showing how to put CAS into action, David shared this diagram of a typical load/save/share flow:

Existing SAS 9 programs and CAS code can both be run in SAS Viya. The calculations and data memory occurs through CAS, the Cloud Analytics Service. Before beginning, it's important to understand a general overview of CAS, to be able to access CAS libraries and your data. For more about CAS architecture, read this paper from CAS developer Jerry Pendergrass.

The performance case for SAS Viya

To close out his paper, David outlined a small experiment he ran to demonstrate performance advantages that can be seen by using SAS Viya v3.3 over a standard, stand-alone SAS v9.4 environment. The test was basic, but performed reads, writes, and analytics on a 5GB table. The tests revealed about a 50 percent increase in performance between CAS and SAS 9 (see the paper for a detailed table of comparison metrics). SAS Viya is engineered for distributive computing (which works especially well in cloud deployments), so more extensive tests could certainly reveal even further increases in performance in many use cases.

Additional resources

A quick introduction to CAS in SAS Viya was published on SAS Users.

7月 182018
 

Last year when I went through the SAS Global Forum 2017 paper list, the paper Breaking through the Barriers: Innovative Sampling Techniques for Unstructured Data Analysis impressed me a lot. In this paper, the author raised out the common problems caused by traditional sampling method and proposed four sampling methods for textual data. Recently my team is working on a project in which we are facing a huge volume of documents from a specific field, and we need efforts of linguists and domain experts to analyze the textual data and annotate ground truth, so our first question is which documents we should start working on to get a panoramic image of the data with minimum efforts. Frankly, I don’t have a state-of-the-art method to extract representative documents and measure its effect, so why not try this innovative technique?

The paper proposed four sampling methods, and I only tried the first method through using cluster memberships as a strata. Before we step into details of the SAS program, let me introduce the steps of this method.

  • Step 1: Parse textual data into tokens and calculate each term's TF-IDF value
  • Step 2: Generate term-by-document matrix
  • Step 3: Cluster documents through k-means algorithm
  • Step 4: Get top k terms of each cluster
  • Step 5: Do stratified sampling by cluster

I wrote a SAS macro for each step so that you are able to check the results step by step. If you are not satisfied with the final cluster result, you can tune the parameters of any step and re-run this step and its post steps. Now let's see how to do this using SAS Viya to extract samples from a movie review data.

The movie review data has 11,855 rows of observations, and there are 200,963 tokens. After removing stop words, there are 18,976 terms. In this example, I set dimension size of the term-by-document matrix as 3000. This means that I use the top 3000 terms with the highest TF-IDF values of the document collections as its dimensions. Then I use k-means clustering to group documents into K clusters, and I set the maximum K as 50 with the kClus action in CAS. The dataSegment action can cluster documents directly, but this action cannot choose the best K. You need to try the clustering action with different K values and choose the best K by yourself. Conversely the kClus action chooses the best K automatically among the K values defined by minimum K and maximum K, so I use kClus action in my implementation.

After running the program (full code at the end of this post), I got 39 clusters and top 10 terms of the first cluster as Table-1 shows.

Table-1 Top 10 terms of Cluster 1

Let's see what samples we get for the first cluster. I got 7 documents and each document either has term "predictable" or term "emotional."

Samples from cluster

I set sampPct as 5 which means 5% data will be randomly selected from each cluster. Finally I got 582 sample documents. Let's check the sample distribution of each cluster.

Donut chart of cluster samples

This clustering method helped us select a small part of documents from the piles of document collections intelligently, and most importantly it saved us much time and helped us to hit the mark.

I haven't had a chance to try the other three sampling methods from the paper; I encourage you have a try and share your experiences with us. Big thanks to my colleague Murali Pagolu for sharing this innovative technique during the SAS Global Forum 2017 conference and for kindly providing me with some good suggestions.

Appendix: Complete code for text sampling

 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=2500, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=10, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
/*-------------------------------------------*/
/* Sampling textual data based on clustering */
/*-------------------------------------------*/
 
 
/*-------------------------------------*/
/* Get tfidf                           */
/*-------------------------------------*/
%macro getTfidf(
   dsIn=, 
   docVar=, 
   textVar=, 
   language=, 
   stemming=true, 
   stopList=, 
   dsOut=
);
proc cas;
textparse.tpParse /
   docId="&docVar"
   documents={name="&dsIn"}
   text="&textVar"
   language="&language"
   cellWeight="NONE"
   stemming=false
   tagging=false
   noungroups=false
   entities="none"
   offset={name="tpparse_out",replace=TRUE}
;
run;
 
textparse.tpAccumulate /
   offset={name="tpparse_out"}
   stopList={name="&stopList"}
   termWeight="NONE"
   cellWeight="NONE"
   reduce=1
   parent={name="tpAccu_parent",replace=TRUE}
   terms={name="tpAccu_term",replace=TRUE}
   showdroppedterms=false
;
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_term_stat", replace=true} 
query="
      select tpAccu_parent.&docVar, 
             tpAccu_term._term_,
             tpAccu_parent._count_ as _tf_,
             tpAccu_term._NumDocs_
      from tpAccu_parent
      left join tpAccu_term
      on tpAccu_parent._Termnum_=tpAccu_term._Termnum_;
"
;
run;
 
simple.groupBy / 
   table={name="tpAccu_parent"}
   inputs={"&docVar"}
   casout={name="doc_nodup", replace=true};
run;
 
numRows result=r / 
   table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
datastep.runcode /
code = "
   data &dsOut;
      set doc_term_stat;"
   ||"_tfidf_ = _tf_*log("||totalDocs||"/_NumDocs_);"
   ||"run;
";
run;
quit;
 
proc cas;
   table.dropTable name="tpparse_out" quiet=true; run;
   table.dropTable name="tpAccu_parent" quiet=true; run;
   table.dropTable name="tpAccu_term" quiet=true; run;
   table.dropTable name="doc_nodup" quiet=true; run;
   table.dropTable name="doc_term_stat" quiet=true; run;
quit;
%mend getTfidf;
 
 
/*-------------------------------------*/
/* Term-by-document matrix             */
/*-------------------------------------*/
%macro DocToVectors(
   dsIn=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   dimSize=500, 
   dsOut=
);
proc cas;
simple.summary /
   table={name="&dsIn", groupBy={"&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="term_tf_sum", replace=true};
run;
 
simple.topk / 
   table={name="term_tf_sum"}  
   inputs={"&termVar"} 
   topk=&dimSize
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name='termnum_top', replace=true};
run;
 
loadactionset "fedsql";
execdirect casout={name="doc_top_terms", replace=true} 
query="
      select termnum.*, _rank_
      from &dsIn termnum, termnum_top
      where termnum.&termVar=termnum_top._Charvar_
        and &tfVar!=0;
"
;
run;
 
transpose.transpose /
   table={name="doc_top_terms", 
          groupby={"&docVar"}, 
          computedVars={{name="_name_"}},
          computedVarsProgram="_name_='_dim'||strip(_rank_)||'_';"}  
   transpose={"&tfVar"}
   casOut={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="term_tf_sum" quiet=true; run;
   table.dropTable name="termnum_top" quiet=true; run;
   table.dropTable name="termnum_top_misc" quiet=true; run;
   table.dropTable name="doc_top_terms" quiet=true; run;
quit;
%mend DocToVectors;
 
 
/*-------------------------------------*/
/* Cluster documents                   */
/*-------------------------------------*/
%macro clusterDocs(
   dsIn=, 
   nClusters=10,
   seed=12345,   
   dsOut=
);
proc cas;
/*get the vector variables list*/
columninfo result=collist /
   table={name="&dsIn"};
ndimen=dim(collist['columninfo']);
vector_columns={};
j=1;
do i=1 to ndimen;
   thisColumn = collist['columninfo'][i][1];
   if lowcase(substr(thisColumn, 1, 4))='_dim' then do;
      vector_columns[j]= thisColumn;
      j=j+1;
   end;
end;
run;
 
clustering.kClus / 
   table={name="&dsIn"},
   nClusters=&nClusters,
   init="RAND",
   seed=&seed,
   inputs=vector_columns,
   distance="EUCLIDEAN",
   printIter=false,
   impute="MEAN",
   standardize='STD',
   output={casOut={name="&dsOut", replace=true}, copyvars="ALL"}
;
run;
quit;
%mend clusterDocs;
 
 
/*-------------------------------------*/
/* Get top-k words of each cluster     */
/*-------------------------------------*/
%macro clusterProfile(
   termDS=, 
   clusterDS=, 
   docVar=, 
   termVar=, 
   tfVar=, 
   clusterVar=_CLUSTER_ID_, 
   topk=10, 
   dsOut=
);
proc cas;
loadactionset "fedsql";
execdirect casout={name="cluster_terms",replace=true} 
query="
      select &termDS..*, &clusterVar
      from &termDS, &clusterDS
      where &termDS..&docVar = &clusterDS..&docVar;
"
;
run;
 
simple.summary /
   table={name="cluster_terms", groupBy={"&clusterVar", "&termVar"}}
   inputs={"&tfVar"}
   summarySubset={"sum"}
   casout={name="cluster_terms_sum", replace=true};
run;
 
simple.topk / 
   table={name="cluster_terms_sum", groupBy={"&clusterVar"}}  
   inputs={"&termVar"} 
   topk=&topk
   bottomk=0 
   raw=True 
   weight="_Sum_"
   casout={name="&dsOut", replace=true};
run;
quit;
 
proc cas;
   table.dropTable name="cluster_terms" quiet=true; run;
   table.dropTable name="cluster_terms_sum" quiet=true; run;
quit;
%mend clusterProfile;
 
 
/*-------------------------------------*/
/* Stratified sampling by cluster      */
/*-------------------------------------*/
%macro strSampleByCluster(
   docDS=, 
   docClusterDS=, 
   docVar=, 
   clusterVar=_CLUSTER_ID_, 
   seed=12345,   
   sampPct=, 
   dsOut=
);
proc cas;
loadactionset "sampling";
stratified result=r /
   table={name="&docClusterDS", groupby={"&clusterVar"}}
   sampPct=&sampPct 
   partind="TRUE" 
   seed=&seed
   output={casout={name="sampling_out",replace="TRUE"},
                   copyvars={"&docVar", "&clusterVar"}};
run;
print r.STRAFreq; run;
 
loadactionset "fedsql";
execdirect casout={name="&dsOut", replace=true} 
query="
   select docDS.*, &clusterVar
   from &docDS docDS, sampling_out
   where docDS.&docVar=sampling_out.&docVar
     and _PartInd_=1;
"
;
run;
 
proc cas;
   table.dropTable name="sampling_out" quiet=true; run;
quit; 
%mend strSampleByCluster;
 
/*-------------------------------------*/
/* Start CAS Server.                   */
/*-------------------------------------*/
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
caslib _all_ assign;
 
/*-------------------------------------*/
/* Prepare and load data.              */
/*-------------------------------------*/
%let myData=movie_reviews;
 
proc cas;
loadtable result=r / 
   importOptions={fileType="csv", delimiter='TAB',getnames="true"}
   path="data/movie_reviews.txt"
   casLib="CASUSER"
   casout={name="&myData", replace="true"} ;
run;
quit;
 
/* Browse the data */
proc cas;
   columninfo / table={name="&myData"};
   fetch / table = {name="&myData"};
run;
quit;
 
/* generate one unique index using data step */
proc cas;
datastep.runcode /
code = "
   data &myData;
      set &myData;
      rename id = _document_;
      keep id text score;  
   run;
";
run;
quit;
 
/* create stop list*/
data sascas1.stopList;
   set sashelp.engstop;
run;
 
/* Get tfidf by term by document */
%getTfidf(
   dsIn=&myData, 
   docVar=_document_, 
   textVar=text, 
   language=english, 
   stemming=true, 
   stopList=stopList, 
   dsOut=doc_term_tfidf
);
 
/* document-term matrix */
%DocToVectors(
   dsIn=doc_term_tfidf, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   dimSize=3000, 
   dsOut=doc_vectors
);
 
/* Cluster documents */
%clusterDocs(
   dsIn=doc_vectors, 
   nClusters=50, 
   seed=12345,   
   dsOut=doc_clusters
);
 
/* Get top-k words of each cluster */
%clusterProfile(
   termDS=doc_term_tfidf, 
   clusterDS=doc_clusters, 
   docVar=_document_, 
   termVar=_term_, 
   tfVar=_tfidf_, 
   clusterVar=_cluster_id_, 
   topk=10, 
   dsOut=cluster_topk_terms
);
 
/* Browse topk terms of the first cluster */
proc cas;
fetch / 
   table={name="cluster_topk_terms",
          where="_cluster_id_=1"};
run;
quit;
 
/* Stratified sampling by cluster      */
%strSampleByCluster(
   docDS=&myData, 
   docClusterDS=doc_clusters, 
   docVar=_document_, 
   clusterVar=_cluster_id_, 
   seed=12345,   
   sampPct=5,
   dsOut=doc_sample_by_cls
);
 
/* Browse sample documents of the first cluster */
proc cas;
fetch / 
   table={name="doc_sample_by_cls",
          where="_cluster_id_=1"};
run;
quit;

How to sample textual data with SAS was published on SAS Users.

6月 222018
 

reference CAS tables using a one-level nameReferencing tables

With the SAS language the way one references a table in code is by using a two-level name or a one-level name. With two-level names, one supplies the libref as well as the table name i.e. PROC MEANS DATA = WORK.TABLE;. By default, all one-level names also refer to SASWORK i.e. PROC MEANS DATA = TABLE;

CAS

To reference CAS tables using a one-level name we will issue two statements that alter which libref houses the tables referenced as one-level names. First, we will create a CAS libref (line 77) followed by the “options user” statement on line 80. It is line 80 that changes the default location from SASWORK to the CAS libref i.e. CASWORK.Figure 1. Statements to alter the default location for one-level names

How to reference one-level CAS tables

From this point on all one-level names referenced in code will be tables managed by CAS. In figure 2 we are creating a one-level CAS table called baseball by reading the two-level named table SASHELP.BASEBALL. This step executes in the SAS Programing Runtime Engine (a SAS Viya based workspace server) and creates the table CASWORK.BASEBALL. Because of the “options user” statement we can now also reference that table using a one-level name i.e. BASEBALL.

Figure 2. Loading a two-level named table into a one-level named table that is managed by CAS

In Figure 3 we will use a DATA Step to read a CAS table and write a CAS table using one-level names. We can also see by reviewing the notes in the SAS log that this DATA Step ran in CAS using multiple threads.

Figure 3. DATA Step referencing one-level named tables

In Figure 4 we observe this PROC MEANS is processing the one-level named table BASEBALL. By reviewing the notes in the SAS log we can see this PROC MEANS ran distributed in CAS.

Figure 4. PROC MEANS referencing one-level named CAS table

Because the default location for one-level names is SASWORK, all tables in SASWORK are automatically deleted when the SAS session ends. When one changes the default location for one-level names, like we just did, it is a best practice to use a PROC DELETE as the last statement in your code to delete all one-level tables managed by CAS, figure 5.

Figure 5. PROC DELETE deleting all one-level CAS tables

Conclusion

It is a very common SAS coding technique to read a source table from a non-SAS data store and write it to SASWORK. By using the technique describe in this blog one now has options on where to store the one-level tables names. As for me, I prefer storing them in CAS so I benefit from the distributed process (faster runtimes) that CAS offers.

How to reference CAS tables using a one-level name was published on SAS Users.

6月 192018
 

CAS DATA StepCloud Analytic Services (CAS) is really exciting. It’s open. It’s multi-threaded. It’s distributed. And, best of all for SAS programmers, it’s SAS. It looks like SAS. It feels like SAS. In fact, you can even run DATA Step in CAS. But, how does DATA Step work in a multi-threaded, distributed context? What’s new? What’s different? If I’m a SAS programming wizard, am I automatically a CAS programming wizard?

While there are certain _n_ automatic variable as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID; 
 
        uniqueID = _n_;
 
run;

CAS DATA Step

Creating a unique ID in CAS DATA Step is a bit more complicated. Each thread maintains its own _n_. So, if we just use _n_, we’ll get duplicate IDs. Each thread will produce an uniqueID field value of 1, 2..and so on. …. When the thread output is combined, we’ll have a bunch of records with an uniqueID of 1 and a bunch with an uniqueID of 2…. This is not useful.

To produce a truly unique ID, you need to augment _n_ with something else. _threadID_ automatic variable can help us get our unique ID as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID;
 
        uniqueID = put(_threadid_,8.) || || '_' || Put(_n_,8.);
 
run;

While there are surely other ways of doing it, concatenating _threadID_ with _n_ ensures uniqueness because the _threadID_ uniquely identifies a single thread and _n_ uniquely identifies a single row output by that thread.

Aggregation with DATA Step

Now, let’s look at “whole table” aggregation (no BY Groups).

SAS DATA Step

Aggregating an entire table in SAS DATA Step usually looks something like below. We create an aggregator field (totSalesAmt) and then add the detail records’ amount field (SaleAmt) to it as we process each record. Finally, when there are no more records (eof), we output the single aggregate row.

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + SaleAmt;
      keep totSalesAmt;
      if eof then output;
 
run;

CAS DATA Step

While the above code returns one row in single-engine SAS, the same code returns multiple rows in CAS — one per thread. When I ran this code against a table in my environment, I got 28 rows (because CAS used 28 threads in this example).

As with the unique ID logic, producing a total aggregate is just a little more complicated in CAS. To make it work in CAS, we need a post-process step to bring the results together. So, our code would look like this:

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain threadSalesAmt 0;
      threadSalesAmt = threadSalesAmt + SaleAmt;
      keep threadSalesAmt;
      if eof then output;
 
run;
 
DATA aggregatedTable / single=yes;
SET aggregatedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + threadSalesAmt;
      if eof then output;
 
run;

In the first data step in the above example, we ran basically the same code as in the SAS DATA Step example. In that step, we let CAS do its distributed, multi-threaded processing because our table is large. Spreading the work over multiple threads makes the aggregation much quicker. After this, we execute a second DATA Step but here we force CAS to use only one thread with the single=yes option. This ensures we only get one output row because CAS only uses one thread. Using a single thread in this case is optimal because we’ll only have a few input records (one per thread from the previous step).

BY-GROUP Aggregation

Individual threads are then assigned to individual BY-Groups. Since each BY-Group is processed by one and only one thread, when we aggregate, we won’t see multiple output rows for a BY-Group. So, there shouldn’t be a need to consolidate the thread results like there was with “whole table” aggregation above.

Consequently, BY-Group aggregation DATA Step code should look exactly the same in CAS and SAS (at least for the basic stuff).

Concluding Thoughts

Coding DATA Step in CAS is very similar to coding DATA Step in SAS. If you’re a wizard in one, you’re likely a wizard in the other. The major difference is accounting for CAS’ massively parallel processing capabilities (which manifest as threads). For more insight into data processing with CAS, check out the SAS Global Forum paper.

Threads and CAS DATA Step was published on SAS Users.

4月 192018
 

A very common coding technique SAS programmers use is identifying the largest value for a given column using DATA Step BY statement with the DESCENDING option. In this example I wanted to find the largest value of the number of runs (nRuns) by each team in the SASHELP.BASEBALL dataset. Using a SAS workspace server, one would write:

DESCENDING BY Variables in SAS Viya

Figure 1. Single Threaded DATA Step in SAS Workspace Server

Figure 2 shows the results of the code we ran in Figure 1:

Figure 2. Result from SAS Code Displayed in Figure 1

To run this DATA Step distributed we will leveraging the SAS® Cloud Analytic Services in SAS® Viya™. Notice in Figure 3, there is no need for the PROC SORT step which is required when running DATA Step single threaded in a SAS workspace server. This is because SAS® Cloud Analytic Services in SAS® Viya™ . Instead we will use

Figure 3. Distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™

Figure 4 shows the results when running distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™.

Figure 4. Results of Distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™

Conclusion

Until the BY statement running in the SAS® Cloud Analytic Services in SAS® Viya™ supports DESCENDING use this technique to ensure your DATA Step runs distributed.

Read more SAS Viya posts.

Read our SAS 9 to SAS Viya whitepaper.

How to Simulate DESCENDING BY Variables in DATA Step Code that Runs Distributed in SAS® Viya™ was published on SAS Users.

4月 132018
 

The release of SAS Viya 3.3 has brought some nice data quality features. In addition to the visual applications like Data Studio or Data Explorer that are part of the Data Preparation offering, one can leverage data quality capabilities from a programming perspective.

For the time being, SAS Viya provides two ways to programmatically perform data quality processing on CAS data:

  • The Data Step Data Quality functions.
  • The profile CAS action.

To use Data Quality programming capabilities in CAS, a Data Quality license is required (or a Data Preparation license which includes Data Quality).

Data Step Data Quality functions

The list of the Data Quality functions currently supported in CAS are listed here and below:

SAS Data Quality 3.3 programming capabilities

They cover casing, parsing, field extraction, gender analysis, identification analysis, match codes and standardize capabilities.

As for now, they are only available in the CAS Data Step. You can’t use them in DS2 or in FedSQL.

To run in CAS certain conditions must be met. These include:

  • Both the input and output data must be CAS tables.
  • All language elements must be supported in the CAS Data Step.
  • Others.

Let’s look at an example:

cas mysession sessopts=(caslib="casuser") ;
 
libname casuser cas caslib="casuser" ;
 
data casuser.baseball2 ;
   length gender $1 mcName parsedValue tokenNames lastName firstName varchar(100) ;
   set casuser.baseball ;
   gender=dqGender(name,'NAME','ENUSA') ;
   mcName=dqMatch(name,'NAME',95,'ENUSA') ;   
   parsedValue=dqParse(name,'NAME','ENUSA') ;
   tokenNames=dqParseInfoGet('NAME','ENUSA') ;
   if _n_=1 then put tokenNames= ;
   lastName=dqParseTokenGet(parsedValue,'Family Name','NAME','ENUSA') ;
   firstName=dqParseTokenGet(parsedValue,'Given Name','NAME','ENUSA') ;
run ;

Here, my input and output tables are CAS tables, and I’m using CAS-enabled statements and functions. So, this will run in CAS, in multiple threads, in massively parallel mode across all my CAS workers on in-memory data. You can confirm this by looking for the following message in the log:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

I’m doing simple data quality processing here:

  • Determine the gender of an individual based on his(her) name, with the dqGender function.
  • Create a match code for the name for a later deduplication, with the dqMatch function.
  • Parse the name using the dqParse function.
  • Identify the name of the tokens produced by the parsing function, with the dqParseInfoGet function.
  • Write the token names in the log, the tokens for this definition are:
    Prefix,Given Name,Middle Name,Family Name,Suffix,Title/Additional Info
  • Extract the “Family Name” token from the parsed value, using dqParseTokenGet.
  • Extract the “Given Name” token from the parsed value, again using dqParseTokenGet.

I get the following table as a result:

Performing this kind of data quality processing on huge tables in memory and in parallel is simply awesome!

The dataDiscovery.profile CAS action

This CAS action enables you to profile a CAS table:

  • It offers 2 algorithms, one is faster but uses more memory.
  • It offers multiple options to control your profiling job:
    • Columns to be profiled.
    • Number of distinct values to be profiled (high-cardinality columns).
    • Number of distinct values/outliers to report.
  • It provides identity analysis using RegEx expressions.
  • It outputs the results to another CAS table.

The resulting table is a transposed table of all the metrics for all the columns. This table requires some post-processing to be analyzed properly.

Example:

proc cas; 
   dataDiscovery.profile /
      algorithm="PRIMARY"
      table={caslib="casuser" name="product_dim"}
      columns={"ProductBrand","ProductLine","Product","ProductDescription","ProductQuality"}
      cutoff=20
      frequencies=10
      outliers=5
      casOut={caslib="casuser" name="product_dim_profiled" replace=true}
   ;
quit ;

In this example, you can see:

  • How to specify the profiling algorithm (quite simple: PRIMARY=best performance, SECONDARY=less memory).
  • How to specify the input table and the columns you want to profile.
  • How to reduce the number of distinct values to process using the cutoff option (it prevents excessive memory use for high-cardinality columns, but might show incomplete results).
  • How to reduce the number of distinct values reported using the frequencies option.
  • How to specify where to store the results (casout).

So, the result is not a report but a table.

The RowId column needs to be matched with

A few comments/cautions on this results table:

  • DoubleValue, DecSextValue, or IntegerValue fields can appear on the output table if numeric fields have been profiled.
  • DecSextValue can contain the mean (metric #1008), median (#1009), standard deviation (#1022) and standard error (#1023) if a numeric column was profiled.
  • It can also contain frequency distributions, maximum, minimum, and mode if the source column is of DecSext data type which is not possible yet.
  • DecSext is a 192-bit fixed-decimal data type that is not supported yet in CAS, and consequently is converted into a double most of the time. Also, SAS Studio cannot render correctly new CAS data types. As of today, those metrics might not be very reliable.
  • Also, some percentage calculations might be rounded due to the use of integers in the Count field.
  • The legend for metric 1001 is not documented. Here it is:

1: CHAR
2: VARCHAR
3: DATE
4: DATETIME
5: DECQUAD
6: DECSEXT
7: DOUBLE
8: INT32
9: INT64
10: TIME

A last word on the profile CAS action. It can help you to perform some identity analysis using patterns defined as RegEx expressions (this does not use the QKB).

Here is an example:

proc cas; 
   dataDiscovery.profile /
      table={caslib="casuser" name="customers"}
      identities={
         {pattern="PAT=</span>?999[<span style=" />-]? ?999[- ]9999",type="USPHONE"}, 
         {pattern= "PAT=^99999[- ]9999$",type="ZIP4"}, 
         {pattern= "PAT=^99999$",type="ZIP"}, 
         {pattern= "[^ @]+@[^ @]+\.[A-Z]{2,4}",type="EMAIL"}, 
         {pattern= "^(?i:A[LKZR]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])$",type="STATE"}
      }
      casOut={caslib="casuser" name="customers_profiled" replace="true"}
   ;
quit ;

In this example that comes from

I hope this post has been helpful.

Thanks for reading.

An overview of SAS Data Quality 3.3 programming capabilities was published on SAS Users.

4月 132018
 

As a follow on from my previous blog post, where we looked at the different use cases for using Kerberos in SAS Viya 3.3, in this post will delve into more details on the requirements for use case 4, where we use Kerberos authentication through-out both the SAS 9.4 and SAS Viya 3.3 environments. We won’t cover the configuration of this setup as that is a topic too broad for a single blog post.

As a reminder the use case we are considering is shown here:

SAS Viya 3.3 Kerberos Delegation

Here the SAS 9.4 Workspace Server is launched with Kerberos credentials, the Service Principal for the SAS 9.4 Object Spawner will need to be trusted for delegation. This means that a Kerberos credential for the end-user is available to the SAS 9.4 Workspace Server. The SAS 9.4 Workspace Server can use this end-user Kerberos credential to request a Service Ticket for the connection to SAS Cloud Analytic Services. While SAS Cloud Analytic Services is provided with a Kerberos keytab and principal it can use to validate this Service Ticket. Validating the Service Ticket authenticates the SAS 9.4 end-user to SAS Cloud Analytic Services. The principal for SAS Cloud Analytic Services must also be trusted for delegation. We need the SAS Cloud Analytic Services session to have access to the Kerberos credentials of the SAS 9.4 end-user.

These Kerberos credentials made available to the SAS Cloud Analytic Services are used for two purposes. First, they are used to make a Kerberized connection to the SAS Viya Logon Manager, this is to obtain the SAS Viya internal OAuth token. As a result, the SAS Viya Logon Manager must be configured to accept Kerberos connections. Secondly, the Kerberos credentials of the SAS 9.4 end-user are used to connect to the Secure Hadoop environment.

In this case, since all the various principals are trusted for delegation, our SAS 9.4 end-user can perform multiple authentication hops using Kerberos with each component. This means that through the use of Kerberos authentication the SAS 9.4 end-user is authenticated into SAS Cloud Analytic Services and out to the Secure Hadoop environment.

Reasons for doing it…

To start with, why would we look to use this use case? From all the use cases we considered in the previous blog post this provides the strongest authentication between SAS 9.4 Maintenance 5 and SAS Viya 3.3. At no point do we have a username/password combination passing between the SAS 9.4 environment and the SAS Viya 3.3. In fact, the only credential set (username/password) sent over the network in the whole environment is the credential set used by the Identities microservice to fetch user and group information for SAS Viya 3.3. Something we could also eliminate if the LDAP provider supported anonymous binds for fetching user details.

Also, this use case provides true single sign-on from SAS 9.4 Maintenance 5 to SAS Viya 3.3 and all the way out to the Secured Hadoop environment. Each operating system run-time process will be launched as the end-user and no cached or stored username/password combination is required.

High-Level Requirements

At a high-level, we need to have both sides configured for Kerberos delegated authentication. This means both the SAS 9.4 Maintenance 5 and the SAS Viya 3.3 environments must be configured for Kerberos authentication.

The following SAS components and tiers need to be configured:

  • SAS 9.4 Middle-Tier
  • SAS 9.4 Metadata Tier
  • SAS 9.4 Compute Tier
  • SAS Viya 3.3 SAS Logon Manager
  • SAS Viya 3.3 SAS Cloud Analytic Services

Detailed Requirements

First let’s talk about Service Principal Names. We need to have a Service Principal Name (SPN) registered for each of the components/tiers in our list above. Specifically, we need a SPN registered for:

  • HTTP/<HOSTNAME> for the SAS 9.4 Middle-Tier
  • SAS/<HOSTNAME> for the SAS 9.4 Metadata Tier
  • SAS/<HOSTNAME> for the SAS 9.4 Compute Tier
  • HTTP/<HOSTNAME> for the SAS Viya 3.3 SAS Logon Manager
  • sascas/<HOSTNAME> for the SAS Viya 3.3 SAS Cloud Analytic Services

Where the <HOSTNAME> part should be the fully qualified hostname of the machines where the component is running. This means that some of these might be combined, for example if the SAS 9.4 Metadata Tier and Compute Tier are running on the same host we will only have one SPN for both. Conversely, we might require more SPNs, if for example, we are running a SAS 9.4 Metadata Cluster.

The SPN needs to be registered against something. Since our aim is to support single sign-on from the end-user’s desktop we’ll probably be registering the SPNs in Active Directory. In Active Directory we can register against either a user or computer object. For both the SAS 9.4 Metadata and Compute Tier the registration can be performed automatically if the processes run as the local system account on a Microsoft Windows host and will be against the computer object. Otherwise, and for the other tiers and components, the SPN must be registered manually. We recommend, that while you can register multiple SPNs against a single object, that you register each SPN against a separate object.

Since the entire aim of this configuration is to delegate the Kerberos authentication from one tier/component onto the next we need to ensure the objects, namely users or computer objects, are trusted for delegation. The SAS 9.4 Middle-Tier will only support un-constrained delegation, whereas the other tiers and components support Microsoft’s constrained delegation. If you choose to go down the path of constrained delegation you need to specify each and every Kerberos service the object is trusted to delegate authentication to.

Finally, we need to provide a Kerberos keytab for the majority of the tiers/components. The Kerberos keytab will contain the long-term keys for the object the SPN is registered against. The only exceptions being the SAS 9.4 Metadata and Compute Tiers if these are running on Windows hosts.

Conclusion

You can now enable Kerberos delegation across the SAS Platform, using a single strong authentication mechanism across that single platform. As always with configuring Kerberos authentication the prerequisites, in terms of Service Principal Names, service accounts, delegation settings, and keytabs are important for success.

SAS Viya 3.3 Kerberos Delegation from SAS 9.4M5 was published on SAS Users.

3月 302018
 

As a follow on from my previous blog post, where we looked at the different use cases for using Kerberos in SAS Viya 3.3, in this post I want to delve into more details on configuring Kerberos delegation with SAS Viya 3.3. SAS Viya 3.3 supports the use of Kerberos delegation to authenticate to SAS Logon Manager and then use the delegated credentials to access SAS Cloud Analytic Services. This was the first use case we illustrated in the previous blog post.

As a reminder this is the scenario we are discussing in this blog post:

Kerberos Delegation

In this post we’ll examine:

  • The implications of using Kerberos delegation.
  • The prerequisites.
  • How authentication is processed.
  • How to configure Kerberos delegation.

Why would we want to configure Kerberos delegation for SAS Viya 3.3? Kerberos will provide us with a strong authentication mechanism for the Visual interfaces, SAS Cloud Analytic Services, and Hadoop in SAS Viya 3.3. With Kerberos enabled, no end-user credentials will be sent from the browser to the SAS Viya 3.3 environment. Instead Kerberos relies on a number of encrypted tickets and a trusted third party to provide authentication. Equally, leveraging Kerberos Delegation means that both the SAS Cloud Analytic Services session and the connection to Hadoop will all be running as the end-user. This better allows you to trace operations to a specific end-user and to more thoroughly apply access controls to the end-user.

Implications

Configuring Kerberos delegation will involve configuring Kerberos authentication for both the Visual interfaces and SAS Cloud Analytic Services. First, we’ll look at the implications for the Visual interfaces.

Once we configure Kerberos for authentication of SAS Logon Manager it replaces the default LDAP provider for end-users. This means that the only way for end-users to authenticate to SAS Logon Manager will be with Kerberos. In SAS Viya 3.3 there is no concept of fallback authentication.

Kerberos will be our only option for end-user authentication and we will be unable to use the sasboot account to access the environment. Configuring Kerberos authentication for SAS Logon Manager will be an all-or-nothing approach.

While the web clients will be using Kerberos for authentication, any client using the OAuth API directly will still use the LDAP provider. This means when we connect to SAS Cloud Analytic Services from SAS Studio (which does not integrate with SAS Logon) we will still be obtaining an OAuth token using the username and password of the user accessing SAS Studio.

If we make any mistakes when we configure Kerberos, or if we have not managed to complete the prerequisites correctly, the SAS Logon Manager will not start correctly. The SAS Logon Manager bootstrap process will error and SAS Logon Manager will fail to start. If SAS Logon Manager fails to start then there is no way to gain access to the SAS Viya 3.3 visual interfaces. In such a case the SAS Boot Strap configuration tool must be used to repair or change the configuration settings. Finally, remember using Kerberos for SAS Logon Manager does not change the requirement for the identities microservice to connect to an LDAP provider. Since the identities microservice is retrieving information from LDAP about users and groups we need to ensure the username part of the Kerberos principal for the end-users match the username returned from LDAP. SAS Logon Manager will strip the realm from the user principal name and use this value in the comparison.

Then considering SAS Cloud Analytic Services, we will be adding Kerberos to the other supported mechanisms for authentication. We will not replace the other mechanisms the way we do for SAS Logon Manager. This means we will not prevent users from connecting with a username and password from the Programming interfaces. As with the configuration of SAS Logon Manager, issues with the configuration can cause SAS Cloud Analytic Services to fail to start. Therefore, it is recommended to complete the configuration of SAS Cloud Analytic Services after the deployment has completed and you are certain things are working correctly.

Prerequisites

To be able to use Kerberos delegation with SAS Viya 3.3 a number of prerequisites need to be completed.

Service Principal Name

First a Kerberos Service Principal Name (SPN) needs to be registered for both the HTTP service class and the sascas service class. This will take the form <service class>/<HOSTNAME>, where the <HOSTNAME> is the value that will be used by clients to request a Kerberos Service Ticket. In most cases for HTTP the <HOSTNAME> will just be the fully qualified hostname of the machine where the Apache HTTP Server is running. If you are using aliases or alternative DNS registrations then finding the correct name to use might not be so straight forward. For SAS Cloud Analytic Services, the <HOSTNAME> will be the CAS Controller hostnameNext by registering we mean that this Service Principal Name must be provided to the Kerberos Key Distribution Center (KDC). If we are using Microsoft Active Directory, each SPN must be registered against an object in the Active Directory database. Objects that can have a SPN registered against them are users or computers. We recommend using a user object in Active Directory to register each SPN against. We also recommend that different users are used for HTTP and CAS.

So, we have two service accounts in Active Directory and we register the SPN against each service account. There are different ways the SPN can be registered in Active Directory. The administrator could perform these tasks manually using the GUI, using an LDAP script, PowerShell script, using the setspn command, or using the ktpass command. Using these tools multiple SPNs can be registered against the service account, which is useful if there are different hostnames the end-users might use to access the service. In most cases using these tools will only register the SPN; however, using the ktpass command will also change the User Principal Name for the service account. More on this shortly.

Alternatively, to Microsoft Active Directory customers could be using a different Kerberos KDC. They could use MIT Kerberos or Heimdal Kerberos. For these implementations of Kerberos there is no difference between a user and a service. The database used by these KDCs just stores information on principals and does not provide a distinction between a User Principal Name and a Service Principal Name.

Trusted for Delegation

For the Kerberos authentication to be delegated from SAS Logon Manager to SAS Cloud Analytic Services and then from SAS Cloud Analytic Services to Secured Hadoop, the two service accounts that have the SPNs registered against them must be trusted for delegation. Without this the scenario it will not work. You can only specify that an account is trusted for delegation after the Service Principal Name has been registered. The option is not available until you have completed that step. The picture below shows an example of the delegation settings in Active Directory.

If the Secured Hadoop environment is configured using a different Kerberos Key Distribution Center (KDC) to the rest of the environment it will not prevent the end-to-end scenario working. However, it will add further complexity. You will need to ensure there is a cross-realm trust configured to the Hadoop KDC for the end-to-end scenario to work.

Kerberos Keytab

Once you have registered each of the SPNs you’ll need to create a Kerberos keytab for each service account. Again, there are multiple tools available to create the Kerberos keytab. We recommend using the ktutil command on Linux, since this is independent of the KDC and makes no changes to the Kerberos database when creating the keytab. Some tools like ktpass will make changes when generating the keytab.

In the Kerberos keytab we need to have the User Principal Name (UPN) and associated Kerberos keys for that principal. The Kerberos keys are essentially encrypted versions of the password for the principal. As we have discussed above, about the SPN, depending on the tools used to register it the UPN for the Kerberos keytab could take different forms.

When using ktpass to register SPN and create the keytab in a single step the UPN of the account in Active Directory will be set to the same value as the SPN. Whilst using the setspn command or performing the task manually will leave the UPN unchanged. Equally for MIT Kerberos or Heimdal Kerberos, since there is no differentiation between principals the UPN for the keytab, will be the SPN registered with the KDC.

Once the Kerberos keytabs have been created they will need to be made available to any hosts with the corresponding service deployed.

Kerberos Configuration File

Finally, as far as prerequisites are concerned we might need to provide a Kerberos configuration file for the host where SAS Logon Manager is deployed. This configuration should identify the default realm and other standard Kerberos settings. The Kerberos implementation in Java should be able to use network queries to find the default realm and Kerberos Key Distribution Center. However, if there are issues with the network discovery, then providing a Kerberos configuration file will allow us to specify these options.

The Kerberos configuration file should be placed in the standard location for the operating system. So on Linux this would be /etc/krb5.conf. If we want to specify a different location we can also specify a JVM option to point to a different location. This would be the java.security.krb5.conf option. Equally, if we cannot create a Kerberos configuration file we could set the java.security.krb5.realm and java.security.krb5.kdc options to identify the Kerberos Realm and Kerberos Key Distribution Center. We’ll show how to set JVM options below.

Authentication Process

The process of authenticating an end-user is shown in the figure below:

Where the steps are:

A.  Kerberos used to authenticate to SAS Logon Manager. SAS Logon Manager uses the Kerberos Keytab for HTTP/<HOSTNAME> to validate the Service Ticket. Delegated credentials are stored in the Credentials microservice.
B.  Standard internal OAuth connection to SAS Cloud Analytic Services. Where the origin field in the OAuth token includes Kerberos and the claims include the custom group ID “CASHOSTAccountRequired”.
C.  The presence of the additional Kerberos origin causes SAS Cloud Analytic Services to get the CAS client to make a second connection attempt using Kerberos. The Kerberos credentials for the end-user are obtained from the Credentials microservice. SAS Cloud Analytic Services Controller uses the Kerberos Keytab for sascas/<HOSTNAME> to validate the Service Ticket and authenticate the end-user. Delegated credentials are placed in the end-user ticket cache.
D.  SAS Cloud Analytic Services uses the credentials in the end-user ticket cache to authenticate as the end-user to the Secured Hadoop environment.

Configuration

Kerberos authentication must be configured for both SAS Logon Manager and SAS Cloud Analytic Services. Also, any end-user must be added to a new custom group.

SAS Logon Manager Configuration

SAS Logon Manager is configured in SAS Environment Manager.

Note: Before attempting any configuration, ensure at least one valid LDAP user is a member of the SAS Administrators custom group.

The configuration settings are within the Definitions section of SAS Environment Manager. For the sas.logon.kerberos definition you need to set the following properties:

For more information see the

SAS Logon Manager will need to be restarted for these new JVM options to be picked up. The same method can be used to set the JVM options for identifying the Kerberos Realm and KDC where we would add the following:

  • Name = java_option_krb5realm
  • Value = -Djava.security.krb5.realm=<REALM>
  • Name = java_option_krb5kdc
  • Value = -Djava.security.krb5.kdc=<KDC HOSTNAME>

Or for setting the location of the Kerberos configuration file where we would add:

  • Name = java_option_krb5conf
  • Value = -Djava.security.krb5.conf=/etc/krb5.conf

SAS Cloud Analytic Services Configuration

The configuration for SAS Cloud Analytic Services is not performed in SAS Environment Manager and is completed by changing files on the file system. The danger of changing files on the file system is that re-running the deployment Ansible playbook might overwrite any changes you make. The choices you have is to either remake any changes to the file system, make the changes to both the file system and the playbook files, or make the changes in the playbook files and re-run the playbook to change the file system. Here I will list the changes in both the configuration files and the playbook files.

There is only one required change and then two option changes. The required change is to define the authentication methods that SAS Cloud Analytic Services will use. In the file casconfig_usermods.lua located in:

/opt/sas/viya/config/etc/cas/default

Add the following line:

cas.provlist = 'oauth.ext.kerb'

Note: Unlike the SAS Logon Manager option above, this is separated with full-stops!

In the same file we can make two optional changes. These optional changes enable you to override default values. The first is the default Service Principal Name that SAS Cloud Analytic Services will use. If you cannot use sascas/<HOSTNAME> you can add the following to the casconfig_usermods.lua:

-- Add Env Variable for SPN
env.CAS_SERVER_PRINCIPAL = 'CAS/HOSTNAME.COMPANY.COM'

This sets an environment variable with the new value of the Service Principal Name. The second optional change is to set another environment variable. This will allow you to put the Kerberos Keytab in any location and call it anything. The default name and location is:

/etc/sascas.keytab

If you want to put the keytab somewhere else or call it something else add the following to the casconfig_usermods.lua

-- Add Env Variable for keytab location
env.KRB5_KTNAME = '/opt/sas/cas.keytab'

These changes can then be reflected in the vars.yml within the playbook by adding the following to the CAS_CONFIGURATION section:

CAS_CONFIGURATION:
   env:
     CAS_SERVER_PRINCIPAL: 'CAS/HOSTNAME.COMPANY.COM'
     KRB5_KTNAME: '/opt/sas/cas.keytab'
   cfg:
     provlist: 'oauth.ext.kerb'

With this in place we can restart the SAS Cloud Analytic Services Controller to pick-up the changes.

Custom Group

If you attempted to test accessing SAS Cloud Analytic Services at this point from the Visual interfaces as an end-user you would see that they were not delegating credentials and the CAS session was not running as the end-user. The final step is to create a custom group in SAS Environment Manager. This custom group can be called anything, perhaps “Delegated Users”, but the ID for the group must be “CASHostAccountRequired“. Without this the CAS session will not be run as the end-user and delegated Kerberos credentials will not be used to launch the session.

Summary

What we have outlined in this article is the new feature of SAS Viya 3.3 that enables Kerberos delegation throughout the environment. It allows you to have end-user sessions in SAS Cloud Analytics Services that are able to use Kerberos to connect to Secured Hadoop. I hope you found this helpful.

SAS Viya 3.3 Kerberos Delegation was published on SAS Users.

3月 302018
 

Multi Node Data TransferWith SAS Viya 3.3, a new data transfer mechanism “MultiNode Data Transfer” has been introduced to transfer data between the data source and the SAS’ Cloud Analytics Services (‘CAS’), in addition to Serial and Parallel data transfer modes. The new mechanism is an extension of the Serial Data Transfer mode. In MultiNode Data transfer mode each CAS Worker makes a simultaneous concurrent connection to read and write data from the source DBMS or Distributed data system.

In CAS, SAS Data connectors are used for Serial mode and SAS Data Connect Accelerators are used for Parallel mode data transfer between CAS and DBMS. The SAS Data connector can also be used for the MultiNode data transfer mechanism. In a multi-node CAS environment when the Data Connector is installed on all Nodes, the Data connector can take advantage of a multi-node CAS environment and make concurrent data access connections from each CAS worker to read and write data from the data source environment.

The CAS Controller controls the MultiNode Data transfer. It directs each CAS worker node on how to query the source data and obtain the needed data. The CAS Controller checks the source data table for the first numeric column and uses the values to divide the table into slices using a MOD function of the number of CAS nodes specified. The higher the Cardinality of the selected numeric column, the easier the data can be divided into slices. If CAS chooses a low cardinality column, you could end-up with poor data distribution on the CAS worker nodes. The CAS controller directs each CAS worker to submit a query to obtain the needed slice of data. During this process, each CAS worker makes an independent, concurrent request to the data source environment.

Data is transferred from the source environment to the CAS worker nodes directly using a single thread connection, bypassing the CAS Controller.

The following diagrams describe the data access from CAS to data source environment using MultiNode Data transfer Mode. CAS is hosted on a multi-node environment with SAS Data Connector installed on each node (CAS Controller and Workers). A CASLIB is defined with NUMREADNODES= and NUMWRITENODES= value other than 1. With each data table access request, the CAS controller scan through the source data table for the first numeric columns and use the value to prepare a query for each CAS worker to run. The CAS Worker node submits an individual query to get its slice of the data. Something like:

Select * from SourceTable where mod(NumericField, NUMREADNODES) = WorkerNodeNumber

The data moves from the DBMS gateway server to each CAS Worker Nodes directly using a single thread connection, bypassing the CAS Controller. It’s a kind of parallel load using the serial mechanism, but it’s not a massively parallel data load. You can notice the bottleneck at DBMS gateway server. The data transfers always passes through the DBMS gateway server to the CAS Worker nodes.

Multi Node Data Transfer

Prerequisites to enable MultiNode Data Transfer include:

  • The CAS environment is a multi-node environment (multiple CAS Worker Nodes).
  • The SAS Data Connector for the data source is installed on each CAS Worker, and Controller Node.
  • The data source client connection components are installed on each CAS Worker, and Controller Node.

By default, SAS Data connector uses serial data transfer mode. To enable MultiNode Data Transfer mode you must use the NUMREADNODES= and NUMWRITENODES= parameters in CSLIB statement and specify value other than 1. If value is specified as 0, CAS will use all available CAS worker nodes. MultiNode Data Transfer Mode can use only number of available node, if you specify more than available nodes, the log prints a warning message.

The following code example describes the data load using “MultiNode” data transfer mode. It assigns a CASLIB using serial mode with NUMREADNODES=10 and NUMWRITENODES=10 and loads data from a Hive table to CAS. As NUMREADNODES= value is other than 1, it follows the MultiNode mechanism. You can notice in log, there is a warning message stating that the Number of Read node parameter exceeds the available Worker nodes. This is one way to verify whether CAS is using MultiNode data transfer mode, by specifying the higher number than available CAS worker nodes. If you specify value for NUMREADNODES =0, it will use all available nodes but no message or warning message in SAS log about multi node usage.

CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib HiveSrl datasource=(srctype="hadoop",
server="xxxxxxx.xxx",
username="hadoop",
dataTransferMode="SERIAL",
NUMREADNODES=10, 
NUMWRITENODES=10,
hadoopconfigdir="/opt/MyHadoop/CDH/Config",
hadoopjarpath="/opt/MyHadoop/CDH/Jars",
schema="default");
proc casutil;
load casdata="prdsal2_1G" casout="prdsal2_1G"
outcaslib="HiveSrl" incaslib="HiveSrl" ;
quit;

SAS Log extract:

….
77 proc casutil;
78 ! load casdata="prdsal2_1G" casout="prdsal2_1G"
79 outcaslib="HiveSrl" incaslib="HiveSrl" ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Hadoop.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(7). The load will proceed with numReadNodes=7. 
…
..

On the Database side, in this case Hive, note the queries submitted by CAS Worker Nodes. Each include the MOD function WHERE clause as described above.

On Hadoop Resource Manager User Interface you can notice the corresponding job execution for each query submitted by CAS worker nodes.

When using MultiNode mode to load data to CAS, data distribution is dependent on the cardinality of the numeric column selected by CAS during MOD function operation. You can notice the CAS data distribution for the above loaded table is not ideal, since it selected a column (‘year’) which is not ideal (in this case) for data distribution across CAS worker nodes. There is no option with MultiNode mechanism to specify a column name to be use for query preparation and eventually for data distribution.

If CAS cannot find suitable columns for MultiNode data transfer mode, it will use standard Serial mode to transfer data as shown in the following log:

……..
74
74 ! load casdata="prdsal2_char" casout="prdsal2_char"
75 outcaslib="HiveSrl" incaslib="HiveSrl" ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Hadoop.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(7). The load will proceed with numReadNodes=7.
WARNING: Unable to find an acceptable column for multi-node reads. Load will proceed with numReadNodes = 1. 
NOTE: Cloud Analytic Services made the external data from prdsal2_char available as table PRDSAL2_CHAR in caslib HiveSrl.
…….

List of data platform supported with MultiNode Data Transfer using Data Connector:

  • Hadoop
  • Impala
  • Oracle
  • PostgreSQL
  • Teradata
  • Amazon Redshift
  • DB2
  • MS SQL Server
  • SAP HANA

The order of data types that SAS uses to divide data into slices for MultiNode Data Read.

  • INT (includes BIGINT, INTEGER, SMALLINT, TINYINT)
  • DECIMAL
  • NUMERIC
  • DOUBLE

Multi-Node Write:

While this post focused on loading data from a data source into CAS, multi-node data transfer also works when saving from CAS back to the data source. The important parameter when saving is NUMWRITENODES instead of NUMREADNODES. The behavior of multi-node saving is similar to that of multi-node loading.

Summary:

The SAS Data Connector can be used for MultiNode data transfer by installing Data Connector and DBMS client components on all CAS Worker nodes without additional license fees. The source data is transferred directly from DBMS gateway server to CAS Worker Nodes being divided up by a simple MOD function. By using this mechanism, the optimum data distribution in CAS Nodes are not guaranteed. It’s suggested to use all CAS Worker Nodes by specifying NUMREADNODES=0 when loading data to CAS using MultiNode mode.

Important links for more information about this topic:

Multi Node Data Transfer to CAS was published on SAS Users.

3月 072018
 

The R SWAT package (SAS Wrapper for Analytics Transfer) enables you to upload big data into an in-memory distributed environment to manage data and create predictive models using familiar R syntax. In the SAS Viya Integration with Open Source Languages: R course, you learn the syntax and methodology required to [...]

The post Use R to interface with SAS Cloud Analytics Services appeared first on SAS Learning Post.