Daria Rostovtseva

8月 162021

In Part I of this blog post, I provided an overview of the approach my team and I took tackling the problem of classifying diverse, messy documents at scale. I shared the details of how we chose to preprocess the data and how we created features from documents of interest using LITI rules in SAS Visual Text Analytics (VTA) on SAS Viya.

In this second part of the post, I'll discuss:

  • scoring VTA concepts and post-processing the output to prepare data for predictive modeling
  • building predictive models using SAS Visual Data Mining and Machine Learning
  • assessing the results

Scoring VTA Features to Work

Recall from Part I, I used the image in Figure 1 to illustrate the type of documents our team was tasked to automatically classify.

Figure 1. Sample document

Figure 1. Sample document

I then demonstrated a method for extracting features from such documents, using the Language Interpretation Text Interpretation (LITI) rules. I used the “Color of Hair” feature as an example. Once the concepts for a decent number of features were developed, we were ready to apply them to the corpus of training documents to build an analytic table for use in building a document classification model.

The process of applying a VTA concept model to data is called scoring. To score your documents, you first need to download the model score code. This is a straightforward task: run the Concepts node to make sure the most up-to-date model is compiled, and then right-click on the node and select “Download score code.”

Figure 2. Exporting concept model score code

Figure 2. Exporting concept model score code

You will see four files inside the downloaded ZIP file. Why four files? There are two ways you can score your data: using the the with the concepts analytics store (astore) or the concepts model. The first two files you see in Figure 2 score the data using the astore method, and the third and fourth file correspond to scoring using the concept .li binary model file.

If you are scoring in the same environment where you built the model, you don’t need the .li file, as it is available in the system and is referenced inside the ScoreCode.sas program. This is the reason I prefer to use the second method to score my data – it allows me to skip the step of uploading the model file. For production purposes though, the model file will need be copied to the appropriate location so the score code can find it. For both development and production, you will need to modify the macro variables to match the location and structure of your input table, as well as the destination for your result files.

Figure 3. Partial VTA score code – map the macro variables in curly brackets, except those in green boxes

Figure 3. Partial VTA score code – map the macro variables in curly brackets, except those in green boxes

Once all the macro variables have been mapped, run the applyConcept CAS action inside PROC CAS to obtain the results.

Figure 4. Partial VTA score code

Figure 4. Partial VTA score code

Depending on the types of rules you used, the model scoring results will be populated in the OUT_CONCEPTS and/or OUT_FACTS tables. Facts are extracted with sequence and predicate rule types that are explained in detail in the book I mentioned in Part I. Since my example didn’t use these rule types, I am going to ignore the OUT_FACTS table. In fact, I could modify the score code to prevent it from outputting this table altogether.

Looking at the OUT_CONCEPTS table, you can see that scoring returns one row per match. If you have multiple concepts matching the same string or parts of the same string, you will see one row for each match. For troubleshooting purposes, I tend to output both the target and supporting concepts, but for production purposes, it is more efficient to set the behavior of supporting concepts to “Supporting” inside the VTA Concepts node, which will prevent them from showing up in the scored output (see Figure 5). The sample output in Table 1 shows matches for target concepts only.

Table 1. OUT_CONCEPTS output example

Table 1. OUT_CONCEPTS output example

Figure 5. Setting concept behavior to supporting

Figure 5. Setting concept behavior to supporting

Completing Feature Engineering

To complete feature creation for use in a classification model, we need to do two things:

  1. Convert extracted concepts into binary flags that indicate if a feature is present on the page.
  2. Create a feature that will sum up the number of total text-based features observed on the page.

To create binary flags, you need to do three things: deduplicate results by concept name within each record, transpose the table to convert rows into columns, and replace missing values with zeros. The last step is important because you should expect your pages to contain a varying number of features. In fact, you should expect non-target pages to have few, if any, features extracted. The code snippet below shows all three tasks accomplished with a PROC CAS step.

/*Deduplicate OUT_CONCEPTS. Add a dummy variable to use in transpose later*/
data CASUSER.OUT_CONCEPTS_DEDUP(keep=ID _concept_ dummy);
	by ID _concept_;
	if first.concept_ then output;
/*Transpose data*/
	by ID;
	id _concept_;
/*Replace missing with zeros and sum features*/
data CASUSER.FOR_VDMML(drop=i _NAME_);
	array feature (*) _numeric_;
	do i=1 to dim(feature);
		if feature(i)=. then feature(i)=0;
	Num_Features=sum(of T_:);
/*Merge features with labels and partition indicator*/
	  CASUSER.LABELED_PARTITIONED_DATA(keep= ID target partition_ind);
by ID;

To create an aggregate feature, we simply added up all features to get a tally of extracted features per page. The assumption is that if we defined the features well, a high total count of hits should be a good indicator that a page belongs to a particular class.

The very last task is to merge your document type label and the partitioning column with the feature dataset.

Figure 6 shows the layout of the table ready for training the model.

Figure 6. Training table layout

Figure 6. Training table layout

Training Classification Model

After all the work to prepare the data, training a classification model with SAS Visual Data Mining and Machine Learning (VDMML) was a breeze. VDMML comes prepackaged with numerous best practices pipeline templates designed to speed things up for the modeler. We used the advanced template for classification target (see Figure 7) and found that it performed the best even after multiple attempts to improve performance. VDMML is a wonderful tool which makes model training easy and straightforward. For details on how to get started, see this free e-book or check out the tutorials on the SAS Support site.

Figure 7. Example of a VDMML Model Studio pipeline

Figure 7. Example of a VDMML Model Studio pipeline

Model Assessment

Since we were dealing with rare targets, we chose the F-score as our assessment statistic. F-score is a harmonized mean of precision and recall and provides a more realistic model assessment score than, for example, a misclassification rate. You can specify F-score as your model selection criterion in the properties of a VDMML Model Studio project.

Depending on the document type, our team was able to achieve an F-score of 85%-95%, which was phenomenal, considering the quality of the data. At the end of the day, incorrectly classified documents were mostly those whose quality was dismal – pages too dark, too faint, or too dirty to be adequately processed with OCR.


So there you have it: a simple, but effective and transparent approach to classifying difficult unstructured data. We preprocessed data with Microsoft OCR technologies, built features using SAS VTA text analytics tools, and created a document classification model with SAS VDMML. Hope you enjoyed learning about our approach – please leave a comment to let me know your thoughts!


I would like to thank Dr. Greg Massey for reviewing this blog series.

Classifying messy documents: A common-sense approach (Part II) was published on SAS Users.

8月 052021

Unstructured text data is ubiquitous in both business and government and extracting value from it at scale is a common challenge. Organizations that have been around for a while often have vast paper archives. Digitizing these archives does not necessarily make them usable for search and analysis, since documents are commonly converted into images that are not inherently machine-readable. I recently participated in a project which aimed, as one of its goals, to automate the search and tagging of specific pages within such digitized archives. We encountered several challenges:

  • These PDF-based files were very long, up to 3,000 pages and sometimes longer, so looking for relevant pages was like looking for a needle in a haystack.
  • Files arrived in the tens of thousands per month and needed to be processed promptly to accommodate the customer’s business process.
  • Relevant pages spanned decades, came in dozens of formats, were only partially machine-readable and exhibited various data quality problems related to how they had been handled prior to being digitized.

With some trial and error, our team developed a common-sense approach to classifying pages in these documents through a combination of advanced optical character recognition (OCR) technologies coupled with feature engineering using SAS Visual Text Analytics (VTA) and supervised machine learning using SAS Visual Data Mining and Machine Learning (VDMML).

In this two-part blog post, I describe our approach and share some lessons learned. I spend the most time on data preparation and feature engineering, because it was a crucial component of this work and is so critical for building accurate models. To maintain customer confidentiality, I am keeping the project details to a minimum and using generic data examples.

In this part of the post, I describe our high-level approach and dive into preprocessing and feature extraction. In Part II, I will discuss data preparation for supervised modeling, model development and assessment. To replicate our approach, you would require licenses for SAS Visual Text Analytics and SAS Visual Data Mining and Machine Learning, and, if your data is not machine-readable, a high-quality OCR tool.

High-Level Approach

At a high level, the model development pipeline for this document classification problem looked like this:

  • Preprocess and label data
  • Extract features
  • Build and evaluate a supervised model
Figure 1. Model development pipeline

Figure 1. Model development pipeline

Data preparation and preprocessing

When my colleagues and I got our first glimpse of the data, our knees nearly buckled. Although the data was in a PDF format, only a small portion of it was machine readable. Some other common issues included:

  • Old, faded, stained, crumpled and torn pages
  • Holes punched through crucial sections, such as form titles
  • Salt-and-pepper effects and smudges from many rounds of faxing and photocopying
  • Documents populated by hand, with a typewriter, or faintly printed on a matrix printer
  • Handwritten notes and marks both on the margins and over the form body

We needed clean-ish text to work with, but it quickly became apparent we also needed a heavy-duty optical character recognition tool to extract it. After some trial and error, we chose to preprocess the data with the Microsoft Cognitive Services Read API. The API handles data in many formats and recognizes both typed and handwritten text. The API returns data in a JSON format which is easy to transform into delimited files or SAS tables. To give you a flavor of its ability to handle messy documents, consider this sample document in the image below.

Figure 2. Sample document image

Figure 2. Sample document image

And now check out the text extracted by the API. I highlighted some areas of interest, including text that came from handwriting and the regions adjacent to the holes punched at the top. Considering the image quality, overall, the output is remarkably accurate: both hand-written and typed text is recognized, and the text is returned consistent with a left to right reading pattern.

Figure 3. Text extracted from Figure 2. Handwriting highlighted.

Figure 3. Text extracted from Figure 2. Handwriting highlighted.

We chose to parse the text data into one page per row, except for some special cases where it made more sense to break down continuous printouts into logical chunks.

A parallel task was to label data for training a supervised model. We were tasked with finding and tagging several classes of forms and documents, with many different versions and styles within each document class. The vast majority of pages in these files were irrelevant, so we were dealing with rare targets. As often happens, the customer did not have reliably labeled data, so there were no shortcuts. We were forced to read through a sample of files and label pages manually. As tedious as it sounds, this gave our team an opportunity to become intimately familiar with the data and ask the customer numerous clarifying questions. I am confident getting to know the data before diving into modeling helped us understand and define the problem better, and ultimately enabled us to create a better product for the customer.

As we labeled the data, we took screenshots of unique versions of documents we encountered and populated a form catalog. This helped us create a visual cheat sheet for use in feature development. The catalog proved so useful that even the customer asked if they could have a copy of it to use for their internal trainings!

Feature engineering

The field of Natural Language Processing has exploded in recent years with significant advancements due to the introduction of novel deep learning architectures capable of solving many text-related problems. However, deep learning is not transparent, requires substantial computational resources and takes a lot of tuning to get the best results. Most importantly, it also requires significant amounts of labeled data, which is expensive to create. Therefore, our team chose to start with a simpler but more transparent approach.

The cornerstone of this approach was to use SAS Visual Text Analytics (VTA), which runs in Cloud Analytics Services (CAS) inside the SAS Viya platform, to generate features for training a machine learning model. We used the Concepts node in VTA to create these features on a sample of records reserved for model training. The VTA Concepts node provides an interface for developing Language Interpretation Text Interpretation (LITI) rules to capture specific entities within a textual context. Because the LITI language is so flexible, you can write rules which overcome many data quality problems.

Figure 4. Example of a Concepts node inside a SAS VTA pipeline

Figure 4. Example of a Concepts node inside a SAS VTA pipeline

Consider again our sample document. In Figure 5, I highlighted some strings that are good candidates for extraction as features because they are part of the form structure and are easily readable with a naked eye, meaning that they should lend themselves well to OCR and remain relatively stable within each document class of interest.

Figure 5. Candidate feature examples

Figure 5. Candidate feature examples

The trick to extracting these strings as robust features using VTA is to anticipate some common ways in which the OCR tool may fail, as well as to account for changes in the wording of these common elements over time. This is where the power of the LITI rules comes in, and the work of building the form catalog pays off.

For example, consider a string “7. Color of Hair” as in the image to the right. In this example, it is perfectly readable and would make a good feature, but here are some of the pattern deviations you would want to account for when building rules to capture it:

  • OCR may incorrectly recognize the period as a comma or skip it altogether.
  • OCR may incorrectly recognize the word “of” as something else, such as “ot”, “or”, etc. depending on the quality of the page.
  • The digit 7 may change to 6, 7a, 8 or some other alphanumeric value depending on the version of the form.
  • Different versions of the form may change the wording from Color of Hair to Hair Color.

Using the flexible LITI rules, you can easily account for all for these patterns. The table below shows some examples of LITI rules. The strings matched by each rule are shown in blue. Our objective is to match the “Color of Hair”/ “Hair Color” strings if they occur in close proximity to a corresponding digit, to ensure that we match the form feature rather than some other mention of an individual’s hair color. The rules in the table are gradually extended to provide more flexibility and match more sample strings.

Table 1. Feature extraction rule examples

Table 1. Feature extraction rule example

You can see how we are able to match more and more patterns as we introduce placeholders for digits, add separators and specify ordered distances between tokens to make the rules more flexible. In the last example, the matched strings do not include the item number, but that’s not a problem, because we know that “Color of Hair” will only be matched if it’s preceded by an item number. We don’t need to match the complete string to build a feature, as long as the matched string occurs in a relevant context.

We could continue to add rules like this to eventually match all desired string patterns. However, this could lead to increasingly complex and/or redundant rules. Instead, a preferred approach is to break down this task into pieces: match the digit, match the “Color of Hair” or “Hair Color” string, and then combine the results to get the final match. Table 2 shows the rules above refactored to follow this pattern. Instead of one concept, we now have three: one target concept, denoted with T_, and two supporting concepts, denoted with L_ (for lexicon or lookup).

Table 2. Refactored rule example

Table 2. Refactored rule example

After refactoring the rules, the target concept T_HairColor catches every pattern we want to match and skips the one string that lacks an item number - as intended. What’s even better, we can now reuse some of this work to extract other document features. For example, the L_ItemNumber concept can be reused to match “6. Date of Birth”, “9. Sex”, “11. Occupation” and so on.

I didn’t cover every available rule type, but hopefully this gives you a taste of how they work. There are many great resources that can help you get started with the LITI rules, such as the SAS tutorial Creating Concept Rules Using Textual Elements in SAS Visual Text Analytics, but my favorite resource is SAS Text Analytics for Business Applications by SAS’ Teresa Jade et al. Keep in mind that LITI is proprietary to SAS and can be leveraged within SAS text analytics products, such as SAS Visual Text Analytics.


So far, I shared our approach to preprocessing messy documents and using SAS Visual Text Analytics concepts to extract features from documents of interest. If you are wondering about the next steps, stay tuned for Part II of this series, in which I will discuss:

  • Scoring VTA concepts and post-processing the output to prepare data for predictive modeling
  • Building predictive models using the drag-and-drop interface of SAS Visual Data Mining and Machine Learning
  • Assessing results


I would like to thank Dr. Greg Massey for reviewing this blog series.


Classifying messy documents: A common-sense approach (Part I) was published on SAS Users.

12月 072020

In my previous blog post, I talked about using PROC CAS to accomplish various data preparation tasks. Since then, my colleague Todd Braswell and I worked through some interesting challenges implementing an Extract, Transform, Load (ETL) process that continuously updates data in CAS. (Todd is really the brains behind getting this to work – I am just along for the ride.)

In a nutshell, the process goes like this:

  1. PDF documents drop into a "receiving" directory on the server. The documents have a unique SubmissionID. Some documents are very large – thousands of pages long.
  2. A Python job runs and converts PDFs into plain text. Python calls an API that performs Optical Character Recognition (OCR) and saves off the output as a CSV file, one row per page, in the PDF document.
  3. A SAS program, running in batch, loads the CSV file with new records into a CAS table. SubmissionID is passed to the batch program as a macro variable, which is used as part of the CAS table name.
  4. Records loaded from the CSV file are appended to the Main table. If records with the current SubmissionID already exist in the Main table, they are deleted and replaced with new records.
    The Main table is queried by downstream processes, which extract subsets of data, apply model score code, and generate results for the customer.

Continuously update data process flow

Due to the volume of data and the amount of time it takes to OCR large PDFs, the ETL process runs in multiple sessions simultaneously. And here is a key requirement: the Main table is always available, in a promoted state, with all up-to-date records, in order for the model score code to pick up the needed records.

What does "promoted state" mean?

The concept of table scope, which was introduced with the first release of SAS Viya, presents a challenge. CAS tables are in-memory tables that can have one of two "scopes":

  • Session scope – the table exists within the scope of your current CAS session and drops from memory as soon as the session ends. Functionally, this is somewhat similar to the data you write to the WORK library in SAS 9 – once you disconnect, the data drops from the WORK library.
  • Global scope – the table is available to all sessions. If your session ends, you will still have access to it when you start a new session. Your colleagues also maintain access, assuming they have the necessary permissions. For the table to assume global scope, it needs to be promoted.

Common promotion techniques for a table are the DATA STEP, PROC CAS, or PROC CASUTIL. For example:

/*promote a table using DATA STEP*/
*this snippet copies a SAS 9 table into CAS and promotes in one step;
data mycaslib.my_table (promote=yes);
     set mylib.my_table;
/*promote using PROC CASUTIL*/
*this snippet promotes a session-scope CAS table to global scope;
proc casutil incaslib='mycaslib' outcaslib='mycaslib';
     promote casdata='my_table';
/*Promote using PROC CAS*/
*same as above, this snippet promotes a session-scope table to global scope;
proc cas;
table.promote / 

Fun Facts About Table Promotion

You cannot promote a table that has already been promoted. If you need to promote a new version of the same table, you need to first drop the existing table and promote the new version.

To discover the current table state, use the

proc cas;
     table.tableinfo / 

If you append rows to a promoted table using the DATA STEP append option, the new rows are automatically promoted. For example, in this snippet the mycaslib.main table, which is promoted, remains promoted when the rows from mycaslib.new_rows are appended to it:

data mycaslib.main(append=yes);
     set mycaslib.new_rows;

When you manipulate a promoted table using the DATA STEP apart from appending rows, it creates a new, session-scope version of the same table. You will have two versions of the table: the global-scope table, which remains unchanged, and the session-scope version which has the changes you implemented. Even if you don't change anything in the table and simply run:

data mycaslib.my_table;
     set mycaslib.my_table;

in which mycaslib.my_table is promoted, you end up with a promoted and an unpromoted version of this table in the mycaslib library – a somewhat unexpected and hardly desired result. Appendix 1 walks through a quick exercise you can try to verify this.

As you probably guessed, this is where we ran into trouble with our ETL process: the key requirement was for the Main table to remain promoted, yet we needed to continuously update it. The task was simple if we just needed to append the rows; however, we also needed to replace the rows if they already existed. If we tried to delete the existing rows using the DATA STEP, we would have to deal with the changes applied to a session-scope copy of the global-scope table.

Initially, we designed the flow to save off the session-scope table with changes, then drop the (original) global-scope version, and finally reload the up-to-date version. This was an acceptable workaround, but errors started to pop up when a session looked for the Main table to score data, while a different concurrent session reloaded the most up-to-date data. We were uncertain how this would scale as our data grew.

PROC CAS to the rescue!

After much research, we learned the deleteRows, which allows you to delete rows directly from a global-scope table. The data is never dropped to session-scope – exactly what we needed. Here's an example:

proc cas;
     table.deleteRows /
     table={caslib="mycaslib", name="Main", where="SubmissionID = 12345"};

In case you are wondering, the Tables action set also has an

/*1. Load new rows. SubmissionID macro variable is a parameter passed to the batch program*/
/*New rows are written to the casuser library, but it does not really matter which caslib you choose – 
   we are not persisting them across sessions*/
proc casutil;
       load file="/path_to_new_rows/New_rows_&SubmissionID..csv" outcaslib="casuser" casout="new_rows";
/*2. Delete rows with the current SubmissionID */
proc cas;
       table.deleteRows /
       table={caslib="prod", name="Main", where="SubmissionID = &SubmissionID."};
/*3. Append new rows*/
data mycaslib.main(append=yes);
	set mycaslib.new_rows;
/*4. Save the main table to ensure we have a disk backup of in-memory data*/
proc casutil incaslib="prod" outcaslib="prod";
	save casdata="main" replace;


We learned how to continuously update data in CAS while ensuring the data remains available to all sessions accessing it asynchronously. We learned the append option in DATA STEP automatically promotes new rows but manipulating the data in a global-scope table through DATA STEP in other ways leads to data being copied to session-scope. Finally, we learned that to ensure the table remains promoted while it is updated, we can fall back on PROC CAS.

Together, these techniques enabled implementation of a robust data flow that overcomes concurrency problems due to multiple processes updating and querying the data.


We thank Brian Kinnebrew for his generous help in investigating this topic and the technical review.

Appendix 1

Try the following exercise to verify that manipulating a promoted table in DATA STEP leads to two copies of the table – session- AND global-scope.

/*copy and promote a sample SAS 9 table*/
data casuser.cars(promote=yes);
     set sashelp.cars;
/*check the number of rows and confirm that the table is promoted*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit; /*The table is promoted and has 428 rows*/
/*delete some rows in the promoted table*/
data casuser.cars;
     set casuser.cars;
     if make='Acura' then delete;
/*check again – how may rows does the table have? Is it promoted?*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit;  /*The table is has 421 rows but it is no longer promoted*/
/*reset your CAS session*/
/*kill your current CAS session */
cas _all_ terminate;
/*start a new CAS session and assign caslibs*/
caslib _all_ assign;
/*check again – how may rows does the table have? Is it promoted?*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit;  /*The table is promoted and has 428 rows*/

What we see here is, manipulating a global-scope table in DATA STEP leads to duplication of data. CAS copies the table to session-scope and applies the changes there. The changes go away if you terminate the session. One way to get around this issue is instead of trying to overwrite the promoted table, create a new table, then drop the old table and promote the new table under the old table's name. Otherwise, use table.Update actions to update/delete rows in place, as described in this post.

Append and Replace Records in a CAS Table was published on SAS Users.

3月 162020

As a long-time SAS 9 programmer, I typically accomplish my data preparation tasks through some combination of the DATA Step, Proc SQL, Proc Transpose and some housekeeping procs like Proc Contents and Proc Datasets. With the introduction of SAS Viya, SAS released a new scripting language called CASL – a language that interacts with SAS Cloud Analytics Services (CAS).

CASL statements include actions, logically organized into action sets based on common functionality. For example, the Table action set allows you to load a table in CAS, view table metadata, change table metadata such as drop or rename a column, fetch (print) sample rows, save or drop a table from CAS, among other things. Steven Sober provides a great overview of CASL in his 2019 SAS Global Forum paper.

Learning CASL is a good idea assuming you want to leverage the power of CAS, because CASL is the language of CAS. While you can continue to use Viya-enabled procs for many of your data processing needs, certain new functionality is only available through CASL. CAS actions also provide a more granular access to options which otherwise may not be available as procedure options. But old habits die hard, and for a while I found myself bouncing between SAS 9.4 and CASL. I'd pull the data down from CAS just to get it to process in the SAS Programming Runtime Environment (SPRE) because it took less effort than figuring out how to get it done properly in CAS.

Then I started a project with a seriously large data set and quickly hit the limit on how much data I could pull down to process in SPRE. And although I could adjust the DATALIMIT option to retrieve more data than the default limit, I was wasting time and server resources unnecessarily moving the data between CAS and SPRE. All this, just so I could process the data “old school.”

I decided to challenge myself to do ALL my data preparation in CASL. I love a good challenge! I started collecting various useful CASL code snippets. In this post, I am sharing the tidbits I’ve accumulated, along with some commentary. Note, you can execute CAS actions from multiple clients, including SAS, Python, R, Lua and Java. Since my objective was to transition from traditional SAS code to CASL, I’ll focus solely on CAS actions from the SAS client perspective. While I used SAS Viya 3.5 for this work, most of the code snippets should work on prior versions as well.

The sections below cover: how to submit CASL code; loading, saving, dropping and deleting data; exploring data; table metadata management; and data transformation. Feel free to jump ahead to any section of interest.

How do you submit CASL code?

You use PROC CAS to submit CASL code from a SAS client. For example:

proc cas;
   <cas action 1>;
   <cas action 2>;

Similarly to other interactive procs that use run-group processing, separate CAS actions by run; statements. For example:

proc cas;
   <cas action 1>;
   <cas action 2>;

In fact, you can have the entire data preparation and analysis pipeline wrapped inside a single PROC CAS, passing data and results in the form of CASL variables from one action to the next. It can really be quite elegant.

Moving Data Using PROC CAS

Loading SASHDAT data in CAS

Your data must be in the SASHDAT format for CAS to process it. To load a SASHDAT table into CAS, use the table.loadtable CAS action. The code below assumes your SASHDAT table is saved to a directory on disk associated with your current active caslib, and you are loading it into the same caslib. (This usually occurs when you already performed the conversion to SASHDAT format, but the data has been unloaded. If you are just starting out and are wondering how to get your data into the SASHDAT format in the first place, the next session covers it, so keep reading.)

proc cas; 
     table.loadtable / path="TABLE_NAME.sashdat" casOut="TABLE_NAME"; 
     table.promote /name="TABLE_NAME" drop=true; 

The table.promote action elevates your newly loaded CAS table to global scope, making it available to other CAS sessions, including any additional sessions you start, or to other users assuming they have the right privileges. I can’t tell you how many times I forgot to promote my data, only to find that my hard-earned output table disappeared because I took a longer coffee break than expected! Don’t forget to promote or save off your data (or both, to be safe).

If you are loading from a directory other than the one associated with your active caslib, modify the path= statement to include the relative path to the source directory – relative to your active caslib. If you are looking to load to a different caslib, modify the casOut= statement by placing the output table name and library in curly brackets. For example:

proc cas;
    table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME" caslib="CASLIB2"};
    table.promote /name="TABLE_NAME" drop=true;

You can also place a promote=true option inside the casOut= curly brackets instead of calling the table.promote action, like so:

proc cas;
    table.loadtable / path="TABLE_NAME.sashdat" 
                      casOut={name="TABLE_NAME" caslib="CASLIB2" promote=true};

Curly brackets are ubiquitous in CASL (and quite unusual for SAS 9.4). If you take away one thing from this post, make it “watch your curly brackets.”

Loading SAS7BDAT, delimited data, and other file formats in CAS

If you have a SAS7BDAT file already on disk, load it in CAS with this code:

proc cas;
    table.loadtable /path="TABLE_NAME.sas7bdat" casout="TABLE_NAME" 

Other file formats load similarly – just use the corresponding filetype= option to indicate the type of data you are loading, such as CSV, Excel, Document (.docx, .pdf, etc.), Image, Video, etc. The impressive list of supported file types is available here.

proc cas;
    table.loadtable / path="TABLE_NAME.csv" casout="TABLE_NAME" 

You can include additional parameters inside the importOptions= curly brackets, which differ by the file type. If you don’t need any additional parameters, use the filetype=”auto” and let CAS determine the best way to load the file.

When loading a table in SAS7BDAT, delimited or some other format, the table.loadtable action automatically converts your data to SASHDAT format.

Loading data in CAS conditionally

Imagine you are building a script to load data conditionally – only if it’s not already loaded. This is handy if you have a reason to believe the data might already be in CAS. To check if the data exists in CAS and load conditionally, you can leverage the table.tableExists action in combination with if-then-else logic. For example:

proc cas;
    table.tableExists result =r / name="TABLE_NAME";
    if r=0  then do;
        table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME"};
        table.promote /name="YOUR_TABLE_NAME" drop=true;
    else print("Table already loaded");

Notice that the result=r syntax captures the result code from the tableExists action, which is evaluated before the loadtable and promote actions are executed. If the table is already loaded in CAS, “Table already loaded” is printed to the log. Otherwise, the loadtable and promote actions are executed.

The ability to output CAS action results to a CASL variable (such as result=r in this example) is an extremely powerful feature of CASL. I include another example of this further down, but you can learn more about this functionality from documentation or this handy blog post.

Saving your CAS data

Let’s pretend you’ve loaded your data, transformed it, and promoted it to global scope. You or your colleagues can access it from other CAS sessions. You finished your data preparation, right? Wrong. As the header of this section suggests, you also need to save your prepared CAS data. Why? Because up to this point, your processed and promoted data exists only in memory. You will lose your work if your SAS administrator reboots the server or restarts the CAS controller. If you need to quickly reload prepared data, you must back it up to a caslib’s data source. See the CAS data lifecycle for more details.

To save off CAS data, naturally, you use the table.save action. For example:

proc cas;
    table.save / table="TABLE_NAME" name="TABLE_NAME.sashdat" replace=true;

In this example, you save off the CAS table to disk as a SASHDAT file, defaulting to the location associated with your active caslib. You can modify the table.save parameters to save or export the data to an alternative data storage solution with full control over the file format (including but not limited to such popular options as HDFS, Oracle, SQL Server, Salesforce, Snowflake and Teradata), compression, partitioning and other options.

Dropping and deleting data

To drop a table from CAS, execute a table.droptable action. For example:

proc cas;
    table.droptable / name="TABLE_NAME" quiet=true;

The quiet=true option prevents CAS from generating an error if the table does not exist in CAS. Dropping a table deletes it from memory. It’s a good practice to drop tables you no longer need, particularly the one you have promoted. Local-scope tables disappear on their own when the session expires, whereas global tables will stay in memory until they are unloaded.

Dropping a table does not delete the underlying source data. To delete the source of a CAS table, use the table.deleteSource action. For example:

proc cas;
    table.deletesource / source="TABLE_NAME.sashdat" quiet=true;

Exploring Data Using PROC CAS

After taking a close look at moving the data using PROC CAS, let’s look at some useful ways to start exploring and manipulating CAS data.

Fetching sample data

When preparing data, I find it useful to look at sample data. The table.fetch action is conceptually similar to PROC PRINT and, by default, outputs the first 20 rows of a CAS table:

proc cas;
table.fetch / table="Table_Name";

You can modify the table.fetch options to control which observations and variables to display and how to display them. For example:

proc cas;
table.fetch / table={name="TABLE_NAME" where="VAR1 in ('value1','value2')"},              /*1*/
	      orderby={{name="VAR1"},                                                     /*2*/
                             {name="VAR2", order="descending"}
	     fetchvars={{name="VAR1", label="Variable 1"},                                /*3*/
	                     {name="VAR2", label="Variable 2"}, 
 		             {name="VAR3", label="Variable 3", format=comma12.1}
	     to=50,                                                                       /*4*/
	     index=false;							          /*5*/

In the code snippet above:

  • #1 – where= statement limits the records to those meeting the where criteria.
  • #2 – orderby= option defines the sort order. Ascending is the default and is not required. If sorting by more than one variable, put them in a list inside curly brackets, as shown in this example. If a list item has a subparameter (such as order= here), encase each item in curly brackets.
  • #3 – fetchvars= option defines the variables to print as well as their display labels and formats. If you select more than one variable, put them in a list inside curly brackets, as shown here. And again, if a list item includes a subparmeter, then enclose each list item in curly brackets.
  • #4 – to= option defines the number of rows to print.
  • #5 – index= false option deactivates the index column in the output (the default is index=true). This is similar to the noobs option in PROC PRINT.

As mentioned earlier, make sure to watch your curly brackets!

Descriptive statistics and variable distributions

The next step in data exploration is looking at descriptive statistics and variable distributions. I would need a separate blog post to cover this in detail, so I only touch upon a few of the many useful CAS actions.

To look at statistics for numeric variables, use the simple.summary action, which computes standard descriptive statistics, such as minimum, maximum, mean, standard deviation, number missing, and so on. For example:

proc cas;
    simple.summary / table="TABLE_NAME";

Among its other features, the simple.summary action allows analysis by one or more group-by variables, as well as define the list of desired descriptive statistics. For example:

proc cas;
simple.summary / table={name="TABLE_NAME", groupBy="VAR1", vars={"NUMVAR1","NUMVAR2”}},
                 subSet={"MAX", "MIN", "MEAN", "NMISS"};

Another useful action is simple.topK, which selects the top K and bottom K values for variables in a data set, based on a user-specified ranking order. The example below returns the top 5 and bottom 5 values for two variables based on their frequency:

proc cas;
simple.topk / table="TABLE_NAME" 

Simple is a rich action set with heaps of useful options covered in the documentation.

You may be wondering – what about crosstabs and frequency tables? The simple action set includes freq and crosstab actions. In addition, the action closely imitating the functionality of the beloved PROC FREQ is freqTab.freqTab. For example, the code snippet below creates frequency tables for VAR1, VAR2 and a crosstab of the two.

proc cas;
freqtab.freqtab / table="TABLE_NAME"

Managing CAS Table Variables

Changing table metadata

One of the basic tasks after exploring your data is changing table metadata, such as dropping unnecessary variables, renaming tables and columns, and changing variable formats and labels. The table.altertable action helps you with these housekeeping tasks. For example, the code snippet below renames the table, drops two variables and renames and changes labels for two variables:

proc cas;
    table.altertable / table="TABLE_NAME" rename="ANALYTIC_TABLE"
                       columns={{name="VAR3" rename="ROW_ID" label="Row ID"},
                                {name="VAR4" rename="TARGET" label="Outcome Variable"}

Outputting variable list to a data set

Another useful trick I frequently use is extracting table columns as a SAS data set. Having a list of variables as values in a data set makes it easy to build data-driven scripts leveraging macro programming. The code snippet below provides an example. Here we encounter another example of capturing action result as a CASL variable and using it in further processing – I can’t stress enough how helpful this is!

proc cas;
    table.columninfo r=collinfo / table={name="TABLE_NAME"};       /*1*/
    collist=collinfo["ColumnInfo"];                                /*2*/
    saveresult collist casout="collist";                           /*3*/

In the snippet above:

  • #1 - the columninfo action collects column information. The action result is passed to a CASL variable collinfo. Notice, instead of writing out result=, I am using an alias r =.
  • #2 - the portion of the a CASL variable collinfo containing column data is extracted into another CASL variable collist.
  • #3 - the saveresult statement sends the data to a CAS table collist. If you want to send the results to a SAS7BDAT data set, replace casout= with dataout=, and provide the library.table_name information.

Transforming the Data

Lastly, let’s look at some ways to use CAS actions to transform your data. Proc SQL and DATA step are the two swiss-army knives in SAS 9 developers’ toolkit that take care of 90% of the data prep. The good news is you can execute both DATA Step and SQL directly from PROC CAS. In addition, call the transpose action to transpose your data.

Executing DATA Step code

The dataStep.runCode action enables you to run DATA step code directly inside PROC CAS. You must enclose your DATA step code in quotation marks after the code= statement. For example:

proc cas;
    dataStep.runCode /
        data table_name;
        set table_name;

Running DATA step code in CAS allows access to sophisticated group-by processing and the use of such popular programming techniques as first- and last-dot. Refer to the documentation for important nuances related to processing in a distributed, multi-threaded environment of CAS.

Executing FedSQL

To run SQL in CASL, use the fedSQL.execDirect action. Enclose the SQL query in quotation marks following the query= statement. Optionally, you can use the casout= statement to save the results to a CAS table. For example:

proc cas;
          select  *
          from TABLE1 a  inner join TABLE2 b
          on a.VAR1 = b.VAR1
    casout={name="TABLE3", replace=True};

Similarly to DATA step, be aware of the many nuances when executing SQL in CAS via FedSQL. Brian Kinnebrew provides an excellent overview of FedSQL in his SAS Communities article, and the documentation has up-to-date details on the supported functionality.

Transposing data

Transposing data in PROC CAS is a breeze. The example below uses transpose.transpose action to restructure rows into columns.

proc cas;
    transpose.transpose /
          table={name="TABLE_NAME", groupby={"VAR1"}}
    casout={name="TRANSPOSED" replace=true};

You can transpose multiple variables in the same transpose action. Simply place additional variables inside the curly brackets following transpose=, in quotes, separated by a comma.


PROC CAS is a wrapper procedure enabling you to leverage SAS’ new programming language - CASL. CASL enables you to submit CAS actions directly to SAS Cloud Analytic Services engine from a SAS client. This post provided examples of loading, managing, exploring and transforming your data through CAS actions. Certain new functionality in CAS is only available through CAS actions, so getting comfortable with CASL makes sense. Fear not, and let the curly brackets guide the way 😊.


I would like to thank Brian Kinnebrew for his thoughtful review and generous help on my journey learning CASL.

Challenge accepted: Learning data prep in CASL was published on SAS Users.

10月 052018

In my earlier blog, I described how to create maps in SAS Visual Analytics 8.2 if you have an ESRI shapefile with  granular geographies, such as counties, that you wish to combine into regions. Since posting this blog in January 2018, I received a lot of questions from users on a range of mapping topics, so I thought a more general post on using – and troubleshooting - custom polygons in SAS Visual Analytics on Viya was in order. Since version 8.3 is now generally available, this post is tailored to the 8.3 version of SAS Visual Analytics, but the custom polygon functionality hasn’t really changed between the 8.2 and 8.3 releases.

What are custom polygons?

Custom polygons are geographic boundaries that enable you to visualize data as shaded areas on the map. They are also sometimes referred to as a choropleth maps. For example, you work for a non-profit organization which is trying to decide where to put a new senior center. So you create a map that shows the population of people over 65 years of age by US census tract. The darker polygons suggest a larger number of seniors, and thus a potentially better location to build a senior center:

SAS Visual Analytics 8.3 includes a few predefined polygonal shapes, including countries and states/provinces. But if you need something more granular, you can upload your own polygonal shapes.

How do I create my own polygonal shapes?

To create a polygonal map, you need two components:

  1. A dataset with a measure variable and a region ID variable. For example, you may have population as a measure, and census tract ID as a region ID. A simple frequency can be used as a measure, too.
  2. A “polygon provider” dataset, which contains the same region ID as above, plus geographic coordinates of each vertex in each polygon, a segment ID and a sequence number.

So where do I get this mysterious polygon provider? Typically, you will need to search for a shapefile that contains the polygons you need, and do a little bit of data preparation. Shapefile is a geographic data format supported by ESRI. When you download a shapefile and look at it on the file system, you will see that it contains several files. For example, my 2010 Census Tract shapefile includes all these components:

Sometimes you may see other components present as well. Make sure to keep all components together.

To prepare this data for SAS Visual Analytics, you have two options.

Preparing shapefile for SAS Visual Analytics: The long way

One method to prepare the polygon provider is to run PROC MAPIMPORT to convert the shapefile into a SAS dataset, add a sequence ID field and then load into the Cloud Analytic Services (CAS) server in SAS Viya. The sequence ID is mandatory, as it helps SAS Visual Analytics to draw the lines connecting vertices in the correct order.

A colleague recently reached out for help with a map of Census block groups for Chatham County in North Carolina. Let’s look at his example:

The shapefile was downloaded from here. We then ran the following code on my desktop:

libname geo 'C:\...\Data;
proc mapimport datafile="C:\...\Data\Chatham_County__2010_Census_Block_Groups.shp"
data geo.chatham_cbg;
set  chatham_cbg;

We then manually loaded the geo.chatham_cbg dataset in CAS using self-service import in SAS Visual Analytics. If you are not sure how to upload a dataset to CAS, please check the %SHIMPR. The macro will automatically run PROC MAPIMPORT, create a sequence ID variable and load the table into CAS. Here’s an example:

%shpimprt(shapefilepath=/path/Chatham_County__2010_Census_Block_Groups.shp, id=GEOID, outtable=Chatham_CBG, cashost=my_viya_host.com,   casport=5570, caslib='Public');

For this macro to work, the shapefile must be copied to a location that your SAS Viya server can access, and the code needs to be executed in an environment that has SAS Viya installed. So, it wouldn’t work if I tried to run it on my desktop, which only has SAS 9.4 installed. But it works beautifully if I run it in SAS Studio on my SAS Viya machine.

Configuring the polygon provider

The next step is to configure the polygon provider inside your report. I provided a detailed description of this in my earlier blog, so here I’ll just summarize the steps:

  • Add your data to the SAS Visual Analytics report, locate the region ID variable, right-click and select New Geography
  • Give it a name and select Custom Polygonal Shapes as geography type
  • Click on the Custom Polygon Provider box and select Define New Polygon Provider
  • Configure your polygon provider by selecting the library, table and ID column. The values in your ID column must match the values of the region ID variable in the dataset you are visualizing. The ID column, however, does not need to have the same name as in the visualization dataset.
  • If necessary, configure advanced options of the polygon provider (more on that in the troubleshooting section of this blog).

If all goes well, you should see a preview of your polygons and a percentage of regions mapped. Click OK to save your geographic item, and feel free to use it in the Geo Map object.

I followed your instructions, but the map is not working. What am I missing?

I observed a few common troubleshooting issues with custom maps, and all are fairly easy to fix. The table below summarizes symptoms and solutions.

Symptom Solution
In the Geographic Item preview, 0% of the regions are mapped. For example:
Check that the values in the region ID variable match between the main dataset and the polygon provider dataset.
I successfully created the map, but the colors of the polygons all look the same. I know I have a range of values, but the map doesn’t convey the differences. In your main dataset, you probably have missing region ID values or region IDs that don’t exist in the polygon provider dataset. Add a filter to your Geo Map object to exclude region IDs that can’t be mapped.

Only a subset of regions is rendered. You may have too many points (vertices) in your polygon provider dataset. SAS Visual Analytics can render up to 250,000 points. If you have a large number of polygons represented in a detailed way, you can easily exceed this limit. You have two options, which you can mix and match:

(1)    Filter the map to show fewer polygons

(2)    Reduce the level of detail in the polygon provider dataset using PROC GREDUCE. See example here. Also, if you imported data using the %shpimprt macro, it has an option to reduce the dataset. Here’s a handy link to In the Geographic Item preview, the note shows that 100% of the regions are mapped, but the regions don’t render, or the regions are rendered in the wrong location (e.g., in the middle of the ocean) and/or at an incorrect scale.

This is probably the trickiest issue, and the most likely culprit is an incorrectly specified coordinate space code (EPSG code). The EPSG code corresponds to the type of projection applied to the latitude and longitude in the polygon provider dataset (and the originating shapefile). Projection is a method of displaying points from a sphere (the Earth) on a two-dimensional plane (flat surface). See this tutorial if you want to know more about projections.

There are several projection types and numerous flavors of each type. The default EPSG code used in SAS Visual Analytics is EPSG:4326, which corresponds to the unprojected coordinate system.  If you open advanced properties of your polygon provider, you can see the current EPSG code:

Finding the correct EPSG code can be tricky, as not all shapefiles have consistent and reliable metadata built in. Here are a couple of things you can try:

(1)    Open your shapefile as a layer in a mapping application such as ArcMap (licensed by ESRI) or QGIS (open source) and view the properties of the layer. In many cases the EPSG code will appear in the properties.

(2)    Go to the location of your shapefile and open the .prj file in Notepad. It will show the projection information for your shapefile, although it may look a bit cryptic. Take note of the unit of measure (e.g., feet), datum (e.g., NAD 83) and projection type (e.g., Lambert Conformal Conic). Then, go to https://epsg.io/ and search for your geography.  Going back to the example for Chatham county, I searched for North Carolina. If more than one code is listed, select a few codes that seem to match your .prj information the best, then go back to SAS Visual Analytics and change the polygon provider Coordinate Space property. You may have to try a few codes before you find the one that works best.

I ruled out a projection issue, the note in Geographic Item preview shows that 100% of the regions are mapped, but the regions still don’t render. Take a look at your polygon provider preparation code and double-check that the order of observations didn’t accidentally get changed. The order of records may change, for example, if you use a PROC SQL join when you prepare the dataset. If you accidentally changed the order of the records prior to assigning the sequence ID, it can result in an illogical order of points which SAS Visual Analytics will have trouble rendering. Remember, sequence ID is needed so that SAS Visual Analytics can render the outlines of each polygon correctly.

You can validate the order of records by mapping the polygon provider using PROC GMAP, for example:

proc gmap map=geo.chatham_cbg data=geo.chatham_cbg;
   id geoid;
   choro geoid / nolegend levels=1;

For example, in image #1 below, the records are ordered correctly. In image #2, the order or records is clearly wrong, hence the lines going crisscross.


As you can see, custom regional maps in SAS Visual Analytics 8.3 are pretty straightforward to implement. The few "gotchas" I described will help you troubleshoot some of the common issues you may encounter.

P.S. I would like to thank Falko Schulz for his help in reviewing this blog.

Troubleshooting custom polygon maps in SAS Visual Analytics 8.3 was published on SAS Users.

1月 272018

Are you interested in using SAS Visual Analytics 8.2 to visualize a state by regions, but all you have is a county shapefile?  As long as you can cross-walk counties to regions, this is easier to do than you might think.

Here are the steps involved:

Step 1

Obtain a county shapefile and extract all components to a folder. For example, I used the US Counties shapefile found in this SAS Visual Analytics community post.

Note: Shapefile is a geospatial data format developed by ESRI. Shapefiles are comprised of multiple files. When you unzip the shapefile found on the community site, make sure to extract all of its components and not just the .shp. You can get more information about shapefiles from this Wikipedia article:  https://en.wikipedia.org/wiki/Shapefile.

Step 2

Run PROC MAPIMPORT to convert the shapefile into a SAS map dataset.

libname geo 'C:\Geography'; /*location of the extracted shapefile*/
proc mapimport datafile="C:\Geography\UScounties.shp"

Step 3

Add a Region variable to your SAS map dataset. If all you need is one state, you can subset the map dataset to keep just the state you need. For example, I only needed Texas, so I used the State_FIPS variable to subset the map dataset:

proc sql;
create table temp as select
/*cross-walk counties to regions*/
when name='Anderson' then '4'
when name='Andrews' then '9'
when name='Angelina' then '5'
when name='Aransas' then '11',
when name='Zapata' then '11'
when name='Zavala' then '8'
end as region 
from geo.shapefile_counties
/*subset to Texas*/
where state_fips='48'; 

Step 4

Use PROC GREMOVE to dissolve the boundaries between counties that belong to the same region. It is important to sort the county dataset by region before you run PROC GREMOVE.

proc sort data=temp;
by region;
proc gremove
by region;
id name; /*name is county name*/

Step 5

To validate that your boundaries resolved correctly, run PROC GMAP to view the regions. If the regions do not look right when you run this step, it may signal an issue with the underlying data. For example, when I ran this with a county shapefile obtained from Census, I found that some of the counties were mislabeled, which of course, caused the regions to not dissolve correctly.

proc gmap map=geo.regions_shapefile data=geo.regions_shapefile all;
   id region;
   choro region / nolegend levels=1;

Here’s the result I got, which is exactly what I expected:

Custom Regional Maps in SAS Visual Analytics

Step 6

Add a sequence number variable to the regions dataset. SAS Visual Analytics 8.2 needs it properly define a custom polygon inside a report:

data geo.regions_shapefile;
set geo.regions_shapefile;

Step 7

Load the new region shapefile in SAS Visual Analytics.

Step 8

In the dataset with the region variable that you want to visualize, create a new geography variable and define a new custom polygon provider.

Geography Variable:

Polygon Provider:

Step 9

Now, you can create a map of your custom regions:

How to create custom regional maps in SAS Visual Analytics 8.2 was published on SAS Users.