CAS

9月 152021
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. I've broken the series into logical, consumable parts. If you'd like to start by learning a little more about what CAS Actions are, please see CAS Actions and Action Sets - a brief intro. Or if you'd like to see other topics in the series, see the overview page. Otherwise, let's dive into exploring your data by viewing the number of distinct and missing values that exist in each column using the simple.distinct CAS action.

In this example, I will use the CAS procedure to execute the distinct action. Be aware, instead of using the CAS procedure, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax from other languages.

Determine the Number of Distinct and Missing Values in a CAS Table

To begin, let's use the simple.distinct CAS action on the CARS in-memory table to view the action's default behavior.

proc cas;
    simple.distinct /
        table={name="cars", caslib="casuser"};
quit;

In the preceeding code, I specify the CAS procedure, the action, then reference the in-memory table. The results of the call are displayed below.

The results allow us to quickly explore the CAS table and see the number of distinct and missing values. That's great, but what if you only want to see specific columns?

Specify the Columns in the Distinct Action

Sometimes your CAS tables contain hundreds of columns, but you are only interested in a select few. With the distinct action, you can specify a subset of columns using the inputs parameter. Here I'll specify the Make, Origin and Type columns.

proc cas;
    simple.distinct /
        table={name="cars", caslib="casuser"},
        inputs={"Make","Origin","Type"};
quit;

After executing the code the results return the information for only the Make, Origin and Type columns.

Next, let's explore what we can do with the results.

Create a CAS Table with the Results

Some actions allow you to create a CAS table with the results. You might want to do this for a variety of reasons like use the new CAS table in a SAS Visual Analytics dashboard or in a data visualization procedure like SGPLOT.

To create a CAS table with the distinct action result, add the casOut parameter and specify new CAS table information, like name and caslib.

proc cas;
    simple.distinct /
        table={name="cars", caslib="casuser"},
        casOut={name="distinctCars", caslib="casuser"};
quit;

After executing the code, the action returns information about the name and caslib of the new CAS table, and the number of rows and columns.

Visualize the Number of Distinct Values in Every Column

Lastly, what if you want to create a data visualization to better explore the table? Maybe you want to visualize the number of distinct values for each column? This task can be accomplished with variety of methods. However, since I know my newly created distinctCars CAS table has only 15 rows, I'll reference the CAS table directly using SGPLOT procedure.

This method works as long as the LIBNAME statement references your caslib correctly. I recommend this method when you know the CAS table is a manageable size. This is important because the CAS server does not execute the SGPLOT procedure on a distributed CAS table. The CAS server instead transfers the entire CAS table back to the client for processing.

To begin, the following LIBNAME statement will reference the casuser caslib.

libname casuser cas caslib="casuser";

Once the LIBNAME statement is correct, all you need to do is specify the CAS table in the DATA option of the SGPLOT procedure.

title justify=left height=14pt "Number of Distinct Values for Each Column in the CARS Table";
proc sgplot data=casuser.distinctCars
            noborder nowall;
    vbar Column / 
        response=NDistinct
        categoryorder=respdesc
        nooutline
        fillattrs=(color=cx0379cd);
    yaxis display=(NOLABEL);
    xaxis display=(NOLABEL);
quit;

The results show a bar chart with the number of distinct values for each column.

Summary

The simple.distinct CAS action is an easy way to explore a distributed CAS table. With one simple action, you can easily see how many distinct values are in each column, and the number of missing rows!

In Part 2 of this post, I'll further explore the simple.distinct CAS action and offer more ideas on how to interpret and use the results.

Additional Resources

distinct CAS action
SAS® Cloud Analytic Services: Fundamentals
Plotting a Cloud Analytic Services (CAS) In-Memory Table
Getting started with SGPLOT - Index
Code

CAS-Action! Simply Distinct - Part 1 was published on SAS Users.

8月 282021
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. I've broken the series into logical, consumable parts. If you'd like to start by learning a little more about what CAS Actions are, please see CAS Actions and Action Sets - a brief intro. Or if you'd like to see other topics in the series, see the overview page. Otherwise, let's dig in with the table.columnInfo action.

In this post we'll l look at exploring the column attributes of a CAS table. Knowing the column names, data types, and any additional formats or labels associated with the columns makes it easier to work with the data. One way to see this type of information on a CAS table is to use the table.columnInfo CAS action!

In this example, I will use the CAS procedure to execute the columnInfo action. Be aware, instead of using the CAS procedure, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language.

Return CAS table Column Information

To view the column information of your CAS table, use the columnInfo CAS action with the table parameter. That's it! Refer to the code below.

proc cas;
    table.columnInfo / table={name="cars", caslib="casuser"};
quit;

The results would appear similar to the following:

table.columnInfo CAS action results

and return a variety of information about each column in the CAS table:

  • the column name
  • a label if one has been applied
  • the Id, which indicates the position of the column in the table
  • the data type of the column
  • the column length
  • the format, formatted length, width and decimal

Create a CSV Data Dictionary Using the ColumnInfo Action

What if instead, you want to create a data dictionary documenting the CAS table? With the columnInfo action you can export the results to a CSV file!

I'll use the columnInfo action again, but this time I'll store the results in a variable. The variable is a dictionary, so I need to reference the dictionary ci, then the columnInfo key to access the table. Next, I'll create two computed columns using the compute operator. The first column contains the name of the CAS table, and the second, the caslib of the table. I'll print the new table to confirm the results.

proc cas;
    table.columnInfo result=ci / table={name="cars", caslib="casuser"};
 
    ciTbl = ci.ColumnInfo.compute({"TableName","Table Name"}, "cars")
                         .compute({"Caslib"}, "casuser");
 
    print ciTbl;
    ...

The code produces the folloiwng result:

The new result table documents the CAS table columns, table name and caslib.

Lastly, I'll export the result table to a CSV file. First, I'll specify the folder location using the outpath variable. Then use the SAVERESULT statement to save the result table as a CSV file named carsDataDictionary.csv.

    ...
    outpath="specify a folder location";
    saveresult ciTbl csv=outpath || "carsDataDictionary.csv";
quit;

 

After I execute the CAS procedure I can find and open the CSV file to view the documented CAS table!

Summary

The table.columnInfo CAS action is a simple and easy way to show column information about your distributed CAS table. Using the results of the action allow you to create a data dictionary in a variety of formats.

Additional resources

table.columnInfo CAS action
CAS Action! - a series on fundamentals
SAS® Cloud Analytic Services: Fundamentals
CASL Result Tables
SAVERESULT Statement
Code

CAS-Action! Show me the ColumnInfo! was published on SAS Users.

8月 192021
 

In Part 1 of my series fetch CAS, fetch!, I executed the fetch CAS action to return rows from a CAS table. That was great, but what can you do with the results? Maybe you want to create a visualization that includes the top five cars by MSRP for all Toyota vehicles? How can we accomplish this task? We'll cover this question and provide several other examples in this post.

Save the results of a CAS action as a SAS data set

First, execute the table.fetch CAS action on the CARS in-memory table to filter for Toyota cars, return the Make, Model and MSRP columns, and sort the results by MSRP. Then save the results of the action in a variable using the results option. The results of an action return a dictionary to the client. The fetch action returns a dictionary with a single key, and the result table as the value. In this example, I'll name the variable toyota.

proc cas;
    table.fetch result=toyota / 
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'",
                 vars={"Make","Model","MSRP"}
          },
          sortBy={
                 {name="MSRP", order="DESCENDING"}
          },
          index=FALSE,
          to=5;
...

After executing the code, the results of the action are stored in the variable toyota and not shown in the output.

Next, use the SAVERESULT statement to save the result table stored in the toyota variable. Since the variable is a dictionary, specify the variable name toyota, a dot, then the fetch key. This will access the result table from the dictionary. Finally, specify the DATAOUT= option with the name of the SAS data set to create.

proc cas;
    table.fetch result=toyota / 
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'",
                 vars={"Make","Model","MSRP"}
          },
          sortBy={
                 {name="MSRP", order="DESCENDING"}
          },
          index=FALSE,
          to=5;
 
     saveresult toyota.fetch dataout=work.top5;
quit;

After executing the code, the result table is saved as a SAS data set. The SAS data set is named top5 and saved to the WORK library.

 

 

Wondering what else can we do? Let's take a look.

Visualize the SAS data set

Now that the result table is saved as a SAS data set, you can use the SGPLOT procedure to create a bar chart! Consider the code below.

title justify=left height=14pt "Top 5 Toyota Cars by MSRP";
proc sgplot data=work.top5
         noborder nowall;
    vbar Model / 
          response=MSRP 
          categoryorder=respdesc
          nooutline
          fillattrs=(color=cx0379cd);
    label MSRP="MSRP";
quit;
title;

There it is! We processed our data in CAS using the fetch action, returned a smaller subset of results back to the client, then used traditional SAS programming techniques on the smaller table. This method will work similarly in other languages like Python and R. Then you can then use the native visualization packages of the language!

You can now use your imagination on what else to do with the raw data from the CARS table or from the top5 results table we produced with the table.fetch action. Feel free to get creative.

Summary

CAS actions are optimized to run in a distributed environment on extremely large tables. Your CAS table can contain millions or even billions of rows. Since the data in CAS can be extremely large, the goal is to process and subset the table on the CAS server, then return a smaller amount of data to the client for additional processing, visualization or modeling.

Additional resources

fetch Action
SAVERESULT Statement
SAS® Cloud Analytic Services: Fundamentals
Plotting a Cloud Analytic Services (CAS) In-Memory Table
Getting started with SGPLOT - Index
Code used in this post

CAS-Action! fetch CAS, fetch! - Part 2 was published on SAS Users.

8月 162021
 

In Part I of this blog post, I provided an overview of the approach my team and I took tackling the problem of classifying diverse, messy documents at scale. I shared the details of how we chose to preprocess the data and how we created features from documents of interest using LITI rules in SAS Visual Text Analytics (VTA) on SAS Viya.

In this second part of the post, I'll discuss:

  • scoring VTA concepts and post-processing the output to prepare data for predictive modeling
  • building predictive models using SAS Visual Data Mining and Machine Learning
  • assessing the results

Scoring VTA Features to Work

Recall from Part I, I used the image in Figure 1 to illustrate the type of documents our team was tasked to automatically classify.

Figure 1. Sample document

Figure 1. Sample document

I then demonstrated a method for extracting features from such documents, using the Language Interpretation Text Interpretation (LITI) rules. I used the “Color of Hair” feature as an example. Once the concepts for a decent number of features were developed, we were ready to apply them to the corpus of training documents to build an analytic table for use in building a document classification model.

The process of applying a VTA concept model to data is called scoring. To score your documents, you first need to download the model score code. This is a straightforward task: run the Concepts node to make sure the most up-to-date model is compiled, and then right-click on the node and select “Download score code.”

Figure 2. Exporting concept model score code

Figure 2. Exporting concept model score code

You will see four files inside the downloaded ZIP file. Why four files? There are two ways you can score your data: using the the with the concepts analytics store (astore) or the concepts model. The first two files you see in Figure 2 score the data using the astore method, and the third and fourth file correspond to scoring using the concept .li binary model file.

If you are scoring in the same environment where you built the model, you don’t need the .li file, as it is available in the system and is referenced inside the ScoreCode.sas program. This is the reason I prefer to use the second method to score my data – it allows me to skip the step of uploading the model file. For production purposes though, the model file will need be copied to the appropriate location so the score code can find it. For both development and production, you will need to modify the macro variables to match the location and structure of your input table, as well as the destination for your result files.

Figure 3. Partial VTA score code – map the macro variables in curly brackets, except those in green boxes

Figure 3. Partial VTA score code – map the macro variables in curly brackets, except those in green boxes

Once all the macro variables have been mapped, run the applyConcept CAS action inside PROC CAS to obtain the results.

Figure 4. Partial VTA score code

Figure 4. Partial VTA score code

Depending on the types of rules you used, the model scoring results will be populated in the OUT_CONCEPTS and/or OUT_FACTS tables. Facts are extracted with sequence and predicate rule types that are explained in detail in the book I mentioned in Part I. Since my example didn’t use these rule types, I am going to ignore the OUT_FACTS table. In fact, I could modify the score code to prevent it from outputting this table altogether.

Looking at the OUT_CONCEPTS table, you can see that scoring returns one row per match. If you have multiple concepts matching the same string or parts of the same string, you will see one row for each match. For troubleshooting purposes, I tend to output both the target and supporting concepts, but for production purposes, it is more efficient to set the behavior of supporting concepts to “Supporting” inside the VTA Concepts node, which will prevent them from showing up in the scored output (see Figure 5). The sample output in Table 1 shows matches for target concepts only.

Table 1. OUT_CONCEPTS output example

Table 1. OUT_CONCEPTS output example

Figure 5. Setting concept behavior to supporting

Figure 5. Setting concept behavior to supporting

Completing Feature Engineering

To complete feature creation for use in a classification model, we need to do two things:

  1. Convert extracted concepts into binary flags that indicate if a feature is present on the page.
  2. Create a feature that will sum up the number of total text-based features observed on the page.

To create binary flags, you need to do three things: deduplicate results by concept name within each record, transpose the table to convert rows into columns, and replace missing values with zeros. The last step is important because you should expect your pages to contain a varying number of features. In fact, you should expect non-target pages to have few, if any, features extracted. The code snippet below shows all three tasks accomplished with a PROC CAS step.

/*Deduplicate OUT_CONCEPTS. Add a dummy variable to use in transpose later*/
data CASUSER.OUT_CONCEPTS_DEDUP(keep=ID _concept_ dummy);
	set CASUSER.OUT_CONCEPTS;
	dummy=1;
	by ID _concept_;
	if first.concept_ then output;
	run;
 
/*Transpose data*/
proc transpose data=CASUSER.OUT_CONCEPTS_DEDUP out=CASUSER.OUT_CONCEPTS_TRANSPOSE;
	by ID;
	id _concept_;
	run;
 
/*Replace missing with zeros and sum features*/
data CASUSER.FOR_VDMML(drop=i _NAME_);
	set CASUSER.OUT_CONCEPTS_TRANSPOSE;
	array feature (*) _numeric_;
	do i=1 to dim(feature);
		if feature(i)=. then feature(i)=0;
	end;
	Num_Features=sum(of T_:);
	run;
 
/*Merge features with labels and partition indicator*/
data CASUSER.FOR_VDMML;
merge CASUSER.OUT_CONCEPTS_TRANSPOSE1 
	  CASUSER.LABELED_PARTITIONED_DATA(keep= ID target partition_ind);
by ID;
run;

To create an aggregate feature, we simply added up all features to get a tally of extracted features per page. The assumption is that if we defined the features well, a high total count of hits should be a good indicator that a page belongs to a particular class.

The very last task is to merge your document type label and the partitioning column with the feature dataset.

Figure 6 shows the layout of the table ready for training the model.

Figure 6. Training table layout

Figure 6. Training table layout

Training Classification Model

After all the work to prepare the data, training a classification model with SAS Visual Data Mining and Machine Learning (VDMML) was a breeze. VDMML comes prepackaged with numerous best practices pipeline templates designed to speed things up for the modeler. We used the advanced template for classification target (see Figure 7) and found that it performed the best even after multiple attempts to improve performance. VDMML is a wonderful tool which makes model training easy and straightforward. For details on how to get started, see this free e-book or check out the tutorials on the SAS Support site.

Figure 7. Example of a VDMML Model Studio pipeline

Figure 7. Example of a VDMML Model Studio pipeline

Model Assessment

Since we were dealing with rare targets, we chose the F-score as our assessment statistic. F-score is a harmonized mean of precision and recall and provides a more realistic model assessment score than, for example, a misclassification rate. You can specify F-score as your model selection criterion in the properties of a VDMML Model Studio project.

Depending on the document type, our team was able to achieve an F-score of 85%-95%, which was phenomenal, considering the quality of the data. At the end of the day, incorrectly classified documents were mostly those whose quality was dismal – pages too dark, too faint, or too dirty to be adequately processed with OCR.

Conclusion

So there you have it: a simple, but effective and transparent approach to classifying difficult unstructured data. We preprocessed data with Microsoft OCR technologies, built features using SAS VTA text analytics tools, and created a document classification model with SAS VDMML. Hope you enjoyed learning about our approach – please leave a comment to let me know your thoughts!

Acknowledgement

I would like to thank Dr. Greg Massey for reviewing this blog series.

Classifying messy documents: A common-sense approach (Part II) was published on SAS Users.

8月 122021
 

Welcome to my SAS Users blog series CAS Action! - a series on fundamentals. I've broken the series into logical, consumable parts. If you'd like to start by learning a little more about what CAS Actions are, please see CAS Actions and Action Sets - a brief intro. Or if you'd like to see other topics in the series, see the overview page. Otherwise, let's dig in with our first topic - the fetch action.

The table.fetch CAS action retrieves the first 20 rows of a CAS distributed in-memory table. It's similar to using the head method in R or Python. However, the fetch action can do more than fetch rows from a table. In this example, I will use the CAS procedure to execute the fetch action. Be aware, instead of using the CAS procedure, I could execute the same action with Python, R and other languages with some slight changes to the syntax. Pretty cool!

Retrieve the first 20 rows of a CAS table using the fetch action

To begin, let's use the table.fetch CAS action on the cars in-memory table to view the action's default behavior. Consider the following code:

proc cas; 
    table.fetch / table={name="cars", caslib="casuser"}; 
quit;

To use the fetch action with the CAS language, you specify the CAS procedure, the action, a forward slash, then the table parameter to specify the CAS table. In the table parameter I'll use the name sub-parameter to specify the cars in-memory table, and the caslib sub-parameter to specify the casuser caslib. The result of the call is listed below.

We retrieved 20 rows from our distributed cars table. What else can we fetch?

Retrieve the first n rows of a CAS table

What if you don't want the first 20 rows? Maybe you only want the first 5. To modify the amount of rows returned use the to parameter, then specify the number of rows. In the results, I also saw an _Index_ column. That column appears by default. To remove the _Index_ column, add the index=FALSE parameter.

proc cas;
    table.fetch /
         table={name="cars", caslib="casuser"},
         to=5,
         index=FALSE;
quit;

The results return 5 rows from the cars table, and the _Index_ column has been removed.

Now, what if I want to sort the returned rows?

Sort the table

The results of a CAS table are not guaranteed since the table is distributed among the CAS workers. To see the results in a precise order, use the sortBy parameter. The sortBy parameter requires an array of key-value pairs (dictionaries), so it's a bit tricky the first time you use it.

In this example, let's sort the table by Make and MSRP in descending order.

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser"},
          sortBy={
                  {name="Make", order="DESCENDING"},
                  {name="MSRP", order="DESCENDING"}
          },
          index=FALSE;
quit;

The results show 20 rows of the cars table sorted by Make and MSRP. Great!

Subset the table

What if I only want to see the cars where Make is Toyota, and return the columns Make, Model, MSRP and Invoice? You can add the where and vars sub-parameters in the table parameter to subset the table.

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'",
                 vars={"Make","Model","MSRP","Invoice"}
           },
           to=5,
           index=FALSE;
quit;

The results return 5 rows from the cars table where Make is Toyota and only the four specified columns.

Quick detour

Instead of using the vars sub-parameter as we did in the previous example, you can use the fetchVars parameter. The code would change to:

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'"},
          fetchVars={"Make","Model","MSRP","Invoice"},
          to=5,
          index=FALSE;
quit;

Either method works and it's totally up to you.

Create a calculated column

Lastly, let's create a calculated column named MPG_Avg that calculates the average city and highway miles per gallon for each car, then subset the results for all cars with an MPG_Avg greater than 40. To create a calculated column, use the computedVarsProgram sub-parameter in the table parameter. Then you can subset on the calculated column with the where sub-parameter.

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser",
                 vars={"Make","Model","MPG_Avg"},
                 where="MPG_Avg > 40",
                 computedVarsProgram="MPG_Avg=mean(MPG_City,MPG_Highway)"
          },
    sortBy={
             {name="MPG_Avg", order="DESCENDING"}
    },
    index=FALSE;
quit;

Summary

CAS actions are optimized to run on the CAS server on large data. They are flexible and offer many parameters to control the output and can be executed in a variety of languages! In the next post, I'll cover more on the fetch action.

Additional Resources

fetch Action documentation
SAS® Cloud Analytic Services: Fundamentals documentation
Code used in this post

CAS-Action! fetch CAS, fetch! - Part 1 was published on SAS Users.

8月 072021
 

I'm excited to curate a series of posts focused on CAS Actions. However, before I dive into details and code, I thought I'd take a moment to lay a foundation for a better understanding of CAS and CAS actions. I'll cover things here at a high level as there's quite a bit of information out there already. Please see the Additional Resources section at the end of this article for more details.

CAS actions are highly optimized units of work for SAS Viya's Cloud Analytics Services (CAS) distributed computing engine. Think of CAS actions as powerful functions or methods created specifically to process data in the CAS server. Actions can load data, manage tables, perform general analytics, statistics, and machine learning, as well as execute DATA step, FedSQL, DS2 and more. Did I mention they can do a lot?

Before we dive into details about actions, let's quickly review the CAS server in SAS Viya.

CAS Server

First, the CAS Server. It provides cloud-based, run-time environment for data management and analytics in SAS Viya. You process data in all stages of the analytic life cycle using the power of distributing computing. To process data in the CAS server, you must first load data into memory from a data source.

Data Sources

Data sources connect to the CAS server through caslibs. Caslibs can access a wide variety of data sources including relational databases, streaming data, SAS data sets, unstructured data, and familiar file formats, such as XML, JSON, CSV, and XLSX.

CAS Actions

Once you have data available in the CAS server you can begin to process it using CAS actions. Actions are the native language of the CAS server. All actions are aggregated with other actions in action sets. You can think of an action set as a package, and the actions inside an action set as methods. There are dozens of CAS action sets and hundreds of CAS actions available.

Executing Actions

There are a variety of interfaces available to execute CAS actions. One method is using the native CAS Language (CASL). CASL is a statement based scripting language. With CASL you can use general programming logic to execute actions, work with the results, and develop analytic pipelines. You can also execute actions through the CAS API using languages like SAS, FedSQL, Python, R, Java, Lua and REST.

Using Familiar Language Methods

One major benefit of CAS actions is the CAS API allows you to execute familiar syntax in your language of choice. When using familiar methods, the CAS API will convert them into actions behind the scenes.

For example, to retrieve the first ten rows of a CAS table you can use the SAS PRINT procedure, the Python head method, or the R head function. All of these methods will convert to the fetch action behind the scenes through the CAS API. Many familiar procedures, methods and functions are available. See the SAS Viya Programming for Developers documentation for more details about specific languages.

NOTE: THE SWAT package is required to use Python and R with the CAS server.

Executing Actions in a Variety of Languages

Another benefit of CAS actions is that you can directly execute actions from a variety of languages. This allows for all types of programmers to work together using common actions. For example, you can execute the same fetch action in SAS, R and Python.

NOTE: THE SWAT package is required to use Python and R with the CAS server.

SAS Viya Applications

Lastly, SAS Viya has a many applications that can work with data in the CAS server, These applications provide a variety of functionality. From point and click to programming, executing actions behind the scenes.

Summary

Hopefully, I've provide enough of a base for CAS, CAS actions, and CAS action sets for you to get started. Here's what you need to know to move forward:

  • SAS Viya's CAS server is a powerful distributed comping engine that processes big data fast.
  • CAS actions are the native language of the CAS server.
  • CAS actions can be executed in a variety of languages like CASL, SAS, Python, R, Lua, Java and REST.
  • SAS Viya has additional applications to work with CAS data.

Additional Resources

SAS® Cloud Analytic Services: Fundamentals
The Architecture of the SAS® Cloud Analytic Services in SAS® Viya™
What is CASL?
SAS Viya Programming examples on GitHub
Using SAS Cloud Analytics Service REST APIs to run CAS Actions
SAS Developers Portal
SAS Developers Community

CAS Actions and Action Sets - a brief intro was published on SAS Users.

8月 072021
 

I'm excited to bring a series of posts centered on CAS Actions. The topics in the list below will be covered in posts set to publish in the next few weeks. Please check back often for new releases.

CAS Actions and Action Sets - a brief intro
CAS-Action! fetch CAS, fetch! - Part 1
CAS-Action! fetch CAS, fetch! - Part 2
CAS-Action! Show me the ColumnInfo!
CAS-Action! Simply Distinct - Part 1
CAS-Action! Simply Distinct - Part 2
CAS-Action! executing SQL in CAS

CAS Action! - a series on fundamentals was published on SAS Users.

8月 052021
 

Unstructured text data is ubiquitous in both business and government and extracting value from it at scale is a common challenge. Organizations that have been around for a while often have vast paper archives. Digitizing these archives does not necessarily make them usable for search and analysis, since documents are commonly converted into images that are not inherently machine-readable. I recently participated in a project which aimed, as one of its goals, to automate the search and tagging of specific pages within such digitized archives. We encountered several challenges:

  • These PDF-based files were very long, up to 3,000 pages and sometimes longer, so looking for relevant pages was like looking for a needle in a haystack.
  • Files arrived in the tens of thousands per month and needed to be processed promptly to accommodate the customer’s business process.
  • Relevant pages spanned decades, came in dozens of formats, were only partially machine-readable and exhibited various data quality problems related to how they had been handled prior to being digitized.

With some trial and error, our team developed a common-sense approach to classifying pages in these documents through a combination of advanced optical character recognition (OCR) technologies coupled with feature engineering using SAS Visual Text Analytics (VTA) and supervised machine learning using SAS Visual Data Mining and Machine Learning (VDMML).

In this two-part blog post, I describe our approach and share some lessons learned. I spend the most time on data preparation and feature engineering, because it was a crucial component of this work and is so critical for building accurate models. To maintain customer confidentiality, I am keeping the project details to a minimum and using generic data examples.

In this part of the post, I describe our high-level approach and dive into preprocessing and feature extraction. In Part II, I will discuss data preparation for supervised modeling, model development and assessment. To replicate our approach, you would require licenses for SAS Visual Text Analytics and SAS Visual Data Mining and Machine Learning, and, if your data is not machine-readable, a high-quality OCR tool.

High-Level Approach

At a high level, the model development pipeline for this document classification problem looked like this:

  • Preprocess and label data
  • Extract features
  • Build and evaluate a supervised model
Figure 1. Model development pipeline

Figure 1. Model development pipeline

Data preparation and preprocessing

When my colleagues and I got our first glimpse of the data, our knees nearly buckled. Although the data was in a PDF format, only a small portion of it was machine readable. Some other common issues included:

  • Old, faded, stained, crumpled and torn pages
  • Holes punched through crucial sections, such as form titles
  • Salt-and-pepper effects and smudges from many rounds of faxing and photocopying
  • Documents populated by hand, with a typewriter, or faintly printed on a matrix printer
  • Handwritten notes and marks both on the margins and over the form body

We needed clean-ish text to work with, but it quickly became apparent we also needed a heavy-duty optical character recognition tool to extract it. After some trial and error, we chose to preprocess the data with the Microsoft Cognitive Services Read API. The API handles data in many formats and recognizes both typed and handwritten text. The API returns data in a JSON format which is easy to transform into delimited files or SAS tables. To give you a flavor of its ability to handle messy documents, consider this sample document in the image below.

Figure 2. Sample document image

Figure 2. Sample document image

And now check out the text extracted by the API. I highlighted some areas of interest, including text that came from handwriting and the regions adjacent to the holes punched at the top. Considering the image quality, overall, the output is remarkably accurate: both hand-written and typed text is recognized, and the text is returned consistent with a left to right reading pattern.

Figure 3. Text extracted from Figure 2. Handwriting highlighted.

Figure 3. Text extracted from Figure 2. Handwriting highlighted.

We chose to parse the text data into one page per row, except for some special cases where it made more sense to break down continuous printouts into logical chunks.

A parallel task was to label data for training a supervised model. We were tasked with finding and tagging several classes of forms and documents, with many different versions and styles within each document class. The vast majority of pages in these files were irrelevant, so we were dealing with rare targets. As often happens, the customer did not have reliably labeled data, so there were no shortcuts. We were forced to read through a sample of files and label pages manually. As tedious as it sounds, this gave our team an opportunity to become intimately familiar with the data and ask the customer numerous clarifying questions. I am confident getting to know the data before diving into modeling helped us understand and define the problem better, and ultimately enabled us to create a better product for the customer.

As we labeled the data, we took screenshots of unique versions of documents we encountered and populated a form catalog. This helped us create a visual cheat sheet for use in feature development. The catalog proved so useful that even the customer asked if they could have a copy of it to use for their internal trainings!

Feature engineering

The field of Natural Language Processing has exploded in recent years with significant advancements due to the introduction of novel deep learning architectures capable of solving many text-related problems. However, deep learning is not transparent, requires substantial computational resources and takes a lot of tuning to get the best results. Most importantly, it also requires significant amounts of labeled data, which is expensive to create. Therefore, our team chose to start with a simpler but more transparent approach.

The cornerstone of this approach was to use SAS Visual Text Analytics (VTA), which runs in Cloud Analytics Services (CAS) inside the SAS Viya platform, to generate features for training a machine learning model. We used the Concepts node in VTA to create these features on a sample of records reserved for model training. The VTA Concepts node provides an interface for developing Language Interpretation Text Interpretation (LITI) rules to capture specific entities within a textual context. Because the LITI language is so flexible, you can write rules which overcome many data quality problems.

Figure 4. Example of a Concepts node inside a SAS VTA pipeline

Figure 4. Example of a Concepts node inside a SAS VTA pipeline

Consider again our sample document. In Figure 5, I highlighted some strings that are good candidates for extraction as features because they are part of the form structure and are easily readable with a naked eye, meaning that they should lend themselves well to OCR and remain relatively stable within each document class of interest.

Figure 5. Candidate feature examples

Figure 5. Candidate feature examples

The trick to extracting these strings as robust features using VTA is to anticipate some common ways in which the OCR tool may fail, as well as to account for changes in the wording of these common elements over time. This is where the power of the LITI rules comes in, and the work of building the form catalog pays off.

For example, consider a string “7. Color of Hair” as in the image to the right. In this example, it is perfectly readable and would make a good feature, but here are some of the pattern deviations you would want to account for when building rules to capture it:

  • OCR may incorrectly recognize the period as a comma or skip it altogether.
  • OCR may incorrectly recognize the word “of” as something else, such as “ot”, “or”, etc. depending on the quality of the page.
  • The digit 7 may change to 6, 7a, 8 or some other alphanumeric value depending on the version of the form.
  • Different versions of the form may change the wording from Color of Hair to Hair Color.

Using the flexible LITI rules, you can easily account for all for these patterns. The table below shows some examples of LITI rules. The strings matched by each rule are shown in blue. Our objective is to match the “Color of Hair”/ “Hair Color” strings if they occur in close proximity to a corresponding digit, to ensure that we match the form feature rather than some other mention of an individual’s hair color. The rules in the table are gradually extended to provide more flexibility and match more sample strings.

Table 1. Feature extraction rule examples

Table 1. Feature extraction rule example

You can see how we are able to match more and more patterns as we introduce placeholders for digits, add separators and specify ordered distances between tokens to make the rules more flexible. In the last example, the matched strings do not include the item number, but that’s not a problem, because we know that “Color of Hair” will only be matched if it’s preceded by an item number. We don’t need to match the complete string to build a feature, as long as the matched string occurs in a relevant context.

We could continue to add rules like this to eventually match all desired string patterns. However, this could lead to increasingly complex and/or redundant rules. Instead, a preferred approach is to break down this task into pieces: match the digit, match the “Color of Hair” or “Hair Color” string, and then combine the results to get the final match. Table 2 shows the rules above refactored to follow this pattern. Instead of one concept, we now have three: one target concept, denoted with T_, and two supporting concepts, denoted with L_ (for lexicon or lookup).

Table 2. Refactored rule example

Table 2. Refactored rule example

After refactoring the rules, the target concept T_HairColor catches every pattern we want to match and skips the one string that lacks an item number - as intended. What’s even better, we can now reuse some of this work to extract other document features. For example, the L_ItemNumber concept can be reused to match “6. Date of Birth”, “9. Sex”, “11. Occupation” and so on.

I didn’t cover every available rule type, but hopefully this gives you a taste of how they work. There are many great resources that can help you get started with the LITI rules, such as the SAS tutorial Creating Concept Rules Using Textual Elements in SAS Visual Text Analytics, but my favorite resource is SAS Text Analytics for Business Applications by SAS’ Teresa Jade et al. Keep in mind that LITI is proprietary to SAS and can be leveraged within SAS text analytics products, such as SAS Visual Text Analytics.

Conclusion

So far, I shared our approach to preprocessing messy documents and using SAS Visual Text Analytics concepts to extract features from documents of interest. If you are wondering about the next steps, stay tuned for Part II of this series, in which I will discuss:

  • Scoring VTA concepts and post-processing the output to prepare data for predictive modeling
  • Building predictive models using the drag-and-drop interface of SAS Visual Data Mining and Machine Learning
  • Assessing results

Acknowledgement

I would like to thank Dr. Greg Massey for reviewing this blog series.

 

Classifying messy documents: A common-sense approach (Part I) was published on SAS Users.

6月 232021
 

Welcome to the fifth installment in my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS server, how to execute CAS actions, how to work with the results, and how your data is organized on the CAS server. In this post I'll discuss loading files into the CAS server. Before proceeding, I need to clear up some terminology. There are two sources for loading files into memory, server-side and client-side. It's important to understand the difference.

Server-Side vs Client-Side

Let's start off with an image depicting server- and client-side file locations.

Server-side files are mapped to a caslib. That is, files in the data source portion of a caslib are server-side. For more infomation on data sources, see Part 4 of this series. Client-side files are files that are not mapped to a caslib. These could be files stored in a SAS library or other local files. That's about it!

In this post, I'll focus on loading server-side files into memory (Part 6 of the series will cover client-side files). In Python, we have two data loading options, the loadTable CAS action, or the load_path method. The techniques are almost identical. The difference is the loadTable action returns a dictionary of information and the load_path method returns a CASTable. If you use the loadTable action you must reference a new table in a separate step. In general, I prefer the actions as they are available in any language such as CASL, R, and Python. Conversely, methods like load_path are specific to a language.

Let's look at a few examples of loading different types of datafiles into CAS.

Loading a SAS Data Set into Memory

In the first example, I want to load the cars.sas7bdat SAS data set from the casuser caslib into memory. Let's confirm the table exists in the caslib by executing the fileInfo action. I made my connection to CAS, naming it conn. For more information about making a connection to CAS, visit Part 1 of the series.

conn.fileInfo(caslib="casuser")

In the results I can see that the cars.sas7bdat file is available in the casuser caslib.

Next, I'll use the loadTable action. First I'll specify the path parameter with the full file name, then the caslib parameter specifying the input caslib, and lastly the casOut parameter to specify the new CAS table. In the casOut parameter you specify a dictionary of key value pairs. Here, I want to name the new table cars_sas and place the table in the casuser caslib.

conn.loadTable(path="cars.sas7bdat", caslib="casuser",
               casOut={
                       "name":"cars_sas",
                       "caslib":"casuser"
               })

After executing the loadTable action, the result returns a dictionary of information about the new table.

Now that the file is loaded into memory, I can process the data by making a reference to the in-memory table using the CASTable method. Then I'll use the head method to view the first 5 rows.

cars = conn.CASTable("cars_sas",caslib="casuser")
cars.head()

Loading a CSV file into Memory

In the next example, I'll load a CSV file into memory. Even though it's a different file type, we still use the loadTable action. The only change needed is the file name in the path parameter. That's it!

conn.loadTable(path="cars.csv", caslib="casuser",
           casOut={
                "name":"cars_csv",
                "caslib":"casuser"
           })
 
csv = conn.CASTable("cars_csv", caslib="casuser")
csv.head()

 

Loading a Text file into Memory

Lastly, data is not always stored in the correct format. Sometimes when loading files you need to modify the default parameters. In this final scenario, let's explore an example using the importOptions parameter. importOptions takes a variety of key value pairs as the values to dictate how to load the file. The options vary depending on the file type.

The file in this example contains no column names, and the delimiter is a vertical bar (or pipe).

I'll begin with specifying the column names in a variable named colNames. Then, in the loadTable action I'll add the importOptions parameter and specify the following:

  • file type I want to import, using the fileType subparameter,
  • the delimiter, using the delimiter subparameter,
  • the getNames subparameter with the value False, since no column names exist,
  • and lastly, the vars subparameter with the names of the columns.

And as in the previous examples, I'll print out the first five lines of the resulting table from memory.

colNames=['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
          'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
          'Weight', 'Wheelbase', 'Length']
 
conn.loadTable(path="cars_delim_bar.txt", caslib="casuser",
           casOut={
                "name":"cars_text",
                "caslib":"casuser"
           },
           importOptions={"fileType":"DELIMITED",
                          "delimiter":"|",
                          "getNames":False,
                          "vars":colNames}
          )
 
csv = conn.CASTable("cars_text", caslib="casuser")
csv.head()

Summary

The loadTable action is flexible and an easy way to load tables into memory. Benefits of the loadTable action include:

  • it's filetype agnostic
  • has many options available based on the file types being imported into memory
  • available in any CAS-compatible language like Python, R and CASL with slight changes to the syntax based on the language used

 

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 5 - Loading Server-Side Files into Memory was published on SAS Users.

6月 232021
 

Welcome to the fifth installment in my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS server, how to execute CAS actions, how to work with the results, and how your data is organized on the CAS server. In this post I'll discuss loading files into the CAS server. Before proceeding, I need to clear up some terminology. There are two sources for loading files into memory, server-side and client-side. It's important to understand the difference.

Server-Side vs Client-Side

Let's start off with an image depicting server- and client-side file locations.

Server-side files are mapped to a caslib. That is, files in the data source portion of a caslib are server-side. For more infomation on data sources, see Part 4 of this series. Client-side files are files that are not mapped to a caslib. These could be files stored in a SAS library or other local files. That's about it!

In this post, I'll focus on loading server-side files into memory (Part 6 of the series will cover client-side files). In Python, we have two data loading options, the loadTable CAS action, or the load_path method. The techniques are almost identical. The difference is the loadTable action returns a dictionary of information and the load_path method returns a CASTable. If you use the loadTable action you must reference a new table in a separate step. In general, I prefer the actions as they are available in any language such as CASL, R, and Python. Conversely, methods like load_path are specific to a language.

Let's look at a few examples of loading different types of datafiles into CAS.

Loading a SAS Data Set into Memory

In the first example, I want to load the cars.sas7bdat SAS data set from the casuser caslib into memory. Let's confirm the table exists in the caslib by executing the fileInfo action. I made my connection to CAS, naming it conn. For more information about making a connection to CAS, visit Part 1 of the series.

conn.fileInfo(caslib="casuser")

In the results I can see that the cars.sas7bdat file is available in the casuser caslib.

Next, I'll use the loadTable action. First I'll specify the path parameter with the full file name, then the caslib parameter specifying the input caslib, and lastly the casOut parameter to specify the new CAS table. In the casOut parameter you specify a dictionary of key value pairs. Here, I want to name the new table cars_sas and place the table in the casuser caslib.

conn.loadTable(path="cars.sas7bdat", caslib="casuser",
               casOut={
                       "name":"cars_sas",
                       "caslib":"casuser"
               })

After executing the loadTable action, the result returns a dictionary of information about the new table.

Now that the file is loaded into memory, I can process the data by making a reference to the in-memory table using the CASTable method. Then I'll use the head method to view the first 5 rows.

cars = conn.CASTable("cars_sas",caslib="casuser")
cars.head()

Loading a CSV file into Memory

In the next example, I'll load a CSV file into memory. Even though it's a different file type, we still use the loadTable action. The only change needed is the file name in the path parameter. That's it!

conn.loadTable(path="cars.csv", caslib="casuser",
           casOut={
                "name":"cars_csv",
                "caslib":"casuser"
           })
 
csv = conn.CASTable("cars_csv", caslib="casuser")
csv.head()

 

Loading a Text file into Memory

Lastly, data is not always stored in the correct format. Sometimes when loading files you need to modify the default parameters. In this final scenario, let's explore an example using the importOptions parameter. importOptions takes a variety of key value pairs as the values to dictate how to load the file. The options vary depending on the file type.

The file in this example contains no column names, and the delimiter is a vertical bar (or pipe).

I'll begin with specifying the column names in a variable named colNames. Then, in the loadTable action I'll add the importOptions parameter and specify the following:

  • file type I want to import, using the fileType subparameter,
  • the delimiter, using the delimiter subparameter,
  • the getNames subparameter with the value False, since no column names exist,
  • and lastly, the vars subparameter with the names of the columns.

And as in the previous examples, I'll print out the first five lines of the resulting table from memory.

colNames=['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
          'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
          'Weight', 'Wheelbase', 'Length']
 
conn.loadTable(path="cars_delim_bar.txt", caslib="casuser",
           casOut={
                "name":"cars_text",
                "caslib":"casuser"
           },
           importOptions={"fileType":"DELIMITED",
                          "delimiter":"|",
                          "getNames":False,
                          "vars":colNames}
          )
 
csv = conn.CASTable("cars_text", caslib="casuser")
csv.head()

Summary

The loadTable action is flexible and an easy way to load tables into memory. Benefits of the loadTable action include:

  • it's filetype agnostic
  • has many options available based on the file types being imported into memory
  • available in any CAS-compatible language like Python, R and CASL with slight changes to the syntax based on the language used

 

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 5 - Loading Server-Side Files into Memory was published on SAS Users.