SAS Viya

9月 152021
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. I've broken the series into logical, consumable parts. If you'd like to start by learning a little more about what CAS Actions are, please see CAS Actions and Action Sets - a brief intro. Or if you'd like to see other topics in the series, see the overview page. Otherwise, let's dive into exploring your data by viewing the number of distinct and missing values that exist in each column using the simple.distinct CAS action.

In this example, I will use the CAS procedure to execute the distinct action. Be aware, instead of using the CAS procedure, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax from other languages.

Determine the Number of Distinct and Missing Values in a CAS Table

To begin, let's use the simple.distinct CAS action on the CARS in-memory table to view the action's default behavior.

proc cas;
    simple.distinct /
        table={name="cars", caslib="casuser"};
quit;

In the preceeding code, I specify the CAS procedure, the action, then reference the in-memory table. The results of the call are displayed below.

The results allow us to quickly explore the CAS table and see the number of distinct and missing values. That's great, but what if you only want to see specific columns?

Specify the Columns in the Distinct Action

Sometimes your CAS tables contain hundreds of columns, but you are only interested in a select few. With the distinct action, you can specify a subset of columns using the inputs parameter. Here I'll specify the Make, Origin and Type columns.

proc cas;
    simple.distinct /
        table={name="cars", caslib="casuser"},
        inputs={"Make","Origin","Type"};
quit;

After executing the code the results return the information for only the Make, Origin and Type columns.

Next, let's explore what we can do with the results.

Create a CAS Table with the Results

Some actions allow you to create a CAS table with the results. You might want to do this for a variety of reasons like use the new CAS table in a SAS Visual Analytics dashboard or in a data visualization procedure like SGPLOT.

To create a CAS table with the distinct action result, add the casOut parameter and specify new CAS table information, like name and caslib.

proc cas;
    simple.distinct /
        table={name="cars", caslib="casuser"},
        casOut={name="distinctCars", caslib="casuser"};
quit;

After executing the code, the action returns information about the name and caslib of the new CAS table, and the number of rows and columns.

Visualize the Number of Distinct Values in Every Column

Lastly, what if you want to create a data visualization to better explore the table? Maybe you want to visualize the number of distinct values for each column? This task can be accomplished with variety of methods. However, since I know my newly created distinctCars CAS table has only 15 rows, I'll reference the CAS table directly using SGPLOT procedure.

This method works as long as the LIBNAME statement references your caslib correctly. I recommend this method when you know the CAS table is a manageable size. This is important because the CAS server does not execute the SGPLOT procedure on a distributed CAS table. The CAS server instead transfers the entire CAS table back to the client for processing.

To begin, the following LIBNAME statement will reference the casuser caslib.

libname casuser cas caslib="casuser";

Once the LIBNAME statement is correct, all you need to do is specify the CAS table in the DATA option of the SGPLOT procedure.

title justify=left height=14pt "Number of Distinct Values for Each Column in the CARS Table";
proc sgplot data=casuser.distinctCars
            noborder nowall;
    vbar Column / 
        response=NDistinct
        categoryorder=respdesc
        nooutline
        fillattrs=(color=cx0379cd);
    yaxis display=(NOLABEL);
    xaxis display=(NOLABEL);
quit;

The results show a bar chart with the number of distinct values for each column.

Summary

The simple.distinct CAS action is an easy way to explore a distributed CAS table. With one simple action, you can easily see how many distinct values are in each column, and the number of missing rows!

In Part 2 of this post, I'll further explore the simple.distinct CAS action and offer more ideas on how to interpret and use the results.

Additional Resources

distinct CAS action
SAS® Cloud Analytic Services: Fundamentals
Plotting a Cloud Analytic Services (CAS) In-Memory Table
Getting started with SGPLOT - Index
Code

CAS-Action! Simply Distinct - Part 1 was published on SAS Users.

9月 102021
 

As its thousands of users know, SAS Analytics Pro consists of three core elements of the SAS system: Base SAS®, SAS/GRAPH® and SAS/STAT®. It provides the fundamental capabilities of data handling, data visualization, and statistical analysis either through coding or through the SAS Studio interface. For many years, SAS Analytics Pro has been deployed on-site as the entry-level workhorse to the SAS system.

Now, SAS Analytics Pro includes a new option for containerized cloud-native deployment. In addition, the containerized option comes with the full selection of SAS/ACCESS engines making it even easier to work with data from virtually any source. For organizations considering the move to the cloud, or those already there, SAS Analytics Pro provides an exciting new option for cloud deployment.

What is SAS Analytics Pro?

SAS Analytics Pro is an easy-to-use, yet powerful package for accessing, manipulating, analyzing and presenting information. It lets organizations improve productivity with all the tools and methods needed for desktop data analysis – in one package.

  • Organizations can get analysis, reporting and easy-to-understand visualizations from one vendor. Rather than having to piece together niche software packages from different vendors, this consolidated portfolio reduces the cost of licensing, maintenance, training and support – while ensuring that consistent information is available across your enterprise.
  • Innovative statistical techniques are provided with procedures constantly being updated to reflect the latest advances in methodology. Organizations around the world rely on SAS to provide accurate answers to data questions, along with unsurpassed technical support.
  • SAS software integrates into virtually any computing environment, unifying your computing efforts to get a single view of your data, and freeing analysts to focus on analysis rather than data issues.
  • Easily build analytical-style graphs, maps and charts with virtually any style of output that is needed, so you can deliver analytic results where they’re needed most

Why data scientists should care about cloud and containers?

With the new containerized cloud-native deployment option for SAS Analytics Pro, this raises the question of why data scientists should care about cloud and containers? This question was addressed in a SAS Users blog post by SAS R&D director Brent Laster.

This post characterizes a container as a “self-contained environment with all the programs, configuration, initial data, and other supporting pieces to run applications.” The nice thing for data scientists is that this environment can be treated as a stand-alone unit, to turn on and run at any time – sort of a “portable machine.” It provides a complete virtual system configured to run your targeted application.

Using popular container runtime environments (e.g., Docker), containers can be an efficient way for individual users to deploy and manage software applications. This is especially useful for applications like SAS Analytics Pro, which participates in SAS Viya’s continuous delivery approach, releasing updates on a regular basis.

For large, IT-managed environments, containers can call for something like a “data center” to simplify deployment and management of dynamic workloads. The most prominent one today is Kubernetes, which automates key needs around containers including deployment, scaling, scheduling, healing, and monitoring – so the data scientist doesn’t have to.

The combination of containers and cloud environments provides an evolutionary jump in the infrastructure and runtime environments where data scientists run their applications. And this gives them a similar jump in being able to provide the business value their customers demand. Containerized and cloud-native deployment of SAS Analytics Pro provides the automatic optimization of resources and the automatic management of workloads that your organization needs to be competitive.

Note that existing customers can continue programming in SAS in a small footprint environment while availing themselves of the SAS container-based continuous delivery process. And if you aren’t already a SAS customer, cloud deployment gives you one more good reason to start letting SAS Analytics Pro deliver value to your organization.

Learn more

Website: SAS® Analytics Pro
Training: SAS® Global Certification Program

On-Premises Documentation:

Cloud-Native Documentation:

SAS Analytics Pro now available for on-site or containerized cloud-native deployment was published on SAS Users.

9月 012021
 

Analytics and Artificial Intelligence (AI) are changing the way we interact with the world around us – increasing productivity and improving the way we make decisions. SAS and Microsoft are partnering to inspire greater trust and confidence in every decision by driving innovation and delivering proven AI in the cloud.

In this demo, see how intelligent decisioning and machine learning from SAS and Microsoft help Contoso Bank – a fictitious banking customer – simplify and reduce risk in its home loan portfolio.

Let’s get started.

Part 1: Data and Discovery

Organizations can run faster and smarter by enabling employees to uncover insights. See how SAS and Microsoft help Contoso Bank gain new insight into its portfolio by bringing together data management, analytics and AI capabilities with seamless integration into the Azure data estate.

Key Product Features:
• Use built-in Power BI tools like smart narratives and sentiment analysis to quickly analyze structured and unstructured data.
• Connect your SAS Viya and Microsoft Azure environments with single sign-on via Azure Active Directory.
• Catalog your datasets across SAS and Microsoft in SAS Information Catalog for a holistic view of your data environment.
• Integrate data from Azure Synapse Analytics and other Azure data sources into a combined dataset in SAS Data Studio.
• No-code intelligence features in SAS Visual Analytics explain analytic outputs in natural language.

Part 2: Model and Deploy

AI has the potential to transform organizations. See how SAS and Microsoft enable Contoso Bank to quickly build and operationalize predictive models by bringing together SAS Viya advanced analytics and AI capabilities with Azure Machine Learning.

Key Product Features:
• Bring models from SAS Visual Analytics into SAS Model Studio as a candidate for production use.
• Create automatically generated pipelines in SAS Model Studio to select the best features for modeling.
• Register models built in open-source Jupyter notebooks within Azure Machine Learning into SAS Model Manager.
• Publish models from SAS Model Manager in Azure Machine Learning to be deployed in the Microsoft ecosystem.
• Schedule SAS model manager to monitor model drift in the SAS or Microsoft ecosystem to identify the right time to retrain models.

Part 3: Automate and Monitor

Building a data-driven organization means increasing productivity with the necessary insights and tools. See how SAS and Microsoft can help Contoso Bank rapidly operationalize the analytics and AI capabilities of SAS Viya through Power Apps and Power Automate to help employees make better decisions.

Key Product Features:
• Build decision flows in SAS Intelligent Decisioning to make calculated decisions at speed.
• Use AI Builder in Power Platform to extract and process information in Power Platform.
• Access SAS Intelligent Decisioning’s decision access engine in low-code applications by using Power Apps to ingest data and receive decisioning outputs.
• Connect to SAS Intelligent Decisioning from Power Apps and Power Automate with the SAS Decisioning connector.
• Embed Power Apps in Microsoft Teams or access via a mobile friendly web app.

To learn more about how SAS Viya integrates with Microsoft, check out our white paper SAS and Microsoft: Shaping the future of AI and analytics in the cloud.

Transforming Your Business With SAS® Viya® on Microsoft Azure was published on SAS Users.

8月 282021
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. I've broken the series into logical, consumable parts. If you'd like to start by learning a little more about what CAS Actions are, please see CAS Actions and Action Sets - a brief intro. Or if you'd like to see other topics in the series, see the overview page. Otherwise, let's dig in with the table.columnInfo action.

In this post we'll l look at exploring the column attributes of a CAS table. Knowing the column names, data types, and any additional formats or labels associated with the columns makes it easier to work with the data. One way to see this type of information on a CAS table is to use the table.columnInfo CAS action!

In this example, I will use the CAS procedure to execute the columnInfo action. Be aware, instead of using the CAS procedure, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language.

Return CAS table Column Information

To view the column information of your CAS table, use the columnInfo CAS action with the table parameter. That's it! Refer to the code below.

proc cas;
    table.columnInfo / table={name="cars", caslib="casuser"};
quit;

The results would appear similar to the following:

table.columnInfo CAS action results

and return a variety of information about each column in the CAS table:

  • the column name
  • a label if one has been applied
  • the Id, which indicates the position of the column in the table
  • the data type of the column
  • the column length
  • the format, formatted length, width and decimal

Create a CSV Data Dictionary Using the ColumnInfo Action

What if instead, you want to create a data dictionary documenting the CAS table? With the columnInfo action you can export the results to a CSV file!

I'll use the columnInfo action again, but this time I'll store the results in a variable. The variable is a dictionary, so I need to reference the dictionary ci, then the columnInfo key to access the table. Next, I'll create two computed columns using the compute operator. The first column contains the name of the CAS table, and the second, the caslib of the table. I'll print the new table to confirm the results.

proc cas;
    table.columnInfo result=ci / table={name="cars", caslib="casuser"};
 
    ciTbl = ci.ColumnInfo.compute({"TableName","Table Name"}, "cars")
                         .compute({"Caslib"}, "casuser");
 
    print ciTbl;
    ...

The code produces the folloiwng result:

The new result table documents the CAS table columns, table name and caslib.

Lastly, I'll export the result table to a CSV file. First, I'll specify the folder location using the outpath variable. Then use the SAVERESULT statement to save the result table as a CSV file named carsDataDictionary.csv.

    ...
    outpath="specify a folder location";
    saveresult ciTbl csv=outpath || "carsDataDictionary.csv";
quit;

 

After I execute the CAS procedure I can find and open the CSV file to view the documented CAS table!

Summary

The table.columnInfo CAS action is a simple and easy way to show column information about your distributed CAS table. Using the results of the action allow you to create a data dictionary in a variety of formats.

Additional resources

table.columnInfo CAS action
CAS Action! - a series on fundamentals
SAS® Cloud Analytic Services: Fundamentals
CASL Result Tables
SAVERESULT Statement
Code

CAS-Action! Show me the ColumnInfo! was published on SAS Users.

8月 242021
 

Interested in how the COVID-19 vaccine has impacted the world around you? This SAS Viya powered data visualization shows information related to the COVID-19 vaccination efforts in the United States. Here's what you can learn from this dashboard. Click on any photo below to explore the full dashboard. Percentage of vaccinated [...]

Visualizing COVID-19 vaccine data using SAS® Viya® was published on SAS Voices by Caslee Sims

8月 202021
 

This article was co-written by Marinela Profi, Product Marketing Manger for AI, Data Science and Open-Source. Check out her blog profile for more information.

Artificial Intelligence (AI) is changing the way people and organizations improve decision-making and move about their lives – from text translation, to chatbots and predictive analytics. However, many organizations are struggling to realize its potential as model deployment processes remain disconnected, creating unforeseen headaches and manual work. Additionally, other requirements like performance monitoring, retraining, and integration into core business processes must be streamlined for optimal teamwork and resource usage.

SAS and Microsoft are partnering to inspire greater trust and confidence in every decision, by driving innovation and proven AI in the cloud. With a combined product roadmap, SAS and Microsoft are working tirelessly to improve offerings and connectivity between SAS Viya and Microsoft Azure environments across industries. That’s why we are especially excited to announce SAS Viya users can now publish SAS and open-source models in Azure Machine Learning.

The SAS and Microsoft team built a tightly integrated connection between SAS Model Manager and Azure Machine Learning to register, validate, and deploy SAS and open-source models to Azure Machine Learning with just a few clicks. From there, data scientists can enrich their applications with SAS or open-source models within their Azure environment.

This integration will enable users to:

1) Extend SAS models stored in SAS Model Manager into the Azure Machine Learning registry, offering more opportunities for collaboration across the enterprise.

2) Deploy SAS and open-source models from SAS Model Manager to Azure Machine Learning on the same Azure Kubernetes cluster you have already set up in Azure Machine Learning. Before deploying the model, you can validate the model and ensure it meets your criteria.

3) Seamlessly connect your SAS Viya and Microsoft environments without the hassle of verifying multiple licenses with single sign-on authentication via Azure Active Directory (Azure AD).

Get started

Step 1: To get started, use Azure AD for simplified SAS Viya access.

Step 2: SAS Model Manager governs, deploys, and monitors all types of SAS and open-source models (i.e., Python, R). On the home page, you can see the projects you and your team are working on in addition to “What’s new” and “How to” videos with the latest updates.

Step 3: Compare different models to identify the most accurate “champion model.” Deploy the model throughout the Microsoft ecosystem from cloud to edge with customizable runtimes, centralized monitoring, and management capabilities.

Step 4: Using the provided artifacts, Azure Machine Learning creates executable containers supporting SAS and open-source models. You can use the endpoints created through model deployment for the scoring of the data.

Step 5: Schedule SAS Model Manager to detect model drift and automatically retrain models in case of poor performance or bias detection.

Discover more

If you want to know more about SAS Model Manager and our partnership with Microsoft, check out the resources below:

“What’s New with SAS Model Manager” article series to find out the latest and greatest updates.
SAS Viya on Azure to solve 100 different use cases on Data for Good, industries and startups.

Let us know what you think!

We would love to hear what you think about this new experience and how we can improve it. If you have any feedback for the team, please share your thoughts and ideas in the comments section below.

Deploying SAS and open-source models to Azure Machine Learning has never been easier was published on SAS Users.

8月 192021
 

In Part 1 of my series fetch CAS, fetch!, I executed the fetch CAS action to return rows from a CAS table. That was great, but what can you do with the results? Maybe you want to create a visualization that includes the top five cars by MSRP for all Toyota vehicles? How can we accomplish this task? We'll cover this question and provide several other examples in this post.

Save the results of a CAS action as a SAS data set

First, execute the table.fetch CAS action on the CARS in-memory table to filter for Toyota cars, return the Make, Model and MSRP columns, and sort the results by MSRP. Then save the results of the action in a variable using the results option. The results of an action return a dictionary to the client. The fetch action returns a dictionary with a single key, and the result table as the value. In this example, I'll name the variable toyota.

proc cas;
    table.fetch result=toyota / 
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'",
                 vars={"Make","Model","MSRP"}
          },
          sortBy={
                 {name="MSRP", order="DESCENDING"}
          },
          index=FALSE,
          to=5;
...

After executing the code, the results of the action are stored in the variable toyota and not shown in the output.

Next, use the SAVERESULT statement to save the result table stored in the toyota variable. Since the variable is a dictionary, specify the variable name toyota, a dot, then the fetch key. This will access the result table from the dictionary. Finally, specify the DATAOUT= option with the name of the SAS data set to create.

proc cas;
    table.fetch result=toyota / 
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'",
                 vars={"Make","Model","MSRP"}
          },
          sortBy={
                 {name="MSRP", order="DESCENDING"}
          },
          index=FALSE,
          to=5;
 
     saveresult toyota.fetch dataout=work.top5;
quit;

After executing the code, the result table is saved as a SAS data set. The SAS data set is named top5 and saved to the WORK library.

 

 

Wondering what else can we do? Let's take a look.

Visualize the SAS data set

Now that the result table is saved as a SAS data set, you can use the SGPLOT procedure to create a bar chart! Consider the code below.

title justify=left height=14pt "Top 5 Toyota Cars by MSRP";
proc sgplot data=work.top5
         noborder nowall;
    vbar Model / 
          response=MSRP 
          categoryorder=respdesc
          nooutline
          fillattrs=(color=cx0379cd);
    label MSRP="MSRP";
quit;
title;

There it is! We processed our data in CAS using the fetch action, returned a smaller subset of results back to the client, then used traditional SAS programming techniques on the smaller table. This method will work similarly in other languages like Python and R. Then you can then use the native visualization packages of the language!

You can now use your imagination on what else to do with the raw data from the CARS table or from the top5 results table we produced with the table.fetch action. Feel free to get creative.

Summary

CAS actions are optimized to run in a distributed environment on extremely large tables. Your CAS table can contain millions or even billions of rows. Since the data in CAS can be extremely large, the goal is to process and subset the table on the CAS server, then return a smaller amount of data to the client for additional processing, visualization or modeling.

Additional resources

fetch Action
SAVERESULT Statement
SAS® Cloud Analytic Services: Fundamentals
Plotting a Cloud Analytic Services (CAS) In-Memory Table
Getting started with SGPLOT - Index
Code used in this post

CAS-Action! fetch CAS, fetch! - Part 2 was published on SAS Users.

8月 162021
 

In Part I of this blog post, I provided an overview of the approach my team and I took tackling the problem of classifying diverse, messy documents at scale. I shared the details of how we chose to preprocess the data and how we created features from documents of interest using LITI rules in SAS Visual Text Analytics (VTA) on SAS Viya.

In this second part of the post, I'll discuss:

  • scoring VTA concepts and post-processing the output to prepare data for predictive modeling
  • building predictive models using SAS Visual Data Mining and Machine Learning
  • assessing the results

Scoring VTA Features to Work

Recall from Part I, I used the image in Figure 1 to illustrate the type of documents our team was tasked to automatically classify.

Figure 1. Sample document

Figure 1. Sample document

I then demonstrated a method for extracting features from such documents, using the Language Interpretation Text Interpretation (LITI) rules. I used the “Color of Hair” feature as an example. Once the concepts for a decent number of features were developed, we were ready to apply them to the corpus of training documents to build an analytic table for use in building a document classification model.

The process of applying a VTA concept model to data is called scoring. To score your documents, you first need to download the model score code. This is a straightforward task: run the Concepts node to make sure the most up-to-date model is compiled, and then right-click on the node and select “Download score code.”

Figure 2. Exporting concept model score code

Figure 2. Exporting concept model score code

You will see four files inside the downloaded ZIP file. Why four files? There are two ways you can score your data: using the the with the concepts analytics store (astore) or the concepts model. The first two files you see in Figure 2 score the data using the astore method, and the third and fourth file correspond to scoring using the concept .li binary model file.

If you are scoring in the same environment where you built the model, you don’t need the .li file, as it is available in the system and is referenced inside the ScoreCode.sas program. This is the reason I prefer to use the second method to score my data – it allows me to skip the step of uploading the model file. For production purposes though, the model file will need be copied to the appropriate location so the score code can find it. For both development and production, you will need to modify the macro variables to match the location and structure of your input table, as well as the destination for your result files.

Figure 3. Partial VTA score code – map the macro variables in curly brackets, except those in green boxes

Figure 3. Partial VTA score code – map the macro variables in curly brackets, except those in green boxes

Once all the macro variables have been mapped, run the applyConcept CAS action inside PROC CAS to obtain the results.

Figure 4. Partial VTA score code

Figure 4. Partial VTA score code

Depending on the types of rules you used, the model scoring results will be populated in the OUT_CONCEPTS and/or OUT_FACTS tables. Facts are extracted with sequence and predicate rule types that are explained in detail in the book I mentioned in Part I. Since my example didn’t use these rule types, I am going to ignore the OUT_FACTS table. In fact, I could modify the score code to prevent it from outputting this table altogether.

Looking at the OUT_CONCEPTS table, you can see that scoring returns one row per match. If you have multiple concepts matching the same string or parts of the same string, you will see one row for each match. For troubleshooting purposes, I tend to output both the target and supporting concepts, but for production purposes, it is more efficient to set the behavior of supporting concepts to “Supporting” inside the VTA Concepts node, which will prevent them from showing up in the scored output (see Figure 5). The sample output in Table 1 shows matches for target concepts only.

Table 1. OUT_CONCEPTS output example

Table 1. OUT_CONCEPTS output example

Figure 5. Setting concept behavior to supporting

Figure 5. Setting concept behavior to supporting

Completing Feature Engineering

To complete feature creation for use in a classification model, we need to do two things:

  1. Convert extracted concepts into binary flags that indicate if a feature is present on the page.
  2. Create a feature that will sum up the number of total text-based features observed on the page.

To create binary flags, you need to do three things: deduplicate results by concept name within each record, transpose the table to convert rows into columns, and replace missing values with zeros. The last step is important because you should expect your pages to contain a varying number of features. In fact, you should expect non-target pages to have few, if any, features extracted. The code snippet below shows all three tasks accomplished with a PROC CAS step.

/*Deduplicate OUT_CONCEPTS. Add a dummy variable to use in transpose later*/
data CASUSER.OUT_CONCEPTS_DEDUP(keep=ID _concept_ dummy);
	set CASUSER.OUT_CONCEPTS;
	dummy=1;
	by ID _concept_;
	if first.concept_ then output;
	run;
 
/*Transpose data*/
proc transpose data=CASUSER.OUT_CONCEPTS_DEDUP out=CASUSER.OUT_CONCEPTS_TRANSPOSE;
	by ID;
	id _concept_;
	run;
 
/*Replace missing with zeros and sum features*/
data CASUSER.FOR_VDMML(drop=i _NAME_);
	set CASUSER.OUT_CONCEPTS_TRANSPOSE;
	array feature (*) _numeric_;
	do i=1 to dim(feature);
		if feature(i)=. then feature(i)=0;
	end;
	Num_Features=sum(of T_:);
	run;
 
/*Merge features with labels and partition indicator*/
data CASUSER.FOR_VDMML;
merge CASUSER.OUT_CONCEPTS_TRANSPOSE1 
	  CASUSER.LABELED_PARTITIONED_DATA(keep= ID target partition_ind);
by ID;
run;

To create an aggregate feature, we simply added up all features to get a tally of extracted features per page. The assumption is that if we defined the features well, a high total count of hits should be a good indicator that a page belongs to a particular class.

The very last task is to merge your document type label and the partitioning column with the feature dataset.

Figure 6 shows the layout of the table ready for training the model.

Figure 6. Training table layout

Figure 6. Training table layout

Training Classification Model

After all the work to prepare the data, training a classification model with SAS Visual Data Mining and Machine Learning (VDMML) was a breeze. VDMML comes prepackaged with numerous best practices pipeline templates designed to speed things up for the modeler. We used the advanced template for classification target (see Figure 7) and found that it performed the best even after multiple attempts to improve performance. VDMML is a wonderful tool which makes model training easy and straightforward. For details on how to get started, see this free e-book or check out the tutorials on the SAS Support site.

Figure 7. Example of a VDMML Model Studio pipeline

Figure 7. Example of a VDMML Model Studio pipeline

Model Assessment

Since we were dealing with rare targets, we chose the F-score as our assessment statistic. F-score is a harmonized mean of precision and recall and provides a more realistic model assessment score than, for example, a misclassification rate. You can specify F-score as your model selection criterion in the properties of a VDMML Model Studio project.

Depending on the document type, our team was able to achieve an F-score of 85%-95%, which was phenomenal, considering the quality of the data. At the end of the day, incorrectly classified documents were mostly those whose quality was dismal – pages too dark, too faint, or too dirty to be adequately processed with OCR.

Conclusion

So there you have it: a simple, but effective and transparent approach to classifying difficult unstructured data. We preprocessed data with Microsoft OCR technologies, built features using SAS VTA text analytics tools, and created a document classification model with SAS VDMML. Hope you enjoyed learning about our approach – please leave a comment to let me know your thoughts!

Acknowledgement

I would like to thank Dr. Greg Massey for reviewing this blog series.

Classifying messy documents: A common-sense approach (Part II) was published on SAS Users.

8月 122021
 

Welcome to my SAS Users blog series CAS Action! - a series on fundamentals. I've broken the series into logical, consumable parts. If you'd like to start by learning a little more about what CAS Actions are, please see CAS Actions and Action Sets - a brief intro. Or if you'd like to see other topics in the series, see the overview page. Otherwise, let's dig in with our first topic - the fetch action.

The table.fetch CAS action retrieves the first 20 rows of a CAS distributed in-memory table. It's similar to using the head method in R or Python. However, the fetch action can do more than fetch rows from a table. In this example, I will use the CAS procedure to execute the fetch action. Be aware, instead of using the CAS procedure, I could execute the same action with Python, R and other languages with some slight changes to the syntax. Pretty cool!

Retrieve the first 20 rows of a CAS table using the fetch action

To begin, let's use the table.fetch CAS action on the cars in-memory table to view the action's default behavior. Consider the following code:

proc cas; 
    table.fetch / table={name="cars", caslib="casuser"}; 
quit;

To use the fetch action with the CAS language, you specify the CAS procedure, the action, a forward slash, then the table parameter to specify the CAS table. In the table parameter I'll use the name sub-parameter to specify the cars in-memory table, and the caslib sub-parameter to specify the casuser caslib. The result of the call is listed below.

We retrieved 20 rows from our distributed cars table. What else can we fetch?

Retrieve the first n rows of a CAS table

What if you don't want the first 20 rows? Maybe you only want the first 5. To modify the amount of rows returned use the to parameter, then specify the number of rows. In the results, I also saw an _Index_ column. That column appears by default. To remove the _Index_ column, add the index=FALSE parameter.

proc cas;
    table.fetch /
         table={name="cars", caslib="casuser"},
         to=5,
         index=FALSE;
quit;

The results return 5 rows from the cars table, and the _Index_ column has been removed.

Now, what if I want to sort the returned rows?

Sort the table

The results of a CAS table are not guaranteed since the table is distributed among the CAS workers. To see the results in a precise order, use the sortBy parameter. The sortBy parameter requires an array of key-value pairs (dictionaries), so it's a bit tricky the first time you use it.

In this example, let's sort the table by Make and MSRP in descending order.

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser"},
          sortBy={
                  {name="Make", order="DESCENDING"},
                  {name="MSRP", order="DESCENDING"}
          },
          index=FALSE;
quit;

The results show 20 rows of the cars table sorted by Make and MSRP. Great!

Subset the table

What if I only want to see the cars where Make is Toyota, and return the columns Make, Model, MSRP and Invoice? You can add the where and vars sub-parameters in the table parameter to subset the table.

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'",
                 vars={"Make","Model","MSRP","Invoice"}
           },
           to=5,
           index=FALSE;
quit;

The results return 5 rows from the cars table where Make is Toyota and only the four specified columns.

Quick detour

Instead of using the vars sub-parameter as we did in the previous example, you can use the fetchVars parameter. The code would change to:

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser",
                 where="Make='Toyota'"},
          fetchVars={"Make","Model","MSRP","Invoice"},
          to=5,
          index=FALSE;
quit;

Either method works and it's totally up to you.

Create a calculated column

Lastly, let's create a calculated column named MPG_Avg that calculates the average city and highway miles per gallon for each car, then subset the results for all cars with an MPG_Avg greater than 40. To create a calculated column, use the computedVarsProgram sub-parameter in the table parameter. Then you can subset on the calculated column with the where sub-parameter.

proc cas;
    table.fetch /
          table={name="cars", caslib="casuser",
                 vars={"Make","Model","MPG_Avg"},
                 where="MPG_Avg > 40",
                 computedVarsProgram="MPG_Avg=mean(MPG_City,MPG_Highway)"
          },
    sortBy={
             {name="MPG_Avg", order="DESCENDING"}
    },
    index=FALSE;
quit;

Summary

CAS actions are optimized to run on the CAS server on large data. They are flexible and offer many parameters to control the output and can be executed in a variety of languages! In the next post, I'll cover more on the fetch action.

Additional Resources

fetch Action documentation
SAS® Cloud Analytic Services: Fundamentals documentation
Code used in this post

CAS-Action! fetch CAS, fetch! - Part 1 was published on SAS Users.

8月 112021
 

Just over a year ago, SAS and Microsoft announced their strategic partnership. Since then we have been working together to provide the best experience and value to our customers as they migrate to the cloud.  Here are six milestones you should know about; each highlights the early success of the two industry giants.   The start of [...]

One year in: Six reasons you should pay attention to the SAS and Microsoft partnership was published on SAS Voices by Nick Johnson