Nicolas Robert

4月 132018

The release of SAS Viya 3.3 has brought some nice data quality features. In addition to the visual applications like Data Studio or Data Explorer that are part of the Data Preparation offering, one can leverage data quality capabilities from a programming perspective.

For the time being, SAS Viya provides two ways to programmatically perform data quality processing on CAS data:

  • The Data Step Data Quality functions.
  • The profile CAS action.

To use Data Quality programming capabilities in CAS, a Data Quality license is required (or a Data Preparation license which includes Data Quality).

Data Step Data Quality functions

The list of the Data Quality functions currently supported in CAS are listed here and below:

SAS Data Quality 3.3 programming capabilities

They cover casing, parsing, field extraction, gender analysis, identification analysis, match codes and standardize capabilities.

As for now, they are only available in the CAS Data Step. You can’t use them in DS2 or in FedSQL.

To run in CAS certain conditions must be met. These include:

  • Both the input and output data must be CAS tables.
  • All language elements must be supported in the CAS Data Step.
  • Others.

Let’s look at an example:

cas mysession sessopts=(caslib="casuser") ;
libname casuser cas caslib="casuser" ;
data casuser.baseball2 ;
   length gender $1 mcName parsedValue tokenNames lastName firstName varchar(100) ;
   set ;
   gender=dqGender(name,'NAME','ENUSA') ;
   mcName=dqMatch(name,'NAME',95,'ENUSA') ;   
   parsedValue=dqParse(name,'NAME','ENUSA') ;
   tokenNames=dqParseInfoGet('NAME','ENUSA') ;
   if _n_=1 then put tokenNames= ;
   lastName=dqParseTokenGet(parsedValue,'Family Name','NAME','ENUSA') ;
   firstName=dqParseTokenGet(parsedValue,'Given Name','NAME','ENUSA') ;
run ;

Here, my input and output tables are CAS tables, and I’m using CAS-enabled statements and functions. So, this will run in CAS, in multiple threads, in massively parallel mode across all my CAS workers on in-memory data. You can confirm this by looking for the following message in the log:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

I’m doing simple data quality processing here:

  • Determine the gender of an individual based on his(her) name, with the dqGender function.
  • Create a match code for the name for a later deduplication, with the dqMatch function.
  • Parse the name using the dqParse function.
  • Identify the name of the tokens produced by the parsing function, with the dqParseInfoGet function.
  • Write the token names in the log, the tokens for this definition are:
    Prefix,Given Name,Middle Name,Family Name,Suffix,Title/Additional Info
  • Extract the “Family Name” token from the parsed value, using dqParseTokenGet.
  • Extract the “Given Name” token from the parsed value, again using dqParseTokenGet.

I get the following table as a result:

Performing this kind of data quality processing on huge tables in memory and in parallel is simply awesome!

The dataDiscovery.profile CAS action

This CAS action enables you to profile a CAS table:

  • It offers 2 algorithms, one is faster but uses more memory.
  • It offers multiple options to control your profiling job:
    • Columns to be profiled.
    • Number of distinct values to be profiled (high-cardinality columns).
    • Number of distinct values/outliers to report.
  • It provides identity analysis using RegEx expressions.
  • It outputs the results to another CAS table.

The resulting table is a transposed table of all the metrics for all the columns. This table requires some post-processing to be analyzed properly.


proc cas; 
   dataDiscovery.profile /
      table={caslib="casuser" name="product_dim"}
      casOut={caslib="casuser" name="product_dim_profiled" replace=true}
quit ;

In this example, you can see:

  • How to specify the profiling algorithm (quite simple: PRIMARY=best performance, SECONDARY=less memory).
  • How to specify the input table and the columns you want to profile.
  • How to reduce the number of distinct values to process using the cutoff option (it prevents excessive memory use for high-cardinality columns, but might show incomplete results).
  • How to reduce the number of distinct values reported using the frequencies option.
  • How to specify where to store the results (casout).

So, the result is not a report but a table.

The RowId column needs to be matched with

A few comments/cautions on this results table:

  • DoubleValue, DecSextValue, or IntegerValue fields can appear on the output table if numeric fields have been profiled.
  • DecSextValue can contain the mean (metric #1008), median (#1009), standard deviation (#1022) and standard error (#1023) if a numeric column was profiled.
  • It can also contain frequency distributions, maximum, minimum, and mode if the source column is of DecSext data type which is not possible yet.
  • DecSext is a 192-bit fixed-decimal data type that is not supported yet in CAS, and consequently is converted into a double most of the time. Also, SAS Studio cannot render correctly new CAS data types. As of today, those metrics might not be very reliable.
  • Also, some percentage calculations might be rounded due to the use of integers in the Count field.
  • The legend for metric 1001 is not documented. Here it is:

8: INT32
9: INT64
10: TIME

A last word on the profile CAS action. It can help you to perform some identity analysis using patterns defined as RegEx expressions (this does not use the QKB).

Here is an example:

proc cas; 
   dataDiscovery.profile /
      table={caslib="casuser" name="customers"}
         {pattern="PAT=</span>?999[<span style=" />-]? ?999[- ]9999",type="USPHONE"}, 
         {pattern= "PAT=^99999[- ]9999$",type="ZIP4"}, 
         {pattern= "PAT=^99999$",type="ZIP"}, 
         {pattern= "[^ @]+@[^ @]+\.[A-Z]{2,4}",type="EMAIL"}, 
         {pattern= "^(?i:A[LKZR]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])$",type="STATE"}
      casOut={caslib="casuser" name="customers_profiled" replace="true"}
quit ;

In this example that comes from

I hope this post has been helpful.

Thanks for reading.

An overview of SAS Data Quality 3.3 programming capabilities was published on SAS Users.

10月 192017

The goal of this article is to describe the steps needed to configure the bridge that allows SAS Data Integration 4.902, based on SAS 9.4M4, to load data directly into CAS on SAS Viya 3.2.

Of course, SAS 9.4M5 simplifies this process, as a SAS/CONNECT communication will no longer be required, enabled by a direct function within SAS Data Integration to CAS - but for those of you who may not move immediately from SAS 9.4M4 to SAS 9.4M5, this could be helpful.

It is assumed here that SAS/CONNECT has been installed and configured on both environments, SAS 9.4M4 and SAS Viya 3.2.

Validate the connection from SAS 9.4M4 to SAS Viya 3.2

⇒     Check the status of the SAS/CONNECT Spawner on SAS Viya, on the machine where this service is installed.

SAS Viya 3.2 Bridge for SAS Data Integration Studio

⇒     Note the machine and the listening port of the SAS/CONNECT Spawner on SAS Viya.
⇒     Open SAS Studio on SAS 9.4M4 and sign-in.
⇒     Run the following SAS code with your machine details and a valid SAS Viya user account and check the results.

SAS Viya 3.2 Bridge for SAS Data Integration Studio

⇒     If successful, sign-off the SAS/CONNECT session and sign-out from SAS Studio SAS 9.4M4

Setup SAS9.4M4 metadata

⇒     Open SAS Management Console 9.4M4 as sasadm@saspw.
⇒     On the “Server Manager” plugin, add a “New Server…”

  • Server type: “SAS Application Server”
  • Name: “SASViya”
  • Select “Connect Server” as the sub-type of server you want to add to this SAS Application Server
  • Configure the “Connect Server” as shown below (you might have to create a new authentication domain for SAS Viya) and set the values accordingly (server where the SAS/CONNECT Spawner on SAS Viya is listening)

⇒     On the “Server Manager” plugin, add a “New Server…”

  • Server type: “SAS Cloud Analytic Services Server”
  • Name: “CAS Server”
  • Configure the “CAS Server” as shown below and set the values accordingly (CAS controller)

⇒     On the “User Manager” plugin, set a login for the SASViya application server, on a user or group that you will use in SAS Data Integration Studio

⇒     On the “Data Library Manager” plugin, add a “New Library…”

  • Library type: “SAS Cloud Analytic Services Library”
  • Name: CAS_DATA
  • Assign the library to the SASViya server

⇒     Configure the CAS library as shown below and set the values accordingly (the CASLIB must exist in the SAS Viya environment; here CASPATH is the name of an existing CASLIB).

⇒     Specify the server and connection information as shown below:

Build a SAS Data Integration Studio job to load data into CAS

⇒     Open SAS Data Integration Studio 4.902 as a user who holds, directly or not, a login for the ViyaAuth authentication domain.
⇒     Test the CAS_DATA library by “Register(ing) tables…”

  • In SAS Environment Manager on SAS Viya, some tables must have been loaded before into the CASLIB (the one that is pointed on by the CAS library, here CASPATH), so that you can display some tables in the “Register Tables…” wizard.
  • If you see some tables then it looks like you are all set.

⇒     If you want to go further and test the “Cloud Analytic Services Transfer” DI transformation, create the metadata for a target table in CAS with appropriate columns.

⇒     Build a job that loads a source table to this target table using the “Cloud Analytic Services Transfer” (“Access” folder) as shown below:

⇒     The “Cloud Analytic Services Transfer” is basically a “Data Transfer” transformation that fits with CAS; it enables you to provide some CAS options such as COPIES; by default the table is PROMOTED.

⇒     Run it and check if it ran successfully and if the table is available from the SAS Viya world.

SAS/CONNECT on SAS Viya configuration directory

⇒     The SAS/CONNECT on SAS Viya configuration is located by default here /opt/sas/viya/config/etc/connectserver/default

⇒     You might want to customize/clean some of the files here.

  • Check the CASHOST option in autoexec files, sometimes the value is not appropriate.

Normally, options here are sourced from the CONNECTSERVER_CONFIGURATION in vars.yml during the deployment of SAS Viya.

SAS Viya 3.2 Bridge for SAS Data Integration Studio 4.902 (on SAS 9.4M4) configuration steps was published on SAS Users.

10月 052016

streamviewerSAS Event Stream Processing (ESP) cannot only process structured streaming events (a collection of fields) in real time, but has also very advanced features regarding the collection and the analysis of unstructured events. Twitter is one of the most well-known social network application and probably the first that comes to mind when thinking about streaming data source. On the other hand, SAS has powerful solutions to analyze unstructured data with SAS Text Analytics. This post is about merging 2 needs: collecting unstructured data coming from Twitter and doing some text analytics processing on tweets (contextual extraction, content categorization and sentiment analysis).

Before moving forward, SAS ESP is based on a publish and subscribe model. Events are injected into an ESP model using an “adapter” or a “connector.” or using Python and the publisher API Target applications consume enriched events output by ESP using the same technology, “adapters” and “connectors.” SAS ESP provides lots of them, in order to integrate with static and dynamic applications.

Then, an ESP model flow is composed of “windows” which are basically the type of transformation we want to perform on streaming events. It can be basic data management (join, compute, filter, aggregate, etc.) as well as advanced processing (data quality, pattern detection, streaming analytics, etc.).

SAS ESP Twitter Adapters background

SAS ESP 4.2 provides two adapters to connect to Twitter as a data source and to publish events from Twitter (one event per tweet) to a running ESP model. There are no equivalent connectors for Twitter.

Both two adapters are publisher only and include:

  • Twitter Publisher Adapter
  • Twitter Gnip Publisher Adapter

The second one is more advanced, using a different API (GNIP, bought by Twitter) and providing additional capabilities (access to history of tweets) and performance. The adapter builds event blocks from a Twitter Gnip firehose stream and publishes them to a source window. Access to this Twitter stream is restricted to Twitter-approved parties. Access requires a signed agreement.

In this article, we will focus on the first adapter. It consumes Twitter streams and injects event blocks into source windows of an ESP engine. This adapter has free capabilities. The default access level of a Twitter account allows us to use the following methods:

  • Sample: Starts listening on random sample of all public statuses.
  • Filter: Starts consuming public statuses that match one or more filter predicates.

SAS ESP Text Analytics background

SAS ESP 4.1/4.2 provides three window types (event transformation nodes) to perform Text Analytics in real time on incoming events.

The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.


Here are the SAS ESP Text Analytics features:

  • Text Category” window:
    • Content categorization or document classification into topics
    • Automatically identify or extract content that matches predefined criteria to more easily search by, report on, and model/segment by important themes
    • Relies on “.mco” binary files coming from SAS Contextual Analysis solution
  • Text Context” window:
    • Contextual extraction of named entities (people, titles, locations, dates, companies, etc.) or facts of interest
    • Relies on “.li” binary files coming from SAS Contextual Analysis solution
  • Text Sentiment” window:
    • Sentiment analysis of text coming from documents, social networks, emails, etc.
    • Classify documents and specific attributes/features as having positive, negative, or neutral/mixed tone
    • Relies on “.sam” binary files coming from SAS Sentiment Analysis solution

Binary files (“.mco”, “.li”, “.sam”) cannot be reverse engineered. The original projects in their corresponding solutions (SAS Contextual Analysis or SAS Sentiment Analysis) should be used to perform modifications on those binaries.

The ESP project

The following ESP project is aimed to:

  • Wait for events coming from Twitter in the source Twitter window (this is a source window, the only entry point for streaming events)
  • Perform basic event processing and counting
  • Perform text analytics on tweets (in the input streaming, the tweet text is injected as a single field)


Let’s have a look at potential text analytics results.

Here is a sample of the Twitter stream that SAS ESP is able to catch (the tweet text is collected in a field called tw_Text):


The “Text Category” window, with an associated “.mco” file, is able to classify tweets into topics/categories with a related score:


The “Text Context” window, with an associated “.li” file, is able to extract terms and their corresponding entity (person, location, currency, etc.) from a tweet:


The “Text Sentiment” window, with an associated “.sam” file, is able to determine a sentiment with a probability from a tweet:


Run the Twitter adapter

In order to inject events into a running ESP model, the Twitter adapter should be started and is going to publish live tweets into the sourceTwitter window of our model.


Here we search for tweets containing “iphone”, but you can change to any keyword you want to track (assuming people are tweeting on that keyword…).

There are many additional options: -f allows to follow specific user ids, -p allows to specify locations of interest, etc.

Consume enriched events with SAS ESP Streamviewer

SAS ESP provides a way to render events in real-time graphically. Here is an example of how to consume real-time events in a powerful dashboard.



With SAS ESP, you can bring the power of SAS Analytics into the real-time world. Performing Text Analytics (content categorization, sentiment analysis, reputation management, etc.) on the fly on text coming from tweets, documents, emails, etc. and triggering consequently some relevant actions have never been so simple and so fast.

tags: SAS Event Stream Processing, SAS Professional Services, SAS Text Analytics, twitter

How to perform real time Text Analytics on Twitter streaming data in SAS ESP was published on SAS Users.

7月 262016

SAS Business Data Network is SAS’ solution for managing glossaries of common business terms. This is part of the SAS Data Governance offering as well as bundled with Advanced versions of all SAS Data Management bundles. One thing that is important regarding Data Governance in general, and this solution in particular, is the ability to search for terms. SAS Business Data Network provides a web search that is sometimes overlooked, but can be used in different ways to help you derive meaning and context from your data. Let’s take a look.

1 – Use a URL to get to the main web page

The main URL that brings you to the SAS Business Data Network Search page is http://<host>/SASBusinessDataNetwork/search

Conduct a web search for business terms with SAS Business Data Network

The search box allows you to search business terms using a specific string. This string is processed entirely, no AND or OR operators. Specifying 2 words will generate a search of those 2 words consecutively. The following fields are processed for the search: Name, Description, Requirements, Attributes, Links, Associated Items and Notes. It is case insensitive and one has to sign in if not already logged on.

Conduct a google-like web search for business terms with SAS Business Data Network2

After a search, among the results, you can choose to open one term in particular or all the terms in SAS Business Data Network.

2 – Customize the URL and call it from external applications

You can also parameterize the search URL and then integrate with it or call it from any other application.

For example, the URL http://<host>/SASBusinessDataNetwork/search?q=gender will search for the string “gender” in all business terms.

The URL http://<host>/SASBusinessDataNetwork/search?q=currency+code will search for “currency code”.

The same search results page will open.

Conduct a google-like web search for business terms with SAS Business Data Network3

This is very useful when you want to seamlessly integrate SAS Business Data Network with other SAS applications or external applications in order to provide more insights on the sense of an indicator or the terminology of a business noun. Within SAS Visual Analytics, we can easily define such parameterized URLs to help the business user understand the meaning of a measure for example. See the end for an example.

3 – Use the bookmarklet to search for highlighted text

You can use the bookmarklet to interact with the BDN search capability smartly. The bookmarklet enables you to search against BDN for a highlighted text in your browser.

Conduct a google-like web search for business terms with SAS Business Data Network01

That means that you can call SAS BDN search from any HTML page in the company. A user can browse an intranet article and call BDN for more information on a term. He can open a business report and call BDN for more information on a specific value.

To install the bookmarklet, you need to follow instructions described in the SAS Business Data Network Search extras.

Conduct a google-like web search for business terms with SAS Business Data Network4

4 – Use the browser search provider

You can register SAS Business Data Network Search as a search provider with your browser. This will allow the user to search strings in a dedicated zone in the toolbar.

Conduct a google-like web search for business terms with SAS Business Data Network5

To install the search provider, you need to follow instructions described in the SAS Business Data Network Search extras.

A final example

I talked earlier about integrating a Visual Analytics report with BDN in order to provide explanations of a specific data item: what is the meaning of that data item? How is it calculated? Who owns that data item? Has there been a change recently in the formula? What is the lineage of that data item? These are questions which BDN can help to answer… Note example treemap from SAS Visual Analytics below.

Conduct a google-like web search for business terms with SAS Business Data Network02

An external link to BDN search is easily configurable in SAS Visual Analytics.

Conduct a google-like web search for business terms with SAS Business Data Network6

BDN also supports deep linking to go directly to a term, without going through search results. This capability requires additional data from the calling SAS Visual Analytics solution.

These are just a few examples of some of the incredible power that SAS Business Data Network gives you to relate and search for business and technical metadata to improve discovery and increase the sharing and reuse of your data assets.

Thanks for your attention and I hope you found this post helpful.

tags: data governance, data management, SAS Professional Services, SAS Visual Analytics

Four ways to conduct a web search for business terms with SAS Business Data Network was published on SAS Users.

3月 082016

DataFlux Data Quality profiling metrics in SAS Visual AnalyticsIn this post, I continue the journey of getting data profiling results into SAS Visual Analytics. In my first blog I described the process of collecting DataFlux Data Quality profiling metrics to load a datamart. Now we load this datamart into memory (LASR) and then plug a VA report on top to share data quality results widely and wisely.

Overall Process

The process is a 2-step process.

1.     Extract data from DataFlux repository and build a DQ profiling datamart (discussed in blog 1)

2.     Load the DQ profiling datamart into memory and plug VA reports on it

Here is the master DataFlux Process Job (called _loadProfilingInVA) that orchestrates the 2 steps.

Data Quality profiling11

The DQ profiling datamart is made of five tables which are SAS data sets. The goal is to load them into LASR. As we did the first step using a DataFlux process job, we will use another process job to do this step and to have a consistent process that we could orchestrate in a master DataFlux process job, seen above.

Loading data into LASR is very simple, especially if we assume that those tables have been loaded the first time manually in SAS Visual Analytics Administrator, in order to create all the necessary metadata (tables, jobs, etc.). If this is not the case, it’s not a big deal to do it automatically using proc METALIB. Once this initial step is done, we just want to refresh these tables in memory each time we have new data profiling results to share. That means, we have to unload old data from memory and then load new data in memory.

Even if we want to do that in DataFlux Data Management Studio, we require a SAS engine somewhere to execute SAS statements. DataFlux DMS does not provide nodes to load data directly in LASR. But it provides nodes (available in process jobs, not in data jobs) to execute SAS code against an existing SAS Workspace Server.

Data Quality profiling12

My technical environment is quite simple: everything on the same box: DataFlux Data Management Studio, LASR, VA, SAS Workspace Server, DQ profiling datamart SAS data sets, etc. No risk! But, in many SAS environments, it may not be as simple. The following questions might arise:

  • Where are my DQ profiling datamart SAS data sets? They are on the same box as DataFlux Data Management Studio/Server, which is probably not the same box as the SAS Workspace Server.
  • Is my SAS Workspace Server able to read those remote tables? Probably a data transfer is needed.
  • Is my SAS Workspace Server able to load data into LASR? I know there are limitations when the SAS Workspace Server is not located in the same box as the LASR master node but I won’t go into details, some resources are already available to deal with this.

Back to my basic mono-box configuration. Here is an example of how to setup the SAS Workspace Server to be called from DataFlux Data Management Studio/Server:
Data Quality profiling13

Here is the SAS code to unload data from memory:

Data Quality profiling14

And here is the code to load data into memory:

Data Quality profiling15

Now, I’m done with the data preparation phase. I’ve executed my master DataFlux Process Job (called _loadProfilingInVA) which:

  • Extracts data profiling results from the DataFlux repository
  • Creates a set of 5 output tables
  • Unloads the 5 old tables from LASR
  • Reloads the 5 new tables into LASR

I’m ready for the reporting phase, and I have prepared a VA report with multiples sections that leverage this DQ profiling datamart. Let’s have a look to what it looks like:

An amazing welcome screen

Data Quality profiling16

An overview of all the profiling metrics from all the profile jobs runs

Data Quality profiling17

Details of profiling metrics by table for a specific run

Data Quality profiling18

An historical view of profiling metrics

Data Quality profiling19

A view of value and pattern frequency distribution

Data Quality profiling20

A view of outliers and percentiles

Data Quality profiling21

tags: data management, SAS Professional Services, SAS Visual Analytics

Data Governance Series: Share DataFlux Data Quality profiling metrics in SAS Visual Analytics (Part 2) was published on SAS Users.

2月 122016

Share DataFlux Data Quality profiling metrics in SAS Visual AnalyticsThis is my first blog post, and the first of a long series around Data Governance. The first thing I want to discuss is the ability to Share DataFlux Data Quality profiling metrics in SAS Visual Analytics.

This post will illustrate how to extract profiling data from the DataFlux repository and how to use SAS Visual Analytics to distribute profiling results centrally and widely through an organization.

To enable data profiling visualization in SAS Visual Analytics, you need to have SAS Visual Analytics, either the full version in addition to any solution containing DataFlux Data Management Studio or Server, or the VAAR (Visual Analytics Administration & Reporting) component that is included, for example, in SAS Data Governance, SAS Data Quality Advanced, SAS Data Management Advanced and SAS MDM Advanced.

Overall process

The process is a 2-step process:

  • Extract data from DataFlux repository and build a data quality profiling datamart
    Load the data quality profiling datamart into memory and plug VA reports on it

Here is the master DataFlux Process Job that orchestrates the 2 steps.

profiling processes results in SAS Visual Analytics

So, let’s start. The first thing to do is to export data profiling results from the DataFlux repository to a set of tables that will be the base of our reports. SAS Data Management Studio does not provide special nodes to query easily the repository for profiling results, as it does for monitoring results. In this case, we will use SQL queries to extract required data to build the necessary reports.

Repository Data Model

To go further, we need to know a bit more about the DataFlux Repository physical data model. There are (at least) two ways to describe it:

First way (preferred): in the Data Management Studio Options window (Tools -> Data Management Studio Options -> General -> Save Repository Schema…)

Data Quality profiling2

The resulting file is a very useful text file that describes all repository profile and monitor tables, and their relations.

Data Quality profiling3

Second way (more technical): in the Data Management Studio New Repository Definition window (Administration -> Repository Definitions -> New Repository Definition… -> Database Connection -> Save Repository DDL…)

Data Quality profiling4

NB: you can cancel the creation of the repository once you get your DDL file.

The resulting file is an SQL script enabling the creation of the repository by an admin directly in the target DBMS. It describes all the repository tables but might be hard to read.

Data Quality profiling5

Now that you know more about how to go further with the repository data model, we can move to the data quality profiling datamart build.

1 – Extract data from DataFlux repository and build a data quality profiling datamart

This phase is done using a DataFlux Process Job that combines multiples DataFlux Data Jobs. Here are the steps:

  • Get common information and field metrics from a join between the following repository tables:
    • DSN (Records each ODBC DSN in the job),
    • FLD (Records metrics for each field in a table),
    • TBL (Records each table that is processed),
    • JOB (Records each run of a profile job),
    • REPORT (Records Profile reports),
    • JSETUP (Job setup root table)
  • Get frequency distributions from a dynamic select of FRQDST<xxx> (Frequency distribution tables; 1 table per field per execution)
  • Get pattern distributions from PATDST (Pattern frequency distribution)
  • Get outliers from OUTLR (Min/Max Outliers)
  • Get percentiles from PTILE (Percentiles)

Data Quality profiling6

NB: The other steps (Table/Field notes, Alerts, Business Rules, Custom Metrics) are not implemented right now in my data quality profiling jobs and VA reports but might be used in the future.

As a result, we’ve got 5 tables that shape our data quality Profiling datamart:

  • PFO_FLD: master table containing summary results for all fields for all executions
  • PFO_FRQDST: frequency distributions for all fields for all executions
  • PFO_OUTLR: outliers for all fields for all executions
  • PFO_PATDST: pattern frequency distributions for all fields for all executions
  • PFO_PTILE: percentiles for all fields for all executions

NB: It was not possible to model only 1 table for all needs because of multiple 1-N relations. This is not a problem because VA is able to build single reports based on multiple data sources.

The first 2 nodes of this process job prepare the environment:

  • Define Variables Here: Initialization of macro variables (where to store the datamart; number of values to keep for frequency distributions and pattern frequency distributions)
  • FILE Storage Management: adapt the DSN properties even if the repository is a database file repository ( a file with an .rps extension)

This Process Job is “self-repository” based. That is to say, it extracts data quality profiling results from the repository where this Process Job is running. You just have to put this job in whatever repository you want to extract results from. Database file repositories are supported.

This Process Job works only in REPLACE mode. The data quality profiling datamart is overwritten each time it runs. No append or delta mode. All the data quality profiling results need to be kept in the original DataFlux repository.

Next time I’ll talk about the second part: load the data quality profiling datamart into memory and use SAS Visual Analytics to share data quality profiling results wisely.

Concluding note:

Querying the profile repository, while useful, must be used with the understanding that the profile repository schema does change (though it’s rare).

tags: data management, SAS Professional Services

Data Governance Series: Share DataFlux Data Quality profiling metrics in SAS Visual Analytics was published on SAS Users.