Clemens Knobloch

10月 122017
 

With SAS Data Management, you can setup SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

When a data issue is discovered it can be sent automatically or manually to a remediation queue where it can be corrected by designated users.

Let’s look how to setup a remediation service and how to send issue records to Data Remediation.

Register the remediation service.

To register a remediation service in SAS Data Remediation we go to Data Remediation Administrator “Add New Client Application.

Under Properties we supply an ID, which can be the name of the remediation service as long as it is unique, and a Display name, which is the name showing in the Remediation UI.

Under the tab Subject Area, we can register different subject categories for this remediation service.  When calling the remediation service we can categorize different remediation issues by setting different subject areas. We can, for example, use the Subject Area to point to different Data Quality Dimensions like Completeness, Uniqueness, Validity, Accuracy, Consistency.

Under the tab Issues Types, we can register issue categories. This enables us to categorize the different remediation issues. For example, we can point to the affected part of record like Name, Address, Phone Number.

At Task Templates/Select Templates we can set a workflow to be used for each issue type. You can design your own workflow using SAS Workflow Studio or you can use a prepared workflow that comes with Data Remediation. You need to make sure that the desired workflow is loaded on to Workflow Server to link it to the Data Remediation Service. Workflows are not mandatory in SAS Data Remediation but will improve efficiency of the remediation process.

Saving the remediation service will make it available to be called.

Sending issues to Data Remediation.

When you process data, and have identified issues that you want to send to Data Remediation, you can either call Data Remediation from the job immediately where you process the data or you store the issue records in a table first and then, in a second step, create remediation records via a Data Management job.

To send records to Data Remediation you can call remediation REST API form the HTTP Request node in a Data Management job.

Remediation REST API

The REST API expects a JSON structure supplying all required information:

{
	"application": "mandatory",
	"subjectArea": "mandatory",
	"name": "mandatory",
	"description": "",
	"userDefinedFieldLabels": {
		"1": "",
		"2": "",
		"3": ""
	},
	"topics": [{
		"url": "",
		"name": "",
		"userDefinedFields": {
			"1": "",
			"2": "",
			"3": ""
		},
		"key": "",
		"issues": [{
			"name": "mandatory",
			"importance": "",
			"note": "",
			"assignee": {
				"name": ""
			},
			"workflowName": "",
			"dueDate": "",
			"status": ""
		}]
	}]
}

 

JSON structure description:

In a Data Management job, you can create the JSON structure in an Expression node and use field substitution to pass in the necessary values from the issue records. The expression code could look like this:

REM_APPLICATION= "Customer Record"
REM_SUBJECT_AREA= "Completeness"
REM_PACKAGE_NAME= "Data Correction"
REM_PACKAGE_DESCRIPTION= "Mon-Result: " &formatdate(today(),"DD MM YY") 
REM_URL= "http://myserver/Sourcesys/#ID=" &record_id
REM_ITEM_NAME= "Mobile phone number missing"
REM_FIELDLABEL_1= "Source System"
REM_FIELD_1= "CRM"
REM_FIELDLABEL_2= "Redord ID"
REM_FIELD_2= record_id
REM_FIELDLABEL_3= "-"
REM_FIELD_3= ""
REM_KEY= record_id
REM_ISSUE_NAME= "Phone Number"
REM_IMPORTANCE= "high"
REM_ISSUE_NOTE= "Violated data quality rule phone: 4711"
REM_ASSIGNEE= "Ben"
REM_WORKFLOW= "Customer Tag"
REM_DUE-DATE= "2018-11-01"
REM_STATUS= "open"
 
JSON_REQUEST= '
{
  "application":"' &REM_APPLICATION &'",
  "subjectArea":"' &REM_SUBJECT_AREA &'",
  "name":"' &REM_PACKAGE_NAME &'",
  "description":"' &REM_PACKAGE_DESCRIPTION &'",
  "userDefinedFieldLabels": {
    "1":"' &REM_FIELDLABEL_1 &'",
    "2":"' &REM_FIELDLABEL_2 &'",
    "3":"' &REM_FIELDLABEL_3 &'"
  },
  "topics": [{
    "url":"' &REM_URL &'",
    "name":"' &REM_ITEM_NAME &'",
    "userDefinedFields": {
      "1":"' &REM_FIELD_1 &'",
      "2":"' &REM_FIELD_2 &'",
      "3":"' &REM_FIELD_3 &'"
    },
    "key":"' &REM_KEY &'",
    "issues": [{
      "name":"' &REM_ISSUE_NAME &'",
      "importance":"' &REM_IMPORTANCE &'",
      "note":"' &REM_ISSUE_NOTE &'",
      "assignee": {
        "name":"' &REM_ASSIGNEE &'"
      },
      "workflowName":"' &REM_WORKFLOW &'",
      "dueDate":"' &REM_DUE_DATE &'",
      "status":"' &REM_STATUS &'"
    }]
  }]
}'

 

Tip: You could also write a global function to generate the JSON structure.

After creating the JSON structure, you can invoke the web service to create remediation records. In the HTTP Request node, you call the web service as follows:

Address:  http://[server]:[port]/SASDataRemediation/rest/groups
Method: post
Input Filed: The variable containing the JSON structure. I.e. JSON_REQUEST
Output Filed: A field to take the output from the web service. You can use the New button create a filed and set the size to 1000
Under Security… you can set a defined user and password to access Data Remediation.
In the HTTP Request node’s advanced settings set the WSCP_HTTP_CONTENT_TYPE options to application/json

 

 

 

You can now execute the Data Management job to create the remediation records in SAS Data Remediation.

Improving data quality through SAS Data Remediation was published on SAS Users.

6月 212017
 

IT organizations today are constantly challenged to do more with less. Reusing data processing jobs and employing best practices in monitoring the health of your data are proven ways to improve the productivity of data professionals. Dataflux Data Management Studio is a component of both the SAS Data Quality and the SAS Data Management offerings that allows you to create data processing jobs to integrate, cleanse and monitor your data.

You can write global functions for SAS Data Management jobs that can be reused in any expression in the system, in either data or process flow jobs. Global functions can be called from expression nodes, monitor rules, profile filters, surviving record indicators, process flow if-nodes and more.

Global functions are defined in a text file and saved in the Data Management installation directory under “etc/udf” in Data Management Studio or Data Management Server respectively.

Each global function has to have a unique name and is wrapped with a function / end function block code, can process any number of input parameters and returns a single value of either integer, real, date, string or boolean type.

Hello World

For a start, let’s create a “hello world” function.

  • If it does not exist, create a folder in the installation directory under “etc/udf” (DM Studio and DM Server).
  • In “etc/udf” create a file named hello_world.txt.
  • In the hello_word file create the function as follows:
function hello_world return string
return “hello world!”
end function
  • Save the file and restart DM Studio, if necessary, in order to use hello_world().

The new function is fully integrated in Data Management Studio. You can see the new function in an expression node under Function->Other or as expression language help in the expression node editor.

Handling Parameters

Global functions can handle any number of parameters. Parameter helper functions are available to access input parameters inside a function:

  • paramatercount() returns the number parameters that have been passed into the function call.  This is helpful if the incoming parameters are unknown.
integer i
for i = 1 to parametercount() 
begin
   // process each parameter
end

 

  • parametertype(integer) returns the type of the parameter for the given parameter position. The first parameter is 1. The return value will either be integer, real, date, string or Boolean.
  • parameterstring(integer), parameterinteger(integer), parameterboolean(integer), parameterdate(integer), parameterreal(integer) these functions return the value of the parameter as specified by position, or null if the parameter doesn’t exist. You can use these functions if you know the incoming parameter type at a given position.
  • parameter(integer) returns the value of the parameter as specified by position, or null if the parameter doesn’t exist. If you don’t know the incoming parameter type you can use this function. Note: Using the parameter() function may require additional overhead to coerce the values to the field type. Using the specific data type parameter functions above will eliminate the cost of coercion.

Global Function Example

This global function will check if password rules are followed.

//////////////////////////////////////////////////////////////////////////
// Function:     check_psw_rule
// Inputs:       string
// Output:       boolean -> true == passed check; false == failed check
// Description:  Check the rules for password. The rules are:
//               Need to be at least 8 characters long
//               Need to have at least one lower case character
//               Need to have at least one upper case character
//               Need to have at least one number
//////////////////////////////////////////////////////////////////////////
function check_psw_rule return boolean
	string check_str
	boolean rc
	regex r
 
	check_str= parameterstring(1)   //copy input parameter to variable
 
	rc= false                       //set default return value to failed (false)
	if(len(check_str) < 8)          //check if at least 8 characters
		return rc
	r.compile("[a-z]")              
	if (!r.findfirst(check_str))    //check if at least one lower case character
		return rc
	r.compile("[A-Z]")
	if (!r.findfirst(check_str))    //check if at least one upper case character
		return rc
	r.compile("[0-9]")
	if (!r.findfirst(check_str))    //check if at least one number
		return rc
	rc= true                        //return true if all checks passed
	return rc
end function

 

This function can be called from any expression in a Data Management job:

boolean  check_result
check_result= check_psw_rule(password)

Global function can also call other global function

Just a few things to be aware of. There is a late binding process, which means if function B() wants to call function A(), then function A() needs to be loaded first. The files global functions are stored in are loaded alphabetically by file name. This means the file name containing function A() has to occurs alphabetically before file name containing function B().

Best Practices

Here are some best practice tips which will help you to be most successful writing global functions:

  1. Create one file per expression function.
    This allows for global functions to easily be deployed and shared.
  2. Use lots of comments.
    Describe what the function’s purpose, expected parameters, and outputs and improve the readability and reusability of your code
  3. Test the expressions in data jobs first.
    Write a global function body as an expression first and test it via preview. This way it is easier to find typos, syntax errors and to ensure that the code is doing what you would like it to do.
  4. Debugging - If the global function is not loading, check the platform_date.log.  For Studio, this could for example be found under: C:\Users\<your_id>\AppData\Roaming\DataFlux\DMStudio\studio1

You now have a taste of how to create reusable functions in Data Management Studio to help you both improve the quality of your data as well as improve the productivity of your data professionals. Good luck and please let us know what kind of jobs you are using to help your organization succeed.

Writing your own functions in SAS Data Quality using Dataflux Data Management Studio was published on SAS Users.

6月 212017
 

IT organizations today are constantly challenged to do more with less. Reusing data processing jobs and employing best practices in monitoring the health of your data are proven ways to improve the productivity of data professionals. Dataflux Data Management Studio is a component of both the SAS Data Quality and the SAS Data Management offerings that allows you to create data processing jobs to integrate, cleanse and monitor your data.

You can write global functions for SAS Data Management jobs that can be reused in any expression in the system, in either data or process flow jobs. Global functions can be called from expression nodes, monitor rules, profile filters, surviving record indicators, process flow if-nodes and more.

Global functions are defined in a text file and saved in the Data Management installation directory under “etc/udf” in Data Management Studio or Data Management Server respectively.

Each global function has to have a unique name and is wrapped with a function / end function block code, can process any number of input parameters and returns a single value of either integer, real, date, string or boolean type.

Hello World

For a start, let’s create a “hello world” function.

  • If it does not exist, create a folder in the installation directory under “etc/udf” (DM Studio and DM Server).
  • In “etc/udf” create a file named hello_world.txt.
  • In the hello_word file create the function as follows:
function hello_world return string
return “hello world!”
end function
  • Save the file and restart DM Studio, if necessary, in order to use hello_world().

The new function is fully integrated in Data Management Studio. You can see the new function in an expression node under Function->Other or as expression language help in the expression node editor.

Handling Parameters

Global functions can handle any number of parameters. Parameter helper functions are available to access input parameters inside a function:

  • paramatercount() returns the number parameters that have been passed into the function call.  This is helpful if the incoming parameters are unknown.
integer i
for i = 1 to parametercount() 
begin
   // process each parameter
end

 

  • parametertype(integer) returns the type of the parameter for the given parameter position. The first parameter is 1. The return value will either be integer, real, date, string or Boolean.
  • parameterstring(integer), parameterinteger(integer), parameterboolean(integer), parameterdate(integer), parameterreal(integer) these functions return the value of the parameter as specified by position, or null if the parameter doesn’t exist. You can use these functions if you know the incoming parameter type at a given position.
  • parameter(integer) returns the value of the parameter as specified by position, or null if the parameter doesn’t exist. If you don’t know the incoming parameter type you can use this function. Note: Using the parameter() function may require additional overhead to coerce the values to the field type. Using the specific data type parameter functions above will eliminate the cost of coercion.

Global Function Example

This global function will check if password rules are followed.

//////////////////////////////////////////////////////////////////////////
// Function:     check_psw_rule
// Inputs:       string
// Output:       boolean -&gt; true == passed check; false == failed check
// Description:  Check the rules for password. The rules are:
//               Need to be at least 8 characters long
//               Need to have at least one lower case character
//               Need to have at least one upper case character
//               Need to have at least one number
//////////////////////////////////////////////////////////////////////////
function check_psw_rule return boolean
	string check_str
	boolean rc
	regex r
 
	check_str= parameterstring(1)   //copy input parameter to variable
 
	rc= false                       //set default return value to failed (false)
	if(len(check_str) &lt; 8)          //check if at least 8 characters
		return rc
	r.compile("[a-z]")              
	if (!r.findfirst(check_str))    //check if at least one lower case character
		return rc
	r.compile("[A-Z]")
	if (!r.findfirst(check_str))    //check if at least one upper case character
		return rc
	r.compile("[0-9]")
	if (!r.findfirst(check_str))    //check if at least one number
		return rc
	rc= true                        //return true if all checks passed
	return rc
end function

 

This function can be called from any expression in a Data Management job:

boolean  check_result
check_result= check_psw_rule(password)

Global function can also call other global function

Just a few things to be aware of. There is a late binding process, which means if function B() wants to call function A(), then function A() needs to be loaded first. The files global functions are stored in are loaded alphabetically by file name. This means the file name containing function A() has to occurs alphabetically before file name containing function B().

Best Practices

Here are some best practice tips which will help you to be most successful writing global functions:

  1. Create one file per expression function.
    This allows for global functions to easily be deployed and shared.
  2. Use lots of comments.
    Describe what the function’s purpose, expected parameters, and outputs and improve the readability and reusability of your code
  3. Test the expressions in data jobs first.
    Write a global function body as an expression first and test it via preview. This way it is easier to find typos, syntax errors and to ensure that the code is doing what you would like it to do.
  4. Debugging - If the global function is not loading, check the platform_date.log.  For Studio, this could for example be found under: C:\Users\<your_id>\AppData\Roaming\DataFlux\DMStudio\studio1

You now have a taste of how to create reusable functions in Data Management Studio to help you both improve the quality of your data as well as improve the productivity of your data professionals. Good luck and please let us know what kind of jobs you are using to help your organization succeed.

Writing your own functions in SAS Data Quality using Dataflux Data Management Studio was published on SAS Users.

11月 022016
 

With DataFlux Data Management 2.7, the major component of SAS Data Quality and other SAS Data Management solutions, every job has a REST API automatically created once moved to the Data Management Server. This is a great feature and enables us to easily call Data Management jobs from programming languages like Python. We can then involve the Quality Knowledge Base (QKB), a  pre-built set of data quality rules, and do other Data Quality work that is impossible or challenging to do when using only Python.

calling-sas-data-quality-jobs-from-pythonIn order to make a RESTful call from Python we need to first get the REST API information for our Data Management job. The best way to get this information is to go to Data Management Server in your browser where you’ll find respective links for:

  • Batch Jobs
  • Real-Time Data Jobs
  • Real-Time Process Jobs.

From here you can drill through to your job REST API.

Alternatively, you can use a “shortcut” to get the information by calling the job’s REST API metadata URL directly. The URL looks like this:

http://<DM Server>:<port>/<job type>/rest/jobFlowDefns/<job id>/metadata

calling-sas-data-quality-jobs-from-python02

The <job id> is simply the job name (with subdirectory and extension) Base64 encoded. This is a common method to avoid issues with illegal URL characters like: # % & * { } : < > ? / + or space. You can go to this website to Base64 encode your job name.

If you have many jobs on the Data Management Server it might be quicker to use the “shortcut” instead of drilling through from the top.

Here an example to get the REST API information for the Data Management job “ParseAddress.ddf” which is in the subdirectory Demo of Real-Time Data Services on DM Server:

calling-sas-data-quality-jobs-from-python03

We Base64 encode the job name “Demo/ParseAddress.ddf” using the website mentioned above…

calling-sas-data-quality-jobs-from-python04

…and call the URL for the job’s REST API metadata:

http://DMServer:21036/SASDataMgmtRTDataJob/rest/jobFlowDefns/ RGVtby9QYXJzZUFkZHJlc3MuZGRm/metadata

calling-sas-data-quality-jobs-from-python05

From here we collect the following information:

The REST API URL and Content-Type information…
calling-sas-data-quality-jobs-from-python06

…the JSON structure for input data

calling-sas-data-quality-jobs-from-python07

…which we need in this format when calling the Data Management job from Python:

{"inputs" : {"dataTable" : {"data" : [[ "sample string" ],[ "another string" ]], "metadata" : [{"maxChars" : 255, "name" : "Address", "type" : "string"}]}}}

…and the JSON structure for the data returned by the Data Management job.

calling-sas-data-quality-jobs-from-python08

When you have this information, the Python code to call the Data Management job would look like this:

calling-sas-data-quality-jobs-from-python09

The output data from the Data Management job will be in data_raw. We call the built-in JSON decoder from the “request” module to move the output into a dictionary (data_out) from where we can access the data. The structure of the dictionary is according to the REST metadata. We can access the relevant output data via data_out[‘outputs’][‘dataTable’][‘data’]

calling-sas-data-quality-jobs-from-python10

calling-sas-data-quality-jobs-from-python11

The Python program will produce an output like this…

calling-sas-data-quality-jobs-from-python12

You can find more information about the DataFlux Data Management REST API here.

Calling Data Management jobs from Python is straight forward and is a convenient way to augment your Python code with the more robust set of Data Quality rules and capabilities found in the SAS Data Quality solution.

Learn more about SAS Data Quality.

tags: data management, DataFlux Data Management Studio, open source, REST API

Calling SAS Data Quality jobs from Python was published on SAS Users.

12月 182014
 
For a long time, master data management (MDM) practitioners boasted about their ability to build a 360° view of customers by aggregating and proactively managing information coming from various business applications such as CRM systems, ERP applications, and other operational systems. But was it really a 360° view? What about […]
12月 122014
 
Master Data and Big Data live in different places and are there for different reasons. Organizations try to build  “360 Views of a Customer” using MDM.  But who is feeding the MDM system? Usually these are the operational systems in an organization. Does this mean we really have a 360 […]