Data Management

4月 132018
 

The release of SAS Viya 3.3 has brought some nice data quality features. In addition to the visual applications like Data Studio or Data Explorer that are part of the Data Preparation offering, one can leverage data quality capabilities from a programming perspective.

For the time being, SAS Viya provides two ways to programmatically perform data quality processing on CAS data:

  • The Data Step Data Quality functions.
  • The profile CAS action.

To use Data Quality programming capabilities in CAS, a Data Quality license is required (or a Data Preparation license which includes Data Quality).

Data Step Data Quality functions

The list of the Data Quality functions currently supported in CAS are listed here and below:

SAS Data Quality 3.3 programming capabilities

They cover casing, parsing, field extraction, gender analysis, identification analysis, match codes and standardize capabilities.

As for now, they are only available in the CAS Data Step. You can’t use them in DS2 or in FedSQL.

To run in CAS certain conditions must be met. These include:

  • Both the input and output data must be CAS tables.
  • All language elements must be supported in the CAS Data Step.
  • Others.

Let’s look at an example:

cas mysession sessopts=(caslib="casuser") ;
 
libname casuser cas caslib="casuser" ;
 
data casuser.baseball2 ;
   length gender $1 mcName parsedValue tokenNames lastName firstName varchar(100) ;
   set casuser.baseball ;
   gender=dqGender(name,'NAME','ENUSA') ;
   mcName=dqMatch(name,'NAME',95,'ENUSA') ;   
   parsedValue=dqParse(name,'NAME','ENUSA') ;
   tokenNames=dqParseInfoGet('NAME','ENUSA') ;
   if _n_=1 then put tokenNames= ;
   lastName=dqParseTokenGet(parsedValue,'Family Name','NAME','ENUSA') ;
   firstName=dqParseTokenGet(parsedValue,'Given Name','NAME','ENUSA') ;
run ;

Here, my input and output tables are CAS tables, and I’m using CAS-enabled statements and functions. So, this will run in CAS, in multiple threads, in massively parallel mode across all my CAS workers on in-memory data. You can confirm this by looking for the following message in the log:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

I’m doing simple data quality processing here:

  • Determine the gender of an individual based on his(her) name, with the dqGender function.
  • Create a match code for the name for a later deduplication, with the dqMatch function.
  • Parse the name using the dqParse function.
  • Identify the name of the tokens produced by the parsing function, with the dqParseInfoGet function.
  • Write the token names in the log, the tokens for this definition are:
    Prefix,Given Name,Middle Name,Family Name,Suffix,Title/Additional Info
  • Extract the “Family Name” token from the parsed value, using dqParseTokenGet.
  • Extract the “Given Name” token from the parsed value, again using dqParseTokenGet.

I get the following table as a result:

Performing this kind of data quality processing on huge tables in memory and in parallel is simply awesome!

The dataDiscovery.profile CAS action

This CAS action enables you to profile a CAS table:

  • It offers 2 algorithms, one is faster but uses more memory.
  • It offers multiple options to control your profiling job:
    • Columns to be profiled.
    • Number of distinct values to be profiled (high-cardinality columns).
    • Number of distinct values/outliers to report.
  • It provides identity analysis using RegEx expressions.
  • It outputs the results to another CAS table.

The resulting table is a transposed table of all the metrics for all the columns. This table requires some post-processing to be analyzed properly.

Example:

proc cas; 
   dataDiscovery.profile /
      algorithm="PRIMARY"
      table={caslib="casuser" name="product_dim"}
      columns={"ProductBrand","ProductLine","Product","ProductDescription","ProductQuality"}
      cutoff=20
      frequencies=10
      outliers=5
      casOut={caslib="casuser" name="product_dim_profiled" replace=true}
   ;
quit ;

In this example, you can see:

  • How to specify the profiling algorithm (quite simple: PRIMARY=best performance, SECONDARY=less memory).
  • How to specify the input table and the columns you want to profile.
  • How to reduce the number of distinct values to process using the cutoff option (it prevents excessive memory use for high-cardinality columns, but might show incomplete results).
  • How to reduce the number of distinct values reported using the frequencies option.
  • How to specify where to store the results (casout).

So, the result is not a report but a table.

The RowId column needs to be matched with

A few comments/cautions on this results table:

  • DoubleValue, DecSextValue, or IntegerValue fields can appear on the output table if numeric fields have been profiled.
  • DecSextValue can contain the mean (metric #1008), median (#1009), standard deviation (#1022) and standard error (#1023) if a numeric column was profiled.
  • It can also contain frequency distributions, maximum, minimum, and mode if the source column is of DecSext data type which is not possible yet.
  • DecSext is a 192-bit fixed-decimal data type that is not supported yet in CAS, and consequently is converted into a double most of the time. Also, SAS Studio cannot render correctly new CAS data types. As of today, those metrics might not be very reliable.
  • Also, some percentage calculations might be rounded due to the use of integers in the Count field.
  • The legend for metric 1001 is not documented. Here it is:

1: CHAR
2: VARCHAR
3: DATE
4: DATETIME
5: DECQUAD
6: DECSEXT
7: DOUBLE
8: INT32
9: INT64
10: TIME

A last word on the profile CAS action. It can help you to perform some identity analysis using patterns defined as RegEx expressions (this does not use the QKB).

Here is an example:

proc cas; 
   dataDiscovery.profile /
      table={caslib="casuser" name="customers"}
      identities={
         {pattern="PAT=</span>?999[<span style=" />-]? ?999[- ]9999",type="USPHONE"}, 
         {pattern= "PAT=^99999[- ]9999$",type="ZIP4"}, 
         {pattern= "PAT=^99999$",type="ZIP"}, 
         {pattern= "[^ @]+@[^ @]+\.[A-Z]{2,4}",type="EMAIL"}, 
         {pattern= "^(?i:A[LKZR]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])$",type="STATE"}
      }
      casOut={caslib="casuser" name="customers_profiled" replace="true"}
   ;
quit ;

In this example that comes from

I hope this post has been helpful.

Thanks for reading.

An overview of SAS Data Quality 3.3 programming capabilities was published on SAS Users.

4月 062018
 

In May of 2016, I expressed my personal excitement for the release of SAS 360 Discover to the SAS platform, and the subject of predictive marketing analytics. Since then, I have worked with brands across industries like sports, non-profits, and financial services to learn about valuable use cases. Before diving [...]

SAS 360 Discover: Enhance your marketing automation platform was published on Customer Intelligence Blog.

3月 302018
 

Multi Node Data TransferWith SAS Viya 3.3, a new data transfer mechanism “MultiNode Data Transfer” has been introduced to transfer data between the data source and the SAS’ Cloud Analytics Services (‘CAS’), in addition to Serial and Parallel data transfer modes. The new mechanism is an extension of the Serial Data Transfer mode. In MultiNode Data transfer mode each CAS Worker makes a simultaneous concurrent connection to read and write data from the source DBMS or Distributed data system.

In CAS, SAS Data connectors are used for Serial mode and SAS Data Connect Accelerators are used for Parallel mode data transfer between CAS and DBMS. The SAS Data connector can also be used for the MultiNode data transfer mechanism. In a multi-node CAS environment when the Data Connector is installed on all Nodes, the Data connector can take advantage of a multi-node CAS environment and make concurrent data access connections from each CAS worker to read and write data from the data source environment.

The CAS Controller controls the MultiNode Data transfer. It directs each CAS worker node on how to query the source data and obtain the needed data. The CAS Controller checks the source data table for the first numeric column and uses the values to divide the table into slices using a MOD function of the number of CAS nodes specified. The higher the Cardinality of the selected numeric column, the easier the data can be divided into slices. If CAS chooses a low cardinality column, you could end-up with poor data distribution on the CAS worker nodes. The CAS controller directs each CAS worker to submit a query to obtain the needed slice of data. During this process, each CAS worker makes an independent, concurrent request to the data source environment.

Data is transferred from the source environment to the CAS worker nodes directly using a single thread connection, bypassing the CAS Controller.

The following diagrams describe the data access from CAS to data source environment using MultiNode Data transfer Mode. CAS is hosted on a multi-node environment with SAS Data Connector installed on each node (CAS Controller and Workers). A CASLIB is defined with NUMREADNODES= and NUMWRITENODES= value other than 1. With each data table access request, the CAS controller scan through the source data table for the first numeric columns and use the value to prepare a query for each CAS worker to run. The CAS Worker node submits an individual query to get its slice of the data. Something like:

Select * from SourceTable where mod(NumericField, NUMREADNODES) = WorkerNodeNumber

The data moves from the DBMS gateway server to each CAS Worker Nodes directly using a single thread connection, bypassing the CAS Controller. It’s a kind of parallel load using the serial mechanism, but it’s not a massively parallel data load. You can notice the bottleneck at DBMS gateway server. The data transfers always passes through the DBMS gateway server to the CAS Worker nodes.

Multi Node Data Transfer

Prerequisites to enable MultiNode Data Transfer include:

  • The CAS environment is a multi-node environment (multiple CAS Worker Nodes).
  • The SAS Data Connector for the data source is installed on each CAS Worker, and Controller Node.
  • The data source client connection components are installed on each CAS Worker, and Controller Node.

By default, SAS Data connector uses serial data transfer mode. To enable MultiNode Data Transfer mode you must use the NUMREADNODES= and NUMWRITENODES= parameters in CSLIB statement and specify value other than 1. If value is specified as 0, CAS will use all available CAS worker nodes. MultiNode Data Transfer Mode can use only number of available node, if you specify more than available nodes, the log prints a warning message.

The following code example describes the data load using “MultiNode” data transfer mode. It assigns a CASLIB using serial mode with NUMREADNODES=10 and NUMWRITENODES=10 and loads data from a Hive table to CAS. As NUMREADNODES= value is other than 1, it follows the MultiNode mechanism. You can notice in log, there is a warning message stating that the Number of Read node parameter exceeds the available Worker nodes. This is one way to verify whether CAS is using MultiNode data transfer mode, by specifying the higher number than available CAS worker nodes. If you specify value for NUMREADNODES =0, it will use all available nodes but no message or warning message in SAS log about multi node usage.

CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib HiveSrl datasource=(srctype="hadoop",
server="xxxxxxx.xxx",
username="hadoop",
dataTransferMode="SERIAL",
NUMREADNODES=10, 
NUMWRITENODES=10,
hadoopconfigdir="/opt/MyHadoop/CDH/Config",
hadoopjarpath="/opt/MyHadoop/CDH/Jars",
schema="default");
proc casutil;
load casdata="prdsal2_1G" casout="prdsal2_1G"
outcaslib="HiveSrl" incaslib="HiveSrl" ;
quit;

SAS Log extract:

….
77 proc casutil;
78 ! load casdata="prdsal2_1G" casout="prdsal2_1G"
79 outcaslib="HiveSrl" incaslib="HiveSrl" ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Hadoop.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(7). The load will proceed with numReadNodes=7. 
…
..

On the Database side, in this case Hive, note the queries submitted by CAS Worker Nodes. Each include the MOD function WHERE clause as described above.

On Hadoop Resource Manager User Interface you can notice the corresponding job execution for each query submitted by CAS worker nodes.

When using MultiNode mode to load data to CAS, data distribution is dependent on the cardinality of the numeric column selected by CAS during MOD function operation. You can notice the CAS data distribution for the above loaded table is not ideal, since it selected a column (‘year’) which is not ideal (in this case) for data distribution across CAS worker nodes. There is no option with MultiNode mechanism to specify a column name to be use for query preparation and eventually for data distribution.

If CAS cannot find suitable columns for MultiNode data transfer mode, it will use standard Serial mode to transfer data as shown in the following log:

……..
74
74 ! load casdata="prdsal2_char" casout="prdsal2_char"
75 outcaslib="HiveSrl" incaslib="HiveSrl" ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Hadoop.
WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(7). The load will proceed with numReadNodes=7.
WARNING: Unable to find an acceptable column for multi-node reads. Load will proceed with numReadNodes = 1. 
NOTE: Cloud Analytic Services made the external data from prdsal2_char available as table PRDSAL2_CHAR in caslib HiveSrl.
…….

List of data platform supported with MultiNode Data Transfer using Data Connector:

  • Hadoop
  • Impala
  • Oracle
  • PostgreSQL
  • Teradata
  • Amazon Redshift
  • DB2
  • MS SQL Server
  • SAP HANA

The order of data types that SAS uses to divide data into slices for MultiNode Data Read.

  • INT (includes BIGINT, INTEGER, SMALLINT, TINYINT)
  • DECIMAL
  • NUMERIC
  • DOUBLE

Multi-Node Write:

While this post focused on loading data from a data source into CAS, multi-node data transfer also works when saving from CAS back to the data source. The important parameter when saving is NUMWRITENODES instead of NUMREADNODES. The behavior of multi-node saving is similar to that of multi-node loading.

Summary:

The SAS Data Connector can be used for MultiNode data transfer by installing Data Connector and DBMS client components on all CAS Worker nodes without additional license fees. The source data is transferred directly from DBMS gateway server to CAS Worker Nodes being divided up by a simple MOD function. By using this mechanism, the optimum data distribution in CAS Nodes are not guaranteed. It’s suggested to use all CAS Worker Nodes by specifying NUMREADNODES=0 when loading data to CAS using MultiNode mode.

Important links for more information about this topic:

Multi Node Data Transfer to CAS was published on SAS Users.

3月 212018
 

Having been involved in some capacity in more than 100 Customer Intelligence (CI) sales and implementation cycles, one of the most common questions perspective clients ask me is: “What do we need to do to make this project successful?” Over the years, I have distilled six themes that ultimately determine [...]

Six factors that determine customer intelligence project success was published on Customer Intelligence Blog.

3月 172018
 

This is a continuation of my previous blog post on SAS Data Studio and the Code transform. In this post, I will review some additional examples of using the Code transform in a SAS Data Studio data plan to help you prepare your data for analytic reports and/or models.

Create a Unique Identifier Example

The DATA step code below combines the _THREADID_ and the _N_ variables to create a UniqueID for each record.

SAS Data Studio Code Transform

The variable _THREADID_ returns the number that is associated with the thread that the DATA step is running in a server session. The variable _N_ is an internal system variable that counts the iterations of the DATA step as it automatically loops through the rows of an input data set. The _N_ variable is initially set to 1 and increases by 1 each time the DATA step loops past the DATA statement. The DATA step loops past the DATA statement for every row that it encounters in the input data. Because the DATA step is a built-in loop that iterates through each row in a table, the _N_ variable can be used as a counter variable in this case.

_THREADID_ and _N_ are variables that are created automatically by the SAS DATA step and saved in memory. For more information on automatic DATA step variables refer to its

Cluster Records Example

The DATA step code below combines the _THREADID_ and the counter variables to create a unique ClusterNum for each BY group.

This code uses the concept of FIRST.variable to increase the counter if it is the beginning of a new grouping. FIRST.variable and LAST.variable are variables that CAS creates for each BY variable. CAS sets FIRST.variable when it is processing the first observation in a BY group, and sets LAST.variable when it is processing the last observation in a BY group. These assignments enable you to take different actions, based on whether processing is starting for a new BY group or ending for a BY group. For more information, refer to the topic

De-duplication Example

The DATA step code below outputs the last record of each BY group; therefore, de-duplicating the data set by writing out only one record per grouping.

Below are the de-duplication results on the data set used in the previous Cluster Records Example section.

For more information about DATA step, refer to the

Below is the resulting customers2.xlsx file in the Public CAS library.

For more information on the available action sets, refer to the SAS® Cloud Analytic Services 3.3: CASL Reference guide.

For more information on SAS Data Studio and the Code transform, please refer to this SAS Data Studio Code Transform (Part 2) was published on SAS Users.

3月 172018
 

SAS Data Studio is a new application in SAS Viya 3.3 that provides a mechanism for performing simple, self-service data preparation tasks to prepare data for use in SAS Visual Analytics or other applications. It is accessed via the Prepare Data menu item or tile on SAS Home. Note: A user must belong to the Data Builders group in order to have access to this menu item.

In SAS Data Studio, you can either select to create a new data plan or open an existing one. A data plan starts with a source table and consists of transforms (steps) that are performed against that table. A plan can be saved and a target table can be created based on the transformations applied in the plan.

SAS Data Studio Code Transform

SAS Data Studio

In a previous blog post, I discussed the Data Quality transforms in SAS Studio.  This post is about the Code transform which enables you to create custom code to perform actions or transformations on a table. To add custom code using the Code transform, select the code language from the drop-down menu, and then enter the code in the text box.  The following code languages are available: CASL or DATA step.

Code Transform in SAS Data Studio

Each time you run a plan, the table and library names might change. To avoid errors, you must use variables in place of table and caslib names in your code within SAS Data Studio. Indicating variables in place of table and library names eliminates the possibility that the code will fail due to name changes.  Errors will occur if you use literal values. This is because session table names can change during processing.  Use the following variables:

  • _dp_inputCaslib – variable for the input CAS library name.
  • _dp_inputTable – variable for the input table name.
  • _dp_outputCaslib – variable for the output CAS library name.
  • _dp_outputTable –  variable for the output table name.

Note: For DATA step only, variables must be enclosed in braces, for example, data {{_dp_outputTable}} (caslib={{_dp_outputCaslib}});.

The syntax of “varname”n is needed for variable names with spaces and/or special characters.  Refer to the Avoiding Errors When Using Name Literals help topic for more Information.  There are also several

CASL Code Example

The CASL code example above uses the ActionSet fedSQL to create a summary table of counts by the standardized State value.  The results of this code are pictured below.

Results from CASL Code Example

For more information on the available action sets, refer to the SAS® Cloud Analytic Services 3.3: CASL Reference guide.

DATA Step Code Example

In this DATA step code example above, the BY statement is used to group all records with the same BY value. If you use more than one variable in a BY statement, a BY group is a group of records with the same combination of values for these variables. Each BY group has a unique combination of values for the variables.  On the CAS server, there is no guarantee of global ordering between BY groups. Each DATA step thread can group and order only the rows that are in the same data partition (thread).  Refer to the help topic

Results from DATA Step Code Example

For more information about DATA step, refer to the In my next blog post, I will review some more code examples that you can use in the Code transform in SAS Data Studio. For more information on SAS Data Studio and the Code transform, please refer to this SAS Data Studio Code Transform (Part 1) was published on SAS Users.

2月 212018
 

Data protection with GDPRHere are some new tips for masking. The new EU General Data Protection Regulation (GDPR) requires your company to implement (quote) all necessary technical and organizational measures and to take into consideration the available technology at the time of the processing and technological developments. So, how can you comply with this requirement in the real world? In Part 1, we anonymized field content or replaced it with aliases. That can be sufficient, but it doesn’t have to be. That’s why we’ll cover beta functions in this article (the ideal solution for pseudonymization), personal data that has slipped through the cracks, and the exciting question of ...

Read part 1 of this series: Pseudonymagical: masking data to get up to speed with GDPR

How random can your birth be?

The exact date of your birth is important to you, naturally. The analytics experts working with your data, on the other hand, aren’t looking to send you birthday wishes anyway (missing opt-in?!). What they’re interested in is your approximate age, maybe even just the decade. The SQL code from Part 1 moves the date of birth randomly plus or minus five days. Someone who knows your birth date would therefore be unable to locate your records within a stolen database. Privacy risk abated!

But even that should be verified… with respect to providing proof of “appropriate measures,” in other words, cluster size. In our example of around 5,000 VIP customers, there is only one who is in their 20’s and has a postal code beginning with the numeral 1. The time required to indirectly identify the individual (Recital 21, GDPR) could be rather low here. In the worst case scenario, legally too low.

Enter the beta function: the ideal solution for pseudonymization

Luckily, Recital 29 of the General Data Protection Regulation tells us how to handle this problem. The information required to pinpoint an individual is simply stored separately. That can be accomplished using a key or a mathematical function, in other words a macro, with a secret key that I only use – but don’t know about the math hidden behind it. The law doesn’t tell us how tricky this logic has to be, though. This so-called beta function should satisfy two additional conditions from an analytical standpoint:

  • It must be invertible (a hash is not, for instance).
  • The result of the masking should be monotonic, which means: high original value = high new value (encryption doesn’t do this).

Why? Well, we don’t want to affect the analytic modelling too much - ideally, the function would output something linear or slightly exponential… Here is a √2 example I’ve kept simple:

CODE1

Mathematically, this is a coordinate transformation - or you can also think of it in terms of Star Trek: people are being beamed to an unfamiliar planet. There is a different gravity field than the earth there (a different coordinate system), but it applies to everyone equally — which means that lightweight visitors on the planet can still jump higher there than their heavyweight colleagues. The same applies accordingly to age etc.

CODE2

When using the birth date or the age, I, as an analytics expert, have no knowledge of how this beaming works technically, but trust that when I’m developing models (and later when scoring) them, that nothing about the behavior has changed. By the way, the computer and correlation don’t care anyway - neither have any concept of age. (It just feels a bit strange for humans.)

We don’t lose the “true” age. It can be re-calculated using another beta function. With what is known as the inverse, but it’s available only to authorized employees - for instance to fraud or legal people during data protection lawsuits. In these cases, your customer can safely be beamed back to earth, so to speak.

A complaint from my office mate

“But how do I explain to the boss my model behavior for these 300-year-olds?!” ... Well in this era of machine learning, neural networks are gaining in popularity and are as selective as they are indescribable. On our side, the math behind it is at least deterministic and explainable; good to know that this key code is no longer stored on your PC, not glued to its data source and target, but remote and safe – because of modern data protection to protect you and the data. And that’s a good thing.

Final aspect: the data for relevant columns has now been subjected to smart masking, the logic is in a central repository, and it’s working in secret. But what about those seemingly harmless fields way in the back, mostly empty and irrelevant, which then in the form of a sales memo or notice suddenly reveal the name of the wife, the second email address, or the former employer? The author who created them thought it was extremely practical, since they didn’t find anywhere else in the contract template where they could enter and save the information.

CODE

SAS Data Quality has pre-configured, transparent sets of rules that you can tweak as necessary to detect many of these types of cases using heuristics. That’s indispensable because if I don’t know about it, I can’t protect against it. (If I forget about the tiny basement window when installing the security system, I can be sure that the robbers won’t cooperate by breaking down the front door).

That is a prerequisite for an inventory of the data warehouse, the estimate of the GDPR implementation expense — and here an additional safeguard. Because in the code above, a firewall filter is applied to the data: if the name of a human being slips through the cracks, then only asterisks are displayed when it is output. The field “Note” is always replaced by the description of the category, such as “This is where a telephone number is hidden. After approval by the data protection officer, you may read it – but not for now.”

Are you ready for the GDPR? Learn how your peers are preparing in this global survey report.

Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.

Beam your customers into invisibility: a data protection masked ball to get you up to speed with the GDPR was published on SAS Users.

2月 212018
 

Masking data for GDPRYou work with data. Data about your customers. It's likely that your customers' identity could be determined from the data you’ve collected. Starting in May 2018, a new data protection law will be in effect. This means you’re going to have to document which technical measures you’ve implemented to prevent your customers from being identified. That’s tricky, but isn’t everything already pseudonymized or anonymous or something? Won’t the IT department just take care of it? No. They won’t. This blog post gives you a detailed breakdown of what’s possible and where standard software can help with masking data for GDPR.

Read part 2 of this series: Beam your customers into invisibility: a data protection masked ball to get you up to speed with the GDPR

In the past, compliance has been relatively easy. The data protection officer was sure to take care of everything, documenting the steps taken together with IT and the compliance team. It felt like the powerful data protection law was there to protect people in charge - like you – from the data’s owners out there. Too abstract for you? Here’s an analogy:

Let’s say my car’s inspection is up in May. I have placed my trust in a copy of the traffic regulations that rests in my glove box. I tell the inspector, this is how I’ve always done things! And my perfectly maintained car has been running great for years. Wait a minute, the new mechanic seems to have missed something last time... and now you want my documentation?! Huh? What documentation?

Data protection by design and by default (Art. 25 GDPR)

But let’s go back to (your) data, which you obtained from the data warehouse (DWH), received from generous colleagues’ collections, or harvested fresh from the data lake ... you know, the stuff that self-service big data analytics is made of. Mix it all together, and the computer is able to guess the customer’s behavior. Lawyers are less inclined to wax poetic here, referring instead to “profiling” (GDPR Article 4), which can quickly extend beyond the initial purpose of the processing. The bad part? If the customer doesn’t reasonably expect this kind of processing of their information, they can submit a complaint and even demand their data be deleted at once (Article 22 and Recital 71 GDPR).

What to do? Well, get rid of personally identifiable information right from the start! So, just select fewer data points? Yes. But if you need them for your analytics, write down in advance exactly how the data will be processed ... that’ll give your data scientist a headache. Even fragments with very minimal data could still be traced back to single out the “wife of a dentist, aged 30 to 40, living in Smallsville.” So it’s better to properly “pseudonymize” or anonymize from the start.

In other words, you have to replace all dates with asterisks or a random number or encrypt them or hash them - then, in the best-case scenario, the record is anonymous. But this procedure, unfortunately, also makes it useless for analysis in most cases. Or even worse, completely wrong with respect to segmentation, model development, or scoring. More about that later. Here’s an example of “common” customer data:

Let’s say we have a file that contains a great deal of personally identifiable information. It could even be lurking in comment fields or notes - without ever being listed in a data dictionary or the records of processing activities. But it’s hiding in there, in unassuming properties, selected out of curiosity or by mistake, and thus stored. And simply storing it brings the GDPR into play. Oops!

Totally anonymous: ensure compliance by making data useless?

So how exactly do you make data unidentifiable without completely destroying all of its value? When masking, you should deploy standard functions that use quality-controlled logic in such a way that users who processes the data are unable to view the algorithmn. In SQL, it may look something like this:

CREATE VIEW pdp_de_demo.Team_Alpha.CRM_CUSTOMERBASE_GDPR_VIEW AS SELECT
     SYSCAT.DM.MASK ( ‘ENCRYPT’, PUT(A.CustomerNo , 8.), ‘alg’, ‘AES’, ‘key’, ‘12345)
          AS CustomerNr_encrypt,
     SYSCAT.DM.MASK ( ‘HASH’, A.IBAN , ‘alg’, ‘SHA256’, ‘key’, ‘12345)
          AS IBAN_hash,
     SYSCAT.DM.MASK ( ‘TRANC’,A.”IBAN” , ‘FROM’, ‘1234567890’, ‘TO’, ‘XXXXXXXXXX’,
               ‘START’, 10 , ‘LENGTH’, 9 )
          AS IBAN_tranc,
     PUT(SYSCAT.DM.MASK ( ‘RANDATE’, A.Bithdate, ‘VARY’, 5, ‘UNITS’, ‘DAY), DDMMYYP10.)
          AS Birthdate,
     SYSCAT.DM.MASK ( ‘RANDOM’, A.Balance, ‘VARY’, 100 )
          AS Balance,
     (CASE WHEN ( SYSPROC.DQ.DQUALITY.DQEXTRACT (
                               A.COMMENTFIELD, ‘PDP - Personal Data (Core)’, ‘Individual’,’DEDEU’ ) ne “)
               THEN* * *’ ELSE A.COMMENTFIELD END)
          AS Commentfield_without_name,
     SYSPROC.DQ.DQUALITY.DQIDENTIFY ( A.ANNOTATION, ‘PDP - Personal Data (Core)’, ‘DEDEU’ )
          AS ANNOTATION_IDENTIFY
FROM pdp_de_demo.data.CRM_CUSTOMERBASE AS A

The results look appealing.

But the following methodological problem quickly arises. Who in the corporate data flow carries out this step, and where should they store the results? Depending on the type of processing, the technician or analytics expert may not be permitted to even view the data in the first place, making it impossible for them to apply masking on top! How about IT (or the application developers) doing a generic anonymization of the entire DWH in one go? Well, that’s a sure way to secure the data, and a sure way to make it useless for a lot of other things, too!

Who’s masking who?

The solution: no one touches those official tables (see note at bottom). The user just no longer (physically) accesses them directly, but rather via dynamic views, which only read specified data. These views handle masking “on-the-fly,” in real time or from the cache – without destroying the original data. In the simplest cases (as in our example code above), the content of the respective fields is modified “for all selectors.” Or depending on a control table, they’re flipped off like a switch the instant a user communicates their opt-out.

The SAS Federation Server hosts these types of views, gently inserting itself between business users, analytics experts, and internal consumers of reports. Once logged in, the server knows the user’s groups, roles, and permissions. This can be elegantly leveraged in the design of the views. The very same view can allow that team with the VIP flag to also see the balance, whereas the fraud team sees everything, while the rest get harmless asterisks in place of sensitive information. All from the same physical data, controlled through a central platform.

New on the inside, proven on the outside: controlled protection for switching on and off

Access to these types of centrally administered views is accomplished via ODBC / JDBC / API. Or, for SAS users, via the trusty libname statement. There, instead of “ORACLE,” “FEDSVR” is your keyword. Everything else looks just like it always has. The tables (some of which are not actual tables) appear, the code is ready to run. Such libref can, of course, also be pre-assigned in the metadata context.

LIBNAME mydwh FEDSVR DSN=dwh1 SERVER=”demo.sas.com” PORT=24141 SCHEMA=Team_Alpha;

A double benefit: the IT team can take its time rebuilding the physical layer (DBMS portings, modifications to the DDL, switching to a different DB, etc.) without the phones ringing off the hook with important departments calling to complain. Plus, every access can be logged. The SAS Federation Server can optionally record queries as well, regardless of whatever sneaky macro data step was used - everything is processed and logged. And users cannot get around it by deleting their SAS logs.

This activity should not be misconstrued as surveillance. The GDPR is what demands proof of who, how, where, and when ... and for data breaches, within 72 hours. Like when someone at the company burns a CD with sensitive tax information or commits other internal fraud. And as with the deletion approach, it’s already a big step in the right direction to be able to say, “yes, that’s how it would work — should we turn it on?”

Summary of the first part

The General Data Protection Regulation requires your company to implement the “appropriate technical and organisational measures” and “taking into consideration the available technology at the time of the processing and technological developments.” This article has given you some suggestions for things you can discuss with colleagues. And what’s more, the software is already out there - there’s no need to patch together a solution yourself. Instead, you can use that valuable time to generate the documentation for compliance instead.

P.S. There’s no time to lose to comply with the GDPR. Learn how to get ready in this global survey report.

Attachment: Legal texts

Profiling GDPR Art. 4, Par. 4:
‘profiling’ means any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyze or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behavior, location or movements;

The GDPR is more specific with anonymous pseudonymization (Recital 26 and 29):
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. The principles of data protection should therefore not apply to anonymous information, namely (...) to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

In order to create incentives to apply pseudonymization when processing personal data, measures of pseudonymization should, whilst allowing general analysis, be possible within the same controller when that controller has taken technical and organisational measures necessary to ensure, for the processing concerned, that this Regulation is implemented, and that additional information for attributing the personal data to a specific data subject is kept separately. The controller processing the personal data should indicate the authorised persons within the same controller.

A final note: To ensure the separate “right to be forgotten” in the data warehouse (and countless source systems) as per Art. 17 GDPR, special measures must be taken. But this is not what we’re talking about above — deleted records are “gone,” and can no longer be processed or pseudonymized by means of profiling. Note: In the event that they are not intended (or permitted) to disappear from the relevant database immediately, approaches such as the one described could offer a temporary solution.

Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.

Pseudonymagical: masking data to get up to speed with GDPR was published on SAS Users.

2月 202018
 

When using conventional methods to access and analyze data sets from Teradata tables, SAS brings all the rows from a Teradata table to SAS Workspace Server. As the number of rows in the table grows over time, it adds to the network latency to fetch the data from a database management system (DBMS) and move it to SAS Workspace Server. Considering big data, SAS Workspace Server may not have enough capacity to hold all the rows from a Teradata table.

SAS In-Database processing can help solve the problem of returning too much data from the database. SAS In-Database processing allows you to perform data operations inside the DBMS and use the distributed processing over multiple Access Module Processors (AMPs). Select SAS procedures take advantage of Teradata SQL functionality, and in some cases leverage SAS functions deployed inside the DBMS. The goal of in-database processing is to reduce the I/O required to transfer the data from Teradata to SAS.

SAS® In-Database Processing in Teradata

Using SAS In-Database processing, you can run scoring models, some SAS procedures, DS2 threaded programs, and formatted SQL queries inside the Teradata database.

The list of SAS In-Database features supported for Teradata include:

  • Format publishing and SAS_PUT()function
  • Scoring Models
  • Select BASE SAS® Procedures ( FREQ, RANK, REPORT, SORT, SUMMARY/MEAN , TABULATE)
  • Select SAS/STAT® Procedures (CORR, CANCORR, DMDB, DMINE, DMREG, FACTOR, PRINCOMP,
  • REG, SCORE, TIMESERIES, VARCLUS )
  • DS2 Threaded programs
  • Data quality operations
  • Extract and transform data

SAS In-Database Deployment Package for Teradata

The in-database deployment package for Teradata includes the following:

  • The SAS formats library, accelterafmt-######.rpm, installs a SAS formats library on the Teradata server. By having a SAS formats library on your Teradata system, you can publish SAS formats in Teradata, which enables you to process SAS statements with SAS formats in the Teradata database. This also enables you to publish SAS PUT functions to Teradata as a SAS_PUT() function. This software can be found in your SAS Install folder under /SAS-install-directory/SASFormatsLibraryforTeradata/3.1/TeradataonLinux/.
  • The SAS® Embedded Process package, sepcoretera-######.rpm, installs SAS Embedded Process in the Teradata database. This is the core package of in-database components. This software can be found in your software depot under folder /depot/standalone_installs/SAS_Core_Embedded_Process_Package_for_Teradata/13_0/Teradata_on_Linux.
  • The SASEPFUNC package, sasepfunc-#####.x86_64.tar.gz, installs SAS Embedded Process support functions on Teradata. SAS Embedded Process support functions are Teradata stored procedures that generate SQL to interface with SAS Embedded Process. The script from the package creates a Teradata database named SAS_SYSFNLIB with a list of tables, views, functions, and procedures to support SAS Embedded Process. The same script also adds a list of functions in the TD_SYSFNLIB database. The package can be obtained from the Teradata support group.

The following figure shows the list of objects from the SAS_SYSFNLIB database to support SAS Embedded Process:

The following shows the list of objects from the TD_SYSFNLIB database to support SAS Embedded Process:

  • The SAS® Quality Knowledge Base package, sasqkb_ci-27.#####.noarch.rpm, installs SAS Quality Knowledge Base on the Teradata server. This is an optional package to SAS Embedded Process. This package is needed along with SAS® Quality Accelerator, if you are planning to run data cleansing operations in the Teradata database. The package can be downloaded from the SAS support site.
  • The SAS Quality Accelerator package. There are two scripts (dq_install.sh and dq_grant.sh) located under SAS-Install-directory to install the data quality accelerator at Teradata. This is an optional package to SAS Embedded Process, and needed only if you are planning to run data cleansing operations in Teradata. The software install files can be found in the folder /SAS-install-directory/SASDataQualityAcceleratorforTeradata/9.4/dqacctera/sasmisc/. As a part of script execution, it adds a list of objects (procedures, functions) to the SAS_SYSFNLIB database.

Sample list of data quality related objects from the SAS_SYSFNLIB database.

Examples of running DS2 Code to perform data quality, data extract, and transform operations in Teradata:

The following example describes the execution of DS2 code by using SAS Data Quality Accelerator and SAS Quality Knowledge Base to match and extract a data set from the Teradata database. The log shows that both Threads program and Data program ran in the Teradata database as in-database program execution.

CODE

Stay tuned in for the next part of the SAS In-Database Processing in Teradata blog series. Coming up is one about publishing SAS format in Teradata.

SAS In-Database Processing in Teradata DBMS was published on SAS Users.

1月 042018
 

What will 2018 unveil for the data management market? I searched expert opinions on technology trends for 2018 and matched them against my own to uncover the five major trends that I think we’ll see in data management this year: 1. Data movement becomes more important. Cloud providers have proven [...]

Data management predictions for 2018 was published on SAS Voices by Helmut Plinke