Here are some new tips for masking. The new EU General Data Protection Regulation (GDPR) requires your company to implement (quote) all necessary technical and organizational measures and to take into consideration the available technology at the time of the processing and technological developments. So, how can you comply with this requirement in the real world? In Part 1, we anonymized field content or replaced it with aliases. That can be sufficient, but it doesn’t have to be. That’s why we’ll cover beta functions in this article (the ideal solution for pseudonymization), personal data that has slipped through the cracks, and the exciting question of ...Read part 1 of this series: Pseudonymagical: masking data to get up to speed with GDPR
How random can your birth be?
The exact date of your birth is important to you, naturally. The analytics experts working with your data, on the other hand, aren’t looking to send you birthday wishes anyway (missing opt-in?!). What they’re interested in is your approximate age, maybe even just the decade. The SQL code from Part 1 moves the date of birth randomly plus or minus five days. Someone who knows your birth date would therefore be unable to locate your records within a stolen database. Privacy risk abated!
But even that should be verified… with respect to providing proof of “appropriate measures,” in other words, cluster size. In our example of around 5,000 VIP customers, there is only one who is in their 20’s and has a postal code beginning with the numeral 1. The time required to indirectly identify the individual (Recital 21, GDPR) could be rather low here. In the worst case scenario, legally too low.
Enter the beta function: the ideal solution for pseudonymization
Luckily, Recital 29 of the General Data Protection Regulation tells us how to handle this problem. The information required to pinpoint an individual is simply stored separately. That can be accomplished using a key or a mathematical function, in other words a macro, with a secret key that I only use – but don’t know about the math hidden behind it. The law doesn’t tell us how tricky this logic has to be, though. This so-called beta function should satisfy two additional conditions from an analytical standpoint:
- It must be invertible (a hash is not, for instance).
- The result of the masking should be monotonic, which means: high original value = high new value (encryption doesn’t do this).
Why? Well, we don’t want to affect the analytic modelling too much - ideally, the function would output something linear or slightly exponential… Here is a √2 example I’ve kept simple:
Mathematically, this is a coordinate transformation - or you can also think of it in terms of Star Trek: people are being beamed to an unfamiliar planet. There is a different gravity field than the earth there (a different coordinate system), but it applies to everyone equally — which means that lightweight visitors on the planet can still jump higher there than their heavyweight colleagues. The same applies accordingly to age etc.
When using the birth date or the age, I, as an analytics expert, have no knowledge of how this beaming works technically, but trust that when I’m developing models (and later when scoring) them, that nothing about the behavior has changed. By the way, the computer and correlation don’t care anyway - neither have any concept of age. (It just feels a bit strange for humans.)
We don’t lose the “true” age. It can be re-calculated using another beta function. With what is known as the inverse, but it’s available only to authorized employees - for instance to fraud or legal people during data protection lawsuits. In these cases, your customer can safely be beamed back to earth, so to speak.
A complaint from my office mate
“But how do I explain to the boss my model behavior for these 300-year-olds?!” ... Well in this era of machine learning, neural networks are gaining in popularity and are as selective as they are indescribable. On our side, the math behind it is at least deterministic and explainable; good to know that this key code is no longer stored on your PC, not glued to its data source and target, but remote and safe – because of modern data protection to protect you and the data. And that’s a good thing.
Final aspect: the data for relevant columns has now been subjected to smart masking, the logic is in a central repository, and it’s working in secret. But what about those seemingly harmless fields way in the back, mostly empty and irrelevant, which then in the form of a sales memo or notice suddenly reveal the name of the wife, the second email address, or the former employer? The author who created them thought it was extremely practical, since they didn’t find anywhere else in the contract template where they could enter and save the information.
SAS Data Quality has pre-configured, transparent sets of rules that you can tweak as necessary to detect many of these types of cases using heuristics. That’s indispensable because if I don’t know about it, I can’t protect against it. (If I forget about the tiny basement window when installing the security system, I can be sure that the robbers won’t cooperate by breaking down the front door).
That is a prerequisite for an inventory of the data warehouse, the estimate of the GDPR implementation expense — and here an additional safeguard. Because in the code above, a firewall filter is applied to the data: if the name of a human being slips through the cracks, then only asterisks are displayed when it is output. The field “Note” is always replaced by the description of the category, such as “This is where a telephone number is hidden. After approval by the data protection officer, you may read it – but not for now.”
Are you ready for the GDPR? Learn how your peers are preparing in this global survey report.
Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.