Analyzing ticket sales and customer data for large sports and entertainment events is a complex endeavor. But SAS Visual Analytics makes it easy, with location analytics, customer segmentation, predictive artificial intelligence (AI) capabilities – and more. This blog post covers a brief overview of these features by using a fictitious event company [...]
I bet many of you didn’t even know the term machine learning five years ago. But Gartner did. The Gartner Magic Quadrant for Data Science and Machine Learning Platforms, 2018 was just released, and SAS has been in the leader’s quadrant for five years straight. According to Gartner, “This Magic [...]
Gartner names data science and machine learning leaders was published on SAS Voices by David Tareen
Eidtor's note: Andrew Fowkes, Head of Retail Centre of Excellence at SAS UK & Ireland, explores the factors driving some retailers to excel and others to fail in a complex environment A stark polarisation is now emerging between winners and losers on the high street. Companies that have focused on [...]
You work with data. Data about your customers. It's likely that your customers' identity could be determined from the data you’ve collected. Starting in May 2018, a new data protection law will be in effect. This means you’re going to have to document which technical measures you’ve implemented to prevent your customers from being identified. That’s tricky, but isn’t everything already pseudonymized or anonymous or something? Won’t the IT department just take care of it? No. They won’t. This blog post gives you a detailed breakdown of what’s possible and where standard software can help with masking data for GDPR.Read part 2 of this series: Beam your customers into invisibility: a data protection masked ball to get you up to speed with the GDPR
In the past, compliance has been relatively easy. The data protection officer was sure to take care of everything, documenting the steps taken together with IT and the compliance team. It felt like the powerful data protection law was there to protect people in charge - like you – from the data’s owners out there. Too abstract for you? Here’s an analogy:
Let’s say my car’s inspection is up in May. I have placed my trust in a copy of the traffic regulations that rests in my glove box. I tell the inspector, this is how I’ve always done things! And my perfectly maintained car has been running great for years. Wait a minute, the new mechanic seems to have missed something last time... and now you want my documentation?! Huh? What documentation?
Data protection by design and by default (Art. 25 GDPR)
But let’s go back to (your) data, which you obtained from the data warehouse (DWH), received from generous colleagues’ collections, or harvested fresh from the data lake ... you know, the stuff that self-service big data analytics is made of. Mix it all together, and the computer is able to guess the customer’s behavior. Lawyers are less inclined to wax poetic here, referring instead to “profiling” (GDPR Article 4), which can quickly extend beyond the initial purpose of the processing. The bad part? If the customer doesn’t reasonably expect this kind of processing of their information, they can submit a complaint and even demand their data be deleted at once (Article 22 and Recital 71 GDPR).
What to do? Well, get rid of personally identifiable information right from the start! So, just select fewer data points? Yes. But if you need them for your analytics, write down in advance exactly how the data will be processed ... that’ll give your data scientist a headache. Even fragments with very minimal data could still be traced back to single out the “wife of a dentist, aged 30 to 40, living in Smallsville.” So it’s better to properly “pseudonymize” or anonymize from the start.
In other words, you have to replace all dates with asterisks or a random number or encrypt them or hash them - then, in the best-case scenario, the record is anonymous. But this procedure, unfortunately, also makes it useless for analysis in most cases. Or even worse, completely wrong with respect to segmentation, model development, or scoring. More about that later. Here’s an example of “common” customer data:
Let’s say we have a file that contains a great deal of personally identifiable information. It could even be lurking in comment fields or notes - without ever being listed in a data dictionary or the records of processing activities. But it’s hiding in there, in unassuming properties, selected out of curiosity or by mistake, and thus stored. And simply storing it brings the GDPR into play. Oops!
Totally anonymous: ensure compliance by making data useless?
So how exactly do you make data unidentifiable without completely destroying all of its value? When masking, you should deploy standard functions that use quality-controlled logic in such a way that users who processes the data are unable to view the algorithmn. In SQL, it may look something like this:
CREATE VIEW pdp_de_demo.Team_Alpha.CRM_CUSTOMERBASE_GDPR_VIEW AS SELECT SYSCAT.DM.MASK ( ‘ENCRYPT’, PUT(A.CustomerNo , 8.), ‘alg’, ‘AES’, ‘key’, ‘12345’ ) AS CustomerNr_encrypt, SYSCAT.DM.MASK ( ‘HASH’, A.IBAN , ‘alg’, ‘SHA256’, ‘key’, ‘12345’ ) AS IBAN_hash, SYSCAT.DM.MASK ( ‘TRANC’,A.”IBAN” , ‘FROM’, ‘1234567890’, ‘TO’, ‘XXXXXXXXXX’, ‘START’, 10 , ‘LENGTH’, 9 ) AS IBAN_tranc, PUT(SYSCAT.DM.MASK ( ‘RANDATE’, A.Bithdate, ‘VARY’, 5, ‘UNITS’, ‘DAY’ ), DDMMYYP10.) AS Birthdate, SYSCAT.DM.MASK ( ‘RANDOM’, A.Balance, ‘VARY’, 100 ) AS Balance, (CASE WHEN ( SYSPROC.DQ.DQUALITY.DQEXTRACT ( A.COMMENTFIELD, ‘PDP - Personal Data (Core)’, ‘Individual’,’DEDEU’ ) ne “) THEN ‘* * *’ ELSE A.COMMENTFIELD END) AS Commentfield_without_name, SYSPROC.DQ.DQUALITY.DQIDENTIFY ( A.ANNOTATION, ‘PDP - Personal Data (Core)’, ‘DEDEU’ ) AS ANNOTATION_IDENTIFY FROM pdp_de_demo.data.CRM_CUSTOMERBASE AS A
The results look appealing.
But the following methodological problem quickly arises. Who in the corporate data flow carries out this step, and where should they store the results? Depending on the type of processing, the technician or analytics expert may not be permitted to even view the data in the first place, making it impossible for them to apply masking on top! How about IT (or the application developers) doing a generic anonymization of the entire DWH in one go? Well, that’s a sure way to secure the data, and a sure way to make it useless for a lot of other things, too!
Who’s masking who?
The solution: no one touches those official tables (see note at bottom). The user just no longer (physically) accesses them directly, but rather via dynamic views, which only read specified data. These views handle masking “on-the-fly,” in real time or from the cache – without destroying the original data. In the simplest cases (as in our example code above), the content of the respective fields is modified “for all selectors.” Or depending on a control table, they’re flipped off like a switch the instant a user communicates their opt-out.
The SAS Federation Server hosts these types of views, gently inserting itself between business users, analytics experts, and internal consumers of reports. Once logged in, the server knows the user’s groups, roles, and permissions. This can be elegantly leveraged in the design of the views. The very same view can allow that team with the VIP flag to also see the balance, whereas the fraud team sees everything, while the rest get harmless asterisks in place of sensitive information. All from the same physical data, controlled through a central platform.
New on the inside, proven on the outside: controlled protection for switching on and off
Access to these types of centrally administered views is accomplished via ODBC / JDBC / API. Or, for SAS users, via the trusty libname statement. There, instead of “ORACLE,” “FEDSVR” is your keyword. Everything else looks just like it always has. The tables (some of which are not actual tables) appear, the code is ready to run. Such libref can, of course, also be pre-assigned in the metadata context.
LIBNAME mydwh FEDSVR DSN=dwh1 SERVER=”demo.sas.com” PORT=24141 SCHEMA=Team_Alpha;
A double benefit: the IT team can take its time rebuilding the physical layer (DBMS portings, modifications to the DDL, switching to a different DB, etc.) without the phones ringing off the hook with important departments calling to complain. Plus, every access can be logged. The SAS Federation Server can optionally record queries as well, regardless of whatever sneaky macro data step was used - everything is processed and logged. And users cannot get around it by deleting their SAS logs.
This activity should not be misconstrued as surveillance. The GDPR is what demands proof of who, how, where, and when ... and for data breaches, within 72 hours. Like when someone at the company burns a CD with sensitive tax information or commits other internal fraud. And as with the deletion approach, it’s already a big step in the right direction to be able to say, “yes, that’s how it would work — should we turn it on?”
Summary of the first part
The General Data Protection Regulation requires your company to implement the “appropriate technical and organisational measures” and “taking into consideration the available technology at the time of the processing and technological developments.” This article has given you some suggestions for things you can discuss with colleagues. And what’s more, the software is already out there - there’s no need to patch together a solution yourself. Instead, you can use that valuable time to generate the documentation for compliance instead.
P.S. There’s no time to lose to comply with the GDPR. Learn how to get ready in this global survey report.
Attachment: Legal texts
Profiling GDPR Art. 4, Par. 4:
‘profiling’ means any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyze or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behavior, location or movements;
The GDPR is more specific with anonymous pseudonymization (Recital 26 and 29):
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. The principles of data protection should therefore not apply to anonymous information, namely (...) to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
In order to create incentives to apply pseudonymization when processing personal data, measures of pseudonymization should, whilst allowing general analysis, be possible within the same controller when that controller has taken technical and organisational measures necessary to ensure, for the processing concerned, that this Regulation is implemented, and that additional information for attributing the personal data to a specific data subject is kept separately. The controller processing the personal data should indicate the authorised persons within the same controller.
A final note: To ensure the separate “right to be forgotten” in the data warehouse (and countless source systems) as per Art. 17 GDPR, special measures must be taken. But this is not what we’re talking about above — deleted records are “gone,” and can no longer be processed or pseudonymized by means of profiling. Note: In the event that they are not intended (or permitted) to disappear from the relevant database immediately, approaches such as the one described could offer a temporary solution.
Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.
Pseudonymagical: masking data to get up to speed with GDPR was published on SAS Users.
Let’s lay down some fundamentals. In business you want to achieve the highest revenues with the best margins and the lowest costs. More specifically, in manufacturing, you want your products to be the highest quality (relative to specification) when you make the item. And you want it shipped to the [...]
How do you take your manufacturing business to the next level? was published on SAS Voices by Tim Clark
Deep learning is not synonymous with artificial intelligence (AI) or even machine learning. Artificial Intelligence is a broad field which aims to "automate cognitive processes." Machine learning is a subfield of AI that aims to automatically develop programs (called models) purely from exposure to training data.
Deep learning is one of many branches of machine learning, where the models are long chains of geometric functions, applied one after the other to form stacks of layers. It is one among many approaches to machine learning but not on equal footing with the others.
What makes deep learning exceptional
Why is deep learning unequaled among machine learning techniques? Well, deep learning has achieved tremendous success in a wide range of tasks that have historically been extremely difficult for computers, especially in the areas of machine perception. This includes extracting useful information from images, videos, sound, and others.
Given sufficient training data (in particular, training data appropriately labelled by humans), it’s possible to extract from perceptual data almost anything that a human could extract. Large corporations and businesses are deriving value from deep learning by enabling human-level speech recognition, smart assistants, human-level image classification, vastly improved machine translation, and more. Google Now, Amazon Alexa, ad targeting used by Google, Baidu and Bing are all powered by deep learning. Think of superhuman Go playing and near-human-level autonomous driving.
How to build deep learning models
Given all this success recorded using deep learning, it's important to stress that building deep learning models is more of an art than science. To build a deep learning or any machine learning model for that matter one need to consider the following steps:
- Define the problem: What data does the organisation have? What are we trying to predict? Do we need to collect more data? How can we manually label the data? Make sure to work with domain expert because you can’t interpret what you don’t know!
- What metrics can we use to reliably measure the success of our goals.
- Prepare validation process that will be used to evaluate the model.
- Data exploration and pre-processing: This is where most time will be spent such as normalization, manipulation, joining of multiple data sources and so on.
- Develop an initial model that does better than a baseline model. This gives some indication of whether machine learning is ideal for the problem.
- Refine model architecture by tuning hyperparameters and adding regularization. Make changes based on validation data.
- Avoid overfitting.
- Once happy with the model, deploy it into production environment. This may be difficult to achieve for many organisations giving that a deep learning score code is large. This is where SAS can help. SAS has developed a scoring mechanism called "astore" which allows deep learning method to be pushed into production with just a click.
Is the deep learning hype justified?
We're still in the middle of deep learning revolution trying to understand the limitations of this algorithm. Due to its unprecedented successes, there has been a lot of hype in the field of deep learning and AI. It’s important for managers, professionals, researchers and industrial decision makers to be able to distill this hype from reality created by the media.
Despite the progress on machine perception, we are still far from human level AI. Our models can only perform local generalization, adapting to new situations that must be similar to past data, whereas human cognition is capable of extreme generalization, quickly adapting to radically novel situations and planning for long-term future situations. To make this concrete, imagine you’ve developed a deep network controlling a human body, and you wanted it to learn to safely navigate a city without getting hit by cars, the net would have to die many thousands of times in various situations until it could infer that cars are dangerous, and develop appropriate avoidance behaviors. Dropped into a new city, the net would have to relearn most of what it knows. On the other hand, humans are able to learn safe behaviors without having to die even once—again, thanks to our power of abstract modeling of hypothetical situations.
Lastly, remember deep learning is a long chain of geometrical functions. To learn its parameters via gradient descent one key technical requirements is that it must be differentiable and continuous which is a significant constraint.
Stories have been flooding the market lately about artificial intelligence (AI) causing the next world war. Based originally on comments made by Elon Musk (see below), many others are jumping in to share similar fears. China, Russia, soon all countries w strong computer science. Competition for AI superiority at national level [...]
Depending on who you speak with you will get varying definitions and opinions regarding demand sensing and shaping from sensing short-range replenishment based on sales orders to manual blending of point-of-sales (POS) data and shipments. Most companies think that they are sensing demand when in fact they are [...]
In part one of this series, Clark Twiddy, Chief Administrative Officer of Twiddy & Company, shared some best practices from the first of three phases of Twiddy’s journey to becoming a data-driven SMB. This post focuses on phases two and three of their journey. Phase two is about action. Now [...]
How to be a data-driven SMB: Part 2 of Twiddy’s Tale was published on SAS Voices by Analise Polsky
For colleges and universities, awarding financial aid today requires sophisticated analysis. When higher education leaders ask, “How can we use financial aid to help meet our institutional goals?” they need to consider many scenarios to balance strategic enrollment goals, student need, and institutional finances in order to optimize yield and [...]