David Stern

11月 022017
 

SAS Visual Investigatorcustom icons and map pin icons in SAS® Visual Investigator has an attractive user interface. One of its most engaging features is the network diagram, which represents related ‘entities’ and the connections between them, allowing an investigator to see and explore relationships in their source data.

For maximum impact, each entity in the network diagram should have an icon – a symbol – which clearly represents the entity. Network diagrams are significantly more readable if the icons used relate to the entities they represent, particularly when a larger numbers of entity types are in the diagram.

SAS Visual Investigator 10.2.1 comes with 56 icons in various colors and 33 white map pins. But if none of these adequately represent the entities you are working with, you can add your own. The catch is that the icons have to be vector images – defined in terms of straight and curved lines – in a file format called SVG, rather than raster images defined in terms of pixels, as GIF, JPEG, PNG files or similar. The vector format greatly improves how good the icons look at different scales. But requiring a vector icon file means you can’t just use any image as a network diagram symbol in Visual Investigator.

So where can you find new icons to better represent your entities? In this post, we’ll see how you can make (or more accurately, find and set the colors of) icons to suit your needs.

For this example, I’m going to suppose that we have an entity called ‘Family,’ representing a family group who are associated with one or more ‘Addresses’ and with one or more individual ‘Persons.’

There are good icons for an ‘Address’ in the default set of 56 provided: I’d choose one of the building icons, probably the house. There are also good icons for a ‘Person’: any of several icons representing people would be fine. The ‘house’ and ‘people’ icons come in a nice variety of colors, and there are white-on-some-background-color versions of each icon for use in Map Pins.

But, there is no icon in the default set which obviously represents the idea of a ‘family.’ So, let’s make one.

We will use the free* library of icons at www.flaticon.com as an example of how to find a source icon image.

*NOTE: While these are free there are some license terms and conditions you must adhere to if you use them. Please read, but the main points to note are that you must give attribution to the designer and site in a specific and short format for the icons you use, and you’re not allowed to redistribute them.

Begin by browsing to https://www.flaticon.com/, and either just browse the icons, or search for a key word which has something to do with the entity for which you need an icon. I’m going to search for ‘Family’:

Notice that many icons have a smaller symbol beneath them, indicating whether they are a ‘premium’ or ‘selection’ icon:

Premium:

Selection:

Premium icons can only be downloaded if you subscribe to the site. Selection icons can be downloaded and used without any subscription so long as you give proper attribution. If you do subscribe, however, you are allowed to use them without attribution. Icons with no symbol beneath them are also free, but always have to be properly attributed, regardless of whether you subscribe or not.

If you are planning on using icons from this site in an image which may be shared with customers, or on a customer project please be sure to comply with company policies and guidelines on using third party content offered under a Creative Commons license. Please do your own research and adhere by their rules.

The icons here are shown at a larger size than they will be in SAS Visual Investigator. For best results, pick an icon with few small details, so that it will still be clear at less than half the size. For consistency with the other icons in your network diagrams and maps, pick a simple black-and-white icon. Several icons in this screenshot are suitable, but I will choose this one named ‘Family silhouette,’ which I think will be clear and recognizable when shrunk:

Icons made by Freepik from www.flaticon.com is licensed by CC 3.0 BY

When you hover over the icon, you’ll see a graphic overlaid on top of it, like this:

If you are working with a number of icons at once, it can be convenient to click the top half of that graphic, to add the icon to a collection. When you have assembled the icons you require, you can then edit their colors together, and download them together. Since I am just working with one icon, I clicked the bottom half of the overlay graphic, to simply view the icon’s details. This is what you see next:

Click the green SVG button to choose a color for your icon – a box with preset colors swings down below the buttons. This offers 7 default colors, which are all okay (white is perfect for map pin icons). But you may prefer to match the colors of existing icons in SAS Visual Investigator – to set your own color, click the multicolor button (highlighted in a red box below):

These are the hex color codes for icons and map pin backgrounds in SAS Visual Investigator 10.2.1. Enter one of these codes in the website’s color picker to match the new icon’s color to one already used in VI:

I chose the green in the center of this table, with hex color code #4d9954:

Then, click Download, and in the popup, copy the link below the button which will allow you to credit the author, then click ‘Free download’ (you must credit the author):

Your SVG icon file is downloaded in your browser. Paste the attribution text/HTML which credits the icon’s author in a text editor.

You must ensure that the files are named so that they contain only the characters A-Z, a-z, 0-9 and ‘_’ (underscore). SAS Visual Investigator doesn’t like SVG filenames containing other characters. In this example, we MUST rename the files to replace the ‘-‘ (dash or hyphen) with something else, e.g. an underscore.

You may also want to rename the SVG file to reflect the color you chose. So, making both these changes, I renamed my file from ‘family-silhouette.svg’ to ‘family_silhouette_green.svg’.

Adding the color to the file name is a good idea when you create multiple copies of the same icon, in different colors. You should also consider creating a white version of each icon for use as a map pin. Doing that, I saved another copy of the same icon in white, and renamed the downloaded file to ‘family_silhouette_white.svg’.

Then, if necessary, copy your SVG file(s) from your local PC to a machine from which you can access your copy of SAS Visual Investigator. To import the icons, open the SAS Visual Investigator: Administration app, and log in as a VI administrator.

When creating or editing an Entity, switch to the Views tab, and click on the Manage icons and map pins button:

In the Manage Icons and Map Pins popup, click Upload…:

Select all of the SVG files you want to upload:

The new icons are uploaded, and if you scroll through the list of icons and map pins you should be able to find them. Change the Type for any white icons you uploaded to Map Pin, and click OK to save your changes to the set of icons and map pins:

You can now use your new icons and map pins for an Entity:

Don’t forget to include a reference to the source for the icon in any materials you produce which accompany the image, if you share it with anyone else. Here’s mine:

Icons made by Freepik from http://www.freepik.com/ is licensed by CC 3.0 BY

See you next time!

Creating and uploading custom icons and map pin icons in SAS® Visual Investigator was published on SAS Users.

7月 082015
 

This is my second blog on the topic of anonymization, which I’ve spent some time over the past several months researching. My first blog, Anonymization for data managers, focused on the technical process. Now let’s dive into the role for analysts, report designers and information owners.

To analysts and reporting experts, anonymization means something quite different. For them, it means the process of rendering personal information non-personal, in much more general terms.

Anonymization in this context includes a wider set of practices for ensuring that personal data is well-governed, and that when summary statistics about a group of individuals are published, the number of individuals making up any one aggregate group is large enough that those individuals cannot be personally identified from the characteristics of the group.

For example, suppose a high school publishes details of its exam results, and breaks them down in several ways as part of a national initiative to be open about its performance, allowing parents and local government to see how the school is performing. The school is at risk of breaching the anonymity of some of their pupils, by inadvertently revealing sensitive personal information about them, if they are not cautious about choosing which figures to publish and which to keep confidential. (There is nothing wrong with calculating those figures, just with publishing them).

Now suppose there are 30 students in a particular class in the school, of whom 15 are male and 15 are female. In the whole class, 6 (1 boy and 5 girls) are in an ethnic group which is in a minority in the school’s local area. The remaining 24 students in the class are from the local majority ethnic group.

The school may publish figures on the students’ representation in the class by gender and ethnic origin, in order to demonstrate fairness in admission procedures, or its success or otherwise in supporting all students equally:

  • Publishing examination scores in a subject by class, and by gender of the students is probably okay – there are 15 students in each gender group, so an unusually low or high score is not attributable to any individual student.
  • It would require a judgement call to decide whether publishing the mean examination scores by the student’s ethnic origin was a good idea or not: one would wish to exercise some care and sensitivity.
  • But if the school breaks down the examination scores by both gender and the student’s ethnic origin, they have a problem. The average figure for boys in the minority ethnic group is in fact made from just one student’s exam score. If they publish a summary table including this figure, they are publishing the exam result of one easily-identified student, with the potential to cause that student considerable distress, and expose the school to risk of litigation if the student has not consented to that information being made public.

The solution – one which is supported by features for suppressing small-contributing-population fields in reports in several of SAS’ products, is to publish the table, but suppress values in cells with small contributing counts, e.g. <5 or <10.

One way to do this is to maintain a ‘cell count’ value alongside each measure value, containing the number of individuals whose values (whether or not those values are missing) were aggregated to produce the measure value. Where the cell count for a measure value is less than the threshold for suppression of small cells, the published version of the report should display some other value or symbol instead of the actual measure value for the cell (e.g. ‘suppressed’, or ‘*’).

However, you must consider the vulnerability – by combining information in several different summary tables which you publish, or by combining summary tables which you publish with other related data published elsewhere or previously, it may be possible to figure out the suppressed value. For example:

In the school we considered earlier, suppose you publish summary statistics as follows, based on data which I will not show:

  • The mean score for Chemistry for the class of 30 students in 2015 was 60%.
  • For the 15 boys, the mean score is 55%, and for the 15 girls, 65%. These can be published without revealing anyone’s identity or revealing overly-personal information about any individual.
  • For both ethnic groups in this class (having 23 students in the local majority group, and 7 in the local minority group), the mean score is 60%. Although 7 is a fairly small cell count for the local minority ethnic group, I would feel encouraged to go ahead and publish this figure because it is not hugely different from the local majority’s mean score, and is therefore unlikely to be controversial or cause anyone much distress. If there was a big difference, I would wish to consider sharing this with the relevant education bodies and acting to redress that difference, but may decide not to publish this figure.
  • For boys in the local majority ethnic group, the mean score for Chemistry is 56.79% (rounded to 2 decimal places). This is not problematic in itself, but...
  • For boys in the local minority ethnic group, the mean score for Chemistry is suppressed because (those who know the makeup of the class will know), there is only one student in this group

Is it safe to publish the data? I would argue that it is not. The reason is that someone familiar with the school and the composition of the class may know that there is only one male student in the local minority ethnic group. That person can calculate his chemistry exam score as follows:

Suppressed score = 55*15 - 56.79*14 = 29.94% (≈30%)

The information above shares enough data about the class that a strikingly low exam score for one identifiable student can easily be calculated. You should not publish all of this data, as it may easily cause that student distress. Moreover, to publish it as above may be a breach of that student’s privacy which could be legally actionable.

So, we must consider the overall set of available data which relates to the individual subjects of summary data, and think of how it can be combined to ‘attack’ the anonymisation we have applied when choosing what information to publish. But more generally, how do you address both this and your wider governance responsibilities, and publish data responsibly?

The UK Anonymisation Network’s anonymization framework

Just over a month ago, I attended a workshop titled “Save the Titanic: Hands-on anonymisation and risk control of publishing open data”. Several things presented in the workshop were so interesting that I thought they were worth sharing here.

ukan

The workshop was held at the Open Data Institute (ODI), on behalf of the UK Anonymisation Network (of course it uses the British spelling of anonymization, http://ukanon.net/), led by Ulrich Atz (Twitter: @statshero) of the ODI. He presented anonymization as a process by which, if you follow certain steps in your organisation, you can help maintain anonymity for individuals even when you handle, and publish potentially-sensitive data about them.

The British Information Commisioner’s Office (ICO) offers this definition of anonymization:

Anonymisation* is the process of turning data into a form which does not identify individuals and where identification is not likely to take place. This allows for a much wider use of the information.

*British spelling

The UKAN offers this definition:

When a dataset is anonymised, the identifiers are removed, obscured, aggregated or altered to prevent identification. The term “identifiers” is often misunderstood to simply mean formal identifiers such as the name, address or, for example, the [National Health Service patient] number. But, identifiers could in principle include any piece of information; what is identifying will depend on context. For instance, if a group of individuals is known to contain only one woman, then the gender will be identifying for the woman (and gender is not a typical identifier). Identifiers can also be constructed out of combinations of attributes (for example, consider a “sixteen year old widow” or a “15 year old male University Student” or a “female Bangladeshi bank manager living in Thurso”). [adapted]

by sa

 

Creative Commons BY-SA license

The information taken from the workshop is presented here with permission. The workshop content is licensed under a Creative Commons Attribution-ShareAlike license (CC BY-SA, see https://creativecommons.org/licenses/), which “…lets others remix, tweak, and build upon your work even for commercial purposes, as long as they credit you and license their new creations under the identical terms.“. Under the terms of that license, the full content of this blog post (but only this post) may also be shared and used under the same terms.

(If you’re curious about the title of the workshop, it used a sample dataset containing personal information about victims of the Titanic disaster in 1912 to illustrate various anonymization principals – this data is publically available).

When thinking about how closed or open data is, Ulrich Atz talked about having a spectrum where, from most closed to most open, example data could be:

  1. Your thoughts [Most closed]
  2. National security data
  3. Commercially sensitive data
  4. Personal finances
  5. Combined/aggregate health data
  6. Bus timetables
  7. The value of π [Most open]

In the workshop, Ulrich also presented 10 key parts of an anonymization decision-making framework, which I felt were worth sharing. This list is developed by the UKAN, to assist organisations in minimising the risk of identification in data they publish. The items in bold were taken directly from this framework. The commentary following each point is based on my notes from the workshop:

  1. Know your data and its origins. You must analyse and explore your data, and be familiar with the specification defining what each field represents, the distribution of values (especially unique and low-cardinality values, or unusual value combinations between fields – such as the gender-ethnicity combination in the example above). Also understand its provenance, what you are allowed to do with it, the methodology of collection, and the culture of control for the data.
  2. Understand the use cases (& abuse cases). How might the data be used, how might it be misused (so that you can consider how to prevent misuse).
  3. Understand the legal issues and pre-share/release governance. Have an anonymization policy, and a governance structure for management of your sensitive data. Do you currently have sufficient need to collect, and hold the data? Do you share the data with third parties, and have you defined their responsibilities?
  4. Understand the issue of consent and your ethical obligations. What would constitute fair processing – using the data for its intended purposes only, and not for other purposes? Do you have informed consent from the owners of the data – the individuals it concerns – to collect, process and use it? Can your data model reflect the different types of consent individuals may have given to the ways you may or may not use their data? Have you a process in place to support removal of an individual’s data if they withdraw their consent for you to hold it? This is a serious consideration for application or data architects and technical consultants!
  5. Know the processes you will need to go through to address the risk of re-identification. The factors that influence this risk are the sensitivity of the data (your banking or salary details, medical details, voting and the content of private communications are probably more sensitive than what was in your shopping basket or which brand of clothing you prefer), how accessible the data will be (published vs internal), how disclosive the information is (eg your location now, or how you voted is more disclosive than your postal code or the fact that you voted), and who might be interested in the data and what their motivation is. Several of these factors can change in sensitivity over time.
  6. Know the processes you will need to go through to anonymize your data. The processes are usually iterative – produce your first anonymized data, then consider how you would attack it to exploit it, and revise the anonymization process.
  7. Understand the data environment. What other data is out there which may be combined with your data to facilitate a new exploit. How should you restrict access to your data, to allow access only by appropriate need? If you publish school exam statistics, do your local board of education have a legitimate need for statistics that you should not publish to the general population?
  8. Know your audience and how you will communicate. How do you license your audience to use your data? Be as open as possible about your methodology to your audience, tell your subjects about what you do with the data you hold, and about how it is useful to them. Be open and talk about the benefits of publishing the data. Take care with use of technical terms – e.g. some audiences understand what pseudonymisation is, others do not – be ready to explain your terms.
  9. Know what to do if things go wrong. Have a breach notification process – know who you need to tell in case of a breach. In some situations, informing the subject of the breach may cause more distress, and it may not be appropriate. Know how to take the data out of circulation, or restrict sharing. Face the media, maintain good, open communication about the breach. Talk with the person or group who caused the breach; senior execs need to know different things than line staff. Identify the cause and how to ensure it doesn’t happen in future. Inform your insurer if you have data breach insurance. Know your appetite for risk – how will you choose to react (and to not overreact) to a breach, depending on its severity.
  10. Know what happens next once you have shared and/or released data. Understand how your responsibility does not end once you have published it – the data environment may change, e.g. when someone else publishes correlated data, so that data which was previously anonymous is no longer safe, or when a new version of your source data becomes available. Regularly review your anonymization, and the usefulness of the data. Establish who will be the point of contact in your organisation for the published data, if the current point of contact leaves – this would ideally be someone with a job title like Data Protection Officer, who has a holistic view of issues, incidents and queries across all of your published sources of data.

I’d be interested to hear if you want to know more about either type of anonymization. Is there an organisation in your country which provides similar support to businesses and government, free of charge? Would you like more information about techniques that can be used to create and manage a system of hashed surrogate IDs, so that you can store data anonymously within a database, and still use it effectively for analytics, but where the anonymized values can be ‘de-anonymized’ to retrieve the original clear values e.g. so that a subject may be contacted?

Or do you have experience of creating, working with or trying to attack anonymized data ethically, that would be worth sharing?

tags: anonymization, SAS Administrators

Anonymization for analysts, report designers and information owners was published on SAS Users.

7月 022015
 

anonymization_1I’ve spent some time over the past couple of months learning more about anonymization.

This began with an interest in the technical methods used to protect sensitive personally-identifiable information in a SAS data warehouse and analytics platform we delivered for a customer. But I learned that anonymization has two rather different meanings; one in the context of data management and another in the context of data governance for reporting, sharing or publishing information.

I think both sides of the topic are interesting enough that it’s worth writing about them – I hope you will find both of them interesting too, as overall, this is a subject about which anyone who handles sensitive or personal data should know something.

To data managers, anonymization often means the technical process of obscuring the values in sensitive fields in the data, by replacing them with equivalent, but non-sensitive values which are still useful for e.g. joining tables, representing individuals in time-series or transactional etc. In some SAS products, eg SAS Federation Server, this is called ‘masking’.

See the sample code below, in which the name field in the sashelp.class dataset is hashed, using the SHA256 hashing algorithm (new in SAS 9.4 M1? See the blog post, A fresh helping of hash: the SHA256 in SAS 9.4m1) which will return the same hashed name value whenever the same name value is is passed into it, for example on another row of the data in this table, or in another table.

data class_anon (drop=name salt_key);
 length name_hashed salt_key $128;
 
 set sashelp.class;
 
 * In real use, you should store the salt key somewhere else than a
   literal string in your code. Also, I had to break the hex value show
   here across multiple lines go get it to render properly in Wordpress -
   it should all be one long string with no whitespace or line breaks;
 salt_key = '9D91D63B23DB252EB9AD7EFE9A6E7120
8C139AEC1E721635E9EF8D39DF7D5AF5
5E707AD54E03631F130D3F87EA62AC8B
04209ED276317E12D17DE3CF9578D1B9';
 
 name_hashed = %trim(put(sha256(cat(salt_key,strip(name))),hex128.));
run;

Hashing (or masking) is different from encryption, in a couple of ways:

  1. Hashing is a one-way function. You cannot take a hashed value and ‘unhash’ it with an algorithm in the same way to produce the clear value.
  2. The length of the hashed value is fixed, and does not vary with the length of the input value. The input value can be longer, as well as shorter than the output hash value.

SAS Federation Server 4.1 now includes hash functions which can be used to anonymize data. The SAS Federation Server fact sheet illustrates this:

anonymization_2.jpg

Where the data masking uses reversible encryption, you should make the masked data more difficult to attack (should someone externally obtain it), by storing the encryption key used to mask the data in a table with access controls that are tighter than the access control to see the SQL code behind the view. Far more people have a justifiable reason to see the code behind the view, than those who have a reason to see the encryption key which it uses when it runs. Better still, you should consider using a one-way hashing algorithm such as sha256, to create the hashed value, and maintain a lookup table somewhere secure inside your data warehouse, with the hashed values and their corresponding cleartext values. If you do this, then you should use a salt key.

About salt keys

A salt key value, which should be known only to you, should preferably be concatenated onto the clear value before it is hashed, to change the hashed values. This means that someone can only generate the same hashed value that you generated if they know the salt key too – which is why you must keep the salt key’s value secret. You should also change it regularly if you can, just as you would change a password regularly.

If you don’t add a salt key, a hacker who had gained access to the data could attack the hashed values by brute force, by providing a list of known first names, hashing each of those using the SHA256 function, and then comparing his list of hashed names against the hashed names in our data. There are a lot of people with very common first names, so without a salt key, the hacker would likely get a good hit rate in such an attack. He or she would eventually discover the ‘clear’ names of at least some rows in the anonymized dataset. But with the salt key, it doesn’t matter if the hacker knows the name of someone who is likely to be in our hashed dataset – they don’t know the salt key, so they can’t create hashed values which match ours, and can’t discover which rows of data are for people with names they already know.

Salt keys are important to defend lots of other types of hashed data against brute-force attacks, for example:

  • phone numbers (it’s relatively easy to generate a list of all possible phone numbers and hash them),
  • bank account numbers,
  • credit card numbers (these may be 16-digits long, but an attacker can significantly reduce the number of values he needs to test if he knows who you bank with, or understands the ISO/IEC 7812 Bank Card Number format),
  • ZIP code,
  • social security number
  • etc.

Using this technique, the fields which contain a subject’s real name, address, phone number and credit card number can each be substituted, in your main fact or dimension tables, with alternative fields, containing surrogate ID numbers or hash values, replacing the values in each of the ‘clear’ fields. So FIRST_NAME becomes FIRST_NAME_HASHED, PHONE_NUMBER becomes PHONE_NUMBER_HASHED etc. You still need to keep the clear values of each field in lookup tables in a special, more secure data store, accessible only to a very select group of logins, so that you can use them to contact the individual, or to interact with external systems, but your analysts and call center operators don’t need access to all the clear text values for sensitive data!

To round off this half of today’s article, I’ve included some more sample code below. It is based closely on code used at a real client to manage hashing of values in data read in to the ETL job flows. It is part of the job which reads data in from a source flat file. I’ve added a little to the beginning of the code, to provide a salt key and set up an _INPUT macro variable which is populated by DI Studio in the real code. I’ve also printed the saltkey value to the log so you can see it – that would be a very silly thing to do in real use, where you need to keep the salt key secret. Here’s the code:

* Create a dummy salt key table. Replace this in real use. Also, I had to
   break the hex value show here across multiple lines go get it to render
   properly in WordPress - it should all be one long string with no
   whitespace or line breaks;
data salt_keys;
 format salt_value $128.;
 
 salt_value = '9d91d63b23db252eb9ad7efe9a6e7120
               8c139aec1e721635e9ef8d39df7d5af5
               5e707ad54e03631f130d3f87ea62ac8b
               04209ed276317e12d17de3cf9578d1b9';
 is_current = 1;
 output;
run;
%let _INPUT=salt_keys;
 
* Get the current salt key from an input table.
 This is part of a feature of the application which
 allows the salt key to be changed every few months.;
proc sql noprint;
 select SALT_VALUE into :salt_key
 from &amp;_INPUT
 where IS_CURRENT = 1;
quit;
 
%let salt_key = %trim(&amp;salt_key);
%put &amp;salt_key; * DO NOT DO THIS IN REAL CODE!
                  The salt key should be kept secret!;
 
/* The %hash macro below is used in data step code later to hash
 several columns in each incoming data file, where those columns
 contain sensitive data. It creates and formats the hashed column,
 and also caluculates the value for that column for
 each row of the input dataset.
 
 Usage: to create a column called PHONE_NUMBER_HASHED,
 you would just insert the line:
 %hash(PHONE_NUMBER);
 into a datastep which had a column called PHONE_NUMBER. */
 
%macro hash(column=);
 format &amp;column._HASHED $64.;
 retain &amp;column._HASHED;
 
 &amp;column._HASHED=%trim(put(sha256(cat("&amp;salt_key.",strip(&amp;column.))),hex64.));
%mend;
 
/* Usage example for the hash macro, hashing the 'name' column
   in the sashelp.class dataset. */
data class_anon;
 set sashelp.class;
 %hash(column=name);
run;

Look for my next blog on this topic - Anonymization for Analysts

tags: anonymization, data management, SAS Administrators

Anonymization for data managers was published on SAS Users.