Availability of Data
Example of anonymizing a column rather than blocking it
By having highly granular controls coupled with anonymization techniques, more data than ever can be at the fingertips of your analysts and data scientists (in some cases, up to 50% more).
Why is that?
Let’s start with a simple example and get more complex. Obviously, if you can’t do row- and column-level controls and are limited to only GRANTing access to tables, you are either over-sharing or under-sharing. In most cases, it’s under-sharing: there are rows and columns in that table the users can see, just not all of them, but they are blocked completely from the table.
That example was obvious, but it can get a little more complex. If you have column-level controls, now you can give them access to the table, but you can completely hide a column from a user by making all the values in it null, for example. Thus, they’ve lost all data/utility from that column, but at least they can get to the other columns.
That masked column can be more useful, though. If you hash the values in that column instead, utility is gained because the hash is consistent - you can track and group by the values, but can’t know exactly what they are.
But you can make that masked column even more useful! If you use something like k-anonymization instead of hashing, they can know many of the values, but not all of them, gaining almost complete utility from that column. As your anonymization techniques become more advanced, you gain utility from the data while preserving privacy. These are termed privacy enhancing technologies (PETs) and Immuta places them at your fingertips.
This is why advanced anonymization techniques can get significantly more data into your analysts' hands.
Using k-anonymization to mask columns
While columns like first_name
, last_name
, email
, and social security number
can certainly be directly
identifying, something like gender
and
race
, on the surface, seem like they may not be directly identifying, but it could be. Imagine if there are very
few Tongan men in a data set...in fact, for the sake of this example, lets say there’s only one. So if I know of a
Tongan man in that company, I can
easily run a query like this and figure out that person’s salary without using their name, email, or social security
number:
select salary from [table] where race = 'Tongan' and gender = 'Male';
This is the challenge with indirect identifiers. It comes down to how much your adversary, the person trying to break privacy, knows externally, which is unknowable to you. In this case, all they had to know was the person was Tongan and a man (and there happens to be only one of them in the data) to figure out their salary, sensitive information. Let's also pretend the result of that query was a salary of 106072. This is called a linkage attack and is specifically called out in privacy regulations as something you must contend with, for example, from GDPR:
Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
Almost any useful column with many unique values will be a candidate for indirectly identifying an individual, but also be an important column for your analysis. So if you completely hide every possible indirectly identifying column, your data is left useless.
You can solve this problem with PETs. Take note of two things by querying the data:
- If you only search for “Tongan” alone (no Male), there are several Tongan women, so this linkage attack no longer
works:
select salary, gender from [table] where race = 'Tongan';
- There are no null values in the gender or race columns.
Now let's say you apply the k-anonymization masking policy using Immuta.
Then you run this query again to find the Tongan man's salary:
select salary from immuta_fake_hr_data where race = 'Tongan' and gender = 'Male';
You get no results.
Now you run this query ignoring the gender:
select salary, gender from immuta_fake_hr_data where race = 'Tongan';
Only the women are returned.
The linkage attack was successfully averted.
Remember, from our queries prior to the policy, the salary was 106072, so let’s run a query with that:
select race, gender from immuta_fake_hr_data where salary = 106072;
There he is! But race will be suppressed (NULL) so this linkage attack will not work. It was also smart enough
to not suppress gender
because that did not contribute to the attack; suppressing race
alone averts the attack. This
is the magic of k-anonymization: it provides as much utility as possible while preserving privacy by suppressing values
that appear so infrequently (along with other values in that row) that they could lead to a linkage attack.
Cell-level security
Cell-level security is not exactly an advanced privacy enhancing technology (PET) as in the example above, but it does provide impressive granular controls within a column for common use cases.
What is cell-level security?
If you have values in a column that should sometimes be masked, but not always, that is masking at the cell-level, meaning the intersection of a row with a column. What drives whether that cell should be masked or not is some other value (or set of values) in the rest of the row shared with that column (or a joined row from another table).
For example, a user wants to mask the credit card numbers but only when the transaction amount is greater than $500. This allows you to drive masking in a highly granular manner based on other data in your tables.
This technique is also possible using Immuta, and you can leverage tags on columns to drive which column in the row should be looked at to mask the cell in question, providing further scalability.