Is Anonymous Data Really Anonymous?
Many services anonymize data including Google, online banking services, home DNA testing companies … the list is literally endless. Over the subsequent years, users have been repeatedly assured that once anonymized, our data can’t be linked to us ever again.
Whether anonymization is done through a process of removing identifying fields, encryption or data masking, the theory was that this data could never be re-identified. Sadly, it seems this isn’t quite as true as we’d like it to be and researchers have been shedding some illuminating and disconcerting lights on the subject ever since.
The results of a combined research project between researchers at the Imperial College, London and Université Catholique de Louvain suggests that none of the current methods of data anonymization can protect large sets of data once they’ve been released into the public domain. This isn’t exactly new news and as far back as 2015, articles were popping up questioning the nature of anonymized data and its potential to be re-identified.
This may all sound like a bunch of techie nonsense but one of those four-year-old articles hits on exactly why you should be worried and how the lack of anonymity in anonymous data impacts on you. According to an article published in Science in January 2015, individuals in phone data set with basic anonymization could be reidentified using just four pieces of outside information, these could be anything from a Tweet to a movie review.
Because most people use several connected devices and services during a day, each of us is constantly sending information out into the ether. This data has the potential to reidentify anonymized data. For example, if a hacker can gain access to your bank statements, for example, they can get additional information that could lead to them reidentifying anonymized credit card details, taking them one step closer to identity theft.
Not So Anonymous Case Studies
A couple of years ago in Germany, a data scientist and journalist teamed up to prove just how easy it could be to reidentify anonymized data. Using the information offered to them by a data broker, the pair gained access to a long list of URLs and timestamps which, for most of us, is the equivalent of white noise. Pay closer attention, however, and you could gain access to an individual’s entire online life.
Data scientist Andreas Dewes points out that if a Twitter who visits their analytics page, a URL containing their Twitter username will appear in their browsing record. Once you’ve got that, it won’t take long for you to finalize the connection between the anonymized data and a specific individual.
Another incident in Australia in 2017 revealed that reidentifying medical data from a public yet anonymized data group was surprisingly easy. According to one researcher, Dr. Culnane of the University of Melbourne, “patients can be re-identified, without decryption… [using] known information… such as medical procedures and year of birth”.
The results of the combined Université Catholique de Louvain and Imperial College research revealed that even the most sophisticated anonymization techniques were insufficient. Despite it seeming like the digital equivalent of finding a needle in a haystack, it seems our online data-sharing habits have provided anyone that cares with a powerful magnet that suddenly makes that needle leap out, regardless of the haystack’s dimensions.
While some of the earlier research focused on data sets with only the most basic of anonymization, these more recent efforts targeted the most advanced anonymized datasets. Nevertheless, researchers proved that 99.98% of Americans could be correctly re-identified using just 15 identifying characteristics, including freely available information, like our age, marital status, and gender.
If you’re curious about just how much personal data is out there that could be reidentified, head over the computation privacy group’s page on the Imperial College London website and test it out for yourself. From the results in the image below, you’ll see that I’m not easily re-identifiable, but only because none of the information I entered was in any way factual.
In most instances, the results will show an 80-90% re-identification probability, indicating just how easy it is. In fact, just using four random attributes, namely, date of birth, gender, ZIP code and marital status, the probability of pinpointing an individual sloshing around in a murky data soup leaps up to 95%.
Anonymization, Privacy and the Law
Data collection and sharing has become increasingly problematic with the explosion of data resulting from internet usage and connected devices. Whether it’s your smartphone location or your credit card information, there’s an abundance of sensitive data out there about you, and it could only be a matter of time before someone uses that to their advantage.
New legislation and data protection laws seem to emerge daily, but just how effective are they? The European Union’s decision to embrace the General Data Protection Regulation appears to be a move in the right direction, but, according to experts, it doesn’t go far enough. The GDPR describes anonymized data as any “data rendered anonymous in such a way that the data subject is not or no longer identifiable”. Encouraging, but we’ve already seen just how flawed that theory is. More worrying, the GDPR considers anonymized data as no longer being personal, meaning it can be shared, used, and sold with no consideration of or permission from the subjects.
According to the researchers behind the latest findings, the measures being introduced on a national level, as well as those brought in with the GDPR legislation, aren’t going far enough when it comes to appreciating the danger data reidentification poses. Even the GDPR doesn’t account for the level of risk reidentification poses or the possibility of new threats emerging in the future.
As far as experts are concerned, new legislation regarding private data needs to “take into account the individual risk of re-identification and the lack of plausible deniability—even if the dataset is incomplete—as well as legally recognize the broad range of provable privacy-enhancing systems and security measures that would allow data to be used while effectively preserving people’s privacy”.
Another potential flaw in the GDPR is that it allows pseudonymization to be used in conjunction with anonymization. This all sounds extremely complicated and high-tech but what it means is that “the data can no longer be attributed to a specific data subject without the use of additional information” and that such additional information should be held separately to any anonymized data. As the recent research from Europe suggests, however, this is far from foolproof, especially as there are so many ways of gathering and accessing data so getting a couple of extra bits of information and using them to reidentify data isn’t that difficult.
The Benefits of Anonymized Data
After reading so much about the pitfalls of anonymized data, it may come as a surprise to discover that it does have some benefits – just not for the subject. Anonymization of data means that, as we mentioned earlier, that data can be shared, stored, bought and sold freely without the subjects’ consent. This makes it far more accessible and therefore beneficial when it comes to research, be it medical or marketing.
Home DNA testing, for example, has given birth to some enormous DNA databases, all of which contain anonymized data but give researchers access to millions of DNA samples and character traits that relate to them. The likelihood of matching a strand of DNA to an individual in a database of millions is virtually impossible, meaning your genetic fingerprint is safe and can never be linked to you. At the same time, your anonymized data can be used to make important discoveries and even develop treatments for currently incurable diseases.
The downside of using anonymized data for research is that if it’s too anonymized, it’s notably less useful. The more characteristics attached to the data, the more useful it is but the more characteristics there are, the easier it is for that data to be reidentified, making it a classic catch-22 situation.
Another more serious downside is that, because so many people, in America at least, have had their DNA tested and their samples uploaded to an anonymized data set, experts say it doesn’t actually matter whether or not you’ve been tested, either way, “you can be identified because the databases already cover such larger factions of the US [population]”.
So, anonymized or otherwise, tested or not, consent given or refused, your data is out there and there’s every possibility it could be used against you.
The Dangers of Reidentification
The major problem with reidentification is that it can be misused as it was in the Facebook scandal. Early last year, it transpired that Cambridge Analytica had harvested personal data from the social media site without the consent of its users.
The information gathered was said to include sensitive data such as a user’s public Facebook profile, location, birthday, and page likes. Combined with other legally harvested information regarding loyalty cards, favorite magazines, and other lifestyle choices, it was possible to build a psychographic profile of each person. This provided a detailed enough profile of the person to indicate “what kind of advertisement would be most effective to persuade a particular person in a particular location for some political event”.
It may sound relatively benign; some believe private data is even more precious than financial. The Facebook data breach was not only invasive but also exposed how easily such information could be gathered and potentially used to manipulate or exploit individuals through its misuse.
This is just one of the many dangers of reidentification. A hacker getting access to all your financial data and stealing your identity is another one. One of the researchers involved in the latest study, Yves-Alexandre de Montjoye, has performed previous studies that indicated that anonymized credit card metadata could reidentify 90% of the subjects with the use of just four additional pieces of information. It’s like a gift to cybercriminals the world over!
But more important than the real-time dangers of reidentification is how it violates our fundamental right to privacy. While this may sound relatively harmless, according to various legal findings, people deprived of privacy experience “mental suffering”. Furthermore, some private information can harm us, either now or in the future. Imagine your reproductive choices or sexual preferences were broadcast to the world against your will!
Over and above all else, however, the real danger behind anonymization’s failures is that it gives everyone, from neighbors to your worst enemies, the potential to access your most private information without your consent.
Finding the Balance
One of the problems with anonymization is that too much renders the data virtually too little while too little renders it easily re-identifiable. Going forward, one of the key issues with any privacy law or data protection legislation is how it balances these two demands. None of us want to see the flow of medical information curtailed if it means reducing human suffering and yet, at the same time, for many of us, our medical secrets are some of our most valuable and sensitive.
As Australian researcher, Dr. Teague assets, “Legislating against re-identification will hide, not solve, mathematical problems, and have a chilling effect on both scientific research and wider public discourse”.
We’ve established that anonymization doesn’t work and that your private data isn’t necessarily safe simply because it’s been anonymized. The best way of dealing with the problem, then, is to throw away your smartphone and never go on the internet again. Not exactly viable, is it? However, there are some cybersecurity tools available that can at least pixelate your digital footprint and make it harder to follow, even if it can’t make it disappear entirely.
A VPN, for example, can hide your location and your IP, so that’s two potentially identifying factors dealt with right there. Many antivirus programs offer advanced identity theft protection, while some background check services give you access to dark web information so you can see which of your secrets are in the public domain. While these tools aren’t going to solve the problem, but they can at least disrupt the bigger picture and make you harder to track down, with or without your anonymized data. At the end of the day, however, the less information you divulge, the fewer identifying factors are hanging around in public places making a nuisance of themselves.
Anonymous Isn’t Private
The only instance in which anonymized data can really protect your privacy is if the following three criteria are met:
- An individual cannot be singled out
- No data points can be linked to create a more complete individual profile
- It is not possible to determine one attribute from another.
Unfortunately, these criteria are rarely met, and large data brokers are doing all they can to link those points and create a more comprehensive and valuable profile.
As data has been called the new oil, it remains highly valuable and trying to keep yours safe and secure is proving ever more difficult. Nevertheless, employing a few cybersecurity measures won’t hurt and is certainly going to be easier than dumping the phone and abandoning online life altogether.
Next time you see a pop-up on a website informing you that the administrator may share anonymized data with third parties, think again about giving your consent, after all, is that website’s content worth compromising your privacy for? Anonymized or not, Big Brother is watching you but just how much you reveal to him is still up to you.