Your data is anonymous: how we can fix dangerous assumptions at the heart of big data exchange

With greater transparency and accountability we can bust the myths to unlock the humanitarian benefits of big data sharing

Sorcha Lorimer
6 min readMay 1, 2021
Image from Stefan Kellar, Pixabay

Data brokers, public bodies, and insights companies assure us that their data can be shared or traded as it’s ‘anonymous’, that’s it’s safe, that time-consuming GDPR compliance doesn’t apply. The open data movement advocates greater disclosure and use of public data. Sourcing or sometimes scraping publicly available data is at the cornerstone of many big tech and AI models, powering innovation.

But there are major security and legal flaws at the heart of many assumptions which underscore data access — whether in the open data context or in commercial data exchange. By understanding these blindspots and addressing the gaps, we can unlock societal benefits.

In this article I’ll explore the reality behind those assumptions, look at some of the bear traps associated with data anonymisation, touch on the relevance of the mosaic effect, and highlight the phenomenal potential of data sharing when good governance is in place.

Two big data myths

Let’s first turn our attention to these two myths which are being perpetuated in the big data economy and sharing.

  1. Myth: ‘Data described as anonymous always is’. Reality: while anonymisation is being used correctly by many data leaders and governance experts, sometimes it is being misused in the industry. At times this is intentional to avoid compliance obligations, other times this is unintentional because privacy is a sophisticated domain.
  2. Myth: ‘Open or anonymised data can be used and combined freely.’ Reality: datasets which might be safe when used or kept in isolation pose risks (and rewards) when combined or leaked.

To understand why these big data myths matter for all of us, let’s first look at two critical terms: anonymisation and the ‘mosaic effect’.

1. Anonymous, pseudonymous — what’s the difference?

Anonymisation is a term which some in the data industry are using when in fact data is pseudonymised or de identified. The difference between the terms is in fact is crucial as personal data that has been anonymised is not subject to the UK GDPR.

Organisations frequently refer to personal data sets as having been ‘anonymised’ when, in fact, this is not the case. (Source: The Information Commissioner’s Office)

This video helps explain the important difference between anonymising and pseudonymising data and the critical point about reversibility:

Video courtesy of Comforte

What’s the risk of confusing the terms?

If data is presented as or assumed to be anonymised when in fact the technique used meets a lower standard, it means that compliance with the regulations could be bypassed and residual privacy or confidentiality risk is not properly mitigated.

When using big data sets, data assumed to be properly anonymised exposes risk of non compliance to the data controller and privacy intrusions on society at large, which is why we should be diligent to the use of terminology. If privacy professionals get hung up on semantics, there are often important legal, ethical and security implications at the back of what might seem like pedantry.

Now let’s look at another key, but less known term.

2. What is ‘the mosaic effect’?

Remember those hidden 3-d images or ‘stereograms’ popular in the 90s? With patience, an image reveal itself from the patterns.

Stereogram

While the technique bears no relation, the idea frames understanding of the mosaic effect with data, derived from the mosaic theory of intelligence gathering. Disparate data pieces combine to reveal patterns, providing greater utility or risk (depending on the perspective or situation).

Thus information deemed to be safe (and subject to disclosure control etc.) may result in privacy breach when combined (i.e. via a data linkage attack). In other words, combined information can reveal data which could lead to harms or uncover useful insights.

The myth or over simplification is that metadata (for example transaction or location data) can simply be re-used and combined without concern for compliance or privacy risk, and this is frequent practice. But the reality is that in practice anonymised big data sets can are subject to de-anonymisation attacks and can expose privacy and confidentiality risks when combined with similar datasets due to this mosaic effect.

87% of the US population can be uniquely identified by the combination of their gender, date of birth and zip code alone. (Source: Latanya Sweeney).

When we consider this effect in the context of data sharing at scale during the pandemic, for example DeepMind (owned by Google) were given access to one million “anonymized” eye scans as part of a research partnership with the NHS, this gives rise to wider questions on privacy, commercial access to public information, and ethics. This example in which doctors’ identities were decrypted in an open dataset of Australian medical billing records illustrates re-identification risk.

Better together?

As well as risks, there are huge potential benefits of combining data.

The best things in life happen when people come together, but in the last year we’ve been safer apart. Isolation, shielding, lockdowns have saved countless lives in the face of a global pandemic but the cost of the loss of human connection is one we’re still counting.

This is also true of data. ‘Good’ data sharing and access can enable great insight which can save lives. Research derived from combined public and even private data sets, could uncover early detection signals in public health, climate change, economic indicators and beyond, arming our leaders to design interventions and build greater resilience for the future in the face of global humanitarian crises.

Open data advocates celebrate the potential for widely shared datasets to be combined — or mosaicked — with other datasets to reveal new information for the good of society. This includes everything from exposing gender pay gaps to uncovering government corruption.” (Source)

Conclusion

As I have explored in this article, combining data is a double edged sword with societal benefits and risks; those handling big data sets must be trusted and demonstrated accountability.

This also goes for organisations we, as citizens, trust as custodians of our data — whether that’s public bodies, big tech or corporations. How do they protect it? And when it’s anonymised, is that done robustly?

And when we look at how tech companies are training AI models we need to ask: what is the legal source of that data? Is it really open? Is it being scraped? What’s the legal basis and how are risks being mitigated? How are ethical threats, such as bias in the datasets, being managed?

So what can we as individuals, or leaders taking decisions to share data, or regulators do to drive greater accountability and ensure better data governance?

Here are three golden rules to guide data sharing projects, cutting through the problematic assumptions:

  1. Show not tell. This premise sits at the heart of compliance. When working with data partners or sub-processors, or as a data subject, ask for evidence of how safeguards like anonymisation are undertaken and clarify risk mitigation strategies. This should be apparent through important processes like Data Protection Impact Assessments (DPIAs), or can be exposed by Freedom of Information requests. It should be documented and should not be secretive
  2. Ask why. Data minimisation sits at the heart of good governance. Initiatives should be purpose not data led: understand why data needs to be shared, why all the data points need to be included, what data will be used for, and how impacts of the work will be measured
  3. Trusted, transparent data partnerships. Who’s processing, sharing, providing or undertaking research? Are they trusted to do what they say they will when it comes to security and risk management? Do they have great capability to govern and produce credible insights?

The data supply which powers our global economy, feeds AI applications and enables analytics is a complex and opaque web which has significant legal, technical and ethical pitfalls and barriers.

While it’s essential that data keeps flowing, we have a long road towards good, transparent global data usage underpinned by trusted custodians of data and information. Building capability in this strategically significant space requires more than investment in infrastructure, it also requires that we invest in:

  • data literacy to help people, leaders and those managing data understand sophisticated concepts and breakthrough myths or assumptions
  • privacy-enhancing technologies like differential privacy which can help us manage phenomenons like the mosaic effect better
  • data governance across people, technology and process.

I am the Founder of Trace, we specialise in governance of sensitive data, good data sharing, and privacy risk mitigation strategies and solutions.

--

--