This is personal. I’m a know-it-all nerd. When NHS Director for Patients and Information Tim Kelsey said about pseudonymous care.data “No one who uses this data will know who you are” there were two possibilities. Either he’d just massively misled the BBC Radio 4 audience, or I’d misunderstood something. And for me at my worst, either of those is equally bad.
So I sought to find out which it was. I wrote a piece “Your NHS data is completely anonymous – until it isn’t” explaining how I understood things: that even pseudonymous care.data runs a high risk of re-identification. The piece got published, it got tweeted 100s and read 1,000s of times, but it never got challenged on the central argument. Daniel Barth-Jones educated me on low probability re-identification scenarios (see here for intro), but no-one told me all my re-identification scenarios were in that category. Some people said I’d failed to highlight the benefits of care.data – but most of the potential benefits are beyond dispute.
This happened in a period of lively Twitter discussion on what was exactly going to happen in the care.data program. There was so much incomplete and contradictory information about – some of the things I heard then, from people who should know, are still being officially contradicted now, 2 months later. Yes I included “@tkelsey1” in a few tweets advertising my piece, and repeated this a few times when I realised his claims of anonymity were spreading into some newspaper (and all BBC) coverage of care.data. Many inquisitive tweets on care.data by me and others already got “@tkelsey1 @geraintlewis” added to them, and Geraint Lewis (NHS England Chief Data Officer) often did engage. At some point I found Tim Kelsey had blocked me on Twitter.
What? Blocked on Twitter, when I’d been using a “corporate”, “professional” account? I thought I had got flamewars out of my system on rec.music.misc in the early 1990s. I’d only been on Twitter for a few months, and I felt disappointed with myself at having failed to avoid a beginners’ mistake in this new internet context. That is, until I found out there were a few others, some no more aggressive than me, who had also been blocked by Kelsey. I then also realised: it’s not personal.
The next realisation hit me sometime during the week when all the HES data sharing scandals came out. Someone mis-briefed junior minister Jane Ellison, leading to her misleading parliament by saying that the SIAS insurance HES data had been “green” rather than “amber” pseudonymised. Whoever briefed her: ignorance, incompetence, or a calculated risk? Twitter found it out instantaneously, and Jane Ellison corrected it within a few days.
The other HES data leak stories were all about data companies: making HES data easier to manipulate, available on the cloud, on maps, and for promoting social marketing. One of the best known data analysis companies in the country was of course Dr Foster, of which the same Tim Kelsey was one of the founders. From that experience, would he have known about re-identifiability of pseudonymised data? It seems almost insulting to suggest that he wouldn’t.
It appeared with the dubious HES sales that it was particularly pseudonymised data which had escaped all scrutiny and detailed approval from HSCIC and its precursor NHSIC – while they have solid independent advisory groups in place for advice. There must have been a line to justify this. Well, here is my guess for that justification:
“It’s not personal”.
If the argument that pseudonymised information is anonymous can be sustained, then pseudonymous information cannot possibly be “personal information”. This removes it from protection through the Data Protection Act. This means no-one who holds or receives the data incurs any obligation towards the data subjects. It will allow them to process the data in any way they like, and take automated decisions on the basis of that processing that affect the data subjects – to mention a few things which are severely constrained by the DPA. Thus, avoiding the DPA makes the data much more powerful, and cheaper to deal with in financial and administrative terms.
This is an argument that Tim Kelsey cannot afford to lose on behalf of HSCIC. “Customers” of HSCIC, commercial or not, will want cheap no-strings data – even when they are not intending to abuse it by re-identifying first. The corporate risk register for HSCIC lists “failure to secure the full amount of budgeted income” as one of the highest impact risks, with a medium exposure. If this is at all considered variable, it can hardly relate to the income HSCIC gets from central government – more likely it relates to selling data. HSCIC needs its care.data customers.
I don’t think the argument is sustainable. Ben Goldacre appears to agree:
mm. given what we know about reidentification risk on pseudonym data, not sure the amber/red distinction still valid http://t.co/7Xaz1QQ0dR
— ben goldacre (@bengoldacre) February 25, 2014
So, no matter how loudly Tim Kelsey shouts it isn’t personal, or even when HSCIC redefines anonymity “by virtue of the right controls” (page 19), we should keep insisting that all non-“green” care.data is personal, and ensure it gets the protection through the DPA and its enforcers that it warrants. I have heard no indication that the new European Data Protection Directive will make things any worse in this respect.
Sure, blocking me on Twitter wasn’t personal. Tim Kelsey may have felt a bit nagged by my tweets, but I was mainly nagging about an uncomfortable truth.
UPDATE (25 March 2014): FoI requests by Phil Booth and Neil Bhatia led to HSCIC responses that confirmed much of this and my previous speculations on care.data risks as facts. The essence of this blog post published at The Conversation. Maybe also interesting: slides of a talk I gave on care.data at the University of Bristol crypto group.