The Intended Interpretation of RIPA, and new powers

I am not a lawyer. I have had to cover RIPA in lectures to Computer Science students on laws that matter to them, so I had experienced its impenetrability and have tried to make some sense of it. Earlier this year, George Danezis of UCL (at a Law Society debate on Surveillance) provided what seemed to me like a plausible interpretation for RIPA: that rather than clumsily dealing with interception of electronic communications from a telephone age perspective, it was a document looking very much forward to an internet-driven world, based on Home Office views of future surveillance opportunities.

Yesterday I found myself in the enviable position of being able to discuss (on Twitter) the highly topical “DRIP” update to RIPA with a real politician – moreover, a knowledgeable one who had previously expressed sensible views on the world post Snowden, namely the LibDem MP Dr Julian Huppert. He had just been asking questions on the topic of the Home Secretary in the Home Affairs Committee, which Mrs May had mostly managed to avoid answering. There had also been an ongoing debate between Huppert and others on Twitter, centering on whether clauses 4 and 5 in DRIP were adding new powers to RIPA. Huppert thought not, and he engaged with my failure to understand why he thought so. The exchange ended with him saying “existing law is unclear; was interpreted one way, now differently. This preserves intended interpretation“; I queried whose interpretation and when, he said “HO & was accepted by CSPs I’ve spoken to. We referred to this as understanding in report on draft Comms data bill” and explained later that the last reference was to para 231 and on from the 2012/13 Joint Committee Report on the Draft Communications Data Bill (a.k.a. the Snoopers’ Charter) (“HO” is Home Office, “CSP” is Communication Services Provider).

I needed a few hours to mull this over. I think I am clear now. Looking at the phrase “intended interpretation”, this is subjective. Both the intention and the interpretation are owned by someone — and I would argue it’s not the Home Office in either case. They may have had the original intention for RIPA by proposing it, but [in my naive understanding of politics] the ultimate responsibility for turning it into law rests with the parliament that agreed to it – no matter how much we may argue now that they didn’t understand what they were agreeing to. As for the interpretation, it is my [again: naive] understanding that it is the courts’ job to interpret laws, not the Home Office’s. So my disagreement with Huppert on whether DRIP (clauses 4 and 5) introduces new powers into RIPA rests on this: he accepts the Home Office’s claim that they own both the intention and the interpretation of RIPA, and I don’t.

As for the “emergency”… It sounds like overseas CSPs had so far been happy to accede to targeted interception requests from UK authorities. It’s amazing that Theresa May refused to admit that yesterday in the Home Affairs Committee when it is clearly stated in the Joint Committee report anyway. Presumably the CSPs had been wanting to be seen as “good citizens” (or even pretending they had “nothing to hide”!), and they would have realised of the lack of transparency around this so they could do this at no expense to their customer relationships. Snowden has changed all that: many CSPs are now publishing the numbers of interception requests received and accepted, and even the RIPA oversight has become a little more transparent. In this way, it becomes visible when overseas CSPs accept UK interception requests even when the legal basis for it appears dubious, and that makes the “gentlemen’s agreement” much less acceptable for the CSPs. That this was a risk already shows in the Snoopers’ Charter report mentioned, from 2013, so not by itself an emergency now.

I reckon the emergency was an overseas CSP threatening to withhold collaboration. From the combination of clauses 4 and 5, it is likely one whose communication methods are not easily covered by the existing RIPA definition of “telecommunications service”. Maybe it is one that is being used in a particularly sensitive and urgent context. From the fact that there are no amendments on the table today to drop clauses 4 and 5 from DRIP, it seems clear there is a very broad consensus behind closed doors that together this really forms an emergency. Is it adding new powers to RIPA? I think it is. Does it matter? The non-politician in me says they might be more honest about it, even if it turns out to be the right thing to do.

I am still not a lawyer, so take this all with a lot of salt. I am also not sharing my inexpert views of whether DRIP does or does not address the striking out of the ECJ judgement against Data Retention. If you want legal opinions on DRIP, look at these ones:

Graham Smith @cyberleagle
Tom Hickman @ukcla
Jack of Kent

All seem to be very sceptical of DRIP, but I haven’t found any legal analysis more favourable to DRIP with anywhere like the same amount of detail.

“You retweet us a lot”

I recently attended an event with at least five people there whom I “knew” from Twitter only. Of course I made an effort to have a chat or at least shake hands with all of them. One in the latter category said “Oh yeah, you retweet us a lot”.

So what does that mean? Now of course I could just ask him, even now, what he’d meant by that, but that’s really not the point. It’s just such an interesting ambiguous statement on the boundary between the web and the world. Does it mean “I’m so glad you share and distribute the views of my worthy organisation?” Or is it “I’ve noticed you on Twitter but don’t follow you as you don’t generate interesting original content?” Or worse, “I’m getting a bit tired of the endless notifications ‘@CyberSecKent has retweeted you’, it’s almost like stalking?”

Yes I think it was probably the first one, too. Still …

Google isn’t just a database, and neither is care.data: a software engineering perspective

Extended version of “NHS must think like Google to make data project work” on The Conversation.

The Google search engine has become a way for people to access sensitive and personal information, and as a consequence it has become more than just a resource: it has had to evolve to address the legal (and maybe even ethical) consequences of its potential. It would make sense to do so from the start for the proposed overarching medical database in the UK, “care.data”.

Google

Google took over market dominance as a search engine, in 2001, providing a simple service. It didn’t know who you were (you couldn’t even log in), and so it also had no record of what you had searched for previously. It just seemed to be finding more search results than previous market leader Altavista, and it also appeared to be presenting them in a useful order – with the fast growth of the web, this was essential ever after.

Interfaces, stateful and stateless

In software engineering, an interface is a simplified view of a piece of software that concentrates only on what kind of values are input, and what outputs that provides – without looking at the machinery that relates them. The essence of the Google interface then and now consists of only three operations: entering a search term, navigating through the search outcomes, and following a link. Behind the screens is a large database of information about web pages, which changes over time as Google notices changes in the web.

In considering interfaces, it is important to look beyond a single interaction to interactions over time. The old interface was what would be called stateless in that the results of a search query would not really depend on your previous Google interactions. To be more precise, let’s call it locally stateless and globally stateful, as the outcomes would still depend on others’ searches and followed links, as these form an important factor in determining “relevance” in the Google ranking algorithm. The interface was also monitored by Google for other reasons: unlike now, in those days there was little “internet news” in the media at all, with Google’s hit parade of most popular search terms (often led by Britney Spears ) the monthly highlight.

Google’s “database”

Behind the interface is a database, which Google does not make available to customers of its search facility. There are several reasons for that: it is too large and too impractical to be copied; there are likely clever ideas in the organisation and structuring of this database that Google does not wish to share; and any copy would just be a snapshot that would go quickly out of date, as Google’s “web crawlers” are constantly looking at the WWW to discover new and disappeared webpages. It is also much easier for Google to control users’ use of the interface than their use of the database itself – more on that later. In fact, Google does have multiple copies (“mirrors”) of the database, to ensure high availability of its services across the globe. It is in full control of these copies, and unlike with many other distributed databases, it does not affect Google’s service if the copies get a little out of sync.

Just a database

The most elementary notion of a database as used within an organisation has a complex interface, typically a query language like SQL, used to generate reports or answer questions, but its basic interface is stateless. The use of it is not likely to be controlled or monitored, and results of queries are not dependent on past queries or influenced by other external information. In this situation, the database and its interface are almost inseparable. They form just a resource, holding relevant data.
Such a database is perfectly sensible within an organisation, as long as the database contains no sensitive information, and it is not expected to support a more abstract, higher level, functionality beyond answering queries.

Google as a higher level functionality

Those two factors pinpoint how Google as a service these days differs from the Google of old which might have been viewed as “just” a database.
For a very long time already, Google hasn’t been just a service to find out which web pages contain a certain bit of text. Google had quickly broadened that out to corrected spellings, similar words, and related words – acknowledging that the service is used as a starting point for finding WWW information about a topic, a person, a location, etc. In schools and universities, this means that Google is both the gateway to committing academic plagiarism and to detecting it. Even more seriously, from the legal perspective Google is now perceived as an entry point to two particular types of sensitive information: “forbidden” information, and “personal information”. They have had to modify their service, and in particular their control of the interface, in response to that.

Google Search isn’t just a database

Google has been eliminating “forbidden” information from its search results from a fairly early stage, to start with in China to implement political censorship, and elsewhere to comply with copyright legislation such as the US Digital Millennium Copyright Act. For censorship in China, which ended in this form in 2010, control was exerted on both the input side (certain search terms refused) and the output side (results suppressed) of the interface. This substantially took Google search away from being “just a database”: it could not address this by simply removing webpages from its database, as that would lead to them being re-added by the updating mechanism eventually; also, links to material forbidden in one country might still be returned in another one. Thus, Google must have introduced a layer of control around its interface to make this work. In the UK, Google has implemented the Internet Watch Foundation blacklist for several years, removing sites on that list from search results. As of late 2013, it also implements a warning and blocking system for search terms that may be used to search for child porn. Measures for reducing access to terrorism related materials are also being considered but it is currently unclear whether their implementation will involve search engines.

The most recent developments have concentrated on Google search acting as a gateway to personal information. Use of search engines in recruitment procedures, “doxing”, and other “internet detective” activity have become a real issue. Judges have suggested closing down of news websites to stop jury members from looking up suspects, for example. Web searches can give prospective employers information that they would not even be allowed to ask for in interview procedures, such as on disability, pregnancy, ethnicity. Because of this, lawyers have generally urged caution in Google use as part of job application procedures, but in general data protection legislation implications of searching for personal information on the web have remained underexposed.

The judgement of the European Court of Justice in May 2014 on Google Spain changed all that, forcing Google to remove links to personal information from search results in specific limited circumstances. This led to a lot of speculation as to whether and how Google could implement this judgement; however, given the variety of existing mechanisms they’d already used, and the large scale of the existing copyright related filtering operations, this may not be so hard for them after all. With new European data protection legislation on the way, there is likely to be only more recognition that searches may return personal information. In fact, by combining different search results not all of which return “personal” information, new personal information may even be generated, and if this is done systematically it certainly has legal implications. One of those is that whenever an organisation records personal information, according to data protection laws it needs to have a registered and agreed purpose for that.

Aside: Google’s other higher level functionality

Of course all this only addresses the web users’ view of Google search as a higher level functionality. From an economic perspective, Google is actually a medium for targeted advertising. In fact the main driver for making the Google interface stateful has come from this direction: finding out identity and interests of the Google users through making them log in to a Google account, giving them free gmail and then reading that, preserving their search history, and linking to their other web browsing activities through tracking cookies etc. This also has legal and privacy implications, but that isn’t the point here.

Health databases

The history of Google search above highlights why it has become ever less meaningful to view it as a simple database operation: this view leaves no room to take into account how it provides a higher level functionality, or to consider the impact of potentially sensitive information contained in it. Unfortunately, the public debate in the UK about a unified database of health (and eventually: social care) data, “care.data”, has remained stuck at exactly that point.

When David Cameron announced this plan in 2011 the perspective was that the National Health Service generates a mountain of medical data, and it would be a waste not to use it for “groundbreaking research” and as a “huge magnet to pull new innovations through”. In other words, an unexplored resource. From then on until well into 2014, that main storyline has been largely unchanged. Sensitivity of medical data has been waved away with the reassurance that it would all be anonymous. Questions as to who would get access to the data, and what the overall purpose would be, were deflected with incidental stories about death and illness that could be avoided with “more research”.

Anonymity

The narrative on anonymity may now have been fatally undermined, finally. Researchers had long ago established that the usefulness of such large databases lies in their rich and longitudinal character. Long and detailed stories about people’s health and treatments give deeper insight and a better chance of explanation for their medical histories. It is then unavoidable that they also, in much the same way, give a deeper insight into which person they refer to – even when the more directly identifying details are elided. Researchers into statistical disclosure control and related topics had known this for a long time: increased usefulness of data goes at the expense of reduced privacy guarantees. Rather than ignoring the privacy risks associated with the use of such rich medical data, they should be managed. In other words, the interface needs to be controlled. After the second delay of the care.data rollout, to beyond September 2014, this idea has finally come into focus.

HSCIC sharing databases

However, the arms-length government organisation in charge, HSCIC, has a long history of treating medical information as a commodity to be shared freely – particularly where some anonymity excuse could be applied. Commercial companies, for example in insurance and pharmacy, have had extensive access to the hospital data in HES. There is an industry of data analytics companies, with revolving doors between them and HSCIC and NHS England, whose business is essentially to sell NHS data back to the NHS in a digested or more accessible form. Looking at their job adverts and websites, these have a clear sense of entitlement towards databases such as HES. Kingsley Manning, Chair of HSCIC, had to admit to the Westminster Health Select Committee last month that he could not even say who the end users of the shared HES data were. It is clear that with the addition of GP data to this database, such an attitude is no longer tolerable.

Open-washing, and research ethics

Arguments in favour of care.data as a simple data resource that can be shared are still coming from two directions. Tim Kelsey, the NHS England director in charge, came into this role after being the government’s “transparency czar”. He and others conflate care.data sharing with the Open Data movement, where government bodies provide data for transparency, allowing the public to draw their own conclusions from them in whichever way they like. However, typically Open Data is not personally identifiable, which is what allows it to be published and shared without Data Protection Act restrictions. According to the ICO’s advice on anonymisation, the data shared by HSCIC would not be unconditionally excluded from data protection – in contradiction with HSCIC’s practice.

Medical research councils also appear to have little problem with the view of care.data as simply a data resource. From their perspective, they have established codes of practice and research ethics that are consistently applied to ensure responsible treatment of sensitive medical data. Some of the language to come from this corner, particularly reference to “consent fetishism”, shows that for some the mechanisms of research ethics have substituted their purpose. The public’s anxiety with care.data is not primarily with the use in academic research, but more where it crosses into commercial (e.g. pharmaceutical) research, especially given the recently highlighted need to improve practical research ethics in that area.

Purpose

In looking at the purpose of the data, data protection legislation (requiring exactly that), and the public’s unease about sharing their sensitive medical data come together. In order to develop a more detailed purpose for care.data, HSCIC drew up the “care.data addendum”, which states that no users can be excluded a priori for the system, and which lists a wide range of categories of users. These include all of HSCIC’s usual customers: research institutes, data analytics companies, as well as pharmaceutical companies, insurance companies, and thinktanks. From the public debate since it came out, it has become clear that many of those are socially unacceptable as recipients of sensitive medical data in any other than aggregated fully anonymised form. In recognition of that, HSCIC have been disowning the care.data addendum in individual communications, while failing to come up with the promised replacement or changing the presentation in their official communications. Of the amendments to the Care Bill proposed in parliament recently, all those which aimed to firm up the care.data purpose beyond “for health purposes” (informally known as the Ronald McDonald amendment) were voted down.

Software engineering

From a software engineering perspective, purpose really should have come first. The traditional sequence of activities in software engineering is to start off establishing the requirements, then design the system, then implement it, and throughout and especially at the end check that it is indeed the system intended. If requirements are likely to change over the development time, these phases can be made to overlap to some extent. In this case, it would have been ever so helpful to start the requirements by describing some illustrative scenarios (“use cases”, or “stories”, in technical terms) and to see how those would sit with stakeholders – patients, medics, and the users of care.data. These scenarios could then have been placed in an overall context – showing how better information indeed, eventually, leads to better care. This would have allowed purpose and constraints to be established with consultation and public buy-in. As things stand, a year after the original intended roll-out, there is an ongoing argument about the requirements. The design is still being modified, for example in order to ensure that patients’ opt-out, added as an afterthought, can have the intended effect. There are persistent rumours that the postponements have been convenient due to delays in the software implementation – not surprisingly given the changes made.

Then there’s the final software engineering phase of checking that it actually is the system as expected. The most common method for this is testing – the final aspect of this is in the phased trial roll out now expected to start in the autumn. There are better methods for this, too, but that is a topic for another time.

Controlling the interface

Security engineering experience, in addition, says that security should be considered right from the start, rather than added on afterwards. For what would likely be the single most security sensitive database in the country (not counting any databases GCHQ might deny the existence of), that counts a fortiori. HSCIC actually produced a price list for sharing care.data “at cost” before they had established the security measures necessary to protect the data and monitor others’ use of it, let alone the cost of these.
In the care.data advisory group established recently, there has been some discussion of a “fume cupboard” model for access to care.data. Some of the lessons from the Google history are put into practice there. HSCIC would not share the database, but give others controlled and monitored access to the interface. Established security mechanisms such as Role Based Access Control could play a part in ensuring that queries match a defined purpose or policy for each type of user. Existing mechanisms, including partly or fully automated ones, for detecting insider attacks could be applied for monitoring and dynamically changing access policies. This would put a wealth of modern security engineering technology at the disposal of the protection of one of the most valuable data sets ever established. Better late than never.

Smart gas meters, and privacy

A former colleague reported today that she had a Hive smart gas meter installed from British Gas. Having read and heard a bit about privacy implications of smart meters for electricity (look at some of George Danezis’ work on this), I wondered how this would be for gas. I reckon the privacy implications are much less obvious and enormous than for electricity, but there must still be interesting information being collected. Off the cuff now … (so possibly naive, but not finding anything quickly with Google …) Particularly if hot water for taps is generated on demand or on a thermostat only, the gas usage should allow patterns of showering and bathing to be detected. Certainly the absence of that, indicating nobody’s home. That information alone is worth protecting.

So what does the British Gas Hive privacy policy say? Very very little indeed. It is 99% a standard privacy policy talking about how they deal with the standard personal information – not a single reference to information they obtain through your smart gas meter. There is one reference to the gas meter: notification that they may use it as an alternative way of sending you messages!

So what is the industry regulator’s view of this? Not a lot, judge for yourself. The Ofgem smart metering installation code of practice says:

“”Privacy charter” means to provide a Customer with information about what data is collected from smart meters and what that the information will be used for and sets out the rights and choices that apply to the Customer in relation to smart metering information”
“Data privacy and security are not in scope of the Code as these are covered under existing data protection legislation”, nevertheless:
“Installers have a basic knowledge and understanding (appropriate to their role) of data protection and privacy”
“All reasonable endeavours should be used to provide the Customer with a copy of the Privacy Charter or make the Customer aware of the Privacy Charter commitments prior to the Installation Visit” but with a rather broad disclaimer footnote: “Subject to the Privacy Charter being approved and made available”

Nothing deep here, but still an interesting privacy gap. Here because this is a bit long for a tweet – feel free to give reactions to @cyberseckent.

It’s not personal

This is personal. I’m a know-it-all nerd. When NHS Director for Patients and Information Tim Kelsey said about pseudonymous care.data “No one who uses this data will know who you are” there were two possibilities. Either he’d just massively misled the BBC Radio 4 audience, or I’d misunderstood something. And for me at my worst, either of those is equally bad.

So I sought to find out which it was. I wrote a piece “Your NHS data is completely anonymous – until it isn’t” explaining how I understood things: that even pseudonymous care.data runs a high risk of re-identification. The piece got published, it got tweeted 100s and read 1,000s of times, but it never got challenged on the central argument. Daniel Barth-Jones educated me on low probability re-identification scenarios (see here for intro), but no-one told me all my re-identification scenarios were in that category. Some people said I’d failed to highlight the benefits of care.data – but most of the potential benefits are beyond dispute.

This happened in a period of lively Twitter discussion on what was exactly going to happen in the care.data program. There was so much incomplete and contradictory information about – some of the things I heard then, from people who should know, are still being officially contradicted now, 2 months later. Yes I included “@tkelsey1” in a few tweets advertising my piece, and repeated this a few times when I realised his claims of anonymity were spreading into some newspaper (and all BBC) coverage of care.data. Many inquisitive tweets on care.data by me and others already got “@tkelsey1 @geraintlewis” added to them, and Geraint Lewis (NHS England Chief Data Officer) often did engage. At some point I found Tim Kelsey had blocked me on Twitter.

What? Blocked on Twitter, when I’d been using a “corporate”, “professional” account? I thought I had got flamewars out of my system on rec.music.misc in the early 1990s. I’d only been on Twitter for a few months, and I felt disappointed with myself at having failed to avoid a beginners’ mistake in this new internet context. That is, until I found out there were a few others, some no more aggressive than me, who had also been blocked by Kelsey. I then also realised: it’s not personal.

The next realisation hit me sometime during the week when all the HES data sharing scandals came out. Someone mis-briefed junior minister Jane Ellison, leading to her misleading parliament by saying that the SIAS insurance HES data had been “green” rather than “amber” pseudonymised. Whoever briefed her: ignorance, incompetence, or a calculated risk? Twitter found it out instantaneously, and Jane Ellison corrected it within a few days.

The other HES data leak stories were all about data companies: making HES data easier to manipulate, available on the cloud, on maps, and for promoting social marketing. One of the best known data analysis companies in the country was of course Dr Foster, of which the same Tim Kelsey was one of the founders. From that experience, would he have known about re-identifiability of pseudonymised data? It seems almost insulting to suggest that he wouldn’t.

It appeared with the dubious HES sales that it was particularly pseudonymised data which had escaped all scrutiny and detailed approval from HSCIC and its precursor NHSIC – while they have solid independent advisory groups in place for advice. There must have been a line to justify this. Well, here is my guess for that justification:

“It’s not personal”.

If the argument that pseudonymised information is anonymous can be sustained, then pseudonymous information cannot possibly be “personal information”. This removes it from protection through the Data Protection Act. This means no-one who holds or receives the data incurs any obligation towards the data subjects. It will allow them to process the data in any way they like, and take automated decisions on the basis of that processing that affect the data subjects – to mention a few things which are severely constrained by the DPA. Thus, avoiding the DPA makes the data much more powerful, and cheaper to deal with in financial and administrative terms.

This is an argument that Tim Kelsey cannot afford to lose on behalf of HSCIC. “Customers” of HSCIC, commercial or not, will want cheap no-strings data – even when they are not intending to abuse it by re-identifying first. The corporate risk register for HSCIC lists “failure to secure the full amount of budgeted income” as one of the highest impact risks, with a medium exposure. If this is at all considered variable, it can hardly relate to the income HSCIC gets from central government – more likely it relates to selling data. HSCIC needs its care.data customers.

I don’t think the argument is sustainable. Ben Goldacre appears to agree:

mm. given what we know about reidentification risk on pseudonym data, not sure the amber/red distinction still valid http://t.co/7Xaz1QQ0dR

— ben goldacre (@bengoldacre) February 25, 2014

So, no matter how loudly Tim Kelsey shouts it isn’t personal, or even when HSCIC redefines anonymity “by virtue of the right controls” (page 19), we should keep insisting that all non-“green” care.data is personal, and ensure it gets the protection through the DPA and its enforcers that it warrants. I have heard no indication that the new European Data Protection Directive will make things any worse in this respect.

Sure, blocking me on Twitter wasn’t personal. Tim Kelsey may have felt a bit nagged by my tweets, but I was mainly nagging about an uncomfortable truth.

UPDATE (25 March 2014): FoI requests by Phil Booth and Neil Bhatia led to HSCIC responses that confirmed much of this and my previous speculations on care.data risks as facts. The essence of this blog post published at The Conversation. Maybe also interesting: slides of a talk I gave on care.data at the University of Bristol crypto group.

So where are we now on care.data?

Things have moved on rather dramatically since my last blog post on the topic. care.data has been delayed by another 6 months, there were some scandals on shared HES data, several Westminster meetings, the last of which produced amendments (and for once no ministerial errors of fact?). I have now also started giving lectures on care.data (to Kent students this week, to Bristol colleagues in a few weeks’ time). So it makes sense to provide an update of my views on the issues here. This in addition also to an article I wrote last week on the 3rd party use of the data.

In my first article, “Outdated laws put your health data in jeopardy” I described the system and listed a few worries, which I revisited a few weeks ago in a blog post. Following on from there …

The legal set up, and weakness of DPA: the parliamentary session on Tuesday 11 March considered and voted down an amendment to increase the penalties on abuse of the data. The contrast between 20 well-informed critical MPs from both sides of the house and a few health ministers discussing the issue, and a mob vote of 500 MPs is a bit shocking. See here for a sensible amendment and reasons why the government’s accepted amendment isn’t good enough. Of course, the new European data protection directive, once agreed by the Council of Ministers and effected in the UK, will allow more serious penalties than the current DPA – up to £100M or 5% of a company’s turnover.

Intelligence services: still a risk, no progress. Had a nice time talking at a Law Society debate on Mass Surveillance last week, where I did manage to drop in care.data as maybe the story that wakes up England on privacy.

Honeypot Value and security: HSCIC have declined to answer a Freedom of Information request from Julia Hippisley-Cox asking to report on the number of past data breaches and audits. Given that even NSA and the big tech companies have been shown unable to protect their own secrets, this worry will never go away completely.

On potential abuse by commercial companies: see my article “Time for some truth about who is feeding off our NHS data” for an overview and analysis.

Anonymity: let me write a separate post on that. I may have been naive so far.

NHS data sharing: taking stock

I have written in the last few weeks, on this blog and twice in The Conversation, on the NHS care.data sharing scheme. In terms of the “authoritative” information, the picture has become a bit clearer to me, although the information “out there” is hardly getting any clearer. Mindless accusations such as “NHS selling data to the highest bidder” are still floating about, and on the “other side” even yesterday the BBC was still reporting data was non-identifiable when it is. Whatever clarifications I have obtained have come through Twitter chats, especially with Geraint Lewis, the Chief Data Officer for NHS England, who is the best and most engaged advocate for the system. (If you follow @CyberSecKent, which in theory could be used by colleagues but in practice so far only by me, you may have seen some of the @GeraintLewis stuff on your Twitter feed.)

Time to take stock. In my first article, “Outdated laws put your health data in jeopardy” I described the system and listed a few worries.

The legal set up: opt-out rather than opt-in, and subsequently reports on rather than requests for use, seems ethically dubious but legally correct. The main chink that is appearing is the presumption that GPs’ DPA duty to inform patients about sharing has been adequately covered by a junkmail leaflet is called into doubt. This applies a fortiori to patients with a visual or mental disability. An amazingly readable long paper on the legal background which encompasses all the shards of legal thought that have appeared in the discussions I have seen (and much more) was pointed out to me by Prof. Julia Hippisley-Cox.
Intelligence services obtaining and re-identifying the data, with possible subsequent use for reputation attacks: the risk of particularly the former still stands, especially after Sir David Omand affirmed this week the official GCHQ doublethink that bulk collection of data does not equate to mass surveillance. Ironically, this was on The Day We Fight Back.
The Honeypot Value of such a single huge database of sensitive data is still a worry. HSCIC point at their recent cyber security inspection. They have claimed there have been no breaches of the HES precursor database (collected hospital data) in its long existence, but gloss over the large number of past NHS data breaches, and have so far not responded to requests for disclosure of recent monitoring information. Worryingly, some of the NHS response does not take into account that the new database will be so much more valuable for potential abusers that it warrants higher levels of security than anything. My worry remains here: data of extremely high value protected by people who are supposed to be top experts has been stolen regularly in the last few years. Particularly full medical histories once out will never go back in the toothpaste tube.
On potential abuse by commercial companies the situation has become a bit clearer to me. As things stand, Bupa and others have access to the precursor HES (“the data analytics arm” of Bupa rather than their insurance business), and they will not have access to care.data. HSCIC currently state that data will not go to anyone beyond the “commissioners” (NHS England, CCGs, Public Health). I have asked about the status of the
care.data addendum which in section 5.3 clearly lists commercial companies as “additional customers”, which contradicts the HSCIC position. Geraint Lewis has asked me to ask for clarification of this by email, but so far I have not received a reply yet. Overall it looks like there will still be an opportunity to reconsider the plans for sharing of “amber” data with others besides NHS and research organisations after the care.data database is established, which is good. That does not reduce the risk of the data going to parts of NHS now which will become private later, though. Again we can think of Bupa there. Or worse: G4S or Atos.
Weakness of DPA is still an issue. Even if the sharing of orange data is limited, the potential gains from abuse dwarf both the maximum fines under the Data Protection Act (500K, no prison), and the money HSCIC assign themselves (“selling at cost”) to monitor against potential abuse. I have been probing to find out how HSCIC could be sure to find out the data has been abused once it has been passed on to a third party, but have had no convincing answers. Channeling all third party access through HSCIC, on a query by query basis, would be much more secure in that respect. The value of the data to us as its subjects means that it should also be worth the extra expense involved.
In my second The Conversation article, Your NHS data is completely anonymous – until it isn’t I did not really raise any new worries. I merely articulated what I thought was common knowledge: pseudonymised (“orange”) data is re-identifiable, particularly if you have a lot of it. The remaining worry from this is that Tim Kelsey’s categorically wrong comment on this, “No one who uses this data will know who you are”, made on BBC Radio 4, is still being repeated e.g. by the BBC. Some of the discussion around the care.data issue concerns trust. The “pro” line goes: you have been trusting the NHS all your life, do not let your distrust of politics get in the way of this great opportunity for the NHS to improve care. However, Tim Kelsey has just shown that despite the HSCIC’s best efforts to do this important job in a responsible way, they are answerable to bosses who are happy to misinform the public. That is a worry that will not go away easily, especially not after they have appointed yet another person with a huge conflict of interest.
In the meantime, another worry appeared: that the data of people who had opted out would still be uploaded onto the system. This has been authoritatively debunked this week by Geraint Lewis; there is a sense that this is a (helpful!) change of policy rather than a clarification. (Ross Anderson’s comments suggest care.data would be used for NHS to pay bonuses to GPs, which would lead to an inconsistency: without the data of opters-out, the information would not be there. GP Neil Bhatia of care-data.info says care.data is not intended for this anyway, and GPs get paid via QoF/CQRS and other submissions which are nearly all anonymised/aggregated.)
Randeep Ramesh in the Guardian raised the above worry (plus MP David Davis’ wonderful 5 broken noses re-identification illustration!) alongside another one: that police would use the HSCIC database to get at medical data without a warrant, in the same way and under the same conditions they can do already now through GP practices. This got the medics worried about HSCIC sticking to the same strict procedures that GP practices have to. The response to this was reassurance from Tim Kelsey that the police wouldnot do so, but it would have been ever so much more reassuring if he’d said they could not do so.
Finally, only on this blog I sketched a nightmare scenario where people would stand out negatively (as having something to hide) through the mere fact of having opted out. That one, at least, has become a lot less likely now.

(edit 14 Mar: Neil Bhatia is care-data.info not medconfidential, sorry. All good people though!)

The NHS data sharing story rumbles on

I’ve found out a lot more about the NHS data sharing scheme.

First, the people in charge do actually have very solid awareness of (de-)anonymisation, but the info is hidden a bit deep in the HSCIC Privacy Impact Assessment and its supporting documents.

Second, there was a different scheme for sharing medical data, the “Summary Care Record“. People were sent opt-out forms for that, some as late as last year. It’s a different system, and it does involve medical data being shared with (e.g.) A&Es. Many GP practices turn out to be confused about this vs. care.data.

My assessment of the situation based on what I understand now has been written up in a piece at The Conversation. For any further links to interesting source documents please do look there. I’ve excluded the scenario from my previous blog post in that.

Since I published that yesterday, some serious discussion has broken out on Twitter (especially around @GeraintLewis, NHS Chief Data Officer) concerning whether data would be sold to insurance companies or not, with apparently contradictory statements on this having come out from the NHS side. See the comments on my piece for a summary of the positions.

NHS care.data: even if you opt out …

Following on from my earlier post …

If you ring an insurance company, there is every chance that at some point you will be reminded that data is liberally shared between insurance companies and other authorities in order to prevent fraud. The following scenario now suggests itself …

You opted out from having your data shared in the care.data program. In the end, despite newspaper front pages and assorted expressions of worry about privacy and accountability, you are in a tiny minority of people to have done so. Now you apply for life insurance, or maybe health insurance (in the post-NHS era we may all need to do this!) A week later you receive a letter from the insurance company: “We don’t have access to your medical data from the NHS. Unfortunately in our experience this indicates a high likelihood that you have medical circumstances that you would wish to hide from us. Because of this, we will not be able to provide you with insurance.”

You decide whether this is a likely scenario or not. In today’s Guardian piece, insurance companies were mentioned as potential buyers of the data. (Aside: doesn’t the financial dimension erode the “it’s all for our benefit” story somewhat?) The piece also reminded us that de-pseudonymisation is not only a risk in general, but very likely no problem at all for organisations who already have lots of our data – such as the insurance industry.

I’ll leave it to the game theorists to decide whether this post is arguing for- or against opting out. Only at the end of writing this it comes back to me that I actually watched “The Rainmaker” last night 😉

Timing of cyber attacks: a model

Last week Axelrod and Iliev from the University of Michigan (Ann Arbor) published a paper “Timing of cyber conflict“. Akshat Rathi, science editor at The Conversation reviewed the paper for the site, and asked me for comments. A quote from me is included in his piece.

There’s a bit more to say than I did there. The Daily Mail also covered this, and lazily presented it as a model that you just enter your data into and presto! it tells you when to perform your cyber attack. That’s not at all the case. The model asks you to guess at some probabilities, and then measure some unmeasurables including a quantification of the attack’s effect. (Makes me think of risk assessment!) There isn’t much of a mathematical model in it, really – there are variables, and a formula, and case studies in the paper, but in the case studies the variables never get values, and the formula isn’t used for anything.

That’s not at all to say it’s a useless paper. Although you can’t actually establish values for the variables, the concepts embodied in them are very useful, and it makes perfect sense to talk about them going up/down in scenarios. The case studies, such as of Stuxnet, make for very interesting reading indeed.

Eerke Boiten's blog

Cyber security and more