Google isn’t just a database, and neither is care.data: a software engineering perspective

Extended version of “NHS must think like Google to make data project work” on The Conversation.

The Google search engine has become a way for people to access sensitive and personal information, and as a consequence it has become more than just a resource: it has had to evolve to address the legal (and maybe even ethical) consequences of its potential. It would make sense to do so from the start for the proposed overarching medical database in the UK, “care.data”.

Google

Google took over market dominance as a search engine, in 2001, providing a simple service. It didn’t know who you were (you couldn’t even log in), and so it also had no record of what you had searched for previously. It just seemed to be finding more search results than previous market leader Altavista, and it also appeared to be presenting them in a useful order – with the fast growth of the web, this was essential ever after.

Interfaces, stateful and stateless

In software engineering, an interface is a simplified view of a piece of software that concentrates only on what kind of values are input, and what outputs that provides – without looking at the machinery that relates them. The essence of the Google interface then and now consists of only three operations: entering a search term, navigating through the search outcomes, and following a link. Behind the screens is a large database of information about web pages, which changes over time as Google notices changes in the web.

In considering interfaces, it is important to look beyond a single interaction to interactions over time. The old interface was what would be called stateless in that the results of a search query would not really depend on your previous Google interactions. To be more precise, let’s call it locally stateless and globally stateful, as the outcomes would still depend on others’ searches and followed links, as these form an important factor in determining “relevance” in the Google ranking algorithm. The interface was also monitored by Google for other reasons: unlike now, in those days there was little “internet news” in the media at all, with Google’s hit parade of most popular search terms (often led by Britney Spears ) the monthly highlight.

Google’s “database”

Behind the interface is a database, which Google does not make available to customers of its search facility. There are several reasons for that: it is too large and too impractical to be copied; there are likely clever ideas in the organisation and structuring of this database that Google does not wish to share; and any copy would just be a snapshot that would go quickly out of date, as Google’s “web crawlers” are constantly looking at the WWW to discover new and disappeared webpages. It is also much easier for Google to control users’ use of the interface than their use of the database itself – more on that later. In fact, Google does have multiple copies (“mirrors”) of the database, to ensure high availability of its services across the globe. It is in full control of these copies, and unlike with many other distributed databases, it does not affect Google’s service if the copies get a little out of sync.

Just a database

The most elementary notion of a database as used within an organisation has a complex interface, typically a query language like SQL, used to generate reports or answer questions, but its basic interface is stateless. The use of it is not likely to be controlled or monitored, and results of queries are not dependent on past queries or influenced by other external information. In this situation, the database and its interface are almost inseparable. They form just a resource, holding relevant data.
Such a database is perfectly sensible within an organisation, as long as the database contains no sensitive information, and it is not expected to support a more abstract, higher level, functionality beyond answering queries.

Google as a higher level functionality

Those two factors pinpoint how Google as a service these days differs from the Google of old which might have been viewed as “just” a database.
For a very long time already, Google hasn’t been just a service to find out which web pages contain a certain bit of text. Google had quickly broadened that out to corrected spellings, similar words, and related words – acknowledging that the service is used as a starting point for finding WWW information about a topic, a person, a location, etc. In schools and universities, this means that Google is both the gateway to committing academic plagiarism and to detecting it. Even more seriously, from the legal perspective Google is now perceived as an entry point to two particular types of sensitive information: “forbidden” information, and “personal information”. They have had to modify their service, and in particular their control of the interface, in response to that.

Google Search isn’t just a database

Google has been eliminating “forbidden” information from its search results from a fairly early stage, to start with in China to implement political censorship, and elsewhere to comply with copyright legislation such as the US Digital Millennium Copyright Act. For censorship in China, which ended in this form in 2010, control was exerted on both the input side (certain search terms refused) and the output side (results suppressed) of the interface. This substantially took Google search away from being “just a database”: it could not address this by simply removing webpages from its database, as that would lead to them being re-added by the updating mechanism eventually; also, links to material forbidden in one country might still be returned in another one. Thus, Google must have introduced a layer of control around its interface to make this work. In the UK, Google has implemented the Internet Watch Foundation blacklist for several years, removing sites on that list from search results. As of late 2013, it also implements a warning and blocking system for search terms that may be used to search for child porn. Measures for reducing access to terrorism related materials are also being considered but it is currently unclear whether their implementation will involve search engines.

The most recent developments have concentrated on Google search acting as a gateway to personal information. Use of search engines in recruitment procedures, “doxing”, and other “internet detective” activity have become a real issue. Judges have suggested closing down of news websites to stop jury members from looking up suspects, for example. Web searches can give prospective employers information that they would not even be allowed to ask for in interview procedures, such as on disability, pregnancy, ethnicity. Because of this, lawyers have generally urged caution in Google use as part of job application procedures, but in general data protection legislation implications of searching for personal information on the web have remained underexposed.

The judgement of the European Court of Justice in May 2014 on Google Spain changed all that, forcing Google to remove links to personal information from search results in specific limited circumstances. This led to a lot of speculation as to whether and how Google could implement this judgement; however, given the variety of existing mechanisms they’d already used, and the large scale of the existing copyright related filtering operations, this may not be so hard for them after all. With new European data protection legislation on the way, there is likely to be only more recognition that searches may return personal information. In fact, by combining different search results not all of which return “personal” information, new personal information may even be generated, and if this is done systematically it certainly has legal implications. One of those is that whenever an organisation records personal information, according to data protection laws it needs to have a registered and agreed purpose for that.

Aside: Google’s other higher level functionality

Of course all this only addresses the web users’ view of Google search as a higher level functionality. From an economic perspective, Google is actually a medium for targeted advertising. In fact the main driver for making the Google interface stateful has come from this direction: finding out identity and interests of the Google users through making them log in to a Google account, giving them free gmail and then reading that, preserving their search history, and linking to their other web browsing activities through tracking cookies etc. This also has legal and privacy implications, but that isn’t the point here.

Health databases

The history of Google search above highlights why it has become ever less meaningful to view it as a simple database operation: this view leaves no room to take into account how it provides a higher level functionality, or to consider the impact of potentially sensitive information contained in it. Unfortunately, the public debate in the UK about a unified database of health (and eventually: social care) data, “care.data”, has remained stuck at exactly that point.

When David Cameron announced this plan in 2011 the perspective was that the National Health Service generates a mountain of medical data, and it would be a waste not to use it for “groundbreaking research” and as a “huge magnet to pull new innovations through”. In other words, an unexplored resource. From then on until well into 2014, that main storyline has been largely unchanged. Sensitivity of medical data has been waved away with the reassurance that it would all be anonymous. Questions as to who would get access to the data, and what the overall purpose would be, were deflected with incidental stories about death and illness that could be avoided with “more research”.

Anonymity

The narrative on anonymity may now have been fatally undermined, finally. Researchers had long ago established that the usefulness of such large databases lies in their rich and longitudinal character. Long and detailed stories about people’s health and treatments give deeper insight and a better chance of explanation for their medical histories. It is then unavoidable that they also, in much the same way, give a deeper insight into which person they refer to – even when the more directly identifying details are elided. Researchers into statistical disclosure control and related topics had known this for a long time: increased usefulness of data goes at the expense of reduced privacy guarantees. Rather than ignoring the privacy risks associated with the use of such rich medical data, they should be managed. In other words, the interface needs to be controlled. After the second delay of the care.data rollout, to beyond September 2014, this idea has finally come into focus.

HSCIC sharing databases

However, the arms-length government organisation in charge, HSCIC, has a long history of treating medical information as a commodity to be shared freely – particularly where some anonymity excuse could be applied. Commercial companies, for example in insurance and pharmacy, have had extensive access to the hospital data in HES. There is an industry of data analytics companies, with revolving doors between them and HSCIC and NHS England, whose business is essentially to sell NHS data back to the NHS in a digested or more accessible form. Looking at their job adverts and websites, these have a clear sense of entitlement towards databases such as HES. Kingsley Manning, Chair of HSCIC, had to admit to the Westminster Health Select Committee last month that he could not even say who the end users of the shared HES data were. It is clear that with the addition of GP data to this database, such an attitude is no longer tolerable.

Open-washing, and research ethics

Arguments in favour of care.data as a simple data resource that can be shared are still coming from two directions. Tim Kelsey, the NHS England director in charge, came into this role after being the government’s “transparency czar”. He and others conflate care.data sharing with the Open Data movement, where government bodies provide data for transparency, allowing the public to draw their own conclusions from them in whichever way they like. However, typically Open Data is not personally identifiable, which is what allows it to be published and shared without Data Protection Act restrictions. According to the ICO’s advice on anonymisation, the data shared by HSCIC would not be unconditionally excluded from data protection – in contradiction with HSCIC’s practice.

Medical research councils also appear to have little problem with the view of care.data as simply a data resource. From their perspective, they have established codes of practice and research ethics that are consistently applied to ensure responsible treatment of sensitive medical data. Some of the language to come from this corner, particularly reference to “consent fetishism”, shows that for some the mechanisms of research ethics have substituted their purpose. The public’s anxiety with care.data is not primarily with the use in academic research, but more where it crosses into commercial (e.g. pharmaceutical) research, especially given the recently highlighted need to improve practical research ethics in that area.

Purpose

In looking at the purpose of the data, data protection legislation (requiring exactly that), and the public’s unease about sharing their sensitive medical data come together. In order to develop a more detailed purpose for care.data, HSCIC drew up the “care.data addendum”, which states that no users can be excluded a priori for the system, and which lists a wide range of categories of users. These include all of HSCIC’s usual customers: research institutes, data analytics companies, as well as pharmaceutical companies, insurance companies, and thinktanks. From the public debate since it came out, it has become clear that many of those are socially unacceptable as recipients of sensitive medical data in any other than aggregated fully anonymised form. In recognition of that, HSCIC have been disowning the care.data addendum in individual communications, while failing to come up with the promised replacement or changing the presentation in their official communications. Of the amendments to the Care Bill proposed in parliament recently, all those which aimed to firm up the care.data purpose beyond “for health purposes” (informally known as the Ronald McDonald amendment) were voted down.

Software engineering

From a software engineering perspective, purpose really should have come first. The traditional sequence of activities in software engineering is to start off establishing the requirements, then design the system, then implement it, and throughout and especially at the end check that it is indeed the system intended. If requirements are likely to change over the development time, these phases can be made to overlap to some extent. In this case, it would have been ever so helpful to start the requirements by describing some illustrative scenarios (“use cases”, or “stories”, in technical terms) and to see how those would sit with stakeholders – patients, medics, and the users of care.data. These scenarios could then have been placed in an overall context – showing how better information indeed, eventually, leads to better care. This would have allowed purpose and constraints to be established with consultation and public buy-in. As things stand, a year after the original intended roll-out, there is an ongoing argument about the requirements. The design is still being modified, for example in order to ensure that patients’ opt-out, added as an afterthought, can have the intended effect. There are persistent rumours that the postponements have been convenient due to delays in the software implementation – not surprisingly given the changes made.

Then there’s the final software engineering phase of checking that it actually is the system as expected. The most common method for this is testing – the final aspect of this is in the phased trial roll out now expected to start in the autumn. There are better methods for this, too, but that is a topic for another time.

Controlling the interface

Security engineering experience, in addition, says that security should be considered right from the start, rather than added on afterwards. For what would likely be the single most security sensitive database in the country (not counting any databases GCHQ might deny the existence of), that counts a fortiori. HSCIC actually produced a price list for sharing care.data “at cost” before they had established the security measures necessary to protect the data and monitor others’ use of it, let alone the cost of these.
In the care.data advisory group established recently, there has been some discussion of a “fume cupboard” model for access to care.data. Some of the lessons from the Google history are put into practice there. HSCIC would not share the database, but give others controlled and monitored access to the interface. Established security mechanisms such as Role Based Access Control could play a part in ensuring that queries match a defined purpose or policy for each type of user. Existing mechanisms, including partly or fully automated ones, for detecting insider attacks could be applied for monitoring and dynamically changing access policies. This would put a wealth of modern security engineering technology at the disposal of the protection of one of the most valuable data sets ever established. Better late than never.

Eerke Boiten's blog

Cyber security and more