An artificial intelligence model, named Foresight, has been trained on the medical data of 57 million people who have used England’s National Health Service according to New Scientist. The initiative, led by researchers including Chris Tomlinson Chris Tomlinson at University College London UCL’s Institute of Health Informatics and King’s College London King’s College London, represents what its creators call the world’s first “national-scale generative AI model of health data,” utilizing approximately 10 billion health events recorded between November 2018 and December 2023.
The model, built on Meta’s open-source Llama 2 model, integrates diverse datasets from outpatient appointments, hospital visits, and vaccination records. Developers suggest Foresight could eventually help doctors predict disease complications before they occur, offering a crucial window for early intervention, and could also forecast broader health trends like hospitalization rates.
Chris Tomlinson stated about the project that “the real potential of Foresight is to predict disease complications before they happen, giving us a valuable window to intervene early, and enabling a shift towards more preventative healthcare at scale.”
An earlier version of Foresight developed in 2023 used OpenAI’s GPT-3 and a smaller dataset from two London hospitals.
Training AI on data from virtually the entire population of England has ignited significant privacy and data protection concerns among other researchers and privacy advocates. While developers insist records were “de-identified” before use, experts warn that the inherent richness of large datasets makes re-identification risks well-documented according to New Scientist. The developers themselves acknowledge there is no guarantee the system will not inadvertently expose sensitive patient information.
Navigating Privacy and Regulatory Hurdles
Building powerful generative AI models while protecting patient privacy remains an unresolved scientific problem, stated Luc Rocher at the University of Oxford.
He argued that the very richness of data valuable for AI also makes it incredibly difficult to anonymize, suggesting such models should remain under strict NHS control for safe use. Michael Chapman at NHS Digital, overseeing the data for Foresight, acknowledged that while direct identifiers are removed, providing absolute certainty against re-identification with rich health data is challenging.
The AI operates within a dedicated NHS England Secure Data Environment (SDE), with computational infrastructure from Amazon Web Services and Databricks, though these companies reportedly cannot access the data.
Transparency and user control are also central points of contention. Caroline Green at the University of Oxford Caroline Green noted that using such a vast dataset without informing individuals weakens public trust. She stated that, “even if it is being anonymised, it’s something that people feel very strongly about from an ethical point of view, because people usually want to keep control over their data and they want to know where it’s going.”
Green contended that ethics and human considerations should be the starting point for AI development, not an afterthought. She said, “there is a bit of a problem when it comes to AI development, where the ethics and people are a second thought, rather than the starting point.”
Currently, existing opt-out mechanisms existing opt-out mechanisms don’t apply for nationally collected NHS datasets do not apply to the data used by Foresight because it has been “de-identified.” An NHS England spokesperson stated that because the data is anonymized, it is not considered personal data, and therefore the GDPR would not apply. However, guidance from the UK Information Commissioner’s Office (ICO) indicates that “de-identified” data should not be used as a synonym for anonymous data, noting that UK data protection law does not define the term, which can lead to confusion.
Furthermore, the legal position is complicated by Foresight’s current use being limited to covid-19 related research, allowing it to operate under exceptions to data protection laws enacted during the pandemic. Sam Smith at medConfidential, a UK data privacy organization, argued that this “covid-only AI” likely contains embedded patient data that should not leave the lab.
Smith said, “this covid-only AI almost certainly has patient data embedded in it, which cannot be let out of the lab,”, stressing that patients should retain control over how their data is used. He added that “patients should have control over how their data is used.”
Yves-Alexandre de Montjoye at Imperial College London highlighted that testing whether models can memorize training data is crucial for assessing their ability to reveal sensitive information, a test the Foresight team reportedly had not conducted yet but was considering.
Potential Benefits and Government Ambition
Despite the privacy concerns, proponents emphasize the potential of AI like Foresight to transform healthcare. Health Secretary Wes Streeting supports leveraging this technology to reduce unnecessary hospital visits, accelerate diagnosis, and free up staff resources, according to UCL News.
Science and technology secretary Peter Kyle also supports the project, viewing it as a step towards a “healthcare revolution” and part of the UK government’s Plan for Change. Kyle stated that this ambitious research shows how AI, paired with the NHS’s secure and anonymised data, is set to unlock a healthcare revolution
NHS England National Director of Transformation Vin Diwakar supports AI trained on large datasets for preventative care. Diwakar said, “AI has the potential to change how we prevent and treat diseases, especially when trained on large datasets.”.
Chris Tomlinson noted that using national-scale data allows the model to represent the diverse population of England, particularly for minority groups and rare diseases.
Objectives of the Foresight model include identifying health inequalities and supporting population-level risk analysis. The model, which analyzes data spanning from 1997 to 2018 aims to predict over 700 health conditions up to five years in advance. A study published in Lancet Digital Health in 2024 indicated Foresight could predict future health conditions, showing particular promise in predicting conditions such as heart failure, chronic kidney disease, and type 2 diabetes. Researchers hope to include richer data sources like clinicians’ notes, blood tests, and scans in future iterations, according to Richard Dobson.
Broader Context and Public Trust
The debate surrounding Foresight reflects similar data privacy discussions across the tech industry. Meta, for instance, has faced scrutiny from regulators like the ICO and the Irish Data Protection Commission (DPC) over its use of publicly accessible UK social media data for AI training Meta Utilizes UK Social Media Posts for AI Development, defending its practices under GDPR’s “legitimate interest” provision despite facing criticism after a European court decision.
Similarly, privacy watchdog NOYB filed GDPR complaints against xAI over the alleged use of personal data from millions of EU users to train its Grok AI model without proper consent, leading to a temporary halt in processing by Ireland’s DPC.
Developments in medical AI continue alongside these privacy debates. Microsoft Research, in collaboration with the University of Washington and Providence, recently unveiled BiomedParse, an AI model designed to enhance medical image analysis.
This model, which utilized OpenAI´s GPT-4 to synthesize a large dataset from Hugging Face, aims to simplify complex procedures but also faces deployment hurdles related to data privacy and regulatory adherence.
Earlier in 2024, Microsoft also launched GigaPath for analyzing gigapixel pathology images. Meanwhile, Paris-based startup Bioptimus released its H-optimus-0 pathology model for pathology trained on a vast collection of histopathology slides, with the goal of enhancing disease diagnosis and medical research.
A recent meta-analysis from Osaka Metropolitan University Osaka Metropolitan University published in Nature found that while generative AI’s diagnostic accuracy is comparable to non-specialist doctors.
A mid-2024 public opinion survey, cited by Solutions 4 IT and originally from the Health Foundation Health Foundation blog, showed 54% public and 76% NHS staff support for AI in patient care, with higher enthusiasm for non-clinical tasks. However, trust remains an issue, particularly among older adults who want notification about data usage.
A critical perspective noted by Pharmaphorum suggests NHS data quality can be poor for prevention purposes as it’s collected when issues already exist.