Researchers have built an AI system that predicts your risk of developing more than 1,000 diseases up to 20 years before symptoms appear, according to a study published in Nature this week.
The model, called Delphi-2M, achieved 76% accuracy for near-term health predictions and maintained 70% accuracy even when forecasting a decade into the future.
It outperformed existing single-disease risk calculators while simultaneously assessing risks across the entire spectrum of human illness.
“The progression of human disease across age is characterized by periods of health, episodes of acute illness and also chronic debilitation, often manifesting as clusters of co-morbidity,” the researchers wrote. “Few algorithms are capable of predicting the full spectrum of human disease, which recognizes more than 1,000 diagnoses at the top level of the International Classification of Diseases, Tenth Revision (ICD-10) coding system.”
The system learned these patterns from 402,799 UK Biobank participants, then proved its mettle on 1.9 million Danish health records without any additional training.
Harvard’s New AI Tool Could Pinpoint Treatments for Parkinson’s and Alzheimer’s
Before you start rubbing your hands with the idea of your own medical predictor, can you try Delphi-2M yourself? Not exactly.
The trained model and its weights are locked behind UK Biobank’s controlled access procedures—meaning researchers only. The codebase for training your own version is on GitHub under an MIT license, so you could technically build your own model, but you’d need access to massive medical datasets to make it work.
For now, this remains a research tool, not a consumer app.
The technology works by treating medical histories as sequences—much like ChatGPT processes text.
Each diagnosis, recorded with the age it first occurred, becomes a token. The model reads this medical “language” and predicts what comes next.
With the proper information and training, you can predict the next token (in this case, the next illness) and the estimated time before that “token” is generated (how long until you get sick if the most likely set of events occurs).
For a 60-year-old with diabetes and high blood pressure, Delphi-2M might forecast a 19-fold increased risk of pancreatic cancer. Add a pancreatic cancer diagnosis to that history, and the model calculates mortality risk jumping nearly ten thousandfold.
The transformer architecture behind Delphi-2M represents each person’s health journey as a timeline of diagnostic codes, lifestyle factors like smoking and BMI, and demographic data. “No event” padding tokens fill the gaps between medical visits, teaching the model that the simple passage of time changes baseline risk.
This is also similar to how normal LLMs can understand text even if they miss some words or even sentences.
When tested against established clinical tools, Delphi-2M matched or exceeded their performance. For cardiovascular disease prediction, it achieved an AUC of 0.70 compared to 0.69 for AutoPrognosis and 0.71 for QRisk3. For dementia, it hit 0.81 versus 0.81 for UKBDRS. The key difference: those tools predict single conditions. Delphi-2M evaluates everything at once.
AI Avatars Are Pushing Mega-Dose Magnesium—Doctors Say It’s a Health Risk
Beyond individual predictions, the system generates entire synthetic health trajectories.
Starting from age 60 data, it can simulate thousands of possible health futures, producing population-level disease burden estimates accurate to within statistical margins. One synthetic dataset trained a secondary Delphi model that achieved 74% accuracy—just three percentage points below the original.
The model revealed how diseases influence each other over time. Cancers increased mortality risk with a “half-life” of several years, while septicemia’s effect dropped sharply, returning to near-baseline within months. Mental health conditions showed persistent clustering effects, with one diagnosis strongly predicting others in that category years later.
The system does have boundaries. Its 20-year predictions drop to around 60-70% accuracy in general, but things will depend on which type of disease and conditions it tries to analyze and forecast.
“For 97% of diagnoses, the AUC was greater than 0.5, indicating that the vast majority followed patterns with at least partial predictability,” the study says, adding later on that “Delphi-2M’s average AUC values decrease from an average of 0.76 to 0.70 after 10 years,” and that “iIn the first year of sampling, there are on average 17% disease tokens that are correctly predicted, and this drops to less than 14% 20 years later.”
In other words, this model is quite good at predicting things under relevant scenarios, but a lot can change in 20 years, so it’s not Nostradamus.
Rare diseases and highly environmental conditions prove harder to forecast. The UK Biobank’s demographic skew—mostly white, educated, relatively healthy volunteers—introduces bias that the researchers acknowledge needs addressing.
Danish validation revealed another limitation: Delphi-2M learned some UK-specific data collection quirks. Diseases recorded primarily in hospital settings appeared artificially inflated, contradicting the data registered by the Danish people.
The model predicted septicemia at eight times the normal rate for anyone with prior hospital data, partly because 93% of UK Biobank septicemia diagnoses came from hospital records.
5 Prompts That Make Anthropic’s Claude AI Better Than a Crypto Analyst, Broker or Doctor
The researchers trained Delphi-2M using a modified GPT-2 architecture with 2.2 million parameters—tiny compared to modern language models but sufficient for medical prediction. Key modifications included continuous age encoding instead of discrete position markers and an exponential waiting time model to predict when events would occur, not just what would happen.
Each health trajectory in the training data contained an average of 18 disease tokens spanning birth to age 80. Sex, BMI categories, smoking status, and alcohol consumption added context.
The model learned to weigh these factors automatically, discovering that obesity increased diabetes risk while smoking elevated cancer probabilities—relationships that medicine has long established but that emerged without explicit programming. It’s truly an LLM for health conditions.
For clinical deployment, several hurdles remain.
The model needs validation across more diverse populations—for example, the lifestyles and habits of people from Nigeria, China, and America can be very different, making the model less accurate.
Also, privacy concerns around using detailed health histories require careful handling. Integration with existing healthcare systems poses technical and regulatory challenges.
But the potential applications span from identifying screening candidates who don’t meet age-based criteria to modeling population health interventions. Insurance companies, pharmaceutical firms, and public health agencies may have obvious interests.
Delphi-2M joins a growing family of transformer-based medical models. Some examples include Harvard’s PDGrapher tool for predicting gene-drug combinations that could reverse diseases such as Parkinson’s or Alzheimer’s, an LLM specifically trained on protein connections, Google’s AlphaGenome model trained on DNA pairs, and others.
What makes Delphi-2M so interesting and different is its broad scope of action, the sheer breadth of diseases covered, its long prediction horizon, and its ability to generate realistic synthetic data that preserves statistical relationships while protecting individual privacy.
In other words: “How long do I have?” may soon be less a rhetorical question and more a predictable data point.