LLM Trained on Somatic Mutations Shows Prognostic and Predictive Utility

Large language models (LLMs) can be trained to understand how each patient’s somatic mutations impact their cancer prognosis and possible response to therapy, according to a presentation at the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning.

John-William Sidhom, MD, PhD, clinical fellow in the Department of Hematology and Medical Oncology at Weill Cornell Medicine in New York, developed a model that was trained on the language of cancer to be able to diagnose tumors, predict prognosis, and recommend optimal treatment regimens for each patient based on precision medicine fundamentals.

“LLMs, or transformer models really, can learn biologically meaningful patterns from somatic mutations,” Dr. Sidhom said.

Background and Model Development

Currently, next-generation sequencing (NGS) reports are sometimes lacking in data, as scientists only have knowledge of a number of actionable targets in the genome, while many results are considered variants of unknown significance. Additionally, although existing targeted therapies can treat one genomic driver of cancer, each patient with these drivers may also carry other possible drivers of their disease, which may lead to different responses to the same targeted therapies. Dr. Sidhom explained that this was the gap in precision medicine that he was seeking to overcome.

Dr. Sidhom explored the possibility of an LLM that could reason through the mutations and information received from an NGS report to predict prognosis and ultimately guide systemic therapy decisions. The model was taught to learn the meaning of different cancer-related mutations and how mutations can occur simultaneously in individual patients.

The LLMs currently used in cancer research are trained on the reference genome but have limited clinical relevance. Dr. Sidhom suggested training models on the mutanome with somatic mutation catalogs, cancer-specific alterations, and clinical outcome data, as he believed it would be more relevant to precision oncology and have more patient-specific insights.

The model was created with a dual-attention architecture whereby every mutation was embedded in the model in terms of its reference and alteration to understand the order of sequences, as well as a second permutation-independent transformer to understand the patient-specific implications of each mutation. The researchers also masked altered sequences to enable the model to learn the rules of metagenesis as if they were the vocabulary and syntax of cancer mutations, with a 100% masking rate.

“This dual-attention mechanism gives you a very nice interpretable framework to understand the complex interactions between patients that drive cancer,” Dr. Sidhom said.

Dr. Sidhom and his team first trained the model on data from The Cancer Genome Atlas with more than 3 million somatic variations across 10,224 patients and 33 cancer types. They later looked at data from the BeatAML2 dataset of 805 patients with acute myeloid leukemia, or 942 specimens, who had undergone matched multiomic profiling to understand correlations between genomic representations and responses to immunotherapies driven by polygenic genomic signals, such as mismatch repair deficiency or microsatellite instability. They applied multiple-instance learning to allow the model to learn to predict responses to each treatment.

Model Findings

In cancer-specific cohorts of patients from The Cancer Genome Atlas, the model showed the ability to predict prognosis of a patient’s cancer through unsupervised k-means clustering of similar prognoses that could then be translated into Kaplan-Meier survival estimates.

Based on patterns that the model learned, the researchers were able to extract biological insights about global attention patterns and causal chains for each cancer.

For example, focusing on colorectal cancer, the researchers found that there were many colorectal cancers with dependencies on APC mutations. The model was able to learn and represent through global attention weights the Vogelstein model showing that colorectal cancer arises from three gene mutations occurring in a certain order: TP53, then KRAS, then APC.

Based on the BeatAML2 dataset, the model tested the ability for patients with acute myeloid leukemia to respond to a drug not yet used in leukemia, the multitargeted tyrosine kinase inhibitor cabozantinib. The model found a canonical correlation of 0.3052 (P < .001), and showed predictive signatures based on the exome for which patients were more likely to be resistant or sensitive to this treatment, with areas under the curve of 0.70 for both resistance and sensitivity.

“The hope is with more data and more powerful models, that this performance will improve,” Dr. Sidhom said.

Disclosure: For full disclosures of the study authors, visit aacr.org.  

Continue Reading