DNA barcodes can make drug discovery screens miss potential medicines | Research

Drug discovery efforts based on DNA-encoded chemical libraries are inadvertently overlooking numerous potential drug candidates, new research shows. 

Each molecule in a DNA-encoded chemical library is tagged with a unique DNA sequence that acts like a barcode. Such libraries have revolutionised early drug discovery by allowing researchers to screen millions, if not billions, of compounds simultaneously. And the resulting datasets are often used to train machine learning models that seek out promising drug candidates.

 Keen to understand how reliable data linked to DNA-encoded chemical libraries actually is, Raphael Franzini, from the University of Utah in the US, and colleagues investigated a library with over 58,000 compounds designed to target enzymes involved in DNA repair and cancer. When they synthesised and tested 33 molecules that screens had dismissed, they discovered that these compounds were often just as effective as those flagged as promising. In particular, various screens nearly missed compounds that were structurally similar to olaparib, an approved cancer drug. 

‘We found that DNA-encoded library data often labels good molecules as bad molecules,’ explains Franzini. 

The problem appears to lie with the DNA barcodes themselves. When the team compared molecules with and without these tags, they found that the DNA reduced molecules’ activity. The effect was even more pronounced when molecules were tested against targets they were not originally designed for. 

Laura Guasch, a computational chemist at pharmaceutical company Roche, Switzerland, describes the findings as ‘a highly relevant contribution’. She says the study ‘raises crucial awareness regarding how these numerous false negatives can impair the increasingly popular machine learning algorithms used in this domain.’

 ‘False negatives introduce substantial noise and bias into training datasets, causing machine learning models to learn misleading patterns or ignore valid chemotypes,’ comments Srinivas Chamakuri, an assistant professor at Baylor College of Medicine’s Center for Drug Discovery in the US. 

Franzini and colleagues demonstrated that even when machine learning models appeared to perform well, they were actually just recognising recurring structural fragments rather than developing genuine predictive capabilities. 

‘A primary implication of this study is the significant risk that current drug discovery programs might be overlooking potential drug candidates due to high rates of false negatives,’ notes Guasch. 

The researchers found that removing unreliable data from the training sets and focusing only on confirmed active compounds dramatically improved models’ ability to identify promising drugs. This suggests that current machine learning approaches in drug discovery may need fundamental changes to account for the inherent biases in screening data.

Continue Reading