Deep learning reveals endogenous sterols as allosteric modulators of the GPCR–Gα interface

Designing novel target molecules by integrating the topological, chemical, and physical attributes of protein cavities necessitates advanced neural networks. While existing approaches like Bicyclic Generative Adversarial Networks (BicycleGANs) (Skalic et al., 2019) and Recurrent Neural Networks (RNNs) (Xu et al., 2021) have demonstrated potential, end-to-end standalone tools for GPCR-specific ligand design remain scarce. To address this, we developed the Gcoupler and provided it to the community as a Python Package and a Docker image. Gcoupler adopts an integrative approach utilizing structure-based, cavity-dependent de novo ligand design, robust statistical methods, and highly powerful Graph Neural Networks. Gcoupler consists of four interconnected modules, that is, Synthesizer, Authenticator, Generator, and BioRanker, that collectively impart a smoother, user-friendly, and minimalistic experience for the end-to-end de novo ligand design.

Synthesizer, the first module of Gcoupler, takes a protein structure as input in Protein Data Bank (PDB) format and identifies putative cavities across the protein surface, providing users with the flexibility to select cavities based on druggability scores or user-supplied critical residues. Since cavity-dependent molecule generation mainly depends on the chemical composition and geometric constraints of the cavity, it is, therefore, indispensable to select the cavity for the downstream steps considering its chemical nature (hydrophobicity/hydrophilicity) and functional relevance (proximity to the active site or residue composition), among others. Accounting for these, Gcoupler offers flexibility to the users to select either of its predicted cavities based on the user-supplied critical residue or by user-supplied cavity information (amino acids) using third-party software (e.g., Pocketome) (Hedderich et al., 2022). To enhance user experience, Gcoupler computes and outputs all identified cavities along with their druggability scores using LigBuilder’s V3 (Yuan et al., 2020) cavity module. Briefly, these druggability scores consider solvent accessibility, cavity exposure or burial, and detected pharmacophores and cavities, which are further prioritized based on this score. Post-cavity selection, the Synthesizer module generates cavity-specific ligands influenced by topology and pharmacophores, outputting SMILES, cavity coordinates, and other requisite files to downstream modules for further steps (Figure 1a). The chemical composition of the in silico synthesized ligands by the Synthesizer module is influenced by the cavity topology (3D) and its composition (pharmacophores). Noteworthy, the Synthesizer module of Gcoupler employs LigBuilder V3 (Yuan et al., 2020), which utilizes the genetic algorithm for the in silico ligand synthesis. Notably, the fragment library of LigBuilder, comprising 177 distinct molecular fragments in Mol2 format, allows the selection of multiple seed structures and extensions that best complement the cavity pharmacophores throughout multiple iterative runs. For each run, once a seed structure is confirmed, Gcoupler employs a hybrid approach of the Growing and Linking modes of the LigBuilder build module, enabling the stepwise addition of small fragments to the seed structure within the binding pocket of the target GPCR to build synthetic ligands. Gcoupler generates 500 unique molecules by default, though it can also be user-defined. The Synthesizer module of Gcoupler enhances LigBuilder V3 practical applicability through automation, dynamic adaptability, and abstraction. This allows for more efficient and targeted ligand generation, even in challenging design scenarios for GPCR ligand design. However, it lacks user-defined library screening, proposes synthetically challenging molecules, and often requires post-processing to isolate High-Affinity Binders () from a broad affinity range of synthetically designed compounds.

Development, benchmarking, and validation of Gcoupler computational framework.

(a) Schematic workflow depicting different modules of the Gcoupler package. Of note, Gcoupler possesses four major modules, that is, Synthesizer, Authenticator, Generator, and BioRanker. (b) AUC–ROC curves of the finally selected model for each of the indicated GPCRs. Note: Experimentally validated active ligands and decoys were used in the testing dataset. (c) Bar graphs depicting the sensitivities and specificities of the indicated GPCRs with experimentally validated active ligands and reported decoys. (d) The AUC–ROC curve indicates the model’s performance in the indicated conditions. (e) Bar graphs indicating the prediction probabilities for each experimentally validated ligand. (f) Schematic workflow illustrates the steps in measuring and comparing the structural conservation of the GPCR–Gα-protein interfaces across human GPCRs. (g) Snake plot depicting the standard human GPCR two-dimensional sequence level information. Conserved motifs of the GPCR–Gα-protein interfaces are depicted as WebLogo. Asterisks represent residues of conserved motifs present in the GPCRs–Gα-protein interfaces. Of note, the location of the motifs indicated in the exemplary GPCR snake plot is approximated. (h) Schematic workflow illustrates the steps in measuring and comparing the structural conservation of the GPCR–Gα-protein interfaces across human GPCRs. (i) Representative structures of the proteins depicting highly conserved (low root mean square deviation [RMSD]) and highly divergent (high RMSD) GPCR–Gα-protein interfaces. PDB accession numbers are indicated at the bottom. (j) Heatmap depicting the RMSD values obtained by comparing all the GPCR–Gα-protein interfaces of the available human GPCRs from the protein databank. Of note, the RMSD of the Gα–protein cavity was normalized with the RMSDs of the respective whole proteins across all pairwise comparisons. (k) Heatmap depicting the pairwise cosine similarities between the in silico synthesized ligands of the GPCR–Gα-protein interfaces of the available human GPCRs using Gcoupler. (l) Schematic diagram depicting the hypothesis that the intracellular metabolites could allosterically modulate the GPCR–Gα interaction.

To address this limitation, a second module was added to Gcoupler, termed Authenticator. This module processes output files from the Synthesizer module, conducting downstream validation steps and preparing results for constructing deep learning-based classification models (third module). The Authenticator requires input protein 3D structure in PDB format, cavity coordinates, and all silico-generated molecules from the Synthesizer module. Authenticator utilizes this information to further segregate the synthesized compounds into HABs and Low-Affinity Binders (LABs) by leveraging a structure-based virtual screening approach (AutoDock Vina) (Trott and Olson, 2010) and statistically backed hypothesis testing for distribution comparisons (Figure 1a). The Authenticator module outputs the free binding energies of all the generated compounds, which further segregates the compounds into HABs and LABs by the statistical submodule while ensuring the optimal binding energy threshold and class balance. Of note, the Authenticator is also capable of leveraging the Empirical Cumulative Distribution Function (ECDF) for binding energy distribution comparisons of HABs and LABs and performs the Kolmogorov–Smirnov test (Berger and Zhou, 2014), Epps–Singleton test (Goerg and Kaiser, 2009), and Anderson–Darling test (Engmann and Cousineau, 2011) for hypothesis testing. This expanded array of statistical tests allows users to employ methodologies that best suit their data distribution characteristics, ensuring robust and comprehensive analyses. Moreover, the Authenticator module incorporates a unique feature for decoy synthesis using HABs. This functionality enables the generation of a negative dataset in scenarios where the Synthesizer module fails to produce an optimal number of LABs. By synthesizing decoys from HABs, users can effectively balance their datasets, enhancing the reliability of downstream analyses. Lastly, the Authenticator module also accommodates user-supplied negative datasets as an alternative to LABs (Mysinger et al., 2012). This feature provides users with the flexibility to incorporate external data sources, enabling robust prediction model building by the subsequent Generator module.

The Generator, the third module, employs state-of-the-art GNN models such as Graph Convolution Model (GCM), Graph Convolution Network (GCN), Attentive FP (AFP), and Graph Attention Network (GAT) to construct predictive classifiers using Authenticator-informed classes. These GNN algorithms are tailored to extract features from the graph structure of the compounds generated by the Synthesizer and apply them to the classification task by leveraging Authenticator-informed class information. For instance, the GCM assimilates features by analyzing neighboring nodes, while the GCN extracts features through a convolutional process. The AFP model focuses attention on specific graph segments, and the GAT employs attention mechanisms to learn node representations. By default, Generator tests all four models using standard hyperparameters provided by the DeepChem framework (https://deepchem.io/), offering a baseline performance comparison across architectures. This includes pre-defined choices for node features, edge attributes, message-passing layers, pooling strategies, activation functions, and dropout values, ensuring reproducibility and consistency. All models are trained with binary cross-entropy loss and support default settings for early stopping, learning rate, and batch standardization where applicable. Gcoupler provides off-the-shelf hyperparameter tuning to ensure adequate training, which is essential for optimizing model performance. After selecting the best parameters and classification algorithm, Gcoupler further ensures the mitigation of overfitting and provides a more precise estimate of model performance through k-fold cross-validation. Notably, by default, Gcoupler employs threefold cross-validation, but users can adjust this parameter.

Finally, BioRanker, the last module, prioritizes ligands through statistical and bioactivity-based tools. The first level ranking offered by BioRanker is composed of a statistical tool that encompasses two distinct algorithms, namely G-means and Youden’s J statistics, to assist users in identifying the optimal probability threshold, thereby refining the selection of high-confidence hit compounds (Figure 1—figure supplement 1a). Additionally, bioactivity embeddings computed via Signaturizer (Bertoni et al., 2021) enable multi-activity-based ranking using a modified PageRank algorithm. Briefly, the bioactivity descriptors of the predicted compounds are projected onto various biological activity spaces, including Chemistry, Targets, Networks, Cells, and Clinics, by performing pairwise cosine similarity comparisons with HABs. The PageRank algorithm is then applied for activity-specific ranking and supports multi-activity-based ranking for sequential screening based on user-defined biological properties. BioRanker also offers flexibility through customizable probability thresholds, enabling stringent or relaxed selection of compounds. Users can also input SMILES representations for direct screening, bypassing prediction probabilities. Taken together, Gcoupler is a versatile platform supporting user-defined inputs, third-party tools for cavity selection, and customizable statistical analyses, enhancing its adaptability for diverse ligand design and screening tasks. This integrated framework streamlines cavity-specific ligand design, screening, and ranking, providing a comprehensive solution for GPCR-targeted drug discovery.

To evaluate Gcoupler’s performance, we tested its modules across five GPCRs (AA2AR, ADRB1, ADRB2, CXCR4, and DRD3) using experimentally validated ligands and matched decoys from the DUD-E dataset (Mysinger et al., 2012). The DUD-E datasets contain five GPCRs alongside information about their cavity coordinates, positive ligands, and decoys (https://dude.docking.org/subsets/gpcr). We used these five GPCRs as independent samples to evaluate different modules and sub-modules of Gcoupler. We first checked whether the cavity search algorithm of Synthesizer could accurately detect a given orthosteric ligand-binding site for a GPCR. Gcoupler accurately identified orthosteric ligand-binding sites and additional allosteric cavities across all targets, validating its de novo cavity detection algorithm (Figure 1—figure supplement 1b). We next asked whether Gcoupler could also synthesize molecules similar to the reported ligands for respective orthosteric sites based on the cavity’s physical, chemical, and geometric properties. For orthosteric sites, the Synthesizer module generated ~500 compounds per GPCR. Subsequently, as per the Gcoupler default workflow, the Authenticator module conducted a virtual screening of these newly synthesized compounds, segregating them into HABs and LABs. Although the Authenticator module provides flexibility in selecting an optimal threshold to distinguish HAB and LAB, we chose the default cutoff of –7 kcal/mol for AA2AR, CXCR4, and DRD3. For ADRB1 and ADRB2, we selected a threshold of –8 kcal/mol to minimize overlap in distributions and thus avoid class imbalance, a critical parameter that could influence the downstream model generation using the Generator module (Figure 1—figure supplement 1c). Statistical validation confirmed significant separation between these groups (p < 0.0001), enabling the Generator module to construct graph-based classification models with high values of AUC–ROC (>0.95), sensitivity, and specificity (Figure 1b, c, Figure 1—figure supplement 1d, e). These models reliably distinguished ligands from decoys, demonstrating Gcoupler’s accuracy in identifying high-affinity ligands.

In addition to evaluating Gcoupler’s performance for the orthosteric sites of GPCRs, we also validated its capability to identify allosteric sites and their corresponding ligands. In this case, we first gathered information about the experimentally validated GPCR–ligand complexes sourced from the PDB database. We chose three GPCR–ligand complexes (β2AR-Cmpd-15PA, CCR2-CCR2-RA-[R], and CCR9-Vercirnon) from the PDB (Shen et al., 2023). We removed the ligands from the PDB files and executed the standard Gcoupler workflow with default parameters. Gcoupler successfully identified allosteric binding sites and generated classification models for synthetic compounds with consistently high AUC–ROC values (>0.95) (Figure 1d, Figure 1—figure supplement 2a–c). This high level of accuracy indicates the robustness of Gcoupler’s algorithms in distinguishing between true positives (allosteric ligands) and true negatives (non-binders). Projection of experimentally validated ligands onto these models further confirmed their predictive accuracy (Figure 1e), underscoring Gcoupler’s robustness and versatility for orthosteric and allosteric ligand discovery.

Next, to evaluate the efficiency of Gcoupler, we compared its run time with the biophysics-based gold standard molecular docking (AutoDock) (Morris et al., 2009). To address the runtime efficiency, we first utilized the ChEMBL31 database (Gaulton et al., 2012) to identify GPCRs with the highest number of reported experimentally validated agonists. We selected the alpha-1A adrenergic receptor (ADRA1A) since it qualifies for this criterion and contains 993 agonists (Figure 1—figure supplement 3a, b). Methodologically, we followed the conventional steps of AutoDock Tools for molecular docking while keeping track of execution time for each step throughout the entire process until completion (Figure 1—figure supplement 3a, c). In parallel, we applied the same timestamp procedure for Gcoupler, including its individual module sub-functions (Figure 1—figure supplement 3a–d). Gcoupler was 13.5 times faster, leveraging its deep learning-based Generator module and AutoDock Vina’s efficiency. Both methods provided comparable predictions for active compounds, demonstrating Gcoupler’s speed and accuracy, making it ideal for large-scale ligand design and drug discovery (Figure 1—figure supplement 3e–h).

Finally, we used Gcoupler to evaluate the ligand space conservation (functional conservation) of the GPCR–Gα interface. Specifically, we aimed to explore the possibility of direct small molecule binding to the GPCR–Gα interface to modulate downstream signaling pathways. We analyzed multiple human GPCR–Gα complexes from the PDB (Figure 1f, Supplementary file 1), identified conserved motifs (DRY, CWxL, and NPxxY) and binding pockets through sequence and structural analyses (Figure 1g). To determine the topological similarity of the GPCR–Gα-protein interface, we undertook a detailed structural analysis across a wide array of GPCR–Gα-protein complexes. This analysis involved identifying and extracting the cavities present within each complex. By focusing on these critical regions, we aimed to assess the degree of structural conservation and quantify it through normalized root mean square deviation (RMSD) values. Specifically, the normalized RMSD values, which provide a measure of the average distance between atoms of superimposed proteins, indicated a high degree of similarity. The mean RMSD value was found to be 1.47 Å, while the median RMSD value was even lower at 0.86 Å. These values suggest that the overall topology of the GPCR–Gα interface is well conserved across different complexes, highlighting the robustness of this interaction site (Figure 1h–j, Figure 1—figure supplement 3i–k, Supplementary file 2). Finally, to test whether this topological and sequence conservation also impacts the ligand profiles that could potentially bind to this interface, we performed the Gcoupler workflow on all 66 GPCRs and synthesized ~50 unique ligands per GPCR (Figure 1h). We next computed and compared the physicochemical properties (calculated using Mordred; Moriwaki et al., 2018) of these synthesized ligands and observed high cosine similarity, which further supports the functional conservation of the GPCR–Gα interface (Figure 1h, k, Figure 1—figure supplement 3l, Supplementary file 3). In summary, we used Gcoupler to systematically evaluate and analyze the ligand profiles of the GPCR–Gα-protein interface and observed a higher degree of sequence, topological, and functional conservation.

Continue Reading