You are viewing the site in preview mode

Skip to main content
Fig. 5 | Journal of Cheminformatics

Fig. 5

From: InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem

Fig. 5

Validation study of InertDB. A. Schematic diagram describing strategies for preparing training dataset to compare efficacy of random sampling and decoy generation methods. b,c. Mean predictive performances for LIT-PCBA (b) and MUV (c) datasets. Each model was constructed by training the random forest-based classifier with ECFP4, with different datasets as sources for positive and negative labels. Performance was evaluated on the hold-out test set consisting of original verified active and inactive compounds from each benchmark dataset. The performances are compared in area under the receiver operating characteristic curve (AUROC) values. A higher AUROC value reflects superior classification performance, indicating that the predictive model can more effectively distinguish between active and inactive compounds. Each data point represents the mean AUROC value from 100 random splits for an individual assay endpoint in the benchmark dataset. Gray squares indicate median values. Statistical significance between paired assay endpoints (connected by lines) was determined using a paired Wilcoxon test: *P < 0.05, **P < 0.01, and ***P < 0.001. d Spearman correlation between model performance and chemical similarity (nearest neighbor Tc) of negative-label compounds in the training set to verified active (left) or inactive (right) compounds from the original benchmark datasets. e Mean chemical similarity (nearest neighbor Tc) between verified inactive compounds (Inac.) and compounds from InertDB (CIC and GIC subsets), PubChem (Pc), ZINC (Zn), and DeepCoy-generated decoys (Dc) for each assay endpoint

Back to article page