Virtual Screening for R-groups, Including Predicted pIC50 Contributions, within Large Structural Databases, using Topomer CoMFA

Multiple R-groups (monovalent fragments) are implicitly accessible within most of the molecular structures that populate large structural databases. R-group searching would desirably consider pIC50 contribution forecasts as well as ligand similarities or docking scores. However, R-group searching, with or without pIC50 forecasts, is currently not practical. The most prevalent and reliable source of pIC50 predictions, existing 3D-QSAR approaches, is also difficult and somewhat subjective. Yet in 25 of 25 trials on data sets on which a field-based 3D-QSAR treatment had already succeeded, substitution of objective (canonically generated) topomer poses for the original structure-guided manual alignments produced acceptable 3D-QSAR models, on average having almost equivalent statistical quality to the published models, and with negligible effort. Their overall pIC50 prediction error is 0.805, calculated as the average over these 25 topomer CoMFA models in the standard deviations of pIC50 predictions, derived from the 1109 possible “”leave-out-one-R-group”” (LOORG) pIC50 contributions. (This novel LOORG protocol provides a more realistic and stringent test of prediction accuracy than the customary “leave-out-one-compound” LOO approach.) The associated average predictive r2 of 0.495 indicates a pIC50 prediction accuracy roughly halfway between perfect and useless. To assess the ability of topomer-CoMFA based virtual screening to identify “highly active” R-groups, a Receiver Operating Curve (ROC) approach was adopted. Using, as the binary criterion for a “highly active” R-group, a predicted pIC50 greater than the top 25% of the observed pIC50 range, the ROC area averaged across the 25 topomer CoMFA models is 0.729. Conventionally interpreted, the odds that a “highly active” R-group will indeed confer such a high pIC50 are 0.729/(1-0.729) or almost 3 to 1. To confirm that virtual screening within large collections of realized structures would provide a useful quantity and variety of R-group suggestions, combining shape similarity with the “highly active” pIC50, the 50 searches provided by these 25 models were applied to 2.2 million structurally distinct R-group candidates among 2.0 million structures within a ZINC database, identifying an average of 5705 R-groups per search, with the highest predicted pIC50 combination averaging 1.6 log units greater than the highest reported pIC50s.

Author(s): Richard Cramer, Phillip Cruz, Gunther Stahl, William Curtiss, Brian Campbell, Brian Masek, Farhad Soltanshahi

Year: November 1, 2008