Streamlining lead discovery by aligning in silico and high-throughput screening

Streamlining lead discovery by aligning in silico and high-throughput screening

Streamlining lead discovery by aligning in silico and high-throughput screening John W Davies, Meir Glick and Jeremy L Jenkins Lead discovery in the p...

404KB Sizes 2 Downloads 64 Views

Streamlining lead discovery by aligning in silico and high-throughput screening John W Davies, Meir Glick and Jeremy L Jenkins Lead discovery in the pharmaceutical environment is largely an industrial-scale process in which it is typical to screen 1–5 million compounds in a matter of weeks using High Throughput Screening (HTS). This process is a very costly endeavor. Typically a HTS campaign of 1 million compounds will cost anywhere from $500 000 to $1 000 000. There is consequently a great deal of pressure to maximize the return on investment by finding fast and more effective ways to screen. A panacea that has emerged over the past few years to help address this issue is in silico screening. In silico screening is now incorporated in all areas of lead discovery; from target identification and library design, to hit analysis and compound profiling. However, as lead discovery has evolved over the past few years, so has the role of in silico screening. Addresses Lead Discovery Center, Novartis Institutes for Biomedical Research Inc, 250 Massachusetts Avenue, Cambridge, MA 02139, USA Corresponding author: Davies, John W ([email protected])

Current Opinion in Chemical Biology 2006, 10:343–351 This review comes from a themed issue on Next-generation therapeutics Edited by Clifton E Barry III and Alex Matter Available online 5th July 2006 1367-5931/$ – see front matter # 2006 Elsevier Ltd. All rights reserved. DOI 10.1016/j.cbpa.2006.06.022

Introduction The published history of in silico screening (broadly defined here as any computational technique that supports the lead discovery process) has largely centered on two areas: focused and sequential screening approaches [1,2–5]. Both methods aim to improve the efficiency of HTS by minimizing the number of compounds to be screened, and are often driven by obstacles such as assay throughput or protein or cell availability. Focused screening begins by screening a subset of the compound collection based on an in silico hypothesis (e.g., pharmacophore or docking models) of compounds that are active against a specific target. Sequential screening is carried out iteratively and begins by screening a small representative set of ‘diverse’ compounds. Analysis of the data enables the generation of an initial hypothesis that is used to create a subsequent and more focused subset of compounds for a second round of screening. Several cycles of testing and www.sciencedirect.com

analysis can be performed until sufficient active molecules have been discovered. In light of this, one could pose the question as to why large-scale HTS is carried out at all. In reality, within the pharmaceutical environment there is still little in silico directed lead discovery. Many screening organizations have built up impressive teams of specialist and equally impressive infrastructure and processes to carry out mass screening of compounds for lead identification. Many groups have facilities to manage, handle and distribute large numbers compounds to HTS with ease. Sophisticated logistics and dedicated manpower allow 1 million compounds to be screened in less than two months. Tools production (enzymatic and cell-based) and assay development, although still one of the most time-consuming steps, has also become largely automated and process optimized. It is now far easier, and more desirable for many to screen the full deck of compounds than it is to ‘cherry-pick’ (individually select) 50 000 or 100 000 compounds to carry out focused or sequential screening. Consequently, employing in silico screening to direct HTS has perhaps not materialized to anywhere near the early promise. Despite this reality, in silico screening does have a role to play in discovery of leads. It is our opinion that the role of in silico screening in lead discovery is to improve the overall quality and number of leads by the development of an on-going hypothesis throughout the screening life-cycle. We have found that applying state-of-the-art in silico techniques before, after and importantly in parallel with HTS has resulted in a significant improvement in the quality and information content of the lead finding process (Figure 1). This paper reviews some of the in silico techniques that we believe are critical in successful lead discovery.

Mining HTS data and hypothesis-driven hit list triaging A critical role of in silico screening is in the analysis of HTS data from the primary screen. Early involvement of a cheminformatics expert, a medicinal chemist and a biologist (The Triage Team) is vital to select the desired compounds from the hitlist for IC50 follow-up. The hitlist is triaged to exclude artifacts [6]. These are inherent to the chemical structure being screened (reactive, fluorescent, quencher, aggregator, chelator, cytotoxic, not stable in solution and reducing compounds), the assay (reagent and temperature errors), screening technology (liquid handling, reader and time errors), and the storage conditions (compound precipitation from Current Opinion in Chemical Biology 2006, 10:343–351

344 Next-generation therapeutics

Figure 1

In silico techniques in parallel with the HTS process. Output is enhanced and enriched in terms of the numbers of hits and the quality of additional data such as SAR, binding mode and pharmacophore models. Images along the top and bottom illustrate representative data at various stages.

DMSO stock solutions). These compounds will fail in secondary assays, mislead the medicinal chemistry efforts and consume time and resources. True hits that are noninteresting from the chemistry perspective are also excluded. These may have high molecular weight (>700), high hydrophobicity (logP > 7.5), will not cross biological membranes (polar surface area >200 A˚2) or have low fingerprint density [7]. ‘Harder’ filters such as ‘Lipinski rule of 5’ [8] are avoided because these properties can often be corrected during the lead optimization stage. It is important to note that such chemical filters are not applied to natural products despite often being non‘drug-like’. Natural products, although complex, can be highly specific and can act as excellent molecular probes [9–11]. The application of such in silico post-processing still allows for medchem intervention. All the excluded compounds are checked by chemists who have the opportunity to ‘rescue’ interesting compounds. Such ‘meta decisions’ are recorded in a hitlist triaging database where statistical models for chemical unattractiveness and frequent hitters in various assay formats can be derived. Current Opinion in Chemical Biology 2006, 10:343–351

Chemical diversity and structure–activity relationships (SAR) will increase the probability of success for the HTS campaign in terms of follow-up chemistry efforts [1]. Chemical classes that are underrepresented in the hitlist or do not appear at all are expanded by 2-D and 3-D similarity searches or more sophisticated data mining approaches (see below). If a three-dimensional model of the protein target is available, then docking is used to enrich the cherry picking list. The precision of docking and similarity searches can be enhanced by restricting the searches to the subset of compounds below the cherry picking threshold; that is, detect interesting (new chemotypes) compounds with an activity that is close to the ‘cherry picking’ cut-off that were not part of the original hitlist. Input from clustering is useful in the case of a prohibitively large hitlist where only the ‘best in class’ compounds from overpopulated classes are followed up. The ‘best-in-class’ can be, for example, compounds with the most desirable percentage inhibition values or ‘lead like’ properties in the cluster. One should be careful with such cluster-based selections since minor chemical www.sciencedirect.com

Streamlining lead discovery by aligning in silico and HTS Davies, Glick and Jenkins 345

modifications such as the addition of a methyl group in a congeneric series of compounds can have a remarkable impact upon the potency or selectivity. In the case of ATP competitive kinase inhibitors also, small modifications can have a dramatic impact upon the selectivity profile. In addition, using clustering from one descriptor space can lead to biases in the final selection. Karnachi and Brown have shown that using a selection criteria from three types of descriptor space can eliminate being trapped by false negatives [12].

HTD is the rigidity of the binding site. Many of the putative binders simply do not fit into the binding pocket since it is too small and does not accommodate its shape when the binding event takes place in silico. Using ‘softer’ scoring functions or more accurate but time-consuming calculations for the top ranked compounds where the protein is flexible are suggested [26]. Docking programs also tend to fail to reproduce docking poses of compounds with a ring system that adopts a peculiar conformation in the protein [28].

One of the most important outcomes of the in silico screening analysis of HTS data is the creation of a hypothesis. By applying a variety of data mining approaches it is possible to identify false positives and negatives in the primary data and, from that hypothesis, to expand the number of active compounds and chemotypes [13]. Statistical models based on Laplacian-modified naive Bayesian [14,15,16], recursive partitioning [17], and recently support vector machines [18] classifiers in conjunction with 2-D descriptors all were employed successfully in the context of primary HTS data analysis. All three classifiers, and particularly Laplacian-modified naive Bayesian, are tolerant towards stochastic noise in the form of false positives and negatives [19]. Such methods are also useful in iterative screens or a sequential approaches [4,12,20,21] and in the prioritization of hitlists when screening in mixtures [22]. Apart from HTS data, data mining methods have also been successfully employed in-house to capture chemists’ knowledge by building models of chemical unattractiveness. A model was trained on the basis of compounds that were excluded in the past by various medicinal chemists. Such models are useful to exclude compounds that simple substructure search filters fail to capture. Analyzing the weights (or probabilities) of the substructures captured by the models can help in the definition of additional unattractive chemical structures for future analysis. Statistical models of aggregate formation [2] or frequent hitters in various assay formats are also available. The latter is particularly useful in the case of FLIPR assays when a large number of hits are non specific.

A further limitation of HTD is the accuracy of the scoring functions used to estimate the binding free energy. Little progress has been made in the past few years in term of improving the accuracy of the scoring functions per se, particularly when dealing with shallow and hydrophobic binding sites. In many cases HTD has the ability to find true binding modes, but the top scoring compounds (1– 5% of the screened database) are littered with false positives. Scoring functions are generally unable to differentiate between true binders and these artifacts. Effectively, HTD poses as much noise as HTS. Consensus scoring approaches [29] make something of an effort to overcome the biases of individual scoring functions by taking a ‘vote’ from multiple scoring functions although they may not necessarily have been used as the objective function to dock the compound. Some progress has been made in the past five years by post processing the ranked compounds and their interactions with the protein. Klon et al. [30–32] re-ranked the compounds based on a Laplacian-modified naive Bayesian model trained on the top scoring compounds in conjunction with 2-D descriptors. Deng and coworkers [33,34] used the binding poses to calculate structural interaction fingerprints (SIFt), which are effectively contact fingerprints of the top-ranked compounds with the active site residues. SIFt have been employed usefully to evict false positives and to derive selectivity profiles across proteins from the kinase family.

HTD post processing, consensus Bayes, SIFT High-throughput docking (HTD) is perceived as the closest in silico equivalent to HTS. A plethora of docking software packages such as DOCK [23], FlexX [24], GOLD [25], GLIDE [26] and ICM [27] exist. These packages dock compounds to a 3-D model of a target protein using different global optimization algorithms that minimize the ligand–protein intermolecular and ligand intramolecular interaction energies. A significant difference between these HTD methods is in the nature of the scoring functions used to estimates the protein– ligand binding free energy and eventually prioritize the compounds to be tested in vitro [28]. Although the ligand is normally treated as flexible, a fundamental problem in www.sciencedirect.com

Our in-house experience shows that, in the vast majority of cases and over a broad range of target proteins, using an HTD software package ‘out of the box’ is not a consistent method to discover leads. There have been numerous references to success stories of how HTD has been successfully employed to find leads in place of running an HTS. An example often cited [35] to exemplify the potential hit rate is based on docking to protein tyrosine phosphatase 1B (PTP-1B) — which has an active site dominated by polar interactions. In this case, the example plays to the strong points of docking and nullifies its weaknesses (scoring functions perform well when the binding interactions are polar and the binding site is not shallow). An obvious prerequisite for HTD is an accurate and reliable structure of the target protein. This is not necessarily available at the early stages of lead discovery and sometimes not at an advanced stage (i.e., G-protein coupled receptors). New trends in lead discovery that put Current Opinion in Chemical Biology 2006, 10:343–351

346 Next-generation therapeutics

emphasis on phenotypic assays and pathway screens [36] render HTD approaches as impractical because the compounds are not screened against one protein target but versus multiple targets — many of them unknown. HTD can be successfully employed in certain situations. Inhouse success stories include docking a virtual library of congeneric compounds from a lead series. Constraints were imposed on the part of the molecule that the medicinal chemists did not want to modify. In that sense, the weakness of scoring and protein flexibility was somewhat nullified. Indeed, HTD can be effective when a ‘hit’ from HTD is confirmed with an IC50 in an assay. There is more confidence that both the biological reading and the binding mode are ‘real’ and not due to an assay artifact because the selection of the compound was not at random but due to a binding hypothesis. In this sense, using HTD in parallel rather than instead of, or to direct, HTS has some value. Unlike QSAR approaches that come across as difficult to extrapolate (i.e., find a nanomolar compound based on a dataset with micromolar compounds), docking can do so by identifying novel interactions with the target protein that were not described in the dataset.

Scaffold hopping by 2-D and 3-D similarity searching Hits from similarity searches are a quick way to build focused libraries for low-throughput screening and early hit finding prior to HTS. Once lead compounds are identified by screening against a target, it is routine to use similarity searches for ‘SAR-by-inventory’, even if it is the same chemical inventory that was screened by HTS. This practice is not seen as redundant because active compounds are regularly missed by HTS for a number of reasons, especially in very large collections. SAR-byinventory, also called ‘hit-directed nearest neighbor searching’ [21] provides critical information to chemists prior to pursuing lead optimization on a scaffold. However, there are many chemical descriptors available for similarity searching, and the question of which one is optimal to recall chemical neighbors remains a heavily pursued area of research [12]. In practice, different similarity methods yield different hit lists; as a result, it is our general practice to perform multiple searches in parallel and keep pooled lists (i.e., aggregation [21]). For example, aggregate hit lists from 2-D and 3-D similarity searches enable a healthy mix of ‘conservative’ and ‘liberal’ suggestions to optimize the trade-off between additional actives and potentially new scaffolds. Multiple compound queries can be combined to increase enrichment for actives, and further gains in enrichment have also been achieved by ‘turbo’ similarity, which includes the nearest neighbors of a single compound as presumed bioactives in the search query [37]. Although identifying additional members of a congeneric series provides useful SAR, the recent trend in cheminformatics is to encourage ‘scaffold hopping’ from lead Current Opinion in Chemical Biology 2006, 10:343–351

compounds to new chemotypes. The objective is to retrieve bioactive compounds that have molecular framework different enough from the starting molecule as to be considered a new series for follow-up chemistry. Methods designed for scaffold hopping should be judged by chemotype enrichment rather than traditional active enrichment [38]. Table 1 gives a non-exhaustive sampling of newly developed ligand-based similarity searching methods or descriptors for scaffold hopping (from 2004–2006). Despite the recent resurgence of novel 3-D descriptors, others have reported that 2-D descriptors such as ‘circular fingerprints’ (also called extended connectivity fingerprints) and MACCS keys are also quite capable of recalling bioactives with scaffolds differing from the query compounds [49,50].

HTD or 2-D/3-D similarity searching? Recently, Chen et al. [28] compared the enrichment ratios from various HTD packages to 3-D shape matching (ROCS: Rapid Overlay of Chemical Structures, version 2.0, 2006. Openeye scientific software LLC) and searching by 2-D similarity (ISIS/Base: version 2.3, 2006. MDL information systems, Inc.). The fundamental difference between docking and similarity is that the former requires a model for the target protein, whereas in cheminformatics methods a ‘probe compound’ that is normally a known binder is a prerequisite. Unlike 2-D similarity that has high recall rates but finds ‘more of the same’, docking has the potential to identify novel chemotypes. In that sense, it is close to 3-D similarity searches. Chen et al. found that ROCS shape matching outperformed all the four benchmarked docking methods for 2 of the 12 targets in their study. We also interpret from the results that ROCS performance was sound for most of the remaining 10 targets. Zhang and Muegge have also compared similarity searching with docking results for multiple targets and found that, in most cases, enrichment factors were better for similarity searching [46]. We have had similar findings in support of lead discovery projects. Intriguingly, there are examples from 3-D similarity searching that have found flexible conformations of reference compounds can recall as many actives as crystallographic (‘bioactive’) conformations [42,46]. This is because the optimal pharmacophore-based flexible alignment of two bioactive molecules may not necessarily reflect their crystallographic alignment, which may be an artifact of alignment methodologies. However, researchers in the field of target-based virtual screening have increasingly acknowledged that protein ensembles better reflect the dynamic nature of protein binding sites in comparison to rigid X-ray structures; in contrast, ligand-based searching methods rarely acknowledge the induced fit phenomenon in the way search queries are constructed. It may be that flexible 3-D queries (i.e., ligand ensembles) are more capable than crystallographic ligand conformations of capturing the dynamic nature of active sites. Given the www.sciencedirect.com

Streamlining lead discovery by aligning in silico and HTS Davies, Glick and Jenkins 347

Table 1 Recent similarity methods designed for scaffold hopping Scaffold hopping descriptor

Methodology

Conformation dependent?

Correlation vectors from fuzzy pharmacophore models [39]

A 3-D distribution of pharmacophore points resulting from an alignment of reference ligands is transformed into a correlation vector representing probabilities of interaction pair distances. 3-D fragment dissimilarities are compared by a deterministic set of rules that produce an absolute configuration from 2-D topology. Molecular similarity is based on a weighed edit distance between pairwise paths of pharmacophore features in reduced graphs. Pharmacophore features are summed and encoded into the centroids of k-means clustered atoms for fuzzy, flexible 3-D similarity searching. Local surface interaction patterns as derived from six probes employing the GRID force field are calculated, providing a binary representation of the potential interaction capabilities of the molecule with the receptor. A binary shape fingerprint is created by measuring the Shape-Tanimoto of a compound to reference shapes by optimizing Gaussian overlap volume. VDW and electrostatic maxima and minima (computed with extended electron distribution or XED) charges form a field point surface for each conformer; by inverting one molecule’s field point charges, similarity can be measured by the overlay energy between molecules. Three-point pharmacophores defined by seven features and six distance ranges were generated for single conformers or for multi-conformer unions. Bond-distances are used to describe fully connected reduced graphs and different graph matching techniques are used to measure similarity. Property–property distance triplets within reduced graphs are recorded, with improved handling of feature separation and size and shape during graph reduction. Like CATS and CATS3D, potential pharmacophore points (PPPs) are used; in SURFCATS the spatial distances between PPPs on the molecular surface are transformed into a correlation vector.

Yes

Topomers [40] Reduced graphs [41] FEPOPS (feature point pharmacophores) [42] MOLPRINT 3D [43]

Shape fingerprints (OpenEye ROCS) [44] Molecular field points (Cresset FieldPrints) [45]

3-D PFP (pharmacophore fingerprints) [46] Reduced graphs with clique detection [47] Extended reduced graph [48] SURFCATS [49]

roughly three orders of magnitude difference in computation time to dock one million compounds versus a similarity search on the same collection, the structures of small-molecule inhibitors can be a more useful starting point with more immediate project impact than the protein target structure. That leads us to believe that 3-D similarity searches will be used as an orthogonal approach to HTD for lead discovery and will be part of the ‘virtual screening’ package in the near future.

In silico chemogenomics for lead finding and target finding Target family classes tend to bind similar chemical structures. This binding infidelity resulting from the structural homology of active sites is now being exploited to accelerate library design and lead discovery for multiple target families, such as kinases [51,52], GPCRs [53] and nuclear receptors [54]. Biologically annotated chemical databases are enabling large-scale automated mapping of chemical structure motifs to targets and families. For instance, the response of a compound against a panel of proteins forms a biological spectra that can be used to cluster the compounds [55]. In a similar vein, Vieth et al. created a new dendrogram of kinase relationships from small molecule selectivity data rather than from protein sequences [52]. There are other ways of organizing protein families without sequence alignment: by superimposing the ‘ligandsensing’ cores of multiple protein domains, it is possible www.sciencedirect.com

No No Yes Partially

Yes Yes

Yes No No Yes

to pre-determine structural families that may be amenable to focused libraries [56]. One can even relate proteins in evolutionary terms by comparing the chemical similarity of their natural ligands [57]. Further still, Sheridan and Shpungin computed the similarities of all targets or biological activities in the MDL Drug Data Report [58] using only their associated chemical structures. In the reverse orientation, the biological response of compounds can be a powerful predictor of chemical structure [59]. At Novartis, we are using biologically annotated chemical databases to predict primary targets [60] or off-targets [61] for lead compounds. For example, generic phenotypes such as ‘antineoplastic’ can be deconvoluted to specific protein targets involved in disruption of cell growth, simply based on multi-target Bayesian models trained on the chemical classes in a chemogenomics database. The implication for the future is that cheminformatics will expand beyond its traditional function in lead discovery of finding new compounds for targets to the role of finding new targets for compounds (i.e., ‘in silico chemogenomics’).

Library design and the chemical universe Existing compound collections can elucidate only a minor fraction of the chemical space, where estimates have been proposed for the total number of organic molecules to be in the range of 1018–10200 compounds [62]. Screening collections are often criticized for their lack of novelty because they contain chemotypes that were synthesized Current Opinion in Chemical Biology 2006, 10:343–351

348 Next-generation therapeutics

for other projects or purchased from external vendors. Schuffenhauer and coworkers [63] have pointed out that many compound libraries are a collection of tight islands of chemical diversity that leave a large portion of chemical space unexplored. It is estimated that the screening collections of the world pharmaceutical companies collectively target only around 1000 proteins [64]. Despite the massive increase in the number of compounds in a typical screening collection, there has not been an equivalent increase in the success rate of screening [65]. Hit rates for HTS can typically range from 0.01 to 1%, but for this reason the hits cover a relatively narrow band of chemotypes. Indeed, it is not uncommon that an HTS campaign will yield a higher hit rate for a kinase target than a protein–protein interaction (PPI) one because, historically, synthetic efforts focused on ATP competitive kinase inhibitors more than on PPIs. The accepted prerequisite for successful lead discovery is a high quality compound screening collection [65], and it is certainly true that simply increasing the size of the screening collection without any attention to its quality or design is a thing of the past. In particular, scaffold diversity, rather than global diversity of the library, is seen as important because the latter can result in underrepresented scaffolds and the danger of missing entire active series [66]. There is now less emphasis on simple combinatorial libraries [67] and more on target-focused libraries [34,68], those that are created around privileged scaffolds [69] and those that emphasize specific traits such as fragment-based libraries [70]. In recent years, there has been an emphasis on differentiating drug-like and leadlike compounds, and their application in the design of screening libraries [71–73]. However, although these properties intuitively may influence the success of lead development, and later drug candidacy, it is the lack of diversity and novelty of many corporate collections that is concerning. Rather than solely focusing on higher hit rates (by front loading the lead-like selection criteria) there should be a drive to maximize the information content of hits. To address the lack of chemotype variance in many corporate collections, Novartis has used in silico analysis to design libraries with greater ring diversity. We have found that the limited number of aromatic scaffolds in bioactive molecules to be very small in comparison to what is synthetically feasible. Issues of molecular complexity and ligand efficiency have also been a topic of conjecture. A number of studies [74,75] have suggested that increasing bioactivity is correlated with reduced molecular complexity. However, molecular complexity is often associated with specificity and selectively, and can provide higher information content. Schreiber [76,77] and Schuffenhauer [78] have both emphasized synthetically more complex, biologically relevant molecules as a strategy for library design. All of the above approaches now rely heavily on the involvement of in silico methods to create the best possible Current Opinion in Chemical Biology 2006, 10:343–351

screening library. Indeed, Novartis incorporates a natural product component in the screening collection. Arguably the more influential application of prospective in silico methods are those applied to the design and creation of the screening libraries, as well as those virtual compounds not yet synthesized. Here, in silico screening can have a dramatic influence upon the success rate of finding good tractable hits and leads. It is certain that library design philosophies, and the corporate collections that they feed, will continue to evolve. What is also certain is that successful strategies in targeted libraries, compound purchasing and overall library design will continue be influenced by in silico methods.

Conclusions As it is practiced now at Novartis, the role of in silico lead discovery is to work in parallel with the HTS process and to focus on improving the quality of the output of hit finding, rather than simply as a technique to reduce experimentalbased costs associated with screening. In certain cases where it is not feasible to screen the entire deck, it is a complementary technique. In some situations, it can be successfully employed to replace HTS altogether; however, the ability of in silico to reliably and reproducibly find leads alone is still challenging. The timelines of HTS also make the effective application of in silico methods difficult (Figure 2). Practical limitations such as cherry-picking capacity for subset creation along with the institutional mistrust of relying on purely in silico methods combine to render in silico driven lead discovery as uncertain. Many of these limitations can be alleviated by the organizational structure. At Novartis the in silico groups are integrated into the global lead finding department. It is coupled tightly to the compound purchasing and the HTS process, optimizing the ability of in silico to affect the screening process at the target identification, library design, data analysis, hit triaging and hit list expansion steps (Figure 2). Over the past few years, the challenges that in silico screening has faced have also changed. There is now less emphasis on traditional structure-based design approaches, which continue to suffer from limitations that have existed for five years (e.g. the accuracy of scoring functions, the flexibility of the target protein), and more emphasis is now given to more pragmatic cheminformatics tools that have advanced significantly. Concepts such as data pipelining (e.g., Scitegic’s Pipeline Pilot, Version 5.1, 2004) and data visualization (e.g., Spotfire’s Decision Site, Version 8.2, 2006]) allow the analysis and manipulation of large-scale data in minutes. This benefit permits in silico hit discovery to keep pace with the HTS process. Technically, cheminformatics has also moved on from a more traditional computational chemistry view to encompass data mining methods such as Laplacian-modified naı¨ve Bayes, recursive partitioning and support vector machines. There is also greater appreciation of the power of 3-D ligand-based methods (e.g., www.sciencedirect.com

Streamlining lead discovery by aligning in silico and HTS Davies, Glick and Jenkins 349

Figure 2

A realistic time model of full random screening with in-parallel in silico analysis, verses sequential screening.

FEPOPS, ROCS) as the limitations of 2-D approaches have been realized. Finally, the increase in biologically annotated chemical databases has made possible knowledge-based cheminformatics for the purposes of targetand family-focused in silico screening and target prediction for orphan compounds. Over the next few years, the process of lead discovery will change and optimize (e.g., more cellular screening, greater miniaturization, higher content readouts); the rapid speed and scalability of current in silico technologies, particularly cheminformatics and data pipelining, will allow in silico tools to keep pace with the evolution of experimental screening. Further, we believe in silico screening will become increasingly influential if it is used to supplement and streamline the process of lead discovery rather than as a replacement technology for experimental screening as once envisioned. www.sciencedirect.com

Acknowledgements This manuscript is dedicated to the memory of Dr Pierre Acklin who initiated the in silico screening department in Novartis. The authors would like to thank Dr Rene Amstutz and Dr Dejan Bojanic for their support and feedback.

References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as:  of special interest  of outstanding interest 1. 

Schnecke V, Bostrom J: Computational chemistry-driven decision making in lead generation. Drug Discov Today 2006, 11:43-50. An excellent review article on modern in silico screening. 2.

Shoichet BK: Virtual screening of chemical libraries. Nature 2004, 432:862-865.

3.

Chin DN, Chuaqui CE, Singh J: Integration of virtual screening into the drug discovery process. Mini Rev Med Chem 2004, 4:1053-1065. Current Opinion in Chemical Biology 2006, 10:343–351

350 Next-generation therapeutics

4.

Young SS, Lam RL, Welch WJ: Initial compound selection for sequential screening. Curr Opin Drug Discov Devel 2002, 5:422-427.

5.

Baringhaus KH, Hessler G: Fast similarity searching and screening hit analysis. Drug Discov Today Technol 2004, 1:197-202.

26. Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK et al.: Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 2004, 47:1739-1749. 27. Abagyan R, Totrov M, Kuznetsov D: ICM-A new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation. J Comput Chem 1994, 15:488-506.

6.

Pearce BC, Sofia MJ, Good AC, Drexler DM, Stock DA: An empirical process for the design of high-throughput screening deck filters. J Chem Inf Model 2006, 46:1060-1068.

7.

Walters WP, Namchuk M: Designing screens: how to make your hits a hit. Nat Rev Drug Discov 2003, 2:259-266.

28. Chen H, Lyne PD, Giordanetto F, Lovell T, Li J: On evaluating molecular-docking methods for pose prediction and enrichment factors. J Chem Inf Model 2006, 46:401-415.

8.

Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 2001, 46:3-26.

29. Wang R, Wang S: How does consensus scoring work for virtual library screening? An idealized computer experiment. J Chem Inf Comput Sci 2001, 41:1422-1426.

9.

Tan DS: Current progress in natural product-like libraries for discovery screening. Comb Chem High Throughput Screen 2004, 7:631-643.

30. Klon AE, Glick M, Thoma M, Acklin P, Davies JW: Finding more needles in the haystack: a simple and efficient method for improving high-throughput docking results. J Med Chem 2004, 47:2743-2749.

10. Clardy J, Walsh C: Lessons from natural molecules. Nature 2004, 432:829-837. 11. Reayi A, Arya P: Natural product-like chemical space: search for chemical dissectors of macromolecular interactions. Curr Opin Chem Biol 2005, 9:240-247. 12. Karnachi PS, Brown FK: Practical approaches to efficient screening: information-rich screening protocol. J Biomol Screen 2004, 9:678-686. 13. Weaver DC: Applying data mining techniques to library design, lead generation and lead optimization. Curr Opin Chem Biol 2004, 8:264-270. 14. Rogers D, Brown RD, Hahn M: Using extended-connectivity fingerprints with Laplacian-modified Bayesian analysis in high-throughput screening follow-up. J Biomol Screen 2005, 10:682-686. 15. Xia X, Maliski EG, Gallant P, Rogers D: Classification of kinase  inhibitors using a Bayesian model. J Med Chem 2004, 47:4463-4470. The first noted paper on the use of Baeysian modeling of HTS data. 16. Bender A, Mussa HY, Glen RC, Reiling S: Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J Chem Inf Comput Sci 2004, 44:1708-1718. 17. Rusinko A, Farmen MW, Lambert CG, Brown PL, Young SS: Analysis of a large structure/biological activity data set using recursive partitioning. J Chem Inf Comput Sci 1999, 39:1017-1026. 18. Vapnik V: In Statistical Learning Theory. New York: Wiley; 1998. 19. Glick M, Jenkins JL, Nettles JH, Hitchings H, Davies JW: Enrichment of high-throughput screening data with increasing levels of noise using support vector machines, recursive partitioning, and Laplacian-modified naive Bayesian classifiers. J Chem Inf Model 2006, 46:193-200. 20. Engels MF, Venkatarangan P: Smart screening: approaches to efficient HTS. Curr Opin Drug Discov Devel 2001, 4:275-283. 21. Shanmugasundaram V, Maggiora GM, Lajiness MS: Hit-directed nearest-neighbor searching. J Med Chem 2005, 48:240-248. 22. Glick M, Klon AE, Acklin P, Davies JW: Enrichment of extremely noisy high-throughput screening data using a naive Bayes classifier. J Biomol Screen 2004, 9:32-36. 23. Makino S, Kuntz ID: Automated flexible ligand docking method and its application for database search. J Comput Chem 1997, 18:1812-1825. 24. Rarey M, Kramer B, Lengauer T, Klebe G: A fast flexible docking method using an incremental construction algorithm. J Mol Biol 1996, 261:470-489. 25. Jones G, Willett P, Glen RC, Leach AR, Taylor R: Development and validation of a genetic algorithm for flexible docking. J Mol Biol 1997, 267:727-748. Current Opinion in Chemical Biology 2006, 10:343–351

31. Klon AE, Glick M, Davies JW: Application of machine learning to improve the results of high-throughput docking against the HIV-1 protease. J Chem Inf Comput Sci 2004, 44:2216-2224. 32. Klon AE, Glick M, Davies JW: Combination of a naive bayes classifier with consensus scoring improves enrichment of high-throughput docking results. J Med Chem 2004, 47:43564359. 33. Deng Z, Chuaqui C, Singh J: Structural interaction fingerprint (SIFt): a novel method for analyzing three-dimensional proteinligand binding interactions. J Med Chem 2004, 47:337-344. 34. Deng Z, Chuaqui C, Singh J: Knowledge-based design of targetfocused libraries using protein-ligand interaction constraints. J Med Chem 2006, 49:490-500. 35. Doman TN, McGovern SL, Witherbee BJ, Kasten TP, Kurumbail R, Stallings WC, Connolly DT, Shoichet BK: Molecular docking and high-throughput screening for novel inhibitors of protein tyrosine phosphatase-1B. J Med Chem 2002, 45:2213-2221. 36. Fishman MC, Porter JA: Pharmaceuticals: a new grammar for drug discovery. Nature 2005, 437:491-493. 37. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A: New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J Chem Inf Model 2006, 46:462-470. 38. Good AC, Hermsmeier MA, Hindle SA: Measuring CAMD technique performance: a virtual screening case study in the design of validation experiments. J Comput Aided Mol Des 2004, 18:529-536. 39. Renner S, Schneider G: Fuzzy pharmacophore models from molecular alignments for correlation-vector-based virtual screening. J Med Chem 2004, 47:4653-4664. 40. Jilek RJ, Cramer RD: Topomers: a validated protocol for their self-consistent generation. J Chem Inf Comput Sci 2004, 44:1221-1227. 41. Harper G, Bravi GS, Pickett SD, Hussain J, Green DV: The reduced graph descriptor in virtual screening and data-driven clustering of high-throughput screening data. J Chem Inf Comput Sci 2004, 44:2145-2156. 42. Jenkins JL, Glick M, Davies JW: A 3D similarity method for scaffold hopping from known drugs or natural ligands to new chemotypes. J Med Chem 2004, 47:6144-6159. 43. Bender A, Mussa HY, Gill GS, Glen RC: Molecular surface point environments for virtual screening and the elucidation of binding patterns (MOLPRINT 3D). J Med Chem 2004, 47:6569-6583. 44. Haigh JA, Pickup BT, Grant JA, Nicholls A: Small molecule shape-fingerprints. J Chem Inf Model 2005, 45:673-684. 45. Low CM, Buck IM, Cooke T, Cushnir JR, Kalindjian SB, Kotecha A, Pether MJ, Shankley NP, Vinter JG, Wright L: Scaffold hopping www.sciencedirect.com

Streamlining lead discovery by aligning in silico and HTS Davies, Glick and Jenkins 351

with molecular field points: identification of a cholecystokinin2 (CCK2) receptor pharmacophore and its use in the design of a prototypical series of pyrrole- and imidazole-based CCK2 antagonists. J Med Chem 2005, 48:6790-6802. 46. Zhang Q, Muegge I: Scaffold hopping through virtual screening using 2D and 3D similarity descriptors: ranking, voting, and consensus scoring. J Med Chem 2006, 49:1536-1548. 47. Barker EJ, Buttar D, Cosgrove DA, Gardiner EJ, Kitts P, Willett P, Gillet VJ: Scaffold hopping using clique detection applied to reduced graphs. J Chem Inf Model 2006, 46:503-511. 48. Stiefl N, Watson IA, Baumann K, Zaliani A: ErG: 2D pharmacophore descriptions for scaffold hopping. J Chem Inf Model 2006, 46:208-220. 49. Renner S, Schneider G: Scaffold-hopping potential of ligand-based similarity concepts. Chem Med Chem 2006, 1:181-185. 50. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A: Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comput Sci 2004, 44:1177-1185. 51. ter Haar E, Walters WP, Pazhanisamy S, Taslimi P, Pierce AC, Bemis GW, Salituro FG, Harbeson SL: Kinase chemogenomics: targeting the human kinome for target validation and drug discovery. Mini Rev Med Chem 2004, 4:235-253. 52. Vieth M, Higgs RE, Robertson DH, Shapiro M, Gragg EA,  Hemmerle H: Kinomics-structural biology and chemogenomics of kinase inhibitors and targets. Biochim Biophys Acta 2004, 1697:243-257. This work underscores the idea that chemical similarity of small-molecule inhibitors may be more useful than protein sequences to determine target family phylogeny, at least from a lead discovery perspective. 53. Jimonet P, Jager R: Strategies for designing GPCR-focused libraries and screening sets. Curr Opin Drug Discov Devel 2004, 7:325-333. 54. Cases M, Garcia-Serna R, Hettne K, Weeber M, van der Lei J, Boyer S, Mestres J: Chemical and biological profiling of an annotated compound library directed to the nuclear receptor family. Curr Top Med Chem 2005, 5:763-772. 55. Fliri AF, Loging WT, Thadeio PF, Volkmann RA: Biological spectra  analysis: linking biological activity profiles to molecular structure. Proc Natl Acad Sci USA 2005, 102:261-266. In the tradition of affinity fingerprints, biological activity spectra from single point assays are utilized to group compounds in a predictive manner. 56. Koch MA, Schuffenhauer A, Scheck M, Wetzel S, Casaulta M, Odermatt A, Ertl P, Waldmann H: Charting biologically relevant chemical space: a structural classification of natural products (SCONP). Proc Natl Acad Sci USA 2005, 102:17272-17277. 57. Nobeli I, Spriggs RV, George RA, Thornton JM: A ligand-centric  analysis of the diversity and evolution of protein-ligand relationships in E. coli. J Mol Biol 2005, 347:415-436. Further advancing the idea of using cheminformatics to judge protein relationships, the authors explore the chemical similarity of endogenous ligands for protein families. 58. Sheridan RP, Shpungin J: Calculating similarities between  biological activities in the MDL drug data report database. J Chem Inf Comput Sci 2004, 44:727-740. By measuring activity–activity similarities via chemical similarity, the authors demonstrate how chemogenomics databases can be better organized by automated methods. 59. Wallqvist A, Huang R, Covell DG, Roschke AV, Gelhaus KS, Kirsch IR: Drugs aimed at targeting characteristic karyotypic phenotypes of cancer cells. Mol Cancer Ther 2005, 4:1559-1568.

www.sciencedirect.com

60. Nidhi N, Glick M, Davies JW, Jenkins JL: Prediction of biological  targets for compounds using multiple-category Bayesian models trained on chemogenomics databases. J Chem Inf Model 2006, 46:1124-1133. One of the first papers on chemogenomic-based target fishing. 61. Hamon J, Azzaoui K, Whiteberad S, Urban L, Jacoby E, Faller B: In vitro safety pharmacology profiling. Eur Pharm Rev 2006, 1:60-63. 62. Fink T, Bruggesser H, Reymond JL: Virtual exploration of the small-molecule chemical universe below 160 Daltons. Angew Chem Int Ed Engl 2005, 44:1504-1508. 63. Schuffenhauer A, Popov M, Schopfer U, Acklin P, Stanek J,  Jacoby E: Molecular diversity management strategies for building and enhancement of diverse and focused lead discovery compound screening collections. Comb Chem High Throughput Screen 2004, 7:771-781. This work thoroughly covers all aspects of diverse and focused screening collections. 64. Lipinski C, Hopkins A: Navigating chemical space for biology and medicine. Nature 2004, 432:855-861. 65. Bleicher KH, Bohm HJ, Muller K, Alanine AI: Hit and lead generation: beyond high-throughput screening. Nat Rev Drug Discov 2003, 2:369-378. 66. Nilakantan R, Immermann F, Haraki K: A novel approach to combinatorial library design. Comb Chem High Throughput Screen 2002, 5:105-110. 67. Weber L: Current Status of Virtual Combinatorial Library Design. Wiley-VCH Verlag GmbH & Co.; 2005. 68. Orry AJ, Abagyan RA, Cavasotto CN: Structure-based development of target-specific compound libraries. Drug Discov Today 2006, 11:261-266. 69. DeSimone RW, Currie KS, Mitchell SA, Darrow JW, Pippin DA: Privileged structures: applications in drug discovery. Comb Chem High Throughput Screen 2004, 7:473-494. 70. Schuffenhauer A, Ruedisser S, Marzinzik AL, Jahnke W, Blommers M, Selzer P, Jacoby E: Library design for fragment based screening. Curr Top Med Chem 2005, 5:751-762. 71. Oprea TI: Cheminformatics and the quest for leads in drug discovery. In Handbook of Cheminformatics, Edn 4. VCH Wiley; 2003:1508–1531. 72. Oprea TI, Davis AM, Teague SJ, Leeson PD: Is there a difference between leads and drugs? A historical perspective. J Chem Inf Comput Sci 2001, 41:1308-1315. 73. Hann MM, Oprea TI: Pursuing the leadlikeness concept in pharmaceutical research. Curr Opin Chem Biol 2004, 8:255-263. 74. Hann MM, Leach AR, Harper G: Molecular complexity and its impact on the probability of finding leads for drug discovery. J Chem Inf Comput Sci 2001, 41:856-864. 75. Hann M, Leach AR, Green DVS: Computational chemistry, molecular complexity and screening set design. In Methods and Principles in Medicinal Chemistry, Edn 23. Edited by Oprea T, Mannhold R, Kubini H, Folkers G.Wiley-VCH; 2005:43-57. 76. Burke MD, Berger EM, Schreiber SL: Generating diverse skeletons of small molecules combinatorially. Science 2003, 302:613-618. 77. Burke MD, Schreiber SL: A planning strategy for diversityoriented synthesis. Angew Chem Int Ed Engl 2004, 43:46-58. 78. Schuffenhauer A, Brown N, Selzer P, Ertl P, Jacoby E: Relationships between molecular complexity, biological activity, and structural diversity. J Chem Inf Model 2006, 46:525-535.

Current Opinion in Chemical Biology 2006, 10:343–351