Drug Discovery Today: Technologies
Vol. 1, No. 3 2004
Editors-in-Chief Kelvin Lam – Pfizer, Inc., USA Henk Timmerman – Vrije Universiteit, The Netherlands DRUG DISCOVERY
TODAY
TECHNOLOGIES
Lead optimization
Fast similarity searching and screening hit analysis Karl-Heinz Baringhaus*, Gerhard Hessler Aventis Pharma Deutschland GmbH (A company of the sanofi-aventis group), Chemistry/Computational Chemistry, Industriepark Hoechst, Building G 878, 65926 Frankfurt am Main, Germany
Similarity searching allows a fast identification of analogues to biologically active molecules. Depending on the applied similarity metrics, either structurally close analogues or more diverse compounds can be identified. This is of particular interest for the analysis of high-throughput screening (HTS) hits. A combination of similarity searching and data mining applied to HTS data derives early structure–activity relationships to guide a subsequent optimization of hits. Introduction
Section Editor: Hugo Kubinyi – University of Heidelberg, Germany High-throughput screening (HTS) generates large numbers of hits that have to be confirmed and further validated, to eliminate false positives. Often, it turns out that active analogues, being chemically closely related to the validated hits, were not discovered in the first HTS run (false negatives). Fast similarity searches, such as the pharmacophore property-based Feature Tree searches, are capable to use large numbers of different hits to search databases for similar compounds, irrespective of their scaffolds and/or potential lack of identical partial structures. In this manner, active analogues are discovered, often with new (patentable) scaffold and substitution patterns.
offers interesting opportunities for lead finding and is also valuable in the analysis of HTS results.
High-throughput screening (HTS) of large compound sets has become an important strategy for hit identification in the early drug discovery phase. Nowadays, several hundred– thousand compounds can be routinely screened within weeks or days. The resulting screening hits are further characterized by biophysical and biochemical techniques to exclude false positives early on. However, HTS data is usually very noisy and efficient data analysis techniques are necessary to extract the most valuable compounds for a specific biological target. Computational techniques such as virtual screening in large databases are used to complement the screening of large historic compound collections. Access to the 3D structure of the target protein allows docking-based virtual screening, which yields high hit rates [1]. Often, only ligands of a target protein are known, which excludes any structure-based approaches. However, ligand-based similarity searching
Database searching for compounds is frequently and successfully applied to find new drug candidates. Either 2D or 3D techniques or combinations thereof are suitable for such a virtual-screening-based hit identification [2]. However, the lack of 3D structural information of certain biological targets requires ligand-based similarity searching instead of structure-based approaches. Such similarity searches are based on the ‘‘similarity principle’’ according to Johnson and Maggiora [3]. Structurally similar molecules are assumed to have similar physico-chemical and biological properties. This is one of the major assumptions generally followed in screening hit analysis and in lead optimization. Similarity searching requires a proper representation of molecules. Such representations can be based on:
*Corresponding author: (K.-H. Baringhaus)
[email protected] URL: http://www.sanofi-aventis.com/
- One-dimensional descriptors derived from compound properties (e.g. volume and log P) [4].
1740-6749/$ ß 2004 Elsevier Ltd. All rights reserved.
DOI: 10.1016/j.ddtec.2004.11.001
Similarity searching
www.drugdiscoverytoday.com
197
Drug Discovery Today: Technologies | Lead optimization
- Two-dimensional descriptors encoding the molecular fragments (e.g. MACCS structural keys) [5]. - Three-dimensional descriptors or spatial pharmacophores (e.g. CoMFA fields, three-point pharmacophores) [6,7]. The final descriptor could be scalar, linear or non-linear. Because of the broad application of structure–activity relationships, an enormous number of different descriptors for similarity searching are available. In this short article, we will focus on a few frequently applied 2D descriptors. The oldest and simplest way of expressing similarity between molecules is on the basis of molecular properties, such as molecular weight or log P. One-dimensional descriptors are primarily applied as rapid filters to reduce virtual compound library sizes (e.g. Lipinski’s ‘‘Rule of 5’’) [8]. A more detailed description of molecular structures is captured by linear descriptors that encode structural features of molecules in a binary- or real-valued vector. A 2D structural diagram encodes physico-chemical properties as well as reactivity. In addition, some two-dimensional descriptors even contain 3D information as well as information for protein– ligand interactions, such as molecular shape, hydrophobicity, size, flexibility and hydrogen bonding potential [9]. 2D fingerprints encode the absence or presence of specific small chemical fragments in a binary- or real-vector. A list of all possible fragments within a structure is generated and converted into a bitstring as shown in Fig. 1a. Because of the large number of possible fragments in a certain molecule, different techniques are used to project fragments onto a finite bitstring. However, each position in the bitstring encodes for a specific property such as the occurrence of two specific atoms separated by a specific number of bonds. Frequently used descriptors are MACCS structural keys (MACCS II, MDL Information Systems; http://www.mdli.com/), Daylight (DAYLIGHT Software Manual, Daylight; http://www.daylight.com/) and Unity Fingerprints (UNITY, Chemical Information Software 4.0, Tripos; http://www. tripos.com/). Encoding molecular structures in bitstrings allows quantification of molecular similarity by functions such as the Tanimoto coefficient, which counts the number of bits in common between a pair of molecules [10]. A Tanimoto coefficient of 0 means that both structures have no bits in common (no intersection between sets of fragments), whereas a value of 1 indicates that the fingerprints are identical. Owing to the simplicity of these functions, 2D-based similarity searching is extremely fast when applied to huge databases of millions of compounds. However, 2D-based similarity searching yields only structurally very similar molecules with respect to the reference compounds. This approach is very useful to follow-up on screening hits or competitor molecules. 198
www.drugdiscoverytoday.com
Vol. 1, No. 3 2004
Nevertheless, these two-dimensional descriptors cannot adequately detect the molecular similarity of pairs of compounds that are structurally different but show similar binding to a protein. However, a similarity measure is needed for scaffold hopping and rescaffolding attempts. Such a movement from a given compound in one structural class to a compound with similar binding and activity albeit from another structural class is becoming more and more important to circumvent patent claims and also to focus on the most promising chemical matter for a particular target. To this end, a significant improvement of similarity searching was achieved by Rarey and Dixon [11]. Their ‘‘Feature Tree’’ (FTree) descriptor has a tree structure (Fig. 1b), in which nodes represent the fragments or functional groups of the molecule, and a graph (tree) captures the overall topology. Because nodes are labeled according to the pharmacophoric properties of the underlying fragments, FTrees produce a more detailed description of physico-chemical properties and thus bridges classical properties of the 2D formula with their relative arrangement in the molecule. The similarity of two molecules is determined by matching two corresponding trees and a numerical similarity score quantifies the quality of the matching (0–1). The matching relates the nodes in one tree to the nodes in the second tree, although keeping the topology of the trees and maximizing the similarity score. Hence, a comparison of two FTrees detects not only similarity in terms of molecular interaction points (through nodes) but also with respect to molecular shape (via tree). As expected, this descriptor outperforms classical 2D fingerprint approaches [12]. Similarity searching with the FTree descriptor retrieves not only structurally very similar molecules (as 2D fingerprints do) but also structurally more diverse compounds with different scaffolds. Hence, FTrees offer additional opportunities for hit finding, although it is two to three orders of magnitude slower than similarity searching with conventional linear descriptors. The CATS descriptor belongs to the category of atom-pair descriptors and encodes topological pharmacophore information (Fig. 1c) [13]. This descriptor turned out to be well suited for 2D-based rescaffolding of biologically active molecules. Recent successful examples are available in literature and CATS-based similarity searching is often used in parallel to the FTree searches. Fig. 1d shows the similarity of three CDK2 inhibitors calculated with Unity fingerprints and FTrees. FTrees detect some similarity between these compounds, whereas Unity fingerprints classify them as much more diverse. Because biological activity and selectivity are closely linked to the ligands 3D structure (spatial arrangement of pharmacophores), three-dimensional descriptors have been derived for similarity searching as well. The term ‘‘pharmacophoric triplets’’ refers to sets of three pharmacophoric points within a molecule (e.g. donor–acceptor-hydrophobic). Each possible
Vol. 1, No. 3 2004
Drug Discovery Today: Technologies | Lead optimization
Figure 1. Two-dimensional descriptors for similarity searching. (a) A molecule is split into all possible fragments to project the structure into a bitstring. Similarity can be described by the Tanimoto coefficient, which counts the number of bits in common between a pair of molecules. (b) The 2D structure is converted into its ‘‘Feature Tree’’. The tree is shown in black, whereas the colored nodes represent the different functional groups in the molecule. (c) A 2D structure is converted into a molecular graph. To each node of the graph pharmacophoric points are assigned according to their definition. The calculation of the distance matrix then results in a 150-dimensional correlation vector for a molecule (CATS descriptor). (d) ‘‘Feature Tree’’ (red) and Unity 2D fingerprint (black) pair-wise similarity of Flavopiridol, Roscovitine and NU6102.
triplet, together with its point-to-point distance can be encoded in a fingerprint-like manner [7]. Individual bits refer to different triangles formed between potential pharmacophoric points. The pharmacophoric definitions reflect biologically relevant interaction features. Usually, 27 distance bins (from ˚ ) and five pharmacophoric types are used (cf. Fig. 1c), 2.5 to 15 A resulting in 307,020 bits encoding triangle geometries. To deal with the flexibility of compounds, 3D descriptors are often generated from an ensemble of conformers. However, this assumes that the 3D features in the descriptor exist simultaneously, which is certainly wrong as they emerge from different conformers. This additional noise in the descriptor results in a lower performance of 3D descriptors in comparison to 2D descriptors [9].
Unity2D and MACCS keys primarily retrieve structurally very similar molecules with respect to reference molecules, whereas FTrees and CATS are also able to extract structurally diverse compounds (Table 1). Numerous other 2D and 3D descriptors for similarity searching are available and we certainly suggest exploiting their potential to select the most appropriate descriptor for a particular problem.
Screening hit analysis HTS screening is one of the most important methods for lead identification. Usually, hundreds of thousands of compounds are tested for their biological activity on the target. The result of HTS is a list of compounds ranked by their biological www.drugdiscoverytoday.com
199
Drug Discovery Today: Technologies | Lead optimization
Vol. 1, No. 3 2004
Table 1. Comparison of similarity searching tools Technology 1
Technology 2
Technology 3
Name of specific type of technology
2D substructure fingerprints
Feature Tree
Topological pharmacophores
Names of specific technologies and associated companies
Daylight (Daylight) MACCS (MDL Inf Systems) Unity (UNITY, Tripos)
FTree (BioSolveIT)
CATS
Pros
Very fast Well established Broad experience
Fast Scaffold hopping Broad experience
Fast Scaffold hopping
Cons
No scaffold hopping
Not applicable to macrocycles and fused ring systems
Limited experience yet
References
http://www.daylight.com/ http://www.mdli.com/ http://www.tripos.com/
[11]
[13]
activity. The easiest follow-up strategy only considers the most active compounds to select valuable candidates for optimization. Nevertheless, the large amount of data offers the chance to extract more information. Thus, the goal for screening hit analysis next to the identification of the most promising compounds is to extract structure activity relationships (SAR) and to identify a pharmacophore as the arrangement of molecular fragments which are necessary for biological activity. A typical workflow for screening hit analysis is shown in Fig. 2.
Figure 2. Workflow of screening hit analysis. Confirmed HTS hits serve as queries for similarity searching to expand the initial SAR. Machine learning derives then improved SAR models. Suitable lead series have to meet additional criteria, such as selectivity against closely related targets or reasonable ADMET properties.
200
www.drugdiscoverytoday.com
However, the analysis of the results of large screening campaigns is hampered by some drawbacks, such as the poor data quality and false positive or false negative information. Some of the false positives are an artifact of the assay system, for example, because of interference with the read-out system. Therefore, it is important to validate active compounds by retesting in an orthogonal assay system as well as by proving the compound identity with mass spectroscopy. In many companies, historic screening collections still contain several undesired compounds, for example, compounds carrying reactive fragments, very lipophilic compounds, frequent hitters, etc. Such compounds are usually removed from the candidate list before any further analysis is carried out. Each screening campaign is often performed on a fixed set of compounds. Thus, there are numerous close analogues to active compounds, which have not been screened, either from the in-house collection or from external vendors. Such compounds can easily be identified with substructure searches and 2D similarity searches. A similarity-based selection of compounds is expected to fail if the query compound is a false positive or the database searched lacks similar compounds. As pointed out by Martin et al. [4], similarity searches show a significantly better enrichment than random picking, but do not guarantee large hit rates. On large data sets, it was found that about 30% of the analogues were active. In the initial phase of HTS analysis, descriptors that select close analogues are preferred. This strategy ensures that the SAR picture is completed with all compounds available from the in-house collection. Therefore, the synthesis of new compounds can be avoided at this stage. Similarity descriptors are also important for clustering the screening data set [14]. Usually, pure clusters containing only actives or inactives are obtained as well as mixed clusters, which contain both active and inactive compounds. Such mixed clusters are helpful in recognizing modifications, which increase or decrease biological activity, but one should not forget that in large screening sets negative data are usually
Vol. 1, No. 3 2004
not confirmed. Thus, mixed clusters can also be used to suggest compounds for retesting, such as for the identification of false negatives. Although clustering techniques are helpful in visualizing and structuring the data set, they do not enable one to derive models for the biological activity. Nevertheless, an understanding of the SAR could help in prioritizing the next steps. Different machine learning techniques are available for data mining of large data sets, which is a prerequisite for analyzing HTS data. Recursive partitioning (RP) derives rules to classify compounds into groups with similar activity. Each rule is formed by a set of descriptor values that is hierarchically arranged. RP has been shown to be efficient in extracting 2D [15] or 3D SAR [16] from large data sets. Recently, support vector machines (SVM) have been used for the analysis of screening data sets [17]. SVMs recognize a hyperplane in the multi-dimensional descriptor space (called the maximum margin hyperplane) that best separates actives from inactives. Although SVMs can successfully be used to select active compounds in a virtual screening scenario, it is difficult to extract the relevant features and thereby to get an understanding of the SAR. The extraction of structure activity relationships from HTS relies not only on 2D descriptors, but also on substructure recognition. LeadScope (http://www.leadscope.com/) uses a set of more than 27,000 molecular fragments. For each fragment, the enrichment among actives is calculated and can be used to recognize important molecular features. Recently, LeadScope was enhanced with an RP approach. In this implementation up to three descriptors can be used for partitioning [18]. The respective combination of descriptors is selected by a simulated annealing procedure from the vast amount of potential combinations.
Sequential screening The standard approach is to run HTS with a large compound set and then to analyse the results to identify the most promising compounds and to extract SAR information. However, this approach is quite costly for some targets (e.g. for ion channels). Furthermore, it is only possible if an HTS-capable assay can be developed. In other cases, sequential screening can be useful. In sequential screening, the compounds are tested iteratively (Fig. 3). Initially, a screening set is tested and SAR is extracted. In the following iterations, compounds are selected based on the SAR model (Fig. 2). The SAR model is then iteratively refined with the new test results [19]. Various machine-learning approaches have been described, such as RP or SVM. The number of actives found in the iterative screening cycles varies between 10% and approximately 50%, depending on the size of the samples and the number of iterations.
Drug Discovery Today: Technologies | Lead optimization
Figure 3. Workflow of sequential screening. An initial screening set (e.g. a diverse set of compounds) is screened against a target. The biological results are used to derive a model by various machine learning techniques that are applied to select new compounds for testing.
A successful example of sequential screening was recently performed on 14 different G-protein coupled receptors (GPCRs) [20]. Initially, a diverse set of compounds was screened against these targets. Atom pairs and topological torsions were used to derive an early SAR by recursive partitioning. This model was applied to derive a second screening set that showed a significantly higher hit rate than the diversity-based first set of compounds. RP also highlighted features relevant for the biological activity. Self-organizing neural networks are used to classify compounds by their similarity to a training set. They are a lowdimensional representation of a high-dimensional space. Compounds are sorted into certain neurons based on their similarity to the compounds already in those neurons. Any new compound that is mapped to a hit neuron is defined as active. A retrospective analysis showed that a self-organizing map finds up to 96% of the hits in the test data set [21]. Sequential screening can save costs, because the number of tested compounds can be significantly reduced, but the iterative testing of compounds also requires some effort in compound logistics. Therefore, the number of iterations and the number of compounds that can be selected per iteration is limited. Nevertheless, the use of SAR information in secondround testing can increase hit rates and can give initial guidance for compound optimization.
Conclusion Similarity searching has been established as an indispensable tool for identifying analogues of known active compounds. Therefore, it is a technique widely used in rational approaches www.drugdiscoverytoday.com
201
Drug Discovery Today: Technologies | Lead optimization
Links BiosolveIT: http://www.biosolveit.de/ Bioreason: http://www.bioreason.com/ SVM: http://www.kernel-machines.org/software.html; http://www.models.kvl.dk/research/svm_starter/index.asp
Vol. 1, No. 3 2004
neuronal nets or SVM. Currently, no method appears to be superior to others. Nevertheless, SVMs or neuronal nets hardly can be used to extract the features relevant for the biological activity. Overall, sequential screening shows some promise for the reduction of screening efforts.
References as well as for the analysis of screening hits. The selection of compounds similar to identified biologically active compounds is a routine process in the follow-up of an HTS campaign to expand the SAR around a hit structure. Although no single descriptor is proven to consistently work in any application or any data set, 2D fingerprint descriptors, such as MACCS keys or Unity fingerprints are state-of-the-art descriptors in this approach. Nevertheless, the choice of the appropriate descriptor for virtual screening, especially for scaffold hopping is problemdependent. No general rule can be given and therefore, simultaneous application of different descriptors is advisable. Regression or classification tools enable the extraction of features or molecular fragments that are relevant for the biological activity from large data sets. Nevertheless, the extraction of SARs from screening data is challenging as a result of noisy HTS data. The SAR information generated is more reliable if HTS positives are experimentally confirmed by secondary assays and compound analytics before they are used for the model generation. In the next step, such SAR information can be used to guide compound optimization. Sequential screening capitalizes on the capabilities of descriptors and regression and classification tools for SAR extraction to reduce the screening effort. One difficulty in sequential screening approaches is the design of the initial screening set. To cover a broad chemical space, diverse collections are often used initially. The SAR is then built up stepwise by the subsequent iterations. Different tools have been used for the analysis, such as recursive partitioning,
Related articles Engels, M.F.M. et al. (2002) Outlier mining in high throughput screening experiments. J. Biomol. Screen. 7, 341–351 Stahura, F.L. and Bajorath, J. (2004) Virtual screening methods that complement HTS. Comb. Chem. High Throuhput Screen. 7, 259–269 Engels, M.F.M. and Venkatarangan, P. (2001) Smart screening: approaches to efficient HTS. Curr. Opin. Drug Discov. Dev. 4, 275–283 Cruciani, G. et al. (2002) Suitability of molecular descriptors for database mining. A comparative analysis. J. Med. Chem. 45, 2685–2694
202
www.drugdiscoverytoday.com
1 Schneider, G. et al. (2002) Virtual screening and fast automated docking methods. Drug Discov. Today 7, 64–70 2 Lengauer, T. et al. (2004) Novel technologies for virtual screening. Drug Discov. Today 9, 27–34 3 Johnson, A.M. and Maggiora, G.M. (1990) Concepts and Applications of Molecular Similarity. Wiley 4 Martin, Y.C. et al. (2002) Do structurally similar molecules have similar biological activity? J. Med. Chem. 45, 4350–4358 5 Grethe, G. and Moock, T.E. (1990) Similarity searching in REACCS. A new tool for the synthetic chemist. J. Chem. Inf. Comput. Sci. 30, 511–520 6 Cramer, R.D. et al. (1999) Prospective identification of biologically active structures by topomer shape similarity searching. J. Med. Chem. 42, 3919– 3933 7 Mason, J.S. et al. (2001) 3D-Pharmacophores in drug discovery. Curr. Pharm. Des. 7, 567–597 8 Lipinski, C.A. et al. (2001) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 46, 3–26 9 Brown, R.D. and Martin, Y.C. (1997) The information of 2D and 3D structural descriptors relevant to ligand-receptor binding. J. Chem. Inf. Comput. Sci. 37, 1–9 10 Willett, P. et al. (1998) Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 11 Rarey, M. and Dixon, J.S. (1998) Feature trees: a new molecular similarity measure based on tree matching. J. Comput. Aided Mol. Des. 12, 471–490 12 Matter, H. and Rarey, M. (1999) Design and diversity analysis of compound libraries for lead discovery. In Combinatorial Organic Chemistry (Jung, G. ed.), pp. 409–439, Wiley-VCH 13 Schneider, G. et al. (1999) Scaffold-hopping by topological pharmacophore search: a contribution to virtual screening. Angew. Chem. Int. Ed. 38, 2894–2896 14 Bo¨cker, A. et al. (2004) Status of HTS data mining approaches. QSAR Comb. Sci. 23, 207–213 15 Chen, X. et al. (1999) Automated pharmacophore identification for large chemical data sets. J. Chem. Inf. Comput. Sci. 39, 887–896 16 Rusinko, A. III et al. (1999) Analysis of a large structure/biological activity data set using recursive partitioning. J. Chem. Inf. Comput. Sci. 39, 1017– 1026 17 Warmuth, M.K. et al. (2003) Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43, 667–673 18 Blower, P. et al. (2002) On combining recursive partitioning and simulated annealing to detect groups of biologically active compounds. J. Chem. Inf. Comput. Sci. 42, 393–404 19 Engels, M.F.M. et al. (2001) Smart screening: approaches to efficient HTS. Curr. Opin. Drug Discov. Dev. 4, 275–283 20 Jones-Hertzog, D.K. et al. (1999) Use of recursive partitioning in the sequential screening of G-protein-coupled receptors. J. Pharmacol. Toxicol. 42, 207–215 21 Teckentrup, A. et al. (2004) Mining high-throughput screening data of combinatorial libraries: development of a filter to distinguish hits from nonhits. J. Chem. Inf. Comput. Sci. 44, 626–634