Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209 www.elsevier.com/locate/chemolab
The central role of chemoinformatics Johann Gasteiger Computer-Chemie-Centrum, Universitaet Erlangen-Nuernberg, Naegelsbachstr. 25 D-91052 Erlangen, Germany Received 8 February 2005; received in revised form 4 June 2005; accepted 10 June 2005 Available online 20 October 2005
Abstract Chemoinformatics has evolved over the last 30 years into a scientific discipline that now is in full bloom. It covers many areas such as chemical structure representation, chemical reaction manipulation, data processing and data analysis, property prediction, chemometrics, data mining, structure elucidation, and synthesis design. Chemoinformatics methods have successfully been applied in all fields of chemistry. The future will bring a rapid expansion of the use of chemoinformatics to further our understanding of chemistry and to process the flood of chemical information. D 2005 Elsevier B.V. All rights reserved.
1. Introduction A major task of chemists is to make compounds with desired properties. The society at large is not interested in beautiful chemical structures but in the properties that these structures carry with them. Chemical industry can only sell properties but they do so by conveying these properties through chemical structures. Thus, the first fundamental task in chemistry is to make inferences on which structure might have the desired property. This is the domain of establishing structure –property or structure –activity relationships (SPR or SAR) or even finding such relationships in a quantitative manner (QSPR or QSAR). Once we have an idea which structure we should make to obtain the desired property we have to make a plan on how to synthesize this compound, which reaction or sequence of reactions to perform to make this structure from available starting materials. This is the domain of synthesis design, and the planning of chemical reactions. Once a reaction has been performed, we have to establish whether the reaction took the desired course, whether we obtained the desired structure. For, our knowledge on chemical reactions is still too cursory; the factors influencing the course of a chemical reactions are too many that we are not always able to predict which products will be obtained, whether side reactions will be observed, or whether the reaction might take a E-mail address:
[email protected]. URL: http://www2.chemie.uni-erlangen.de. 0169-7439/$ - see front matter D 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2005.06.022
completely different course than expected. Thus, we have to establish the structure of the reaction product. A similar problem is given, when the degradation of a xenobiotic in the environment or in a living organism has to be established. This is the domain of structure elucidation, that, in most parts, utilizes information from a battery of spectra (infrared, NMR, and mass spectra). These fundamental tasks of a chemist are summarized in Fig. 1. All these tasks are, in general, too complicated to be solved from first principles. They require a lot of knowledge, knowledge that has to be derived by learning from data and from observations made on experiments. It has to be realized that there are two ways of learning, deductive and inductive learning. In deductive learning a theory is used to make inferences, deductions. In chemistry this is usually achieved by calculations such as quantum mechanical or molecular mechanics calculations. Such calculations provide data that can assist in solving a problem. Inductive learning, on the other hand, learns from observations, from data. These data are put into context to obtain information. Information can then be generalized to obtain knowledge (Fig. 2). To give an example: the measurement of a certain biological activity is, by itself, not very useful. Only when we can associate such a biological activity with a chemical structure do we obtain information. Many such pieces of information of chemical structures and their associated biological activities can then be used to build a model for the relationships between chemical structure and biological activity. Such a model
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
physical
chemical
biological QSAR QSPR
chemical structure synthesis design available starting materials
structure elucidation reaction prediction
Fig. 1. The fundamental tasks of a chemist: property prediction, synthesis design, reaction prediction, and structure elucidation.
comprises knowledge that can be used to make predictions on the biological activity of new chemical structures. Inductive learning has a long history in chemistry. In fact, it has been the most important method to further our understanding of chemistry since more than 100 years. In recent decades, methods have been developed that allow inductive learning to be put on a more formal and rigorous basis by mathematical methods. Different names have been attributed to this area such as machine learning, data mining, pattern recognition, chemometrics, or neural networks. All these methods are considered to be part of chemoinformatics. There are other reasons, that make chemoinformatics indispensable: the amount of information available in chemistry is enormous. Presently, more than 40 million different compounds are known; all have a series of properties, physical, chemical, or biological, all can be made in many different ways, made by a wide range of reactions, all can be characterized by a host of spectra. And each year more than a million new compounds are discovered or synthesized, each year about 800.000 new articles are published that somehow deal with aspects of chemistry. All this just aggravates the flood of information. This immense amount of information can only be processed by electronic means, by the power of the computer. This is again where chemoinformatics comes in! Thus, quite early on, in the sixties, databases for storing information on chemical compounds were built in order to ensure that the information accumulated by chemists can also in the future be made accessible to the scientific community. Large as this flood of information is, there are also many areas where not enough information is available. Although 40 million compounds are known, we have experimental data on their 3D structure only for 250.000 compounds. And, the largest database on infrared spectra comprises only 220.000 spectra. Thus, we have experimental 3D structures and infrared spectra only for 0.5% of all known compounds. The question is then, can we develop methods to predict the 3D structure or the infrared spectra for the other 99.5% of compounds? Can we learn from the known 0.5% of the 3D structures enough about the construction principles of chemical structures to predict the 3D structures for the other 99.5% of compounds? Can we learn from the 0.5% infrared spectra stored in databases enough about the relationships between structure and infrared spectra to predict IR spectra for the other 99.5% of compounds? This is again where chemoinformatics has to come in!
201
Thus, we see that chemistry provides a host of problems to be solved by novel methods: storage and retrieval of chemical compounds and reactions, structure – property relationships, synthesis design, reaction prediction, spectra simulation, structure elucidation. This wide variety of applications has matured to a new field: Chemoinformatics, the application of informatics methods to the solution of chemical problems. 2. History of chemoinformatics The need to employ informatics methods to assist chemists in the solution of their scientific problems was felt quite some time ago in many areas of chemistry. Thus, chemoinformatics has many roots that often go back nearly 40 years. 2.1. Chemical structure representation In the early sixties various forms for machine readable, chemical structure representation such as linear notations, matrices, and connection tables were explored [1]. Eventually, connection tables became the norm and were selected for the first version of the Chemical Abstract Registry System [2]. This, however, had the result that for structure searching the graph isomorphism problem had to be solved. This was achieved by the Morgan algorithm for unique and unambiguously numbering the atoms of a molecule [3]. Since then, many additional problems in the representation and manipulation of chemical structure information, such as substructure and similarity searching, diversity analysis, ring and aromaticity perception, etc. have been solved. This has led to efficient methods for processing chemical structure information and to the building of a diversity of databases on chemical structures and reactions. 2.2. Computer-assisted structure elucidation In the process of elucidating the structure of an unknown compound the chemist mainly uses information from various spectroscopic methods and puts these pieces of information together by logical inferences. Thus, the process of chemical structure elucidation was recognized quite early on as a field of exercise for artificial intelligence techniques. The DENDRAL project initiated in 1964 at Stanford University gained deductive learning knowledge information
data
generalization
inductive learning
context
measurement calculation
Fig. 2. Deductive and inductive learning: from data through information to knowledge.
202
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
widespread interest [4]. Chemical structure generators were developed and information from mass spectra was used to prune the chemical graphs in order to derive the chemical structure associated with a certain mass spectrum. Structure elucidation systems that utilized information from different spectroscopic techniques were initiated in the late sixties by Sasaki at Toyohashi University of Technology [5] and by Munk at the University of Arizona [6] and work on these systems still continues after 40 years! 2.3. Computer-assisted synthesis design In 1969 Corey and Wipke presented their seminal work on developing a synthesis design system. [7] Shortly after that other groups such as Ugi et al. [8], Hendrickson [9], and Gelernter [10] reported on their concurrent efforts on developing systems for designing organic syntheses. Several groups have entered – and left again – this exciting field but work is still continuing and the more advanced systems have become quite mature [11]. 2.4. Molecular modeling Also in the late sixties the potential of cathode ray tubes for visualizing 3D molecular models was explored [12]. This work has led – stimulated by the progress in hardware and software technology, particularly as concerns computer screens and graphic cards – to highly sophisticated systems for the visualization of complex molecular structures with great details. 2.5. Chemometrics It was recognized early on in the late sixties that the diversity and complexity of chemical data need powerful and diversified data analysis methods. Thus, the field of chemometrics was soon established and is flourishing since, being presented in journals of their own such as Journal of Chemometrics, Journal of Chemometrics and Intelligent Laboratory Systems, and Quantitative Structure Activity Relationships. Multifaceted as these various problem areas are, from structure representation to chemometrics studies, they have nevertheless drawn success from similar methods, have benefited from many connections to such an extent that they have merged to a scientific discipline of its own: chemoinformatics. 3. Quantitative structure activity/property relationships Returning to the fundamental questions of a chemist mentioned in the Introduction we want to further delve into the relationships between chemical structure and a desired property. This field, quantitative structure – property relationships (QSPR), or quantitative structure – activity relationships (QSAR) if the property of interest is a biological activity, is the prototypal area of application of chemoinformatics methods as it emphasizes certain problems that are also important in other domains of chemoinformatics.
It has already been emphasized that many properties of a chemical compound, such as its biological activity, cannot be calculated from first principles. This is where inductive learning methods have to come in. Firstly, the chemical structure of compound has to be represented by a set of structure descriptors. Then, a series of compounds and their associated properties have to be compiled and submitted as a training set to an inductive learning method to build a model for the relationships between chemical structure and its property (Fig. 3). This process will be analyzed in some detail as it involves methods that are of importance in other areas of chemoinformatics. 3.1. Chemical structure representation Many different approaches have been taken for the representation of chemical structures by descriptors. These can be experimentally determined data such as partition coefficients (log P) or even the spectrum of a compound. Most of the structure descriptors, however, are calculated by computational methods. The descriptors can be a single valued data (MW, log P, etc.) or a vector consisting of several entries. Recently, an overview of chemical structure descriptors has been published [13]. It contained more than 1500 different types of descriptors. This cornucopia of structure descriptors emphasizes that there is not a lack of structure descriptors but we need methods to choose the appropriate ones. Our approach to the problem has been to develop structure descriptors that have a clear physicochemical basis both as concerns the geometric resolution of a structure and the physicochemical effects considered in their calculation. Fig. 4 illustrates that there is a distinct hierarchy in structure resolution in going from the constitution to the 3D structure to molecular surfaces. We have already mentioned that experimental 3D structures have been determined for only 0.5% of all known compounds. Thus, methods are needed to predict 3D structures for all other structures. We have developed the 3D structure generator CORINA that needs as input only the constitution of a compound and generates a 3D structure by a data and rule driven approach (Fig. 5) [14,15]. Of central importance is that CORINA has a broad scope being able to generate a 3D structure for basically any organic structure. This has been shown by submitting the freely accessible database of the National Cancer Institute comprising 250.251 chemical structures to CORINA. In a single run taking 1.1 h on a Linux PC (1.6 GHz) a 3D model was obtained for molecular structure
//
representation
property
model building
structure descriptors Fig. 3. The indirect way for predicting properties of chemical compounds.
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
• constitution (topological) graph (“2D”)
HO
NH2
input
203
output learning
HO
property
object method
• 3D model structure
statistical or pattern recognition method, neural network
spectrum; biological activity
for each object the same number of descriptors
• molecular surface
Fig. 6. General scheme for inductive learning: each object, i.e. chemical structures, of a dataset has to be represented by the same number of descriptors. Fig. 4. A hierarchy in structure representation: from the constitution through 3D structures to molecular surfaces.
248.795 (99.4%) of the structures. CORINA can be accessed on the internet [15]. The 3D structures of the NCI database have also been made freely available [16]. Having access to a 3D structure it is also easy to construct molecular surfaces. At each level of structure resolution, different physicochemical effects such as charges, polarizabilities, or inductive and resonance effects can be considered that can be calculated by rapid empirical methods [17] that allow the treatment of datasets with millions of structures. These methods have been collected in the package PETRA also available on the web [18]. Structure representation provides an additional challenge when diverse datasets comprising molecules of different size, with different number of atoms, are investigated by inductive learning methods. These methods require that each object, in our case each molecule, is represented by the same number of descriptors to be input into the learning method (Fig. 6). The chemical structure, be it the constitution, the 3D structure, or a molecular surface has therefore to be subjected to a mathematical transformation that results in the same number of descriptors irrespective of the number of atoms. Autocorrelation [19] or radial distribution functions [20] are such mathematical methods. 3.2. Inductive learning methods Data analysis methods have been used in chemistry since a long time. The importance of this field is underlined by connection table & stereo descriptors initialization of internal coordinates
*
rings: ring perception ring template search ring assembly acyclic systems: removal of steric crowding
quite an assortment of names that has been given to this field. Early use of pattern recognition methods in chemistry matured to a point that a new name was given to this field: chemometrics. Two decades later, neural networks entered the arena and found an important place for analyzing chemical data [21]. Then, in recent years, the term ‘‘data mining’’ has obtained wide-spread use. Whatever the terminology – chemometrics, machine learning, data mining or neural networks – this field is an important subfield of chemoinformatics. These methods help to manage the flood of information and to derive insight and knowledge from data collected in all fields of chemistry. 4. Overview of chemoinformatics Recently, we have provided a comprehensive overview of chemoinformatics: – – – – – – – –
representation of chemical compounds representation of chemical reactions data in chemistry datasource and databases structure search methods methods for calculating physical and chemical data calculation of structure descriptors data analysis methods.
We believe that these fields have matured to such a point and have developed so many interconnections that it is fair to say that chemoinformatics has come of age. In order to further support this process and the development of this discipline we have written a Textbook on Chemoinformatics [22] and have edited a four-volume Handbook of Chemoinformatics [23] written by more than 60 scientists active in various divisions of chemoinformatics. 5. Applications of chemoinformatics The range of applications of chemoinformatics is wide indeed, any field of chemistry can profit from its methods:
3D coordinates
Fig. 5. Outline of the 3D structure generator CORINA.
– the storage and retrieval of chemical information to manage the flood of data
204
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
– the derivation of knowledge from these data to further our understanding of chemistry – the analysis of data from analytical chemistry by chemometric methods to make predictions on the quality, origin, and age of the investigated objects – the prediction of physical, chemical, or biological properties of compounds – the elucidation of the structure of a compound from spectroscopic data – the design of organic syntheses – the prediction of the course and products of chemical reactions. Particularly, the area of drug design has seen the development of many applications of chemoinformatics methods: – the identification of new lead structures – the optimization of lead structures – the establishment of quantitative structure – activity relationships – the planning of chemical libraries – the analysis of high-throughput screening data – the docking of a ligand into a receptor – the modeling of ADME-Tox data – the prediction of the metabolism of xenobiotics. Varied as these areas are and diversified as these applications are, the field of chemoinformatics is by far not fully developed. There is much space for innovation to seek for new applications and to develop new methods. Nevertheless, it is already now nearly impossible to give a comprehensive overview of the large amount of publications in chemoinformatics. To illustrate the broad scope of applications and the potential of chemoinformatics we have decided to present some applications from our group. 5.1. The prediction of properties It has been realized in recent years that during the development of a new drug increasing attention has to be given not only to the optimization of its biological activity but also to ensure that is has favorable physical, chemical, and biological properties such as adsorption, distribution, metabolism, excretion, and toxicity (ADME-Tox). Methods are being developed for the prediction of these properties prior to the synthesis of the respective compounds in order to use these methods in the virtual screening of large sets of compounds. One of the properties that deserves special attention is aqueous solubility because this property has to be in a certain range in order for a drug to be orally administered and, on the other hand, also to be absorbed into the body. Huuskonen [24] has compiled a dataset on aqueous solubility of 1294 organic compounds that has been studied by a variety of research groups [25]. Also our group has studied the dataset and has developed a quantitative model for the prediction of solubility both by multilinear regression analysis and by back-propagation neural networks .18 descriptors were
used that can automatically be derived from the constitution of a compound [26]. This model has an error of prediction for log S of about 0.5 log S, as good as the best other methods developed for this set of compounds. However, it has been recognized that this dataset is structurally not diverse enough to sufficiently cover the kind of compounds used as drugs. For an extension of this study we obtained a dataset from Merck KGaA that was specifically selected for the interests of a pharmaceutical company. We could show by an analysis of the chemical descriptor space that indeed this dataset of 2743 compounds was structurally more diverse than the Huuskonen dataset. A model for the prediction of aqueous solubility of this dataset could be developed that had an error of prediction of 0.6 log S units and about the same error for those 799 molecules of the Huuskonen dataset not contained in the Merck dataset [27]. The important messages to be learnt from these studies are: – aqueous solubility of organic compounds can be predicted with satisfactory accuracy from easy to calculate structural descriptors – the quality of the data is of large influence on the quality of the prediction model – academic groups can be successful in developing prediction models of interest to industry only when industry also releases inhouse data. 5.2. Analysis of analytical chemistry data The analysis of samples to assign their quality, their place of origin, or their age has high commercial interest. As the relationships between the composition of a sample and its quality, origin, or age are highly complex, chemometrics methods and other inductive learning methods have been employed since a long time. The study we present here should emphasize the importance of unsupervised learning methods that may lead to insights that were not directly sought but are contained in the data. The classification of Italian olive oils according to the region they were produced is a prototypal problem. M. Forina has analyzed
6 66 66 55 55 55 8 8 88 88 8 6 66 6 5 5 55 5 7 8 88 8 6 66 6 5 555 5 5 7 8 8 88 8 6 66 65 5 55 5 5 7 7 7 78 88 8 6 6 5 555 7 7 7 77 7 8 8 33 55 5 77 7 9 99 9 8 3 55 5 7 77 9 9 99 33 5 5 7 77 79 99 99 33 3 7 7 77 9 9 9 1 3 33 33 33 77 79 1 41 3 33 33 33 22 4 2 11 11 3 33 3 33 33 2 22 2 22 1 11 3 33 33 33 33 2 2 1 11 3 3 3 33 3 2 22 2 2 44 3 33 33 33 33 33 2 2 22 3 33 33 33 33 32 22 4 44 3 33 3 3 333 3 33 34 22 44 44 3 33 3 3 33 33 4 2 2 22 3 3 3 33 33 344 3 44 22 22 3 33 33 33 33 33 3 4 44 22 24
572 samples 9 regions 8 fatty acids 8
7
1 3 9 5 6 4 2
Fig. 7. Comparison of the results of a clustering of a dataset of Italian olive oils by a self-organizing map with the regions of origin in Italy.
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
3D structure
205
climate – that allow them to be distinguished from those of Southern Italy and those of Sardinia. These facts are hidden in the data and have been discovered by the unsupervised learning strategy of a Kohonen network.
transformation Radialcode
(128 values)
5.3. Computer-assisted structure elucidation (CASE) The elucidation of the structure of a compound is presently residing nearly exclusively on spectral data of various sorts (NMR, IR, MS). The derivation of the structure of a compound from spectroscopic data involves the processing of a large amounts of information and many decisions have to be made between a host of alternatives. Therefore, as was already mentioned in chapter 2, quite early on methods of artificial intelligence (it would now be called chemoinformatics!) were introduced into the arena. These efforts are continuing, and still a lot is to be done. Among the methods needed for CASE, the simulation of spectra plays an important role as this is needed for the validation of structure proposals. We have developed a method for the simulation of infrared (IR) spectra as IR spectroscopy is a non-destructive method that requires only small amounts of samples and can even be used for compounds attached to the beads of a combinatorial chemistry experiment. IR spectroscopy monitors the vibration of a structure in 3D space. We therefore thought it indispensable to use descriptors for the 3D structure of a compound. In order to fulfill the requirement for a structure representation with a fixed number of descriptors (see Fig. 6), the 3D structure was transformed into a radial distribution function (RDF) which was then made discrete resulting in 128 values of an RDF code. [20,30] The relationship between the 3D structures and their corresponding IR spectra for a training set of compounds was stored in a counterpropagation (CPG) neural network [30]. Such a CPG network can then be used for predicting the IR spectrum of new structures from a test set (Fig. 8). Fig. 9 compares the predicted IR spectrum with the experimental one for a structure from a test set by a CPG
IR spectrum (128 absorbance values)
Fig. 8. Architecture of a counterpropagation neural network for infrared spectrum simulation.
572 samples of olive oils from nine different regions in Italy and has characterized each sample by determining the contents of eight fatty acids [28]. This data set has already been studied by a variety of data analysis methods. We used a Kohonen neural network to map the samples from an eight-dimensional space (having each one of the eight fatty acids as coordinates) into a two-dimensional plane. From the 572 samples 250 olive oils were used for training a 15 15 network, the other 322 samples were used as the test set. From this test set 312 (97%) of the samples could be correctly classified into their respective region of origin [29]. This is certainly an excellent result, but we want to show another interesting feature. Comparison of the Kohonen map obtained in this study with the geographic map of Italy shows that they nicely correspond to each other. The regions of Northern Italy are grouped together, well separated from the regions of Southern Italy. And the two regions of Sardinia are, for their side, well separated from both the regions of Northern and Southern Italy (Fig. 7). Surprising as this might seem at first sight, it emphasizes the importance of unsupervised learning methods that analyze data without imposing a rigid model. Apparently, the samples of Northern Italy have similarities – caused by similar soils and
absorbance
0.8
simulation experiment
Cl
r = 0.932 O
O
0.4 Cl
0 3500
-1
3000
2500
2000
1500
487 training, network selection, 384 test; 3D-MoRSE code, n=32, smax=31
1000 Å-1,
cm 500
Ai = qtot; 25x25, t, unsupervise
Fig. 9. Comparison of simulation with experimental infrared spectrum.
206
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
Molecule Editor
target identification
lead selection
lead optimization
preclinical testing
Synthesis Tree
Molecule Viewer
bioinformatics
Similarity Searches
Substructure Searches
Strategic Bonds
Reaction Database
Fig. 10. Tools of the synthesis design system WODCA for planning the synthesis of organic compounds and combinatorial libraries.
network that has been trained by a dataset of 487 mono-, di-, and tri-substituted benzene derivatives [30]. A study with a variety of structures to establish the scope and limitations of the approach has also been published [31]. The method has been made accessible on the internet through the TeleSpec project [32]. Along similar lines a method for the prediction of chemical shifts in 1H NMR spectra has been developed [33]. It can also be used on the internet [34]. 5.4. Computer-assisted synthesis design (CASD) The design of a synthesis for an organic compound involves the consideration of many alternatives, has to draw from a broad knowledge of organic reactions, has to focus on a large selection of available starting materials, and has to consider a variety of economic effects. Thus, it is one of the most challenging problems in organic chemistry and has attracted
clinical development
chemoinformatics
Fig. 12. The drug design process.
early on interest as a field of exercise for artificial intelligence techniques. However, because of the complexity of the problem, internationally only a few groups have taken up this challenge and have invested decades of person-years into the development of systems that should assist the chemist in the design of organic syntheses. Our group has worked on this challenging project since 30 years and we have arrived at a version of the WODCA system (Workbench for the Organization of Data for Chemical Applications) that we consider mature enough to be able to be of practical use for synthesis chemists [35 –37]. WODCA comprises a series of tools that the user/chemist can employ for planning the synthesis of individual compounds or of combinatorial libraries (Fig. 10). – a molecule editor and a molecule viewer allow the chemist to communicate with the system in the language he/she is used in the form of structure diagrams. – similarity searches can be used to drive the design of a synthesis as quickly as possible to available starting materials. Different similarity criteria, specifically designed for focusing on synthesis reactions, are provided that the user can choose from. Catalogs of available compounds are attached to the system but methods are also provided to append one’s own proprietary catalogs.
Fig. 11. User interface of WODCA.
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
target-based
207
ligand-based
de novo design
pharmacophore
target protein
lead
docking
ligand(s)
ligand database
similarity searching
Fig. 13. Chemoinformatic methods for drug lead discovery.
– strategic bonds can be perceived where a molecule should be dissected in order to simplify a synthesis problem and work with smaller precursor molecules. – such a dissection of a molecule corresponds to a retroreaction and searches in a reaction database can be invoked to verify whether such a retroreaction corresponds to a known reaction or reaction type. The important feature is that such a query into a reaction database is automatically derived by the system. Also here the reaction database integrated into the system can be changed to one’s own reaction database. This interplay of similarity searches, strategic bond definition, and reaction searches allows one to generate a complete synthesis plan or even several alternative syntheses plans that are stored and graphically presented in a synthesis tree. When one is planning the synthesis of a combinatorial library one can invoke substructure searches each time one has found an available starting material and thus expand the synthesis plan to all available building blocks. Here is not the space to present individual syntheses that have been developed through the use of WODCA. By showing
Fig. 11 we just want to emphasize that the interaction with the system is largely graphical in nature allowing its easy use by the bench chemist. This is to emphasize that we consider the chemist and the synthesis design system WODCA as a team that should work together to solve complex problems of organic synthesis. Both team players bring in their own strengths, the computer tirelessly exploring and evaluating a large number of different alternatives and thus generating new ideas. The chemist being stimulated by these ideas in his lateral thinking and thus bringing in his knowledge of chemistry. 5.5. Drug design The area of drug design is presently undoubtedly the most important field for using chemoinformatics methods. The reasons are several: first, there is enormous economic pressure to reduce the high costs needed for developing a new drug and to reduce the time needed for this process. Secondly, experimental methods recently introduced in the drug design process such as combinatorial chemistry and high-throughput
lead
similarity searching
lead hopping
combinatorial library
virtual library
HTS
virtual screening
set of ligands
QSAR
docking
optimized lead Fig. 14. Chemoinformatic methods for drug lead optimization.
208
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209
screening produce enormous amounts of data that have to be analyzed. And lastly, it is clear that the biological activity of a chemical compound cannot yet be predicted from first principles and therefore still needs methods that learn from available data. Fig. 12 shows a simplified outline of the drug design process and thus highlights the interplay of bioinformatics and chemoinformatics in this process. Bioinformatics methods should assist in identifying from genetic information the target protein that is the focus of a certain disease. The next steps are the identification of a new lead structure, the optimization of this lead structure to increase the biological activity, and then, or better simultaneously, optimizing the ADME-Tox properties to convert the highly active compound into a drug with advantageous physical, chemical, and biological properties. For all those three tasks chemoinformatics methods have been developed to increase the efficiency in achieving these goals. Fig. 13 gives an overview of the methods developed for lead discovery. They fall into two categories, target-based methods that need the 3D structure of the target protein and ligand-based methods that do not need the 3D structure of the target protein but do their job by analyzing a series of ligands that bind to this target protein. Fig. 14 shows the major methods for lead optimization. Their first task is to expand a lead into a set of potential ligands. This can be achieved through similarity searching in compound databases, by lead hopping, by analyzing the outcome of high-throughput experiments, or by performing compound screening on virtual libraries. Once a set of ligands has been obtained those can either be docked into the target protein, if its 3D structure is known, or one can try to establish quantitative structure – activity relationships for this set of ligands in order to pinpoint the highly active compounds. The optimization of the ADME-Tox properties is performed by methods that have briefly been touched in Section 5.1. To summarize, chemoinformatics has developed a large arsenal of methods that can be utilized to make the drug design process more efficient. Much has been done but it should be emphasized that we are just at the beginning and much more has to be developed in order that we achieve the goals that can be envisaged.
reach beyond chemistry, it can provide methods and information that can be used in biology, medicine, and physics. Acknowledgements It has been exciting to help in the last 30 years bring a new scientific discipline, chemoinformatics, to life. This was only possible because other scientists and a series of able coworkers shared our conviction and enthusiasm. They have contributed to advance the field through their research. And there were people in governments and industry that believed in this vision and helped fund the work. References [1] [2] [3] [4]
[5] [6] [7] [8] [9] [10] [11] [12] [13]
[14] [15] [16] [17] [18] [19] [20] [21]
6. Conclusions By now it should have been realized that chemoinformatics has matured to a point where it has made important inroads into many fields of chemistry. And we are just at the beginning! Certainly, chemistry is a science that is driven by experiments, it heavily relies on data and observations. However, it is imperative to better plan the experiments that generate data, to better analyze the results of the experiments, and then come back and make more focused experiments. Chemoinformatics can step in to assist in this endeavor. And it can do so in all fields of chemistry, inorganic, analytical, organic, physical, medicinal, and bio-chemistry. And it can
[22] [23] [24] [25]
[26] [27] [28] [29] [30]
F.A. Tate, Ann. Rev. Inf. Sci. Technol. 2 (1967) 285 – 309. G.M. Dyson, M.F. Lynch, H.L. Morgan, Inf. Storage Retr. (1968) 27 – 83. H.L. Morgan, J. Chem. Doc. 5 (1965) 107 – 113. R.K. Lindsay, B.G. Buchanan, E.A. Feigenbaum, J. Lederberg, Applications of Artificial Intelligence for Organic Chemistry: The Dendral Project, McGraw-Hill, New York, 1980. S.I. Sasaki, H. Abe, T. Ouki, M. Sakamoto, S. Ochiai, Anal. Chem. 40 (1968) 2220 – 2223. M.E. Munk, J. Chem. Inf. Comput. Sci. 38 (1998) 997 – 1009. E.J. Corey, W.T. Wipke, Science 166 (1969) 178 – 193. J. Blair, J. Gasteiger, C. Gillespie, P.D. Gillespie, I. Ugi, Tetrahedron 30 (1974) 1845 – 1859. J.B. Hendrickson, J. Am. Chem. Soc. 93 (1971) 6847 – 6854. H.L. Gelernter, N.S. Sridharan, A.J. Hart, S.-C. Yen, Top Curr. Chem. 41 (1973) 113 – 150. W.-D. Ihlenfeldt, J. Gasteiger, Angew. Chem., Int. Ed. Engl. 34 (1995) 2613 – 2633. A.M. Lesk, Comput. Biol. Med. 7 (1977) 113 – 129. R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, in: R. Mannhold, H. Kubinyi, H. Timmerman (Eds.), Methods and Principles in Medicinal Chemistry, vol. 11, Wiley-VCH, Weinheim, 2000. J. Sadowski, J. Gasteiger, Chem. Rev. 93 (1993) 2567 – 2581. web: http://www2.chemie.uni-erlangen.de/software/corina/free-struct.html and http://www.mol-net.de/software/category/gen3dcoord.html. http://www2.chemie.uni-erlangen.de/services/ncidb2/index.html. J. Gasteiger, C. Jochum, M.G. Hicks, J. Sunkel, Phys. Prop. Predict. Org. Chem. (1988) 119 – 138. web: http://www2,chemie.uni-erlangen.de/software/petra/index.html. M. Wagener, J. Sadowski, J. Gasteiger, J. Am. Chem. Soc. 177 (1995) 7769 – 7775. M.C. Hemmer, V. Steinhauer, J. Gasteiger, Vibr. Spectrosc. 19 (1999) 151 – 164. J. Zupan, J. Gasteiger, Neural Network in Chemistry and Drug Design, Wiley-VCH, Weinheim, 1999. J. Gasteiger, T. Engel (Eds.), Chemoinformatics — A Textbook, WileyVCH, Weinheim, 2003. J. Gasteiger (Ed.), Handbook of Chemoinformatics — From Data to Knowledge, Wiley-VCH, Weinheim, 2003. J. Huuskonen, J. Chem. Inf. Comput. Sci. 40 (2000) 773 – 777. I.V. Tetko, V.Y. Tanchuk, T.N. Kasheva, A.E.P. Villa, J. Chem. Inf. Comput. Sci. 41 (2001) 1488 – 1493; R.F. Liu, S.-S. Do, J. Chem. Inf. Comput. Sci. 41 (2001) 1633 – 1639. A. Yan, J. Gasteiger, QSAR Combust. Sci. 22 (2003) 821 – 829. A. Yan, J. Gasteiger, M. Krug, S. Anzali, J. Comput.-Aided Mol. Des. 18 (2004) 75 – 87. M. Forina, C. Armanino, Ann. Chim. (Rome) 72 (1982) 127. J. Zupan, M. Novic, X. Li, J. Gasteiger, Anal. Chim. Acta 292 (1994) 219 – 234. J. Schuur, J. Gasteiger, Anal. Chem. 69 (1997) 2398 – 2405.
J. Gasteiger / Chemometrics and Intelligent Laboratory Systems 82 (2006) 200 – 209 [31] P. Selzer, J. Gasteiger, H. Thomas, R. Salzer, Chem. Eur. J. 6 (2000) 920 – 927. [32] http://www2.chemie.uni-erlangen.de/services/telespec/index.html. [33] J. Aires de Sousa, M. Hemmer, J. Gasteiger, Anal. Chem. 74 (2002) 80 – 90. [34] http://www2. chemie/uni-erlangen.de/services/spinus.
209
[35] J. Gasteiger, M. Pfo¨rtner, M. Sitzmann, R. Ho¨llering, O. Sacher, T. Kostka, N. Karg, Perspect. Drug Discov. Des. 20 (2000) 245 – 264. [36] http://www2.chemie.uni-erlangen.de/software/wodca/index5.html and http://www. mol-net.de/software/category/synthesis.html. [37] M. Pfo¨rtner, M. Sitzmann, in: J. Gasteiger (Ed.), Handbook of Chemoinformatics, Wiley-VCH, Weinheim, 2003, pp. 1457 – 1507.