4.14 Chemoinformatics J. Polanski, University of Silesia, Katowice, Poland ª 2009 Elsevier B.V. All rights reserved.
4.14.1 4.14.2 4.14.3 4.14.3.1 4.14.3.2 4.14.3.3 4.14.3.4 4.14.3.5 4.14.4 4.14.4.1 4.14.4.2 4.14.4.2.1 4.14.4.2.2 4.14.4.3 4.14.4.4 4.14.4.5 4.14.4.5.1 4.14.4.5.2 4.14.4.5.3 4.14.4.5.4 4.14.4.6 4.14.4.7 4.14.4.8 4.14.4.9 4.14.4.10 4.14.4.11 4.14.4.11.1 4.14.4.11.2 4.14.4.11.3 4.14.4.11.4 4.14.4.11.5 4.14.4.11.6 4.14.4.11.7 4.14.4.11.8 4.14.4.11.9 4.14.5 4.14.6 4.14.7 References
Introduction The Origins and Scope of Chemoinformatics Teaching Computers Chemistry: Data Input Problems Computer-Processable Molecular Codes Molecular Editors Computer-Oriented Chemical Compounds Nomenclature Coding Chemical Reactions Organizing Chemical Facts into Databases In Silico Chemistry: Data Processing and Data Output Problems Computer-Generated Chemical Names Molecular Modeling Structure generators Modeling 3D structures Structure and Substructure Searches Molecular Graphics Chemical Syntheses and Retrosyntheses (Disconnections) The development of product-to-reagents strategy in synthesis design Synthon nomenclature Operations on synthons Computer-assisted synthesis design Reaction Prediction Computer-Assisted Structure Elucidation Database Mining for Computer-Assisted Knowledge Discovery Chemometrics: Translating Mathematics to Chemistry and Chemistry to Mathematics Computer-Assisted Molecular Design Property-Oriented Synthesis Intuition and serendipity in drug discovery and development Brute force screening by combinatorial approaches From data to drugs Structure-based design Ligand-based design Mapping structure to property in QSAR approach Drug likeness and druggability concept Molecular diversity in property-oriented synthesis Bioinformatics in drug design Internet Resources for Chemistry and Chemoinformatics Conclusions and Further Trends Sources of Further Information and Advice
459 460 463 465 467 467 470 471 471 472 473 473 475 477 478 478 479 480 480 482 485 486 486 488 489 490 490 491 492 492 493 493 497 498 498 498 499 500 500
459
460 Chemoinformatics
Symbols ai di P
acceptor synthon located at the ith atom relatively to functional group heteroatom (i ¼ 0) donor synthon located at the ith atom relatively to functional group heteroatom (i ¼ 0) physical, chemical, or other properties, where m and p in subscript refer to
S E B R
measured and predicted values, respectively structural properties end reaction state matrix initial reaction state matrix reaction matrix
4.14.1 Introduction Chemoinformatics (cheminformatics) is a term that has been coined recently to describe a discipline organizing and coordinating the application of computers in chemistry. Although computers have been assisting chemists for years, this term did not appear until recently. Thus, it is not surprising that not all chemists are impressed by this fact. Actually, there are a number of controversies over the necessity for the foundation of this relatively novel chemistry branch. Wendy Warr, who surveyed this issue among chemists, concluded this by stating, ‘‘some people felt that it was a neologism invented by information professionals who felt that chemical information was not a sexy enough name to safeguard their jobs. Opinion has now shifted towards acceptance of chemoinformatics as a discipline although not everyone agrees about the definition, or even about the syntax: 50% of respondents like chemoinformatics.’’1
4.14.2 The Origins and Scope of Chemoinformatics Chemoinformatics, which joins together chemistry and informatics, is evidently related to computer applications in chemistry. However, not all chemical branches that depend on computers should necessarily be included in the field. Clark asks the question: ‘‘Does quantum chemistry have a place in cheminformatics?’’2 Even though the author considered a very narrow research area for chemoinformatics, this causes a hesitation on a ‘‘possible role of quantum mechanical techniques in chemoinformatics’’2 and suggests the autonomy of quantum chemistry.2,3 The essence of this discipline is the assumption that we do not need any specific chemical interaction for the explanation of chemical bonding. In principle, it is just the physics of atoms and mathematics that allow the correct modeling of molecular objects, and pure mathematics, hypothetically, can be done without computers. However, even today, such an approach has important limitations; we can investigate rather small molecules. The larger the molecules, the more remote and inaccessible the precise mathematical explanation. In fact, it appears that chemistry is often too elusive for a precise description of the molecular bodies. Because such bodies are the most substantial object of chemical investigations, this even provokes the question ‘‘is chemistry a science?’’4 Philosophers have developed several theories to explain the origins of science. According to conventionalism, logical structures called laws of nature are created or invented, which are then verified by conducting experiments. Inductivism finds the origins in ‘‘collecting and classifying sensory input data into a form called observable facts’’.5 Inductive logic is then applied to draw general conclusions or laws of nature. Finally, for deductivists, theories are at the origins of science. Scientists can never prove the theory, but science develops through theory falsification.5 Independent of the philosophy we would accept, we need data and theories for the development of science. We cite here Brock,6 who discussed the history of fundamental concepts or theories in chemistry to illustrate the complexity of chemical researches. Consider chemical bonding. The molecular orbitals or valence bonding theories describe atomic scale aggregation into molecules. Both models have been competing with each other and chemists still discuss which is correct and which is better. In physics, it is possible to develop a relatively simple model to explain certain facts of nature. In contrast, in
Chemoinformatics
461
chemistry, a theory often partially interprets some data, also offering partial solutions. Thus, we need several high-quality models for the correct theory. Brock concluded that theoretical chemistry is still an empirical science based on the Schro¨dinger equation. It however appeared that a general solution of the equation will never be found. Mathematics is an instrument used for modeling and developing theories, which means that it can be interpreted as a compression tool unifying the facts of nature. Now we do not need individual facts any longer, ‘which disappear,’ but a single equation that explains the reality. Reductionism is an approach that insists that a system complexity can be explained on another level by such a compressed model. For the illustrative discussion of these problems, the reader is referred to Cohen and Stewart.7 However, the reality often appears to be too complex or even unavailable for an accurate mathematical description. Alternatively, a model developed can be too complex for a precise solution. Since we still need answers in such situations, we have to rely on simplifications, even if it would be less reliable. Eventually, speculation or educated guess is better than blind guess or no answer and ‘‘educated guess is being supported by the computer’’.8 This describes the first application of computers in chemistry, which is to assist a chemist in a calculation or computation that requires the calculation by computers. Why can computations still be possible, flexible, and efficient in data processing when human calculations fail? The efficiency of in silico mathematics9 is achieved, first of all, not by computer intuition or flexibility but by a brute force that preserves mathematical rigor and formalism. This makes mathematical philosophy in silico evidently different from the human one. The enormous speed and competence in low-level manipulations coupled with human intelligence allowed computers to solve ‘‘formerly intractable problems, and explore areas beyond the reach of human calculation’’.9,10 In this context, we can also outline the domain of chemoinformatics preferentially to data processing that cannot do without in silico mathematics, that is, those chemistry branches that depend on massive data that cannot be compressed to the standard mathematical models. Oprea suggested that we also do not include into chemoinformatics some traditional chemistry branches that are usually associated with computational chemistry but ‘‘generate more numbers than information (. . .), e.g., physical and chemical property calculation.’’11 It seems however that in a more general way this refers to such operations that, even though they can be performed efficiently in silico, hypothetically, can be done without a computer on the basis of relatively simple mathematical equations. Data storage systems is the second important field for the application of computers in chemistry. Chemistry starts from data, that is, facts and numbers, which when processed and delivered properly at a proper time and place make up information. Processing information in turn develops chemical knowledge. Chemistry focuses on atoms and molecules and their properties and transformations. A whole lot of matter available in the universe can be arranged to an unbelievably large number of molecules forming chemical data space. To illustrate the numbers, Chemical Abstracts Service (CAS) currently has registered almost 37 million chemical compounds, 60 million sequences, and 15 million single and multistep reaction data entries.12 The population of chemical space (CS), that is, the number of potential compounds, is estimated between 1018 and 10200 (the number 1060 being cited most often), which can be compared to the factual CS (FCS) of the order of 107 and an estimated number of stars in the universe of 1022.13,14 The expansion of CS can be even better illustrated if we analyze a single molecule of n-hexane substituted with 150 different substituents. Bringing together all mono- to 14substituted molecules will give a molecular population of 1029.15 The term CS itself is an example of the impact of mathematics on chemistry. In chemistry, this term appeared recently to illustrate the necessity for the control of the structural constitution of such a space or, in other words, the diversity of the molecular population investigated in combinatorial chemistry. In mathematics, a space is a set of a certain structure; in particular, a vector space is a set of multidimensional vectors in a generalized coordinate system. Mathematics demands some further conditions for such a space. Thus, an origin and a base (unit vectors in each dimension) are to be defined to form a space. The term CS used in chemical literature is a synonym of the chemical set including all possible chemical compounds. This often refers to virtual compounds, that is, those that have not already been synthesized. Mapping CS to biological space or to property space is a further borrowing from mathematics. Figure 1 attempts to further organize chemistry in the form of CS. CS is formed by chemical molecules. A molecule is a vector, elements of which describe the structure (structural properties S) and chemical or physical properties, P. As shown in Figure 1, CS is constructed from two basic moieties, FCS, that is, real molecules forming chemical compounds that
462 Chemoinformatics
CS *m1(S11, S12, S1i, P 21, P 12, P 1i)
*m2(S 21, S 22, S 2 i, P 21, P 22, P 2i,)
VCS
FCS (a)
m m ∈(FCS)r k ∈(CS)r
(b)
m
k
k
m ∈(CS)r k ∈(CS)r
(c)
Figure 1 Chemical space (CS) consisting of factual (FCS) and virtual (VCS) spaces, FCS [ VCS ¼ CS is formed of the molecules, where each molecule can be given by a vector (a). This provides a base for the definitions of molecular transformation operators capable of mapping molecular objects in CS, for example, organic synthesis in vitro operator (b) or reaction prediction operator in silico (c) (cf. Section 4.14.4.6). Two symbols m and k were used for the better illustration that mapping by these operators needs two different molecule types (m-reagents, k-products); r denotes the r-reaction domain in CS. The operator in vitro starts from FCS, while that in silico can work entirely in CS.
have already been obtained and described, and virtual CS (VCS), that is, hypothetic molecular structures. Accordingly, two common chemical problems (chemical synthesis and reaction predictions) are defined using a vector space formalism. The investigation and construction of new chemical objects cannot be made without an efficient data mining system that allows verification and screening of physical and chemical characteristics among a variety of described compounds. Access to information is a fundamental problem in chemistry. This should enable delivery of proper data to chemists’ desks where needed. Therefore, from the beginning, chemists developed documentation systems on chemical compounds. Chemisches Zentralblatt appeared as early as 1830; the first edition of Beilstein’s Handbuch der Organischen Chemie was published in 1881 and contained two volumes, registering 1500 compounds, with more than 2000 pages. This comprehensive encyclopedia of organic structures covers chemical literature from 1771 to date.16–18 Chemical Abstracts have been published since 1907. Unlike in other sciences, data storage systems are discussed in basic chemical handbooks, for example, March’s Advanced Chemistry.17 The improvement of data storage could have significantly stimulated the development of chemical sciences. Accordingly, chemical information branches have been keen to profit from computers. It is much more efficient to keep information on the computer desktop than just on the desk. Therefore, besides computations, chemical information formed an important component of chemoinformatics. Gasteiger illustrates this by the fact that, in 1975, the Journal of Chemical Documentation changed its name to Journal of Chemical Information and Computer Sciences.19 If we think in a similar vein, we can use the same title to show recent developments in this field, since the journal name has just changed to Journal of Chemical Information and Modeling. Computer science is now too far from the chemical core, and chemists believe that they are generating their own tools for computer chemistry investigations. Willet suggests that this journal might today reasonably be entitled Journal of Chemoinformatics.20 Thus, the journal’s history briefly illustrates the scope of the discipline. In fact, modeling is the next important problem in which chemists need computer assistance. The most obvious dictionary meaning of a model is ‘a physical representation that shows what an object looks like.’ For years, molecules were too small for direct observation and even today we usually watch them indirectly by analyzing measurable data. Therefore, from the very early days, chemists had to assemble physical objects resembling molecular scale shapes. Molecular models can be any physical representation of molecular configuration assigned to molecular objects that are constructed to understand and explain measurable characteristics manifested by molecules.21 Molecular models (Dreiding, CPK, and so on) are so popular among chemists that our conception of molecules is predominantly shaped by such real-world
Chemoinformatics
463
reproductions. In contrast, macroscopic analogies provide only a model imitation, and simple hard spherelike molecular representations cannot furnish the exact illustration of the microscopic bodies that can be described only by quantum mechanics. Although modeling is a broad term that describes a variety of methods, its substantial meaning in chemistry involves the construction and visualization of chemical molecules. The development of computer technology provides a virtual reality platform for chemistry that is known under the term molecular modeling. One way or the other, increasing dependence on computers is a fact in modern chemistry. This has brought a need for better organization of this field. As far we have indicated, computations, data storage and modeling have potential computer applications in chemistry. In fact, these problems are also of fundamental importance for general computer sciences. Computer sciences, a term used in the United States, or informatics, coined as its synonym in Europe (for the discussion of the differences see Roberts22), can be defined as ‘‘the science of algorithmic processing, representation, storage and transmission of information.’’23 In general, such a definition also describes potential application areas for computers in chemistry. Consequently, a recent definition of chemoinformatics presented by Gasteiger in the Handbook of Chemoinformatics points for ‘‘the application of informatics methods to solve chemical problems.’’24 This includes more specific descriptions of this field. Brown describes this discipline as ‘‘the combination of all the information resources that a scientist needs to optimize the properties of a ligand to become a drug.’’25,26 According to Paris, chemoinformatics ‘‘encompasses the design, creation, organization, storage, management, retrieval, analysis, dissemination, visualisation and use of chemical information, not only in its own right, but as a surrogate or index for other data, information and knowledge.’’1 Chemoinformatics should be interpreted as an element of knowledge management. This includes problems such as ‘‘compound registration into databases, library enumeration; access to primary and secondary scientific literature (. . .).’’27,28 Chemical informatics is another term related to the application of computers in chemistry. It is noteworthy to indicate that it is the oldest computer chemistry representation that appears in the literature as early as the 1980s. Formal definition of the branch includes: ‘‘computer-assisted storage, retrieval, and analysis of chemical information, from data to chemical knowledge.’’29,30 Chemical Informatics Letters, an open web access journal published since 2000, brings the latest news in this field. The website, edited by Goodman, is designed in a hypertext format, which makes a great difference to the standard form of a conventional journal. Cheminformatics and chemiinformatics are synonyms that sometimes replace the term chemoinformatics29 Finally, computer chemistry also seems to describe a similar chemistry branch. It is noteworthy that the research centers which are being explicitly called computer chemistry laboratories, for example, Labor fur computer Chemie at Technische Universita¨t Mu¨nchen, were established in the 1980s and 1990s. The history and operation of the European computer chemistry institutes can be found in Noordik.31 Chemistry is not the only science that has developed its own informatics. A variety of multidisciplinary informatics have appeared. Accordingly, bioinformatics relates to genetic information encoding living organisms’ structures and processes. Medical informatics focuses on diseases, patients, and drugs. Crystalloinformatics and protein informatics are other examples of interdisciplinary informatics.
4.14.3 Teaching Computers Chemistry: Data Input Problems Computer-understandable chemistry is required for the machines to process the data. At the same time, it is also required to enable an interaction between chemist and computer. It is not a trivial problem to translate structure data of molecular objects into a machine-readable and -processable system that is clear enough and unambiguous. Chemical molecules are the main object of chemical investigations. Molecules can represent both real chemical compounds that have been obtained previously and described, or virtual structures representing hypothetical compounds under design or speculation. Organic chemistry and inorganic chemistry are disciplines that construct such objects in reality, in the proportions of approximately 1:200 in favor of organic chemistry. To control CS, that is, all possible real or virtual molecules, we need to have efficient machine-searchable databases registering all compounds that have been synthesized by chemists from the very
464 Chemoinformatics
early days to today. This problem, which appeared in the 1960s, can be defined as structure representation and searching.20 We discuss below several problems referring to structure representation, which is of substantial importance for the organization of chemistry in silico. Structure searching as a chemical operator in silico will be discussed in Section 4.14.4.3. What we usually mean in the broadest sense by structure is chemical entity described by constitution and stereochemistry where constitution means ‘‘the description of the identity and connectivity (and corresponding bond multiplicities) of the atoms in a molecular entity (omitting any distinction arising from their spatial arrangement, i.e. – molecular stereochemistry.’’32 Atomic composition given by molecular formulae is not sufficient to unambiguously identify a molecule. Chemical entities of the same atomic composition but different constitution and/or stereochemistry are called isomers. The International Union of Pure and Applied Chemistry (IUPAC) defines isomers as ‘‘one of several species (or molecular entities) that have the same atomic composition (molecular formulae) but different line formulae or different stereochemical formulae and hence different physical and/or chemical properties.’’ A line formula is constructed by indicating atoms that are ‘‘joined by lines representing single or multiple bonds, without any indication or implication concerning the spatial direction of bonds.’’ The discussed rules allow the chemist to define unambiguously chemical entities that are characterized by certain structure or structure properties, as suggested in Figure 1. If we refer to a molecule defined according to Figure 1, m(S1, S2, Si, P1, P2, Pi), then we can make further discrimination of properties into molecular properties and chemical properties. For example, structure property can refer to both a molecule (molecular surface, molecular volume, 3D structure) and chemical compounds (3D crystal structure). Similarly, chemical or physical property can describe a molecule, for example, polarizability, and a chemical compound, for example, melting point. Figure 2 illustrates the basic terms that refer to molecular objects in FCS and VCS. It is worth mentioning that in the majority of chemical applications stereochemical description does not include a precise description of the real 3D molecular structure (which is known relatively rarely), but rather its rough scheme. This is shown, for example, in Figure 3, which illustrates two hypothetically possible 3D structures of trans-1,2-dibromocyclohexene.
Molecular objects
FCS
VCS
Chemical entity
Chemical entity
Chemical compound Molecule
Molecule
Isomer
Isomer
Constitution Stereochemistry
Constitution Stereochemistry
3D structure (shape)
3D structure (shape)
Molecular properties as measured
Molecular properties as predicted
Figure 2 Molecules are substantial objects of chemical investigations both in FCS and VCS. It is not easy to differentiate the terms that are used to describe molecules in these spaces. However, some differences can be definitely indicated, for example, in experimental FCS chemistry we are only very rarely investigating a single molecule. Chemical compound, that is, a population of molecules interacting with each other or agglomerated into a solid or liquid phase, predominantly focuses our attention. In contrast, theoretical methods often focus on a single molecule.
Chemoinformatics
465
Br
Br
Figure 3 Two hypothetically possible 3D structures for trans-1,2-dibromocyclohexene.
4.14.3.1
Computer-Processable Molecular Codes
Kekule was the first who realized the formation of carbon chains and rings. However, with the exception of the so-called sausage formulas, he did not use graphical representation of the molecules. Couper, independent of Kekule, developed the concept of a four-valence carbon atom and presented carbon chains in the form of atoms connected by dotted lines; finally, Crum Brown developed and popularized a molecular notation similar to that used today. The evolution of molecular graphs illustrating molecular objects is briefly outlined in Figure 4. Molecular graphs illustrating 2D atomic arrangement of chemical entities are easy to read for chemists. Although current computer systems are prepared to understand molecular graphs, generally such a form is not computer-friendly and needs a transformation before it can be presented to the computer. Linear notation and connection tables are two systems that enable an efficient coding of molecular graphs. Linear notation is a system that allows a molecule to be represented in the form of a string similar to that of line formulae. The Dyson, Wiswesser (WNL), Sybyl, Representation of Structure Diagram Arranged Linearly (ROSDAL), which was developed by Beilstein Institute, and Simplified Molecular Input Line Entry Specification (SMILES) notation are several systems used.34 For a detailed discussion of the chemical structure notation, the reader is referred to a number of monographs available.20 SMILES is probably the most popular line notation currently applied to a number of environments; for example, Figure 5 illustrates drawing the structure of benzamide in ACD ChemSketch freeware and explicitly writing its SMILES code into the appropriate window.
Ethyl alcohol C H H H
C
H H C
{OH
Acetic acid O H
OH
O H
C
C
2
C H3
H
O
C H H H
{OO
OH 2
C H3
H
H
C
C
H
H
H O
H
H
C
O C
O
H
H
Figure 4 Ethanol and acetic-acid formulae as shown by Kekule, Loschmidt, Couper and Crum Brown, from top to bottom, respectively. Adopted from Ihde, A. J. The Development of Modern Chemistry; General Publishing Company, Ltd.: Don Mills; 1984.
466 Chemoinformatics
Figure 5 Generating a molecular graph from its SMILES notation in ACD ChemSketch.
See SMILES manuals for the detailed code rules.35 An excellent tutorial is also available online from Daylight Chemical Information Systems.36 Several illustrative examples for the molecules coded by SMILES are shown in Figure 6. Chemical graphs can be coded by matrices. Adjacency matrix, atom connectivity matrix, incidence matrix, and bond electron matrix are only few examples of the possible notations.24 Connection tables are another possibility for coding molecular structures. Connection tables record, in a tabular form, only the atoms and bonds within a molecule. In contrast to matrix notation, this allows a decrease in the amount of data with increasing molecular size. Figure 7 illustrates an example of a connection table in
H3C
CH3 O
F
CH3
H3C CH3 NH2 C(C(C)C)(C)C(=O)N
HO H2N
H3C c1cc2CCCCc2cc1
c 2 c
O
*
1
CH3
F N[C@](F)(C(=O)O)C
c c 3
C/C(F)=C(F)/C
c3
c12
c 5 c 4 c c
C12C3C4C1C5C2C3C45
Figure 6 An example of SMILES coding several different molecules.
F
c2 c5 c
c3 c45 c4
Chemoinformatics
467
Bonds Atoms 1 2 3 4 5 6 7 8 9
C C N O H H H H H
6 5
4
1 2 3
7 8
9
1st atom 1 2 2 5 6 7 8 9
2nd atom 2 3 4 1 1 1 3 3
Bond 1 1 2 1 1 1 1 1
Figure 7 Connection table coding acetamide molecule.
the form of explicit, redundant, and nonredundant connection table. An in-depth description of the matrix and connection table codes can be found in Gasteiger and Engel.24 A connection table or linear notation can be formed arbitrarily. This means that numbers can be assigned to the atoms differently and there is no standard molecular representation. Canonical labeling is a solution for this problem. This provides a unique representation for a certain molecular graph. Unique SMILES is an example of such a canonical labeling system.36 Chirality is an important chemical structure property and isomeric SMILES is a system that allows various chiral and isotopic specifications.
4.14.3.2
Molecular Editors
Molecular graphs are an unambiguous, chemist-friendly, and illustrative way for the presentation of constitution and stereochemistry of molecules. Molecular Editor is an interface that not only allows a user to draw professionally presented molecular structures, but also acts as a tool for the translation of such a structure into computer-processable molecular codes. A number of systems have been developed that are capable of translation of molecular formulas introduced into a computer by its user in the form of direct drawing; examples range from using a mouse to machinereadable code. ISIS,37 ChemSketch (ACDLAB),38 JME,39 and RasMol40 are molecular editors available free of charge at their respective websites. We cannot discuss here all of the above-mentioned software, but we will concentrate on JME editor, which was programmed by Peter Ertl from Novartis. ‘‘Since molecular construction and editing are indispensable for chemical information systems, and in 1994 no such tool was available for the WWW, we decided to develop our own WWW-based molecular editor. This editor was based on a clickable map.’’ Adding atoms, rings, and functional groups, connected by bonds, is achieved by choosing ‘‘the desired action from the menu, and then picking the appropriate place on the drawing area.’’39 Currently, JME is a Java applet that allows input of a molecular structure by drawing its graph within hypertext directly on the website operated. A number of organizations using JME can be found at the Molinspiration website.39 Figure 8 illustrates the applet mounted at the online catalog of the Sigma-Aldrich fine chemical supplier. Sometimes, it is helpful for chemical documentation to transform 2D molecular illustration presented on a sheet of paper into the form of a connection table. The Clide program is an optical character recognition (OCR) system that performs such a transformation.41 Although we may question drawing a molecule in a molecular editor by writing its code instead of using a mouse, this method is much more convenient in a number of situations, for example, when a number of structures are to be generated via an automatic approach.
4.14.3.3
Computer-Oriented Chemical Compounds Nomenclature
Chemical nomenclature is an example illustratively showing the differences between chemist and computer when acquiring and processing chemical data. Chemical molecules can be complex and it is often impractical to use their explicit structures in the form of molecular graphs, connection tables, or similar notation systems just
468 Chemoinformatics
Figure 8 The JME molecular editor mounted at the website of Sigma-Aldrich fine chemical supplier. The results of substructure searches using this applet are shown in Figure 16.
to designate a proper chemical entity. In particular, this relates also to verbal communication among chemists. Thus, what is needed in chemistry is a human-friendly name system that allows identification of individual chemical molecules. In fact, the identification of compounds by assigning them certain names occurs prior to other designations. For example, water is identified as a substance that is absolutely necessary for human life and found in the environment as a relatively pure chemical compound. Although the name came before we had gained any idea of its chemical constitution and structure, it is adopted into the chemical nomenclature to label a water molecule. The formal designator oxygen dihydride is only very rarely used to name this compound. Faraday was the first to isolate a substance which he named bicarbuet of hydrogen. This was renamed to a simpler and still used name, benzin, or benzene in English, by Eilhard Mitscherlich, who obtained this compound by thermal treatment of lime and acid isolated from gum benzoin, a balsamic resin of tropical Asian trees of the genus Styrax. The formal name that explains the compound structure is cyclohexatriene.42 When chemists obtain a compound, they often use simple names that describe the origin of compound, commemorate an events (for example Olympiadane)43 or person (for example Buckminster-fullerene)44 point to other associations, or just use acronyms enabling clear and rapid identification of structures. Some illustrative examples are given in Figure 9. Although trivial names are unique and clear for people working in the field, their etymology and sense can be completely misty for the general chemical audience, especially after some time has passed from the synthesis and/or isolation of the molecule. Therefore, a systematic nomenclature scheme has been developed by chemists starting from the recommendation of the 1892 Geneva Convention of IUPAC. This system has been continuously improved by IUPAC.45 Some other nomenclature systems have also been developed but have never come into extensive application.46 Theoretically, IUPAC rules should relate to all factual and virtual structures, that is, a name can be generated for each molecule, regardless of its structural complexity. However, in reality the complexity of possible molecular structures requires many rule extensions and restrictions. In fact, IUPAC regularly provides recommendations on the nomenclature for the novel classes of compounds that appear. Fullerenes, or phanes, are examples of such relatively novel classes with special IUPAC nomenclature rules published. A perfect nomenclature should not only be unambiguous but also
Chemoinformatics
N
HO
L-870810 N F
O
N O
Quinine
N+
N+ O
O
O N+
N+
O
O
O
O
O N+
Benzene
OH
O O
SO2
N
H N
N
O
O
N+
N+ O
N+
N+
N+
O
O
O
N+
469
O
N+
O
O
O O
O
O
O O O
O
O
O
O 12PF–6
Olympiadane Figure 9 Trivial chemical names: quinine named after the quinine tree bark, benzene (benzin) named after gum benzoin, chemical acronym L-870810, and Olympiadane to commemorate the Olympic Games.
unique. This means that we should not only clearly identify the structure given a name label but also a single name label should describe only a single chemical entity. IUPAC rules are human-oriented, which means that easy chemical nameability and name readability by human chemists has been given the highest priority. A human-friendly nomenclature does not necessarily meet the requirements of a computer-oriented system. For example, the IUPAC system does not restrict the names generated for a single structure to a unique value. This problem is still a challenge and is to be solved by the preferred name program (PNP) currently realized by IUPAC.47 In practice, Beilstein and CAS, two main chemical information suppliers, adopted their own rules to provide different nomenclature systems that obey IUPAC rules but restrict them.46 Although it may sound surprising, until recent years a name to structure conversion has not been solved in a general way. The first system that allowed input of chemical structures in the form of their chemical names was developed by Beilstein in 1986. This was operated internally at the Beilstein Institute, and the structures input was restricted to the Beilstein notation subrules.46 Several other converters are available now, for example within the Advanced Chemistry Development (ACD) LAB program.38 However, this system is restricted quantitatively to a name length up to 255 characters including spaces, punctuation marks, and others, and up to 255 heavy atoms in generated structure. Several further limitations concerning the nomenclature require the user to input special name representation for correct recognition. Finally, ACD/Name to Structure software is not present in a freeware package of the ACD ChemSketch but should be purchased as an additional module. Although the Beilstein Institute allows a user to search its database by the chemical name (CN) field using the IUPAC-based name, that used in the Beilstein Handbook (Beilsteins Handbuch der organischen Chemie) (BH) (or the names used in original publication, those generated by AutoNom, the structure to name converting tool
470 Chemoinformatics
(cf. Section 4.14.4.1), available within the database) is preferred. Otherwise, according to the Beilstein database help, ‘name searches are not recommended to identify compounds, because names are ambiguous or not systematic in many cases.’
4.14.3.4
Coding Chemical Reactions
Atom bonding systems in molecules can change during a process described as chemical reaction. Chemical reaction involves the breaking and formation of chemical bonds. Chemical compounds or reactants to be converted are transformed during chemical reactions to reaction products. The problems of chemical reaction nomenclature resemble those of the description of chemical compounds. Many reactions, honoring distinguished chemists, are named after the discoverers. This naming corresponds to trivial chemical compounds nomenclature. In fact, there is no information on the reaction itself within its trivial name. Merck Index is a popular compendium book that provides a guided tour through name reaction chemistry.48 Similarly, the Organic Chemistry Portal offers an excellent web-based name reaction database.49 However, the accumulated chemical reaction resources needed more systematic classification and nomenclature that would give more detailed information on the particular molecular transformation. The most substantial classification of organic reactions groups them into four classes: substitutions (exchanges), additions, eliminations, or rearrangements. The IUPAC Commission on Physical Organic Chemistry developed systematic nomenclature for the reaction grouped into several classes.17 Precisely, this system describes the rules for the nomenclature of eight reaction types, that is, substitutions, additions, eliminations, attachments and deattachments, rearrangements, coupling and uncoupling, insertions and extrusions, and ring openings and closings. This is briefly illustrated in Figure 10. Although the IUPAC system seems to be attractive and universal, officially it has not been used in any single organic chemistry handbook with the exception of the recent issue of March’s Advanced Organic Chemistry. Such reaction class description is also too rough for the precise coding of the molecular transformations of a certain reactant to individual product.
Substitution H2SO4
+
NO2
HNO3
Nitro-de-hydrogenation
Addition
Cl +
Cl2 Cl Dichloro-addition
Elimination
Br
Br Dihydro-dibromo-bielimination Figure 10 Reactions named according to the rules of IUPAC Commission on Physical Organic Chemistry. Adopted from Smith, M. B.; March, J. March’s Advanced Organic Chemistry Reactions Mechanisms, and Structure; Wiley: New York, 2001.
Chemoinformatics
4H
2
H3
H3
C
471
+
5
6
7
H
C
N
2
4H
C
6
7
C
N
O1 1O
H5
O C H H H C N
O 4 2 0 0 0 0 0
C 2 0 1 1 0 0 0
H 0 1 0 0 0 0 0
H 0 1 0 0 0 0 0
H 0 0 0 0 0 1 0
C 0 0 0 0 1 0 3
N 0 0 0 0 0 3 2
+
O C H H H C N
O C H 0 –1 0 –1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
H 0 0 0 0 0 0 0
H 1 0 0 0 0
C 0 1 0 0
N 0 0 0 0 –1 0 –1 0 0 0 0 0
=
O C H H H C N
O 4 1 0 0 1 0 0
C 1 0 1 1 0 1 0
H 0 1 0 0 0 0 0
H 0 1 0 0 0 0 0
H 1 0 0 0 0 0 0
C 0 1 0 0 0 0 3
N 0 0 0 0 0 3 2
Figure 11 Reaction coded by the B þ R þ E matrix. Adopted from Gasteiger, J.; Engel, T. Chemoinformatics: a Textbook; Wiley-VCH: Weinheim, 2003, p 186.
An illustrative algebraic model for the description of molecular transformations has been developed by Ugi and coworkers. This is based on logical connectivity and matrix addition. In this notation, the reaction is represented by the matrix equation B þ R ¼ E, where B (beginning) represents an initial reaction stage, E (end) codes the final state, and R is a reaction matrix. Figure 11 illustrates an example of the reaction noted in such an approach.24 The Dugundji–Ugi (DU) notation not only provides an elegant and clear coding system for molecular transformations, but also reveals some further interesting features. For example, the R matrix indicating the distance between B and R similar to a real reaction designs an important measure describing the extent of valence electron shifts needed for B to R conversion; therefore, it directly explains the real chemistry of the transformation. Other representations and classifications of chemical reaction have been developed but will not be discussed here further and the reader is referred to Gasteiger and Engel24 and Chen.50 4.14.3.5
Organizing Chemical Facts into Databases
Finally, what we need to enable chemists using computers to perform efficient chemistry is access to chemical information, that is, chemical data represented by chemical facts and numbers. Thus, for example, coding chemical transformation as described in Section 4.14.3.4 does not provide factual information gathered in chemistry on this specific reaction; for example, no information on reaction conditions, solvents, temperatures, catalysts, by-products, can be found in the B, R, and E matrices. These data are, however, fundamental for chemical research. Therefore, a number of chemical databases have been converted into a form compatible with the computer platform. Chemical data organized in searchable databases form a focal point of chemoinformatics. Chemical compounds and reaction databases such as Beilstein and Chemical Abstracts, patent databases such as esp@cenet, chemical substance catalogs, for example, Aldrich, and a variety of chemical journals are the sources that are available online with user-friendly interfaces. The impact of searchable chemical databases on chemical research is discussed in Section 4.14.4.8. Table 1 specifies several databases available online. An extensive list of a number of other databases is available on the web.51
4.14.4 In Silico Chemistry: Data Processing and Data Output Problems Computers equipped with chemical information and software capable of understanding chemical data provide a chemoinformatic platform advising and assisting chemists in their research. Some of the problems appearing during an interaction between a chemist and computer in the course of data processing and data output are discussed below.
472 Chemoinformatics Table 1 Some chemical databases available online Provider
Data available
Web address
Beilstein Institut CAS Sigma-Aldrich, Fluka Supelco NIH NIH, National Center for Biotechnology Information (NCBI) National Institute of Advanced Industrial Science and Technology (AIST) NIH
Chemical compounds and reaction Chemical information (literature bibliography) Commercially available chemicals HIV therapeutics database PubMed – literature database
www.beilstein-institut.de www.cas.org www.sigmaaldrich.com http://chemdb2.niaid.nih.gov www.ncbi.nlm.nih.gova
Spectral database
www.aist.go.jpa,b
An extensive list of chemistry databases on small molecules esp@cenet patent database Various molecular databases including Pubchem Compound - chemical compounds database Chemical molecules, spectra, suppliers, etc. Discovery Gate, small molecule database environment that enables the simultaneous searches of several different databases (including Beilstein and patent databases)
http://cactus. nci.nih.gov/
European Patent Office NCBI eMolecules, Inc. Elsevier, MDL
a b
www.espacenet.com www.ncbi.nlm.nih.gov www.emolecules.com www.discoverygate.com
Several other protein, structure, etc, databases are available at this address. www.aist.go.jp/RIODB/SDBS/cgi-bin/direct_frame_top.cgi?lang=eng.
4.14.4.1
Computer-Generated Chemical Names
Chemical names generators realizing a structure to name transformation are generally supplied with a molecular editor that enables introduction of molecular structure in the form of a 2D graph. The Autonom program developed in the Beilstein Institute was the pioneer in this field. Wisniewski discussed and designed algorithms that included the following components:46
structure initialization, functional group identification, ring perception and recognition, parent structure selection, binary name tree processing, chemical name assembly.
Functional group identification is a table-driven approach that enables recognition of favored atom groups known as functional groups, which are then ranked according to the rules predefined by IUPAC. Officially, the approach adopted involves ‘‘rapid atom by atom connectivity search mechanism’’ similar to that used in substructure searches.46 Cycle systems formed by atoms or their assemblies are important components determining chemical names. Thus, all cycle closures within the smallest sequence of atoms are to be identified. The so-called smallest set of smallest rings (SSSR) algorithm is used for the correct identification of the cycle structures consistent with nomenclature rules. The ambiguity of the cycles’ identification within chemical graphs can be illustrated by the topology of a simple tetrahedron having four faces, three rings, but six valid SSSRs.36,52 The cycle perception step described above is a preliminary step that allows a program to identify certain ring classes, for example monocyclic alkanes, bicyclic alkanes, monospirocyclic alkanes, or trivial name ring systems whose names are obtained by using a lookup dictionary procedure. A collection of detailed rules and routines describe naming for each individual class. During the parent structure selection step, the candidate structural fragments, mainly rings and chains, are screened. Global regulations that rule the structure of a name generated are a sequence of principles that obey IUPAC nomenclature. Nonparent structure fragments are then
Chemoinformatics
473
introduced as substituents and subsequent substituents on substituents. The so-called binary name tree processing is then performed. During this step, the parent molecular fragment becomes the root of the tree, and other tree nodes represent other identified units that are to be named. Processing of the name tree starting from the root gives the final preliminary name assembly that includes, for example, punctuation and locants. A special control is then applied for the identification of the trivial name blocks that are preferred by IUPAC, for example, AutoNom generates the name benzoic acid and not benzene carboxylic acid. The success ratio of this program amounted to 86.3% when tested for more than 63 000 sample structures. The current version of the AutoNom program allows a user to generate both the name forms consistent with the Beilstein or ACS nomenclature.46 ChemSketch, developed by Advanced Chemistry Development Inc., is a freeware part of the extensive software system that can be downloaded directly from the ACD/Labs Internet site.38 As a freeware version it allows users to generate a name for ‘molecules containing no more than 50 atoms, and no more than 3 rings, with atoms from among only H, C, N, P, O, S, F, Cl, Br, I, Li, Na, and K.’ The ILAB is an interface for the charged ACD/Labs Online service enabling extension of this in a pay-per-use fashion.38 Similarly to Beilstein, the ACD generator also provides the names in their Beilstein or ACS version.
4.14.4.2
Molecular Modeling
Molecular modeling is a method that includes a variety of computational schemes that are aimed at simulating molecular structures, their properties and behavior in silico. In particular, this should also include molecular manipulations, that is, visualizing molecules on the screen using different modes, merging molecules, superimposing, and rotating molecules in space and bonds within individual molecules, and so on, as well as molecular predictions, that is, predicting molecular shape by 3D structure generation and modeling or forecasting chemical properties or eventual biological activity or effects. In particular, modeling virtual molecular structures themselves is not a trivial problem and can be achieved on the different level of approximation. For a brief introduction into general problems and applications of molecular modeling, the reader is referred to Ho¨ltje et al.53 4.14.4.2.1
Structure generators
2D structure generators In novel approaches we often sample VCS by systematically changing various molecular moieties in the user-directed mode. This can demand generation of thousands or even millions of structures and this operation can be achieved only by using the automated way. Such an operation can be easily programmed in a variety of environments, for example MATLAB, basing on SMILES codes whose syntax is simple enough. The 2DCOOR program is an example of a 2D structure generator available from Molecular Networks.54
4.14.4.2.1(i)
3D molecular structure In a variety of chemical research, we simplify the real structure of a chemical molecule to its molecular configuration (cf. Section 4.14.2). What we usually mean by molecular configuration is a simplified 3D molecular structure, for example, we are classifying E and Z isomers as two different configuration series, although some other effects such as steric hindrance can further affect individual structures. Actually, in organic chemistry, we often rely on such simplification. However, molecules are 3D objects, which means each atom can be described by its exact space location. We can observe this by applying X-ray diffraction pattern on crystals, which allows us to reveal the 3D structure of the atomic lattice and thus to describe the 3D structure of the molecule. This effect is limited to condensed matter (crystals). Although there are many further approaches that allow chemists to disclose some structural data concerning the 3D atomic pattern, for example, by the application of NMR, current physics and chemistry do not have general technology for the observation of the 3D molecular structure. X-ray crystallography poses problems related to production of crystals, which is not always an easy task, and there is also the question of the relationship between condensed matter atom configuration and configuration in other environments. Even though nowadays we have data for quite a number of structures (Figure 12) including peptides or drug–ligand complexes, it is only a small percentage of the compounds described.55
4.14.4.2.1(ii)
474 Chemoinformatics
400 000 350 000 300 000 250 000 200 000 150 000 100 000 50 000 0 1972
1976
1980
1984
1988
1992
1996
2000
2004
Figure 12 An increasing number of compound structure available in the Cambridge X-ray database. Reprinted with permission from Cambridge Structural Database. www.ccdc.cam.ac.uk/products/csd/statistics.ª Cambridge Structural Database.
A 3D molecular structure describes molecular shape, which is fundamental for a variety of chemical effects, for example, molecular recognition phenomena (supramolecular chemistry) or drug–receptor interactions. However, molecules are dynamic objects that generally can adopt different shapes depending on their interactions with the environment. Therefore, molecular shape is also a dynamic property. Besides the discussed issues, it depends not only on the molecule and its environment but also on its energy. The higher the energy, the larger the extent of nuclear configurations (conformations) available for the molecule. This makes molecular shape a fuzzy category of an extremely complex character.22 Moreover, chemists often make use of virtual molecules that have never been synthesized or hypothetical structures that would be extremely unstable or even could never have been isolated, for example, active complexes. Therefore, computer modeling is often not only a matter of low expenses but of a single possibility that makes available a hypothetical 3D molecular structure. 3D structure generators 3D structure generators are programs or program blocks designed to convert 2D molecular graphs into their 3D representations. Since we do not have a single measure for the molecular shape, X-ray structures are a kind of standard for experimental atomic 3D coordinates. The strategy used by 3D structure generators is to apply standard bond length and bond angles for the different atom types defined by hybridization. Additional rules on the preferred conformations of cyclic systems are usually incorporated into such programs. If a nontypical atom type appears at the program input, a program can calculate a reasonable value rather than crash. Sadowski56 lists several requirements that we are awaiting from 3D structure-generating programs:
4.14.4.2.1(iii)
robustness: a program should not crash answering different structures that can appear at the point of input; large file handling capability; variety of chemical types processed; correct stereochemistry interpretation; rapidity and automatic mode of action; high-quality models at the output; high conversion rate.
An early example of a 3D generator is a program block built in logic and heuristics applied to synthetic analysis (LHASA) to analyze the influence of steric hindrance on reactivity for the cyclohexane conformations. The Model Builder mounted in Hyperchem57 is another example of the program block capable of generating 3D atomic coordinates in an interactive mode. This means that a molecular graph drawn by a user in molecular
Chemoinformatics
475
editor at the program interface is converted into a 3D molecular structure. The program adds hydrogen atoms to a nonhydrogen molecular graph and generates 3D atomic coordinates. It operates using built-in rules to assign standard bond lengths, bond angles, torsion angles, and stereochemistry. Approximate structures that are formed might require refinement by geometry optimization, as discussed in Section 4.14.4.2. However, Hyperchem Model Builder is not a typical example of a 3D structure generator. We usually use such programs for a high-speed conversion of the extremely large number of 2D molecular structures into their 3D representations. CORINA, Cobra, Alcogen, Chem-X, Molgeo, and Converter (cf. Sadowski56 for the respective references) are programs capable of automated structure data conversion. To give a sense of high speed, CORINA, developed by Gesteiger’s group,58 processes a data set of 100 100 small- to medium-sized molecules in 2301 s on a 1.0 GHz workstation, which makes a performance of 23 ms per cpd yielding a 99.99% conversion rate. It is also capable of generating multiple ring conformers and ROTATE and STERGEN can be included into this program to analyze rotamers and tautomers. Concord, which is developed by Pearlman, is a built-in 3D generator within the SYBYL computational informatics software for molecular modelers. ‘‘This handles input/ output (I/O) in all of the common industry-standard formats and offers a variety of built-in geometry optimization options.’’59 Figure 13 compares the efficiency of several programs in generating 3D structures.
4.14.4.2.2
Modeling 3D structures The application of virtual molecular models constructed in silico is a routine procedure in today’s chemistry. Computational chemistry offers many types of molecular or quantum mechanics modeling methods based on different approaches to bonding and structures. See Gasteiger and Engel24 for the importance of these methods in chemoinformatics; in particular, quantum methods in medicinal chemistry can be found in Carloni and Alber.60 A brief online introduction can be found on the web.61 This allows refinement of molecular 3D structures or geometry optimization. Both molecular and quantum mechanics approaches owe their origins to the Born–Oppenheimer (BO) approximation. This means that the Schro¨dinger equation H ¼ E
relating the wave function , Hamiltonian operator H, and energy E can be given in a simplified form. Formally, H includes components responsible for nuclear kinetic energy, electron kinetic energy, nuclear 4 3.5
RMSXYZ (A)
3 2.5 2 Chem-X 1.5
Molgeo
Corina Converter
1
Concord
0.5
Alcogen Cobra
0 0
100
200
300 400 500 Number of structures
600
700
Figure 13 The performance of 3D structure generators, shown as a plot of RMS value of nonhydrogen atoms versus a conversion rate. Reprinted with permission from Sadowski, J. Representation of 3D Structures. In Gasteiger, J. Ed.; Handbook of Chemoinformatics from Data to Knowledge; Wiley-VCH: Weinheim, 2003; pp 231–260.
476 Chemoinformatics
repulsion energy, electron repulsion energy, and electron–nuclear attraction. In the BO approximation, we assume that electron distribution depends on the fixed nuclear position only, which means that nuclear kinetic energy term can be neglected in the H operator. Informative discussion and references to molecular modeling can be found in Ho¨ltje et al.53 Molecular mechanics The term molecular mechanics (MM) appeared in the 1970s to describe the so-called force field method. A brief introduction can be found in Hinchliffe.62,63 This method works on the assumption that a molecular set of atoms can be defined by the potential energy, which depends on molecular geometry given by the atoms’ space locations. Potential energy calculations are based on mechanics developed by classical physics. An illustrative model where atoms can be represented as balls and bonding as springs connecting the balls can represent this approach. Atomic interactions are given by analytical functions, which define the force field. Different force fields were developed by the application of certain compounds’ classes and functions, for example, MMþ, AMBER, BIOþ, and OPLS. The force-field concept allows us to also include implicitly electronic energy in MM calculations but only via parameterization. The search for the minimum at the potential energy surface, mapping molecular geometry space, provides the final molecular model. MM calculations give an insight into molecular geometries and energies. In practice, a variety of programs offer MM calculations, for example, HYPERCHEM or Sybyl (Tripos). A variety of examples for the application of molecular modeling and MM calculations in chemistry are discussed in Goodman64 and Keseru and Kolossvary.65 An online access for excellent discussion and examples is offered by Rzepa.66 Figure 14 shows an example of MM application for the simulation of the X-ray structure of D-glucose.
4.14.4.2.2(i)
Semiempirical quantum chemistry methods Semiempirical quantum mechanical (SM) methods offer further approximation of molecular models, supplementing the MM calculation for a deeper insight into the electron distribution described by molecular orbitals. Molecular orbitals are calculated in SM methods on the different levels of approximation while including only valence electrons. The core maneuver at the heart of SM methods is parameterization. This means that a set of parameters are derived from the experimental data for a certain set of compounds (database). This significantly simplifies quantum chemical
4.14.4.2.2(ii)
Figure 14 The MM-derived model (red) compared with X-ray structure (blue) of D-glucose.
Chemoinformatics
477
calculations. AM1, MNDO, PM3, CNDO, and INDO are examples of the individual SM methods. See Cramer67 for a brief review of the individual methods. Molecular dynamics Molecular dynamics (MD) is another approach for the investigation of the atom location in space. In this approach, a single-point model is replaced by a dynamic model in which the nuclear system is forced into motion. The simulation of the motion is realized by the numerical solution of the classical Newtonian dynamic equations. The set of possible atom locations gives, for example, conformational ensemble profile for a given molecule. MD can also provide information on thermodynamic and dynamic properties of the molecules. The MD can be used for simulations of protein shapes and refinement of X-ray structures. For further discussion the reader is referred to Rapaport.68 This reference also includes a number of additional references for the application of MD in chemistry. Figure 15 illustrates the application of MDgenerated atomic pattern for 2,4,5-trinitrobenzoic acid. AMBER, CHARMM, CHARMm, DL_POLY, GROMACS, GROMOS, NAMD, LAMMPS, and QUANTUM 3.1 are examples of the software capable of MD simulation.69 4.14.4.2.2(iii)
4.14.4.3
Structure and Substructure Searches
Comparing molecular structures is a substantial method for a variety of chemical procedures. Consequently, structure and substructure searches are of fundamental importance for a variety of issues discussed in this chapter. Current chemical investigations cannot be performed without chemical compounds and reaction database searches. Accordingly, a structure search, which is the most simple database query, is a routine operation. The translation of the molecular graph to a canonical molecular representation is the basic operation that enables such a query. Hash key is a function used for data structure organization. In this context, the translation of the canonical SMILES strings or Augmented Connectivity Molecular Formula in the ACS Registry System into hash codes is an important operation. For further information and representative references see Leach and Gillet.70 Substructure search is another approach for molecular queries. This enables finding a defined molecular fragment included in all molecules under the analysis. For further discussion and representative references to algorithms see Noordik.31 Figure 16 provides an example of the substructure search within the online Aldrich
Figure 15 The MD-generated atomic pattern for 2,4,5-trinitrobenzoic acid.
478 Chemoinformatics
Figure 16 The results of the substructure searches of the Aldrich commercial chemicals database using the molecular formulae shown in Figure 8.
chemical database. Algorithms for substructure searches can be based on graph theory (so-called subgraph isomorphism problem). From the technical point of view, binary string molecular representations are used for the rapid structure screens and two main approaches are used by the binary search methods.70 In the first approach of structural key, which is a Boolean array usually represented by a bitmap, each element codes a true or false, that is, the presence or the absence of a certain structural feature or pattern. Fingerprint is the next approach enabling high-speed structure queries. Similar to the structural key, fingerprint is also a Boolean array; however, the structural pattern set is not predefined. For further information on this issue, the reader is referred to the brief but informative review given by the Daylight Chemical Information System.36 Similaritybased searches are another approach for identifying chemical structures that are based on the idea of the molecular similarity measures. For the essentials see Kochev et al.71 Mining 3D structure databases is a similar problem of substantial importance for pharmacophore mapping and searches for possible ligands using the ligand-based drug discovery paradigm.72 The reader is also referred to Kochev et al.71 for the list of software available for structure, substructure, and similarity searches. 4.14.4.4
Molecular Graphics
Molecular graphics is used sometimes as a synonym of molecular modeling. In the more narrow meaning, this term refers to visualization of molecular objects in virtual reality, and is really a component of the broader problem of scientific visualization.75 The first project aimed at the visualization of physical models on computer screen has been initiated in the 1960s in the MIT within the Mathematics and Computation (MAC) program. Molecular visualization is an interdisciplinary problem between chemistry and computer sciences.73 A variety of interactive systems have been developed to display virtual chemistry on screen in interactive mode. This enables the use of atomic, molecular surface, or a variety of other symbolic molecular representations.74 For a comprehensive review with representative reference, the reader is referred to Keil et al.75 whereas a brief discussion on the differences between physical and virtual models can be found in Morris.76 4.14.4.5
Chemical Syntheses and Retrosyntheses (Disconnections)
Chemical synthesis is a core problem in chemistry. It constructs fundamental objects for chemical research. However, even today, organic synthesis is a bottleneck in a variety of chemical applications. In other words,
Chemoinformatics
479
chemists believe that it is still an art. To better understand this problem, consider an example of tropinone, the core atropine moiety that is cited in a majority of the textbooks on organic synthesis. Atropine is a natural product that has been isolated from nightshade or belladonna in ancient Rome and India. However, it was not until 1901 that Willsta¨tter proved the structure and synthesized tropinone in a more than 20-step synthetic approach. The total yield of this synthesis amounted to no more than 1%. Appearances can be deceptive; despite the low yield, this was a masterpiece of a synthetic work performed by a Nobel Prize winner. Even today, this synthesis would be a relatively complex organizational challenge, even though we have precise and rapid NMR or MS spectrometers that allow us to establish efficiently and rapidly the structure of synthesized compounds. However, it is the approach of Robinson (Figure 17) that holds one’s breath in amazement. Onestep process involving condensation and decarboxylation can give the same product in more than 90% yield (42% was reported in the original publication). The above discussed example indicates that the efficiency of the synthetic approach dramatically depends on the individual chemist’s approach. A question is: can we use a computer to provide a hint for a chemist or at least make synthesis design process slightly less dependent on the individual human’s skills? 4.14.4.5.1
The development of product-to-reagents strategy in synthesis design In the majority of practical applications, chemists need to design and produce novel substances that can then be used by industry, pharmacy, agriculture, and so on. Thus, a product is what we basically concentrate on. However, chemistry focuses on chemical reaction, that is, a conversion starting from the reagents to end with the products. This clearly originated from the historical development of chemical knowledge. Chemists should have first explored the possible conversions of known compounds, that is, their reactivity, before they could obtain in a rational way any required entity. Chemical reactivity of a molecule is the core issue of organic chemistry, as can be proved by the inspection of any organic chemistry textbook. In other words, we are trained to analyze the problem in a reagents-to-product strategy. A design of a product in such a strategy would be realized nondirectly by screening possible reactions. After all, we cannot build a database for the direct searches of products because CS is too highly populated, namely, we cannot find each virtual product in any database. Practically, chemists have been accustomed for years to design products in a reagents-to-product strategy using nondirect approaches that involve77
a search of a substructure that is present in the target molecule (TM) and whose synthesis is known, if so a major synthetic step is available;
O
N
Br
Br
N
O
O COOH + N O
+
O COOH
Figure 17 Willsta¨tter (upper) and Robinson (lower) approaches to the tropinone synthesis.
480 Chemoinformatics
a search for a chemical reagent of the structure similar to that of the TM, if so the problem now is to find a reaction converting this reagent to the TM; a search for a fragment of the TM, if available as a natural product, for example, chiral pool synthesis in which we use enantiopure substances available as a TM building block; a search for a known reaction scheme converting reagents to product.
Finally, the conversion from reagents to product can also be a serendipitous discovery. A completely different approach to synthesis design has been used by Corey, who developed a direct product-to-reagent strategy.78 In such a strategy, we are starting from a product (TM or synthetic target), which gives reagents in what is called retrosynthesis or transform. Retron is a structural pattern indicated as a TM subunit, which makes possible a certain transform representing retroreaction. Synthon is a virtual reagent resulting from retrosynthesis. Precisely, synthons are any atom groups indicated by disconnecting chemical bonds in a product in such a way that we can associate them with possible real reagents and chemical reactions yielding the product called in this approach a TM. Synthons are virtual chemical entities with indicated chemical reactivity. Thus, plus (þ) and minus () signs resulting from the bond breaking indicate the donor or acceptor synthon types that correspond to nucleophilic or electrophilic reactivity types. Figure 18 gives a simple example of the disconnection. It is worth mentioning that a precise definition of a synthon can become a problem and Corey did not use this term in his recent monograph.78,79 Moreover, synthons resulting from disconnections are noted often directly in a form that already converts them into standard reagents (cf. Section 4.14.4.5.3.1). 4.14.4.5.2
Synthon nomenclature Although useful reactions on carbon chains deprived of any functionalities may be sometimes possible and such chemistry draws chemists’ attention, functional groups (FGs) are of key importance in the contemporary synthetic chemistry approaches. FGs, that is, any group of atoms different from carbons and/or hydrogens bonded with single bonds, differentiate carbon atoms imparting them chemoselectively. Synthon atoms are numbered according to the relative position of the FG and a disconnected carbon atom. Figure 18 illustrates an example of the disconnection giving a common carbonyl acceptor synthon a1 and carbonyl donor synthon d2. 4.14.4.5.3
Operations on synthons Synthons are formed in disconnections, so really they are virtual molecular entities that need at least an addition of an ending at the terminal atom to be transformed into a stable molecule. Synthon-to-reagent conversion Often mounting a virtual ‘ending’ in the form of an ionized atom or group of atoms converts a synthon directly into a reagent. Figure 19(a) shows an example of the acceptor a1 synthon converted to the possible reagent CH3COCl or CH3CHO by its ‘ending’ with Cl– or H ending. Figure 19(b) also illustrates the fact that although both generated reagents undergo a reaction with nucleophile Nu, the first reaction b proceeds via cleavage of the provided ending, that is, the substitution mechanism with leaving group Cl, while the second reaction c1 proceeds via the addition mechanism in which the ‘ending’ added to the synthon is preserved. High basicity of H (H is not a suitable leaving group) makes route c2 highly improbable. Synthon-to-reagent conversion may not be always as simple as discussed above. For example, the carbonyl acceptor a3 synthon formally needs the abstraction of the neighboring proton at C2 to be transformed into the
4.14.4.5.3(i)
O
0
O
O
1 2
O
+
+
a1
+
– d2
Figure 18 A simple example of the disconnection to an acceptor a1 and donor d2 synthon.
Chemoinformatics
(a)
Cl–
O
481
O Cl
+
H–
a1
O H
(b) O
O
+ Nu–
Cl
Nu
+ Cl–
(c) O
O–
+ Nu–
H+
Nu
H
HO
Nu H
H
O
+
Nu
H–
Figure 19 A synthon-to-reagent transformation (a) and reaction schemes determined by this conversion (b, c).
O
O
–H+
+ H H
H
a3 Figure 20 An example of synthon a to reagent conversion proceeding via beta Hþ elimination. 3
respective alkene (Figure 20) and the acceptor a1 synthon þCOOH can be converted to CO2 and a donor d1 synthon COOH to the CN (anion) salt. For further discussion, the reader is referred to Smit et al.79 Synthon modification Modifying a synthon controls its reactivity. Figure 21 gives an example of the reactivity controlling group. Thus an acetone representing donor synthon d2 can be modified by the ‘ending’ replacement to ethyl acetylacetate synthon d29. Both are equivalent acetone synthons. The synthesis path using ethyl acetylacetate salt as a reagent representing synthon d29 is shown in Figure 21(b). This route complies with the routine ketone synthesis from ethyl acetylacetate. Synthons can be modified to control reaction stereochemistry or block reactivity if not needed (protecting groups). This is discussed in detail in many available references, for example, in Fuhrhop and Penzlin.80 However, generally, such transformations require more complex chemistry than that illustrated in Figure 21 and discussed in Section 4.14.4.5.3. 4.14.4.5.3(ii)
The acceptor a1 synthon is generated by the carbonyl group, as shown in Figure 18. This makes available targets of the Figure 22(a) type, which is described by the acyl cation chemistry, but not Figure 22(b), which will need the donor d1 synthon and the acyl anion chemistry. Although the acyl anions are not directly available, such a chemistry can be realized nondirectly by a reaction sequence shown in Figure 22(c). A concept of similar operations, the so-called umpolung, has been developed by Seebach.81 Currently, a variety of individual umpolung operations changing synthon reactivity have been adopted to the synthetic routine.
4.14.4.5.3(iii)
Umpolung
482 Chemoinformatics
O
O
+
+ d2
– Alkyl a –H–
Synthon modification O +
O
– O
O +
+ –
O Alkyl a
d2′ Synthesis:
O O
X
O +
Decarboxylation
Substitution
O
– Na+
Figure 21 The disconnection of pentan-2-one gives d2 and alkyl a synthon. Synthon d2 is modified to d29 and converted to acetylethylacetate, which is then reacted in a routine ketone synthesis.
(a)
O +
O
R–(Nu–)
+
a1 R
(b)
O –
R+(E+)
+
d1 (c) O
SH
S
SH H
S
1. Base 2. E+
S E
S
O E
Figure 22 The umpolung of acetyl cation (a) provides acetyl anion (b) capable of the reaction with electrophile E using the dithiane chemistry (c).
4.14.4.5.4
Computer-assisted synthesis design Total synthesis of longifolene, a natural product occurring in pine resins, was by no means a trivial matter.78 Designing the synthesis of such compounds demands large experience, talent, and art from chemists. Figure 23 illustrates a disconnection scheme to this molecule designed by Corey in 1957.82 Precisely, the scheme directly indicated reagents and not synthons. Corey was also the first who aimed at the applications of computers for synthesis design. The general solution for synthesis design by computers appeared to be an extremely complex computational problem. Usually, we cannot stop after a single TM disconnection to first-level reactants, but these should be further disconnected up to the moment when available reactants are obtained to form the
Chemoinformatics
483
Retrosynthetic analysis for longifolence (1957) Figure 23 First disconnection performed by Corey. For further discussion of the disconnection and synthesis, see Corey et al.83 Reprinted with permission from Corey, E. J. The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Molecules. In Nobel Lectures in Chemistry 1981–1990; Malmstrom, B. G., Ed.; World Scientific: Singapore, 1993; pp 686–708. ª The Nobel Foundation 1990.
so-called synthesis tree with first-level precursors, second-level precursors, and so on. This indicates a first problem, which is an expansion of a number of possible routes analyzed. Since medium complexity synthesis usually involves 10 levels, the number of precursors increases to 1010. Thus, the synthesis tree must be pruned.77 In an excellent review on computer-assisted synthesis design (CASD), Todd compares machine design synthesis to programming a game of chess.84 Both problems are ruled by relatively simple laws. The best move in chess and synthesis design needs to be evaluated by a computer using a complex scoring function, whereas a trained human, by his intuition and experience, can sometimes find optimal solutions at a first glance. A tree of possible moves generated for both problems provides various successful paths; at the same time, the early choices strongly determine further paths. In both cases, an explosion of the possible number of routes makes it necessary to prune the analyzed answers, usually by rule of thumb. Moreover, in both games, no correlation exists between the early scoring and the final result. A gambit is a chess opening strategy in which we sacrifice a chess piece in the hope of further advantages. Similarly, what may look like a poor branching in a synthesis tree at its early stages may provide excellent final results. Accordingly, serendipitous synthesis resembles unconventional winning chess strategies. Moreover, we are still not sure how far we are from the solution of both problems. Although a tournament between IBM Deep Blue supercomputer and Kasparov has officially been won by the computer, it is the last game that decided the whole competition, and IBM so far has refused to repeat the event with an external non-IBM organizer. Similarly, we are not quite sure what we have achieved in synthesis design. Sir Derek Barton, an outstanding organic chemist, commented on CASD: ‘‘I don’t think that it has value for people in the academic world, because it just limits them to use known facts, known reagents, known reactions, and known properties.’’77 On the contrary, we can cite a Nobel Lecture by Corey: ‘‘The field of computer assisted synthetic analysis is fascinating in its own right, and surely one of the most interesting problems in the area of machine intelligence. Because of the enormous memory and speed of modern machines and the probability of continuing advances, it seems clear that computers can play an important role in synthetic design.’’82 Generally, the CASD program should include a search for synthons, retrons, and transforms, generation of precursors, and evaluation of the validity of the route found. LHASA, which appeared in the late 1960s, was the first attempt at CASD programming. This program, which was elaborated by the Corey group, is still in use. It is also still a scientific project under research and development.85 In LHASA, a special chemical language CHeMistry TRaNslator (CHMTRN) has been constructed to note and search for disconnections. For more detailed description of CHMTRN, the reader should see Ott.91 The language syntax resembles that of English to be easily learned and controlled by
484 Chemoinformatics
chemists. This should have made communications between a chemist and computer user-friendly. However, we ought to remember that a graphical-type input by mouse did not exist at that time. LHASA was capable of performing simple functional group interconversions (FGIs), functional group additions (FGAs), and functional group removal (synthon modifications – cf. Section 4.14.4.5.3.2) that would be important for making possible routes otherwise invalid. Two group disconnections identify, for example, an aldol condensation.84 Each transform is performed by an independent portion of code, which allows a chemist to modify or define new transforms. Corey’s scheme, developed in the late 1960s, has been received without enthusiasm or conviction. The results were published in Science86 and three papers appeared in the Journal of the American Chemical Society. ‘‘But shortly after the papers were published, Corey was told that the journal would not accept any more papers of that ilk. The work was so radical that it bothered people, Corey believes. He recalls being chided (. . .), that such a program would render synthetic chemists useless.’’87 The current version of LHASA is supplied with more than 2000 transforms for the diverse synthesis routes. Figure 24 illustrates a sample disconnection sequence reported on the LHASA website. LHASA can search to a depth of 15 precursor levels. See Barone and Chanon77 for the discussion of coding some important transforms, for example, implying Diels–Alder or sigmatropic reactions. The availability of reagents is an important factor deciding synthesis. Thus, the so-called starting material strategy has been integrated into the LHASA program. In such an approach, transforms are forced in such a way to prefer those of the best match between TM and reagents, if tested by the estimation of similarity between these molecules by comparing a number of atoms and bonds. Protecting group strategies is the other LHASA knowledge database element. Finally, LHASA uses an interactive chemist-oriented strategy in which synthesis tree pruning is based on human expertise. This allows for wide and deep transform searches. LHASA is an example of an heuristic approach to a transform problem. This means that the core problem of synthesis tree pruning is addressed by empirical rules. Several other synthesis design projects such as SECS, CASP, and PASCOP resemble the approach adopted by LHASA.77 SYNCHEM is an example of heuristic programs operated in noninteractive mode. Theoretical approaches are another, nonheuristic, alternative for transform analysis. In such approaches, it is not an empirical synthetic knowledge base formed of the preferred transforms and counterpart reactions but theoretically simulated or predicted reaction paths that control the CASD process.
Figure 24 Sample LHASA sequence display. Adapted from LHASA. http://lhasa.harvard.edu. ª President and Fellows of Harvard College.
Chemoinformatics
485
A mathematical formalism of DU bond–electron matrix reaction notation (discussed in Section 4.14.3.4) allows a direct quantitative description of the extent of chemical change yielded by any chemical reaction. The R matrices denote chemical distance between reagents and products giving also a measure of the possibility of the reaction progress. The lower the distance, denoted by R, the more probable the reaction. In the language of chemistry, this means that reactions preferred are those proceeding with the lowest valence electron interchanges. This makes it possible to predict a reaction path and, if so, to predict also a transform path. Practical application of DU method is an Interactive Generation of Organic Reactions (IGOR) developed in Ugi’s group. Thus IGOR is not only a chemical synthesis software but also chemical reaction generator (cf. Section 4.14.4.6). Elaboration of Reactions for Organic Synthesis (EROS), a program developed by Gasteiger’s group,88 supplemented the above-mentioned DU method with synthetic tree pruning heuristics applied by calculating physical and chemical parameters to further describe chemical reactivity, which is described as a multidimensional space where each coordinate represents a certain effect. Individual parameters were derived from the heat of reaction, bond dissociation energy, partial atomic charges, resonance, polarizability, hyperconjugation, and frontier molecular orbital approach. In the 1987 publication, in an attempt to classify their program, authors ranked it among expert systems based on ‘‘distilled knowledge on organic chemistry in the form of our quantitative models (. . .) EROS calculates its results based on these models, in contrast to the organic chemist who depends on a sort of mental qualitative pattern recognition (. . .). Despite that, EROS can provide answers to questions that are not always straightforward even for an expert in the field.’’88 Similarly, a theoretical approach for the transform/reaction analysis is mounted in the SYNGEN program by Hendrickson, who developed his own system based on symbolic representation of bond formation.50,84 Workbench for the Organization of Data for Chemical Applications (WODCA) was designed as an environment to go into synthetic organic chemistry. This environment includes the EROS reaction generator supplemented by a starting materials identifier, which enables us to find available reagents. WODCA can be installed with a suitable database of chemicals available on the market, for example, Jannsen Catalog. An indepth description of the program can be found in Gasteiger and Engel24 and Pfo¨rtner and Sitzmann.89 An excellent WODCA tutorial is available as online material from WODCA Computer-Assisted Organic Synthesis.90 At the same site, a test program license can be obtained from the authors. An extensive list of programs available and web references for them are given in Gasteiger and Engel.24 Online excess to retrosynthetic programs is available at the Organic Chemistry Resources Worldwide portal.
4.14.4.6
Reaction Prediction
Predicting reaction paths is a problem related to synthesis design. RAIN (Reactions and Intermediates Networks) was an interesting program capable of prediction reaction schemes that uses the DU formal model (see Section 4.14.3.4). See Ott91 for further discussion. CAMEO (Computer-Assisted Mechanistic Evaluation of Organic Reactions) is an expert system aimed at reaction prediction for the individual reagents and reaction conditions. This program developed by the Jorgensen group at Yale is based on the calculation of the molecular parameters controlling compounds reactivity, in particular, the pKa value regulating the acid-basic behavior, identifying nucleophile/electrophile, or measuring the leaving group capability. The program calculates bond angle and length, and performs, in particular, 3D structure minimizations, which allow stereoselectivity predictions. An extensive experimental knowledge is included, which helps in reasonable predictions. The program can also suggest side reactions, and the discovery of novel reactions is a potential program application. For further discussion, see Gasteiger and Engel,24 Todd,84 Ott,91 and CAMEO.92 EROS, implemented also in WODCA, is another program capable of reaction prediction developed in the Gasteiger research group.88 In EROS, chemical reactivity is a multiparameter event described by the reactivity space. A variety of physicochemical parameters are calculated to determine the events in such a space. Moreover, for better predictability, EROS also includes heuristic rules, among which those relating to reaction conditions are the most important. It is worth mentioning that computer-assisted reaction prediction has actually inspired novel reaction discovery.84,93
486 Chemoinformatics
4.14.4.7
Computer-Assisted Structure Elucidation
Chemical reactions used by synthetic chemistry provide chemical compounds. Although these compounds usually have expected structures, these structures should be proved by analytical procedures, which tie together analytical and organic chemistry domains. Spectroscopic methods (MS, IR, and NMR) for the identification of compounds are routine in modern chemistry. As artificial intelligence methods have been applied to these problems since relatively early, they are usually concerned as the domain of chemometrics (cf. Section 4.14.6 for the discussion of chemoinformatics vs chemometrics). In fact, this area cannot be done without computers during the whole procedure starting from spectra registration to data processing and interpretation. The main problems within this area can be grouped into two basic but related classes. First is the simulation of spectra for virtual molecules, and the second is the identification of compounds based on their spectra. Besides the application of formal methods based on the theoretical calculation level, a variety of approaches that are based on spectroscopic data have been developed by chemometrics. An elegant method for the conversion of IR spectra into 3D molecular structure has been developed by Gasteiger.94,95 In this approach, a radial distribution function (RDF) or 3D MoRSE code represents molecular structure. A counterpropagation neural network can process such a code to efficiently simulate (or in other words predict) the compound’s IR spectrum. The reader can see, for example, Zupan and Gasteiger95 for a number of neural network applications within the spectroscopy field. Further discussion of the problems is beyond the scope of this review and the reader is referred to Adams,96 who discuss chemometrical problems in spectroscopy. A broader discussion involving interrelation between organic and analytical chemistry as well as programs available can be found in Steinbeck.97
4.14.4.8
Database Mining for Computer-Assisted Knowledge Discovery
First of all, a database is a routine tool that gives access to chemical documentation and enables an efficient identification of chemical molecules or their properties. However, a database can also work similarly to a teacher who helps us in finding individual information, making it possible to extend our chemical knowledge, by falsifying, modifying, or improving existing models. Eventually, an extensive investigation into the database by nontrivial data extraction methods can reveal some general rules that give us an insight into the chemistry architecture. In this application, the database itself is an object for the discovery of knowledge.98,99 The contemporary chemical database is a user-friendly system, if needed equipped with molecular editor, allowing chemists to perform efficient searches. It was not always that way. Consider an example of organic chemistry that constructs novel chemical molecules relying on the data availability for the compounds that have not yet been described. The compound identification can be restricted to a simple property comparison if the compound has been obtained and registered previously. BH is a comprehensive reference to organic compounds that has been absolutely essential for any organic chemistry laboratory. It arranges registered compounds according to their chemical structure types. The so-called functional derivatives (defined according to internal rules) are put nearby in a systematic manner. To find any compound in BH, we should first locate the compound class in a proper volume, correct series, and respective pages. The BH content guide is shown in Figure 25. Practically, each chemist should have had an ability to locate any compound within the book just from its structure type. Of course, there are some tricks that allow us to take shortcuts; for example, some series include indexes: Sachregister, a chemical compound, and Formelregister, a formula index. This allows us to find the respective compound location independent of its chemical structure taxonomy. Similarly, indexes have always been the most important part of Chemical Abstracts. Searches can be performed by screening chemical names or molecular formulae. As mentioned in Section 4.14.4.1, even today Beilstein does not recommend searches by chemical names. In contrast, a single molecular formulae entry can register many compounds. Anyway, the searches demand training and expertise in the field, which needs much time. This many-series handbook is divided into the so-called series. BH contains Basic Work and five Supplementary Series. The fourth edition comprises 503 volumes, which makes 440 814 pages.16 Compared to today’s searching tools, the BH search itself was a tedious and rather time-ineffective procedure. The Belistein Institute closed its printed version production in 1998.
Chemoinformatics
487
Figure 25 Beilstein chemical compounds taxonomic system in the Vogel Practical Organic Chemistry (the Polish edition is shown). Adopted from Vogel, A. I.; Furniss, B. S.; Hannaford, A. J.; Smith, P. W. G.; Tatchell, A. R. Vogel9s Textbook of Practical Organic Chemistry, 5th ed.; Longman Scientific and Technical: Essex, 1989; p 1406.
Similarly to the BH, which revolutionized organic research by providing an effective reference system to organic compounds for more than 100 years, a computer-searchable and online database improves our capacity both to perform research and generate new ideas. For the first time, chemists have access to chemical data that can be processed all together in real time. This fact not only changes research performance but can also significantly influence our understanding of chemistry. Can any important discovery be made on the basis of database screening? The results reported by Fialkowski et al.101 illustrate a positive answer. The authors extracted all molecular masses for the molecules described in Beilstein and analyzed the distribution of this property in a statistical sense, to analyze what they designate as an architecture of organic chemistry. Among several interesting conclusions, the desirable mass distribution in the so-called drug-likeness concept (cf. Section 4.14.4.11.7) has been falsified. The study has revealed that a range in which it is believed that drugs appear preferentially complies with that observed for all chemical molecules synthesized. In a similar approach, Rucker and Meringer ask: how many organic compounds are graph-theoretically nonplanar? In this work, they investigated the whole compounds space registered in the Beilstein database by analyzing the difference between theoretical properties of graphs and those represented by real chemical compounds.102 The applications of data mining methods for knowledge discovery in reaction databases have also been reported in other references,103 and the application of the self-organizing neural network is suggested for knowledge discovery in reaction databases.104 For further reading also see Ester and Sander105 and Bremer et al.106 It is in recent times that we attempted to study the structure of chemistry architecture itself. Previously, chemists used mining databases for standard data. There is however only a small difference between a standard search and sophisticated knowledge discovery. This is illustrated by an example shown in Figure 26. A simple query can answer a complex problem in just a few seconds.
488 Chemoinformatics
(a) No.
Select
Hitset
Hits
Context
Database
Query
Options
Total charge = 0, radicals = 0, no impl. ring closures, no isotopes
1
Q04
0 Reactions
Beilstein abstracts (2006/01)
Original structure
Reaction1
Product1
2
Q05
0 Reactions
Beilstein abstracts (2006/01)
Same structure
Free sites on hetero atoms
3
Q06
2339 Reactions
Beilstein abstracts (2006/01)
Same structure
Substructure search
(b) Hit 16 RX.ID = 261796
Hit 2 RX.ID = 154806
Hit 1 RX.ID = 133880
Product1
Reaction1
Product1
Reaction1
Product1
Reaction1
(c)
Figure 26 Searching the Beilstein reaction database for the reduction conditions that allow us to convert function NO2 into NH2 while preserving co-occurring CN gives 2339 hits (a). It can be easily found that a lot of these hits (b) explain the query question, for example, hits 1 and 2. It is not always the case because other reactions are also included in the query formulated. For example, hit 16 records the substitution reaction that replaces Cl with NH2, while both NO2 and CN occurring in the reagent can also be found in the product 16. Database searching, if compared to a standard literature search, significantly improves chemist performance. Even if 2339 records are additionally analyzed by an expert, in traditional literature the problem needs extensive research review. Further database inspection reveals the reaction condition for the second hit (c).
4.14.4.9 Chemometrics: Translating Mathematics to Chemistry and Chemistry to Mathematics In this handbook, chemometrics is discussed thoroughly; therefore, here we will make only a few remarks switching the point of view to those aspects that concern at the same time chemometrics and chemoinformatics. Chemometrics originated from analytical chemistry as one of the first organized computer application fields in chemistry. With the greater potential of informatics, in silico chemistry has significantly increased the scope of
Chemoinformatics
489
interest and the available fields of investigation. This originated chemoinformatics. Chemometrics is deeply embedded in mathematics, including statistics, programming, and so on.107 After all, the term chemometrical analysis is often used as a synonym of the application of advanced statistical methods for the extraction of chemical knowledge from chemical data (cf., Pierce, et al.108 Myshkin and Wang,109 and Pytela et al.110); and in the narrow sense chemometrical analysis (classification, identification) sometimes replaces principal component analysis (PCA) or partial least squares (PLS) analyses.111 More generally, mathematics is a ‘language of science’ and chemometrics translates this language into chemistry. In contrast, chemoinformatics is a branch of chemistry that also depends on in silico mathematics. Clearly, chemoinformatics is a broader term than chemometrics. However, chemometrics is not only a part of chemoinformatics but can also be interpreted as the autonomous science that provides a ‘formal language’ for chemoinformatics. 4.14.4.10
Computer-Assisted Molecular Design
Which molecular objects are interesting for chemists? To answer this question, Fialkowski et al. analyzed selected properties of all molecules that have been synthesized, that is, real chemical compounds space of FCS registered in Beilstein.101 Such FCS can be ‘wired’ further by chemical reactions connecting compounds. This approach allowed authors to discover several statistical laws describing the rules for molecular production and interconversions, revealing a wiring pattern connecting molecules with different reactions. The topology of such a reaction network can indicate a structure or ‘architecture of chemistry’ as well as its evolution. ‘‘The average connectivity between molecules (. . .) initially increased, reached a maximum by about [year] 1885 and then steadily decreased to the value of approximately 2 in 2004.’’ Early organic chemistry attempted to optimize synthetic procedure itself by wiring existing molecules; subsequently chemical compounds space is explored by the synthesis of novel molecular objects. On the one hand, chemistry constructs a variety of materials of practical importance and applications, such as drugs, preservatives, and flavors. On the other hand, chemists can form these materials by arranging their atoms in a variety of combinations, that is, synthesizing different compounds that can possess individual properties as desired. However, molecular configurations are limited by the synthetic capability of current chemical technologies and individual laboratories. Even today, chemical synthesis cannot easily construct all we would like to have and compound availability is an important factor for determining potential investigations and applications. Are there any other regularities ruling the formation of molecular structures? Quite surprisingly, it has been suggested that the structure of drugs is based on preferential patterns as discussed in detail in Section 4.14.4.11.7.112,113 Chemists are fascinated with the beauty of molecular objects. This is especially emphasized by contemporary molecular graphics. Figure 27 illustrates, for example, a nanotechnology-inspired concept of the molecular
H
C10H21O
C10H21O
OC10H21
OC10H21
C10H21O
OC10H21
C10H21O H
C10H21O
OC10H21
H
OC10H21
C10H21O
H
OC10H21
Figure 27 The nanocar. Reprinted with permission from Shirai, Y.; Osgood, A. J.; Zhao, Y.; Kelly, K. F.; Tour, J. M. Directional Control in Thermally Driven Single-Molecule Nanocars. Nano Lett. 2005, 5, 2330–2334. Copyright (2005) American Chemical Society.
490 Chemoinformatics
car.114 This molecule has cost 8 years of development, after which ‘‘the world’s first single-molecule car was launched.’’115 Although molecular car can have a charm similar to that of a real-world vehicle, it is not its view but molecular properties that make a sense of the molecule. ‘‘The nanocar is a key step toward molecular manufacturing. It consists of an oligo(phenylene ethynylene) chassis and axle covalently mounted to four fullerene wheels. The researchers used the tip of a scanning tunneling microscope to propel the car on a gold surface and showed that it actually rolls forward on its wheels instead of sliding around randomly.’’115 What has been discussed above can be concluded by the truth realized already by Hammond that ‘‘the most fundamental and lasting objective of synthesis is not a production of new compounds but the production of properties.’’116 The exploration of chemical compounds space by the Fialkowski network clearly indicates that we are more interested in compounds of commercial applicability, that is, those having useful properties, important drugs, industrial chemicals, and so on.101 Network connectivity significantly increases for these compounds. A rather exotic property of ‘forward rolling’ that decides the nanocars mobility is the fact that focuses our attention on the story of the molecular car. It is a paradox, however, that we still cannot do without a good measure of serendipity in property design and production. For the illustrative review on the role of serendipity in drug research, see Kubinyi.117 Property-oriented synthesis is a clear target of today’s chemistry.
4.14.4.11
Property-Oriented Synthesis
There are several approaches to property-oriented synthesis. First is molecular design that basically works on the assumption that any molecular feature is a function of molecular construction that we can reveal and rationally use further for the design of molecules with improved properties. This is illustrated schematically in Figure 28 as a translation of the molecular structure into the compound property space or, in other words, mapping structure to property. An ideal molecular design operation would be a fully rational process. However, the complexity of biological systems interacting with molecular effectors generally makes them unavailable to high-level theoretical, for example, quantum, methods. In its most general form, molecular design demands the processing of extraordinary large data using excessive computational ability. Thus, if there is an answer, it would be offered by chemometrics. Actually, it is not a coincidence that the scope defined at the early stage of chemoinformatics focuses on molecular design (cf. Section 4.14.2) and nowadays these issues still remain a core problem. The efficiency of molecular design is, however, even now controversial and there are good reasons for that.118,119 Let us focus our attention for a while on this problem. 4.14.4.11.1
Intuition and serendipity in drug discovery and development Pharmaceuticals have probably the most important share among bioactive molecules whose design is attempted. This impacts even on the term drug design, which is often used as a synonym of molecular design
O OH Cl O N H2N
NH
IC50
O
HOOC
OH N N OH
Figure 28 Molecular design interpreted as structure to property mapping. Hypothetically the other direction is also possible.
Chemoinformatics
491
if we are targeting molecules with applications other than the pharmaceutical. For a discussion on the formal interrelations between molecular and drug design, see Van de Waterbeemd.120 Designing drugs is, however, a complex issue that still lacks a general solution. Let us start from a real example to illustrate this fact. Sidenofil (Viagra) is a well-known pharmaceutical that has been developed recently by Pfizer. The compound had been thoroughly designed as a cardiac drug. However, the current drug recommendation is erectile dysfunction therapy, an effect which was serendipitously discovered during clinical testing. This could have been found relatively easily because health conditions of cardiac patients are generally poor. Sweeteners can be used for an even better illustration of this problem. Developing artificial sweetener is even more complicated than a pharmaceutical. A need for the optimization of a variety of properties at the same time explains this fact. To pass registration and be accepted as a food additive, an alternative sweetener should be perfect. But it must be just ideal to persuade a consumer to change his dietary habits. Moreover, the sweetener not only needs a perfect taste and flavor profile that imitates that of sucrose, but it should also be completely harmless to be consumed without restrictions in much larger doses than medicines, the intake of which is in some senses risky but is often healthier than a death from the disease. Actually, all artificial sweeteners currently used have been discovered serendipitously and few new ones are their close analogs.121 For an extensive discussion and a list of serendipitous discoveries in drug design, see Kubinyi.133 In the same context, it is also interesting to analyze the contemporary practice for the development of bioactive molecules. For a brief review and classification of possible molecular discovery methods, see Lipinski and Hopkins.15 In a classical approach, at the early stage, a series of compounds are synthesized to observe the qualitative rules describing the change of activity with the modification of the molecular structure: structure– activity relationships (SARs). Usually, SAR consists of screening potential receptor ligands by compounds that have been designed on more or less intuitive similarity schemes. Moreover, a so-called fragment approach has recently appeared. This method works on the assumption that linking two molecular moieties of lower activity will result in a compound of higher activity.122 An illustrative example is given in Figure 29.123 Linking fragments can be supported by computational techniques similar to that performed during docking but can also come down to intuitive synthetic operations. Of course, this does not always work, and linking may not always amplify the activity. From the experimental point of view, the discovery of potential molecular scaffolds (fragments) encompasses a variety of instrumental methods, for example, NMR investigations of the drug– ligand complexes.124 From the computational point of view, the method is a sophisticated iterative molecular design technique, if supported by the appropriate docking simulations. If not, however, it is much more intuitional than a number of other approaches, and its popularity and efficiency is worth mentioning here. Finally, analog-based drug discovery can be another example of pragmatism in pharmaceutical R&D.125 In this method, simple variations within the structures of known drugs are performed in the search for novel drug candidates. Structural analogy often also implies a function similarity, which ‘‘makes the success almost warranted.’’126 4.14.4.11.2
Brute force screening by combinatorial approaches Combinatorial chemistry is another method that in the search for useful properties ignores the computational ability of molecular design. This method is based rather on probability foundations than design. The statistic O O
O
H N
H2O3PO
NH HO
OH
O
IC50 = 0.675 µmol l –1
O
+
H
N
OH
O
H N
H2O3PO COOH
IC50 = 3.5 µmol l –1
O H N
HO
OH
N
OH COOH
O
IC50 = 0.043 µmol l –1
Figure 29 A construction of the novel herbicide by fragment-based design. Reproduced from Hannessian, S.; Lu, P. P.; Sanceau, J. Y.; Chemla, P.; Gohda, K.; Fonne-Pfister, R.; Prade, L.; Cowan-Jacob, S. W. An Enzyme-Bound Bisubstrate Hybrid Inhibitor of Adenylosuccinate Synthetase. Angew. Chem. Int. Ed. 1999, 38, 3159–3162.
492 Chemoinformatics
1/10 000 describes the chance of finding a drug in a random pool of molecules. However, Kubinyi indicates that this number refers to so-called lead structures explored in traditional drug design, in which a small set of virtual molecules are targeted intentionally in the search for useful properties.127 In other words, this means that compounds have not been selected on a fully random basis but some drug preference is predefined in the chemist’s selections. However, nature has constructed molecular systems controlling life processes with an extremely high efficiency using a completely different strategy. In the evolution process, a large pool of random compounds have been processed in the large time period of millions of years to provide efficient bioeffectors. By the way, the extent of randomness in natural selection is also an extremely complex problem, for example, Lamino acid preference in natural systems. Combinatorial chemistry imitates this strategy by the increase of the population of molecular objects investigated. A variety of technologies have been developed nowadays to synthesize large compounds’ sets, the so-called combinatorial libraries of potential drugs.128–131 In a traditional synthetic approach, a chemist can produce on average four compounds per month, which costs about US$30 000, making US$7500 per compound. Using a combinatorial approach, he can produce 3300 compounds costing US$40 000, which makes a cost of US$12 per compound.132 In traditional combinatorial methods, the efficiency of synthesis is decisive for the structures of the compounds available. Accordingly, we are constructing a pool of compounds of unknown relation to the property searched. In fact, it is even not clear if the library can be treated as a random pool uniformly distributed in a property space. Kubinyi indicates that a 1/10 000 statistic no longer rules the success probability in such a pool, because we have changed the space of the compounds tested. We now need to investigate a population level of 100 000 molecular objects to find hits that can be developed into lead structures.133 Actually, a significant expansion of the compound pool generated recently by combinatorial methods did not result in an observable increase in new drug approvals.131–136 Lipinski, discussing the efficiency of combinatorial chemistry, insists that we need at least a few more years to judge this method fairly, because a 10–15-year lag is required for the drug development process.15 The population of a combinatorial library can be limited by shifting some compounds to virtual reality. It means that a certain pool of compounds can be generated but not necessarily all compounds appear in a real system. This idea is behind dynamic combinatorial chemistry, a novel variant of combinatorial chemistry. In such target-driven approaches, an enzyme added to the synthetic medium should amplify the yield of the products preferred by this certain target.137 From the molecular design point of view, in the latter approach the design step is also not a central point of the process and the enzyme-rich environment allows us to increase the probability of a drug formation scrambling the randomness of the library pool. 4.14.4.11.3
From data to drugs In contrast, there are good examples of successful drug design138 and drugs get better and better at imitating natural biological effectors. Drug discovery can be defined as screening CS in a search for novel drug candidates. This can be divided into searching for novel molecular objects (VCS) or novel properties FCS. Molecular (or drug) design is a method attempting to find drug candidates on the basis of computational techniques that strongly depend on data handling. Basically, a drug molecule is a man-made moiety designed or discovered in other ways to fit the biological counterpart and produce the required action allowing for the manipulation of biological effects. Biological targets are usually macromolecular proteins designated as receptors (for the precise definition of the receptor, see Cohen139). Thus, the drug–receptor interactions define a potential activity. Current molecular design attempts to simulate these interactions in silico. Structure-based design is a term designating the design methods that are based on the known receptor structure. 4.14.4.11.4
Structure-based design Two basic methods used to investigate a fill of a receptor with a ligand structure are docking and de novo design. In docking, the interaction energy for the different ligand–drug is sampled to find the preferred orientation of these moieties. The reader is referred to Schneider and Fechner140 for an informative discussion of the de novo and Cohen139 for docking and scoring in virtual screening. Kitchen et al.141 review methods, applications, and problems in drug discovery by means of docking algorithms, while the newest programs available are listed in Kitchen et al.141 and Gasteiger and Engel.142 De novo design differs from docking in the fact that a potential
Chemoinformatics
493
receptor ligand is built from the atoms or molecular fragments. Programs available for such procedures are listed in Gasteiger and Engel.24 Recent successes are discussed in Kitchen et al.141 and Gasteiger and Engel,142 and critical evaluation of the programs in Warren et al.143 and Kirkpatrick.144 High-throughput screening (HTS) in silico using docking protocols is an extensive version of docking performed for a large number of molecules.141,145 The Intel-United Devices Cancer Research Project is an interesting virtual HTS project based on docking by Internet-distributed computing. In this project, initiated by Graham Richards from Oxford University, a total number of 3.5 billion molecules are to be tested as potential anticancer drugs by the downloadable screensaver. The large collection of chemical data has been extracted from the ‘catalogs’ of molecules from various organizations.146 4.14.4.11.5
Ligand-based design When not enough data for the target receptor is available, the apparent receptor is replaced by the set of ligand structures that stimulate this receptor. Thus, ligand-based design is a nondirect method for the investigation into drug–receptor interactions. A receptor or pharmacophore mapping is a term that often designates such methods. A pharmacophore in its most general sense is just a receptor or receptor sector model deduced from a series of its ligands. In fact, no information is available on the relation between a real receptor and this subset. For an extensive review of this approach, see Cohen139 and Horvath.147 4.14.4.11.6
Mapping structure to property in QSAR approach Quantitative structure–activity relationship (QSAR, cf. Chapter 4.05 for a detailed discussion) is a method for mapping CS to property space by modeling functions relating chemical structure to property. Basically, this should enable us to efficiently generate novel compounds with the desired properties, while a function itself should work like a dictionary between two spaces. In a first step in QSAR, a variety of procedures can be used for the generation of a mathematical model describing a biological answer in a given property space. In the next step, the compound space is screened in the hope of finding virtual molecules of the required property. Then, these molecules can be targeted by organic synthesis. The whole process should enable optimization of the compounds’ structure. Therefore, QSAR can be defined as ‘‘an indirect molecular design by the iterative sampling of the chemical compound space to optimize a certain property, and thus indirectly design the molecular structure having this property.’’148 In the majority of described applications, QSAR realizes a strategy from molecules to property; however, an inverse strategy from property to molecules can also be hypothesized.149 Modeling Hansch QSAR Traditional Hansch QSAR is based on a modeling relationship between hydrophobic parameters (nowadays a variety of other parameters can also be used150) and biological activity (cf. Chapter 4.05) using regression methods. Nominally, this can be achieved with a simple mathematical approach that does not need computations in silico. Actually, Topliss described the schemes that allow us to choose from several synthetic targets on the basis of simplified QSAR even without modeling precise mathematical equations.151 Quantitative structure–property relationship (QSPR) is a variant of QSAR in which we use the parameters representing other activity types or property. Formally, this also refers to some Hansch models in which hydrophobicity is given by property, for example, measured by chromatographic methods or solubility. Although traditional QSAR is a relatively simple approach, it has allowed for the optimization of several important generic series.152 However, the method can also be extended to compounds of more diverse structures.151,153 Traditional QSAR can be broadened into multidimensional problems by the combination of several parameters as independent variables. In this particular case, a multidimensional method, for example, principal component regression (PCR) (multiple regression with forward or backward variable elimination) is used. Dragon is a program enabling the efficient calculation of a number of molecular descriptors.154 4.14.4.11.6(i)
Modeling multidimensional QSARs Theoretically, molecular structure if supplemented by a given molecular environment codes all molecular properties. In multidimensional QSAR (m-QSAR), we attempt to model a function relating 3D structures generated by molecular modeling methods to various
4.14.4.11.6(ii)
494 Chemoinformatics
properties, in particular, biological activities. Algorithms and theoretical problems of 3D to 6D QSAR modeling are discussed in Chapter 4.05 of this handbook and will not be discussed further in this section. Rather, we focus our attention on the critical discussion of the current state-of-the-art and possible improvements in the robustness of the models generated.148 m-QSAR is still a challenge in molecular design, if not an illusion.155 Actually, we will show below that the uncertainty is inherent in the nature of this method. Generally, in the described applications, QSAR is limited to a series of molecules for which the activity has been measured a priori. Therefore, formally we can insist that QSAR is a device basically working in the FCS. This makes current m-QSAR more posterior data analysis than a strict method for activity prediction in the sense of a novel compound design. Is it possible to enhance the molecular design ability of QSAR? Can we improve this by novel protocols, coding systems, and data handling methods? The origins of uncertainty in QSAR can be divided into several groups, as follows. Data Unlike in traditional Hansch method, in m-QSAR we describe a single molecular object by the enormously large data vector. The number of independent variables can reach 10 000–20 000 molecular field points in comparative molecular field analysis (CoMFA). This makes the need for the efficient massive data processing that limits the analysis performance. PLS is a method routinely used in CoMFA151 and technical improvements in data elimination such as GRS-CoMFA,156 genetic algorithms,157 and neural networks158 address the uncertainty issue in m-QSAR.
4.14.4.11.6(iii)
Molecular similarity Molecules are full of similarities, which results in intercorrelation between different molecular moieties. This can take place in different locations of molecular areas or molecular fields. It is not quite clear if we can extract from the m-QSAR data those variables that are at the origin of the activity and not those that intercorrelate but do not map a real pharmacophore. In fact, Doweyko155 observed that in 3D-QSAR, particularly in CoMFA, different models provide similar statistical performance. A similar effect can also be indicated in classical QSAR where different parameters can be used for the model construction. This of course brings a problem of model interpretation. The reader is referred to Wermuth159 for further discussion of this problem.
4.14.4.11.6(iv)
Molecular superimposition In a simplified interpretation, molecules are geometrical objects formed of atoms. A comparison of two molecules demands their superimposition, that is, indicating atom pairs to be covered within these two molecules. This operation is, however, not trivial. Generally, there are numerous covering modes for two multipoint systems and the way we superimpose molecules critically influences the final m-QSAR results. A number of examples can be found in the literature that prove this fact and the illustrative molecular plots can be found in Daweyko.155 There are three basic solutions to the molecular superimposition uncertainty. In the first approach, we can iteratively modify the relative orientation of the molecules to improve the final statistical performance of the model. In the second approach, this can be further modified by making the molecular objects flexible. The latter refers not only to a natural flexibility of the molecular systems, for example, determined by conformational effects, but also to an abstract ability for a better relative fit. See Korhonen160 and Lemmen and Lengauer161 for novel improvements in flexible superimposition. The CoMPAS162 and comparative molecular surface analysis (CoMSA)163 methods use, for example, neural networks to enhance molecular flexibility within this step. In the third approach, we completely ignore molecular superimposition by the calculation of the molecular descriptors that are invariants of the superimposition, such as distance geometry, autocorrelation vectors, 3D MoRSE, or RDF codes. Default covering modes can also be used in some methods, for example, covering along molecular inertial axis in the CoMSIA, Receptor-like Neuron Network, GRIND, or CoMMA. For further discussion and representative references, see Polanski.148 On the one hand, superimposition is a technical problem that allows for the similarity evaluation between two nonidentical molecules. On the other hand, superimposition outlines molecular orientation that maps a hypothetical internal pharmacophore or hopefully an external receptor (a developed pharmacophore negative) surrounding the investigated molecules. This means that passing over superimposition can deprive us of a
4.14.4.11.6(v)
Chemoinformatics
495
control of significant effects. Thus in the majority of applications a user defines the rules for molecular superimposition. An additional problem that appears in this context, namely, what is the relevance between molecular superimposition and the real molecular recognition phenomena, cannot be answered by the traditional receptor-independent QSAR methods. However, some novel approaches address such issues, as discussed in Section 4.14.4.11.6.7. Conformational profile 3D-QSAR focuses on nondynamic molecular representations even though in reality molecular configurations are dynamic and their shape can usually significantly fluctuate depending on the molecular environment. The drug–receptor interactions can further cause conformational changes of both the interacting moieties that allow them to further adopt their shapes, for example, induced fit effect.139 Although conformational effects were taken into account by some 3D-QSAR methods, for example, by testing several possible conformations in particular in CoMPASS,162 it is 4D-QSAR that systematically addressed this issue157 or 5D-QSAR that especially addressed the problem of induced fit.164 Molecular dynamics is applied in 4D-QSAR for the mapping of the space available for molecular shapes. The enormous number of conformers explore different spatial regions in this method and the likelihood of the formation of the common 3D-patterns of a series of molecules is sought after to increase the chances for mapping a proper pharmacophore. The detailed procedure of Hopfinger 4D-QSAR is discussed in Chapter 4.05. For the alternative SOM-4D-QSAR method, see Polanski and Bak.165
4.14.4.11.6(vi)
Molecular recognition It is usually believed that QSAR works in a receptor-independent mode. This is not fully true even for a traditional method, because at least one variable, that is, biological activity data, clearly depends on the receptor. However, in fact, there is an extensive unbalance between receptor-independent data describing a series of ligands and a single number describing the ligand–receptor interactions. In traditional 3D- or 4D-QSAR, a quiet assumption is that the relative orientation of the molecules defined in a superimposition step complies with that observed during the real receptor–ligand interactions. This often does not appear to be true. Therefore, we take into consideration that these interactions can improve m-QSAR performance. Receptor-dependent 4D-QSAR,166 5D-QSAR,164 6D-QSAR,167 and COMBINE168 are examples of such methods that can be classified as receptor-dependent m-QSAR. Molecular modeling or experimental data describing a complex receptor–ligand system is used for structure to property mapping in these methods.
4.14.4.11.6(vii)
QSAR predictions into virtual chemical space Prediction into VCS, which should be a major QSAR target, is, however, a roulette risk operation. Even a minute chemical structure modification can result in substantial activity changes, which indicates that a virtual molecule may appear not only as a QSAR equation outlier but also can be completely inactive. This effect, known as similarity paradox, has been addressed only to a minor extent and requires further attention in QSAR investigations. Accordingly, using a formal notation as presented in Figure 1, the majority of QSAR reported in chemical literature can be defined as S (structural property) to P (property) mapping:
4.14.4.11.6(viii)
Sm ∈(FCS)QSAR Pm ∈ (FCS)QSAR
Sm
Pm
where the m index denotes an m molecule and the QSAR index denotes the FCS domain subset in which molecules obey the QSAR operator. Usually, in QSAR, we follow an indirect scheme that maps structure to property, that is, we are modeling a function relating structure to property P ¼ f(S). This allows us to calculate the apparent property values for the structures introduced into the equation. The direct molecular design by
496 Chemoinformatics
QSAR would follow, however, the reverse property to structure mapping. Current QSAR schemes only very rarely extend outside FCS:
Sm FCS
Sm ∈FCS Pm ∈FCS P(S)m ∈CS
P(S)mCS
Pm FCS
where S(P) denotes structural, chemical, or biological properties, the QSAR domain index is omitted, but should also be obeyed, and FCS and CS indexes emphasize the molecules’ space location. Thus, what is usually meant by prediction in QSAR is an operation in which during model generation we are excluding some compounds selected within an FCS, that is, we are making a quiet assumption that we do not know their properties, and then using these compounds for testing the predictability of the generated model. Leave-one-out or leave-several-out (LSO) cross-validation (CV) and bootstrapping (cf. Chapter 4.08 for the detailed procedures) are the methods developed for validation predictability in the test set. For a detailed discussion of the technical problems of the application of these methods in m-QSAR, the reader is referred to earlier works.148,155,169–172 Stochastic model validation (SMV) is a generalized approach to LSO CV in which all possible distribution combinations of the molecules between the training and test set are sampled.144,148,167 Figure 30, summarizing the SMV approach, illustrates a large spread of the activity values modeled by CoMFA for the steroid benchmark series.148,170 In the current practice of testing for prior predictability, that is, sampling VCS (which can be done by real synthesis of the molecules of the hopefully advantageous properties in accordance with QSAR functions) is of marginal interest in this method. The reason is obvious, it is rather naive to expect a reliable property prediction for a single (or even several) virtual molecules from current QSAR technology. Instead, molecular design by m-QSAR is an indirect method in which the so-called interaction contour plots are indicated to find these space zones in which certain interactions of the steric or electrostatic background are advantageous or disadvantageous. In the context of prior predictability evaluation, sharing experimental data on drugs from the experiments by the pharmaceutical industry that perform large-scale novel compounds synthesis could be very helpful. A concept to make available such data in a coded form, that is, in a safe data exchange form not disclosing lead structures and other sensitive information, was discussed recently.173 This could significantly increase the extent of FCS and consequently improve the methods for model validation. In conclusion, current QSAR maps FCS structure to FCS property. We would like to extend this to VCS for the efficient property and thus structure predictions in VCS. However, to achieve this we should define a
(a)
SDEP
q2 = 1; SDEP = 0
90 (c) 1.4
(b) 20 18 16 14 12 10 8 6 4 2 0 –1.5
80 70
1.2 1
60 50
0.8
40
0.6
30
0.4
20 10 –1
–0.5
0
0.5
1
0
0.2 0 0.4
0.5
0.6
0.7
0.8
0.9
1
240 220 200 180 160 140 120 100 80 60 40 20 0
q2
Figure 30 The evaluation of QSAR modeling by stochastic model validation (SMV) approach. Q2 and SDEP parameters measure model quality in the training (q2) and test sets (SDEP). See Chapter 4.05 for the detailed equations defining these parameters. Higher q2 and lower SDEP values indicate better models. Hypothetical model with 100% correlation (a), modeling with simulated noisy data (b), and CoMFA modeling of the steroid benchmark series (c). Adopted from Polanski, J.; Gieleciak, R.; Bak, A.; Magdziarz, T. J. Chem. Inf. Model. 2006, 46, 2310–2318.
Chemoinformatics
497
structure class for which the developed QSAR would probably work in VCS. From the mathematical point of view, this means precisely defining the structure domain within CS. This problem has been addressed recently by a concept of the applicability QSAR domain; see Tropsha174 for the discussion and respective references. An interesting approach for the reduction of the prediction risk is to extend the number of properties included in the analysis. As established, this means that structural data (S) are supported by several variables (a relatively large number) extracted from the data available for the additional biological activity types (P). Thus, we also get an insight into the real biological similarity of the molecules. PASS is software realizing such a hybrid QSAR–PAR modeling strategy.175 This approach also indicates the most general QSAR definition:
Ù
Ú
PmÎ(FCS)QSAR PpÎ(FCS)QSAR
Pm
Pp
where P means structural, physical, chemical, or biological property and the index m refers to the measured X block data and p to the predicted Y block data. In fact, we would also like to extend the Y block data into the whole CS instead of the FCS, as noted.
4.14.4.11.7
Drug likeness and druggability concept QSAR modeling is highly data dependent and uncertain, as discussed in the previous sections. Current technology still does not provide us with a method efficient enough to design in a single act a single molecule having properties predicted. Therefore, molecular design should improve the success ratio during such screening. The search for so-called drug likeness insists on the concept that statistically some common features or privileged structures exist in CS that generate drugs. The Lipinski rule of five indicates molecular weight (below 500), log P (below 5), a number of hydrogen bond acceptors (10 or fewer), and a number of hydrogen bond donors (5 or fewer) that rule drug likeness for orally available drugs.176 Five in this rule does not mean five rules but a value for cutoffs that always amounts to five or its multiplication. This clearly classifies the Lipinski rule among the rules of thumb; Fialkowski’s investigations into the whole real chemical compounds space revealed recently that the molecular weight criterion does not distinguish drugs from nondrugs molecules, so it is no longer true (cf. Section 4.14.4.8). ADMET is another concept that focuses in a similar context on the privileged adsorption, distribution, metabolism, elimination, and toxicological drug properties.177–179 A closer inspection of the large drug molecule population also reveals some further drug-like properties. It has been found that a ‘‘group of 32 common shapes or frameworks accounted for 50% of the 5120 molecules considered. Whether these fragments had intrinsic characteristics that gave them drug-like properties or their presence was a result of chemists’ habits, familiarities or synthetic versatility was an issue that was recognized but not addressed.’’113 In a similar approach, Helma et al. used Molecular Feature Miner (MOLFEA) to investigate the NCI DTP AIDS Antiviral Screen program database, which included more than 40 000 compounds tested for anti-HIV activity. This revealed the pattern of the molecular fragment distribution in active and inactive compounds. Kubinyi112 indicated some other privileged druglike moieties and Oprea,180 investigating more than 12 000 compounds with a different biological activity spectrum, attempted to cluster them into active, middle activity, and inactive molecules. It has been attempted to extend standard QSAR into comparative QSAR or combined QSAR databases that would enable to revelation of molecule drug likeness.181,182 An interesting approach into data mining has been suggested by Tropsha, who developed the concept of application of predictive QSAR as a virtual screening tool for data mining.183 Similar approaches have been suggested on the protein and gene expression data.184 Druggability is a concept related to drug likeness. Thus, drug-like molecules would provide an efficient fit only to some selected receptors within biological space. Consequently, we should search for druggable genomes that would express druggable proteomes.15
498 Chemoinformatics
4.14.4.11.8
Molecular diversity in property-oriented synthesis Molecular diversity evaluation is a relatively novel idea in molecular design.185–187 Basically, molecular design is aimed at imitating bioeffectors operating in natural systems to manipulate biological effects. Imitation implies the search for similarity. Thus, computational operations aimed towards the search for novel drugs are based on the assessment of similarity between a series of active compounds, for example, modeling relationship and correlating parameters in QSAR or m-QSAR, discovering general resemblance for drug likeness, or finding the correspondence between the selected areas of the ligand and receptor. However, discovering novel potential drugs requires us to disturb an initial structure or, in other words, to introduce a structural change into this structure. This also means that for making novel molecules one needs to make them dissimilar to the lead structure. In practice, molecular diversity is a concept related to combinatorial chemistry, which constructs a number of molecules in CS. The comparison of virtual objects of CS to optimize the design of combinatorial products in the context of improved property production is the major mission of this method. In the context of chemoinformatics, measuring molecular diversity is a complex problem. A variety of parameters have been developed, including those given directly by PCA latent variables obtained by processing molecular structural data.188 Dean and Richard187 provide an extensive introduction on the molecular diversity problems. Diversity-oriented synthesis (DOS) is a recent experimental approach based on molecular diversity. In such an approach, sometimes called chemical genetics, the exploration of biological space described by proteins is performed by their direct perturbation using small molecules occupying CS. Thus, we are investigating which proteins regulate certain biological processes.189 ChemBank is a web-based system for data sharing in chemical genetics.190 4.14.4.11.9
Bioinformatics in drug design Bioinformatics shifts our scope of interest in molecular design from CS to biological space. Formally, this approach focuses on mapping biological to chemical spaces. Pharmacogenomics is a direction that is based on the assumption that the interpretation of genomic data will provide us with novel drugs. Relating drug activity to the gene expression patterns in structure–activity–target (SAT) is one of the novel concepts.191 A number of problems, including the analysis of DNA microarray gene expression data, pose a novel computational and data processing challenge to bioinformatics, a new interdisciplinary research direction.192–194 A detailed discussion is, however, beyond the scope of this review.
4.14.5 Internet Resources for Chemistry and Chemoinformatics The first concept of social interaction by networking was suggested in the 1960s and the first cable connection was constructed in 1969 in the United States. At that time, no one could have predicted the importance of this idea and its impact on the economy and science. Currently, computers are more and more dependent on web technologies. Accordingly, the number of chemical resources available on the web has increased steadily. This involves sites offering an access to chemical data, educational sites, free or commercial software, e-commerce, online chemistry journals, and many others. Nowadays, it is clearly impossible to discuss individual sites separately in such a brief review. Instead, below we specify several addresses of the websites providing further links for chemistry resources and chemistry portals.
ChemWeb is a chemistry portal providing a variety of further links, in particular to chemical databases (www.chemweb.com), Chemdex is run by the Department of Chemistry, University of Sheffield (www.chemdex.org), Chemcenter is provided by ACS (www.chemistry.org/portal/a/c/s/1/home.html), Chemfinder (chemfinder.cambridgesoft.com), ChemSoc by the Royal Society of Chemistry (www.chemsoc.org/links/links.htm), ChemNet (www.chemnet.com/), Chemie.de (www.chemie.de/), Organic Chemistry Resources Worldwide (organicworldwide.net),
Chemoinformatics
499
Rolf Claessen’s Chemistry Index (www.claessen.net/chemistry), ChemIndustry.com (www.neis.com).
A variety of useful links to chemistry web pages can be found under the addresses: Links for Chemists (www.liv.ac.uk/Chemistry/Links/links.html), Internet Resources for Science and Mathematics Education (www.towson.edu/csme/mctp/Technology/ Chemistry.html), Huber’s Chemistry Resource on the Internet (www.library.ucsb.edu/istl/98-winter/interne1.html).
Several chemistry portals explicitly refer to chemoinformatics. The websites providing further links are Chemoinf.com – The Chemoinformatics Hub (www.chemoinf.com) and PharmTao The Chemoinformatics Portal (www.pharmtao.com/Cheminfo/index.htm). A variety of further links can also be found at the Wendy Warr and Associates home page (www.warr.com/links.html). Interesting applications into chemoinformatics including online computations can be found at the TORVS (www2.chemie.uni-erlangen.de) or Molinspiration (www.molinspiration.com/jme) sites. Available e-commerce includes a number of chemical suppliers. Several links are listed under the ChemicalInformationNetwork (http://chemport.ipe.ac.cn/ListPageE/Catalogs.shtml). Sigma-Aldrich (www.sigmaaldrich.com) and BASF are examples of e-commerce sites offering a variety of chemicals (www.ecommerce.basf.com) and listing chemical data for these chemicals.
4.14.6 Conclusions and Further Trends While discussing computer applications in this chapter we focused on organic chemistry. It is by no means a coincidence because the majority of chemical applications referring to other disciplines are discussed in previous chapters, and literature devoted to chemometrics has already taken care of a variety of chemical informatics branches. The origins of chemometrics can be traced to analytical chemistry. This allowed for the significant improvement of data processing by the use of computer-based methods even though we must sometimes agree for an ‘educated guess strategy’. Chemometrics inspired and forced further applications of such strategies in chemistry, in particular, in organic chemistry, which poses challenging problems even at the level of data structure translation from chemistry to algorithmic data processing. This might have increased the difficulties encountered in analytical chemistry. Thus we can observe a delayed effect and several more years are needed to make available and significantly enhance computer applications in this area. Moreover, these techniques need to be generally accepted by chemists working in organic chemistry. From this perspective, chemoinformatics is an extension of chemometrics, particularly, but not only, into the organic chemistry domain. Hence, it can be expected that chemoinformatics will head in the same direction that chemometrics did several years ago in analytical chemistry. There are still many unsolved problems in this area, particularly regarding molecular design, that is, the discipline that originated chemoinformatics. Basically, what we need for more efficient molecular design is to include complex information on drug–receptor interaction. What we will probably observe is a large expansion of receptor-dependent modeling, for example, as currently observed for the 4D-QSAR method. The complex nature of biological interactions as well as a need for the optimization of many properties at the same time means that, for years, investigations in this field will be cooperative and iterative procedure in which experiments will predominate and calculations will enhance the probability of finding high-quality synthetic targets. With increasing computational power, not only in the sense of computer hardware, but also novel computational ideas and approaches, a larger population of molecules will be available for virtual screening. Knowledge discovery by database mining is a promising direction of large and still undiscovered potential. Basically, efficient technology together with precise knowledge of the real or possible ligand– receptor interactions should make drug discovery a dead certainty. However, this will probably still not be the case in the near future.
500 Chemoinformatics
In synthesis design and reaction prediction, computers are able to provide valuable hints. This will probably resemble chess-playing software. Some artists among chemists, especially supported by serendipity, will still find better solutions, but computer hints will remain useful for the majority. Technically, already today the Internet is a philosophy of informatics and chemoinformatics. Chemistry depends on data and web resources to make data widely available. Further expansion of network technologies will definitely be of major importance for further development in this field.
4.14.7 Sources of Further Information and Advice A four-volume Handbook of Chemoinformatics edited by Gasteiger and supplemented by the Textbook provides an extensive description of this research area. Chemometrics and Chemoinformatics by Lavine is an informative introduction into these related disciplines. Several further handbooks, in particular focusing on molecular design, are also available on the market.195–198 The glossary part of Chemical Informatics Letters29 is an excellent source providing basic definitions of the terms used in chemoimformatics. This is supported by further web links. The IUPAC glossary of terms used in computational chemistry can also provide useful information199 and Pharmaceutical Chemistry (and Biology) glossary is available online.200
References 1. Warr, W. http://www.warr.com/warrzone2000.html. 2. Clark, T. In Molecular Informatics: Confronting Complexity, Hicks, M. G., Kettner, C., Eds., Proceedings of the Beilstein-Institut Workshop, Bozen, Italy, May 13–16, Hicks, M. G., Kettner, C., Eds.; 2002;2002 http://www.beilstein-institut.de/bozen2002/ proceedings/Clark/Clark.pdf. 3. Parthasarathi, R.; Padmanabhan, J.; Elango, M.; Subramanian, V.; Roy, D. R.; Sarkar, U.; Chattara, P. K. Application of Quantum Chemical Descriptors in Computational Medicinal Chemistry and Chemoinformatics. Indian J. Chem. A 2005, 45, 111–125. 4. Fensham, P. J. Implications, Large and Small, from Chemical Education Research for the Teaching of Chemistry. Quı´ı´m. Nova 2002, 25, 335–339. 5. Hughes, J. What Is Science? Summaries and Reviews. http://www.mdx.ac.uk/www/study/science.htm. 6. Brock, W. H. The Fontana History of Chemistry; Fontana Press: London, 1993. 7. Cohen, J.; Stewart, J. The Collapse of Chaos: Discovering Simplicity in a Complex World; Viking: New York, 1994. Polish edition Proszynski and s-ka, Warsaw, 2006. 8. Kowalski, B. R.; Bender, C. F. Solving Chemical Problems with Pattern Recognition. Naturwissenschaften 1975, 62, 10–14. 9. Lutus, P. Computer Math, Exploring a New Frontier beyond the Realm of Human Calculation. http://www.arachnoid.com/ lutusp/computermath.html. 10. Jager, W.; Krebs, H.-J.; Eds. Mathematics – Key Technology for the Future;2003 Springer: Berlin, 2003. 11. Oprea, T. I. Chemoinformatics and the Quest for Leads in Drug Discovery. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 1509–1531. 12. Chemical Abstract Service, CAS. http://www.cas.org/expertise/cascontent. 13. Baldi, P. Chemoinformatics, Drug Design, and Systems Biology. Genome Inform. 2005, 16, 281–285. 14. Bohacek, R. S.; McMartin, C.; Guida, W. C. The Art and Practice of Structure-Based Drug Design: A Molecular Modelling Perspective. Med. Res. Rev. 1996, 16, 3–50. 15. Lipinski, C.; Hopkins, A. Navigating Chemical Space for Biology and Medicine. Nature 2004, 432, 855–861. 16. Beilstein-Institut. http://www.beilstein-institut.de. 17. Smith, M. B.; March, J. March’s Advanced Organic Chemistry Reactions Mechanisms, and Structure; Wiley: New York, 2001. 18. In Heller, S. R.; Ed. The Beilstein Online Database – Implementation, Content, and Retrieval; ACS Symposium Series 436, American Chemical Society: Washington, DC, 1990. 19. Gasteiger, J.; Engel, T. Chemoinformatics a Textbook; Wiley-VCH: Weinheim, 2003; p 4. 20. Willet, P. A. History of Chemoinformatics. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; WileyVCH: Weinheim, 2003; pp 6–20. 21. Polanski, J. Molecular Shape Analysis. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 302–319. 22. Roberts, T. S. Computer-Supported Collaborative Learning in Higher Education; Idea Group Inc.: Hershey, 2004. 23. Hromkovic, J. Theoretical Computer Science: Introduction to Automata, Computability, Complexity, Algorithmics Randomization, Communication and Cryptography; Springer: Berlin, 2003; p 1. 24. Gasteiger, J.; Engel, T. Chemoinformatics a Textbook; Wiley-VCH: Weinheim, 2003. 25. Brown, F. Chemoinformatics: What Is It and How Does It Impact Drug Discovery. Annu. Rep. Med. Chem. 1998, 33, 375–384. 26. Bajorath, J. Chemoinformatics Concepts, Methods, and Tools for Drug Discovery. In Methods for Molecular Biology; Walker, J. M., Ed.; Humana Press: Totowa, 2004; Vol. 275, p V. 27. Hrib, N. J.; Peet, N. P. Chemoinformatics: Are We Exploiting These New Science? Drug Discov. Today 2000, 5, 483–485.
Chemoinformatics
501
28. Cambridge Healthtech Institute, Pharmaceutical Cheminformatics & Chemoinformatics glossary. http:// www.genomicglossaries.com. 29. Goodman, J. M. Glossary. Chem. Inf. Lett., http://www-jmg.ch.cam.ac.uk. 30. Goodman, J. M. Chemical Informatics. Chem. Inf. Lett. 2003, 6, 14. 31. Noordik, J. H. Cheminformatics Developments; IOS Press: Amsterdam, 2004. 32. IUPAC Compendium of Chemical Terminology, 2nd ed., International Union of Pure and Applied Chemistry (IUPAC), 1997. 33. Ihde, A. J. The Development of Modern Chemistry; General Publishing Company, Ltd.: Don Mills, 1984. 34. Barnard, J. M. Representation of Molecular Structures – Overview. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 27–50. 35. Weininger, D. SMILES – A Language for Molecules and Reactions. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 80–102. 36. Daylight Chemical Information System, SMILES Tutorial. http://www.daylight.com. 37. MDL Information Systems, Inc., ISIS/Draw. http://www.mdli.com. 38. Advanced Chemistry Development, ACDLAB, ACD/ChemSketch 8.0 Freeware. http://www.acdlabs.com. 39. Ertl, P. JME Molecular Editor. http://www.molinspiration.com. 40. Sayle, R. Rasmol. http://www.umass.edu. 41. Simulated Biomolecular Systems Inc., CLiDE. http://www.simbiosys.ca. 42. Scu¨htt, H. W. Eilhard Mitscherlich: Prince of Prussian Chemistry, History of Modern Chemical Sciences Series; American Chemical Society and the Chemical Heritage Foundation: Washington, DC, 1997. 43. Quadbeck-Seeger, H.-J.; Ed World Records in Chemistry; Wiley-VCH: Weinheim, 1999. 44. Hirsch, A.; Brettreich, M.; Wudl, F. Fullerenes; Wiley-VCH: Weincheim, 2005. 45. Moss, G. P. IUPAC Recommendations on Organic & Biochemical Nomenclature, Symbols & Terminology. http://www.chem.qmul.ac.uk/iupac/. 46. Wisniewski, J. Chemical Nomenclature and Structure Representation: Algorithmic Generation and Conversion. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 51–79. 47. IUPAC Strategy Round Table. Representations of Molecular Structure: Nomenclature and Its Alternatives. http:// www.iupac.org/newsarchives/2000/NRT_Report.html. 48. Smith, A.; Heckelman, P. E.; O’Neil, M. J.; Budavari, S.; Eds.; The Merck Index an Encyclopedia of Chemicals, Drugs, & Biologicals, 13th Ed.; Merck & Co., Inc.: Whitehouse Station, 2001. 49. The Organic Chemistry Portal. http://www.organic-chemistry.org. 50. Chen, L. Reaction Classification and Knowledge Acquisition. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 348–388. 51. Chemical Databases. http://www.google.com/Top/Science/Chemistry/Chemical_Databases. 52. Downs, G. M.; Gillet, V. J.; Holliday, J. D.; Lynch, M. F. Review of Ring Perception Algorithms for Chemical Graphs. J. Chem. Inf. Comput. Sci. 1989, 29, 172–187. 53. Ho¨ltje, H.-D.; Sippl, W.; Rognan, D.; Folkers, G. Molecular Modeling; Wiley-VCH: Weinheim, 2003. 54. Molecular Networks. http://www.mol-net.com. 55. Motherwell, S. Chemoinformatics and Crystallography. The Cambridge Structural Database. In Cheminformatics Developments, History, Reviews and Current Research; Noordik, J. H., Ed.; IOS Press: Amsterdam, 2004; pp 37–68. 56. Sadowski, J. Representation of 3D Structures. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 231–260. 57. Hypercube Inc. http://www.hyper.com. 58. Sadowski, J.; Gasteiger, J. From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders. Chem. Rev. 1993, 93, 2567–2581. 59. Concord. http://www.tripos.com. 60. Carloni, P.; Alber, F. Quantum Medicinal Chemistry. In Methods and Principles in Medicinal Chemistry; Mannhold, R., Kubinyi, H., Folkers, G., Eds.; Wiley-VCH: Weinheim, 2003; Vol. 17. 61. San Diego Supercomputer Center. http://www.sdsc.edu/~kimb. 62. Hinchliffe, A. Molecular Modelling for Beginners; Wiley: Chichester, 2003. 63. The NIH Guide to Molecular Modeling. http://cmm.cit.nih.gov/modeling/guide_documents/molecular_mechanics_document.html. 64. Goodman, J. Chemical Applications of Molecular Modelling; Royal Society of Chemistry: London, 1999. 65. Keseru, G.; Kolossvary, I. Molecular Mechanics and Conformational Analysis in Drug Design; Blackwell Publishing: Oxford, 1999. 66. Rzepa, H. Molecular Modelling for Organic Chemistry. http://www.ch.ic.ac.uk/local/organic/mod/. 67. Cramer, C. J. Essentials of Computational Chemistry; Wiley: Chichester, 2004. 68. Rapaport, C. The Art of Molecular Dynamics Simulation; Cambridge University Press: Cambridge, 2004. 69. Molecular dynamics. http://en.wikipedia.org. 70. Leach, A. R.; Gillet, V. J. An Introduction to Chemoinformatics; Kluwer: Dordrecht, 2003. 71. Kochev, N.; Monev, V.; Bangov, I. Searching Chemical Structures. In Chemoinformatics a Textbook; Gasteiger, J., Engel, T., Eds.; Wiley-VCH: Weinheim, 2003; pp 291–318. 72. Nicklaus, M. C. Pharmacophore and Drug Discovery. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 1687–1711. 73. Hubbard, R. E. Molecular Graphics: From Pen Plotter to Virtual Reality. In People and Computers VII; Monk, A. F., Diaper, D., Harrison, M. D., Eds.; Cambridge University Press: Cambridge, 1992; p 21. 74. Center for Structural Biology, Yale University. http://www.csb.yale.edu. 75. Keil, M.; Borosch, T.; Exner, T. E.; Brinkman, J. Computer Visualization of Molecular Models Tools for Man-Machine Communication in Molecular Science. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; WileyVCH: Weinheim, 2003; pp 320–344. 76. Morris, P. J. T. From Classical to Modern Chemistry; Royal Society of Chemistry: London, 2002.
502 Chemoinformatics 77. Barone, R.; Chanon, M. Computer-Assisted Synthesis Design. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 1428–1456. 78. Corey, E. J.; Cheng, X.-M. The Logic of Chemical Synthesis; Wiley: New York, 1989. 79. Smit, W. A.; Caple, R.; Bochkov, A. F. Organic Synthesis; Royal Society of Chemistry: London, 1998. 80. Fuhrhop, J.; Penzlin, G. Organic Synthesis Concepts, Methods, Starting Materials; VCH: Weinheim, 1983. 81. Seebach, D. Methods of Reactivity Umpolung. Angew. Chem. Int. Ed. Engl. 1979, 18, 239–336. 82. Corey, E. J. The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Molecules. In Nobel Lectures in Chemistry 1981–1990; Malmstrom, B. G., Ed.; World Scientific: Singapore, 1993; pp 686–708. 83. Corey, E. J.; Ohno, M.; Vatakencherry, P. A.; Mitra, R. B. Total Synthesis of D,L-Longifolene. J. Am. Chem. Soc. 1961, 83, 1251–1253. 84. Todd, M. H. Computer-Aided Organic Synthesis. Chem. Soc. Rev. 2005, 34, 247–266. 85. LHASA. http://lhasa.harvard.edu. 86. Corey, E. J.; Wipke, W. T. Computer-Assisted Design of Complex Organic Syntheses. Science 1969, 166, 178–192. 87. Rouhi, A. M. Above and Beyond Organic Synthesis. Chem. Eng. News 2004, 82, 37–41. 88. Gasteiger, J.; Hutchings, M. C.; Christoph, B.; Gann, L.; Hiller, C.; Lo¨w, P.; Marsili, M.; Saller, H.; Yuki, K. A. New Treatment of Chemical Reactivity: Development of EROS, an Expert System for Reaction Prediction and Synthesis Design. Top. Curr. Chem. 1987, 137, 19–73. 89. Pfo¨rtner, M.; Sitzmann, M. Computer-Assisted Synthesis Design by WODCA (CASD). In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 1457–1507. 90. WODCA Computer-Assisted Organic Synthesis. http://www2.chemie.uni-erlangen.de. 91. Ott, M. A. Chemoinformatics and Organic Chemistry. Computer Assisted Synthetic Analysis. In Cheminformatics Developments, History, Reviews and Current Research; Noordik, J. H., Ed.; IOS Press: Amsterdam, 2004; pp 83–110. 92. CAMEO. http://zarbi.chem.yale.edu/software.html. 93. Herges, R.; Hook, C. Science 1992, 255, 711–713. 94. Gesteiger, J. The Central Role of Chemoinformatics. Chemom. Intell. Lab. 2006, 82, 200–209. 95. Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug Design, 2nd ed.; Wiley-VCH: Weinheim, 1999. 96. Adams, M. J. Chemometrics in Analytical Spectroscopy; Royal Society of Chemistry: Cambridge, 2004. 97. Steinbeck, C. H. Computer-Assisted Structure Elucidation. In Handbook of Chemoinformatics from Data to Knowledge; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, 2003; pp 1378–1406. 98. Fayyad, U. M.; Piatetsky-Shapiro, G.; Smyth, P. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining; Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Eds.; AAAI Press/The MIT Press: Menlo Park, CA, 1996; pp 1–34. 99. Frawley, W. J.; Piatetsky-Shapiro, G.; Matheus, C. Knowledge Discovery in Databases: An Overview. In Knowledge Discovery in Databases; Piatetsky-Shapiro, G., Frawley, W. J., Eds.; AAAI Press/MIT Press: Cambridge, MA, 1991; pp 1–30. 100. Vogel, A. I.; Furniss, B. S.; Hannaford, A. J.; Smith, P. W. G.; Tatchell, A. R. Vogel’s Textbook of Practical Organic Chemistry, 5th ed.; Longman Scientific and Technical: Essex, 1989; p 1406. 101. Fialkowski, M.; Bishop, K. J.; Chubukov, V. A.; Campbell, C. J.; Grzybowski, B. A. Architecture and Evolution of Organic Chemistry. Angew. Chem. Int. Ed. Engl. 2005, 44, 7263–7269. 102. Rucker, C.; Meringer, M. How Many Organic Compounds Are Graph-Theoretically Nonplanar? MATCH Commun. Math. Comput. Chem. 2002, 45, 153–172. 103. Berasaluce, S.; Laurenc¸o, C.; Amedeo, N.; Gilles, N. An Experiment on Knowledge Discovery in Chemical Databases. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases – PKDD 2004; Pisa, Italia: Lecture Notes in Artificial Intelligence, Vol. 3202;, 2004; Springer Verlag: Berlin, 2004; pp 39–51. 104. Chen, L. R.; Gasteiger, J. Knowledge Discovery in Reaction Databases: Landscaping Organic Reactions by a Self-Organizing Neural Network. J. Am. Chem. Soc. 1997, 119, 4033–4042. 105. Ester, M.; Sander, J. Knowledge Discovery in Databases Techniken und Anwendungen; Springer: Berlin, 2000. 106. Bremer, E. G.; Hakenberg, J.; Han, E.-H. S.; Berrar, D.; Dubitzky W.; Eds. In Knowledge Discovery in Life Science Literature, PAKDD 2006 International Workshop Proceedings, KDLL, Singapore, April 9, 2006; Lecture Notes in Computer Science, Vol. 3886; Springer: Berlin, 2006. 107. Wold, S. Chemometrics; What Do We Mean with It, and What Do We Want from It? Chemometr. Intell. Lab. Syst. 1995, 30, 109–115. 108. Pierce, K. M.; Wood, L. F.; Wright, B. W.; Synovec, R. E. A Comprehensive Two-Dimensional Retention Time Alignment Algorithm to Enhance Chemometric Analysis of Comprehensive Two-Dimensional Separation Data. Anal. Chem. 2005, 77, 7735–7743. 109. Myshkin, E.; Wang, B. J. Chemometrical Classification of Ephrin Ligands and Eph Kinases Using GRID/CPCA Approach. J. Chem. Inf. Comput. Sci. 2003, 43, 1004–1010. 110. Pytela, O.; Kulhanek, J.; Ludwig, M. Chemometrical Analysis of Substituent Effects. IV. Additivity of Substituent Effects in Dissociation of 3,5-Disubstituted Benzoic Acids in Organic Solvents. Collect. Czech. Chem. Commun. 1994, 59, 1637–1644. 111. Rodriguez-Barrios, F.; Gago, F. Chemometrical Identification of Mutations in HIV-1 Reverse Transcriptase Conferring Resistance or Enhanced Sensitivity to Arylsulfonylbenzonitriles. J. Am. Chem. Soc. 2004, 126, 2718–2719. 112. Kubinyi, H. Privileged Structures and Analogue-Based Drug Discovery. In Analogue-Based Drug Discovery; Fischer, J., Ganellin, C. R., Eds.; Wiley-VCH: Weinheim, 2006; pp 53–65. 113. Fattori, D. D. Molecular Recognition: The Fragment Approach in Lead Generation. Drug Discov. Today 2004, 9, 229–238. 114. Shirai, Y.; Osgood, A. J.; Zhao, Y.; Kelly, K. F.; Tour, J. M. Directional Control in Thermally Driven Single-Molecule Nanocars. Nano Lett. 2005, 5, 2330–2334. 115. Halford, B. Nanocar Rolls into Action. World’s First Molecular Car Zips about on Fullerene Wheels. Chem. Eng. News 2005, 83 (43), 13. 116. Kolb, C.; Finn, G.; Sharpless, B. Click Chemistry: Diverse Chemical Function from a Few Good Reactions. Angew. Chem. Int. Ed. 2001, 40, 2004–2021.
Chemoinformatics
503
117. Kubinyi, H. Chance Favors the Prepared Mind – from Serendipity to Rational Drug Design. J. Recept. Signal Transduct. Res. 1999, 19, 15–39. 118. Booth, B.; Zemmel, R. Prospects for Productivity. Nat. Rev. Drug Discov. 2004, 3, 451–456. 119. Kubinyi, H. Drug Research: Myths, Hype and Reality. Nat. Rev. Drug Discov. 2003, 2, 665–668. 120. Van de Waterbeemd, H. Introduction. In Chemometric Methods in Molecular Design; Van de Waterbeemd, H., Ed.; Wiley-VCH: Weinheim, 1995. 121. Polanski, J. Developing New Sweeteners. In Optimising Sweet Taste in Foods; Spillane, W. J., Ed.; Woodhead Publishing Limited: Cambridge, 2006; pp 307–326. 122. Erlanson, D. A.; McDowell, R. S.; O’Brien, T. Fragment-Based Drug Discovery. J. Med. Chem. 2004, 47, 3463–3482. 123. Hannessian, S.; Lu, P. P.; Sanceau, J. Y.; Chemla, P.; Gohda, K.; Fonne-Pfister, R.; Prade, L.; Cowan-Jacob, S. W. An EnzymeBound Bisubstrate Hybrid Inhibitor of Adenylosuccinate Synthetase. Angew. Chem. Int. Ed. 1999, 38, 3159–3162. 124. Huth, J. R.; Sun, C.; Sauer, D. R.; Hajduk, P. J. Utilization of NMR-Derived Fragment Leads in Drug Design. Methods Enzymol. 2005, 394, 549–569. 125. Fischer, J.; Ganellin, C. R.; Eds. Analogue-Based Drug Discovery; Wiley-VCH: Weinheim, 2006. 126. Wermuth, C. G. Analogues as a Means of Discovering New Drugs. In Analogue-Based Drug Discovery; Fischer, J., Ganellin, C. R., Eds.; Wiley-VCH: Weinheim, 2006; pp 3–20. 127. Kubinyi, H. Drug Research – from Serendipity to Rational Design. http://www.kubinyi.de/lectures.html. 128. Bannwarth, W.; Felder, E.; Eds. Combinatorial Chemistry: A Practical Approach. In Methods and Principles in Medicinal Chemistry; Mannhold, R., Kubinyi, H., Timmerman, H., Eds.; Wiley-VCH: Weinheim, 2000; Vol. 9. 129. Ternet, N. K. Combinatorial Chemistry; Oxford University Press: New York, 1998. 130. Borman, S. The Many Faces of Combinatorial Chemistry. Chem. Eng. News 2003, 81 (43), 45–56. 131. Geysen, H. M.; Schoenen, F.; Wagner, D.; Wagner, R. Combinatorial Compound Libraries for Drug Discovery: An Ongoing Challenge. Nat. Rev. Drug Discov. 2003, 2, 222–230. 132. Persidis, A. High-Throughput Screening. Nat. Biotechnol. 1998, 16, 488–489. 133. Kubinyi, H. Changing Paradigms in Drug Discovery. In The Chemical Theatre of Biological Systems, Proceedings of the International Beilstein Workshop, Bozen, May 24–28, 2004; Hicks, M., Kettner, C., Eds.; Logos-Verlag:Berlin, 2005; pp 51–72. 134. Frantz, S. 2003 Approvals: A Year of Innovation and Upward Trends. Nat. Rev. Drug Discov. 2004, 3, 103–105. 135. Schmid, E. F.; Smith, D. Is Pharmaceutical R&D Just a Game of Chance or Can Strategy Make a Difference? Drug Discov. Today 2004, 9, 18–26. 136. Schmid, E. F.; Smith, D. Is Declining Innovation in the Pharmaceutical Industry a Myth? Drug Discov. Today 2005, 15, 1031–1039. 137. Otto, S.; Furlan, R. L. E.; Sanders, J. K. M. Recent Developments in Dynamic Combinatorial Chemistry. Curr. Opin. Chem. Biol. 2002, 6, 321–327. 138. Borman, S. Drugs by Design. Chem. Eng. News 2005, 83 (48), 28–30. 139. Cohen, N. C. Guidebook on Molecular Modelling in Drug Design; Academic Press: San Diego, 1996. 140. Schneider, G.; Fechner, U. Computer-Based De Novo Design of Drug-Like Molecules. Nat. Rev. Drug Discov. 2005, 4, 649–663. 141. Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual Screening for Drug Discovery: Methods and Applications. Nat. Rev. Drug Discov. 2004, 3, 935–949. 142. Gasteiger, J.; Engel, T. Chemoinformatics a Textbook; Wiley-VCH: Weinheim, 2003; p 610. 143. Warren, G. L.; Andrews, C. W.; Capelli, A.-M.; Clarke, B.; LaLonde, J.; Lambert, M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S. A Critical Assessment of Docking Programs and Scoring Functions. J. Med. Chem. 2006, 49, 5912–5931. 144. Kirkpatrick, P. Computational Chemistry: Docking on Trial. Nat. Rev. Drug Discov. 2005, 4, 813. 145. Hinchliffe, A. Chemical Modelling: Applications and Theory; Royal Society of Chemistry: Cambridge, 2000–2004; Vols. 1–3. 146. Richards, W. G. Virtual Screening Using Grid Computing: The Screensaver Project. Nat. Rev. Drug Discov. 2002, 1, 551–555. 147. Horvath, D.; Mao, B.; Gozalbes, R.; Barbosa, F.; Rogalski, S. L. Strengths and Limitations of Pharmacophore-Based Virtual Screening. In Chemoinformatics in Drug Discovery; Oprea, T. I., Ed., In Methods and Principles in Medicinal Chemistry; Mannhold, R., Kubinyi, H., Timmerman, H., Eds.; Wiley-VCH: Weinheim, 2005; Vol. 23, pp 117–140. 148. Polanski, J.; Gieleciak, R.; Bak, A.; Magdziarz, T. Robust QSAR Modeling. J. Chem. Inf. Model. 2006, 46, 2310–2318. 149. De Julian-Ortiz, J. Virtual Darwinian Drug Design: QSAR Inverse Problem. Comb. Chem. High Throughput Screen. 2000, 4, 295–310. 150. Esposito, E. X.; Hopfinger, A.; Madura, J. D. Methods for Applying the Quantitative-Structure Relationship Paradigm. In Chemoinformatics: Concepts, Methods, and Tools for Drug Discovery; Bajorath, J., Ed.; Humana Press: Totowa, 2004; pp 131–214. 151. Kubiny, H. QSAR: Hansach Analysis and Related Approaches. In Methods and Principles in Medicinal Chemistry; Mannhold, R., Krogsgaard-Larsen, P., Timmerman, H., Eds.; Wiley-VCH: Weinheim, 1993; Vol. 1, pp 1–240. 152. Boyd, D. B. Successes of Computer-Assisted Molecular Design. In Reviews in Computational Chemistry; Lipkowitz, K. B., Boyd, D. B., Eds.; VCH Publishers: New York, 1990; pp 355–371. 153. Maran, U.; Sulev, S. QSAR Modeling of Mutagenicity on Non-congeneric Sets of Organic Compounds. In Artificial Intelligence Methods and Tools for Systems Biology; Dubitzky, W., Azuaje, F., Eds.; Springer: Doordrecht, 2004; pp 19–36. 154. Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors. In Methods and Principles in Medicinal Chemistry; Mannhold, R., Kubinyi, H., Timmerman, H., Eds.; Wiley-VCH: Weinheim, 2000; Vol. 11, pp 1–607. 155. Daweyko, A. 3D-QSAR Illusions. J. Comput. Aided Mol. Des. 2004, 18, 587–596. 156. Cho, S.; Tropsha, A. Cross-Validated r2-Quieded Region Selection for Comparative Molecular Field Analysis: A Simple Method to Achieve Consistent Results. J. Med. Chem. 1995, 38, 1060–1066. 157. Hopfinger, A.; Wang, S.; Tokarski, J.; Jin, B.; Albuquerque, M.; Madhav, P.; Duraiswami, C. Construction of 3D-QSAR Models Using the 4D-QSAR Analysis Formalism. J. Am. Chem. Soc. 1997, 119, 10509–10524. 158. Tetko, I. V.; Kovalishyn, V. V.; Livingstone, D. J. Volume Learning Algorithm Artificial Neural Networks for 3D QSAR Studies. J. Med. Chem. 2001, 44, 2411–2420.
504 Chemoinformatics 159. Wermuth, C. The Impact of QSAR and CADD Methods in Drug Discovery. In Rational Approach to Drug Design; Ho¨ltji, H., Sippl, W., Eds.; Prous Science: Barcelona, 2001; pp 3–20. 160. Korhonen, S. P.; Tuppurainen, K.; Laatikainen, R.; Peva¨kyla, M. FLUFF-BALL A Template-Based Grid-Independent Superposition and QSAR Technique: Validation Using a Benchmark Steroid Data Set. J. Chem. Inf. Comput. Sci. 2003, 43, 1780–1793. 161. Lemmen, C.; Lengauer, T. Computational Methods for the Structural Alignment of Molecules. J. Comput. Aided Mol. Des. 2000, 14, 215–232. 162. Jain, A.; Koile, K.; Chapman, D. Compass: Predicting Biological Activities from Molecular Surface Properties. Performance Comparison on a Steroid Benchmark. J. Med. Chem. 1994, 37, 2315–2327. 163. Polanski, J. Self-Organizing Neural Networks for Pharmacophore Mapping. Adv. Drug Deliv. Rev. 2003, 55, 1149–1162. 164. Vedani, A.; Dobler, M. 5D-QSAR: The Key for Simulating Induced Fit? J. Med. Chem. 2002, 45, 2139–2149. 165. Polanski, J.; Bak, A. Modeling Steric and Electronic Effects in 3D- and 4D-QSAR Schemes: Predicting Benzoic pKa Values and Steroid CBG Binding Affinities. J. Chem. Inf. Comput. Sci. 2003, 43, 2081–2092. 166. Santos-Filho, O. A.; Hopfinger, A. J. Structure-Based QSAR Analysis of a Set of 4-Hydroxy-5,6-Dihydropyrones as Inhibitors of HIV-1 Protease: An Application of the Receptor-Dependent (RD) 4D-QSAR Formalism. J. Chem. Inf. Model. 2006, 46, 345–354. 167. Vedani, A.; Dobler, M.; Lill, M. A. Combining Protein Modeling and 6D-QSAR – Simulating the Binding of Structurally Diverse Ligands to the Estrogen Receptor. J. Med. Chem. 2005, 48, 3700–3703. 168. Barrios, F.; Gago, F. Chemometrical Identification of Mutations in HIV-1 Reverse Transcriptase Conferring Resistance or Enhanced Sensitivity to Arylsulfonylbenzonitriles. J. Am. Chem. Soc. 2004, 126, 2718–2719. 169. Tropsha, A.; Gramatica, P.; Bombar, K. The Importance on Being Earnest: Validation Is the Absolute Essential for Successful Application and Interpretation of QSAR. Quant. Struct. Act. Relat. 2003, 22, 69–76. 170. Polanski, J.; Gieleciak, R.; Bak, A. Probability Issues in Molecular Design: Predictive and Modeling Ability in 3D-QSAR Schemes. Comb. Chem. High Throughput Screen. 2004, 7, 793–807. 171. Clark, R. Boosted Leave-Many-Out Cross-Validation: The Effect of Training and Test Set Diversity on PLS Statistics. J. Comput. Aided Mol. Des. 2003, 17, 265–275. 172. Sheridan, R.; Feuston, B.; Maiorov, V.; Kearsley, S. Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR. J. Chem. Inf. Comput. Sci. 2004, 44, 1912–1928. 173. Mulklin, R. Sharing Drug Data. Chem. Eng. News 2005, 83 (50), 20–21. 174. Tropsha, A. Application of Predictive QSAR Models to Database Mining. In Chemoinformatics in Drug Discovery; Oprea, T. I., Ed.; In Methods and Principles in Medicinal Chemistry; Mannhold, R., Kubinyi, H., Timmerman, H., Eds.; Wiley-VCH: Weinheim, 2005; Vol. 23, pp 437–456. 175. Anzali, S.; Barnickel, G.; Cezanne, B.; Krug, M.; Filimonov, D.; Poroikov, V. Discriminating between Drugs and Nondrugs by Prediction of Activity Spectra for Substances (PASS). J. Med. Chem. 2001, 44, 2432–2437. 176. Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Deliv. Rev. 1997, 23, 3–25. 177. Van de Waterbeemd, H.; Gifford, E. Admet In Silico Modelling: Towards Prediction Paradise? Nat. Rev. Drug Discov. 2003, 2, 192–204. 178. Davis, A. M.; Riley, R. J. Predictive ADMET Studies, the Challenges and the Opportunities. Curr. Opin. Chem. Biol. 2004, 8, 378–386. 179. Hodgson, J. ADMET – Turning Chemicals into Drugs. Nat. Biotechnol. 2001, 19, 722–726. 180. Oprea, T. 3D-QSAR Modeling in Drug Design. In Computational Medicinal Chemistry for Drug Discovery; Tolleneare, J., De Winter, H., Langenaeker, W., Bultinck, P., Eds.; Marcel Dekker: New York, 2004; pp 571–616. 181. Oprea, T. Current Trends in Lead Discovery. Are We Looking for the Appropriate Properties? J. Comput. Aided Mol. Des. 2002, 16, 325–334. 182. Hansch, C.; Hoekman, D.; Leo, A.; Weininger, D.; Selassie, C. Chembioinformatics: Comparative QSAR at the Interface between Chemistry and Biology. Chem. Rev. 2002, 102, 783–812. 183. Shen, M.; Beguin, C.; Golbraikh, A.; Stables, J. P.; Kohn, H.; Tropsha, A. Application of Predictive QSAR Models to Database Mining: Identification and Experimental Validation of Novel Anticonvulsant Compounds. J. Med. Chem. 2004, 47, 2356–2364. 184. Helma, C. H.; Kramer, S.; De Raedt, L. The Molecular Feature Miner MOLFEA. In Molecular Informatics: Confronting Complexity, Proceedings of the Beilstein-Institut Workshop, Bozen, May 13–16, 2002; M. G. Hicks, C. Kettner, Eds.; Hicks, M. G., Kettner, C., Eds.; pp 1–15. 185. Maggiora, G. M.; Shanmugasundaram, V.; Lajiness, M. S.; Doman, T. N.; Schultz, M. W. A Practical Strategy for Directed Compound Acquisition. In Chemoinformatics in Drug Discovery; Oprea, T. I., Ed.; In Methods and Principles in Medicinal Chemistry; Mannhold, R., Kubinyi, H., Timmerman, H., Eds.; Wiley-VCH: Weinheim, 2005; Vol. 23, pp 317–332. 186. Cavallaro, C. L.; Schnur, D. M.; Tebben, A. J. Molecular Diversity in Lead Discovery: From Quantity to Quality. In Drug Discovery; Oprea, T. I., Ed.; In Methods and Principles in Medicinal Chemistry; Mannhold, R., Kubinyi, H., Timmerman, H., Eds.; WileyVCH: Weinheim, 2005; Vol. 23, pp 175–198. 187. Dean, P. M.; Richard, A. L.; Eds. Molecular Diversity in Drug Design; Kluwer: Dordrecht, 1999. 188. Andersson, P. M.; Linusson, A.; Wold, S.; Sjostro¨m, M.; Lunstedt, T.; Norden, B. Design of Small Libraries for Lead Exploration. In Molecular Diversity in Drug Design; Dean, P. M., Richard, A. L., Eds.; Kluwer: Dordrecht, 1999; pp 197–220. 189. Schreiber, S. L. The Small-Molecule Approach to Biology. Chem. Eng. News 2003, 81 (9), 51–61. 190. ChemBank. http://chembank.broad.harvard.edu. 191. Blower, P., Jr.; Yang, C.; Fligner, M. A.; Verducci, J. S.; Yu, L.; Richman, S.; Weinstein, J. N. Pharmacogenomic Analysis: Correlating Molecular Substructure Classes with Microarray Gene Expression Data. Pharmacogenomics J. 2002, 2, 259–271. 192. Cavalieri, D.; De Filippo, C. Bioinformatic Methods for Integrating Whole-Genome Expression Results into Cellular Networks. Drug Discov. Today 2005, 10, 727–734. 193. Habeck, M. New Approach to Gene Expression Analysis. Drug Discov. Today 2003, 8, 427–428.
Chemoinformatics
505
194. Spang, R. Diagnostic Signatures from Microarrays: A Bioinformatics Concept for Personalized Medicine. Drug Discov. Today 2004, 9 (Suppl.), 32–36. 195. Bajorath, J.; Ed. Chemoinformatics: Concepts, Methods, and Tools for Drug Discovery; Humana Press: Totowa, 2004. 196. Lavine, B. K.; Ed. Chemometrics and Chemoinformatics; ACS Symposium Series 894; American Chemical Society by Oxford University Press: Washington, DC, 2005. 197. Leach, A. R.; Gillet, V. J. An Introduction to Chemoinformatics; Kluwer: Dordrecht, 2003. 198. Noordik, J. H.; Ed. Cheminformatics Developments; IOS Press: Amsterdam, 2004. 199. Van de Waterbeemd, H.; Carter, R. E.; Grassy, G.; Kubinyi, H.; Martin, C.; Tute, M. S.; Willett, P. Glossary of Terms Used in Computational Drug Design. Pure Appl. Chem. 1997, 69, 1137–1152. 200. Cambridge Healthtech Institute. http://www.genomicglossaries.com/content/chemistry.asp.
506 Chemoinformatics
Biographical Sketch
Jaroslaw Polanski is a Professor of Chemistry and Head of Department of Organic Chemistry at the Institute of Chemistry, University of Silesia, Katowice, Poland. He graduated from the Silesian University of Technology, Gliwice, obtained his Ph.D. from the University of Silesia (1993), and D.Sc. from the Technical University of Lodz, Poland (1998). His scientific interests involve organic chemistry and chemoinformatics. He has published more than 80 peer-reviewed articles, book chapters, and patents in the area of drug design and discovery. Modeling multidimensional Quantitative Structure–Activity Relationship and practical design and synthesis of novel chemical compounds in the search for potential sweeteners, HIV-1 integrase inhibitors, and antiproliferative agents are examples of the investigations reported. His professional experience includes the invited short-time fellowships (DAAD and Konferenz der deutschen Akademien der Wissenschaften) in Labor fu¨r Computer Chemie, Organisch-Chemisches Institut, Technische Universita¨t Mu¨nchen (1993) and Computer-Chemie-Centrum, Institut fu¨r Organische Chemie, Universita¨t ErlangenNu¨rnberg (1996, 1997–1998), Germany, and a visiting position in Laboratoire de Biotechnologies et Pharmacoge´ne´tique Applique´e, Ecole Normale Supe´rieure de Cachan, France (2001, 2002). Professor Polanski was the Vice Dean of the Faculty of Mathematics, Physics and Chemistry (1999–2005) and the Deputy Head of the Institute of Chemistry (2005–) of the University of Silesia.