Logic-based approach to expert systems in chemistry Tatsuya Akutsu, Einoshin Suzuki* and Setsuo Ohsuga* A logic-based approach to expert systems for chemistry is discussed. In chemistry, most of the knowledge is expressed with chemical structural formulas. However, it is difficult to handle chemical structural formulas by using currently available logic programming languages. The logic programming language has been extended so as to include chemical structural formulas. This extended language/system named Chaus is presented in this paper. With Chaus language, the chemical knowledge is expressed in a natural way. The implementation to ensure practical efficiency and an application for drug design is also discussed. Keywords: logic-based approach, chemical compound design, database systems Expert systems for chemistry have been widely developed since Dendral was first experimented I. Particularly important among them are the expert systems for chemical compound design. However, it is difficult to represent and utilize the knowledge of chemical synthesis and heuristics on current available systems. Therefore, a system with which chemists can represent and utilize their knowledge in a most natural and easy way is required. Since most of the chemical knowledge, especially in organic chemistry, is expressed with chemical structural formulas 2, it is necessary that the representation of chemical structural formulas should be made possible. In this respect, chemical knowledge is mostly heuristic and fragmentary. However, it is difficult to gather fragmentary knowledge and arrange it into a procedural form. Thus knowledge should be expressed in a declarative form. According to this purpose, a logic-based approach to expert systems in chemistry is presented in this paper. Generally, logic-based systems make it possible to
Mechanical Engineering Laboratory, Namiki 1-2, Tsukuba, Ibaraki 305, Japan *Research Centre for Advanced Science and Technology, University of Tokyo, Komaba 4-6-1, Meguro-ku, Tokyo 153, Japan Paper received 6 August 1990. Accepted 3 October 1990
Vol 4 No 2 June 1991
represent knowledge in a declarative form. However, a given chemical structural formula can not be represented directly by using a logic formula. It is then necessary to extend logic in order to represent chemical structural formulas in a most natural form. For that purpose, we have designed and developed a knowledge-based system named Chaus (chemical knowledge acquisition and utilization system). Chaus is designed as a general purpose tool for building expert systems in the domain of organic chemistry. Chaus is an extension of Kaus (knowledge acquisition and utilization system), which has been developed by our group for more than ten years 3,4. Kaus is a system allowing the expression of knowledge by using logic formulas. It allows the expression of meta-level knowledge as well. The organization of the paper is as follows. Firstly, previous applications in the domain of database systems and expert systems in chemistry are briefly surveyed. Next, knowledge representation in the domain of chemistry is discussed and the features of Kaus are briefly presented. The features of Chaus and its implementation are then described. Lastly, examples of ongoing applications on Chaus are given and we discuss the conclusions.
PREVIOUS RESEARCH In this section, we briefly survey the previous research in expert systems and database systems for chemistry. Although only few chemists would use expert systems, such database systems as CAS (Chemical Abstract Service) database are widely used 5. This is due to the fact that over 10 M chemical compounds were discovered. The use of a database is then a necessity for chemists who want to synthesize new chemical compounds. Such information as registration number, chemical structural formulas, stereochemical information, references, chemical activity and date of registration are registered in the CAS database. One of the most important features of the CAS database is that a unique name is given to each chemical structure. If we want to register some chemical compound, the system has to check whether the same compound was already registered or not. Thus, giving each chemical structure a unique and invariant name (number) is quite convenient. This is the purpose for which the Morgan
0950-7051/91/020103-14 (~) 1991 Butterworth-Heinemann Ltd
103
algorithm has been developed and used in the CAS database 6. Substructure search too is an important feature of the CAS database. The system searches for chemical compounds including substructures specified by the users. Many algorithms were developed so far for this purpose. Most of the other database systems such as Darc and Maccs have a substructure search function as well. Dendral is famous because for the first time the power of expert systems was demonstrated 1. It is a structure elucidation system from mass spectra. Several systems have been made for the same purpose such as Congen 7, Chemics 8 and Case 9. They use similar methods and we present here an outline of the pattern. • Step 1. Using heuristic rules, the system decides
what fragments (substructure such as benzene ring) must be (or marc be or must not be) included in the target compounds from the given spectra. • Step2. The system combines fragments into a complete structure and checks whether it is consistent with the given spectra. Step 2 is the core part of structure elucidation systems. It is an exhaustive structure generation algorithm from fragmental substructures. The algorithms vary with the systems. However, these systems are special purpose systems and it seems difficult to apply their methods to other applications such as chemical synthesis. Secs I° is a system for chemical synthesis as well as Lhasa and Synchem 1. The purpose of these systems is to find a synthesis path to obtain a given chemical structure. The methods they use are similar and outlined as follows: Inputting the chemical structures to be synthesized. • Step 2. Applying chemical reaction formulas to the target structure inversely and obtaining the precursor structures. • Step3. Repeating step 2 until precursor structures are compounds which can be obtained. • Step l.
The following problems must be faced: • How to describe the scopes or conditions so as to apply a chemical reaction formula? • How to limit the search space for a synthesis path? Such special purpose languages as Alcem for Secs and Chemtrn for Lhasa were developed in order to solve practical problems 1. In Alchem, the knowledge is described in an if-then form. In order to limit the search space, Lhasa provides several predefined strategies from among which the users make their selection. In Synchem, a priority coefficient is computed for each precursor structure and the highest one is searched in the first place. Although these systems and the methods they involve are close to our purpose, they are still inadequate, because of their performance limit in describing the complex conditions of chemical structures. Search strategies too are limited. This explains why more powerful methods of describing knowledge related to chemical structure and search strategies are required. 104
K N O W L E D G E R E P R E S E N T A T I O N IN CHEMISTRY To develop expert systems, it is important to discuss the knowledge representation of the application domain. A production rule ( i f - - t h e n - ) or a definite clause (A1, A2 . . . . . A,, -+ B) is used as a form of knowledge representation in most systems. They are very simple and it seems resonabte to use one of them at present. But production rule and definite clause are knowledge representation for making inference. The way of representing object models in application domains is another problem. The focus now will be on representing object models in chemical applications and describing what is required in the case of chemical information knowledge processing. C h e m i c a l structure The choice of an appropriate representation of the object models depends on the application domains. The model can be for instance block diagrams in control theory, mathematical formulas in mathematics and solid models in mechanics. In chemistry, the most important object is the chemical compound. A chemical compound is usually represented by a chemical structural formula. Many chemical properties are associated with a substructure of a chemical structure. Chemical reaction formulas which describe the transformation of chemical structures through chemical reaction are used as well. Of course, a chemical reaction formula is not adequate to describe a chemical reaction completely. It is necessary that such conditions as temperature, concentration, pressure and the kind of catalyst should be known as well. However, most of these conditions can be described by numerical values and symbols so that they can be easily handled by conventional languages. Thus it is not difficult to combine a chemical structure and such conditions once the chemical structure has been described. A chemical structure is usually described as a graph in which the following correspondence holds: atom ¢:> node (vertex) chemical bond ~ edge However, it is still inadequate because it lacks geometric information. Some chemical structures have the same topological (graph) structures but are considered to be different from one another. Stereoisomers are good examples. They have the same topological structures, but they have different geometric structures. Some examples of stereoisomers are shown in Figure 1. Sometimes it is very critical to distinguish stereoisomers from one another. It might seem easy to handle stereoisomers by giving three dimensional coordinate to atoms, but this does not work since atoms are not static but rotating. Hence, another method is required. Fortunately, chemists have studied extensively stereochemistry and several methods have been developed i0, the results of which can be utilized. We will revert to the question in the latter section.
Required functions It is important to make clear the limits of the
Knowledge-Based Systems
H
H
I
I
G
Br
/ II \
CI
Cl
CI
Cl
H
c", 1
I
-
H//C~I~CH3
/\ H
CH3
Figure 1. Examples o f stereoisomers. X " ~ that Y is in front o f X
Y denotes
application domain when designing knowledge-based systems. The aim of our research is to make a knowledge-based system with which chemists can express and utilize their knowledge in a natural way. Users are assumed not to be computer scientists or knowledge engineers but chemists. The most important task is to develop a knowledge-based system for chemical synthesis or chemical compound design. However, it is difficult to develop a system that may apply to every domain of chemistry. Therefore the application domain is restricted as indicated below: • Organic chemistry. Not including inorganic chemistry. Chemical structural formulas play an important role in organic chemistry. It is not so much the case in inorganic chemistry. • The number o f atoms o f a chemical structure must be limited to less than 300. This limitation is compulsory because it is difficult to handle a large chemical structure efficiently. For example, no efficient algorithm is known for substructure matching. Despite this serious restriction, there is a wide enough range of application domains.
• Knowledge can be represented in a declarative form. It is desirable that the representation of the knowledge should be made possible in a declarative form. Because most of the chemical knowledge is heuristic and fragmentary, it is difficult to express knowledge in a procedural form. Moreover, users being assumed to be chemists, it is difficult and wasteful to transform heuristic knowledge into a procedural form. Thus it is required that the system may combine automatically heuristic knowledge and solve the given problem. However, it is required that knowledge should be described in a procedural form as well since the systematic way is also used in chemistry. • Chemical structure can be processed efficiently. The processing of chemical structure such as substructure matching and the transformation of structure should not be the bottle-neck of the inference process. Moreover, the system should be able to handle a large number of chemical structures efficiently, since many chemical structures must be handled in chemical applications.
To develop a system meeting the above requirements, a logic-based system and a production system are considered. Since logic-based systems have theoretical foundations, logic-based systems seem to be better than production systems. Most of the logic-based systems are based on the first order logic 11. Soundness and completeness theorems holding in the first order logic, it is possible to check consistency of knowledge. Moreover, the solution inferred by the system is guaranteed to be correct if the knowledge is consistent and correct. On the other hand, it seems difficult for a production system to check consistency since it does not have theoretical foundations. Of course, it is difficult to check consistency completely in the first order logic due to computational complexity. However, simple checks can be performed. Logic-based systems have also shown a fine capacity to handle meta-level knowledge. Several research works have been carried out in the domain of meta-level inference based on logic. Chemists would often use such meta level knowledge as 'if the compound X shows the property P, use the transformation rule which decreases the value of Q'. According to those statements we have used Kaus, which is a logic-based system having already many of these preferable features mentioned above. The next section is dedicated to an overview of Kaus.
In order to achieve a proper knowledge-based system, the system should meet the following requirements. • Chemical structural formulas can be represented in a natural way. As mentioned above, chemical structural formulas play an important role. Chemical structures should be represented in as easy a way as chemical structural formulas, that is to say, they should be represented in a graphical form. Moreover, the transformation rules between chemical structures should be represented in as easy a way as chemical reaction formulas. The structure-substructure relation should also be represented. It is preferable that the representation of stereochemical information should be made possible as well.
Vol 4 No 2 June 1991
O V E R V I E W OF K A U S As mentioned before, Chaus is designed and implemented based on Kaus. Kaus is a knowledge-based system based on a multilayer logic (MLL, in short). Kaus is designed f3r solving practical problems especially for making CAD (computer aided design) systems in such various fields as mechanical CAD, CAD for chemical compound design and CAD for software design. Kaus and MLL have the following features: • MLL is an extension of the first-order predicate logic in which various data structures can be expressed. 105
• Declarative knowledge and procedure are integrated in Kaus. • The interface with conventional relational database and non-normal form database (it can be seen as an object oriented database) is provided with Kaus. • Kaus has a meta-level control mechanism. In this section, the important features of Kaus and MLL are briefly presented. For more information concerning Kaus and MLL, see references 3 and 4.
Multilayer logic
goes along with it is not enough to solve practical problems automatically and efficiently. The representation of the design process and its control strategies is also required. We call the former an object-level knowledge and the latter a meta-level knowledge. In chemical compound design, what follows is an example of object-level knowledge. Primary alcohol is made by deoxidizing arudehyde ( R - C H O ~ R-CH2-OH)
On the other hand, here is an example of meta-level knowledge.
MLL is an extension of the first-order logic and has a general data structure. MLL can be considered as:
To get a more anti-hyperpiesia compound, apply the reaction formula which has the effect of decreasing the value of log P.
Predicate logic + data structure. A data structure is defined as a set in the sense of the ZF ( Z e r m e l o - F r a n k e l ) set theory. The ZF set theory has been slightly modified and a set of primitive relations including 'element-of', 'component-of', 'power-set-of', 'product-set-of', 'union-of', 'intersection-of' and 'pair of' is provided for MLL. The syntax of the first-order logic is expanded so as to include the domain of every variable that appears explicitly in the prefix. For example, 'Man is mortal' is represented as [vX](man(X) ~ mortal(X)) in the firstorder logic, where symbols beginning with capital alphabet denote variables in this paper. On the other hand, it is represented in MLL as [VX/man](mortal(X)), where 'man' denotes a set of man. In this example, MLL is similar to many-sorted logic which is an important branch of the first-order logic '2. However, domains in MLL are allowed to be any set defined as above, instead of fixed sets in MSL. Moreover, domains may be specified by variables which are already quantified as elements of some set. The following formula is an example of a MLL expression. Let 'nat' be the set of non-negative integers. Then, [vX/*nat][vM/X](max(X, M) ~ [ v y / x I M > = Y) gives a definition of the 'max' predicate, where max(X, M) means that M is the maximum integer of a set X. Here, '*nat' denotes the power set of 'nat' except the empty set. So, X must be a set of non-negative integers. The inference algorithm is a SLD-resolution-likealgorithm modified according to the previous modification. The core part of the modification concerns unification. In MLL, a checking mechanism for set equivalence and set inclusion for example has been added to the unification algorithm. KAUS has a set of built-in predicates called a procedural type atom ( P T A in short). A usual atom is called a non-procedural type atom (NTA in short). By means of PTA, declarative knowledge and procedure are integrated. The interface with databases is also provided in the form of PTAs.
Meta-level control Representing a data structure and the set of rules that 106
Kaus was provided with a mechanism to handle meta-level knowledge. In Kaus, rules (MLL formula) are indexed and partitioned into several sets. A set of rules is called a 'world' (it is a local world). Kaus has several PTAs which specify and change the worlds used during the inference process. Moreover, a meta-level inference process can get information from the object level. The following is a typical example of meta control PTAs. resolve(WFF,KB,Bool,Ans) WFF: logic formula to be resolved KB: world (a set of rules) Bool: True or False (indicates whether WFF is inferred from KB or not) Ans: answer (bindings for variables in WFF)
E X T E N S I O N OF THE L A N G U A G E Kaus is a general-purpose knowledge-based system. Since its components make it possible to express any data structure, chemical structures can thus be expressed on Kaus. However, chemists would find it difficult to express chemical structures by using Kaus' basic components. Even chemical structures that can be expressed as some data structure on Kaus can not be processed efficiently in this form. Therefore, we have extended Kaus so as to include chemical structures in the form of basic objects. Many PTAs were added to handle chemical structures. In this section, we outline the features of Chaus which is an extension of Kaus for chemical application.
Chemical structure as a basic object To extend Kaus so as to include chemical structures, we used the method of object-oriented programming. The most important feature of object-oriented program-
ming is data abstraction. Users or application programmers do not need to know how objects (chemical structures in our application) are implemented or what the data structure of the objects is. They only have to know the message protocols between objects. Thus chemical structures were added to Kaus as follows: •
A c h e m i c a l structure is c o n s i d e r e d as a basic object.
In Chaus, a chemical structure is treated in the same way as numbers and symbols are. It can not be
Knowledge-Based Systems
decomposed into basic objects. Though it is implemented by some data structure, this data structure can not be seen by the users or the application programmers. Indeed, a chemical structure is implemented as a pointer in M L L level. On the other hand, it is implemented by an adjacency list which represents the graph structure in the primitive level (C language level). • The node (atom) o f a chemical structure too is considered as a basic object. When changing a chemical structure by adding or cutting a chemical bond, it is required that the nodes or the edges should be specified. In doing so, the nodes of a chemical structure too are added as basic objects. However, the edges are not added as basic objects since an edge can be specified by two nodes. For example, the information about an edge is obtained by means of the following PTA: get-edge(CS,N1,N2,E) where 'CS' denotes a chemical structure, 'NI' and 'N2' denote the nodes of the structure and 'E' denotes the kind of bond between 'NI' and 'N2' (ex. single bond or double bond). • Every operation concerning chemical structures is carried out through a P T A . In object-oriented programming, operations between objects are carried out through message passing. Users only have to know the message protocols. In logic programming, basic operations are carried out through PTAs (i.e. built-in predicates). Users only have to know the specifications of PTAs. That is to say, users only have to know the name of the PTA and the meaning of its arguments associated with the operation. Several tens of PTAs are implemented for basic operations concerning chemical structures. Basic objects such as symbols and integers are identified in usual logic programming languages if they are represented in the same way (for example, the same string for symbols or the same value for integers). This is the reason why, in Chaus, stereochemically isomorphic chemical structures are identified. This is the key point when a formal semantics or a logical foundation has to be given to the language. Furthermore, each node appearing in the Chaus program must belong to only one chemical structure. It can not belong to more than one structure and it must belong at least to one structure. In chemical structures, nodes must meet the condition of the degree (valence) associated to the kind of atom. For example, the carbon atom has valence 4 so that, four single bonds, or, a double bond and two single bonds, or . . . . a triple bond and a single bond, must be adjacent to it. However, the description of partial structures or substructures is required. Therefore, every structure in which the number of bonds adjacent to each node does not exceed the valence of each atom is allowed in Chaus even if it does not exist chemically. Moreover, a special kind of atom (resp. bond) is added. It is an 'A' atom (resp. an 'anybond' bond). The 'A' atom (resp. 'anybond' bond) is treated as an atom (resp. a bond) which matches with any atom (resp.
Vol 4 No 2 June 1991
bond). It is used especially to describe a transformation rule between chemical structures. It may be seen as a logical variable. However, basically there is no difference between an 'A' atom and usual atoms. It is used and interpreted as a wildcard only in special PTAs. In this respect, how can users specify chemical structures or their nodes when they are basic objects? They are usually specified by means of a graphic editor. Chemical structures are inputted in the form of a graph on a graphic display. Nodes are specified by clicking the mouse. Once a chemical structure is inputted, it is registered in the chemical structures database (mentioned in the latter section) and an ID number is given to it. Then, it can be specified by the ID number. Chemical structures registered in the database can also be specified by their name such as 'methane', 'propane' and 'benzene'.
Pattern matching and structure conversion Pattern matching, especially substructure matching, is an essential function for chemical applications. Much knowledge is related to substructures, for example, 'if a compound contains a benzene ring, it is an aromatic compound'. For substructure matching, the following two PTAs are implemented: subpat(PS,CS) subpatmatch(PS,CS,MAP) where 'PS' is a pattern structure and 'CS' is a complete chemical structure and 'MAP' is a list of pair of nodes which shows the correspondence between nodes in 'PS' and nodes in 'CS'. If 'PS' is the subgraph (including stereochemical information) of 'CS', then the logical value of 'subpat(PS,CS)' becomes true. Otherwise, the logical value becomes false. The logical value of 'subpatmatch(PS,CS,MAP)' is evaluated in the same way as for 'subpat(PS,CS)'. However, the correspondence of the nodes is returned as a side effect when it succeeds. Furthermore, the PTA searches for another correspondence when the backtracking occurs. Examples are shown in Figure 2. The transformation of chemical structures is also an
H2 1
I
2
su bpatmatch ( A - - 01 , H - - C - - O - - H 4, Map) HI
[ (A,C), (O1,O2)] Map i backtrack Map =
[(A,H 4 ),(01,0 2 ) ]
Figure 2. Function o f the 'subpatmatch' predicate. The superscript o f a node is an index helping to distinguish the different nodes o f the same atom name
107
essential function. For this purpose, the following PTA is added: convert(In,InPat,Map,OutPat,Out)
For example, if X is bound to ~x will be bound to ( ~ Q ~ .
-
where In: is a list of graphs InPat: is a list of input graph subpatterns Map: is a list of pairs of corresponding nodes OutPat: is a list of output graph subpatterns Out: is a list of output graphs Since it is difficult to describe the detailed specification of the PTA, we will use some examples to make things plain:
Example 1. Primary alcohol is made by deoxidizing arudehyde: (R-CHO ~ R-CH2-OH) This transformation can be written as follows: convert([X],[Al-CH= O] ,[(A 1,A2)], [A2-CH2-OH],[Y]) where the superscript of a node is an index helping to distinguish the different nodes of atoms of the same name. For example, if X is bound to CH3-CH2-CH=O then Y will be bound to CH3-CH2-CH2-OH. But, if the rule is as follows: convert([X],[A I-CH=O],[],[A2-CH2 -OH],[Y]) then Y will be bound to A-CH2-OH.
Example 2. Aromatic arudehyde is made from aromatic compound by using the Gettermann-Koch method.
©
-
L(,,~J' then Y
© CHO
This transformation rule can be written as follows: C1
C6~C 2 I I ~onvm([xJ,[ c 5 ~ ) c a 1, [(c~,c7),(c2,c~),(c3,c9), ~(,4/
y
CHO Note that the substructure matching is implicitly used in the 'convert' predicate.
Other functions Chaus has many PTAs for handling chemical structural formulas. Some of them are given below: • getallnodes(Structure,ListofNode) gets a list of all the nodes in 'Structure'. • adjacentnodes(Node,ListOfNode) gets a list of the nodes adjacent to 'Node'. • connectnodes(Structure,Node 1,Node2,Bond,NewStructure) gets a new graph by connecting 'Nodel' and 'Node2' of 'Structure' with 'Bond'. • getatomname(Nodel,AtomName) gets an atom name of 'Nodel'. In addition to PTAs which are used to manipulate chemical structures, there are PTAs for the tightlycoupled database and for the graphic editor. Some of them are given here: • selectmol(Structure) specifies the chemical compound by using the graphic interface. The variable 'Structure' will be bound to the specified chemical compound. • selnode(Structure,Node) displays the chemical structure 'Structure'. The node is specified by using the mouse. • molbyname(DBID,Name,Structure) searches a chemical structures database which is specified by 'DBID' for a chemical structure whose molecular name is 'Name'. Details for the chemical structures database are explained in the latter section. • findsubpat(DBID,SubStructure,Structures) gets a list of structures which contains specified substructure in the database specified by 'DBID'.
Examples In this section, we will show simple example programs (MLL formulas), by which it can be seen that chemical knowledge is represented in a natural form.
Example 1. [VX/mol](is__nitro(X) <-- subpat(NOE,X))
I
H C7
C 1-'2~-~C8 (CS,Ctt),(C6,Ct2)], [
c
L_)b" " J,t J " ~ 10/ I
CHO 108
In this MLL formula, 'is__nitro(X)' means that X is a nitro compound. Note that 'NO2' is not a variable in this case, but a substructure.
Example 2. [ v X/mol] [v Y/mol] (primary__alcolaol(X) <-convert([X],[A 1-CHE-OH],[(A 1,A 2)],[A2],[Y]), alkyl__group(Y)) Knowledge-Based Systems
Data structure for chemical structure
[VX/mol] Iv Y/moll[VZ/mol] (secondary__alcohol(X) *-convert([X] ,[ A'z~,CH-OH], A
[(A 1,A 3),(A 2,A 3)],[A 3,A3I,[Y,Z]), alkyl__group(Y), alkyl__group(Z)) In these MLL formulas, 'primary__alcohol(X)' (resp. 'secondary__alcohol(X)') means that the molecule 'X' is a primary alcohol (resp. a secondary alcohol). 'alkyl__group(Y)' means that 'Y' belongs to the alkyl group, whose molecular formula is CnHzn+~. It can be defined as follows: [VX/moll[V Cn/intl[V Hn/intl[V Size/int] (alkyl__group(X) getatomamt(X,"C",Cn), getatomamt(X,"H",Hn), getsize(X,Size), Size = Cn + 2*Hn + 1) where 'getatomamt' is a PTA for getting a number of atoms of the specified atom name, 'getsize' is a PTA for getting a number of atoms in the specified structure. IMPLEMENTATION Chaus has been implemented on Kaus and improvements are currently performed according to the requirements of the applications. Kaus and C-language were used for the implementation of Chaus. Kaus too is implemented in C-language and the PTAs are usually implemented in C-language. Sun-3 workstation is used as a machine for development. An X-window is employed for the graphic editor. The configuration of the system is illustrated in Figure 3. Chaus consists in three parts ~3,14. An inference engine, a chemical structures database and a graphic editor. The core part of the inference engine is not modified in Chaus. The greatest modification was adding PTAs. About 75 kinds of PTAs were added. The chemical structures database is a tightly-coupled database, in which all chemical structures are registered. The graphic editor is used for the input and the output of chemical structures. Inference Engine
PTA
Chemical Structures Database
=
Stack DB Heap DB
KAUS Global DB J
Graphic Editor
Figure 3. Configurationof the Chaus system Vol 4 No 2 June 1991
In Chaus, the adjacency list is employed as a data structure for a chemical structure. The adjacency list is a famous data structure for representing a graph structure 15. It is an economical data structure from the point of view of memory space. Furthermore, most of the efficient algorithms for graph operations are described with the adjacency list. Together with the adjacency list, the data structure contains information about chemical compounds such as the molecular name, the kind of atom, the kind of bond, the two dimensional coordinates of an atom and information about chemical activity. Moreover, the adjacency list was modified so as to include stereochemical information. The method is simple and similar to the one described in Reference 10. The key idea is to restrict the order of the adjacency list in which stereochemical informations are included. Details are found in Reference 16.
Tightly-coupled database Chaus is equipped with a tightly-coupled database that contains a large number of graph representations of chemical structures called 'chemical structures database'. It is essential for Chaus because all the chemical structures which would appear in the knowledge representation (MLL formulas) and which are generated during the inference process are registered in the database. Instead of using such an external memory as disk files, all data are registered on the main memory during the execution time. The database has the following features: • The database is partitioned into several databases. • There are index files for the fast structure search. • The database is a tightly-coupled database. In the Prolog system, for the purpose of efficiency, the data space is divided into several parts according to the data class. The chemical structures database is also divided into several databases for the same purpose. Usually, it is divided into three databases: • a global database • a heap database • a stack database The global database contains a large number of permanent data. It is a database in the usual sense in which the chemical compounds known in the application domain are contained. The heap database contains the chemical structures which are used as substructures or patterns. It contains data in program clauses and data generated by such predicates as 'assert'. The stack database is used as a stack for the inference engine. It contains the chemical structures generated during the inference process until the backtracking occurs. In addition to these databases, the user can open and use up to 20 other databases in the current implementation. For example, five databases are used in the application for the drug design. For the purpose of fast structure search, the chemical 109
structures database has the following index files:
Table 1. Time for calculating the Morgan name
• a Morgan name index file • a molecular name index file
Chemical structure
The Morgan name is the unique and invariant number given to each chemical structure 6. When a new chemical structure is about to be input, the Morgan name is computed. Then the system checks whether an isomorphic structure (i.e. a structure which has the same Morgan name) was already registered or not. If there is no such structure, the structure and its Morgan name are registered. The Morgan name index file is a sorted file for making search for isomorphic structures faster by using the binary search. Furthermore. an appropriate data structure enabling efficient insertion and deletion was implemented 13. A molecular name is given by a user when it is input• If it is generated by the system, a simple name such as 'mol.._xxxx' is given. Since chemical structures are often referred to by their name, the molecular name index file is also sorted. The chemical structures database and more particularly, the stack database part, is tightly coupled with the inference engine. The chemical structures generated during the inference process are put in the stack database• The stack database grows when inference proceeds and shrinks when the backtracking occurs. Thus, chemical structure data are linearly added and deleted and this operation can be done efficiently by using a simple method. However, some technique had to be used to maintain index files, since index files do not grow or shrink linearly 13. Basic algorithms
Basic and simple operations for chemical structures such as adding or cutting a bond between specified nodes can be carried out efficiently with the adjacency list. In most cases, they can be done in linear time by simple algorithms I5. However, it is difficult to do the following operations efficiently: • checking whether or not two given structures are isomorphic. • substructure matching• Checking the isomorphism of two given structures is important for a chemical structure search especially when inputting new structures. This is related to the graph isomorphism problem, for which no efficient algorithm is known yet. However, efficient algorithms were developed by chemists. Although the computational complexity of these algorithms is not polynomial time, they run efficiently in most cases. Most of these algorithms are modified versions of the Morgan algorithm 6. The most important feature of the algorithm is that it computes the unique name (number) of the given chemical structure. The names (numbers) of isomorphic structures are given the same value. This is what is called the Morgan name. Thus we implemented the modified version of the Morgan algorithm. Some CPU times for the computation of the Morgan name of several chemical structures are shown in Table 1. The substructure matching too is important. When applying chemical reaction formulas, substructure 110
time (msec)
Ethanol (CH3-CH2-OH)
0.53
Benzene (C6H6)
1.01
Aniline (C6Hs-NH2)
2.61
Ethyl phenyl keton (CH 3CH 2-CO-C6H 5)
4.22
Anthracene (C14HI0)
4.32
Tryptophane (ClIO2N2H12)
7.98
C4205H38
37.22
CssOsNHn3
166.91
matching is necessary. It is related to the subgraph isomorphism problem, which is one of the famous NP-complete problems• Although they are not polynomial time ones, efficient algorithms were also developed by chemists. The set reduction algorithm and the atom-by-atom algorithm are famous ones among them 17. Our algorithm is based on the atom-by-atom algorithm, combining some features of the set reduction algorithm• A substructure screen table which is often used in database systems for chemistry is also employed in Chaus. It is associated to each chemical structure and it shows which special substructures the chemical structure contains (see Figure 4). It is used as a pre-test for substructure matching. The effect of the substructure screen table is shown in Table 2. Substructure matching can be done in a reasonable time so that
CH3
i© © 1
C -0 [I 0
0
1
0
1
•
1
l
-H
Figure 4. Substructure screen table
Table 2. Amount of time for substructure matching with 968 chemical structures
Substructure
With screen (sec)
Without screen (sec)
(a) (b) (c) (d) (e) (f) (g)
0.75 0.91 1.64 0.14 1.13 0.28 1.81
30.26 2.87 7.91 1.78 4.65 1.83 1.82
Knowledge-Based Systems
,,, © © (d)
c ---- N C----N
Examples of the graphic editor are shown in Figure 6 and Figure 7.
,°,0
(e)
(f)
Substructures
-OH
-CHO
(g) C-N--C-C
used as k e y s
o II --C--
c -o
c-.
c =o
..... 0 S u b s t r u c t u r e s used as a s c r e e n
Figure 5. Substructures used as keys and substructures used as a screen in Table2
it does not turn out to be the bottle-neck of the inference process.
User interface In Chaus, input and output of chemical structures are carried out with the graphic editor implemented on an X-window. It is used for the input and output of rules and of the result of the inference process. It can also be used directly as a user interface for the chemical structures database. That is to say, the database can be accessed without the inference engine. The user interface has been designed so as to limit the use of the keyboard as much as possible. Most of the operations except the input of molecular names and the input of file names can be done only by using the mouse. It has
ONGOING DEVELOPMENT OF APPLICATION We are now developing a knowledge-based system for chemical compound design based on Chaus, in cooperation with chemists. The target domain is drug design for agriculture and medicine. Since it is too difficult to make a completely automatic system, it is designed to work with chemists.
Lead evolution The basic strategy of the system is the lead evolution, in which lead compounds play the most important role. A lead compound is a skeleton structure from which an effective new drug can be obtained by modifying its substructures. The process of drug design is divided into the following two phases: Lead generation. It is a process to find lead compounds. Since no method has been established yet, the experience and the heuristics of chemists play the most important role. Four methods are known. Lead selection, lead discovery, lead generation in a narrow sense and lead evolution. Lead optimization. It is a process of optimization of such things as activity and cost by modifying detailed substructures of lead compounds. Methods for lead optimization are much studied and used for practical problems. Current methods are based on statistics. Correlations between substructures and activities are analyzed by statistical methods and the result is applied to optimization.
In our system, we focused on the lead generation and more particularly on the lead evolution method. In the lead evolution, some compound is initially selected. Next, a series of drastic modifications of structures are made in order to get skeleton structures that have the target activity. However, since the quantitative method is not known, the modifications rely on the experience and the heuristics of chemists. Therefore, the application of knowledge-based systems to lead evolution seems to be effective. The flow of drug design process with the knowledgebased system is as follows:
(1) Inputting and modifying chemical structures. (2) Specifying the node of the chemical structure which is displayed. (3) Making a complete chemical structure by automatically adding hydrogen atoms. (4) Specifying the chemical structure by name (menu selection method). (5) Searching chemical structures which contain the specified substructure. (6) Making substructure matching and displaying the correspondences between the nodes. (7) Specifying the correspondences between the nodes in chemical structures. (8) Appending two chemical structures. (9) Inputting of substructures which are often used by menu. (10) Attractive printing of the chemical structure (by giving two dimensional coordinates for the chemical structure which is generated during the inference process).
(1) Inputting an initial lead compound. (2) Modifying the initial compound by using meta level control. (3) Making analysis and evaluation of the modified compound. (4) If some modified compounds display the required properties, outputting them as lead compounds. (5) Checking output lead compounds (by chemists). If compounds meeting the requirements are found, then the process has succeeded. Otherwise, return to (1).
Note that (7) is used for 'convert' PTA in which the correspondences of the nodes must be specified.
Among the above steps, (2)(3)(4) are processed automatically by the system. This process is illustrated in Figure 8.
Vol 4 No 2 June 1991
111
Ill
F-F--
IF--l~I..2
II~i~I~l IF-~IF--]I~I~
o
1~
~gvEc I~0~E.
~
~
Iio. IIs
ii~----l~--ll
II~ II~o
K
If9)TURN
!pup II.0~ '~-"--- 0
O:I
CH3
CH ...CI.~.
eft
"
eH....c.....~
cll
ci "".-, ,..CH .. ...-" !*
. ,..'
A
CH. C/ "...
CI-I ..,.'"" . . . .
""",.....
c~
NI
"'°"'-, CH ..-'""" ""I""
QUIT
I
Figure 6. Snapshot of the graphic editor (1). Specifying the correspondence between the nodes in chemical structures
Modification of chemical compounds If all the transformation rules (chemical reaction formulae) are applied, a combinatorial explosion will be caused. It is necessary to restrict the number of rules according to current lead compounds and target properties. For this purpose, the function of meta level control of Kaus is utilized. For example, the following knowledge: In the domain of anti-hyperpiesia drug, the activity would increase if the transformation rule which decreases the value of log P is applied to the structures containing the substructure shown in Figure 9. can be expressed by Chaus: 112
[v Mol/anti__hyper._mol] [v Mols/*mol] (modify_.anol(Mol,Mols) <-subpat(mol(1,0),Mol), resolve([a NewMol?/mol]apply---rules (Mol,NewMol),log P-,true,Mols)) where 'Mol' denotes an initial lead compound, 'Mols' denotes a set of transformed chemical compounds, 'log P-' denotes a set of transformation rules which decrease the value of log P,'mol(1,0)' denotes the substructure shown in Figure 9. The predicate 'apply..-rules' denotes that 'NewMol' is obtained by applying some transformation rule to 'Mol'. In addition to the meta-level knowledge, it is useful to take into account the previous examples. This process is often used by chemists whether they are Knowledge-Based Systems
[b,nzen,
Illazol
i[lazo2
I ['lazo3
1,zo6 3azol
[[ta=oZ [3azo5
II~.zo~
I I1..o~
5.zo2
Its.zo5
8~zoZ
llazo2
5azo6
I[e,zo4 [12azol
] l~.zo~
lazol
lazo7
3azo6
! 11azo2 i
II~.zo~ IIz.zozo 114.=o~ Ilo~zo~
IIz.zo5 il2.=o2 !1,4.=o2 I1~.=o~
I I~zo~ 3azol
] 3azo5
12azo4
21azo2
6cime2
14cimel
I I
:tcimet
5cime2
Q
lQUIT
]
Figure 7. Snapshot of the graphic editor (2). Searching chemical structures which contain the specified substructure
conscious of it or not. For this purpose, the special PTA for finding analogy among chemical structures has been implemented. The system searches for analogical examples of drug design which has succeeded in the past and applies the same transformation rule to the current structure. The system is now being developed Is. At present, 142 transformation rules and 40 meta rules have been input. A part of the system has been successfully tested and facilities for practical use are available. However, several extensions such as analysing and evaluating the electro-chemical property of chemical structures are required for practical use. The integration with analysis and evaluation will be achieved in the near future. Since much topics have to be discussed about drug Vol 4 No 2 June 1991
design, further details will be presented in other papers. CONCLUSIONS In this paper, a logic-based approach to building expert systems for chemistry was presented. In fact, the Chaus system was designed and developed with this type of approach in view. Chaus is designed for chemists to help them to represent and utilize their knowledge. For this purpose, the following requirements were considered. (1) Chemical structural formulas can be represented in a natural way. 113
~ead
Compou~nd
Modification
~1~.,,,,
KB + DB
",.
/ ?/ AnalysisandEvaluation
i
~ :
:
Selectionof Compounds
For requirement (2), Chaus was designed and developed based on Kaus. Kaus is a logic-based system, in which knowledge is expressed in the form of logical formulas. Furthermore, meta-level knowledge can be expressed in Kaus. For requirement (3), the chemical structures database which is tightly coupled with the inference engine was developed. The adjacency list representing the graph structure was modified and implemented so as to represent stereochemical information. With the data structure, most basic operations concerning chemical structures can be clone efficiently. For structure and substructure search, the Morgan algorithm and the atom-by-atom algorithm were implemented. By means of the above functions, the knowledge about chemical structures can be described in a natural form. Inference about chemical structures can also be done efficiently. To apply Chaus to practical problems, a drug design system has been developed on Chaus in cooperation with chemists. The functions for chemical structures and meta inference are effectively used in it. Though some extensions are needed, Chaus allows practical applications in chemistry.
ACKNOWLEDGEMENT This work was carried out as part of the project 'knowledge-based system for chemical compound design' sponsored by the Ministry of Science and Technology in Japan. We wish to thank here all the members of the project for the fruitful discussions we had and their constructive suggestions. In particular, we wish to thank Professor Sasaki and Dr Funatsu at Toyohashi University of Technology and Science for their suggestions about chemical information processing and for having permitted us the use of their program for attractive printing of chemical structures. We are also grateful to Mr Shiono in Fujitsu Keiyo Systems Engineering who implemented part of the graphic editor and to Ms Roy for her help.
Figure 8. Flow of drug design process
N
CH 3-- 0 ~ ~ ' ~ - o.
I
N
N - - C --C~--CH - - R
I
"
0
Figure 9. Substructure used in some meta rule. 'R' denotes an alkyl base
(2) Knowledge can be represented in a declarative form. (3) Chemical structure can be processed efficiently. Most of those requirements are met with Chaus. For requirement (1), the chemical structure is added as a basic object to the data set of Kaus. Several PTAs are implemented for manipulating chemical structures. Among them, the 'convert' predicate is the most important one, since it is the one by which the transformation rule corresponding to chemical reaction formula can be described in a natural way and processed efficiently. The graphic editor was also developed in order to input and output chemical structures in a visual form. 114
REFERENCES 1 Barr, A and Feigenbaum, E A The Handbook of Artificial Intelligence William Kaufmann Inc., USA (1980) 2 Morrison, R T and Boyd, R N Organic Chemistry Allyn and Bacon, USA (1966) 3 0 h s u g a , S 'Framework of Knowledge Based Systems - Multiple Meta-Level Architecture for Representing Problems and Problem Solving Process' Knowledge-Based Systems Vol 3 No 4 (December 1990) 4 0 h s u g a , S and Yamauehi, H 'Multi-Layer Logic - A Predicate Logic Including Data Structure as Knowledge Representation Language' New Generation Computing Vol 4 (1985) pp 403-439 5 Stohaugh, R E 'Chemical Substructure Searching' J. Chemical Information and Computer Science Vol 25 (1985) pp 271-275 6 Morgan, H L 'The Generation of a Unique Machine Description for Chemical Structures - A Technique Developed at Chemical Abstracts Service' J. ChemKnowledge-Based Systems
miconazole
azo62
azo32
plan_f3
plan_f6
plan_I5 cTC~tI4 ~
azol5
plan_19
3
azo62
[
[
plan_32
dff~3
[
azo62
I°lan--39
I
O__J~H31
I PIs. Push Botton!
CB--"~I obal db22" [meta_ks2] KB=.... [ana_ksl]
[azol] DB_level=0
miconazole "mol (0. I )
level=O numbers=50
Figure 10. Example of the system for chemical compound design
ical Documentation Vol 5 (1965) pp 107-113 7 Masinster, L M, Sridharan, N S, Lederberg, J, and Smith, D H 'Applications of Artificial Intelligence for Chemical Inference XII - Exhaustive Generation of Cyclic and Acyclic Isomers' J. American Chemical Society Vo196 (1974) pp 7702-7714 8 Sasaki, S e t al. 'CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds' J. Chemical Information and Computer Science Vol 18 No 4 (1978) pp 211-222 9 Shelley, C A and Munk, M E 'Computer Perception of Topological Symmetry' J. Chemical Information Vol 4 No 2 June 1991
and Computer Science Vol 17 No 2 (1977) 10 Wipke, W T and Dyott, T M 'Simulation and Evaluation of Chemical Synthesis - Computer Representation and Manipulation of Stereochemistry' J. American Chemical Society Vol 96 No 15 (1974) pp 4825-4834 11 Lloyd, J W Foundations of Logic Programming Springer-Verlag, FRG (1984) 12 Enderton, It B Mathematical Introduction to Logic Academic Press, USA (1972) 13 Akutsu, T Study on the logic programming language and the database system for chemical information 115
processing (in Japanese) Doctor Thesis in Department of Information Engineering in University of Tokyo, Japan (1989) 14 Akutsu, T and Ohsuga, S 'CHEMILOG - A Logic Programming Language/System for Chemical Information Processing' Proc. Fifth Generation Computer Systems (1988) pp 1176-1183 15 Aho, A V, Hopcroft, J E, and Uliman, J D The Design and Analysis of Computer Algorithms Addison-Wesley, USA (1974) 16 Akutsu, T 'Methods for handling of stereochemical
116
information in expert system for chemistry (in Japanese)' Proc. 4th Annual Meeting of Japan Artificial Intelligence Society Japan (1990) 17 Randie, M and Wilkins, C L 'Graph-Based Fragment Searches in Polycyclic Structures' J. Chemical Information and Computer Science Vol 19 No 2 (1979) pp 22-31 18 Suzuki, E Drug Design Aiding System based on Meta Level Control (in Japanese) Master Thesis in Department of Aeronautics in University of Tokyo, Japan (1990)
Knowledge-Based Systems