CompukrsChem. Vol. 18. No. 2, pp. 189-193, 1994 Pergamon
Copyright fQ I994 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0097~8485/94 $7.00 + 0.00
0097-8485(93)EMM-s
APPLICATION CODING
OF CHEMICAL
STOYAN KARABUNARLIEV,
NOTE
STRUCTURES NOTATION
BASED
ON A LINE
JULIAN IVANOV and OVANES MEKENYAN*
Bourgas Technological University, 8010 Bourgas, Bulgaria (Received
10 February
1993; in revised form
14 December
1993)
Abstract-An algorithm for coding of chemical structures is proposed based on a chemistry oriented line notation language. The latter is based on simple rules providing an almost convention free specification of molecular connectivity. A very useful feature of the proposed molecular code is that it has a line notation form, i.e. it can be interpreted according to the line notation language rules. Both the line notation language and molecular code are based on the principle of decomposition of the molecular graph into biconnected components (cyclic fragments or single atoms). The decomposition graph is a tree, each vertex of which stands for a biconnected component. Within the coding algorithm first the codes for each biconnected component are formed and then they are used as vertex labels of the decomposition tree. Since large chemical graphs usually consist of several biconnected components this method improves, to a great extent, the average time complexity of the algorithm. Terminal cyclic radicals and chain fragments of the molecular graph appear as unique substrings in the line notation code which enhances their computer perception.
1. INTRODUCTION
2. THE LINE NOTATION
In the design of a chemical information system two general problems always have to be solved. The system should have a chemistry oriented interface for molecular structure input. It can be either a graphic
The present line notation is a variation on that proposed by Read (1983). It is obtained on the basis of an oriented spanning tree of the chemical graph, constructed by the depth first search (DFS) procedure (Tarjan, 1972). The line notation of the chemical graph is formed by the sequence of atom labels according to the ordering they have in the DFS tree. Between atom labels stand the symbols denoting the type of the chemical bond. The ring-closure edges are included in the line notation as the number of the preceding vertex and the bond type symbol in front of it. If a vertex has several descending DFS subtrees, the line notations corresponding to those
editor or a line notation form, the latter one is discussed here. Second, the information system must be able to identify molecules on the basis of their structure. Given a molecule, the system must derive a unique code for the molecule, so that the code can be looked up in a library and the data about the molecule located. A number of line notations (Wiswesser, 1968; Klopman & McGonigal, 1981) and codes (Morgan, 1965; Corneil & Gotlieb, 1970; Wipke & Dyott, 1974; Read, 1983) for molecules have been proposed and used which suggests the importance and the complex-
subtrees are enclosed in pointed brackets thus denoting that the vertex is adjacent to several atoms in direct ordering (Fig. 1). The main difference with the Read line notation is that the present one permits to construct DFS trees and attribute vertex number only for certain fragments. Let us call the edge-cut set of a graph the set of all edges which do not belong to a cycle. When we decompose the graph with respect to its edge-cut set, we obtain several disjoint components which are either single vertices or cyclic fragments where each vertex belongs to at least one cycle (“non-trivial
ity of both problems. A molecular code having a line notation form is proposed in the present paper. In fact, this code can be regarded as a canonical line notation for a given molecule. Thus, the molecular codes stored in the library are not only used for cataloguing purposes, but can be. directly looked up by the chemist and easily interpreted line notation rules.
according
to the
blocks” according to graph-theoretical We call them cyclic components
*Author for correspondence. 189
terminology). of the graph.
190
Application Note CH3
CH -CH \
OH
4 c C-~C.CH.C.
,-~Hts=~~s.C.CH.CH.C’CH.~H.11
>-cl
b Fig. 1. An oriented spanning tree and the corresponding
line notation
constructed
by using pointed
brackets.
Ring-closure edges can appear only between vertices of the same cyclic component and the vertex numbers denoting ring-closure edges refer always to vertices of the same cyclic component, respectively. Thus DFS subtrees starting at another cyclic component may have local vertex numbers to be used within the corresponding line notation. Such local line notations of terminal fragments appear enclosed in round brackets. In case there are several identical terminal fragments attached to an atom, a repeat factor after the closing bracket may be used for the sake of brevity. Finally, there are some abbreviations for atomic groups which are often encountered (Fig. 2). 3. THE LINE NOTATION
CODE
The obvious way to construct a canonical representation of a molecule is to generate all possible line notations and then to take the first one with respect to some type of ordering. Backtrack approach may save some time over such a brute force method by abandoning the generation of a line notation as soon as it is known to be worse than the current best one, but still the running time of such a method is of (d - 1)” order where n is the number of vertices and d the maximum vertex degree. Unfortunately, graph coding belongs to the combinatorial problems for which no algorithm of time complexity better than exponential upon the size of the input, n, is known.
Still, some improvements can be made in two aspects (Karabunarliev et al., 1984). (i) One can try to partition initially the vertices with respect to the distance metrics properties of the graph and the atom labels and then to improve this partitioning proceeding from the vertex adjacency. (ii) One can observe that backtrack search to obtain canonical line code is necessary only within each cyclic component. The running time will be the sum of exponential terms of power equal to the number of vertices of different cyclic components. Thus, a significant improvement could be gained in average because large molecules are usually quite sparse and contain several cyclic components. The partitioning algorithm starts by ordering of the vertices according to their distance code-the sequence of numbers of the first, second, third, etc., neighbours. Then vertices are ranked starting with the ones of smallest distance code. The initial partitioning is then improved iteratively (Balaban er al., 1985; Mekenyan et al., 1985). First the sorted lists of ranks of all neighbouring vertices are formed, and then vertices are partitioned according to the lexicographic ordering of those lists. The latter procedure is repeated until it does not produce a better partitioning. Next the chemical graph is divided into cyclic components. Representing each cyclic component as
191
Application Note
b N.CH.C(O:).C(C
-(6)
- C.CH .CH.C(dl).CH.CH.I).CH.C(CjI
Q.1
Fig. 2. Hierarchical partitioning of a chemical graph into terminal fragments and the corresponding line notation constructed by using round brackets.
h=2
h=3
h=4 Fig. 3. Decomposition of a chemical graph into cyclic components and the decomposition
tree.
192
Application Note
d
yT\
(C.QI.C.(O~.N.C3).~.l)
(0)
(c.CH.cXC(cI).CH.CH. 1)
~O~C.~I.~I.C(C1).[3H.QI.lHC,CH.C(OH).N.CH(C).~.l) Fig. 4. The steps of the line code construction algorithm.
a single vertex results in a graph having no cycles, which we call the decomposition tree. A tree has a uniquely determined central vertex or edge. Each tree has at least two vertices of degree 1, called leaves. Let height h (x) of a vertex x be the length of longest path to a leaf. A tree has either a unique vertex of smallest height or a pair of adjacent vertices of smallest height. We construct the oriented decomposition tree starting from one of the vertices of smallest height which we call the root. The adjacent vertices are its descendants, etc., up to the leaves of the tree. Thus each cyclic component except the leaves has descendants and each cyclic component except the root has an ancestor (Fig. 3). The line code of the chemical graph starting at an arbitrary atom of the root cyclic component is obtained in two steps. (i) Suppose the current cyclic component has several descendant components in the decomposition tree. Each descendant starts by a unique atom adjacent to the current component. Construct the line
codes for all descendant components by starting at the vertex which is adjacent to the current one. This will results in recursive generation of line codes up to the leaves of the decomposition tree. Then append the atom labels of the current component by the line codes of the adjacent descendants enclosed in brackets. If an atom has several adjacent components, the corresponding label extensions are ordered by simple comparison. When equal line codes for identical terminal fragments appear, the repeat factor form is used. (ii) Once all descendant components have been coded and taken into account by extending the atom labels, we can concentrate upon the construction of the line code within the current component. For this purpose a special class of DFS trees, and the line notations corresponding to them, are generated. Each time the visiting of an atom is found to produce a line notation which is worse than the best one found, the DFS is cancelled in advance (Fig. 4).
Application Note In order to diminish the number of trials two methods are also used. (i) The order of visiting of adjacent vertices is partially predelined according to the preliminary heuristic partitioning. When several atoms are to be visited alternatively in the course of the DFS, only the ones of greatest rank are considered. Only line notations starting at vertices of greatest rank of the central component or pair of components are generated. (ii) When a complete line code identical to the current best one is generated, pairs of equivalent atoms are found taking into account the different orderings of atoms in both line notations. The vertices found to be equivalent are united into sets which at the end of the search represent the graph orbits. This partial perception of vertex equivalency is used in the course of the backtrack search to eliminate the redundant trials. Although the time complexity of the algorithm with respect to the worse case is no better than an exponential one, it could be argued that the approach works quite well on chemical graphs. Backtrack search is limited within separate cyclic components which are usually not very large. For chemical graphs of low symmetry the heuristic partitioning is quite efficient and diminishes the number of generated line notations. For symmetric cyclic graphs the combinatorial complexity is, to a great extent, resolved by the graph orbit perception improvement which provides much better results than exhaustive generation. Thus for a complete graph of order n only n DFS trees are constructed (the total number being n!) and the running time is proportional to n *.
193
Both the line notation interpreter and the coding algorithm are implemented into PASCAL procedures using the Borland International Inc. compiler. They have been used in the OASIS system for computer aided structure-property relationship investigation (Mekenyan et al., 1990). With slight modifications the line notation interpreting procedure is used for input of series of congeneric molecules. The coding procedure is used for cataloguing purposes in a library of substituent constants, and subfragment search system based on three-dimensional database (Mekenyan et al., 1994).
REFERENCES Balaban A. T., Mekenvan 0. & Bonchev D. (1985) \ , J. Comput. Chem. 6, 5j8; 562. Corneil D. G. & Gotlieb C. C. (19701 , J. Assoc. Ccrmout. Mach. 17, 51.
Karabunarliev S., Mekenyan 0. & Dobrinin A. (1984) Viichislit. sist. (Novosibirsk) 103, 141 (in Russian). Kloptnan G. & McGonigal M. (1981) J. Clrem. I@ Comput. Sci. 21, 48. Mekenyan O., Bonchev D. & Balaban A. (1985) J. Comput. Chem. 6, 552.
Mekenyan O., Karabunarliev S. & Eonchev Comput. Chem. 14, 193.
D.
(1990)
Mekenyan O., Karabunarliev S., Ivanov J. & Dimitrov D. (1994) Comput. Chem. 18, 173. Morgan H. L. (1965) J. Chem. Dot. 5, 107. Read R. C. (1983) J. Gem. Inky Comput. Sci. 23, 135. Read R. C. (1985) J. Chem. If. Comput. Sci. 25, 116. Tarjan R. E. (1972) SIAM J. Cornput. 1, 146. Wipke W. T. 81 Dyott T. M. (1974) J. Am. Chem. Sot. 96, 4825; 4834. WiswesserW. J. J. (1968) J. Chem. Dot. 8, 146.