TetrahedronComputerMethodology, VoL 2, No. 2, pp. 75 to 83, 1989 Printed in Great Britain
0898-5529/89 $3.00+.00 Pergamon Press plc
HBA: New Algorithm for Structural Match and Applications Xu Jun and Zhang Maosen The Center of Structure and Element Analysis, University of Science and Technology of China, Hefei Anhui 230026, China
Received 3 February 1989, Revised 26 April 1989, Accepted 6 June 1989
Key words: Substructure search; Spectra simulation," Generic structures; Synthesis planning," Backtracking Abstract: The concept of WALKING on structures is proposed, and the partial ordering
between a structure and a query structure (substructure) is also created by means of WALKING. Based upon the concepts above, the authors create the Heuristic-Backtracking Algorithm (HBA) of structural match with high performance. The applications of HBA in molecular graphics, synthetic planning, spectrum simulation, the representation and recognition of general structures are discussed. The source code (HBA.PAS) and executable code (HBA.COM) plus input files (DEMO*.HBA ) are included on disk.
BACKGROUND Structural match is one of the key algorithms in chemical structural research, but the algorithm is a typical NP complete problem, because there is a "combinatoral explosion" problem. 1 In order to reduce the time complexity, some authors have proposed the algorithm PPA based upon a multiprocessor computer system. 2 Structural match can be described as follows: given a query substructure QS, and substructure SS, if QS ~ SS, then the match algorithm should find a mapping between QS and SS. In the authors' opinion, when an algorithm is based on the exhaustive search, 3,4 and the algorithm includes the heuristic-backtracking technique, in many cases, exhaustive search is avoidable. According to this idea, we propose the algorithm HBA.
PARTIAL ORDERING AND WALKING We can describe structural match in terms of a binary relation. Let (QS, >) and (SS, >) be the same algebraic systems, and there is a mapping, namely, g: QS---'SS, if there are a pair of atoms m, n e QS, there must be g(m > n) = g(m) > g(n), and g(m), g(n) e SS, then g is called a homomorphic mapping from QS to SS. Here, symbol ">" is a definable relation operator, defined as follows,
1.
Topological homomorphism:> ::= m before n, and deg(m) < deg(g(m)) 75
76
X. JtrN and Z. MAOSEN
2.
Colored node structure homomorphism:
> ::= m before n, and deg(m) < deg(g(m)), and node_color(m) = node_color(g(m))
3.
Colored edge structure homomorphism:
> ::= m before n, and deg(m) < deg(g(m)), and edge__color(m - n) = edge__colorfg(m) - (g(n))
4.
Colored node and edge
> ::= m before n, and deg(m) < deg(g(m)), and edge color(m) = edge color(g(m) - g(n)), and nodecolor(m) = node_color(g(m))
The operator ">" defines a partial ordering. This kind of ordering is ascertained from the Cartesian set representing the sm~cture. Fig. 1 shows an example. E" is one of the subsets of E x E. It represents the graph theory features, if E" can be found in another structure, SS, then S and SS must be homomorphic or isomorphic. Hence, structural match is divided in two phases: 1) get a partial ordering from QS, and 2) look for the partial ordering from SS, In order to finish the phases, WALKING on QS, and heuristic-backtracking W A L K I N G controlled by the partial ordering from QS on SS are needed, Algorithm WALKING can be described as follows (see Fig. 2 for an example of the WALKING algorithm, and the correlating content of ROUTE.): 5 PROCEDURE WALKING; BEGIN choose a node from QS arbitrarily as entrance node; keep the information of the entrance node in list R O U T E ; push entrance node into Branch_Stack; W H I L E Branch Stack <> N I L DO BEGIN pop Branch_Stack, get an edge and its color; pop_on := TRUE; W H I L E there is a b r a n c h t o _ g o DO BEGIN I F pop_on T H E N BEGIN p o p o n := FALSE; I F current node is not entrance T H E N number the current node according to the walking sequence and keep the information in list R O U T E ; END; keep the current node information, including, atom_type, bond_type, free_electron_number, charge_number, a d j a c e n t d e g r e e , walking_order in list R O U T E ; I F adjacent_degree = 0 T H E N there is a branch_to_go := FALSE; ELSE choose any branch to go on walking, others are kept in Branch_Stack; END; {of W H I L E } END; {of W H I L E } output the list R O U T E , which contains the partial ordering if QS. END; {of P R O C E D U R E }
New algorithm for structural match 6
S:
4
77
2
7
1
11
9
S=(N, E, E') N=(I, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
E=[
(1-2), el (8-9), e8
(2-3), e2 (9-10), e9
(3-4), (4-5), e3 e4 (10-11), (11-12), el0 ell * " - " represents mono-bond
(5-6), e5 (5-10), el2
(6-7), e6 (3-12) el3
E'={ (el, e2), (e2, e3), (e3, e4), (e4, e5), (e5, e6), (e6, e7), (e7, e8), (e8, e9), (e9, el0), (el0, el 1), (el 1, el2), (el2, el3) Fig. 1. The Partial Ordering on a Structure.
QS: 4 /rH 1 6'
l,
Original_Order (00)
Walking_Order (WO)
ROUTE:
Atom/POP
OO
WO/CP
AD/BR
BT
0 C N C 0
5 2 1 3
1 2 3 4
1 3 2 3
* 2 1 1
7
5
POP 6 8 4
6 7 8
4
8
POP C
1 4
2 6
2 2 2 2
8 2
1 1 1 2 1 1
CP: Cut Point AD: Adjacent Degree BR: Branch B T: Bond Type Fig. 2. Walking on Structure QS.
(7-8), e7
}
78
X. JUN and Z. MAOSEN A L G O R I T H M H B A AND A N A L Y S I S
Algorithm H B A will walk on the structure SS controlled by the information from ROUTE to seek the mapping of QS. It also chooses arbitrarily the entry node, checks other branches, and backtracks in case of fault. There is an example of H B A in Fig. 3. FUNCTION HBA: BOOLEAN; 6 BEGIN success := F A L S E ; W H I L E (NOT success) AND (there are untried nodes) T H E N BEGIN choose untried_node from SS as entry node; W H I L E NOT(arrive at the end of R O U T E ) AND (there are branches of SS to walk) DO BEGIN seek the branch which matches the current node of R O U T E (compare the atom_type, b o n d t y p e , free_electrons,
electric_charge and adjacent_degree); I F match T H E N BEGIN I F not arrive at the end of R O U T E T H E N BEGIN prepare to check the next node of R O U T E ; walk on the next node of SS END END E L S E back track END I F arrive at the end of R O U T E T H E N BEGIN success := TRUE; output the match mapping list END; END; H B A := success; END; {of HBA}
In HBA, most of the time is used in heuristic-backtracking search. According to Combinatoral Mathematics, the number of routes H B A should search on SS (in the worst case) can be counted as follows, m n .
i=1 j = l
.
.
,
.
[ AD(J)SS- AD(/')QS ]
(1)
i
where AD is the adjacent degree of the atom, m is the number of nodes on SS, n for QS (m>_n), i is the entry node, a n d j is the tried node.
New algorithm for structural match
8
1~
79
14 17
C
~ N['f"i 9
6 7 ~ 4 / 5
O16
SS
The diration of walking The Mapping of QS (in Fig. 2) and SS
WO [ 1 2 QS I 5 2 SS 16 11
3 1 9
4 5 3 7 4 10
6 6 1
7 8 2
8 4 3
Fig. 3. HBA seeks Match Mapping.
In contrast, the mapping number used in other algorithms should be counted by Eq. 2.1 MN <
N! ( M - N) !
(2)
Compare Eqs. 1 and 2. It is obvious that MN depends on the number of nodes, however, RN depends on both the number of nodes and the adjacent degree of the node. Let us illustrate their differences in Table 1. Hence, the sharp decrease of search space is the main reason that algorithm HBA is of high performance. VARIOUS MATCHES AND APPLICATIONS
Structural Query The algorithm can only determine whether QS e SS, namely, it is enough that one mapping between QS and SS is found. When the algorithm used to search a large structural data base, only the match is needed, and the mapping details are not considered in order to speed up the match procedure. One of the advantages of HBA in searching a large structural data base, is that it need not keep contact with QS, it only gets the partial ordering of QS at the start. This results in a sharp decrease of time and space complexity.
Multiple-Structure Match In this case, the algorithm should find out all mappings between QS and SS. This match is-divided in two Eases:
Unabundant Multiple Match (UMM). In UMM, the algorithm finds all mappings in which different mappings have a different node set. Fig. 4 gives an example.
80
X. JtJN and Z. MAos~,~
Table 1. Some Counting Results of Examples
a
lb,.
RN<4
C
QS
n=3, m=3 3
1
MN<6
SS a-b [
I
RN___8
c-d
Qs
1-2 l
n=4, m=4 I
MN < 2 4
3-4 SS e i
a-b-c
-d
QS
RN<18 n=6, m=6
5 I 1-- 2--3--4
MN < 7 2 0
I
SS 6
(7" Qs
RN < 4 0
n=7, m=lO MN < 840
© QS
RN<24
n=6, m=10
Adamantane SS
MN < 30
New algorithm for structural match 4
2
2
4
81
6
Qs
8
ss
UMM Mapping: QS
[ 1
2
3
4
5
6
7
8
SS
1 9
2 8
3 7
4 6
5 12 13 14 5 12 11 10
Fig. 4. An Example of UMM In molecular graphics, UMM is important. Based upon UMM, computers can know the basic substructural components which consist of a complete structure, then build 3-D model of the structure by means of the substructural connecting technique. For example, "Diamond" type structures can be built with the connection of "chair" hexacyclic rings. 7 Generally speaking, HBA should abandon some abundant mappings to get UMM, hence it will use more time.
Abundant Multiple Match. Here, the algorithm will find out all mappings even though they may have the same node set. Let us read Table 2. Table 2. AMM Mapping from Fig. 4. QS
1 2 3 4 5 6 7 8
SS
1 1 9 9 3 3 7 7 10 10 14 14
2 2 8 8 4 4 6 6 11 11 13 13
3 3 7 7 5 5 5 5 12 12 12 12
4 14 6 10 6 12 4 12 13 5 11 5
Sum
12 Mappings
5 13 5 11 7 11 3 13 14 4 10 6
12 12 12 12 10 10 14 14 3 3 7 7
13 5 11 5 11 7 13 3 4 14 6 10
14 4 10 6 12 6 12 4 5 13 5 11
AMM is important for computer-assisted synthetic planning, because the AMM mapping will find out all equivalent chemical environments from SS. For example, if the reaction active position is on atom 3 in QS, then AMM will tell us that the atoms 3, 5, 7 and 12 in SS must be the equivalent active positions. If the result of the reaction were to break bond 3-8 in QS, then the bonds 3-14, 3-4, 7-10, 7-6, 5-12, 5-6, 5-4, 12-13, 12-11 in SS would be broken individually by applying this reaction to SS and the products would be as in Fig. 5. 8
82
X. JLrN and Z. MAOSEN
Fig. 5. Products of Chemical Transform
Simulation of Spectra In UV and NMR spectral analysis, simulation of the specmam is often desired. Usually, the simulation is based on the functional group chemical shift sum method. The method depends upon exploring the chemical environment of every substituent position, different substituents have different experimental parameters. Finding a functional group on a special position in a structure can be implemented with the structural match algorithm.
Fig. 6. Query s-Substitute Function Group In Fig. 6, if we want to determine the substituents of the [~-positions of the carbon with the asterisk, we can let the three substructures, a, b and e match a basic substituent set, and the structural query can be used here. With this method, the simulation of a spectrum can be achieved.
General Structure Recognition General structures represent the structural set with the common constraints. A general structure can be considered as a kind of structural domain. It is recognized by checking whether a QS belongs in the domain. The main problem is checking the constraints. 10,11 In Fig. 7, GS is shown together with three constraints. Checking every constraint must deal with structural recognition in which HBA can be applied. By means of matching substructural set X, set Y, set Z and the if-then statement in GS, computers know QS1 e GS, while QS2 does not belong in GS for it is not compatible with the if-then statement in GS. ACKNOWLEDGEMENT The algorithm HBA is implemented on a VAX 11/785 computer. The Computer Center of Beijing Pharmacy Chemistry Institute provided us the computing resource. This work is supported by the science foundation of Academia Sinica.
New algorithm for structural match
Z
83
QS1
Y=IG } Z = { F, C1, Br, I, OH } if X = ~
OH then Z ~ OH.
General Structure (GS) Fig. 7. General Structure Recognition
EXPERIMENTAL
On disk in this issue is the Turbo Pascal ((c) Borland) source code of this algorithm HBA.PAS, the executable code, HBA.COM, and documented sample input files, DEMO*.HBA. The latter files may be typed out and read as well as serving as input to the program. DEMO.HBA should be read first. REFERENCES AND NOTES
Tarjan, R. E., "Graphic Algorithm in Chemical Computation", in Algorithms for Chemical Computation; Chistofferson, R.E., Ed.; ACS: Washington, DC, 1977, Vol. 46, pp. 1-20. 2. Wipke, W. T.; Regers, D., J. Chem. Inf. Comput. Sci., 1984, 24, 255-262. 3. Ash, J.E., et al, Communication, Storage and Retrival of Chemical Information, John Wiley & Sons, 1985, pp. 129-131. 4. Chaunjie, G., et al, Ke Xue Tong Bao, 1987, 18, 1393-1396. 5. Jun, X., "The Doctorate Dissertation of University of Science and Technology of China (USTC)" (in Chinese), 1988, pp. 49-53. 6. 1bid.,pp. 60-66. Algorithms are implemented in PASCAL language, in case of the source program is needed welcome to contact with the authors. 7. Ibid., pp. 76-121. This is the second part of Xu's dissertation which is on automatic 3-D molecular model building and structural generator. 8. Ibid., pp. 131-173. This is the third part of the dissertation in which computer-assisted isotopically-labelled compounds synthetic planning is discussed. 9. 1bid.,pp. 174-187. 10. Ibid., pp. 140-146. Here gives the detailed discussion on generic structure representation and recognition. 11. Jun, X.; Maosen, Z., "Representation of Nondeterministic Graphic Knowledge & A Linear Notation for Graphics", In Mini & Microcomputers and Their Applications, Universidad Autonoma de Barcelona, 1988, pp. 261-264. i.