An efficient algorithm for the computation of the canonical numbering of reaction matrices

An efficient algorithm for the computation of the canonical numbering of reaction matrices

AN EFFICIENT ALGORITHM FOR THE COMPUTATION OF THE CANONICAL NUMBERING OF REACTION MATRICES Organisch-chcmisches JOSEF BRANDT and ANNETTE VON SCHOLLE...

759KB Sizes 175 Downloads 80 Views

AN EFFICIENT ALGORITHM FOR THE COMPUTATION OF THE CANONICAL NUMBERING OF REACTION MATRICES

Organisch-chcmisches

JOSEF BRANDT and ANNETTE VON SCHOLLEY Institut, Technische Universitlt Miinchen. D-8046 Garching.

Germany

(Received 14 April 1982) AbstraeCAn

efficient algorithm is given for computing a canonical numbering of reaction matrices as defined bv

the DUGUNDJI-UGI-model.

The canonization rules rely on the heuristic assumption that in ordinary reactions. the valence electrons are shifted along a single path or q&k. The algorithm con&s in a stepwise construction of strings with prompt elimination of all unsuccessful branches. An exemplary implementation (in PASCAL) is given

that can be easily converted

into commonly used languages,

INTRODUCTION

Scholley, 1981) and a compact string notation for chemical reactions has been formulated (Brandt et al., 1981).

Practical application of matrix representations of chemical structures and of their transformations in information handling systems hinges critically on the possibility of defining and computing (within acceptable times) a canonical representation of such a matrix. Reactions can be described as a transformation of one matrix ‘B into another 2B by a reaction matrix ‘*R according to the fundamental equation: ‘B + “R = 2B.

since more advanced constructs are not employed.

PRMCIF’LES (a) Formulation

as an assignment

problem

The canonization rule given in (Brandt et al., 1981) belongs to one class of canonization rules that can be characterized as follows: Each position i, i in the matrix is weighted by the elements cii of a matrix C. Then a row/column permutation of Z, the (arbitrary) matrix to be canonized, is postulated which minimizes (or maximizes) the sum of the weighted entries of Z, thus:

(1)

The matrices ‘B and *B are called the BE-matrix of the educt and product ensembles of molecules: ‘EM and 2EM. Together with an atom vector a (note that ‘a = ‘a), they describe the chemical constitution of the ensembles ‘EM and ‘EM by giving the bond orders of covalent bonds linking atoms a, and a, as the off-diagonal elements bij of B; free electrons on II, are represented by diagonal entries b, in B. The change of the pattern of electrons that occurs with the reaction valence ‘EM + ‘EM, is represented by the matrix ‘2R, the R-matrix. In ‘*R, the off-diagonal entries ri, represent bonds broken or made as negative and positive values, resp. A diagonal entry r,, gives the gain or loss of free valence electrons of the atom ai. Details of this mathematical model of constitutional chemistry that has become known as the DUGUNDJI-UGI-(DU-1 model. have been oublished elsewhere (Ugi & &gundji, 1973; Ugi et al., 197$a, 1979b, 1979~; Brandt et al., 1981 and literature cited therein). An upper bound for the computational effort in any definition of a canonical representation is the complete enumeration of all n! possible numberings of the indices of rows and columns (i.e. rows/columns permutations) of such matrices. This is usually unacceptable in practical implementations. Canonization rules, apart from fulfilling the necessary condition of uniqueness, therefore employ some heuristical assumptions that aim at facilitating the computation of the canonical representation in question. For R-matrices, such a canonization rule has recently been published (Brandt et al., 1981). In the present paper we shall describe an efficient algorithm for the manual and machine computation of this canonical representation of chemical reactions. On the basis of this algorithm, chemical information handling systems (storage and retrieval algorithms, synthesis planning) have been implemented (v.

C(PZPT) = min! where all matrices permutation

(2)

are square matrices of rank n; P is a

matrix,

such that p,, E{O, 1) and Tp,i = I

(i.e. P contains exactly one pi, = I in any row and column); Pr is the transpose of P; CZ is the “scalar product” of matrices C and Z defined by:

cz: = x i

xi

(CijZij).

The entries of the weight matrix C are usually set up in such a way that any permutation in Z leads to a change of the value of CZ. One may simply take powers of a number s: cij = s I: where k is an expression values in Z:

(4)

in i and j, and s is the range of

s: = max (z,,) - min(r,,) + 1 i.i IIi

(5)

so, when Z is an adjacency matrix (zrl E {0, l}), then s = 2. A rather straightforward

way of setting

up C would

use:

k=?lZ-((i-l)n+J7.

(6)

With this C, the elements are weighted along the rows. This C is the weight matrix that underlies RANDIe’s 51

J. BRANDT and

52

(Rand& 1975, 1977) (RandiC et al., 1981) canonization rule for adjacency matrices. For symmetric-matrices, one could also use: k=t-[(j-i-l)(Zn-j+i1/2+i];forj>i

(7)

with: t = n(n ~ 1)/Z which ranks the elements along the side diagonals (Jochum, 1978). We have shown (Brandt et al., 1981) that for Rmatrices a weighting along the side diagonals can yield an extremely concise and, above all, a chemically meaningful representation of the electron shift pattern of a reaction if the following two additional features are included: (a) The side diagonals of the upper triangle of R must be read in a circular fashion, i.e. they have to be made n elements long by jumping back to the left margin of the R-matrix when the right margin is encountered. (b) The signs of the entries have to alternate. This method, then, would be defined as follows: Instead of eqn. (4) we use: cij=(--1)‘s’;

for Isi-jsn/2

(8a)

cij=(~I)‘sk;

fori-j>n/2

(8b)

cij = s n i.,

for

i = j

k=~-[n(j-l)-i(n-l)];

@cl

forj-isn/2

(9a)

k = t - [ni + (n - 13(n - l)]; for j - i > n/2 with these equations, would read:

s6 - p 0 0

c=

0 0 0 0

-p

_s13

P 4

s I9 _p

0” ,g 0 0 0

0 0 0

-g

s15

_S’a

s

0” 0

_sll

24

*

-gI

s8 -

s I2

s -

s14

-s7 17

s

10

ST3 -sL6 s’ .i= 0

so

and for n = 6 we get: 7 c=

0 0 0 0

SCHOLLEY

Replacing (6) or (7) by a less steeply progressing series may lead to a sacrifice in uniquencess which might however be tolerable in special cases (Jochum, 1978).(b) Formulation as a string operation Equation (2) may be rephrased as a concatenation of elements of R and as a maximisation of the string thus generated: i--n j-” 1 { 1 rsci,. 4(i)) = max! i=, j=* where 11is the concatenation two strings of data elements,

(10,

operator; it concatenates e.g. characters, or digits,

into one: let x = 123 and y = 456, then x/y = 123456; 1; is the generalized concatenation operator; it is def&ed (analogous to X (summation), lI (product) etc.) as:

q is a permutation vector. This formulation lends itself more easily to implementations by means of higher level programming languages that have language constructs for handling of strings of lists (pointers), and by computers that support such operations by hardware commands. The weight matrix formalism [Eq. (4)ff.], then, can be replaced by sequence rules for i and j in the concatenation rule (10). The sequence rule i=l,...,

0”

;

VON

the weight matrix with rank n = 7

5

0 0 0 0

(9b)

A.

_szQ ‘4 __e _slo ,a -:I, SIX s’ ps’z 0” s3 ps’fl 0 0 * s:’ 0 0 0” 0” 0 0 0

SIT sy - s4 s” ~ slfi s”

Computation of Eq. (2) leads to a quadratic ossignment problem. Algorithms for the solution of such problems exist, see, e.g. Gilmore, 1962; Lawler, 1%3; Pierce & Crowston, 197f; Burkard, 1975, 1976; Burkard & Derigs, 1980, but are rather complex and time consuming. More seriously, with Eqs. (4) or (8). the elements of C can reach extremely high numerical values which can easity exceed the usual range of integer values allowed in present day computers (with s = 2, and 31 bit unsigned integers, the rank of R would be limited to n G 6).

nandj-l,,..,

II

(12)

corresponds to (6), and it was, in fact, in this form that a canonization rule of this behaviour was lirst formulated (RandiC, 1975). For the canonization of reaction matrices, we have given the rule (8) and (9) in the following form: Rule Cl. Form a string by concatenating the substrings formed by rules [C2], [C3], [C4], [CS] (in that order). Choose the lexicographically greatest string as the canonical one and the corresponding numbering of the R-matrix as the canonical one: RC. Since substrings need to be computed and concatenated only long enough to reach a decision, usually rule [C21 is sufficient. We define substringC2 as follows: Rule C2. Form a string by concatenating the offdiagonal elements r, of R, with alternating signs in the following order: - 4~23ll

- hll-

.. (-

1)”

‘r”~,..ll(-

l)“r”.lll

~ r11()r*41/. (I(- l)or”.n+z(l- r14ll..

(13)

thus: m--n--l substringC2:

=

k=n

11 { 1) ([(- l)*f~il)) m-, *=I

where r,, are the elements of R’, and j = 1 i mod,, (m + k 1). Rule C3. SubstringC3 is formed by concatenating the

53

Algorithm for the computation of the canonical numbering of reaction matrices

diagonal elements

of R: r1111r2211 . . . Ilr..

(14)

thus: i-n

substringC3:

=

II ,=I

(rJ

A=B + D-E + C: + D-A-C-E

The further rules (Brandt et al., 1981) do not enter in the part of the algorithm to be described here, and are given only for the sake of completeness: Rule C4. SubstringC4 is formed by concatenating the elements of the intact BE-matrix, B’, of the reaction core in the same order (but retaining the sign) as in rule [C2], including the main diagonal, thus: nl=n

substringC4:

=

(a) Detailed description For the algorithm to be generally applicable, it should be amenabte also to manual computation, at least in the most frequently encountered cases. This is indeed possible here, and we shall give an example of such a manual computation since we can thereby most easily illustrate the procedure of the computer program. Let us take a reaction of the type: + B:

(16)

The n(n - 1) off-diagonal entries rij of the corresponding R-matrix are partitioned into equivalence classes according to the values (- 2, - l,O, 1) of r,j, thus:

Equivalence-

k=n

11 ( 1 [b:J) m--l *=I

where

class:

I

II

III

IV

Value of r,,:

-2

-I

0

1

AB BA

:i

CD DC

AC CA AD DA

bi, are the elements of B’.

iA” cl?

Rule CS. Substring-C5 is formed by concatenating descriptors of the elements of the atom vector ac, thus:

SC

z

g I-n

substringC5:

=

II

(4Uil)

(15)

::

,=1 where d[aJ is a description of atom i. The above set of rules was found to be particularly suitable for R-matrices, since these have negative and positive entries {as in contrast to adjacency or BEmatrices), which furthermore tend to be arranged alternatingly along a path (open or circular) in the corresponding molecular structure. ALGORITHMS

The definiton (13) lends itsely easily to an efficient algorithm based on a stepwise build-up of strings. The alternating-signs rule permits a procedure, whereby on each step of the building process the elements to be appended to existing strings are taken from diferent equivalence classes of entries, i.e. from the highest class on each euen step and the lowest class on each odd step. Equivalence class partition of the entries of R is based on the values of r,. The range of values of the r,, is - 1 . . + 1 for ii j and for elementary reactions. Even in composite reactions, this range is usually restricted to - 2. + 2. The equivalence class with r,, = 0 has by far the highest population since R-matrices are sparse. Another advantageous feature for the formulation of an efficient algorithm is the circular-reading rule. It requires that for the nth string element to be valid there must exist an entry rln = r,, in the appropriate equivalence class. After the first n entries of the R-string are placed, the corresponding numbering of the R-matrix is determined. Thus, if there are still any non-zero elements to be processed, their position (and correspondingly the total R-string) is already determined. The final step, therefore, consists merely in comparing the remaining strings and retaining the maximal ones.

We start by writing down the indices (e.g. atom descriptors, element names, arbitrary numerical indices) of the “most favourable” equivalence class, which-following the definition (13~is the one with the lowest r;,-value, thus - 2. The buildup process, therefore, starts with two strings of length two: “AB” and “BA”. In the following steps strings are extended by one element per step subject to the conditions that: -the element to be appended is not yet a member of the string. -an index-pair can be found in the “most favourable” equivalence class, whose first member is equal to the last element of the string to be extended; the second element of the pair is then appended. The “most favourable” equivalence class is alternatingly the highest (for the even-numbered step) or lowest (for an odd-numbered step) equivalence class, in which at least one suitable index pair has been found. The alternate scanning from top or bottom, resp., is a direct consequence of the signs-rule: (- l)k in [rule C2]. Any string which cannot be extended by members class is henceforward from the “most favourable” excluded from the buildup process, because it could be extended only from less favourable classes, thus yielding a substringC2 of lower lexicographic value than its competitors. Such a string is therefore deleted from the list of growing strings. Some strings will engender more than one descendant, thus leading to branching, when more than one suitable index pair is present in the “most favourable” equivalence class. This will occur mainly with a highly populated equivalence class, such as class III where r;j = 0.

54

J.

The buildup-process,

then, proceeds

BRANDT

and A. VON

SCHOLLEY

as follows:

Start

AB

c-cm

I

step

one: take elements

class r

.

most favourable class up) is class Iv. String AB cannot be extended and is therefore deleted

!

This atom matrix:

vector

B B +2 A-2 D 0 EO c 0 The corresponding

defines A -2 0

D 0 1

I

0

O-l I

0

the

BADE

Step three: work from top down. No extension from class I, most favourable class is class 11

BADEC

step

following

E 0 0 -1 0 I

canonical

C 0 1 0 1

(18)

four:

work

from

bottom

(17)

up

example, only CABDE is retained, since CE is in class IV, while CABED fails, DC being in class III; its string notation is smaller than the one for CABDE. Even if the ring “closure step” were omitted, the two remaining strings would yield two different linear notations: CABED~(l2010;a001.1~00.22)

-2

and CABDE-+(12011;0.1/00.22)

string notation is:

n-l steps (and the “ring closure” check), the of indices and therewith the full string notation is determined; the lirst II elements of substringC2 are computed. Should there remain more than one candidate, the remaining elements of substringC2 are computed and a final check is made which retains only those strings that lead to the highest linear notation. In the general case of a binary matrix (adjacency matrix with entries 0 and 1 only) this could lead to an exhaustive enumeration. After

(21110; oOOo.1/2000.2).

sequence

The buildup process for the reverse reaction proceeds in an analogous fashion. Since the R-matrix for the reverse reacfion R is defined by f,i = - r,, the list of equivalence classes is the reverse of the one given above (17). The starting step takes all entries from class IV. Step two works with class I, step three uses class 111,and step four uses class II:

I

II

D\BK-----\

CAB

IV

CA.@'

DABC

CABE ' ' CABD

b IL1

/I CABED

CABDE

Strings marked 8 are deIeted, because they cannot be extended, and the strings marked @ are invalid, because the element to be appended, C or D, is already a member of the string. A llnal step checks for “ring closure”, i.e. only those strings whose first and last element correspond to a pair in the most favourable class, are retained; all others lead to a lexicographically smaller linear notation. In our

DABD( b)

DABE I DABED@

In an R-matrix, there must be of least three equivalence classes, with rjj E (- 1, 0, + I} for i# i. Furthermore, chemically meaningful R-matrices are sparse, with approximately n/2 positive and n/2 negative entries in the upper triangle, the rest being zero. The nonzero matrix elements of common reactions do form chains of alternating bond formations and breakages (Arens, 1979a, 1979b, 1979~). As a consequence, the algorithm is parti-

54

Algorithm for the computation of the canonical numbering of reaction matrices cularly well suited for R-matrices, although it can he employed to find alternating paths in any edge-marked graph. (b) Implementation

(PASCAL-Program)

We choose a structured language (PASCAL) to demonstrate some further details of our implementation. One aspect requires a more careful consideration: the deletion of unsuccessful strings during the buildup process and the branching of others requires some housekeeping of computer storage and can be handled in

PROGRAM

s

TYPE

SHAS(INPUT,

FORTRAN, PL/l etc.) or set operations (in PASCAL) are used. The following version illustrates the principles of an implementation that could be applied to the more commonly used languages, like FORTRAN, BASIC, etc., which support only scalars and arrays. Note that records would have to be resolved into individual vectors of integers, and sets into vectors of booleans (with some loss of clarity, of course). The program stores each R-string individually, even if some strings have some leading elements in common. The following data structures are specified:

OUTPUT,REIN);

RRANGR

= 7;

RRANCB RRANGQ

= I

RRANCF

= 42:

8; 21

;

(*

MAX RANG

(*

EIN DREIECK DER R-MATRIX YOLLE R-MATRIX OHNE DIAGONALE

("

INDRANGA INDRANGB INDRANGA

: 0 . . RRANGR; = t . . RRANGB; r 1 . . RRANGR;

INDRANK INDRANGF RWERTBER

= = =

1 1 -

. . 2’.

DLR

RRANGQ; RRANCF ; . + 2;

(* WERTEBEREICH

DER

VERGLEICH = SET OF INDRANGR; (* AUFNAHME DER IN EINEM STRING SCHON

SORTRECORD IR:

I RECORD INDRANGR; INDRANGR;

JR: RIJ:

R-MATRIX,

(* E,IN ELEMENT

R-MATRIX

(* ZEILENINDEX (* SPALTENINDEX

RWERTBER:

7

(*

WERT

'1 4) *)

l)

R-HATRIXELEMENTE

VORHANDENEN

DER

Z.B.

ELEMENTE”)

l) l> l> 4)

END;

RLINNOT z RECORD RSTRING: ARRAY CINDRANGB~ OF INDRANCA; SETEL: VERGLEICH; (” ENTH. DIE IN RSTRING END;

VAR -

RM: LINNOT: NEU,

FERTIG! ARANCR: ARANGQ: ARANGB: ARANGF: STRINGLENGTH: EQZ:

PLE i ABSCHLUSS: KLASSE: REIN:

EIN VORH.

R-STRING INDICES

l) l)

ARRAY

[INDRANGQI OF SORTRECORD; (* R-MATRIX IN TRIPELNOTATION ARRAY [INDRANGFI OF RLINNOT; (* R-STRINGS IM AUFBAU SET OF ItJDRANGF;(" BEARBEITUNGSZUSTAND INDRANGR; (* TATS. RANG DER R-MATRIX INDRANCP; (* TATS. ANZAHL VON TRIPELN INDRANGB; INDRANGF; (* VOLLE R-MATRIX OHNE DIAC. INDRANCCt; (* LAENGE VON RSTRING ARRAY CRWERTBERI OF INTEGER; BOOLEAN; (. FLIP-FLOP:ARBEITSRICHTUNG RWERTBER; BOOLEAN; INTEGER; TEXT;

two ways. Pointer structures allow a rather elegant implementation, if languages support them (PASCAL, PL/l etc.). Otherwise. one would prefer arrays of strings and duplication of strings in the case of branching; some garbage collection has then to be done in order to recover the space occupied by deleted strings. In checking for non-occurrence of the elements to be appended in the string to be extended, bit vectors (in CACVOI. 7.NO.2--B

(.

l) ') ') l) 4) +) "1 ll

The R-strings are kept as strings of numerical atom indices in RLINNOT.RSTRING, and the set RLINNOT.SETEL contains those elements that are part of RLINNOT.RSTRING. The sets NEU and FERTIG support the housekeeping and garbage collection; in addition, NEU is empty, when no string could be extended from the present equivalence class, thus the next class has to be worked up.

56

J.

BRANDT

and A. VON

The blocks of the main program reflect the steps illustrated

SCHOLLEY

in the manual example above.

EECIN (* -- BLOCK LIESREIN; PC*

1 : EINGABE -- ') EINLESEN DER DATEN

UEBER

FILE

~~REIN'~ l )

c* -- RLOCK 2 : SORTIEREN -- l) GENSORT(Rn, - (* ORDNEN DER b%%Jt!;"~%i DER AEPUIVALENZKLsASSEN MAECHTIG; (+ EEST. DER MAECHTIGKEIT JEDER AEQUIVALENZKLASSE:EQZ (*-- --BLOCK 3 : VORSESETZUNGEN ANFBESETZ; (* EINTRAGEN DER WERTE DEA HOECHSTEN (* IN LINNOT

"1 ')

--•) AEQIJIVALENZKLASSE

c* -- BLOCK 4 i AUFBAU DES RSTRIHGS -- l) IF (ARANCR > 2) THEli PREPARE; PROCEDURES: (' STEUERIJNG DER DEN RSTRING AUFSAUENDEN APPEND : PRUEFT AUF NACHFOLCER ; EXTEND : ERWEITERT RSTRINC UM NACHFOLGER ; FILLIN : FUECT NACHFOLGER IN RSTRINC EIN ; TESTOUTPUT : TESTAUSGABE : CPBT STUFENWEISE DIE BISHER ERRECtlNETEN RSTRINCS AUS (' -- BLOCK IF (ARANGR>=2)EN BEGIN

5 :

RINCSCHLUSS

--

l)

ABSCHLUSS:=TRUE; APPEND; END;

R-matrix by ascending procedures members of

data are read in (block I), stored as triples {i, i, rii} (with i < j) in RM.IR, RM.JR. RM.RU and sorted (block 2) value of rij. Only one triangle of the (symmetric) R-matrix is stored; this leads to some duplicate coding in the ANFBESETZ, EXTEND, and RINGSCHLUSS. The procedure MAECHTlG counts the number of each equivalence class (rii E { - 2, - 1, 0, + 1, + 2) in EQZ. Block 3 performs the starting step:

PROCEDURE ANFBESETZ; (* EINTRAGEN DER WERTE (" IN LINNOT VAR

DER

HOECHSTEN

AEQUIVALENZKLASSE

N: RWERTBER; 12.1: INDRANGF;

BEGIN FERTIG :I II; - 2. :Hik (E&IN1 = 0) DO N :r SUCCtN); FOR I := 1 TO EQZCNI DO BEGIN WITH LINNOTCI] DO BEGIN FERTIG := FERTIG + [II; RSTRINGCll := RMtIl.IR; ASTRINGC21 := RH[Il.JR; SETEL := [RMtII.IRl + IRM tII.JRI; 12 := I + EQZCNI; WITH LINNOTCI21 DO BEGIN FERTIC := FERTIG + 1123; RSTRItlGt13 := RM[Il.JR; RSTRINGC21 := RH[II.IR; SETEL := [RH[Il.IRI + CRH CII.JRl; END I*WITH*); END (*FOR*); END (*ANFBESETZ*);

END

(.WITH*);

l) ")

Algorithm for the computation of the canonical numbering of reaction matrices

Extension of the string is governed between the top-down and bottom-up PROCEDURE

by

procedure

processing

which flips variable WO to do the alternating

PREPARE,

of equivalence

classes:

PREPARE; (m PLATZANSTEUERUNG

FUER

BEGIN ABSCHLUSS:=FALSE: WO := FALSE;' FOR STRINGLENGTH := 2 TO ARANGR - 1 DO IF WO THEN BEGIN KLASSE :I - 2; PLUS := 1; END ('IF*1 ELSE BEGIN KLASSE :I 2; PLUS := - 1; END (*ELSE*); APPEND; WO := NOT WO: END (*FOR*); END (*PREPARE"); PROCEDURE VAR

APPEND-ROUTINE

BEGIN

STOP, CO: INTEGER; RUECK: BOOLEAN; I: INDRANGF;

BEGIN NEU := tl; RUECK := (KLASSE > 0); REPEAT WHILE [EQZCKLASSEI : 0) DO KLASSE := KLASSE + PLUS; (* SOLANGE IN DEN ABGEARBEITETEN AEQUIVALENZKLASSEN (+ KEIN NACHFOLGER CEFUNDEN WURDE IF RUECK THEN BEGIN co := 2: START :I ARANGQ + 1; WHILE (GO >= KLASSE) DO BEGIN START := START - EQZtGOl; GO :z PREDCGO); END (*YHILE*); STOP :I START - 1 + EQZCKLASSEI; END (#THEN.) ELSE BEGIN co :- - 2; STOP := 0: WHILE (GO'<. KLASSE) DO BEGIN STOP := STOP + EQZCGOI; GO := SUCC~GO); END (*KHILE*); START :I STOP + 1 - EQZEKLASSEI; END (*ELSE'); FOR I :I 1 TO ARANGF Do IF (I IN FERTIC) THEN IF ABSCHLUSS THEN RINGSCHLUSStI,START,STOP) ELSE EXTENDCI, START, STOP); KLASSE := KLASSE + PLUS; UNTIL NEU <> [I: FERTIG := NW;

is done by FILLIN

PROCEDURE VAR

.>

APPEND;

START,

The actual string processing EXTEND:

57

under the

EXTENDIINARBEIT:

LAUF:

supervision

INDRANGF;

START,

of

the

STOP:

above

l) ")

procedure

INTEGER);

INTEGER;

BEGIN FOR

LAUF :r START TO STOP DO BEGIN IF (RMILAUFl.IR = LINNOT~INARBEXTI.RSTRINGCSTRINCLENGTH~) AND NOT (RM[LAUF~.JR IN LINNCTIINARBEITI. SETEL) THEN FILLIN(RMtLAUFl.JR. INARBEIT): IF (RMILAUFl.JR =~LINNdTC~iARBEITl.R~~RINGCSTRINGLENGTH7~ AND NOT (RMCLAUFI. IR IN LINNOT~INARBEIT~. SETEL) THEN FILLIN(RHtLAUFl.IR, INARBEIT); END (*FOR*); END (.EXTEND*); PROCECURE VAR

FIt.LIN(ZUSATZ: PLATZ:

INDRANGR;

INARB:

INDRANGF);

INTEGER;

K: INTEGER; BEGIN PLATZ := 1; SUCtlE FREIEN RECORD WHILE (PLATZ IN (NEU + FERT&DO PLATZ := SUCC(PLATZ); FOR K := 1 TO STRINGLENGTH W LINNOTtPLATZl.RSTRINGtKl := LINNOTtINARBl.RSTRINGtKl LINNOTtPLATZl.RSTRINGCSUCC(STRINCLENGLENGTH)l := ZUSATZ; LINNOTtPLATZl.SETEL :I LINNOTIINARBl.SETEL + CZLISATZ1; NEU := NEU + IPLAT21; cc NARKIERE l) END ('FILLIN*);

and

of

58

J.BRANDT and A. VON SCHOLLEY

Table I

-__+----.-________________+__________________________+_________________.----________.____________*__ 6

!I-8

*

c-0

B c ;_c___________

!:A

4

A:

+

B--c

*

D

:A1

2

:*I-,

-1

1(1110/2l;bBCD

Ll

ICI :lJi

i

:.4-B

*

C-D

4 A-D

:a;.

-i

:*:

c

B

D

I(llll,:bBCD

I

:n:

l

D-E

i ;

-

1

-1

-i

;_+______

IA:

.

:c: :w. IEI

IFli

.-

i

:

iEi 1

:~‘11’,;*DcE3

_____ 1

i

i

.-i

.-I

.

ni ic:-i -!1 i :D:

I

1

-1

:

B

;_+_.__.____________

______--___

1

c

-1

I

-1 ; : :

!W-i

i ;

:E: I FI

-i

.-i

.

__________-.:*: i

:

,.1,*oo*.c~

D

l - --.----_

mm_*

65

.

.

D

______

:A: :rli-l

1

i;i-1

8-C

:~1,01/000.2~:BcDb

1

-2

*

---+--------------~-----~~,-~----~-~~---~~~~~~~~~___~.~~~______________.*~~~__________~~~_________*_________________

l

A

i

i

1

:_*

.-1

I

A-F

D

‘_~_______________________-___________..____

i;T___________

8-C

c

I .-I

*i

I:A +

I

Ici-7

---+-----------~~~~---~“~~-_________~~~~~~~~~~_______*_________________

7

;

;,;___________

i

(111,.

:

:

I

E

: . 1

-i

-i

i i

:

: -1

-i

_____-__________



A

B

f

4;;:;~e_;_EI,_;_;I;_~_~~~_~_~_~__B--C__o__~__~__~___~_________________~_~_______________________*____________.___.

H,c1I,llllllr :

-_)

iA-H

l

B-C

.

D-E

T

F-G

:_._______________________: :A;

:e:_; -i :c;

.

1

iii

.

.

IFi

i

: : : :

;

-1

-’ .

ABCDEFCH

;

: :/

_i

-1

:c:.

i

IHi’

;I’

; .-l

.-1

I

c:~1,,111,,1 i_:_“__*__“__‘._f__l__‘___ i :I! I

:*:_; -I iB:

it: ID:

:i

IE: : F:

_:

:c:i

-1 ;

-1

i :

-i -;

:

1;

;

.I

i

:

:HRBCDEFG

; : : : : :;

1

:

:

:

-i:

.-i

.;

‘;

___‘_____--_______________~_________________________*_________________*________________________.+_________________

. ._I ___+___--________-_-9*5:1-B

+

D-C-C

14 !:A

*

8-C-F

*

~__~_~_f_~_____~_~__~~~~~~~~~~___*~~~~~~~~~~~~~____._________________________*____.____________ + E-F c :wi i A B c a E F c H:(111110.10;:00.1;I : H c c F E D B ;_~_______________________j 00, ,2000000_2) :-• ______ _________________ :li: a -1 D-E a C-H!*: 2 -1 ; . . _; : LBCDEFCH :G:-1 ,B:-, _I i : _i ; -; i -; :c: . ; ig i i 1 :Ll: _I

i-i

:E:

:F:.

:c: :H:.

:

.

-.

-;

i

1

.-

.:

:

i

:-1 : : : ;

jE/ 1 1

_i

:ll

.! 1 -21

!

:B:

:*I

-1 i -i -4 -: 1 :

1 : 1

.

.

bi~,111,0.,0;;00.,; ; 00~12000000. *: :HCCFEDBA

:i

.:

:/ ii

-21

: ii

Algorithm for the computation

The main program then sets the parameters RINGSCHLUSS (instead of EXTEND):

of the canonical

for ring closure

PROCEDURE

RINGSCHLUSS(J:

VAR

LAUF:

INDRANGF:

numbering

in block

START.STOP:

5 which

The subsequent compilation of the higher off-diagonals and the main diagonal into substrings yielding the linear notation is trivial, since the sequence of indices is determined after block 5. RESULTS

be used for finding the longest alternating path or cycle in any graph with weighted edges. The particular properties of the R-matrices of ‘ordinary’ organic reactions, however, endow it with a particularly outstanding efficiency. Note that the R-matrix is a representation of the valence electron shifting pattern of a reaction. Now, in ordinary organic reactions, the valence electrons are shifted along one connected path, at least formally. Approximately n/2 bonds are broken and n/2 bonds are made in a reaction core consisting of n atoms. There exist about n possible numberings of the atom vector, which result the canonical R-string. Matrices with all off-diagonal entries having an equal value r, would represent the worst case: there only exists one equivalence class. Therefore, the algorithm would have to execute a complete enumeration of all n! possible numberings of the atom vector. The execution of the algorithm would start with n,(n - I) strings consisting of the indices i and j of all n+t - I) off-diagonal entries of the matrix. In the subsequent step, there exist n -2 equivalent possibilities for appending an element on each of these strings. In n steps, therefore, we arrive at the upper bond of n! possibilities. Chemical reactions, in which only free electrons are shifted, would be an example for those matrices. No bonds are made or broken, all off-diagonal entries rji (ii ~3 are equal to zero. But matrices of that type (with a reaction core greater 2) do not represent common organic reactions. In a sample run of 840 different reaction matrices evenly distributed over ranks n = 4,5,6 and 7, the total processing time was 8 CPU-seconds on a CDC-CYBER 175 system under the NOS-2 operating system. Since the processing time includes some l/O-operations, a typical computation time for an R-matrix is estimated to be well below 0.01 CPU-sec. On a DEC-PDPl l/45 under RT-I 1, the FOTRAN-version of the algorithm required approx. I CPU-second to canonize a R-matrix of rank 7. By comparison, the same problem treated by exhaustive enumeration consumed 100 CPU-seconds. algorithm

can

cause

APPEND

to call

INTEGER);

INTEGER;

BEGIN FOR LRUF::START TO STOP DO BEGIN IF (RM[LAUFI.IR = LINN~TIJI.RSTRIN~~S~RINOL~NGTHI) (RM[LAUF~.JR = LINNOTCJI.ASTRING~~~I THEN FILLINtRMCLAUFl.JR,J); IF (RMILAUFI.JR = LINNOT[J~.RSTRINC~STRINGLENGTH~) (RM[LAUFI.IR = LINNOT~J1.RSTRINC[~1~ THEN FILLIN(RM[LAUFl.IR,J); END; END;

This

59

of reaction matrices

AND

AND

The algorithm is currently in use with various investigations concerning the generation of complete sets of R-matrices (Table 1) under given constraints, composition and dissection of complex reaction schemes into elementary ones etc. Its most outstanding practical application is a documentation system for chemical reactions (v. Scholley, 1981), where it is used to canonize hundreds of input data as well as the users’ queries for retrieval in a reaction data bank.

Acknowledgemenl-We thank ting and testing the PASCAL Bundesminister fitr Forschung and 631.02, Gesellschaft fiir

Barbara

Srraupe for help in wri-

code. This work was supported by und Technologie

(grant PT 631.01

Informationund Dokumentation.

Frankfurt).

REFERENCES

Arens, J. F. (1979a), Rec. Trau. Chim. Pays-Bus 98. 155.

Arens, J. F. (1979b), Rec. Trau. Chim. Pays-Bas 98, 395. Arens, J. F. (1979~3,Rec. Trav. Chim. Pays-Bas 98,471. Brandt, J., Bauer, J., Frank, R. M. & von Scholley, A. (1981), Chemico Scriptn 18, 53. Burkard, R. E. (1975). Z f. Operot. Rex 19, 183. Burkard, R. E. (1976), Mathematisches lnstitut der Universitil K6ln, Report 76-2. Burkard, R. E. 8 Derigs, U. (1980), Mathematisches Institut der Universitat K&In, Lecture Notes in Econom. and Math. Sysr. Vol. 184, 148~. Berlin, Springer. Gilmore, P. C. (1962). J. Sot. Ind. a Appl. Mark. 10, 305. Jochum, C. (1978), Dissertation. T.U. Miinchen. Lawler, E. L. (1963). Management Sci 9, 586. Pierce, J. F. & Crowston, W. B. (19711,Nav. Rex Log. Quot. 18, 1. Randif, M. (1975), 1. Chem. Inform. Cornput. Sci. 15, 105. RandiC, M. U977), 1 Chrm. Inform. Comput. Sci. 17, 171. Rand& M., Brissey, G. M. B Wilkins, C. L. (1981). J. C/ten. Inform. Comput. Sci. 21, 52. von Scholley, A. (1981). Dissertation, T. U. Mdnchen. Dugundji, J. & Ugi, I. (1973), Topics Cum Chem. 39. 19. Ugi, I., Bauer, J.. Brandt, J., Friedrich, J., Gasteiger, J., Jochum. C. Br Schubert,W. (1979a).hformal Commun. Malh. Chem. 6. 159. Ugi, I., Bauer, J.. Brandt. J., Friedrich, J., Gasteiger, I., Jochum, C. & Schubert, W. (1979b), Angcw. Chem. 91.99. Usi, I.. Bauer, J.. Brandt, J., Friedrich, J., Gasteiger, J., Jochum, C. & Schubert, W. (1979~). Angow. Chem. ht. Ed. 18. 111.