0097~8485/W s3.00 + .oo Pcrgamon Press Ltd.
A LINEAR
CHEMICAL
NOTATION
G. M. C&ME and C. MULLER Dkpartement de Chimie Physique des Rtactions. Laboratoire Ass&C au CNRS No. 328 Universitk de Nancy I et Institut National Polytechnique de Lorraine, 1, rue Grandville, 54042 NANCY Cedex, France and P. Y. CuNINt and M. GRIFFITHS~ Centre de Recherches en lnformatique de Nancy. Laboratoire Associi au CNRS No. 262, Universit& de Nancy I et Institut National Polytechnique de Lorraine, BP 239, 54506 Vandoeuvre les Nancy Cedex, France (Recetied 28 November 1983) Abstract-A linear notation for chemical compounds is proposed which allows input of chemical formulae into the computer from standard devices. The notation is unambiguous, but not canonical. This leads to a simple grammar which is easy to learn to read or to write. INTRODUCIION
simple linear notation, it is convenient that, at least in input, it need not be canonical, while remaining unambiguous. This means that a given representation corresponds to a unique chemical object, although each object may have several different representations. It remains possible to obtain automatically, for example with the algorithm of Morgan (1965), used by Chemical Abstracts, a canonical representation of a molecule from an unambiguous formula. This paper proposes a linear, unambiguous chemical notation, which is not canonical, but which requires only a standard alphanumeric terminal, and which is easy to learn to read or write.
To use a computer successfully, a physical chemist must have available an external representation of chemical components and reactions which allows easy input-output on alphanumeric or graphic terminals. In general, such representations may be in one, two or three dimensions. Of course, three-dimensional representations are two-dimensional projections with facilities like shadows or hidden line properties, which allow the user to visualise the relative positions of different atoms in space. They are really photographs taken from different angles of classical molecular models expressed as spheres linked by rods or pressed together. Obtaining different perspectives of a molecule requires a graphic terminal with special&d software. Two-dimensional representations of formulae are often used. These may be developed, in which case all atoms and bonds are shown, or semi developed, in which bonds corresponding to normal atomic valence do not figure explicitly. Such representations need graphics terminals, but the software is less complicated than in the preceding case. Note that twodimensional representations allow stereochemical information to be transmitted by the use of forward (a) or backward (1111’) symbols. In contrast with the above cases, linear, or onedimensional, representations require only alphanumeric terminals and no special software. Input of formulae is direct, since they are considered to be character strings. The best-known linear notation is that of Wiswesser (Smith, 1975). Its main aim is to obtain a canonical representation of molecules, that is, that each molecule has one and only one formula. This requires a large number of coding rules and a complicated grammar, which means a long period of apprenticeship to learn the code. In order to use a
1.
NOTATION
FOR ATOMS AND BONDS
In classical chemical notation, atomic symbols consist either of one capital letter, or of a capital followed by a small letter. On terminals which do not accept lower case, upper case pairs are surrounded by apostrophes. Thus “CO” represents the cobalt atom, whereas CO is carbon monoxide. In addition, this allows the definition of atom-like objects, which are not changed during a reaction. This notion is used as a shorthand to describe formulae such as “PHI” for benzene, for formal reaction mechanisms, for isotopes such as “Cl4”, or for MARKUSH formulae. Electric charges are represented by + or - , and free electrons on atoms or free radicals by ., preceded if need be by an integer giving their number, and enclosed in brackets after the corresponding atom. For example For: Cl ., We write : “CL”(.),
Fe3 + “FE”(3
+ )
Bonds between atoms can be single, double, or benzenoid. They are written respectively
tPresent address: Laboratoire d’lnformatique, Universiti de Dijon, 214, rue de Mirande, 21000 Dijon, France. $Present address: Laboratoire d’Informatique pour les Sciences de I’Homme-CNRS; 31, chemin Joseph Aiguier, 13402 Marseille Cedex 9, France.
triple,
To indicate bonds towards the front or the back of the plane, one may write respectively
233
234
G. M. 0%~ 2. NON-CYCLIC
er al. -A substituent consisting of a single atom can be written without the bond or brackets, which arc then implicit. A valency table indicates the type of bond corresponding to these implicit bonds. In the exam-
COMPONENTS
This paragraph presents a method to obtain a linear notation for a non-cyclic chemical component. The rules will be illustrated on the following example
17
H
16~_12+I~
Fig. I.
(a) First select any atom A,, then A2 bonded to A, by a,, then A, bonded to A, by a>, , . . The process stops at A,. Such an atomic chain, written A,a,A,a,A,...A,_,a,_,A, will be called the primary chain of origin A,. In the example, consider as primary chain the atoms numbered from 1 to 5, which is represented
(b) Secondary chains are described, starting from atoms on the primary chain, in the order A,, A,, , .., A,,. Let b, B,b2B2.. , c, C,c,C,. _. be the symbolic representations of secondary chains. They will figure in brackets after their primary atom. This corresponds to the familiar idea of atomic substituents. In the given example, the secondary chains contain the atoms numbered 9, lo,.. ., 15, which leads to the following representation: WI)/ /C(lWC(/C)
to
CH2//CH/C(/CH3)2/CO/OH -Simple leads to simplified
bonds after brackets the following form as far as possible
can be left out. This for the example,
CH2//CH/C(CH3)2/CO/OH
CHr=CH-C(CH,)rm 3. CYCLIC
is close
to the
COOH
COMPONENTS
The rules for cyclic components by the treatment of the following
will be illustrated example:
(/C)/C(llO)/O(/W
(c) The process is repeated as often as necessary, creating tertiary, quaternary, , , . chains from secondary, tertiary, . . origins. Each subsidiary chain is put between brackets behind its origin. In the example, tertiary chains (atoms 16, 17, .., 21) are sufficient, leading to the representation: C(W)
the formula
Note that the fully simplified formula standard semi-developed from:
C/lClClClO
C(P)
ple, this rule reduces
t/H)//C(/H)/C(/CVH)(/H)(/H))(/C(/H) (/H)(/H))/C(//O)/O(/H)
(d) This form is obviously complete, but not very easy to read. The following rules of simplification bring us closer to classical, linear, semi-developed, chemical notation, while remaining unambiguous: -Repeated substituents can be enclosed in brackets and followed by an integral multiplier. In the example, we obtain:
H H
H
S
H
H
Fig. 2. (a) Atoms outside cycles or which are not on chains bonding cycles are ignored during the first part of the derivation. The process to achieve that consists of eliminating the atoms of connectivity equal to 1, and then to repeat on the new molecule, and so on. In the example, the molecule is reduced to:
Fig. 3.
notation
A linear chemical (b) For describing a primary cycle, first setect any atom A, of the cycle, then A, bonded to A, by a,, then A, bonded to A, by a,, . . ., finally A, bonded to A, by a,. Closing the cycle calls for a labelling of atom A,. This is achieved by putting a label (i.e. a whole number), preceded by # , into parentheses at the right of atom A,. At the end of cycle, where A, is found again, A, is replaced by its label. In the example, consider the 6-membered cycle at the left of Fig. 3 beginning at carbon atom number 1, which leads to: C( # 1)&X( The reason why labelled will soon
DISCUSSION
carbon
atom
has
been
(c) Secondary cycles are written in the same way, by labelling the atoms beginning and closing a new atom chain. Consider the secondary cycle of the example, beginning and ending on atoms number 1 and 2 respectively. It is written: I &C&C(
# 3)&C&C&2
(d) Higher-order cycles are similarly the atoms are all described. The last example is written:
treated until chain of the
3/C/C( # 4)/C//C/S/C//4 (e) The atom other, separated
chains are by commas.
C( # l)&C(
written one behind the The example is written:
# 2)&C&C&C&C&l,
1 &C&C(
# 3)&C&C&2,
3/C/C( # 41/C/ lCiSlC/
AND
14
(f) The complete formula (Fig. 2) is then reconsidered in order to add the non-cyclic elements as in the previous paragraph. Non-cyclic chains are placed after the point of attachement. For the example, the final form is:
APPENDIX
The .rg~~mmnr
C( # 1)&C{ 3/CH(CH3)/C(
The grammar which follows is written in BACKUS normal form, those elements being themselves defined in terms of others are enclosed by pairs of ( >. The empty string is designated by A.
# 2)&CH&CH&CH&CH&l,
I&CH&C(
# 3)&CH&CH&2, # 4)/C(CHO)//CH/S/CH//4
(Formula} {Group} (Rest
of formula}
: : = (Group) : : = {Atom}
{Rest
of formula)
(Substituent
list}
: : = (Label) (Bond} (Formula)
13.
(Atom)
: : = (Letter) (‘(Letter) {Atom rest}’
(Label}
: : = {Integer)
(Substituent
list}
: : = (Substituent start} (Possible multiplier) (Substituent
(Bond} : : =/i//l///l&I (Lelter)::
CONCLUSIONS
The grammar which defines the allowed description of chemical components as linear formulae is described in the appendix. It is voluntarily limited to the more common cases in organic chemistry. Extension towards structures such as complex ions, chelates, polymers, catenanes, , lead to no particular difficulties. Some particular points about the chemical language should be noted. A formula such as CH5, with five hydrogen atoms simply bonded to one carbon atom, is accepted. The formula H2 is refused, the hydrogen molecule being written H/H or HH. The methyl free radical must be written CH3(.), and not . CH3, and so on. Markush formulae can easily be written, as can sub-structures, which is useful for documentary research. The use of labelling is kept at a minimum and essentially reduced to polycyclic atoms. It makes it very easy to correct formulae or to edit families of analogous compounds. Similar notations have been proposed by Edelson (1976) and Kirby and Morgan (197X). These apear, however, to be more primitive than the notation described here. In conclusion, the linear chemical notation described here is non-ambiguous, since it describes all inter-atom bonds, but it is not canonical, since a given component may be described in several different ways. This iatter property simplifies learning of the notation both for reading and writing chemical formulae. The notation has been used in the laboratory for different projects: a compiler for kinetics descriptions (Alran, 1979), modelling and mechanistic simulation of complex radical reactions (Azay, 1981), automation of writing reactions (Haux-Vogin, 19X2), thermodynamic data bases (Klai’, 1982).
# 2)&C&C&C&C&l
a second appear.
235
=AIBl...IZ
list}
1{Bond} (Bond rest) 1A
> I* IF+1< 14 1%
G. M. C&E
236 (Atom
(Substituent (Possible
rest}
start)
multiplier}
et al.
: = (Letter)
{Atom
: = (Atom)
1({Substitucnt))
: =
rest}
/ (Figure}
{Atom
rest}
1I
/I
(Integer)
(Bond rest} : = {Label} 1(Group) (Substituent)
:
= (Possible
multiplier}
( 1Possible (Integer} (Integer
rest}
(Figure) (Substituent (Possible
(Substituent
bond 1 (Group)
: = (Figure)
{Integer
rest)
= (Figure}
{Integer
rest}
:
rest>
1
( # {Label)
(A
: =0~1)2(...19
rest) bond}
: = +I--]. : = (Bond)jA
Use of the grammar is illustrated by the following ples of syntax analysis:
exam-
Formula (a) CH,(.)
I
G&up
Rest
n&m
Suhstituent
c
Substituent
start
of
iormula
list
Possible
I Ayom
multiplier
SUbS~
ituent
list
I Intyger
I
Letter I H
Figure
I Substituentstart
Possible
Possible
Substituent
multiplier
Substituent
I
multiplier
I
I
(
Integer
I
1
rest
rest
-Substituent
list
I
A linear chemical Since the tree is complete,
CH3(-)
notation
237
is a valid formula.
Formula
I
I
I----
__~__ Rest
GC3tlp
I I
Atom
of
I
formula
I x Substituent
I
Letter
I
list
I
?
I H
“Substituent list” starts either with “substituent start” or with “Bond”. The integer “2” corresponds to neither of the allowed possibilities. Hence H2 is not a legal formula. Note that formulae such as CH9 or CH2/CHZ are valid syntactic strings, and must bc eliminated by semantic considerations. In particular, questions of valency are purely semantic, and do not figure in the syntax. The opposite problem is that of formulae having a meaning for a chemist and rejected by the grammar, such as trivial or systematic names. In some cases, the use of “super-atoms” is a practical solution, as is, for example “C6H5”/CH3 for toluene.
REFERENCES Alran, D., CBme, G. M., Cunin, P. Y. & Griffiths, M. (1979), Compuf Chem. Engng 3, 87. Azay, P. (1981), Thesis, Nancy. Edelson, D. (1976). Compuc Chem. 1, 29. Haux-Vogin, L. (1982); Thesis, Nancy. Kirby, G. H. & Morgan, C. H. (1978), Cornput Chem. 2.95. Klai, S. E. (1982), Post-Graduate diploma, Nancy, Morgan, H. L. (1965), J. Chem. Dot. 5, 107. Smith. E. G. & Baker, P. A. (1975), The Wiswesser LineFormula Chemical Notation (WLN), 3rd Edn, New Jersey CIMI.