In-house chemical databases at Imperial Chemical Industries G W Adamson, J M Bird, G Palmer and W A Warr* ICI Pharmaceuticals
Division, Alderley Park, Macclesfield, Cheshire SK10 4TG, UK HARDWARE
ICI has mounted several online databases for end-user access under its new graphics system, Sapphire. Sapphire uses the Maces@ software for chemical structure registration and searching. This paper concentrates on those databases which use Maces software for both structure and data handling, rather than on the larger Company Compound Centre database which will use Maces in conjunction with a relational database management system. chemical database, relational database Keywords: management system, chemical structure registration received
3 February
1986, accepted
10 February
1986
ICI is presently developing a system called Sapphire (Structures and Properties Produced by Helpful Interactive Rapid Enquiry), a user-friendly, interactive system for the storage and rapid retrieval of chemical structures and related property datale2. The system will have interfaces to other ICI systems such as molecular modelling, reaction design and biological data handling and even, it is hoped eventually, to systems external to ICI for handling information from the scientific literature. Within Sapphire, chemical structure registration and searching is handled by Maces software licensed from Molecular Design Limited (MDL)3. The authors have also licensed the MDL product Maccslib. This is a library of routines that allows a user to write FORTRAN programs which can access a Maces database. ICI has also licensed the MDL programs Layout and Dataccs-I. Layout converts Maces connection tables to structure coordinates. Dataccs incorporates, amongst other things, the function previously provided in a separate program Molrst (‘Molrestore’) to update a Maces database, noninteractively. Before the advent of Sapphire, chemical information at ICI was handled by the Crossbow system”” with Wiswesser Line Notation (WLN) as the tool for structure representation. In Sapphire, structures are stored as Maces connection tables and Sema names (Stereochemically Extended Morgan Algorithm). Existing WLN records had to be converted to connection tables and this was carried out using the Daring routines from Fraser-Williams Scientific Systems. *This paper was presented by Dr W A Warr before the Division of Chemical Information, 190th National Meeting of the American Chemical Society, Chicago, September 8-13, 1985
Sapphire is run on a Vax 11/785. The commonest user terminal is the DEC VTlOO-plus-Retrographics known as a DQ650M or VT640 (or an IBM-PC emulating this terminal), but for registration and conversion procedures ICI Pharmaceuticals Division uses two IMLAC terminals. In order to input large numbers of structures rapidly a high resolution terminal with large screen and almost instantaneous responses to bond deletions, reorientations and so on is advantageous. A good ‘drawer’ can input in excess of 30 average structures per hour. USAGE Over 40 terminals have been installed specifically for Sapphire in the Pharmaceuticals Division, and in June 1985 75 different users actually accessed the system. Over the company as a whole, several hundred users have been trained and the number of sessions per month is numbered in thousands. The Sapphire databases at Alderley Park can be accessed from at least seven locations in the UK as well as from ICI Pharma in France. ICI Americas have established a link to the UK database, but also have their own copy of the Maces software. The main users of Sapphire are Pharmaceuticals, Plant Protection and Organics Divisions. DATABASES SAPPHIRE
AVAILABLE
UNDER
The following databases have been converted to Maces: a) 340 000 compounds from the Company Compound Centre (CCC) database with no related data at present other than a numerical key. b) The Cambridge Crystallographic Database and the related data. c) The Hansch (Pomona College) physical chemistry database of structures and related log P data. d) An internal ICI Hansch-type database of compounds with measured log P values. e) A database of compounds available commercially with supplier data and chemical names (CAOCI, the Commercially Available Organic Compounds Index, a misnomer since it also contains inorganics). f) The structures and locations of chemicals on shelves in laboratories and stores (‘Labstock’). g) A small number of more specialized databases. The CCC database has always been the main object of the Sapphire project. However, in this paper it is intended to discuss some of the ‘lesser’ databases. These
Volume 4 Number 3 September 1986 0263-7855/86/030165-05 SO3.00 @ 1986 Butterworth & Co (Publishers) Ltd 165
have one thing in common which di~er~ntiates them from the CCC database. This is the fact that both chemical structures and related data are handled by Maces for databases b) to g), whereas for the CCC database a relational database management system will be used for sample and biological data. CONVERSION FROM WLN Apart from the Cambridge Crystallographic Database, and one or two minor specialized databases, all the chemical structural data was already available as WLNs at the start of the Sapphire project. The main programs used in the conversion exercise were Ctgen (written by ICI) and Layout and Dataccs (supplied by MDL). Ctgen calls on the Daring routines to convert a WLN to a connection table. Ctgen also makes the connection table suitable for input to Layout, which generates structure coordinates. Dataccs is used to update a Maces database. The functions of Ctgen are to: a) Read WLNs, suffices and reference numbers from the database to be converted. b) Use the Daring subroutines to convert the WLNs to connection tables (and check molecular formulae). c) Adjust the Daring connection table bonds for compounds with 5-bonded nitrogen (e.g. nitro groups). d) Remove dative (-+) bonds from chelates. e) Remove the Daring special ferrocene bond. f) Flag as errors compounds with more than 256 nonhydrogen atoms or bonds. g) Convert the Daring connation table format to Layout connection table format. h) Produce an SD file suitable for input to Layout.
All the information relating to a single X-ray determination was stored together under the one structure. Some compounds have been reported in the literature more than once and their structures are duplicated in the database in order to keep the data separate. The Cambridge database contains approximately 32000 compounds. The Maces version used by the authors contains approximately 21000. At the time of conversion Maces could not handle atoms with more than 5 connections or compounds with metal x-bonds, or various inorganics and metal complexes. Some of these restrictions have now been removed. A Maces display of a typical structure is shown in Figure 1 and the related data in Figures 2-6. Note how
Figure f . Str#ct~re~rom the eambr~dge ~rystall~graphi~ ~ata~a~e in the Maces Exec mode
After conversion was complete, all Maces structures were scanned by information scientists. Very approximately, 5% of structures needed parts flipped or rotated. A very much smaller percentage had to be either largely or completely redrawn. Structures in the latter category were checked by a second information scientist. On some databases, compounds which have been drawn manually have a flag set in a special datatype, so that they need not be drawn again when the database is recreated/ updated. Because all structures are scanned and careful checks carried out, and also because the initial WLN databases were highly accurate, it is possible to be confident about the quality of the Sapphire databases. CAMBRIDGE CRYSTALLOGRAPHIC DATABASE This database was established in 1965 and is a collection of results of X-ray and neutron diffraction studies of compounds published since 1935”. The producer of the database is the Crystallographic Data Centre of Cambridge University in the UK. The machine-readable version consists of three files containing structures, bibliographies and crystal data. The three files were converted into a single Maces-type database. The structural data was in the form of connection tables, not WLNs, and a modified form of Ctgen had to be written to convert the Cambridge connection table to Maces form. Programs were also written to format the data into the required Maces SDfiles/datatypes.
166
Figure 2. Maces data display from Crystallographic Database, first screen
the Cambridge
the full name of the compound is given in datatype 4. Chemists prefer to see structures and data together on the same screen, so some software has been written within ICI to make this possible. Figure 7 shows such a display. The data on the left of the screen can be scrolled alongside the structure to which it relates. LOG P DATABASE The experimental log P and p& values of a large number of chemical compounds have been collected together by Journal of Molecular Graphics
SEARCH
EXIT
ATTACH
HELP
BLANK
PLOT
DRAW
DATA
NAME
NAME FIND
RON0
FIND
RGNO
CANCEL
CURR
CANCEL
C”RR
REGISTER
DATA
REGlSTER
DATA FILE
FILE keyboard
keyboard input
Figure 3. Maces data display from the Cambridge Crystallographic Database, second screen
Figure 6. Maces data display from Crystallographic Database, fifth screen
Input
the Cambridge
MOL.WE,GHT 357.285 REF.NO MPlClN
JOURNAL J.CHEM.90C.C
,970.-
,958
JOURNAL.CODE.N”MBER 087 REUA9,LITY.OF.DATA R-FACTOR
R-0.0530
COORD9.CRYSTALAXES FRACTIONAL
FORMAT2
Cl0
58179
-38507
29440
C12 Cl3
34007
-14115
4.44050
36875
CM C,5
3,625
- 10370 -1457
62‘65 6,402
22145
3950
I-METHYL-2-PICRYLIMINO-
lNDOLlNE
53834
Figure 7. Sapphire display of structures and data from the Cambridge Crystallographic Database Figure 4. Maces data display from Crystallographic Database, third screen
the Cambridge
DATA H12, “122 HI*3 “13 “14, ;;I: ,100
>I00
3580 8370-2700 3230 7000-3640 4040 8510-3310 -664 1380-3530 -m-1760-3810
SEARCH
1110-1420-3130 530 -5904490
DTCm2
>
FCIRMAT I - 43 ATOMS 222 1 3 3 5 5 7 7 1 Q,,,213,3,7 92017 3 5 7 Q1,1,,5,5,5,Q,Q,Q2,2,2,22 bllt’dl 232323 910
< “NIT.CE!_L > I,.6069 7.8888 8.2023 kcm’dl SPACE GRCWP
620
422
,
EXlT
ATTKH
HELP
BLANI(
PLOT
DRAW
DATA
DTWI3 QO 4 PZ,
91.99
90
NAME FIND
RON0
CANCEL
C”RR
REGlSTER
DATA FlLE
keyboard h&l”,
Figure 5. Maces data display from the Cambridge Crystallographic Database, fourth screen
Hansch and Leo” (at Pomona College, USA). It is supplied as a set of printable tiles on magnetic tape. It contains details of about 25 000 experimental values for about 11000 compounds. The volume of data is currently growing at the rate of about 900 determinations every 6 months. The structures are supplied as WLNs, which ICI con-
Volume 4 Number 3 September 1986
verts to Maces connection tables and structure coordinates. Stereochemistry is added manually to the Maces database where Pomona College has supplied absolute stereochemistry which can be handled by the Maces software. Again, ICI has written software to assign the data to Maces datatypes. The log P and pK, values are held exactly as the ‘print’ record on the original tape, together with the solvent used in the determination. Reference and footnote keys for each entry are held alongside, preceded by a letter R or F, respectively. Entries for one compound are sorted into solvent order. Literature references and footnotes are stored in separate datatypes, together with the appropriate R or F number which relates them to the log P and pK, values. A typical structure is shown in Figure 8 and the Maces data layout for the same compound in Figure 9. If the user calls up not Maces itself, but ICI’s ‘Sapphire Phase II’ module (which accesses Maces databases) he may obtain a display such as that shown in Figure 10, where data can be scrolled alongside a structure.
COMMERCIALLY AVAILABLE CHEMICALS INDEX (CAOCI)
ORGANIC
Before a chemist starts looking in catalogues for the intermediate he requires, he will check whether a sample
167
/3x 0 0
xl+
“\ NAME
GiJ
Figure 8. Structure from the Hansch log P Database in Maces Exec mode
SELECT OPTION
Figure 9. Maces data display .from the Hansch log P Database
RGNO
Figure I I. Maces data display from the CAOCI Database
ICI still uses the term CAOCI instead of Fine Chemicals Directory, FCD, under which name FraserWilliams and MDL market similar products. This is because there are significant differences between the products, and also CAOCI is geared to ICI’s needs. The development of CAOCI has been described by Walker13. Converting the WLN-based CAOCI database to one of a Maces type and formatting the catalogue entries into a display pleasing to the end-user involved considerable planning and programming effort. A typical catalogue-entry datatype display in Maces is shown in Figure 11. For compounds which have more than one supplier, the catalogue records are given in alphabetical order of supplier name and reference number. There is a spacer record, consisting of a single full stop. This indicates the end of an entry, which may be split across two display screens.
CONCLUSIONS This paper aims to show the variety of data which enduser chemists can access under Sapphire at ICI. Details of the system design and programming that were needed to produce all these facilities for users may be the subject of another publication.
REFERENCES
Figure 10. Sapphire display of structures and data from the Hansch log P Database is already on site. This he can do using the Labstock database under Maces. If he must buy in the compound, he need not look through multiple catalogues manually, exercising his chemical nomenclature skills. He can access CAOCI under Maces. The CAOCI and Labstock databases are similar in structure and usage (although the systems differ considerably) so we will consider only CAOCI here.
168
Warr, W A ‘Maces - an ICI view’ Proc. 7th Int. OnZine Znf Meeting London, UK (1983) Adamson, G W et al. ‘Use of Maces within ICI’ J. Chem. Inf. Comput. Sci. V0125 (1985) pp 9&92 Anderson, S ‘Graphical representation of molecules and substructure queries in Maces’ J. Mol. Graph. Vo12 (1984) pp 83-90 Hyde, E et al. ‘Conversion of Wiswesser notation to a connectivity matrix for organic compounds’ J. Chem. Doe. Vo17 (1967) pp 200-204 Thomson, L H et al. ‘Organic search and display using a connectivity matrix derived from the Wiswesser notation’ J. Chem. Dot. Vol 7 (1967) pp 204-207
Journal
of Molecular
Graphics
E and Thomson, L H ‘Structure display’ J. Chem. Dot. Vo18 (1968) pp 138-146 7 Eakin, D R ‘The ICI CROSSBOW. system’ in Ash, J E and Hyde, E (eds) Chemical information systems Horwood, UK (1975) 8 Ash, J E ‘Connection tables and their role in a system’ in Ash, J E and Hyde, E (eds) Chemical information systems Horwood, UK (1975) 9 Eakin, D R et al. ‘The user of computers with chemical structural information: ICI Crossbow system’ Pestic. Sci. (1974) pp 319-326 10 Townsley, E E and Warr, W A ‘Chemical and bio6 Hyde,
Volume 4 Number 3 September 1986
logical data - an integrated online approach’ ACS Sypm. Ser. (1978) No 84 11 Allen, F H et al. ‘The Cambridge Crystallographic Data Centre: computer-based search retrieval, analysis and display of information’ Acta. Crystallogr. Vol B35 (1979) pp 233 l-2339 12 Hansch, C ‘A quantitative approach to biological structure-activity relationships’ Act. Chem. Res. Vol 2 (1969) pp 232-239 13 Walker, S B ‘Development of CAOCI and its use in ICI Plant Protection Division’ J. Chem. Znf. Comput. Sci. Vol 23 (1983) pp 3-5
169