GREDIA: A new access to GREMAS databases

GREDIA: A new access to GREMAS databases

TetrahedronComputerMethodology,Vol. 2, No. 3, pp. 167 to 175,1989 0898-5529/90 $3.00 ~.0~ Pergamon Press plc Printed in Great Britain GREDIA : A N ...

497KB Sizes 0 Downloads 67 Views

TetrahedronComputerMethodology,Vol. 2, No. 3, pp. 167 to 175,1989

0898-5529/90 $3.00 ~.0~ Pergamon Press plc

Printed in Great Britain

GREDIA : A N e w Access to G R E M A S

Databases

Christa Fricke, Ingeborg Nickelsen, Robert Fugmann and Jiirgen Sander* Hoechst Aktiengesellschaft, D- 6230 Frankfurt 80, West Germany

Dedicated to Prof. Wolfgang Hilger on his 60th birthday

Received 16 January 1990; Revised 16 February 1990;Accepted 16 February 1990

Key words: Topological front end; Query transfer; Fragment code; IDC data bases; End user

Abstract: GREMAS is a fragment code system for storing and retrieving specific and generic structures and reactions. Over a period of 30 years, uniform high-quantity and high-quality databases have been developed. Searching in these GREMAS databases has hitherto been restricted to experts who are familiar with the GREMAS code. Using GREDIA, a structure entered topologically can now be automatically converted to a GREMAS query formulation. This means that in future non GREMAS experts will also be able to use the high-value GREMAS databases.

INTRODUCTION There have been GREMAS databases at HOECHST for nearly 30 years. They were set up with the GREMAS system 1 and are to date one of the most important sources of information for HOECHST chemists. The GREMAS system is based on a strictly hierarchically structured fragment code which can encode, store and retrieve both specific and generic, i.e. generalized, structures. GREMAS can therefore be used to record both the journal and the patent literature. All these facilities were used first at HOECHST, which began to construct GREMAS databases in 1960. A few years later BAYER and BASF were also using the GREMAS system. In 1967 the IDC INTERNATIONAL DOCUMENTATION IN CHEMISTRY was founded with the aim of constructing a powerful chemistry documentation system for the needs of the industry. 2 This was the start of a broadly based and systematic process, tailored to the requirements of the IDC members, of recording structures and reactions. Figure 1 shows the sources and numbers of documents that have been and are being entered in the IDC Organika database. According to the figure, this database covers preparative, low-molecular weight organic chemistry. It shows that the journal and patent literature is covered in equal proportions and back to about 1960. 167

168

C. FRICKE et al.

The main focus of current work is shown by the ratio of 40,000 patents entered annually to 12,000 journal articles. Since 1970, the DERWENT ® CPI abstracts have served as the main source in the patent sector. The generic structures and reactions reported in the patent claims are converted manually to GREMAS code. This means that the patent claims can be traced directly, without one being dependent on the information in the patent examples. The journal literature has been covered since 1960 onward. Since 1970, documents have been entered from Chemlnform and since 1974 from the CA file. For this purpose the CA registry entries from the CAS documents in sections 21-30 relevant to the IDC are converted automatically to GREMAS and the intellectual content is completed by comparing with Chemlnform. This additional information relates for example to the specific structures-educts having been recorded as well as products and identified as such from the start-and also in particular to a detailed description of the reaction, with precise identification of the reaction centers. 3 In addition to the IDC Organika database, the IDC GREMAS structure registry is derived from the CAS registry f'1le. With a few exceptions, this can be set up automatically and covers the low-molecular weight organic range including the patent examples documented by CAS. 4 The two databases-the IDC Organika database with about 10 million generic and specific structures and about 4 million reactions, and the GREMAS structure registry with about 9 million structures, are combined in the IDC d a t a pool, a database which can be searched on-line. When this work is complete, the IDC members BASF, BAYER, CHEMIE LINZ, DEGUSSA, DYNAMIT NOBEL, HENKEL, HOECHST and HOLS will have a uniform, high-quality and high-quantity on-line search database. This database is available for GREMAS searches using the Siemens GOLEM ® system. It covers a period of 30 years and permits searches in the patent and journal literature in the low-molecular weight organic field. Until now, these GREMAS and GOLEM databases could only be used by GREMAS experts because it takes at least six months to learn and fully master the relevant search rules. Research chemists were therefore not able to carry out GREMAS searches themselves. GREDIA now enables on-line use of the GREMAS databases to be extended to non GREMAS experts.

GREMAS/GOLEM QUERY GENERATION

GREDIA takes on the role of the GREMAS expert, and derives the GREMAS query formulation interactively from a query input in the form of chemical structures. GREDIA - short for G R E M A S dialog uses the structural formula conventionally used by chemists as their technical language. The desired structure is drawn in on the screen. The GREMAS or GOLEM query terms are generated automatically for fully defined structures. A query dialog is run in addition for substructures. Figure 2 shows that GREDIA systematically uses questions that only have to be answered "yes" or "no". The questions refer successively to all substituents Rn of an atom A n. These substituents R n can be either any or carbon or heteroatoms. For each of these three possibilities, the program must consider four different environments: the defined atom A n of the substructure, adjacent to R n, can either be a carbon atom or a heteroatom and also a ring or a chain. The question sequences shown in Fig. 2 lead to these 12 possible cases. Subsequent detail questions relate to rings and chains and to the type of functional groups and type of heteroatoms. They also permit, for example, chain lengths to be defined within specific limits. Statements can also be made about the degree of saturation of a chain, although the chain is not explicitly drawn. These query options do not exist in current topological systems. For each of the 12 cases there are different sequences of questions, which in turn depend on the answer to the previous question. For this reason the complex hierarchical system of questions can only be indicated here; the end user only has to answer the questions about each substituent R n. The questions about the

New access to G R E M A S databases

169

[,,,,,

-i

~

g

rd~

"0 ".=

t"0

g

t-

~

=

=., =

t_

.~

\ .=.

g

E "0 0 t_ ,.0 =

g

/ I I i I

? I~'~ g S

~1~

"0

¢.8 t_

0 el3

170

C. FRICK~et al.

b. rl

°~

o

j, o

.~[ ~0

.=. L

eq

E

New access to GREMAS databases

171

respective adjacent atoms A n of the substructure are answered automatically. In addition, negations relating to areas outside the substructure and additional requirements relating to specific molecule areas within the substructure are included in the query formulation. For example information can be given as to whether a ring is to be unsubstituted or substituted, and in the latter case with or without specifying the positions of the substitution. Information about undefined substitutions in particular cannot be represented in topological systems. To generate GREMAS/GOLEM query terms, first the relevant GREMAS rules were implemented by machine. The "hetero-orientation" of carbon atoms, known to chemists for over 150 years from the Beilstein handbook, is the first classification criterion and occupies first position in the fragment code. This code consists of 3 letters for each carbon atom and is structured hierarchically. The second position indicates information about the type and environment of the heteroatoms. In the third position, bonds to other carbon atoms are described. As well as this, rings and ring systems are classified on the basis of a ring perception and coded by strictly defined rules. Thus the GREMAS fragment code records the essential structural features systematically and reproducibly. The IDC Organika database currently contains more than 6,000 different structural features and their associated codes. The theoretically possible number is much higher than this, which means that any structure can be represented in GREMAS without changing and extending the GREMAS code. A syntax is used to define which fragments occur in common in a carbon chain or in a ring, which rings are linked together and which fragments occur more than once in a chain or ring. This results in any number of combinations of these 6,000 terms. It is only this syntax that enables adequate precision and sufficiently great differentiation of structures. For query term generation it must also be noted that GREMAS terms can be influenced simultaneously by different substituents R n. The environment of the substituent R n and that of the defined adjacent atom A n play a crucial role here. In some circumstances therefore the GREMAS terms for an atom A n may change as a result of the effect of a more distant atom in the substituent R n. Generation of terms that would simulate details that do not exist in the document is also consistently prevented in accordance with pragmatic considerations. For example for "aryl", "phenyl" is not simply assumed, but the more general term is generated instead of the specific one. This transition from specific to more general structural features is enabled by the hierarchical system on which the GREMAS code is based. This method of generating GREMAS query terms clearly shows the effort that must be made to learn the GREMAS terms if the system is to be used to its full potential, and what a great help a system like GREDIA is in this work for experts too. E X A M P L E S AND RESULTS The herbicide phosphinothricin will now be used with three different substituents R 1 as an example to illustrate the different dialog sequences in GREDIA. Following the path marked in bold in Fig. 2, one first arrives in Fig. 3 at substituent R1, which in this case is a carbon atom and has as its adjacent atom A 1 a carbon atom lying in a chain. Questions about the ring details, chain details and functional group details are then asked for the three different substituents Rll, R12 and R13. For Rll, questions must be answered about the type of ring and the type of ring fusion. For the hydrocarbyl group R12, the dialog produces statements about the number of carbon atoms and the number of double and triple bonds. The example of the acyl group R13 shows how the heterofunctionality, the type of heteroatom and the number of hydrogen atoms produces ketone. If the questions are not answered in accordance with figure 3, the program goes to another part of the query dialog. The examples given here can therefore only indicate the variety and multiplicity of the query dialog. The examples show why interactive user guidance was the method chosen: the questioner is directed to options that he had not considered, but which are important for the query formulation and hence for low noise.

172

C. FRICKEet al. O

NH 2

II

I

CH3--1P-- CH--CH-I

OH

CH-- C

2

R1

O

isolated " cycloaliphatic

I/

R

\

R12

: saturated alkyl 1-5 C atoms

RI3

:

11

OCH 3

group group

acyl group

no

yes yes

yes

no

no

ring details R l l heterocyclic ? no aromatic ? no cycloaliphatic? yes additional details

chain details R12 min. number C-atoms? 1 max. number C-atoms? 5 number double bonds? 0 number triple bonds? 0

fused ring ? no

Fig. 3. Determining the GREMAS/GOLEM query formulation for phosphinothricin with different substituents R 1

group details R13 heterofunctionality? 2 type of heteroatom? O number H-atoms ? 0

New access to GREMAS databases

173

In principle there is no limit to the number of different substituents. But the present version of GREDIA is not yet able to request or negate different groups attached to one and the same substituent. This means that end users must presently resolve queries with several alternative residues into individual queries. A search for phosphinothricin with the alternative substituents Rll, R12 and R13 in a single query is therefore not possible at the moment. This limitation can be circumvented in a few cases by formulating one query with a very generalized substituent and attempting to compensate for the alternative requests with complementary negations. Test results with the existing version so far show that in searches for defined structures there is no difference in the results obtained by GREMAS beginners and GREMAS experts. And because of the precision of the fragment code, very little noise occurs; what noise does occur is attributable to the limitations of the GREMAS fragment code, and not to GREDIA. In substructure queries it is apparent that as the size of the substructure increases, the noise becomes less, and the results obtained by GREMAS beginners can approach those of GREMAS experts. 5 These differences between GREMAS beginners and GREMAS experts are largely attributable to two causes, one of which is the limitation previously mentioned of not being able to formulate alternative groups attached to a substituent. It is true that GREDIA allows generalizations for undefined substituents, and these represent a useful expansion over topological systems. But in comparison with the GREMAS system as a whole the options implemented in GREDIA for entering general structural parts are for the time being limited. For example general queries about multiple substitution of halogens in rings or chains are not possible at the moment. This means that in cases of doubt, GREMAS experts will have the advantage, and non GREMAS experts must accept a certain amount of noise.

CONCLUSIONS GREDIA is shown in its function as a transfer system from the topological system to query formulation for GOLEM, thus allowing a new access to the IDC databases. Combining G R E D I A and G O L E M in the IDC data pool enables the material stored so far in GREMAS to be additionally used by a new group of users.

Use of the IDC structure registry can serve as an equivalent to CAS on-line searches, not only for full structure searches but also, and most importantly, for substructure searches including differently generalized searches. The IDC Organika database is particularly important for searches for reactions. Here, the reaction centers of educts and/or products are identified and can, in one of the versions of descriptions, very precisely be linked together as a search parameter. The IDC data pool is the most comprehensive and only database in the world which has provided uniform access to specific and generic structures and reactions from the journal and patent literature since 1960. No equivalent exists worldwide for the reactions recorded since 1960, either as regards the extent or the precision of the coding. 3 Though CAS made the CASREACT reaction database available on-line in 1988, this only covers the important preparative reactions since 1985. CASREACT also at present is restricted to searching for the co-occurrence of educt(s) and product(s) in one document and without recognition of the reaction centers. Experience gained so far shows that the IDC databases will initially mainly be used in the literature sector by GREMAS beginners. Patent searches will mainly stay the reserve of GREMAS experts. In this connection the experience to be gained worldwide with the PC program Generic-TOPFRAG will be useful and helpful in ascertaining whether and to what extent patent searches can be carried out without knowledge of fragment codes. Like GREDIA, TOPFRAG converts topological structures into the CPI fragment code and thus facilitates access to the DERWENT databases with sections B, C and E. 6 As GREMAS/GOLEM searches have proved themselves over many years, the conditions exist for

174

C. FRICKEet al.

GREDIA to be used successfully and cost effectively in connection with GOLEM searches and the IDC databases. GREDIA is a prototype system that enables the high-value IDC databases to be accessed for the first time without knowledge of GREMAS. DETAILS OF THE PROGRAM

SYSTEM

GREDIA was developed on a UNIVAC ® 90-80 using the VS 9 operating system. All the programs were converted to FORTRAN 77 from FORTRAN 4 and ASSEMBLER-360 and transferred to a VAX 'e' with the VMS 4.6 operating system. This version consists of 55 individual programs which occupy 531,968 bytes of memory. HOECHST is initially using the DRAW module from the CASP synthesis planning system for graphic input. The SMD format is used as the interface. The user interface is being improved in collaboration with BAYER. REFERENCES

. Fugmann, R.; Braun, W.; Vaupel, W. "GREMAS*---ein Weg zur Klassifikation und Dokumentation in der organischen Chemie" (GREMAS-an approach to classification and documentation in organic chemistry). Nachr. f. Dok. 1963, 4, 179-190 *Generisches Recherchieren durch MagnettragerSpeicherung (generic searching on magnetic storage media) 2.

Meyer, E. "The IDC system for chemical documentation." J. Chem. Doc. 1969, 9, 109-113; Roessler, S.; Kolb, A. "The GREMAS system, an integral part of the IDC system for chemical documentation." J. Chem. Doc. 1970, 10, 128-134; Lynch, M. F.; Harrison, J. M.; Town, W. G.; Ash, J. E. "Computer Handling of Chemical Structure Information." Macdonald/American Elsevier Computer Monographs, 1971; Fugmann, R. "The IDC-System" in Ash, J. E.; Hyde, E., Eds.: Chemical Information Systems; Ellis Horwood Ltd.: Chichester, 1975

3.

Fugmann, R.; Bitterlich, W. "Reaktionendokumentation mit dem GREMAS-System" (Reaction documentation with the GREMAS system). Chemikerzeitung 1972, 96, 323-330; Fugmann, R.; Kusemann, G.; Winter, J. H. "The Supply of Information on Chemical Reactions in the IDC System." Inform. Proc. Management, 1979, 15, 303-323; Fricke, C.; Fugmann, R.; Kusemann, G.; Nickelsen, I.; Ploss, G.; Winter, J. H. "Experience with reaction indexing and searching in the IDC system" in Modem Approaches to Chemical Reaction Searching: Willett, P., Ed., Gower: Aldershot, UK., 1985; Fugmann, R.; Ploss, G.; Winter, J. H. "Supply of Information on Chemical Reactions. An Advanced Topology-Based Method." J. Chem. Inf. Comput. Sci. 1988, 28, 47-53; Fujita, S. "Structure-Reaction Type" Paradigm in the Conventional Methods of Describing Organic Reactions and the Concept of Imaginary Transition Structures Overcoming this Paradigm. J. Chem. Inf. Comput. Sci. 1987, 27, 120-126; Deroulede, A. "An update on computer-based systems providing information on chemical reactions and syntheses." Inf. Chim. 1987, 289, 143-146

.

Ehrhardt, F.; Roschkowski, H. "IDC Inorganic Chemicals Data Base. 2. Utilization of Chemical Abstracts Service Data Bases for the IDC Inorganic Chemistry Documentation System. J. Chem. Inf. Comput. Sci. 1986, 26, 63-71

5.

Franzreb, K.-H. "Vergleich der Ergebnisse zwischen GREMAS-Experten und GREMAS-Unkundigen" (Comparison of results between GREMAS experts and GREMAS beginners); unpublished results

6.

Harsdorf, E. v.; Dethlefsen, W.; Suhr, C. "Derwent's CPI and IDC's GREMAS: Remarks on their

New access to GREMAS databases Relative Retrieval Powers with Regard to Markush Structures" in Computer Handling of Generic Chemical Structures: Barnard, J. M., Ed., Gower: Aldershot, UK., 1984; Silk, J. A. "Present and Future Prospects for Structural Searching of the Journal and Patent Literature." J. Chem. Inf. Comput. Sci. 1979, 19, 195-198; Kaback, S. M. "What's in a Patent? Information! But Can I Find It?". J. Chem. Inf. Comput. Sci. 1984, 24, 159-163; Jordis, U.; Oberhauser, O. "Status of computer search of the chemical literature: (partial) structural research with GREMAS, DARC, and CAS ONLINE." Osterr. Chem. Z. 1982, 83 (12), 311-14; Kaback, S. M. "Online Patent Searching: The Realities." Online 1983, 22; Kaback, S. M. "Chemical Structure Searching in Derwent's World Patent Index." J. Chem. Inf. Comput. Sci. 1980, 20, 1-6; Simmons, E. S. "Central Patents Index Chemical Code: A User's Viewpoint." J. Chem. Inf. Comput. Sci. 1984, 24, 10-15; Kaback, S. M., "A User's Experience with the Derwent Patent Files." J. Chem. Inf. Comput. Sci. 1977, 17, 143-148; Smith, R. G.; Anderson, L. P.; Jackson, S. K. "On-Line Retrieval of Chemical Patent Information. An Overview and a Brief Comparison of Three Major Files." J. Chem. Inf. Comput. Sci. 1977, 17, 148-157; Watermann, J. R. "Using CAS ONLINE to search for patents" in Computer Handling of Generic Chemical Structures: Barnard, J. M., Ed., Gower: Aldershot, UK., 1984

175