trends in analytical chemistry
vol. 13, no. IO, 1994
MSLIB -a versatile tool for handling and interpreting mass spectral data H. Lohninger Vienna, Austria An MS-DOS-based system for the handling and processing of mass spectral data is introduced. MSLIB provides a convenient graphithe allows and interface user cal administration of mass spectral data as well as related substance specific information, including chemical structures. MSLIB provides tools both for importing and editing the data, and for searching in the databases, and includes a spectral search and a structure similarity search.
1. Introduction The handling of mass spectral data would now be unthinkable without computerized support. Mass spectrometry was one of the first analytical fields whose usability was transformed by the advent of computers. In addition to proprietary software systems provided by the vendors of mass spectrometers, the user may need to administer his or her own personal file of mass spectra together with other physical and structural data. MSLIB runs on IBM PCs (80386 or up) under MS-DOS and features a convenient graphical user interface. The user may edit, search, browse, extend and analyze the data in a database. MSLIB has been tested by porting both the NIST [ 1 ] and the Wiley [2] mass spectral database to it. Thus a large, commercially available database can be used together with proprietary mass spectra.
2. Data structure The main data structure of MSLIB, which is permanently visible to the user, is a ‘card file’ having one index card per compound. A single card contains all the information available on that single compound. The cards can be arranged by users 0
1994 Elsevier Science B.V. All rights reserved
according to their needs. Any number of cards can be compiled to form subsets of the database. MSLIB can manage several mass spectral databases (up to lo), each containing a practically unrestricted number of substance data (up to 16 million cards per library). Physically, each library consists of several files (the database and several inverted index files), although the user sees only one compact library which is addressable by either a name or an identification number. In practice, the user rarely uses the physical libraries, but will rather utilize some subsets of them. These subset files can be treated as sublibraries which can be compiled according to the specific needs of the user. Subsets can be built manually, but are usually created by a search or some other process within MSLIB. The subsets can be edited, browsed, or logically combined with other subsets, and form the basis of everyday work with MSLIB. As already pointed out, MSLIB relies on a data structure very much like a card file. Each card of the library contains all the information available on a single chemical compound. This information comprises not only the mass spectrum but other important data too, such as 0 the name of the substance, 0 the molecular formula, 0 formula (connection structural the table + graphical information), 0 some physical parameters (melting point, boiling point, refractive index, density, and a user-definable parameter), 0 the CAS-registry number, 0 a spectral quality grading. All the data can be edited manually (the system includes a versatile structure editor) or can be imported from other sources using some simple data formats (including the proposed JCAMP-DX standard [ 31). MSLIB provides several tools for searching in the databases. Each data item in an index card can be searched for. The most powerful search procedures are the substructure search, the 016%9936/94/$07.00
416
trends in analytical chemistry, vol. 13, no. 70, 1994
Table 1 List of attributes to be defined during substructure search 3 5 6 B
I structure
Editor
L
Ring of 3, 4, or more than 6 atoms Ring of 5 atoms Ring of 6 atoms Branching atom (atom with a minimum of 3 neighbour atoms) Condensed ring atom Atom in a linear chain Any ring atom Terminating atom (last atom of a chain)
C L R T
structure similarity search, and the search for mass spectra. In addition, MSLIB supports the automatic check of the quality of mass spectra and provides a means to flag the quality of each spectrum. 3. Substructure
4. Structure similarity search In
vides
contrast to the substructure search, which proa search
for specific
structural
fragments,
1
1: 2: 3: 4: 5 6. 78-
6.E.R 6.R fiC,R 6.C.R 6,R 6.R L L,T
Fig. 1. Substructure
search.
the query
can be set to any
structure
The atom attributes
of
meaningful
value. In this example
the attributes
of atoms 3 and 4
have been extended
by the C-flag
(condensed
and the atom 8 has set the T-flag
terminating
A built-in chemical structure editor allows the user to draw on the screen a structure template which is used for the substructure search. The search is based on the back-tracking algorithm [4] which is sped up by several pre-processing steps. Although the back-tracking algorithm is slow compared to other algorithms [ 51, it has the advantage that any additional information (in the form of atomic numbers, bond properties, and attributes of atoms) can easily be included with the search. Thus, the search can be conducted over a large range of conditions. MSLIB provides the possibility of performing a substructure search on three different levels: 0 skeleton only, 0 skeleton + atom type, 0 skeleton + atom type + bond type. In addition, each atom of the template can be assigned up to eight attributes which allow one to define specific conditions for the search. These attributes define some topological constraints on the candidate structures (Table 1). The procedure for the searching of structures is shown in Fig. 1. The substructure search results in a subset which contains all index cards found during the search. In addition, MSLIB provides the possibility of logically combining several subset files, which can further enhance the versatility and selectivity of a substructure search.
Search Algorithm
QJ-“-
Attributes
system)
search
query structure
atom). This ensures
ring (chain
that only condensed
ring systems with a methoxy group at the indicated position are considered during the search.
there is sometimes also the need for searching for the structures most similar to a given molecule. The results of this search can then be used to support the interpretation of a mass spectrum of the target compound. The similarity search is based on superatoms, each consisting of a central atom and its neighbouring atoms. The resulting hitlist of the similarity search is stored in a subset and can be P-N.
O--N v
/
e!?% q--Q y$F d
=
d =
0.000
0 244
&._ d
=
d =
0.000
d =
0.445
d =
0.244
$$a/ & d =
0509 d =
0.510
0 509
Fig. 2. Structure similarity search. The first nine hits of the similarity search for 4-acetyl-3,5-diphenyl-4,5dihydroisoxazole. The similarity is indicated by the distance measure, d, which equals zero for a perfect match and increases with increasing dissimilarity.
trends in analytical chemisfry, vol. 13, no. IO, 1994
417
processed further, as with any other subset. As an example, Fig. 2 shows the results of a structure similarity search for 4-acetyl-3,5-diphenyl-4,5dihydroisoxazole in the approx. 60 000 structures of the NIST mass spectral database. As can be seen, the resulting subset comprises several isomers of the query structure and its derivatives. These isomers could be found by a substructure search only by combining the results of three different runs.
5. Search for mass spectra The search for mass spectra can be performed in two different ways. First, MSLIB offers a simple peak search, where the user may specify up to eight peaks in mass and intensity. MSLIB then searches for mass spectra which satisfy the given restrictions. This search is quite fast and can be used for screening possible candidate spectra. Secondly, the user may input an unknown spectrum and execute a full library search on it. This search is based on a simplified SISCOM approach [ 61 and results in a hitlist of 150 spectra ordered according to decreasing similarity to the query spectrum.
6. Investigations relationships
on structure-spectra
The relationship between a chemical structure and its mass spectrum is often unknown to a large extent. Although some recent attempts have been made to predict the mass spectrum solely from the chemical structure [ 71 there is not really a solution to this problem. Until an analytical solution has been found (if ever) one has to resort to methods of multivariate statistics for the interpretation of mass spectra. Although MSLIB has primarily been designed as a flexible MS database, it provides a link to INSPECT [ 81, a PC-based chemometrics software package. This link enables the user to apply the more important methods of multivariate data interpretation to the spectra. An advantage of the combination of MSLIB and INSPECT lies in the free availability of both programs for MS-DOS-based PCs. However, ‘power users’ who need a fullyfeatured chemometrics software package which has been tailored to the needs of the interpretation of mass spectra should also look at EDAS [9] in combination with MassLib [ lo] or Specinfo [ 1 11.
Fig. 3. Spectra of the 150 structures most similar to cu-terpinene, projected onto the plane of the first two principal components of the mean-centred data. The classes have been assigned by a separate cluster analysis and clearly show the occurrence of four different types of spectra (classes l-4). Some spectra (marked by ‘x’) cannot be assigned to any of the four classes.
A central problem when dealing with mass spectra is the high dimensionality of the data vectors. This difficulty can be met either by selecting the appropriate masses or by transforming the mass spectra to a data space of lower dimensionality through the introduction of spectral features [ 121. MSLIB supports the creation of spectral features and provides a simple data interface to INSPECT. A simple example should clarify this approach. Suppose one has performed a structure similarity search for cr-terpinene. This results in a set of similar structures (typically 150)) which comprises mostly monoterpenes and related compounds. In order to investigate whether the mass spectra of these similar structures are heterogeneous, in the sense that there are different fragmentation pathways, one can apply a cluster analysis on the corresponding spectra. Therefore, MSLIB is used to calculate modulo- 14 spectra, which thereafter are submitted to a cluster analysis using INSPECT. The results show that the spectra can be divided roughly into four classes (Fig. 3). These classes are made up of three groups having one to three double bond equivalents, and a fourth group which comprises cyclic monoterpene alcohols with the hydroxyl group attached to the ring. In order to provide a suitable view of the data the modulo- 14 spectra have been subjected to a principal component analysis showing the clusters in a plane of the first two principal components.
418
trendsin analytical
7. Implementation, hardware requirements, and availability
chemistry
vol. 13, no. 10, 1994
H. Hillig and P. Lampen, in C. Jochum (Editor), Soj?wure-Development in Chemistry, Vol. 8, Gesellschaft Deutscher Chemiker, Frankfurt, 1994, in press. [ 41 L.C. Ray and R.A. Kirsch, Science, 126 ( 1957) [ 31
MSLIB has been implemented in Pascal (approx. 16 000 lines of code), additionally using a Pascal library [ 131 which contributed the basis system for the graphical user interface. The following summary gives a list of basic hard- and software, which is necessary or strongly recommended for convenient utilization of MSLIB: 0 IBM-compatible PC (at least 80386), 0 640 kByte memory (2 MB recommended), 0 XMS manager recommended, 0 EGA or VGA graphics card (VGA recommended), 0 Microsoft compatible mouse, 0 hard disk, at least 5 MB free, 0 math coprocessor recommended, 0 MS-DOS 5.0 or higher. University scientists and members of non-profit organizations may obtain a free copy on request if they have access to the Internet (e-mail to:
[email protected]).
References [ I] S.R. Heller,J. Chem. Zr$ Comput. Sci., 31 ( 1991) 352. 121 Wiley Muss Spectral Dutubuse, Electronic Publishing Division, Wiley, New York, 4th ed., 1988.
814.
[Sl J.M. Barnard, J. Chem. IF$ Comput. Sci., 33 ( 1993) 532.
[61 H. Damen, D. Henneberg and B. Weimann, Anal. Chim. Actu, 103 ( 1978) 289. W. Hanebeck and K.-P. Schulz, J. Gem. In$ Comput. Sci., 32 (1992) 264. 181 H. Lohninger, Chemom. Intell. Lab. Syst., 22 (1994) 147. 191 W. Werther and K. Varmuza, in C. Jochum (Editor), Software-Development in Chemistry, Vol. 8, Gesellschaft Deutscher Chemiker, Frankfurt, 1994, in press. MussLib Version 7. I, Chemical Concepts, Weinheim. Specfnfo, Chemical Concepts, Weinheim. H. Lohninger and K. Varmuza, Anal. Chem., 59 ( 1987) 236. H. Lohninger, Borlund Pascal Gruphik Toolbox, IWT-Verlag, Vaterstetten/Miinchen, 1993. [71 J. Gasteiger,
H. Lohninger is at the Institute of Genera/ Vienna, Chemistry, Technical University Lehargasse 4/152, A-1060 Vienna, Austria. His main research interest are the utilization of computers in chemistry with emphasis on the application of neural networks to the interpretation of spectroscopic data and to quan tita five structureproperty relationships.
1 FOR ADVERTISING
INFORMATION
PLEASE CONTACT OUR REPRESENTATIVES
GREAT BRITAIN
JAPAN
T.G. Scott & Son Ltd. Vanessa Bird Portland House, 2 I Narborough Road COSBY, Leicestershire LE9 STA Tel.: (0.533) 7.53-333, Fax: (0533) 750-522
ELSEVIER SCIENCE
USA AND CANADA Weston Media Associates Mr. Daniel S. Lipner P.O. Box I 110, GREENS FARMS, CT 06436-l I10 Tel.: (203) 261-2500, Fax: (203) 261-0101
Mr. S. Onoda 3-20. I2 Yushima, Bunkyo-Ku, TOKYO I I3 Tel: (03) 3836.08 IO, Fax: (03) 3839.4344 Telex: 026.57617
REST OF WORLD
ELSEVIER SCIENCE B.V. Ms. W. van Cattenburch P.O. Box 2 I I, 1000 AE AMSTERDAM The Netherlands Tel: (20) 5 153220, Fax: (20) 6833041 Telex: 16479 els vi nl