Analytica Chimica Acta, 112 (1979) 143-150 Computer Techniques and Optimization 8 Elsevier Scientific Publishing Company, Amsterdam
SEARCH SYSTEM
GEERT
-
Printed in The Netherlands
STRATEGY AND DATA COMPRESSION WITH BINARY-CODED MASS SPECTRA
VAN MARLEN*
and JAN H. VAN
Department of Analytical (The Netherlands)
Chemistry,
(Received 14th December
1978)
FOR A RETRIEVAL
DEN HENDE
Delft University of Technology.
Jaffalaan 9, Delft
SUMMARY A retrieval system for binary-coded mass spectra is described. The data base used contains 9628 low-resolution mass spectra from the Aldermaston Mass Spectra Data Collection. These spectra are reduced to 106 preselected biuary-coded m/z values each. Storage of the compound names and formulae is minimized by using a special set of characters and file organization. The search strategy permits fast generation of the Nnearest neighbours. Depending on the number of best matches generated, an average search requires access to only 24-33s of the spectra contained in the data base. Because of its limited storage requirements, this search system can be used even on microcomputers_
The minicomputer plays an increasingly important role in the functioning of modem laboratories. As a result the available mass spectral data bases have grown to such an extent that their sheer size is becoming a handicap to their in-house applicability for routine mass-spectral retrieval systems. The storage of compound names and empirical formulae for a data base of 10 000 spectra would require at least 1 million bytes and the spectral information a commensurate amount or even more. The use of data compression techniques has therefore become inevitable. This paper describes the organization of a mass-spectral retrieval system based on the optimized use of storage combined with a feature selection technique. In addition, the problem of reducing search time, also affected by the size of the data base, is addressed. EXPERIMENTAL
Data base A library of 9628 low-resolution mass spectra, originating from the Mass Data Collection [ 11, was used as a reference file for the retrieval system. These spectra were reduced by binary coding of the intensity values with an intensity threshold of 1% of the base peak. Further reduction was obtained by selecting 106 binary-coded peak positions_ The selection, based on the information content of a peak position corrected for the correlation between Spectra
144
peak positions, has been described previously [2] . With this method only those peak positions significant for the entire reference file were coded, requiring a storage capacity of 106 bits per spectrum. File organization The data base consists of three files with random access organization. The first file contains, beside the binary-coded spectrum, a unique identification number ID and a presearch parameter, the distance dn_ This distance parameter is defined as the number of peak positions coded “present” in the spectrum or the number of “bit mismatches” between the spectrum and the “empty” spectrum with no peak positions coded present. The file is pre-arranged in order of increasing values of the distances d R_ All reference spectra with the same dR are combined into a “cluster” of contiguous records in the file, each record containing up to 24 spectra. Figure 1 shows a frequency plot of the number of spectra for all clusters in the file versus the dR value. The second file is used to store pointers for each spectrum to the empirical formula and name of the compound, stored in the third file. This method of indirect reference was chosen to eliminate any duplication of empirical formulae and name of compounds in the data base, thereby reducing the total storage requirements. As a result, only 3300 different empirical formulae and 8100 different compound names are stored. The relationship between these files is illustrated in Fig. 2. The storage requirements are summarized in Table 1. Keywords To avoid lengthy and variable storage of compound names a special &bit “character” set was generated. This character set contains, in addition to the normal numerical and alphanumerical ASCII characters (a total of 64), a set of 160 “keywords”representing character combinations which occur frequently such as ACETY L, METHYL PHENYL, etc. The keywords generated and their frequencies of occurrence in the data base are given in Table 2. With this compression, most of the compound names occupy less than 12 bytes of storage. The names requiring more than 12 characters are split up in chemically signi-
Fig. 1. Distribution
of reference spectra clusters.
145
pointers
formulae
and
names
Fig. 2. File organization. TABLE
1
Storage requirements
for 9628 binarycoded
mass spectra
File
Contents
No. of entries
Record length (byte)
1 2 3
Binary data Pointers Compound names and formulae
462 9628 8106 3279
480 4 16
Storage Whyte) 223 39 _ 216
ficant segments, separately stored in the file, and shared by alI compound names. These segments are referred to within the la-byte name-space by means of a l-byte indicator followed by a 2-byte record number. The segments can aIs0 contain one or more references to other segments. The process of reconstructing a compound name is therefore recursive. Configuration The retrieval system is based on a PDP11/45 minicomputer, which is used for various laboratory applications, under RSX-11D in a multi-user environment. The data base is stored on a RK05 disk with a capacity of 2.4 Mbytes and an average transfer time to memory of 2 ms per spectrum_ According to Table 1, the storage of the data base requires about 20% of the capacity of a disk cartridge. The acquisition of the g.c.-m-s. or m-s. data is carried out on a PDP11/45 preprocessor. Search program The retrieval program is written in PDPll FORTRAN IV-PLUS and requires a storage capacity of 5.5K words exclusive of system functions. The general structure is given in the flow sheet presented in Fig. 3. The most important aspects and modules of the program are described in the following sections_
146 TABLE
2
Keywords
and their frequencies of occurrence in the data base
C%U.~
Keyword
Freq.
Char-
Keyword
Freq.
Char.
Keyword
Freq.
Char.
Keyword
140 141 142 143 144 145 146 147 150 151 152 153 154 155 156 157 160 161 162 163 164 165 166 167 170 171 172 173 174 175 176 177 200 201 202 203 204 205 206 207
ACETYL ANTHRA BENZYL BORANE CARBON CHLORO CROTON DEHYDE DEUTER EPOXYFLUORO HEPTYL HEXANE METHYL NAPHTH PENTYL PHENYL PROP10 PROPYL THIOL TRANSAMINE AMIDE AMINO ALPHA ANTHR ANOIC BENZ0 BUTYL BUTYR BROMO ALLYL CYCLO CHRYS CHLOR COHOL ANONE CHROM DECYL EICOS
117 60 175 43 279 964 31 97 90 23 430 43 244 3053 352 132 934 129 451 119 119 97 67 172 244 105 32 286 584 102 278 68 966 16 70 52 122 39 172 60
210 211 212 213 214 215 216 217 220 221 222 223 224 225 226 227 230 231 232 233 234 235 236 237 240 241 242 243 244 245 246 247 250 251 252 253 254 255 256 257
ETHYL ETHER ENONE FLUOR HEXYL DESOXY HEPTA IDINE NOATE NITRO ORTHO OXIDE OCTYL PENTA PI-IENE SULPH TETRA UNDEC VINYL AMYL ANOL OXYL ANTH ACET ACID BENZ BETA 1,2,3, IDENE CHOL CARB CYAN CISDECA DIOL ERIN ENYL FORM GLYC HEXA
863 159 95 77 300 29 184 149 160 86 17 52 73 469 154 88 634 24 45 36 259 41 20 368
260 261 262 263 264 265 266 267 270 271 272 273 274 275 276 277 300 301 302 303 304 305 306 307
978 40
330 331
14 250 48 42
332 333 334 335
OLE INE BUT OCT
56 557 55 75 81 130 527 142 26 91 392 33 24 40 212 138 153 42
336 337 340 341 342 343 344 345 346 347 350 351 252 353 354 355 356 357
148
310
415 182 160 67 43 107 64 104 278 63 14 184 33 49 186
311 312 313 314 315 316 317 320 321 322 323 324 325 326 327
OXY HEPT STYR PHEN IMINO IMID MERC 3, NATE NITR NON 6, HYDRO OCTA OATE PENT PROP SPIRO QUIN UREA THIO BTHIA AZINE SECIS0 PHO YNE ANE ATE 5, IND ALENE HYDR
a8-bit representation
4, PH ENE AZ0 BIS HEX
36- 360 275 361 120 26 737 281 204 126 134 270 321 158 929 123 137 125
362 363 364 365 366 367 370 371 372 373 374 375 376 377
TRI FUR PYR i&D PER SIL IDE ONE 0x0 BI ACR ETH OL PI6DI AN NE YL IC T1. EN0 % o2& N42, Pl5-
in octai notation.
Input. The preprocessed m-s_ or g.c.-n IS. data file or a manually entered mass spectrum is converted to a binary code with the same format as the referezxe spectra in the data base. For the “unknown” spectrum, the distance
parameter dU and a distance limit value are also calculated.
Freq 45 186 248 17 688 90 209 229 22 74 52 103 305 129 139 47 563 386 140 354 1742 301 241 613 88 116 858 29 186 204 328 997 814 62 343 552 569 64 514 332
147
Fig. 3. General structure of the search program.
Matching algorithm. This most frequently required module of the program was written in the PDPll MACRO assembler. The number of “bit mismatches” between the unknown and reference spectrum is taken as a matching criterion and is calculated with an “exclusive or” (XOR) logic operator. The XOR matching was selected to allow fast comparison of the two spectra. A previous study [4] has indicated that other matching criteria did not give better re- _ trieval results when adopted in the system. To decrease execution time, the comparison is terminated when the calculated distance limit is exceeded. The CPU time needed for one comparison varies from 150 to 800 ps, the execution time of 50-250 instructions. This variation is caused by the location and number of bit mismatches between the two spectra. Generation of the N-nearest neighbours. In order to generate the N-nearest neighbours, a fixed list of results is constantly sorted by a “ripple sort” method. A high sorting speed is obtained by using pointers (tags) as indirect references so as to avoid unnecessary rearrangements of the list. Search strategy_ Only clusters of reference spectra for which the preset distance condition is valid are selected for comparison with the unknown spectrum. The sequence in which these reference spectra are compared is as follows. When the distance parameters d, and du are used for the reference and the u&nown spectrum, respectively, the a priori minimum distanced between the two spectra is calculated from D = IdR - dul before the comparison is actually started. The reference spectra are compared with the unknown spectrum in sequence of
148
increasing of D starting with O_ The search is terminated when D exceeds the number of bit mismatches found for the Nth nearest neighbour generated. The a priori distance between unknown and reference spectrum then exceeds the generated distance for the N-nearest neighbours found so far, and continuation of the search becomes fruitless. This search strategy generates the same N-nearest neighbours compared with a normal sequential search, but matches only those reference spectra which most probably belong to the desired set. The number of spectra searched In terms of information theory concepts, the introduction of the presearch parameter d, implies use of additional data with its own information content,
I,, in add&ion to the binary-coded peak positions. IH may be calculated from the formula of Shannon and Weaver [3] : I,=--Pd
lo&Pd
--O&Ad
(1)
The prodbabilities pd of measuring a spectrum at a distance d, are derived from the data presented in Fig. 1. If a criterion of “perfect matching” is used the distance D between the unknown and the reference spectrum equals 0 in the event of a match and the “search window” Ad equals 1. For the data set under consideration, IH then amounts to 5.9 bit_ Use of this parameter therefore reduces the average number of spectra to be searched to 161 (9628 divided by 25-3) assuming that the unknown spectra exhibit the same distribution as the reference spectra. In a previous paper 141, the distance D between an unknown and a reference spectrum of the same compound was investigated. For a large set of compounds, a skew distance distribution was found with an average distance of 6.9 and a median distance of 3-Z When eqn. (1) was applied with search windows of 13.8 and 6-4, the average and median number of spectra searched before the correct reference spectrum was found, became approximately 2200 and 1050, respectively . In practice however, the identity of the unknown spectrum is not known beforehand. Generally, the search is terminated after a certain number of best matches, the N-nearest neighbours, have been produced. To estimate the number of spectra to be searched for 3 or 10 nearest neighbours, a set of 459 “unknown” spectra was extracted from the file and used as input to the search system. The average number of spectra that had to be searched was about 2300 and 3200 for these two examples. The influence of this approach to the search strategy on the number of spectra to be searched is illustrated in Table 3 for a typical compound extracted from the file. It is obvious that the saving in search time is worthwhile. DISCUSSION
It would be expected that for other general data bases, a similar picture would emerge with regard to the storage optimization described. Smaller
149 TABLE
3
Number of spectra searched and search time for the spectra of phenylacetylene under different conditions (dv = 21), with a simulation of the report generated for the spectrum of phenylacetylene No. of spectra searched
N-nearest neighbours
Search time (s)
7501a 3896b 23&lb
10 10 3
13.4 8.2 5.0
MS-RETRIEVAL BINARY V08 MODE: TITLE: UNKNOWN f-” PEAKS
IN SPECTRUM:
DATE:
95
TIME:
14:06:09
PAGE:
1
MANUAL 21
=: SPECTRA SEARCHED: 2381 ID# MATCH BRUTOFORMULA 100 C8.H6 180 2168 97 C8.H6 4380
19-APR-78
C7.H5.N
PRESEARCH TIME: PRINT TIME:
0.4 s 3.3 s
SEARCH TIME: MATCHING AVERAGE:
5.0 s 2.1 MS
NAME PHENYLACETYLENE PHENYLACETYLENE BENZONITRILE
search with a reset distance limit between unknown and reference spectrum_ (4 < d, < 38). E Search strategy with a priori distance.
specific data sets, e.g. a set of alkane spectra, will result in the use of fewer keywords. The size, of the program and the data base described do not require a large computer system. -4 microcomputer, e.g. a PDPl1/03 combined with a dual floppy-disk drive, would suffice for this type of retrieval. However, because the search time is primarily dependent on the time needed for the transfer of data from disk to memory, the total search time for such a system would become approximately 1 min. For the described system with a transfer time of 2 ms per spectrum, the generation of the 10 nearest neighbours takes about 6 s on the average. The same retrieval system run on the IBM370/158 at the University computer centre requires an average of about 3 s to generate similar results. The file organization and search strategy outlined are generally applicable for other search systems, provided that a parameter can be found to pre-order the data base. This is in agreement with the observations made by Grotch [5]. The effects of the search strategy on other retrieval systems, such as the Biemann search method 161, will be explored further. The authors are indebted to A. Dijkstra and H. A. van ‘t Klooster for helpful discussions.
150 REFERENCES 1 Mass Spectral Data Centre. AWRE, Aldermaston, U.K., library purchased in 1971. 2 G. vanMarlen and A. Dijkstra, Anal. Chem., 48 (1976) 595. 3 C_ E. Shannon and W_ Weaver, The Mathematical Theory of Communication, The University of IIlinois Press, Urbana, IlI., 1949. 4 G_ van Marlen, A. Dijkstra and H. A. van ‘t Klooster, Anal. Chem., 51(1979) 420. 6 S. L. Grotch, 25th Annual Conference on Mass Spectrometry and Allied Topics, Washington, D_ C., 1977_ 6 H. S. Hertz, R. A. Hites and K. Biemann, Anal. Chem., 43 (1971) 681.