A NEW RETRIEVAL
SYSTEM
FOR INFRARED
SPECTRA
J. ZUPAN, D. HAD& and M. PENCA Chemical Institute Boris KidriE, Ljubljana and Department of Chemistry, University of Ljubljana, Ljubljana, Yugoslavia (ReceivedJanuary
1975; in revisedform February 1976)
Abslraet-A new retrieval system for infrared spectra is described. Besides searching for match to a single spectrum, procedureshavebeendevelopedforsearchesfortwocomponentmixtures.Theprocedureshavebeenimplementedand have been tested by programs written in FORTRAN IV SINGLE
KNTRODUCTION
spectra have been constructed (Sparks, 1964; Erley, 1968; Clerc & Erni, 1973: Sebasta & Johnson, 1972). In spite of some shortcomings, the WYANDOTTE collection, still the largest one, is used as a data bank in almost all systems. Although these systems deal with the same problem, i.e. how to find the compound in the data bank that has the spectrum most similar to (in rare cases identical to) the unknown, the procedures differ from case to case. In the present paper we describe an improved system incorporating new ideas for a retrieval system based on spectrum records of 660 bits. The implementation in the program ZAPAHZ has three main features: single spectrum search, two component mixture search and serial or catalog number search. ZAPAHZ has been in use on a CDC Cyber 72 computer operating under the SCOPE 3.4 in the Computing Center of SR Slovenia. DATA
in searching for a single spectrum to be able to define what percentage of input peaks is expected to be in the data base, and 80% often works well. We have found it useful to be able to control what chemical and structural information provided by the original data bank (Codes and Instructions, 1964) is to be used as a criterion of match. It is desirable to have a choice of either an obligatory match on all data or of an ‘either-or’ form, which means that the spectrum from the ‘goodlist’ matches one set (of up to 24 items) or it matches a second set, etc., up to a maximum of 12 sets. This option is more useful for data selection or reorganization than for identification. A rating number R, is computed for each spectrum in the ‘goodlist’ according to the following algorithm:
RANK
n, = 2noI + 5r1~.~ + lOn, + n4.
Table 1. Transformation of WYANDOTTE-ASTM i.r. data cards (Codes and Instructions, 1964) into eleven 60 bit words*
4 5 6
8 9 IO
11
Contents Absomtion bands 1B-5.9 urn Abs&tion bands 6.0-10.9 pm Absorption bands 11.O-15.9 pm Chemical classification Part A
ChemicalclassificationPart A, B Chemical classificationPart B Chemical classification Part B Chemical classification Part B Chemical classification, No. of atoms Part B No. of atoms, boiling and melting point Catalog number
SEARCH
given peak. It is useful
Data from an 80 column card (960 bits of information) are reformatted to occupy eleven 60 bit words (Table 1) in a way that retains all information. Blank columns of the card, carrying no information, are omitted. If only band positions and no band regions are required, the search runs only over the first three words containing the absorption data. It is extended to include the appropriate number of words if additional structural or chemical data are requested. Peaks and interval boundaries can be entered either in microns or in wavenumbers. After the search, the spectra from the ‘goodlist’ can be printed out as catalog numbers or as expanded summaries.
Word
SPECTRA
As pointed out by Erley (1971) a good search algorithm must balance selectivity against the tolerance necessary to produce useful output, especially when an exact match to the input data does not exist. In practice tolerances greater than 0.3 pm are seldom useful in searching for a match to a single spectrum. Normally one of three options is suitable: (a) all input peaks are assigned a tolerance of kO.1 pm, (b) all input peaks are assigned a tolerance +0.2 pm, or (c) peaks below 10~ are assigned a tolerance of 20.1 and all others kO.2 em. It is necessary to be able to define no-band regions, and by allowing the no-band option to override peak tolerance specifications, an unsymmetrical tolerance region can be defined about a
In the past 10yr a number of retrieval systems for i.r.
(1)
The terms no.1and no.2are the numbers of peaks which differ from the input specifications by ~0.1 and by 20.2 pm, respectively; np is the number of extra peaks in the data base spectrum but not specified on input. The term r14= 0 if n, = 0, and 5”~f 5 if n. + 0; R,, is the number of missing peaks. The spectrum with the lowest rating number is defined to be the most probable hit. We tested the efficiency of the above algorithm from results of 60 searches according to Erley’s (1971) suggestions for accuracy, precision, performance and false lookups. The results are depicted in Fig. 1. The 60 test spectra were randomly chosen from compounds whose spectra are known to be in the bank. Table 2 summarizes average results. It is clear why missing peaks and excess peaks are not equally weighted in the rating algorithm (Eq. 1). As shown in Fig. 1, two spectra were not found in the
Column l-5 6-16 1l-15 32-35 3640 41-45 4650 51-55 56-66 61-65 71-80
*See below for use of other word sizes. 71
J. ZUPANet al.
72
found
Not 3 0’1
3
I ‘213’4’5’6’7’9’9’10 No.
of better hits
Fig. 1. The results of 60 test searches. The system failed to fmd only two examples within the 10 first hits. The search characteristics, according to Erley (1971), are as follows: the accuracy of the system ZAPAH A = 0.97, precision P = 0.83, and performance Q = A . P = 0.~. NF, the number of false lookups per search is 0.67.
suitable. In the case of very dense peak structure only the central peak is coded, using a tolerance that covers the whole group of peaks. (b) The spectra of the components of the mixture are assumed to be in the data file. (c) The region between the upper and the lower tolerance limit of two successive bands is considered as a no band region. (d) Spectra having a small number of bands (optionally less than or equal to three) are not considered as a possible component of a mixture spectrum. During the first run over the data file all possible components within the above limitations are stored on the ‘goodlist’ as 2-word arrays, the first word containing the catalog number and the second the right justified mask of the bands present in the given spectrum (Fig. 2). In the 60 bit word
4
I
ololol...
. ..lololollllIllolllllolollllloll
(01 (bl
(Cl
Table 2. Results of 60 test searches
(dl
Per Item No. of peaks No. of no-band regions No. of peaks at exact position No. of peaksat 0.1 difference No. of peaks at 0.2 difference No. of missing peaks No. of excess peaks Rating
search
Lowest
12 4 1.2 4.0 0.2 0.8 1.9 41
5
Highest 20 7 15 8 2 3 12 150
: 0 0 0 0 0
bank. An analysis
of the disagreement between input data and the actual card images of the particular spectra from the file, shown in Table 3, make clear the practical difficulties facing any search procedure.
Table 3. Actual ASTM file data in comparison with the input data for two unsuccessful1searches Items No. of peaks
Exact position +O.l tolerance Missing peaks Excess peaks No. of input peaks
TWO
COMPONENT
Catalognumbers 1897CA 4448FA 23 6 2 0 15
8 8 0 6 0
8
14
MIXTURJZ SEARCH
The search algorithm for the MIXTURE option of the retrieval system ZAPAH is based on the assumptions proposed by Sebasta & Johnson (1972) but incorporate new methods to make the search faster. (a) The spectrum of an unknown mixture is assumed to be a linear sum of the spectra of its components; however we have arranged to be able to use tolerances up to 0.9 @, which makes it possible {at least theoretically) to find a component of a given mixture that has peak shifts up to 0.9 ,um. Tolerances of ?0.2bm are normally
Fig. 2. Four right justified masks carrying the information about occurrences of peaks. The particular search requires 12 peaks. Mask (a) represents a single spectrum with the 4th, 7th, 8th and 1lthpeak missing.
example the logical sum of mask b with c (b.0R.c) as well as the logical sum a.0R.b form a perfect solution of the problem and are associated with the rating number 0. On the contrary, the logical sum a.0R.c has two mismatches and therefore is assigned a rating number of 2. The logical sum instruction as well as the count instruction (the function that counts the number of zeros or ones in a given word) are very fast, so that the search for all possible two component combinations is relatively rapid. The search of the mixture spectrum required a total central processor time of 80 s. We have elected to list on output components which match completely the input conditions plus those combinations of two spectra with the lowest (best) rating. In one test, two components not known to the authors were mixed in a 50 : 50ratio. The recommended procedure led to recovery of 15 different combinations of two spectra having zero mismatches. The final ‘goodlist’ contained 119 possible components. Ten combinations included poru-dinitrobenzene (3657CA) and five included dicyclo-pentadienyl (5198FA). Two components among the first 10 were thiourea (3%2CA and 3788FA). The actual mixture was para-dinitrobenzene and thiourea. The best general strategy is to use as large a tolerance region as reasonable, expecially when peaks are not well defined, are very close together, or are very broad. ABOUT
IMPLEMENTATION
program ZAPAH consists of about 800 FORTRAN IV statements (not counting comments), and through use of overlay structure requires only 33,m words of memory. The search speed on the CDC CYBER 72 computer is lOCKL1200spectra/s. In our practice both The
73
A new retrieval system for infrared spectra
the data file and the absolute program are stored on a private pack which can be called directly by the terminal connection from the laboratory. The real time required for one search of the more than 42,000 spectra is about 1 min. The following limitations for the input data have proved adequate: (a) Maximum number of peaks = 30 (b) Maximum number of intervals = 30 (c) Maximum number of chemical structure requests are 12 in one cohonn and 20 columns in one set (d) Maximum number of ‘either-or’ sets of chemical structure data 24 (e) Maximum tolerance at each peak kO.9 Frn (f) Maximum catalog or sequence number requests in one search 200 (g) Maximum number of searches in one run no limitation. MODIPICATION
FOR OTHER
TYPES OF COMPUTER3
Although the data file and the source program as described are based on a-bit words adaptation to other word lengths requires two minor changes. First, the 660
bits of information are reformatted into the appropriate number of words. This might be 42 words for a N-bit word size. Second, the elementary search loop is set to include the appropriate number of words. If, for example, the search runs only over the whole wavenumber region (180 bits of information), it would be necessary to use a search loop over three 60-bit words or over 12 16-bit words. authors are indebted to Dr. S. Detoni for the sample preparation and spectra recording. Financial support of Acknowledgements-The
the Boris KidriE Fund is gratefully acknowledged. REFERRNCES
Clerc, I. T. & Emi, F. (1973), Topics in Current Chemistry 39,91.
Codes and Instructions for WYANDOTTE-ASTM, ASTM (1964), 1916 Race St., Philadelphia, Pa. Erley, D. S. (1968), Anal. Chem. 40, 894. Erley, IX S. (1971), Appl. Spectroscopy 25, 200. Sebasta. R. W. & Johnson. G. G. (1972). Anal. C&m. 44. 260. Sparks,‘R. A. (1964). Storage and‘re&val of WYANDoTTE ASTM Infrared Spectral Data Using an IBM 1401 Computer ASTM, Philadelphia, Pa.