65
MANIPULATION OF CHEIIICAL DATA BASES BY PROGRAMMING
Jure ZUPAN 'Boris Kidrii.' Institute of Chemistry, Hajdrihova 19, YU-61115 Ljubljana, Yugoslavia
4.1 INTRODUCTION
A dilemma whether one should know (or learn) any of the most common used high level languages (Basic, Fortran, Pascal, C, etc.) or not is still hunting many practicing chemists, researchers, students and educators alike. There are arguments on both sides spanning from absolute rejection (almost everything chemists need can be bought on the software market or at least developed by their computer departments) to fiercely advocating the 'do-it-yourself approach (arguing that chemists using application packages as 'black-box' tools do not know how their data are actually processed). As always, the response of chemists will depend on the needs, abilities, goals, and general policy prevailing in the working environment, staff, laboratory, and not at least on the opinion of the head of the group. Probably, the most important factor in such a decision is the goal the particular project or even the laboratory is oriented towards. For a routine work with well established procedures, the 'black-box' approach is doubtless very convenient, useful, and above all, error safe. O n the other hand, the research on the frontiers in any field compulsory requires at least a minimal knowledge of programming in one high level language. A fundamental research requiring a lot of data handling and at the same time using only a bought software can seldom lead to completely new discoveries. If new facts are sought, the relevant data should be treated in a
66
completely new way at least one point of data handling process what requires programming of your own routines. There are, of course, many other occasions when knowledge of programming is very useful, especially if ’interfacing’ programs for reformatting the data for transfer between two standard packages, minor changes in the existing programs, or new special purpose procedures are needed. It is worthwhile to keep in mind that programming in a high level language is not much more complicated than programming in dBase or LOTUS, to mention only two of the most commonly used packages offering their own programming languages. In short, knowing how to program and when to use this skill is very beneficial. Many problems can be solved in a simpler, faster, and more economic way compared with the case that the user must explain the problem to a programmer, to check the obtained results, and iterate the procedure until the calculated results are satisfactory. Especially frequent small changes in a program can be very annoying if done through an intermediate person. Additionally, it should be remembered that, in order to attract the buyers, most of the stand-alone computer packages are designed as general purpose products. This means that a package is programmed to handle as many different problems (in its scope, of course) as possible and as such cannot be equally well suited for all of them. This is not to say that general packages on the market are deficient or even not worth buying, on the contrary, they are mainly very useful and much more user friendly than t h e majority of ‘home made’ programs. What we want to say is that for special applications the general purpose software may not offer the very best data handling procedure available. And at this point the programming ’know-how’ becomes very valuable.
4.2 PROGRAMMING PROCEDURES
Years ago, the choice of a programming language was a very simple one. Chemistry was dominate almost entirely by the Fortran language. Today, the choice is mainly made between Fortran, Pascal, and Basic. Because each of them has its own
67
advantages and drawbacks, it is hard to single out any of them as the best one. Basic is easy for learning (syntax and rules), is good on graphics if you have CGA (Color Graphic Adapter) or EGA (Enhanced Graphic Adapter) card on the PC, but long programs become rather difficult to maintain and update, while handling I/O with files is inconvenient and almost each computer brand has its own Basic dialect. Pascal is mainly an algorithmic language with medium I/O capability what makes it not the best choice if a lot of file manipulation and communication is planned. There are several Pascal graphic packages (Borland Turbo Graphic Box, for example) offering diverse graphic procedures making Pascal very attractive for many young chemists. As the situation stands now the majority of 'home made' programs in chemistry is still written in Fortran. Some of the reasons are historical (it is hard to switch to another language) and some are based on a high level of standardization which makes Fortran the most portable language. It is nice to transfer a source code form a mainframe to a PC, compile it there, and then run it without most of the problems usually encountered when transferring programs written in other languages. Due to its standardization, Fortran is very conservative language what graphics and screen manipulation procedures are concerned. Similar to Pascal, many software houses offer special subroutines for graphics (Microsoft Windows, MetaWindows, etc.). Development of any program is carried out in the following steps: designing a solution for the problem in an algorithmic (procedural) way, writing the selected algorithm in a high level language (source file) using a text editor, compiling the source file with the corresponding compiler (sorile compilers require compilation in 2 or 3 passes), correcting the source file for typographical and syntax errors if compiler finds some (repetition of steps 2 to 4 until no errors are detected by the compiler), linking the compiled program (file.obj) with other object files and libraries using the system linker obtaining the executable file (file.exe), running the 'exe' version and applying test data, comparing results with test data and repeating the entire procedure from point 2 in the above scheme until the obtained results are consistent with the expected ones, running real application.
68
As it can be seen, writing user’s own application is not an easy or fast task. It can be learned only be practice. Doubtless, the most difficult and tiresome part of the programming is tracing down logical (procedural o r algorithmic) errors in the source code. This part is called a ’debugging’. The use of a debugging option, if i t is offered by the compiler is of great help. It enables the programmer to trace and to monitor changes of variables, arrays, and program flow. Additionally, using a debug option, the programmer can change values of variables during the execution of the program, etc. If the compiler does not have a debug option, a number of write statements to communicate the values of variables must be included into the source, what makes finding errors a difficult and a time consuming process.
It has to be mentioned that big step towards unification and standardization of programming has been offered by MicroSoft compilers version 4.0 or more (MS Pascal 4.0, MS Basic 6.0, MS Fortran 4.1, MS C 5.1, MS Macro Assembler j . l ) , providing that object files produced from sources written in different languages can be linked together into a single executable file. For example, compiled Pascal procedures can be linked with object files obtained from Fortran source code, or vice versa. Additionally, some software and hardware producers (Hercules, MicroSoft, Borland, for example) are offering graphic packages containing, Basic, Pascal, Fortran, and assembler routines for application of graphics in different graphic environments (CGA, EGA, VGA, Hercules, etc.) which can be easily implemented in the programming code. In the following paragraphs, some of the procedures specific for chemistry that must usually be programmed will be described from two aspects. The first one will be a condensed description of the problem while the second one will be a basic procedure necessary to program the task.
4.3 HANDLING CHEhlICAL STRUCTURES WITH PC
4.3.1 General Most of the chemists will agree that chemical structure is a common denominator in the majority of chemical work and that it seems naturally to discuss t h e ways how chemical structures can be handled (input, output,. displayed, compared, searched, ranked, etc.) by computers (ref. 1) in general and by personal computers in particular.
69
Usually, in a chemical laboratory someone comes up with the idea of organizing a collection of chemical structures and to link it with a specific application. Due to the lack of general purpose packages enabling chemists to create a data base of structures according to the specific needs, the chemists are forced to 'reinvent the wheel', starting to build such a system from scratch. To avoid the situation, we shall discuss the procedures (editing and representation of chemical structures, sub and superstructure search, etc.) and ways (linking structure generation with files containing structures, making access to structure related features easier, etc.) needed to prepare a custom tailored data file of chemical structures.
4.3.2 Editing a structure Editing a chemical structure using a computer means building, changing, storing, copying, downloading, or otherwise interactively manipulating chemical structures with commands familiar to chemists. Figure 4.1 shows a process of editing the structure of 3-amino cyclohexanone using different commands from the menu displayed on the screen. Each selection of the chemist is immediately displayed on the screen so he or she can closely follow assembling of the structure. Once the desired structure is generated t h e user should be able to use its representation (the connection table) in many different ways: to store it, to combine it with other structures, supplement it with textual information, to decompose it to fragments, add it to a collection, use it as a target or query compound in different searches or procedures, use it in different applications such as simulation of spectra, determination of properties, etc. calculate molecular formula, draw it on a plotter, etc.
4.3.3 Representation of chemical structures Connectivity matrix and connection table. The most frequently used forms for representing chemical structures in the computer are the connectivity matrix (CM) and the connection table (CT). In the CM, the diagonal element Cii is a chemical symbol of the i-th atoms, while the off-diagonal elements Cij represent bond orders
70
between the i-th and j-th atom. Figure 4.2a shows the CM of the 3-amino cyclohexanone.
r
CHAIN
CHAIN
RING ATOM
I5
BOND
I
-B R I D G E
BRIDGE
CT
DELETE
DELETE
INSERT
INSERT
...
~
C H A I N : l AT 5 . 1 AT 3
MENU
CHAIN
I BOND BRIDGE DELETE
OELETE
INSERT
INSERT
MENU
C T O M : 8 NH2. 7 0
MENU
s
Fig. 4.1 Building a chemical structure (3-amino cyclohexanone in this exam le) with commands partially selected from the menu and partially t. e -in by the user. The numbering-of atoms is important for two reasons:%st for fast addressing of atoms in the editing process and second, for comparing the retrieved structures with the on-screen structure. The particular software was developed in the author’s lab.
It can be seen that a number of information in CM is redundant (each bond is listed twice) and that a large portion of matrix is empty (elements are equal to zero). This indicates the structure can be represented more economically with a table of constant width w. Such representation requires only wN instead of N2 variables. In the i-th row of the new representation w data associated with i-th atom (chemical symbol of the element, sequential numbers and bond types to its neighbors) are stored. Such representation is called the connection table of a chemical structure or CT (Fig. 4.2).
71
1 2 3 4 5 6 7 8 c 1 0 0 0 1 0 O
1 c 1 0 0 0 0 O
0 1 c 1 0 0 0 O
0 0 1 c 1 0 0 l
0 0 0 1 c 1 0 O
1 0 0 0 1 c 2 O
0 0 0 0 0 2 0 O
0 0 0 1 0 0 0 N
Fig. 4.2 The connectivity matrix (CM) of the 3-amino cyclohexanone
C c c C C c 0 N
2 1 2 3 4 1 6 4
1 1 1 1 1 1 2 1
6 3 4 5 6 5 0 0
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 8 1 0 0 1 0 0 0 0 1 7 2 0 0 0 0 0 0 0 0 0 0 0 0
and the connection table (CT) (right)
Fragment code, In computerized chemical information systems, especially in the connection with other structure handling algorithms (substructure search, structure generation, etc.) some sort of fragment code must be used. Structural fragments can be defined on a basis of chemical properties, statistical frequency of occurrence in chemical compounds, or from purely formal mathematical (ref. 2,3), i.e. graph-theoretical aspect with little relevance to physical or chemical properties. Sometimes, graph-theoretical definition of fragments is supplemented with basic chemical properties like type of atoms or bond orders, etc. Another approach is to label the selected chemical fragments of easily recognizable features (for example: -OH or -C( = 0)-group, aromatic ring, structural skeletons, etc.) with consecutive numbers, letters, or special characters. The full structure representation is than either a list of all present or a list of all different structural features. If different fragments are labeled as fi, then structure S can be written as a set of m fragments:
For a given structure the number and type of different (or of all) fragments depends entirely on the definition of fragments. For any kind of structure handling procedure the definition of fragments should be -unique for all structures in the collection.
72
In order to be as simple as possible the fragments are usually defined as structures with one atom in the center and a number of layers of neighbors to which the fragment is still defined. First layer of neighbors consists of bonds and atoms directly bonded to the central atom, the second layer consists of bonds and atoms bonded to the atoms in the first layer not yet taken into the account, and so on. Fragments should be stored in tables and described precisely enough that any structure can be decomposed into a unique set of them. The representation of all possible fragments in a form of CM or CT (with all details: atoms, bonds connections up to a certain layer) is very voluminous. To be precise, the limiting factor is not the computer space for storing fragments, but the time for scanning the entire collection of fragments for each atom of the query structure in order to find the match. However, the fragment coding can be improved considerably by adequate definition of fragments and wise restriction of coding elements. In this paragraph a procedure for unique encoding of atom centered fragments into 32-bit strings is described. The length of 32 bits imposes a limitation on the size of the fragments (number of neighbors to be considered) (ref. 4). If longer bit strings are used, more general atom centered fragments can be encoded. The reverse procedure enables decoding of any 32-bit string into a corresponding atom centered fragment. The main limitation (which is not too restricting for organic chemistry) is the maximum number of 4 neighbors at each atom. The atoms, of course, can be more than 4-valent, providing that the excessive valences are used up by multiple bonds (in fragments like -SO?- and -NO:). An additional limitation excludes from encoding hydrogen atoms. Only non-hydrogen atoms are coded and treated as atoms. Hydrogens are (or can be) added at the end of the procedure to each atom as required by its unsaturated valence. The encoding starts at the right side of the 32-bit word at its least significant part. The first 3 bits are used for encoding the type of t h e central atom, while the following 4 times 7 bits are used consequently for each neighbor of t h e central atom. If the central atom has only two or three neighbors, only two or three 7 bit strings are encoded, respectively. Each 7 bit string is coded using the same scheme: the consecutive 2, 3, and 2 bits (starting at the most significant part of the 7 bit string) represent the bond central-atom-neighbor, the atom type of the neighbor, and the number of second layer neighbors bonded to this particular neighbor. The
13
32-nd bit is always empty (equal to 0) what ensures the ID number of the fragment to be always positive. In order to obtain a unique coding scheme the order for coding the neighbors has to be established. First, using the formula: NSi = 32 * BOND
+ 4 * ATOAM + NEIGHBORS
(4.2)
a 7 bit number NSi is calculated for each neighbor i. The parameter ’BOND’ specifies the bond type between the central atom and the neighbor (BOND = 1, 2, or 3 for single, double and triple bond, respectively); the value ’ATOM’ (in the range 1 - 7, with carbon = 1, oxygen = 2, etc.) represents the type of the atom; while ’NEIGHBORS’ is the number of non-hydrogen second layer neighbors (0-3). The value of NSi can be between 0 (no neighbors) and 127. In order to obtain a unique (smallest possible) identification number ID for the fragment all NSis are sorted in descending order (NSi > NS2 > NS3 > NSI) and added to the code for a central atom type CATOM: I D = CATOM
+ 8 NS1 + 1024 NS2 + 131072 NS3 + 16777216 NSI
(4.3)
The numerical factors (23, 2”, 2” and 2”, respectively) in equation (4.3) are used for placing the corresponding SSi on the proper positions into the 32-bit string. Besides the easiness of encoding, the advantage of this fragment code is that the topology of fragments can be directly reproduced from the ID number of a fragment. The described scheme with all coding possibilities is shown in Figure 5.3. The described fragment code can be modified according to the users needs. With larger strings, larger fragments can be encoded o r the atoms more specifically determined. A disadvantage of all fragment codes is a non-unique description of structures, hence, two identical lists of fragments do not mean that the corresponding compounds have identical structures. To confirm the identity a time consuming atom-by-atom comparison must be invoked. Fortunately, the substructure search in a large file of chemical structures requires such a tedious comparison to be performed only on a list of compounds having the same set of fragment codes as subsets of their constituents. The number compounds on the ’good list’, is in most searches orders of magnitude shorter than in the entire collection.
74
32-bit word 31 30 29
4 3 2 1 0
I Bond Atom Nc 4th neighbor
Bond.Atom,Nc 3nd neighbor
ti LH
Bond Atom Nc Bond Atom Nc 2rd neighbor 1st neighbor Central atom
J
Neighbors:
Central atom:
2 bits: Bond
3 bits: Atom
00 01 10
000 001 010 011 100 101 110
11
No bond Single bond Double bond Triple bond
3 bits: Atom
000 00 1 010 011 100 101 110
111
111
No atom Carbon Oxygen Nitrogen Sulphur Phosphorus Halogen A n y other atom
Noatom Carbon Oxygen Nitrogen Sulphur Phosphorus Halogen Any other atom
2 bits: No. of continuations Nc
00' 01 10 11
End atom 1atom 2 atoms 3 atoms
Fig. 4.3. Encoding structural fra ments into a 32 bits long strings. The encoding starts at the least significant part (right and proceeds towards the left yielding smallest numbers for the simp est fragments.
1'
75
4.3.4 Sub- and super- structure search
At the very bottom level of most structure handling algorithms two structures are compared atom by atom and bond by bond (ref. 1). However, the preprocessing steps, the I/O conditions, the constraints in the query or in the reference structures, and requirements for a match or failure differ considerably from application to application. The most frequent used structure manipulating procedures are substructure and superstructure searches. In the substructure search the query (input) is small compared to the reference structures. The goal is to find all structures from the reference file that contain the query as a substructure. In the superstructure search the investigated (input) structure is large compared to structures in the reference file which is usually much shorter than in the case of substructure search. The purpose of the superstructure search is to identify all structures (skeletons, substituents, parts) from the reference file that fit into the query structure (Fig. 4.4). Superstructure search consists of a number of substructure searches in each of which the 'reference' file contains only one structure, namely, the query. The number of substructure searches is equal to the number of structures in the reference file of the superstructure search. As already mentioned, the comparison of structures is a tedious and time consuming procedure. Therefore, the part where atom-by-atom and bond-by-bod comparisons are made should be executed on a file containing only a number of structures as small as possible. To achieve this, a fast scan of a long reference file should be made to select only the structures that possibly contain the query.
Usually this is done in two steps: first, the query structure Sx is decomposed into fragments (see equation 4.1): sx =
(fl, f2, f3, f4)
(4.4)
and second, using the inverted file of fragments a, group of structures containing, beside others, all fragments of Sx is selected.
76
Fig. 4.4 A substructure (left) and superstructure (right). search. In the substructure search the uery is usually smaller compared with the structures in the reference fi e, while in the case of su erstructure search the reference file is much shorter and containes on y small fragments and/or skeletons.
4
7
77
s1
= (fl, f2, fa, fb, f3, fa fc) s2 = (fl, fZf3, f4)
...
.....
Sk
= (fm,fl, f2, f3, f4)
The obtained group of k possible candidates is much shorter than the entire file. O n the k candidates atom-by-atom and bond-by-bond comparison must be made (Fig. 4.5). Query
Fragments
inverted
file
Fr a gm. ID
Reference
References
structure Structure
ID
IDS
89
I ...89,..796,..1740,..
305
I
7 6 2
I ...8 9...796. ..1740,.. I
,757057
Fig. 4.5
796
'
I ...796,...1740, ...
I
Inverted file of fraoments contains identification numbers of all reference structures having tRe same fragment in the same record. Decom osition of the query structure into fragments and scanning the inverted i l e yields a short file of possible candidates.
The IBM PC compatible program GEN (Fig. 4.1), designed and made in author's laboratory has an option for download (to a sequential permanent file) of a set of atomic centered fragments coded into 32-bit strings as described in paragraph 4.3.4 of each currently edited structure as. The option enables the generation of a file containing lists of fragment codes for a collection of structures.
78
4.3.5 Update and retrieval in direct access files using hash algorithm
The formation of an inverted file of structural fragments for a large collection of structures can be made via hashing algorithm of fragment I D numbers. After a chemical structure is decomposed into fragments and the fragments are encoded into 32-bit ID numbers, the question arises how to find (how to access) the record where the information about this particular fragment is stored. Because they are too large, the 32-bit long numbers (about 10'") are not usable as addresses for direct access. The same problem is encountered if the 'key' information for access is the chemical name of the compound o r fragment ('ADAMANTANE' or 'CARBONYL' for example). Before any large number or alphanumeric 'key' is used for addressing a record in the direct access file, it must be transformed into a number between 1 and N, N being the length of this file. The procedure employed for such transformation is called 'hash' algorithm (ref. 5 ) . The problem, how to transform an arbitrary alphanumeric string into a large number is easily solved by chopping the string into small parts of equal length (usually I bytes long) and then XOR-ing the parts into a single large number. For example, the key 'ADAMANTANE' yields a number LARGE by the following procedure: LARGE = 'ADAM' XOR 'PINTA' XOR 'NE
'
(4.5)
The XOR bit operation (0011... XOR 0101... = 0110...) is a preprogrammed function available in almost all high-level languages as Fortran, Pascal, etc. In the described way, a character string of any length can be transformed into 4 byte string which in effect can be regarded as a large integer number. I t is interesting to note that the order of XOR-ing of individual parts together does not change the final result. Hash algorithms are widely used in many applications and there is a number of different approaches how to transform a long (large) number or multi-byte string into a short address in a unique way.
79
Any such algorithms must inevitably cause more different input keys to produce identical address. This effect which is immanent to all hash algorithms is known as the 'address collision' and programers must provide a way to calculate the consecutive addresses (an address increment) until an adequate address is reached,
Figure 4.6 shows how hash algorithm works in the case of collisions. If hashing of a given key produces the address where the information about another key is stored a new address must be calculated and the content checked again. The procedure is repeated until an empty (for update of new items) record is reached or the record containing the information of the identical key is reached. In order to check the identities of keys the complete reference key must be stored in each record.
File Key
PROPENE ADAMANTANE BENZENE
Address
Hashing
Hash address
Key
Data
DROPENE
1 BENZENE
Address increment 4DAMANTANE
Fig. 4.6 Hash algorithm produces another address whenever the collision of two different keys occurs on the same address.
80
One of the most commonly used hash algorithm employs twin prime numbers (two consecutive odd numbers that are both primes) and modulo function. If KEY is a large number (fragment ID or XOR-ed parts of chopped long string) and NP the length of the direct access file than the calculated address ADDR and the increment INCR can be obtained by the following equations: ADDR = MOD(KEY-1,NP) + 1 INCR = MOD(KEY-l,NP-2) + 1 The only requirement is that the length of the direct access file NP is set to the largest of both twin prime numbers (NP and NP-2). In any case, the length of a direct access file for which hashing is employed should be chosen about 10-20 7c larger than there is expected number of records to be actually stored. Such an surplus of empty space guarantees a reasonable access time to empty andlor correctly addressed records. The programmers must be aware that number of collisions increases sharply after the file is more than 85 % full. The full algorithm for the direct access of information described by the large number or character string KEY can be written as follows:
A1 A2
ADDR = MOD(KEY-1,NP) + 1; INCR = MOD(KEY-l,NP-2) + 1; read(file, rec=ADDR) KEYREF, list; if KEYREF = 0 then return. no information for KEY found if KEYREF = KEY then return. search returns ‘list’ if KEYREF # KEY then ADDR = ADDR + INCR; if ADDR > NP then ADDR = ADDR - NP; continue at AZ;
This algorithm can be applied either for the update of new items in the file or for the retrievals. It is evident that all records from the entire file must be repositioned again if the old file becomes too small and must be extended, i.e. new length NP must be used in the retrieval and update. Therefore, a careful study and realistic estimation of the needs must be made in advance.
81
4.4 SPECTRA REPRESENTATION IN THE COMPUTER
4.4.1 General
Another very broad field in chemistry not adequately covered on the software market is handling of spectral collection. There is, of course, a number of instrument producers that provide spectra handling software for their own instruments and ’data stations’. Unfortunately, their software is mostly neither open for the users to modify it nor documented adequately to take full advantage of it. The worst example of such software does not even allow the user the access to ’raw’ data produced by the instrument and offers no possibility to transfer the data to other computers where processing according to users’ needs can be done. The potential buyers must be aware of such products, especially if they intend to work intensively on their own measured data, what is mainly the case in the R&D laboratories.
The choice of proper spectra representation is very critical when designing an information or expert system based on a particular spectroscopy. It influences the speed, efficiency and, of course, the reliability of the system (ref. 6). In spite of the increasing computation power (space and speed) installed i n today’s laboratories, the problem of spectra representation is more serious when considering the implementation of the information system on a PC than on a mainframe computer. Besides the number of spectra one wants to handle with the system. the type of spectroscopy (infrared, NMR, mass, etc.), the goal for which the spectra are collected (identification of compounds, prediction of properties, structure elucidation, etc.), and the way the spectra are collected (link with the instrument, manual digitization, transfer from the mainframe, etc.) are the deciding factors according which the representation of spectra should be determined. Good spectra representation should:
-
-
contain as much as possible relevant information about the structure of recorded compounds, be short enough to ensure economical handling large amount of data. allow good reproduction of the original spectrum from its representation, enable retrieval and identification of spectra based on the query represented identically, enable prediction of structural features and different type of properties,
82
-
allow the coding of representations of groups of spectra in the same way as individual spectra, etc.
There are still some other requirements that a representation should fulfill, but they are mainly of more specific nature. There is no such representation that would satisfy all requirements, hence. the representation must be selected in a kind of trial-and-error procedure guided by a good spectroscopic knowledge.
4.4.2 Peak tables
Probably the most common representation for all kind of spectra used in computerized information and expert systems is the peak table. This very simple representation consists of a table containing all (or a certain number of the most significant) peaks appearing in the spectrum. Each peak is usually described by its position and intensiry, but more information (half width, multiplicity. shape type, etc.) can be added if needed. Such tables are very convenient for peak-by-peak search if the inverted files containing ID numbers of reference spectra are at hand. These files must be generated in advance (Fig. 1.7). The problem with the peak-table-representation is that the retrieved match is rather inconvenient starting point for evaluation of the experiment. A comparison between the full-curve query spectrum and the retrieved one(s), represented as the peak table(s), is almost impossible. In order to assure better comparison a link from the table representation to the original (full-curve) reference spectrum must be maintained. However, even if such link is implemented we must be aware that retrieved results obtained using ranking of peak-tables are worse compared to results obtained by comparing full-curve spectra. The second problem inherently associated with peak search in the inverted file of 'peak vs. ID numbers' is the tolerance limit within such a retrieval should be carried out. If the intervals in hich the peaks are 'inverted' are broad the search will probably yield the correct answer but the list of produced matches will be rather
83
Infrared spectrum No. 648
Peak table [cm-l]
Adresses
Inverted file
200 2 10 220 230 760 830 900 1035 1110 1135 1375 1450
1730
/
1720 1730 1740
1
..., 648, ... ...,648, ...
648, ...
2930
3980 3990 4000
Fig. 4.7
Inverted file for retrieval of infrared spectra generated fromgea3tables for fast searching by peak positions (in the tolerance region +1 cm ).
84
long. O n the other hand if the tolerance interval is narrow the correct spectrum can be lost even if only one peak is not matched due to the experimental error in the query or reference peak table. To overcome this problem a number of reduction methods (see Chapter 5 ) can be applied to obtain reduced representations of spectra (ref. 7).
4.1.3Organization of full-curve or reduced representations of spectra Representations, From the formal point of view, the full-curve and reduced representation of the spectra as well as all handling with them are identical. The only difference is the length (dimensionality) of the representations. The most important aspect of the 'reduced spectral representation' (for the reduction of the spectral curves see Chapter 5 ) is the possibility to work with a significantly smaller number of variables compared to the number of intensity values of the full-curve representation. It is assumed, of course, that the reduced representation carries only slightly less information than the original full-curve spectrum.
Two quantities are most commonly evaluated (calculated) during spectra comparison. The first one being the similarit? Sij between two spectra and the second one the representation of a group of spectra. The similarity between two spectra is used for retrieval, ranking, clustering, structure prediction, simulation, etc., while the representation of a group is mainly used for linking structurally similar compounds or compounds with similar properties together or for extraction of significant features. The underlaying assumption in the evaluation of both information, the Sij and the representation of a group, is that compounds with similar properties have similar structural features thus producing similar spectra. I n view of the fact that no strict rule for quantitative definition of similarity betLveen structures exists it is hard to justify the above assumption. However, many valuable results can be obtained using t h e correlation between the similarity of properties (structures) and similarity of spectra. If the reduced representation of a spectrum i is written in a 'vector' form Ri as
Ri = ( r i , r g r 3,....rm)
(3.7)
85
than the similarity Sij between two ’spectra’ Ri and Rj can be expressed as the inverse distance between the corresponding representations:
The distance between two points Ri and Rj in the representation space can be any nonnegative, real, commutative function that satisfies the triangle inequality (ref. 8). Usually, when comparing spectra Euclidean or Manhattan distances are employed. The generalized form of both, the Minkowski distance, can be written as follows: m
dij = (
C
-
(Xki Xkj)’)l’’
(4.9)
k=l where m is the dimensionality of the measurement space (representation). For p = 1, and p = 2 , the Manhattan and Euclidean distances can be deduced, respectively (ref. 8). Once the distance (similarity) between individual spectra is defined, a ranked list of most similar matches to the query or any other related quantity can be obtained by scanning over the entire reference collection. The representation of a group of spectra must emphasize common spectral properties of linked compounds very clearly otherwise the extraction of relevant structural features becomes very difficult if not impossible. One of the most commonly used (although not the best one) representations for a group of objects is their average:
4.10)
Usually much better, even though harder to obtain is a weighted average:
G’= W .G
= (wl .gl, w2 .gx
... wm .gm)
= (g’i, g’2,
...g ’m)
(4.11)
86
with weights Wi (values between 0 and 1) expressing importance of each specific component Xi in the representation scheme. Adequate weights for a reduced representation are harder to obtain than for a full-curve one because for the later a number of spectra-structure correlations are available. The weights for reduced representations can be obtained by a trial and error procedure on a number of known cases using some standard clustering method (ref. 9,lO) for checking the results. Handline large collections. Once the representation G (or G’) of a group is established in the same m-dimensional space as the objects (spectra or their reduced representations), the distance between the groups and/or objects can be evaluated using the equation (4.9). Although the full-curve or reduced spectra are mainly exploited in a sequential way (i.e. one after another through the entire collection), the most efficient way to handle them is a hierarchical organization. Figure 4.8 shows hierarchically organized spectra. Although the space used for a hierarchical organization is twice of that required for the sequential one, the loss is more than compensated by a significant gain in efficiency and quality of retrieval.
1
I
1
f
I
B
Fig. 4.8 Hierarchically organized spectral data base. Full and empty circles re resent single spectra, Ri and groups of spectra, Gj, respectively. The up ate or retrieval starts always at the root and proceeds toward the leaves (Ri’s). Some of the clusters A,B,C) contain spectra of compounds having easil reco nizable structura(Ifeatures in common. For object X, travelling t roug such a cluster, the common structural feature can be predicted.
h g
87
The most outstanding property of a $-member hierarchy is that each individual object (spectrum in this case) can be reached from the root in approximately 1og.N comparisons. The actual value depends on how much the hierarchy (tree) is balanced, but even for trees that are far from being perfectly balanced, the average number of comparisons is very small compared to the number of spectra in the entire collection. The scope of the book is too limited to explain the details how such an organization can be actually achieved. The interested reader is advised to relevant references (ref. 11,12). It has to be said, however, that a hierarchical organization shows its full potential when large numbers of items are to be handled. Under the word 'large number' we understand the collections containing several thousand and more spectra.
4.5 CONCLUSION
In any field, be chemistry, medicine, archeology, economy, or any other one, there is always need for programming a specific problem by your own. It is true that such a piece of code cannot be a substitute for a professional software package, but can in many cases shorten tedious work and. or accelerate the solution of a troublesome situation or even solve the entire problem. In spite of the fact that such programs are very seldom passed to other persons or groups and are not treated with the same yardstick as the packages on the market, they should notwithstanding use sound algorithms and proper methods to attack the specific problems, what in turn requires knowledge of basic algorit hms and fundamentals of programming. The above is true particularly in science where a number of programs are daily written with very specific needs in mind. Trying to be on the top of a field scientists are trying to treat their data in a unique way on at least one point of data handling process - a requirement that by the definition cannot be met by bought software.
88
4.6 REFERENCES 1
2 4
5
6 7
8 9
10 11
12
N.A.B. Gray, 'Computer-Assisted Structure Elucidation', John Wiley, New York, 1986, chapters 7 and 9, K.A. Ross, C.R.B. Wright, 'Discrete Mathematics', Prentice-Hall International, Inc., Second Edition, London, 1988,3 P. Harary, 'Graph Theory', Addison Wesley, Reading, 1972. chapters 2, I,and 13, J. Zupan, 'Algorithms for Chemists', John Wiley, Inc., Chichester, 1989, D.E. Knuth, 'The Art of Computer Programming', Sorting and Searching, Addison Wesley, Reading, Second printing, 1975, Vol3, p. 506, J. Zupan, Ed., 'Computer-supported Spectroscopic Data Bases', Ellis Honvood, Inc., Chichester, 1986, J . Zupan, S. Bohanec, M. Razinger, M. Novic, Reduction of the Information Space for Data Collections, Anal. Chim. Acta, 210, (1988). 63-72, K. Varmuza, 'Pattern Recognition in Chemistry', Springer Verlag, Berlin, 1980, p.25, B. Everitt, 'Cluster Analysis', Heineman Educational Books, London, ( 1977), D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, L. Kaufman, 'Chemometrics; A Textbook', Elsevier, Amsterdam, 1988, p. 371, J. Zupan, 'Clustering of Large Data Sets', Research Studies Press (Wiley), Chichester, 1982, J. Zupan, M.E. Munk, Hierarchical Tree based Storage, Retrieval, and Interpretation of Infrared Spectra, Anal. Chem., 57, (1985), 1609.