Orthogonality considerations for library searching Nth-order data

Chemometrics and Intelligent Laboratory Systems 41 Ž1998. 115–125 Orthogonality considerations for library searching Nth-order data Chad E. Anderson,...

Download PDF

184KB Sizes 0 Downloads 53 Views

Report

PDF Reader
Full Text

Chemometrics and Intelligent Laboratory Systems 41 Ž1998. 115–125

Orthogonality considerations for library searching Nth-order data Chad E. Anderson, Reinaldo G. Nieves, John H. Kalivas

)

Department of Chemistry, Idaho State UniÕersity, Pocatello, ID 83209, USA

Abstract Identification of chemicals from multi-order Ž Nth-order. instruments is typically carried out with only first-order library searches. Simultaneous library searches for all orders of Nth-order data enhances identification of chemicals. Three methods are presented for simultaneous Nth-order library searches based on measuring the angle between the unknown and reference chemicals present in a library. The first method, called first-order limited ŽFOL., consists of unfolding Nth-order data to first-order and using the standard dot product function to compute the angle. The second approach, second-order limited ŽSOL., unfolds Nth-order data to second-order and uses singular value decomposition to obtain the angle. The third method, Nth-order ŽNO., computes the angle directly without unfolding. Performance evaluations show the advantages of SOL in search time for large libraries while FOL and NO appear to have a higher degree of selectivity. The methods of FOL and NO are shown to be the same measures but NO has a computational advantage. Computer memory usage for all three methods is also addressed. q 1998 Elsevier Science B.V. All rights reserved. Keywords: Multi-spectral library search; Selectivity; Orthogonality

1. Introduction A common problem in analytical chemistry is the analysis of a sample to identify chemical constituents present. This is often accomplished by using an instrument that produces responses reflecting characteristic information about chemicals present in a sample. For example, a chromatogram and a spectrum provide chemical specific information as signal intensity corresponding to time and radiation frequency, respectfully. An instrument response such as a spectrum can then be analyzed to identify chemical constituents using a library search method. In this

)

Corresponding author.

case, the instrument response for a sample is compared to that of a reference library of corresponding instrument responses for known chemicals w1x. Successful identification is accomplished when a match occurs between the instrument response for the sample and a reference chemical. Computer algorithms have become an important tool in identification of a chemical constituent present in a sample. Various methods of library searching have recently been presented with different degrees of success w2–6x. A comparison of five established library search algorithms showed that the best performance was obtained by the mathematical dot product function w7x. In the dot product approach, the degree of similarity is determined by calculating the angle, u i , between the unknown sample instrument

0169-7439r98r$19.00 q 1998 Elsevier Science B.V. All rights reserved. PII S 0 1 6 9 - 7 4 3 9 Ž 9 8 . 0 0 0 2 8 - 8

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

116

response and the ith chemical library instrument response. The angle is computed by cos u i s

x tli 5 x t 5 5 li 5

Ž 1.

where x symbolizes the first-order m = 1 vector for a sample measured at m responses, l i denotes the ith m = 1 library vector of responses, and 5 P 5 represents the Euclidean norm of a vector. In this method, an angle of 08 Žcos u i s 1. indicates identical responses Žpositive identification., and 908 Žcos u i s 0. refers to totally dissimilar unknown and library responses. Angles close to 08 denote those reference chemicals similar to the sample with respect to the measured instrument responses. This angle approach has only been used with first-order responses, i.e., an instrument response consisting of a single vector of data Žchromatogram, spectrum, etc... A basic dilemma in chemical identification is when two or more chemicals produce similar responses from one type of instrument, resulting in two or more possible identifications for a chemical constituent. To minimize this potential ambiguity, chemicals can be analyzed by two or more instrumental techniques, where the likeliness of simultaneous similar behavior between different chemicals is substantially reduced w 8 – 11 x . In part for this reason, second-order, or Nth-order, techniques are becoming increasingly popular. Instruments providing simultaneous information in two directions are called second-order instruments and those with N directions are N th-order instruments w12x. Second-order instruments produce a matrix of information, and Nthorder instruments produce an N-dimensional array of instrument responses. Examples of common secondorder instruments are combinations of gas chromatography ŽGC. with infrared spectroscopy ŽIR., GC with mass spectroscopy ŽMS., and liquid chromatography ŽLC. with photodiode array detection ŽDAD.. Despite the popularity of second-order instruments, most library search methods evaluate information only in the first-order, i.e., a single order from an Nth-order data set is used. For example, library searches based on GC-MS data generally use information only from the MS order. Since Nth-order instruments enhance chemical information, library

searches of Nth-order data should improve chemical identification as well. That is, chemicals with similar spectra will be listed as matches in first-order spectral library searches, but when other information orders are present for simultaneous Nth-order library searching, such as a chromatographic order, the chemicals may now appear totally dissimilar. As an example, GC retention indices in combination with MS have been used to further distinguish chemicals w9,10x. Unfortunately, only a few retention indices are used on each compound thereby reducing the multivariate advantage of using the complete chromatogram w13x. Lack of useful match indicators and search algorithms capable of using all the data measured has limited the use of Nth-order library searches w14–16x. Methods to alleviate this problem are proposed and evaluated in this paper. The focus is on multivariate extensions of determining cos u i values in Eq. Ž1. to Nth-order data. A simple approach to extending the common first-order library search method of Eq. Ž1. to a simultaneous N th-order method involves unfolding Nth-order data to pseudo first-order and using Eq. Ž1.. However, this may cause wasted time and simultaneous matching of all orders without unfolding could prove more efficient. A method published in the statistical literature for comparing second-order group similarities uses singular value decomposition ŽSVD. to measure the extent to which two or more groups differ, where a group can be considered as a matrix with rows for samples and columns representing measured variables w17x. The method was originally applied to studies of educational achievements in different student groups. The approach was recently used to compare different sampling seasons in soils w18x. Its use can be extended to group comparisons in other areas such as second-order library searches. In this case, the second-order data array for an unknown chemical is compared to a corresponding second-order library array. The extension of this to Nth-order would consist of unfolding to a final second-order form, but the unfolding of Nth-order to second-order may also delay the library search. Another method for comparing similarities recently published in the analytical chemistry literature uses orthogonal projections. The approach was developed for assessment of selectivity and other figures of merit for Nth-order data w19x and was recently adapted for

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

assessment of chromatographic peak purity w20x. As described in this paper, the method can also be extended to library searching. Comparison of these three methods in this paper is accomplished by testing respective abilities to successfully identify chemical constituents under challenging conditions, e.g., when different chemicals produce similar instrument responses. Search times and memory are also evaluated to determine the fastest and most efficient method. The focus is on library searching various second-order data sets in order to gain an understanding of how the three methods work. The previously described three methods shall be named, respectively, first-order limited ŽFOL. for unfolding Nth-order to first-order and using Eq. Ž1., second-order limited ŽSOL. for unfolding Nth-order to second-order and using the SVD to obtain cos u i values, and Nth-order ŽNO. for obtaining cos u i values directly without any unfolding. Section 2 explicitly describes how cos u i values are obtained for SOL and NO.

2. Mathematical details 2.1. First-order limited (FOL) Let x and l 1 , l 2 , . . . , l i be nonzero first-order instrument responses where x symbolizes the unknown chemical vector of responses and l i represents the ith library chemical vector of corresponding responses. All first-order arrays are of size m = 1, where m is the number of response elements from a particular instrument. For example, m could be the number of time increments for a chromatogram or the number of wavelengths for a spectrum. Note that the sizes of x and l i are required to be the same. Most importantly, information in x and l i are required to be recorded in the same manner, e.g., same spectral range for spectroscopy and the same time increments and solvent for LC. The x and l i vectors can also be a combination of instrumental responses, e.g., ultraviolet-visible and IR spectra can be placed one after another to form a single vector. Finally, the x and l i vectors can be the result of unfolding Nth-order ar-

117

rays to respective first-order arrays. For example, an m = s instrument response matrix X can be unfolded to a vector. One approach is to let the m row vectors containing s elements be combined to form a single vector. Alternatively, the s column vectors containing m elements can be combined to produce a single vector. The Nth-order unknown and library data arrays must be unfolded with respect to the same orders. The angle, u i , between x and l i is computed by Eq. Ž1.. The cos u i values range from 1 to 0 with cos u i s 1 denoting a perfect match and cos u i s 0 implying that the unknown and library vectors are orthogonal. 2.2. Second-order limited (SOL) Let X and L i represent second-order instrument responses for the unknown chemical and the ith library chemical, respectively. The matrices X and L i have dimensions m = s, where m and s represent the number of response elements in each order. For example, data recorded from an LC-DAD instrument produces a spectrochromatogram forming an m = s matrix with m wavelengths and s time increments. A row of this matrix signifies a chromatogram measured at the mth wavelength and a column represents a spectrum measured at the sth time increment. The X and L i arrays can also be the result of unfolding Nth-order arrays to corresponding second-order matrices. As with unfolding to first-order, the order of unfolding does not matter, but the sample and library must be unfolded in the same manner. Let k and ri be the ranks of X and L i , respectively. An SVD of X results in X s UX S XVXt where UX symbolizes an m = k matrix containing eigenvectors of XX t , S X represents a diagonal k = k matrix consisting of singular values si on the diagonal, and VX denotes an s = k matrix containing the eigenvectors of X t X. Thus, an m = s spectrochromatogram with m wavelengths and s time increments would have spectral variance described by eigenvectors in UX and chromatographic variance represented by eigenvectors in VX . Performing an SVD for library matrices produces L i s UL i Ý L iVLt i , where UL i , ÝL i , and VL i are m = ri , ri = ri , and s = ri , respectively.

118

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

New ri = k matrices M i and S i are formed by M i s ULt iUX

Ž 2.

S i s VLt iVX

Ž 3.

where the rank of M i and S i is the minŽ ri ,k .. Performing an SVD on M i results in M i s UM i ÝM iVMt i . The degree of similarity between the left eigenvectors for the unknown X and a library L i , i.e., UX and UL i , is defined as cos u U i s s 1M i where s 1M i is the largest singular value located at row 1, column 1 of S M i . This value represents the minimum angle between an arbitrary vector in the space of the r i eigenvectors of L i for the information order with m measurements and the vector most nearly parallel to it in the space of the k eigenvectors of X for the same information order. Performing an SVD on S i results in S i s US i S S iVSt i and cos u Vi s s 1S i . The value s 1S i represents the minimum angle between an arbitrary vector in the space of the ri eigenvectors of L i for the information order with s measurements and the vector most nearly parallel to it in the space of the k eigenvectors of X for the same information order. As with FOL, cos u U i and cos u Vi values range from 1 to 0 with a value of 1 implying a perfect match for the corresponding information order evaluated and a value of 0 indicating orthogonality for the corresponding information order. A composite similarity for both information orders can be computed by cos u i s cos u U i cos u Vi . Again, the composite cos u i value will range from 1 to 0. Note that while one information order may have a cosine value of 1, if the other order has a cosine value of 0, the composite will be 0 denoting no match. For single component systems, the focus of this paper, the chemical rank should be 1. In this case, Eqs. Ž2. and Ž3. become m i s u Lt i u X and si s z Lt i z X , respectively, where u L i and z L i are the m = 1 and s = 1 vectors taken from the first column of UL i and VL i , respectively, and u X and z X are the m = 1 and s = 1 vectors from the first column of UX and VX , respectively. Because m i and si are scalars, the degrees of similarity for the two orders simplify to cos uu i s u Lt i u X and cos uv i s z Lt i z X . The composite similarity between X and L i is computed by cos u i s cos uu i cos uÕ i s u Lt i u X z Lt i z X . Because only the first eigenvectors are needed, other algorithms beside

the SVD can be used to generate the respective eigenvectors. Two distinct advantages to the SVD method exist. One is its noise filtering capability. For instance, using less eigenvectors than the mathematical rank discards those eigenvectors attributed to noise. The other advantage is the fact that the two information orders for X and L i do not have to be of the same respective sizes if only one similarity angle is desired. For example, if only cos uÕ i is needed, the sizes of s for X and L i must be the same, but the values for m do not. It should be noted that if first-order data is used, the SVD method simplifies to Eq. Ž1. provided vector x is of the same size as the vector l. 2.3. Nth-order (NO) No unfolding is required in this approach and data sets of any dimension can be compared. Let X and L i denote Nth-order arrays. The degree of similarity between X and L i is determined by cos u i s

5P Ž X . 5 5X 5

Ž 4.

where PŽX. indicates the orthogonal projection of X onto the span of L i and 5X 5 is the norm of X defined by ²X,X:1r2 , where ²X,X: signifies the sum of element by element multiplication of X with itself to form a scalar. The projection of X is given by PŽX. s Ž²L i ,X:r²L i ,L i :.L i where ²L i ,X: and ²L i ,L i : are defined as the sum of element by element multiplication of L i with X and L i with L i , respectively. The cos u i values from Eq. Ž4. will range from 1 to 0. For this approach, respective information orders of X and L i must be of the same size. Further details on this Nth-order projection is available in Ref. w19x where the Nth order approach was developed as a selectivity measure. In Ref. w19x, it was incorrectly stated that unfolding Nth-order data would cause improper assessment of the orders. Elementary linear algebra considerations show that Eq. Ž4. becomes Eq. Ž1. when first-order data or an unfolded Nth-order data set is used. This identical relationship between N th-order data and unfolded N th-order data to pseudo first-order is also true when Eq. Ž4. is used to assess selectivity.

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

119

3. Experimental

3.3. Noise

3.1. Simulated data

Normally distributed random noise was added to each data point in a spectrochromatogram. A level of 0.5% homoscedastic and 0.5% heteroscedastic noise was used as the standard. In specific experiments, noise ranged from 0.5 to 3.0% for both homoscedastic and heteroscedastic.

Simulated second-order spectrochromatograms were generated using Gaussian curves based on a Matlab toolbox. Peak width is defined as the distance from peak center to 2% of full height, and each peak has unit height. The spectral order contains 100 wavelength units and each spectral peak has a width of 20 units. The chromatographic order contains 90 time units with peak widths of 10 units. The spectral order contains two peaks while the chromatographic order has only one peak. Spectrochromatograms were for single component samples, not mixtures. To simulate an unknown, a spectrochromatogram is copied from the library and given new noise at the same level as the library. Peak positions are described in Section 4. Another second-order situation was simulated. In this case, spectral absorbances over 150 wavelength units were simulated at 10 different pH levels. Further description is provided in Section 4. 3.2. IR data Real IR spectra ŽTable 1. were coupled with simulated chromatograms generated using Gaussian curves to form second-order spectrochromatograms. The IR spectra contain 307 wavelengths while the chromatographic order has 110 time units. Specific details of IR spectra are given in Ref. w19x.

Table 1 IR spectra coupled with simulated chromatographic peaks to form second-order arrays Chemical

Chromatographic peak

3,5,5-Trimethyl-1-hexanol Hexane 2,2,5-Trimethylhexane Methylcyclopentane 2,4-Dimethylpentane Pentane 2,3-Dimethylbutane 2,4,4-Trimethyl-2-pentane 3-Methylpentane 2-Methylbutane

30 31 32 33 34 35 36 37 38 39

Chromatographic peak values represent peak centers.

3.4. Memory and search time Memory is reported in number of bytes. Search time is reported as the number of flops, which is the count of floating point operations. The algorithms were tested on a 166 MHz personal computer equipped with a Pentium processor and Matlab version 4.2c ŽThe Math Works, Natick, MA..

4. Results and discussion 4.1. Simulated data In order to investigate the use of Nth-order data for library searching, second-order spectrochromatograms were simulated with varying chromatographic resolutions and spectral similarities. This way, worst possible scenarios could be studied. The three library search methods were then evaluated to determine respective selectivities. Additionally, time efficiency and memory requirements were determined. Conditions for simulated spectrochromatograms for ten library chemicals are listed in Table 2. Each spectrochromatogram contains two peaks in the spectral order and one peak in the chromatographic order. To test the effectiveness of second-order library searching over conventional first-order searches, the spectrochromatogram of Chemical B in Table 2 was selected to act as the unknown. The spectrum of Chemical B is identical to that of Chemical I, except for noise. In conventional library search methods, only the spectral-order is normally used. For this study, the spectrum corresponding to the peak maximum in the chromatogram Žhighest signal to noise ratio. was removed from the spectrochromatogram and used in the

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

120

Table 2 Simulated spectrochromatograms used for first and second-order library searches Chemical

Chromatographic peak

Spectral peak 1

Spectral peak 2

A B C D E F G H I J

10 20 30 40 45 50 55 60 70 80

10 20 25 30 35 40 35 30 20 10

90 80 75 70 65 60 65 70 80 90

Values represent peak centers.

first-order spectral library search. Alternatively, the average spectrum across the chromatographic peaks could be used. Table 3 lists results from this firstorder search. Chemicals B and I are possible matches for the unknown from the spectral-order library search, with cos u i values equal to 0.9999 and 0.9998, respectively. The next highest value for the spectral-order search is 0.8835, for chemical C. The correct match for the unknown is difficult to determine based only on the spectral-order search since B and I have such high values and chemical C represents a potential match. As noted in Section 2, using SOL or NO for only one order simplifies to FOL and hence, cos u i values are the same for all three methods.

Simultaneous searches of two orders produce cos u i values with greater degrees of selectivity. Table 3 shows that only chemical B has cos u i values great enough to represent a match, with cos u i equal to 0.99998 for SOL, and 0.99928 for FOL and NO. The next largest cos u i value is for chemical C, with 0.11908 for SOL and 0.11917 for FOL and NO. These values are not large enough to be considered as possible matches for the unknown. As noted previously, chemicals B and I contain identical spectral information and are consequently both chosen as matches in the spectral-order library search. The difference in results for the spectral-order and secondorder library searches is evident and the second-order search provides a greater degree of distinguishing ability. The noise filtering capabilities of SOL is seen by the larger cos u i , value for chemical B. As observed from Table 3, FOL and NO produce the same cos u i values for the second-order search. As stated in Section 2, this is due to the fact that Eq. Ž4. for NO is a generalization of Eq. Ž1. for FOL. Therefore, FOL and NO values will be reported in the same columns in future tables. In the SOL method, a cos u i value is computed for each order. In this case, cos uu i for the spectralorder and cos uz i for the chromatographic-order are computed, both of which are tabulated in Table 4. The composite cos u i values computed by multiplication of the individual values are listed in Table 3. In the spectral-order, chemicals B and I are similar to the unknown, and in the chromatographic-order, only chemical B is similar to the unknown. Thus, the

Table 3 Cos u i results from a library search using only the spectral-order and then the second-order information for chemicals in Table 2 Chemical

A B C D E F G H I J

Spectral-order search

Spectral and chromatographic search

FOL

FOL

SOL

NO

0.61137 0.99993 0.88346 0.60520 0.32080 0.12544 0.32055 0.60564 0.99983 0.61175

0.082171 0.99928 0.11908 0.00080796 0.00053968 0.00020200 0.00057717 0.00035421 0.00063731 0.00054965

0.083223 0.99998 0.11917 0.000036785 0.00013148 0.000058422 0.00024713 0.00017435 0.00066152 0.00023490

0.082171 0.99928 0.11908 0.00080796 0.00053968 0.000020200 0.00057717 0.00035421 0.00063731 0.00054965

Chemical B acted as the unknown.

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125 Table 4 Cos uu j and cos uz j values for Table 3 SOL spectral and chromatographic orders Chemical

cos uu i

cos uz i

A B C D E F G H I J

0.61239 0.99999 0.88258 0.60785 0.32412 0.12797 0.32407 0.60768 0.99997 0.61271

0.13590 0.99999 0.13502 0.000060517 0.00040565 0.00045654 0.00076259 0.00028691 0.00066154 0.00038338

121

Table 5 Simulated spectrochromatograms used for a second-order library search Chemical

Chromatographic peak

Spectral peak 1

Spectral peak 2

K L M N O P Q R S T

50 50 50 50 50 50 50 50 50 50

20 21 22 23 24 25 26 27 28 29

60 60 60 60 60 60 60 60 60 60

Values represent peak centers.

composite cos u i value only lists chemical B as similar to the unknown. An advantage with the computation of cos uu i and cos uz i with the SOL method is that a selective order search becomes possible. For example, if the spectral-order for the library is searched first and only those chemicals with a cos uu i greater than 0.9 are used to search the chromatographic-order, then as Table 4 shows, only library chemicals B and I are searched. Search time can decrease significantly with this approach when a large library is present. This is similar to current methods of searching multi-spectral libraries w8,21,22x. For instance, with GCrIRrMS instrumentation, the IR spectral library is searched to form a hit list using the IR spectrum at the GC peak maximum. Next, the MS library is searched to form another hit list. The unknown is identified by looking at these two lists w8x. It is unlikely that two chemicals will produce nearly identical instrument responses in all orders of an Nth-order array. However, the three methods were tested with spectrochromatograms containing similar information in both orders to investigate more challenging situations. Table 5 lists parameters used to simulate ten spectrochromatograms with identical chromatographic peaks and similar spectra. Two spectral peaks are present in the spectral order, with the second peak being constant for all chemicals and the first peak shifting one wavelength unit for each chemical. Chemical O acted as the unknown and Table 6 reports the results from the library search. As can be seen in Table 6, all methods report chemical O as being the most similar to the unknown. Even though cos u i values are close to 1, the

methods were able to distinguish between library spectrochromatograms with a small degree of selectivity. The results are not as selective as in the previous example because of the similarities between the unknown and all the chemicals in the library, but the values clearly show the methods’ abilities to chose the correct unknown under worst case conditions. Again, all the SOL values are slightly greater than the FOL and NO values due to the noise filtering abilities of the SVD. To test the methods even further, the level of noise was changed. Table 7 lists values for a library search with noise levels of 3.0% homoscedastic and 3.0% heteroscedastic noise. Even though the cos u i values

Table 6 Cos u i results from a library search using second-order information for chemicals in Table 5 Chemical

FOL, NO

SOL

K L M N O P Q R S T

0.95979 0.97617 0.98801 0.99535 0.99929 0.99525 0.98791 0.97574 0.95942 0.93939

0.96184 0.97822 0.99010 0.99749 0.99998 0.99737 0.99003 0.97782 0.96149 0.94137

Chemical O acted as the unknown.

122

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

Table 7 Cos u i results for chemicals in Table 5 with noise increased to 3.0% homoscedastic and 3.0% heteroscedastic

Table 8 Cos u i results for chemicals W and Z using second-order arrays in Fig. 1

Chemical

FOL, NO

SOL

Chemical

FOL, NO

SOL

cos uu i

cos uv i

K L M N O P Q R S T

0.89207 0.90856 0.91791 0.92366 0.97525 0.92383 0.91658 0.90488 0.88904 0.87207

0.96151 0.97783 0.98863 0.99586 0.99935 0.99503 0.98812 0.97516 0.95872 0.93865

W Z

0.99981 0.65502

0.99998 0.99976

0.999997 0.999991

0.999979 0.999773

Chemical O acted as the unknown.

were close to 1, chemical O still maintains the largest values. Notice that while the cos u i values greatly reduced for FOL and NO, the SOL values did not reduce as much with the increase in noise. Depending on the data variation in each order, the SOL method can present difficulties. Consider a second-order array where one order is pH and the other is spectral. Such a second-order array represents multivariate data collected during a titration using a DAD and contains useful information for identifying chemicals. Fig. 1 shows second-order data sets for two chemicals labelled W and Z. Both orders have arbitrary wavelength and pH units. Results presented in Table 8 show that when chemical W acts as the unknown, the FOL and NO methods distinguish between chemicals W and Z and correctly identify

Fig. 1. Simulated second-order plots for chemicals W Ža. and Z Žb. showing pH changes in one order and spectral changes in the other order.

Chemical W acted as the unknown.

chemical W as the unknown. However, the SOL approach implies that chemicals W and Z are both similar to the unknown. The SOL method does choose the correct library, but the cos u i value for chemical Z is extremely close to the cos u i value for chemical W. This is due to the structure of eigenvectors used to describe the two orders. In Fig. 2, respective u L w , u L z, and u X eigenvectors corresponding to information in the spectral order are plotted. Plotted in Fig. 3 are respective z L w , z L z, and z X eigenvectors associated with information in the pH order. The u L i and u X eigenvectors in Fig. 2 are identical, except for small noise differences. This is because information on variation in the spectral order shown in Fig. 1 is the same. The z L i and z X eigenvectors in Fig. 3 appear to be different, but because values are so small and do not change significantly, the vectors are seen as nearly identical. Table 8 lists cos uu i and cos uz i values and concurs with these observations. This type of behavior for certain situations suggest that the SOL approach does not always retain acceptable distin-

Fig. 2. Eigenvectors for chemicals W and W in Fig. 1Ža., Žb., and Žc. are the first columns from UL , UL , and UX , respectively, and w z contain information on data variation in the spectral order.

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

123

grams were simulated to present similarities in the chromatographic order. Thus, the spectrochromatograms would be very similar and present a challenge for library searching. Chemical 2,4-dimethylpentane was chosen to act as the unknown because of its similarity in the spectral-order with the other chemicals. Table 9 lists the results of the library search. Chemical 2,4-dimethylpentane was correctly identified as the unknown by all three methods, even with the similarities between all the spectrochromatograms. 4.3. Memory Fig. 3. Eigenvectors for chemicals W and Z in Fig. 1Ža., Žb., and Žc. are the first columns from VL , VL , and VX , respectively, and w z contain information on data variation in the pH order.

guishing abilities. However, preprocessing of the measured data by appropriate methods may alleviate this problem. An evaluation of preprocessing methods such as mean-centering or autoscaling were not attempted in this study. 4.2. IR spectrochromatograms Real IR spectra were coupled with simulated chromatograms to test the three methods using second-order data. Table 1 lists the IR spectra and the chromatographic peak centers. These IR spectra were chosen because of their similarities and chromato-

Table 9 Cos u i results for second-order IR spectrochromatograms of chemicals in Table 1 Chemical

FOL, NO

SOL

3,5,5-Trimethyl-1-hexanol Hexane 2,2,5-Trimethylhexane Methylcyclopentane 2,4-Dimethylpentane Pentane 2,3-Dimethylbutane 2,4,4-Trimethyl-2-pentane 3-Methylpentane 2-Methylbutane

0.66147 0.75214 0.85962 0.89325 0.98140 0.89048 0.84527 0.76666 0.66700 0.56214

0.69772 0.78889 0.91128 0.94498 0.99975 0.93483 0.89364 0.80668 0.70246 0.59193

2,4-Dimethylpentane acted as the unknown.

When possible, libraries were stored to reduce computer memory. The FOL and NO libraries store entire instrument response arrays. The SOL method only needs to store the m = 1 u L i and s = 1 z L i vectors. Since each element in an array requires 8 bytes of memory, the amount of memory required for storing 1000 second-order arrays of size 150 = 150 for FOL and NO is 1.8 = 10 8 bytes and 2.4 = 10 6 bytes for SOL. The SOL method is superior in memory allocations since it requires substantially less. As expected, the FOL and NO libraries require the same amount of memory since the FOL library is the NO library unfolded. 4.4. Search time An important consideration in comparing library search methods is the time efficiency of the algorithm. To reduce the search time of FOL, Nth-order library arrays are stored into computer memory as unfolded arrays. For SOL, SVDs are computed on second-order arrays, unfolded from Nth-order, and pertinent eigenvectors are stored into memory for the SOL library. When the Nth-order array for an unknown is to be identified, it has to be unfolded into a vector for a FOL library search, adding extra time. The time delay will vary depending on how large the Nth-order array is. Similarly, the Nth-order array for an unknown must be unfolded to a second-order array for the SOL approach. Additionally, the SOL approach necessitates an SVD of the unfolded array for the unknown requiring additional time.

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

124

To generate libraries of various sizes, chemical E from Table 2 was selected for no specific reasoning and duplicated until the desired number of chemicals were present. Table 10 shows a list of search times in terms of flops for the different methods when using chemical E as the unknown. In a library search of 10 chemicals with second-order arrays of size 90 = 100, FOL requires 2.52 = 10 5 more flops than the NO method. This difference increases to 2.70 = 10 7 additional flops compared to NO in a search of 1000 chemicals. Even though FOL and NO generate the same cos u i values, as the size of a library increases, NO becomes faster relative to FOL. It must be noted that unfolding does not require flops, which makes NO even faster than FOL. It should also be noted that the number of flops required for an algorithm to perform a computation depends on how that algorithm is structured. In this study, all algorithms were written to minimize flops. For the SOL method, the largest contributor to search time is the SVD of the unknown. When 100 and less chemicals are present, the SOL method requires more search time than the other two methods. This is due to the SVD of the unknown, needing 1.4021 = 10 7 flops. For ten chemicals in a library, the actual chemical search only uses 3810 flops for SOL and by including the SVD of the unknown, the total search time becomes 1.4025 = 10 7 flops. However, when 1000 chemicals are present in a library, the additional chemical searches increase the search time marginally and even with the SVD, SOL now requires substantially less time than the other two methods. Another contributor to search time for the SOL approach is the unfolding of the unknown if the original order is greater than two. Despite this time delay for unfolding, SOL uses smaller arrays to compute cos u i and is ultimately quicker than FOL and NO when a large library is present.

Table 10 Computational search times for the three methods with different second-order library sizes where each array is of size 90=100 Method

10 Chemicals

100 Chemicals

1000 Chemicals

FOL SOL NO

9.00=10 5 1.40=10 7 6.48=10 5

9.00=10 6 1.41=10 7 6.32=10 6

9.00=10 7 1.44=10 7 6.30=10 7

Time is reported in flops.

To further decrease the SOL search time, a selective library search can be implemented. As noted earlier in this paper, search time could be reduced by searching one order first, forming a hit list, and then searching the second order for only those chemicals in the hit list. However, search times are dependent on the size of the hit list obtained from the order searched first and a detailed investigation on this approach was not performed.

5. Conclusions While this paper only investigated FOL, SOL, and NO methodologies for second-order data, the conclusions can be extended to Nth-order data. The FOL method with unfolded data and NO method are identical procedures. Because the NO method requires less flops than FOL and FOL needs additional time to unfold higher-order data, the NO approach is more appealing. For large library sizes, the SOL method was superior in search time and memory. The SOL method also provides the advantage of using separate order searches. However, the SOL method is not always as selective as FOL and NO as demonstrated in one case. Appropriate preprocessing of the data sets may alleviate this, i.e., improve the distinguishing ability of SOL. Regardless of the approach, to produce correct results it is important that the same conditions for measurements of L i be present when measuring X. Besides the second-order data sets investigated here, other second-order arrays could be library searched. For example, two-dimensional Ž2D. fluorescence data and correlation methods such as 2D IR or 2D Raman correlation spectroscopy could prove useful in distinguishing chemicals that appear similar in certain information orders, but not others. Another approach to generating multi-order data is through sample perturbation. For example, to enhance the information content of an IR spectrum, the spectrum can be perturbed by changes in temperature w16x. With a collection of such enhanced spectra for a chemical, criteria as described here can be used in library searches. Of course, some Nth-order situations would probably only be useful for identification from a special, confined library.

C.E. Anderson et al.r Chemometrics and Intelligent Laboratory Systems 41 (1998) 115–125

Acknowledgements The work described was supported by the Camille and Henry Dreyfus ScholarrFellow Program for Undergraduate Institutions, NSF-Idaho EPSCoR Program, and National Science Foundation grant number OSR-935039.

References w1x w2x w3x w4x w5x

G.W. Small, Anal. Chem. 59 Ž1988. 535A. A.R. Gross, M.J. Adams, Anal. Proc. 3 Ž1994. 22. S.E. Stein, J. Am. Soc. Mass. Spectrom. 5 Ž1994. 316. B. Dathe, M. Otto, Chromatographia 37 Ž1993. 31. S. Lin, X. Lu, W. Zheng, J. Zhang, Chemom. Intell. Lab. Sys. 20 Ž1993. 85. w6x S. Lo, C.W. Brown, Appl. Spec. 45 Ž1991. 1621. w7x S.E. Stein, D.R. Scott, J. Am. Soc. Mass. Spectrom. 5 Ž1994. 859.

125

w8x H. Laber, J. Schultz, W.R. Sponholz, W. Bremser, J. Fresnius, Anal. Chem. 351 Ž1995. 530. w9x A. Rozenblum, P. Brunerie, Dev. Food. Sci. 35 Ž1994. 133. w10x J. Schubert, J. Chrom., A 674 Ž1994. 63. w11x M. Fuller, R. Rosenthal, Proc. SPIE-Int. Soc. Opt. Eng. 2089 Ž1993. 440. w12x E. Sanchez, B.R. Kowalski, J. Chemom. 2 Ž1988. 247. ´ w13x M. Otto, W. Wegscheider, E.P. Lankmayr, Anal. Chim. Acta 171 Ž1985. 13. w14x C. Pav, I.M. Warner, TRACS 7 Ž1988. 68. w15x A.L. Allanic, J.Y. Jesequel, J.C. Andre, ´´ ´ Anal. Chem. 64 Ž1992. 2618. w16x C. Marcott, I. Noda, A.E. Dowrey, Anal. Chim. Acta 250 Ž1991. 131. w17x W.J. Krzanowski, J. Am. Stat. Assoc. 74 Ž1979. 713. w18x A. Carlosena, J.M. Anrade, M. Kubista, D. Prada, Anal. Chem. 67 Ž1995. 2373. w19x N.J. Messick, J.H. Kalivas, P.M. Lang, Anal. Chem. 68 Ž1996. 1572. w20x Y. Xie, J.H. Kalivas, Anal. Lett. 30 Ž1997. 395. w21x C.L. Wilkins, Anal. Chem. 66 Ž1994. 295A. w22x J. Zupan, M. Penca, D. Hadzl, ˇ J. Marsel, Anal. Chem. 49 Ž1972. 2141.

Orthogonality considerations for library searching Nth-order data

Orthogonality considerations for library searching Nth-order data

Recommend Documents