Neural networks for interpretation of infrared spectra using extremely reduced spectral data

Neural networks for interpretation of infrared spectra using extremely reduced spectral data

407 Analytica Chin&a Acta, 282 (1993) 407-415 Elsevier Science Publishers B.V., Amsterdam Neural networks for interpretation of infrared spectra usi...

682KB Sizes 0 Downloads 33 Views

407

Analytica Chin&a Acta, 282 (1993) 407-415 Elsevier Science Publishers B.V., Amsterdam

Neural networks for interpretation of infrared spectra using extremely reduced spectral data M. Meyer, K. Meyer and H. Hobert Institut jZr Physikalische Chemie, Friedrich-Schiller-UniversiriitJena, Lessingstrasse IO, O-6900 Jena (Getmany) (Received 21st December 1992)

AbStIWt

In order to classify infrared (IR) spectra into structural groups the application of artificial neural networks is a powerful tool. The network architecture used for this work was a back propagation model with one hidden layer. For the minimization of the network the number of input elements has been reduced extremely. Instead of the whole infrared spectrum (512 measuring points) the score values calculated by principal component analysis (PCA) involving the spectra of the training set (700) and the prediction set (348) were used as input elements for the network The number of principal components or scores per spectrum as network input, respectively, has been varied (4,6, 8 and 10) to find out the minimal spectral information needed for a sufficient prediction ability of the network. 13 IR sensitive structural patterns have been defined as output elements of the network. Results obtained for the training and prediction step are discussed. The prediction performance of the neural network operating with such extremely reduced spectral data is surprisingly good. Qvwordr: Infrared spectrometry; Principal component analysis; Neural networks

The application of artificial neural networks for interpretation of spectral data increased in the last few years and is an established altemative approach to other methods like library search, pattern recognition and knowledge based expert systems. Different types of spectra have been used for spectral interpretation by neural networks (infrared, mass, NMR, UV spectra). The Review article of Zupan and Gasteiger [l] gives a good summary concerning these investigations, which can be completed with some more recent papers [2-41. In most of the cases the spectral information used as network input was directly the digitized spectrum so that the number of input elements can vary in the range between 100 Correqx&ence to: M. Meyer, Institut fiir Physlkalische Chemie, Friedrlcb-Schiller-Universitiit Jena, Lessingstrasse 10, O-6900 Jena (Germany).

and 4000, approximately. We have also used 128 or 512, respectively, measuring points of infrared spectra in our recent studies [5]. This kind of representation of spectral data im lies some disadvantages with regard to neural Y?tw n ork application. With a high number of elements in the input layer and because of better recognizability of the network desirable high number of hidden layer elements the total number of network nodes increases rapidly. The result is a significant enhancement of computing time required for the training sessions. Furthermore, in most of infrared spectra (but not only in these) the spectral information is distributed not evenly over the whole spectral region, but it is concentrated in distinct ranges of the spectrum corresponding with the structural pattern. Therefore it is favorable to reduce the spectral data by PCA in order to extract the useful information from the origi-

0003~2670/93/$06.00 Q 1993 - Elsevier Science Publishers B.V. All rights reserved

M. Meyeret al. /And. Chbn.Acta 282 (1993) 407-415

nal spectrum. The spectral interpretation will be performed in two steps, the reduction of data by PCA and the classification into structural classes by the neural network. The performance of the network in prediction of functional groups has been investigated using different numbers of principal components.

EXPERIMENTAL

For the spectral interpretation two self-made computer programs written in TurboPascal 6.0 for PCA and for simulation of the neural network were used. Spectra with 512 measuring points from the IR database of our library search system SUSY [6] were used for the training and the prediction set. The training set consists of 700 spectra, whereas 348 spectra have been chosen for the prediction set. All spectra were normalized to a vector length of one. PCA The dimension of the data matrix D used for the PCA calculations is defined by the number of spectral measuring points (512) and the size of the training set (700). PCA was performed using the NIPALS algorithm, which gives with some modifications the same results like an eigenanalysis of the 512 by 512 covariance matrix 2 computed from the 512 points by 700-data matrix D. Z=D.D= The eigenanalysis of the covariance matrix Z results in n eigenvalues in an n by n diagonal matrix A, n eigenspectra in the n by 512 matrix R, and 700 score vectors in an n by 700 score matrix Q. R*Z*R==A R*D=Q But it is favorable to compute the eigenvalues, eigenspectra, and scores directly from the spectra of the training set using the NIPALS procedure, because the manipulation of such a large 512 by 512 data matrix with a PC with 640&B memory is very difficult and requires much computing time. The NIPALS procedure allows to calculate the

scores and eigenspectra more quickly and to abort the computations if the desired munber n of principal components is reached. The details of the NIPALS algorithm are described, e.g., by Martens and Naes [7]. After PCA the information at the 512 measuring points of the training set spectra in the data matrix D is reduced to a maximum of 10 score values per spectrum in matrix Q. These 10 principal components maximum should reflect the most significant features of the whole training set. The data matrix D can be considered as the sum of the linear combination of the eigenspectra for the II principal components weighted by the score values and the residual matrix D,, which contains beside noise and errors also the special features of the spectra. D=Q*RT+DS With an increasing number of principal components the score vectors describe more special features of the individual spectra and the elements of the residual matrix D, are decreased. All spectra of the training set can be visualized as points in an n-dimensional hyper space, but although the spectra have been normalized to unity vector length before PCA the spectra points lie not near the surface of an unit hyper sphere, as long as the residual matrix D, contains not only noise. Nevertheless, the position of the spectral points in the n-dimensional hyper space or the score vectors, respectively, can be used for a separation of the spectra into groups. For the classification of a set of 348 unknown spectra D, at first the pseudo-score matrix Qo must be computed using the eigenspectra of the training set and the 348 unknown spectra of the prediction set. R*D,=Q, If the positions of the unknown spectra in the n-dimensional hyper space of the training set are determined, it is possible to classify them into the defined groups. Classification

Common procedures for separation and classification [81 need large differences between the features of the objects of different classes, as

M. Meyer et aL /Anal. Chim. Acta 282 (1993) 407-415

small as possible deviations within the classes, and a well-defined training set. In the general case of spectral interpretation of organic compounds the different C-H content and the various functional groups of the compounds prevent the separation into distinct classes. Even if the training set consists only of spectra of monofunctional compounds and allows a separation of the spectra into classes with regard to their functional groups, it is impossible to determine all functionalities of compounds with more than one functional group in one step using the common classification procedures. In contrast to these approaches the neural network enables to detect simultaneously all functionalities of a compound. The investigations should demonstrate that the advantages of data reduction by PCA can be also useful for the spectral interpretation by a neural network. The network has a three-layered architecture with an input layer, one hidden layer and an output layer. The input layer consists of a varying number of NIPALS scores (4-10) for each spectrum of the training set and the prediction set. The size of the hidden layer was defined by 30 network elements. The 13 output units of the network correspond with 13 predefined IR sensitive structural patterns (Table 1).

TABLE 1 IR sensitive structural pattern as output elements of the network No.

Structural pattern

1 2 3 4 5 6 7 8 9 10 11 12 13

Alcohol

Total

Ketone Eater Aldehyde Carboxylic acid Amide C-Halogen group Nitro group sulfa group Ether Aromatic C-H group Aliphatic C-H group

Frequency in Frequency in training set prediction set 153

61 25 96 39 132 69 25 74 410 482

75 32 44 50 14 18 39 83 13 9 58 161 305

1711

901

85 60

409 RESULTS AND DISCUSSION

For the present work the NIPALS calculations were restricted to a number of 10 eigenspectra and 10 corresponding scores. The set of obtained eigenspectra are shown in Fig. 1. The first eigenspectrum is a positive linear combination of the training set spectra and can be interpreted as a mean spectrum of all these spectra. In all of the further eigenspectra both positive and negative combinations are possible as it is seen in the feature of the eigenspectra. The function of each of the consecutive eigenspectra is to separate as much as possible spectral patterns present in the data set from each other. Spectral features of functional groups with great intensity values which are not strongly overlapped with other characteristic features, will separate much better than patterns of functionalities with weak absorption bands or spectral pattern overlap with some others. For instance it is obvious that alcohols and compounds containing a carboxylic group can be discriminated very well (mainly by the second eigenspectrum) whereas the separation of amines or compounds with a sulfo group from all others is relatively poor. Training step The training of the neural network has been performed using 4, 6, 8 or 10 scores as representation of each spectrum of the training set. When the convergence of minimization of the output error became slow the training session was stopped. The output error has been defined as the deviation of the current network output to the given target values of the output layer. The values achieved after the training step are given in Table 2. To judge the success of network learning it is useful to have a look at the recognizing level for the defined structural groups. In Table 3 the results of network training for all numbers of scores used as input are summarized. All functionalities found correctly are called “true”, functional groups which are present in a substance but not found by the network are called “false - “. Finally, substructures found by the network which are not present in a substance are named “false + “. It can be seen that for some

M. Meyer et aL /AnaL Chim. Acta 282 (1993) 407-415

0.005r~

2

39000 rx

31500

.2252.941

39000

I

315ao

245(10 IY=

0.066

2450.0

175ao

1050.0

I

1,501)

4X.0 CM-’

1050.0

4OQO

-0.1051~, 39oao 31500 LX=2252.941 [

24500 IY0.033

1750.0 j

1050.0

4000 CM-’

0.2091 (6) O.T05-

I

0.960

A

A

0.035.

0620

I 0.060

-0105-

0.000 :.x

w

I

-0.175

-0.060 i

w

-00994 39001) [X.2252.941 0.261-

II,

11

31501) I

2450.0 IY=-a026

17500 1

1050.0

400.0 CM-’

(7)

Y

3900.0 31500 1X*2252.941 j

24500 IV - 0066

17500 1

10500

245Oa IY= 0.053

1750.0 1

105(x0

I 400.0 CM-’

0.142

0.2*5-

A

A 0.070 0.135.

-0.135

3QooJJ 315a0 [X=2252.941 1

-0640

*4500 lY= 0.043

17501) i

105.30

400.0 CM-’

-0163 39000 315ao /Xm2335.2% 1 0.227-

4000 CM-’

(10)

0.160A O.OQO-

-0.090-

-o.mo-

3900.0 3150.0 1X-2252.941 1

*45m /Y- 0.033

175m 1

0x10

4000 CM-’

Fig. 1. Eigenspectra l-10 calculated by NIPALS.

-0.246, 3900.0 1% -2252.941

3160.0 I

2450.0 by-

0007

176(x0 1

10501)

400.0 CM-’

411

M. Meyer et al. /Anal. Chim. Acta 282 (1993) 407-415 TABLE 2 Output errors of trained network Number of scores

Output error (absolute) Output error (%o)

4

6

8

10

218.75 2.4

139.15 1.5

94.50 1.03

69.68 0.76

functional groups the recognizing ability increases more or less continuously with the number of input elements of the network (e.g., alcohols, amines and carboxylic acids) whereas for some other functionalities the performance decreases or stagnates (e.g., esters, aldehydes and amides). The reason for the different behavior of functional groups is the distribution of spectral information in the eigenspectra obtained by PCA. If the total information of a spectral feature (corresponding to a certain functional group) is distributed more evenly to all eigenspectra, the performance of network learning will improve with increasing numbers of input elements. For the opposite case, if the spectral information related to a substructure can be focussed by PCA into one or only a few eigenspectra, the recognizing ability of the network may alternate to become better or worse with increasing numbers of input elements. The special performance behavior of some of these functional groups will be discussed below. In the case of the ketone functionality using 4 input elements only 28% were recognized by the network. This indicates, that only a small part of characteristic spectral information is involved in the first four eigenspectra. The increase of the number of input elements to 6 causes a strong improvement of “true” recognized ketones up to 70%, which means that the main part of spectral information characteristic for ketones is concentrated in the fifth and sixth eigenspectrum. Switching to 8 input elements, the performance increases slightly to 85% and remains constant also for 10 input elements. Another interesting example is the ester functionality. The performance is already high with 4 input elements (93% “true”) and gets better with 6 (95% “true”) and 8 (100% true) input ele-

ments. But the expansion of input elements up to 10 leads to a decrease of recognizing performance down to 89%. How can this behavior be interpreted? The main part of specific spectral information for the ester functionality is located in the first four eigenspectra, but also in the next eigenspectra (5-8) some useful information is contained. The use of additional scores (corresponding to eigenspectrum 9 and 10) as network input does not improve but reduces the recognizing ability, because this information is non-relevant for the ester functionality or in conflict with the assignment to other substructures containing a carboxylic group (e.g., ketones, aldehydes, carboxylic acids). In other words, the specific information attributed to the ester functionality becomes more fuzzy with expansion of score numbers up to ten. A third type of dependence of functional group recognition from the number of scores can be observed for the sulfo group. The performance improves with increasing numbers of scores, but even with 10 score values the recognizing ability is relatively poor (68%). This indicates that the number of principal components used for the network training is too low (some specific information is involved in further principal components) or the spectral feature of a functional group is not clearly correlated to a distinct substructure. Prediction

step

After the training step it was tried to classify a set of unknown spectra (represented by their NIPAIS scores as described above) by the neural network. The results of the prediction of functional groups using also 4, 6, 8 and 10 score values as network input are shown in Table 4. It can be seen that, of course, the prediction ability is close correlated with the performance of the training for the different functional groups. Functionalities which were recognized very well in the training step can also be predicted very well whereas the prediction performance for some other functional groups with wrong training results is relative poor. In average (for all used substructures and all numbers of scores) the level of “true” prediction is lower than the level of

r-

-

_^

__

-

-

-

.

.

Total

-

20.4

79.6

Ketone Ester Aldehyde Carboxylic acid Amide C-Halogen group Nitro group suffo group Ether Aromatic C-H group Aliphatic C-H group

-

14 38 72 7 48 26 31 43 52 56 25 6 8

Alcohol

1 2 3 4 5 6 I 8 9 10 11 12 13

.“-

2.5

2 1 0 0 0 1 1 5 1 0 1 19 26

-

89.6

90 84 70 95 80 97 82 68 77 60 77 98 96

(%I

(o/o)

(%I

(%o)

6 True

False +

False -

4 True

Number of scores as network input

86 62 28 93 52 74 69 57 48 44 75 94 92

Structural pattern

No.

Network perfomtance using the training set (recognizing ability)

TABLE 3

10.4

10 16 30 5 20 3 18 32 23 40 23 2 4

(%o)

False -

1.5

2 1 0 0 0 0 0 8 0 0 0 5 9

(%I

False +

8

91.8

92 93 85 100 80 99 74 66 93 60 82 97 97

f%)

True

0.6

1 0 0 0 0 0 0 1 0 0 0 4 5

8 7 15 0 20 1 26 34 7 40 18 3 3 8.2

(o/o)

False +

(%,)

False-

10

94.4

% 95 85 89 56 100 90 80 97 68 85 100 99

(%o)

True

False -

5.6

4 5 15 11 44 0 10 20 3 32 15 0 1

(o/o)

0.4

0 0 0 0 0 0 0 1 0 0 0 4 3

(o/o)

False

R

2

50

is

1 2 3 4 5 6 7 8 9 10 11 12 13

No.

71.6

Total

Aliphatic C-H group

80

25 21 92 36 72 31 45 31 0 45 88 93 28.4

75 73 8 64 28 69 55 69 100 55 12 7

20

(o/o)

(%I

AlCQllOl

False -

True

4

4.2

2 1 2 1 3 1 9 1 0 4 25 47

7

False + (%I

72.0

79 56 48 90 21 100 41 53 69 11 57 81 89

(o/o)

True

6

Number of scores as network input

&nine Ketone Ester Aldehyde Carboxyfic acid Amide C-Halogen group Nitro group sulfa Broup Ether Aromatic C-H group

Structural pattern

Network perfortnance using the prediction set (prediction ability)

TABLE 4

28.0

21 44 52 10 79 0 59 47 31 89 43 19 11

(%)

False -

4.9

3 4 4 2 3 3 1 16 1 1 1 23 41

(%o)

False +

8

75.0

81 44 64 86 36 100 21 46 69 56 55 80 94

(o/o)

True

25.0

19 56 36 14 64 0 79 54 31 44 45 20 6

(%)

False -

4.2

6 7 3 2 1 3 0 5 1 2 2 21 35

(%I

False +

10

78.9

83 69 59 74 21 94 64 57 85 0 60 91 92

(%I

True

False. -

21.11

17 31 41 26 79 6 36 43 15 100 40 9 8

(o/o)

False +

4.0

3 4 3 1 1 3 1 9 1 3 2 20 30

(%I

414

hf. Meyer et al. /Anal. Chim. Acta 282 (1993) 407-415

true recognition in the training step, but in some cases the prediction behavior as function of the number of scores is unexpected or obscure. For example, for the aldehydes the “true” value for the training using 4 scores is 52%, the corresponding prediction value is 36%. Switching to 6 scores the values are 80% and 21%, respectively. The recognizing ability increases but the prediction performance decreases. There are some possible reasons for such a behavior. One reason can be a different distribution of types of aldehydes (aromatic, aliphatic or heterocyclic) in the training set and in the prediction set. Another possible explanation is the different certainty of recognition or prediction of a functionality for changing numbers of scores. Because the threshold of “true” assignment is simply set to 0.5 without further differentiation of the quality of assignment for all values climbing over this threshold the distribution of these “true” values can variously affect the performance of the network for different numbers of scores. This fact should be illustrated by the histograms of the output values of the network for the training and prediction

I

step (Fig. 2). It can be seen that in the case of 4 scores the general feature of the histograms (distribution) for the training step and the prediction step is more or less similar apart from a slight shift to lower output values for the prediction. If 6 scores are used as network input the percentage of output values > 0.5 becomes relatively high (SO%) for the training set but the distribution of these values is very broadly smeared between 0.5 and 1.0. This feature of the training set histogram may be the reason for the bad prediction performance (21%) which is in accordance to the shown histogram for the prediction set. Whereas the percentage of “false + ” assignments in the training step is very low (with exception of the aromatic and aliphatic C-H functionality with 4 scores as network input), in the prediction step the part of additional functional groups found, which are not present in the substance, became higher, but it is apparent that this effect is more marked for functionalities, which are more general and not so specialized, e.g., the aromatic or aliphatic C-H or the C-halogen group, respectively.

prediction set (4 scon%s)

training set (4 scores)

training set (6 scores)

Fig. 2. Histograms

of network output values for the aldehyde

prediction set (6 scares)

functionality.

M. Meyer et al. /Anal. Chh

Acta 282 (1993) 407-415

Another typical effect which is often found in neural network studies is the phenomenon of overtraining. Such a behavior can be observed in our investigation especially for the ester and the carboxylic acid functionality. In both cases the network can be trained so that it recognizes all present substructures in the training set (“true” = 100%) if the available spectroscopic information (number of used eigenspectra or scores, respectively) is sufficient. For esters 8, and for carboxylic acids 10 scores are needed for this behavior, but using the same number of scores for the prediction step the performance for these learned functional groups was wrong in contrast to the case that the training was performed with a lower number of scores, where the recognition rate is in a region between 95 and 100%. An explanation for this is that if too much specialized information for a distinct functional group is included in the data of the training set the network is not able to tolerate possible variations in the spectral data of the prediction set. Therefore a little loss of information in the training set makes the network more flexible and can enhance the performance in the prediction step. Conclusions

The goal of our studies presented in this paper was to find out what is the minimal amount of spectral information for classifying infrared spectra by an artificial neural network. The use of PCA for the reduction of spectral data is very

415

favorable for a selective extraction of the specific information correlated with the functional groups of the compounds. It has been shown that in most of the cases a number between 4 and 10 scores obtained by the NIPALS procedure is sufficient to classify structural unknown substances by a neural network into the 13 defined substructures. With respect to the fact that most of the compounds of the training and the prediction set are multifunctional substances (average value = 2.5 functional groups per compound), the performance of the network operating with such extremely reduced spectral data is surprisingly good. It has been shown that the combined application of PCA and neural networks is a powerful tool for a quick structural classification.

REFERENCES 1 J. Zupan and J. Gasteiger, Anal. Chim. Acta, 248 (1991) 1. 2 K. Tanabe, T. Tamura and H. Uesaka, Appl. Spectrosc., 46 (1992) 807. 3 L.S. Anger and P.C. Jurs, Anal. Chem., 64 (1992) 1157. 4 C. Rorggaard and H.H. Thodberg, Anal. Chem., 64 (1992) 545. 5 M. Meyer and T. Weigelt, Anal. Chim. Acta, 265 (1992) 183. 6 M. Meyer, I. Weber, R. Sieler and H. Hobert, Jena Rev., 35 #J9OJ 16. 7 H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester, 1989. 8 K. Varmuza, Pattern Recognition in Chemistry, Springer, New York, 1980.