Multivariate strategies for classification based on NIR-spectra—with application to mayonnaise

Multivariate strategies for classification based on NIR-spectra—with application to mayonnaise

Chemometrics and Intelligent Laboratory Systems 49 Ž1999. 19–31 www.elsevier.comrlocaterchemometrics Multivariate strategies for classification based...

246KB Sizes 0 Downloads 53 Views

Chemometrics and Intelligent Laboratory Systems 49 Ž1999. 19–31 www.elsevier.comrlocaterchemometrics

Multivariate strategies for classification based on NIR-spectra—with application to mayonnaise Ulf G. Indahl ) , Narinder S. Sahni 1, Bente Kirkhus 2 , Tormod Næs ˚ Norway MATFORSK, OsloÕeien 1, N-1430 As, Received 26 January 1998; accepted 28 March 1999

Abstract The goal of the presented study is two-fold. First, we want to emphasize the power of Near Infrared Reflectance ŽNIR. spectroscopy for discrimination between mayonnaise samples containing different vegetable oils. Secondly, we want to use our data to compare the performances of different classification procedures. The NIR spectra with 351 variables correspond to equally spaced wavelengths in the 1100–2500 nm area. Feature extraction both by automatic wavelength-selection and by projection onto principal components ŽPCs. is discussed. The discriminant methods considered are linear discriminant analysis ŽLDA., quadratic discriminant analysis ŽQDA. and regression with categorical 0,14-responses. A dataset containing 162 spectra of mayonnaise samples based on six different vegetable oils is analyzed. By LDA with authentic cross-validation ŽPC-models re-estimated for each cross-validation segment., only one sample was misclassified. Classification by allocating a sample according to the largest fitted value of a linear regression ŽDiscriminant-Partial least squares ŽDPLS. or Discriminant-Principal components regression ŽDPCR.. is demonstrated sub-optimal compared to LDA of the corresponding PLS- or PCR-scores. QDA significantly outperforms LDA for projections of the data onto subspaces of moderate size Žscores of 7–9 PCs.. Two automatic variable-selection procedures choose 16 and 26 wavelengths Žvariables., respectively from the spectra. Based on the selected wavelengths, LDA gives considerably better classification than the regression approach. By reporting the performances of several feature extraction techniques in tandem with three of the most common classification methods, we hope that the reader will notice two relevant aspects: Ž1. By using the DPLS and DPCR Žclassification by ‘dummy’ regressions. one is exposed to a significant risk of obtaining sub-optimal classification results; Ž2. The automatic wavelength selections may give valuable information about what is actually causing a successful discrimination. Such knowledge can, for instance, be used to select the most suited filters for online applications of NIR. Besides, from demonstrating different classification strategies, our study clearly shows that classification methods with NIR spectra can be used to discriminate between mayonnaise samples of different oil types and fatty acid composition. q 1999 Elsevier Science B.V. All rights reserved. Keywords: Discriminant analysis; Principal components; Automatic variable selection; NIR; Vegetable oils

) 1 2

Corresponding author. Tel.: q47-22-47-67-45; Fax: q47-22-47-67-95; E-mail: [email protected] Mills DA, P.O. Box 4644, Sofienberg, N-0506 Oslo, Norway. Mills DA, P.O. Box 4644, Sofienberg, N-0506 Oslo, Norway.

0169-7439r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 7 4 3 9 Ž 9 9 . 0 0 0 2 3 - 4

20

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

1. Introduction Near Infrared Reflectance ŽNIR. spectroscopy and similar techniques are primarily used for Žrapid. detection of chemical compounds. During the last decade, a number of applications as well as theoretical approaches have been reported where the main goal is discrimination between samples belonging to one of several distinct groups based on spectral properties. A number of theoretical approaches deals with regularization-strategies. Regularization is necessary to overcome the problems with unstable and singular covariance-estimates due to a large number of variables Žpossibly highly correlated. together with a moderate number of samples. One such strategy is to project the data onto a smaller subspace spanned by a manageable number of principal components ŽPCs.. A standard application of linear discriminant analysis ŽLDA. or quadratic discriminant analysis ŽQDA. can then be executed for the projected data. There are also other methods suggested in the literature. In Ref. w1x, Friedman introduces a strategy called Regularized Discriminant Analysis ŽRDA. for computing classdependent but biased covariance-estimates resulting in a compromise between LDA and QDA. In Refs. w2,3x, the method DASCO Ždiscriminant analysis with shrunken covariances. is introduced and suggested as a substitute for the well-known method SIMCA w4x. Both methods allow individual class-specific PCmodels and a shrinking of the residual-spaces, but the allocation rules of the two methods differ. In Ref. w5x, a strategy for regularizing the covariance-estimate of LDA in high-dimensional situations is considered. A stepwise procedure for selection of relevant PC scores in the case of initial dimension reduction is demonstrated in Ref. w6x. Other approaches of feature-extraction including estimation of Fourier- and Wavelet-coefficients, as well as variable selection strategies, can be found in Refs. w7,8x. Variable selection in tandem with Neural Network ŽNN. classifiers is described in Ref. w9x. Practical examples presented in several of the references above demonstrate successful applications of NIR spectroscopy to a number of scientific areas. NIR spectroscopy has many applications in the food industry and food analysis w10x and it is used for quality control of raw materials and final products. A large number of NIR- and IR-applications are related

to quality and authenticity studies of food products. Evans et al. w11x give an application with orange juice, Ellekjær et al. w12x focus on determination of sodium chloride content in sausages, and tenderness of meat is classified in Ref. w13x. In Refs. w14,15x, NIR spectroscopy is successfully used to determine the correct origin of basmati rice. Problems closely related to the application presented in this paper is the work in NIRand IR-based discrimination between different vegetable oils w16–26x. In the present paper, we address the problem of discriminating between mayonnaise samples based on six different vegetable oils, where the measured NIR spectra yield 351 equally spaced and highly correlated variables corresponding to wavelengths in the 1100–2500 nm range. Various classification techniques are applied to NIR spectra of full fat mayonnaise containing 70% to 80% oil. Mayonnaise is an emulsion with oil droplets dispersed in water. Since the NIR spectra reflect both absorption characteristics as well as scattering from droplets and other particles, we had to consider interference from other components than the oil itself. Therefore, the six vegetable oils, in pure form, were also analyzed by NIR and Gas Liquid Chromatography ŽGLC.. The results of the two discriminantanalyses will be compared. Studies have shown that NIR spectra of vegetable oils contain information about their fatty acid composition. Absorption bands around 1600–1800 nm and 2100–2200 nm are assigned to straight carbon chains and cis double-bonds. Attempts have previously been made to classify various vegetable oils based on full-range NIR spectral data w21x. The goal of our study is two-fold. First, we want to emphasize the discriminative power of NIR spectroscopy for the chosen problem. Secondly, we want to discuss and compare the performance of several different discrimination strategies applied to our data and to point out some useful relationships between them. We will focus on the straightforward approach of projecting the data onto suitable subspaces found by a principal component analysis ŽPCA. and multi-response partial least squares ŽPLS2. modelling. The latter approach involves a regression with categorical  0,1 4 -responses. We also consider variants of a method for automatic variable selection based on es-

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

timation of between-groups variance over withingroups variance ratio ŽBrW-ratio. for each spectral wavelength of the dataset. 2. Materials and methods 2.1. MultiÕariate classification Similar to multivariate regression problems, the goal of a classification-procedure is to extract reliable information from potentially high-dimensional and collinear X-data. This often requires careful shrinking, transformation, regularization or subsetselection from the original variables Žwavelengths. before the standard statistical techniques can be successfully applied. A classification problem can be described as follows: n measurements x 1 , . . . , x n , each of p variables are arranged in a data matrix X. Group membership is given by a vector Y g  1,2, . . . , K 4 n, where K is the number of groups. From the dataset ŽX,Y. a classification-rule CR: R p ™  1,2, . . . , K 4 is designed in such a way that the correct class of a future anonymous sample x g R p is predicted with high probability. Good references on the topic are Ripley w27x, Duda and Hart w29x, Mardia et al. w30x, McLachlan w31x, and Bishop w32x. By assuming the different groups generated by K distinct probability-densities f k for k g  1, . . . , K 4 with prior probabilities p k , a straightforward argument minimizing the risk of misclassification leads to ˆ where the rule: CRŽ x . s k, f kˆ Ž x . p kˆ s

max ks1, . . . , K

 fk Ž x . p k 4 ,

Ž 1.

i.e., allocation of x to the class of maximal probability score. If the densities involved are assumed multi-normal, i.e., 1 fk Ž x . s 1r2 pr2 Sk Ž 2p . =exp y1r2 Ž x y m k .

t

Sy1 k

Ž mk . ,

Ž 2.

with prior probabilities p k for k g  1, . . . , K 4 , the expression in Eq. Ž1. can be replaced by: d kˆ Ž x . s

min ks1, . . . , K

dk Ž x . ,

Ž 3.

simply by taking the logarithm of the densities and deleting common terms. When different covariance

21

matrices of each class are assumed, Eq. Ž3. is applied with: t

d k Ž x . s Ž x y m k . Sy1 k Ž x y m k . q ln S k y 2ln Ž p k . . Ž 4. Ž . Ž . Eqs. 3 and 4 together are leading to quadratic boundaries between the classes, and the resulting classification rule is known as QDA. We will only assume equal priors of the classes and consequently, cancel the last term of Eq. Ž4.. If also equal covariance matrices of the classes are assumed, i.e., S k s S , ;k g  1, . . . , K 4 , we get a global metric on the variable space, d Ž x . s Ž x y y . tS y 1 Ž x y y . with d k Ž x . s Ž x y m k . tSy1 Ž x y m k .. By simple algebraic manipulations, one can deduce that the quadratic terms of x in these expressions cancel. This leads to linear decision boundaries between the classes. The resulting classification rule is known as LDA. In most applications of LDA and QDA the classcenters m k and the covariance matrices S k are unknown and usually replaced by: 1 m Ž 5. ˆ ks Ý x n k classŽ i .sk i where n k s‘the number of observations in group number k’ and: 1 t Sˆ k s Ž 6. Ý Ž x y mˆ k . Ž x i y mˆ k . , n k classŽ i .sk i corresponding to the maximum likelihood ŽML. estimates of m k and S k . 2.1.1. Collinearity and singularity When dealing with collinear data or more variables than observations, the resulting ML-estimates of covariance matrices will suffer from serious instability or even singularity. This is easily seen by expressing the spectral decomposition of the covariance matrices as: p

Sk s

Ý e i k zi k zitk ,

Ž 7.

is1

where e i k is the ith eigenvalue and zi k the corresponding eigenvector of S k . The inverse in this representation equals p

Sy1 k s

Ý is1

zi k zitk ei k

,

Ž 8.

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

22

and Eq. Ž4. takes the form p

dk Ž x . s

Ý

zitk Ž x y m k .

is1

ei k

y 2ln Ž p k . .

2

p

q

Ý ln Ž e i k . is1

Ž 9.

Eq. Ž9. is heavily influenced by the smaller eigenvalues and the directions corresponding to their eigenvectors. According to Friedman w1x, the eigenvalues of the empirical estimates Sˆ k are biased such that small eigenvalues are underestimated and large eigenvalues are overestimated. When the n k values are greater than but close to p, the estimated Sˆ k values will consequently be unstable and when n k - p the smallest eigenvalues will be equal to 0 and imply singular estimates. In the situation of small groups Žcompared to the number of variables. LDA based on the pooled covariance-estimate

Sˆ s

1 n

K

Ý Ž n k . Sˆ k ,

Ž 10 .

ks1

Ž n s S n k . will often outperform QDA because one is better off with a more stable estimate of the common covariance-structure than unstable individual estimates. Sometimes, a better alternative is to look for intermediate solutions where the class-dependent covariance matrices are modified according to certain rules. Such strategies are discussed in Refs. w1,33,34x. 2.1.2. Feature extraction by orthogonal decompositions In the situation with highly multivariate and collinear data such as NIR spectra, and only a moderate number of samples, even the pooled covariance-estimate of LDA may become unstable or singular. A reasonable solution to the problem is to reduce the dimension of the data by decomposing and projecting the samples onto a smaller number of orthogonal components found by PCA, PLS or some other suitable strategy. LDA or QDA can then be applied to the reduced data without stability problems. With some fortune, such procedures can give a drastic reduction of dimension without causing significant loss of discriminative information. A useful and interesting part of discriminant analysis is the computation of canonical variates also known as Fisher’s linear discriminants. The canonical variates are the solution-vectors z maximizing the

BrW-ratio criterion z t Bzrz t Wz under orthogonality constraints. These vectors are found by solving the generalized eigenvalue problem Bz s l Wz, where l is a generalized eigenvalue, W s n Sˆ within the sum K Ž of squares matrix and B s S ks ˆ k y mˆ .Ž mˆ k y mˆ . t 1 m the between sum of squares matrix Ž mˆ s Ž 1r n. S n k m ˆ k .. If W is non-singular, z will be an ordinary eigenvector of the expression Wy1 B Žsee Mardia et al. w30x.. It is proved in Ref. w31x that a direct application of LDA to the original data yield the same results as LDA applied to a projection of the data onto the space spanned by the canonical variates. Plots of the data projected onto two or three of the first canonical variates often give a useful graphical representation of the relationship between the different groups Žsee Fig. 3.. In the case of multivariate and highly correlated variables typical for spectral data, it is, however, important to do some kind of regularization of the dataset before estimation of canonical variates. Otherwise, both the estimated classification rule and the corresponding graphical representation can be strongly misleading.

2.1.3. Feature extraction by Õariable selection An alternative to projecting the data onto PCs is to select a subset of the original variables Žwavelengths.. The ideal situation is, of course, to identify the variables having the significant discriminative power. One possible strategy to obtain focus on variables contributing to correct classification is to compute the one-dimensional BrW-ratios separately for each wavelength w8x of the spectra to obtain the scattercurÕe Žif the full multivariate versions of W and B are already computed, the scatter-curve can be obtained by element-wise division of the diagonal of the matrix B with the diagonal of the matrix W.. Instead of exclusively selecting the wavelengths corresponding to the largest values, we suggest selection of variables corresponding to the local maxima of this spectrum ŽFig. 4.. This idea is sensible considering the high correlation present between adjacent wavelengths of a NIR spectrum. Even if two adjacent variables yield a high ratio, they both supply essentially the same discriminative information. This approach is equivalent to computation of the F-statistic

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

for each wavelength in a One-way ANOVA where the grouping corresponds to the different oils. We also consider an extended hybrid of this procedure based on simultaneous selection of local maxima of the scatter-curve and the related ÕariancecurÕe resulting from estimation of variable-wise variances ŽFig. 4.. The variance-curve is identical to the diagonal of the total covariance matrix Sˆ T s ŽB q W.rn of the entire dataset. 2.1.4. Classification by regression Regression with  0,14 -responses is a strategy for classifying multivariate data apparently different from the probabilistic approaches leading to LDA and QDA. A response matrix Y of n rows Žcorresponding to the observations of X. and K columns Žcorresponding to the number of classes. is designed as follows: for the ith row of Y Ž i s 1, . . . ,n., put a 1 in the k th column and a 0 in all the other columns if the corresponding ith object of X belongs to class k. By regressing Y onto X, classification of a new sample x is done by selecting the group corresponding to the largest component of the fitted yˆ Ž x . s Ž yˆ 1Ž x ., . . . , yˆK Ž x ... Ripley w27x gives an algebraic argument explaining the regression approach as a variant of LDA with equal prior probabilities, where the total covariance matrix is used in place of the within-groups covariance matrix. When the model assumptions of LDA are appropriate, the regression approach is seldom superior. For K s 2 Ža two-class problem., it can be shown that LDA and the regression approach are equivalent Žsee Ref. w27x.. However, as explained both by Ripley and Hastie et al. w28x, for arbitrary ˆ many classes K, LDA applied with the fitted Y-values of a linear regression always yield a classification equivalent to LDA applied to the original x-variables. Furthermore, this is equivalent to an application of LDA when the data are projected onto the space spanned by the canonical variates. The equivalence is a consequence of the fact that the fitted values span the same space as the canonical variates, and can be looked upon as a linear transformation of the original data into the canonical space Žnote that this relationship is no longer exact when LDA is substituted by QDA.. In the cases of PCR and PLS2 regression with  0,14 -responses w36x, i.e., DPCR and DPLS, this relationship can be explained as follows: application of

23

LDA to fitted values corresponds to first projecting the data onto canonical variates derived from the space spanned by the selected PCs, followed by an application of LDA in this space. 2.1.5. Validation and model selection Estimates of success rates for the classifiers are based on cross-validation. Because estimation of PCs is included in the model-specification, extra care must be taken. Some observations of the dataset can potentially have a significant influence on the choice of components, hence, evaluation of classifiers based on an initial decomposition may be misleading. This is an analog to validation by leverage correction of PCR and PLS regression models. An ‘authentic crossvalidation’ is obtained by decomposing the data for each cross validation segment excluded. The actual dataset was divided into segments by leaving out the three replicates of each sample per validation-step for all the methods presented. 2.2. Mayonnaise and oil samples A two level factorial design Ž2 4IV– 1 without replicates. was used for the investigation of oil, stabilizer, egg, and sugar in full fat mayonnaise. All four variables were varied at two levels in each of six mayonnaise samples based on different vegetable oils. These were soybean oil, sunflower oil, canola oil, olive oil, corn oil and grapeseed oil. The oil content was varied from 70 to 80%. In addition to eight designed experiments with mayonnaise of a certain oil type, one sample containing 75% soybean oil was always produced immediately after the others to serve as a control. Hence, a total of 54 Ž6)9. samples of mayonnaise were produced. Of the considered six groups, the soybean oil group contained 14 samples Žincluding the six 75% control samples. whereas the other groups contained eight samples each. All the samples were analyzed by NIR. Six different commercially available vegetable oils, of the same kind as mentioned above, were subjected to a similar analysis as the mayonnaise samples. The reason for doing so was to compare the results using oil type as the discriminating criteria both in pure form, as well as in a finished product Žmayonnaise.. In order to assure sufficient data-variation, different brands of all oils were bought from

24

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

local food stores. Because of restricted local availability of the different oil types, we ended up with six different brands of soybean oil, seven of sunflower oil, three of canola oil, four of olive oil, four of corn oil and three different brands of grapeseed oil, giving a total of 24 different samples. The oils were stored at 48C and analyzed by NIR and Gas Chromatography. 2.3. NIR measurements The mayonnaise were analyzed in triplicates Ž54)3 replicatess 162 samples. using logŽ1rreflectance. spectra ŽFig. 1a. at 4 nm intervals from 1100

to 2500 nm with an InfraAlyzer 500 ŽBran q Luebbe, Germany.. A sample cup with a quartz coverglass was used for sample presentation. The pure vegetable oils were also analyzed ŽFig. 1b. in triplicates Ž24)3 replicatess 72 samples. using the same instrument and specifications as for the mayonnaise samples. A sample cup coated with 40 mm of gold covered by a quartz cover glass was used for sample presentation of the oils. The spectra were not exposed to any pre-transformationsrscatter corrections. 2.4. Gas Liquid Chromatography (GLC) All the 24 samples of pure vegetable oil were analyzed in triplicates for their fatty acid concentration using a gas chromatograph ŽHP-5890II. with a split–split less injector and a flame ionization detector ŽJ & W DB-23 column, 25 m = 0.25 mm i.d., 0.25 mm film thickness.. The samples were trans-methylated with methanolic-HCl.

3. Results

Fig. 1. Ža. NIR spectra of mayonnaise. Žb. NIR spectra of vegetable oils.

Essentially, we considered three different classification methods, i.e., LDA, QDA and regression with  0,14 -responses. The methods were applied to different decompositions of the dataset containing NIR spectra for all the mayonnaise samples. The decompositions considered were PCA, PLS2 and wavelength selection based on the scatter- and variance-curves ŽQDA was not used with the selected wavelengths.. To reveal an overview of the performances of these three methods in the case of PCAand PLS2-decomposed data, cross-validated success rates were recorded as a function of the number of PCsincluded in model-estimation ŽFig. 2.. For projections of the dataset onto a moderate number Ž7–9. of PCs, QDA based on PCA- or PLS2-decomposed data gave significantly better results than LDA. However, using 15 Ž14 for PLS. PCs or more, cross-validated LDA correctly classified 161 out of the 162 samples, outperforming QDA which then started to suffer from unstable covarianceestimates. Fig. 3 shows a plot of the data projected

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

Fig. 2. Success rates of the different methods as a function of dimension. PLS-methods Ždashed., PCA-methods Žsolid..

onto the first two canonical variates estimated from an initial projection onto 15 PCs. By studying the clusters, one member of class 2 Žsunflower. can be observed to be well inside class 1 Žsoybean.. The other two replicates of sunflower from the same batch were correctly classified, and indicate that a faulty measurement, probably due to a switch of samples, may be the reason why a perfect classification is not obtained. Direct classification according to the largest estimated response of the PLS2 regression with  0,14 -indicators ŽPLS discriminant analysis, Ref. w36x. was

Fig. 3. Canonical plot of mayonnaise-data after projection onto 15 PCA-components.

25

consistently better than PCR ŽPCR discriminant analysis. for nine or more PCs, but both regressions were significantly poorer than the corresponding versions of LDA. However, the regressions outperform QDA when the number of PCs included is large. Automatic variable selection based on the scatter-curve gave 16 wavelengths corresponding to its local maxima ŽFig. 4.. LDA applied to the selected wavelengths resulted in a cross-validated success rate of 153r162. Compared to the cross-validated success rate of 126r162 obtained from a classification according to the largest estimated response of multiple linear regression ŽMLR., the latter approach is clearly unfavourable. The hybrid selection based on maxima of both the scatter- and the variance-curves resulted in selection of 26 wavelengths. LDA applied to these wavelengths gave a cross-validated success rate of 160r162, proving the selected subset to contain the essential information related to our problem. The corresponding linear regression had a crossvalidated success rate of 148r162, still significantly poorer than LDA. It should be noted that these results were obtained by direct application of the methods to the set of selected wavelengths. No further reductions of dimension were carried out. ŽQDA was not considered for the selected wavelengths because of the instability-problems related to individual covariance-estimates when the number of variables are large compared to class size.. A summary of the best

Fig. 4. Automatically selected wavelengths for the mayonnaisedata based on local maxima of the scatter-curves Žsolid. and variance-curves Ždashed..

26

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

Table 1 Selected success rates of the different methods Methods

LDA Ž‘optimal’.

Linear regression Žsame comp. as LDA. Linear regression Ž‘optimal’. QDA Ž‘optimal’.

Projections PCA-components

PLS2-components

Selected variables ŽBrW.

Selected variables Žhybrid.

ŽLDArPCA. 15 components 161r162 ŽPCR. 15 components 131r162 ŽPCR. 20 components 154r162 ŽQDArPCA. 9 components 155r162

ŽLDArPLS2. 14 components 161r162 ŽPLS2. 14 components 152r162 ŽPLS2. 17 components 158r162 ŽQDArPLS. 15 components 155r162

ŽLDA. 16 variables 152r162 ŽMLR. 16 variables 126r162

ŽLDA. 26 variables 160r162 ŽMLR. 26 variables 148r162

cross-validated success rates for the different methods are given in Table 1.

4. Discussion There are two important aspects related to the data analyzed in this paper. The first one is methodological and related to successful discriminant analysis based on NIR spectra and similar multivariate data. The second one is about the chemical interpretations of the information extracted from the NIR spectra to give good discrimination. The following discussion will hence be divided into two parts.

canonical variates. To compute the canonical variates, class-membership of the samples is required. Some regularization of the data is also necessary. In the case of Fig. 3, the canonical variates were obtained by first projecting the data onto 15 PCs Žoptimal classification. of a PCA on the entire dataset. For the canonical variates to represent a realistic grouping of the data, the number of PCA-components chosen was decided according to the cross-validated success rates of LDApca in Fig. 2. The different clusterings solve different classification problems related to our dataset. To extract the best possible features for the problem under consid-

4.1. Aspects concerning successful discriminant analysis Classification based on spectral data is no trivial task. One single dataset can reveal several distinct group structures and geometrical exploration based on score-plots of PCA does not necessarily show the clusterings we seek. For the mayonnaise-data, the PCA score-plot of Fig. 5 shows a three-group clustering, according to the amount of pure oil in the samples, where the middle cluster represents all the samples of 75% soybean oil. The score-plot of Fig. 6 shows a two-group clustering where the olive-based samples are distinctly separated from the rest of the data. The desired grouping according to oil type, is successfully obtained by projecting the data onto

Fig. 5. Mayonnaise-data projected onto PC1 and PC2 found by PCA.

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

Fig. 6. Mayonnaise-data projected onto PC2 and PC5 found by PCA.

eration is both a non-trivial and important task. But the multi-dimensional nature of our data also increase the possibility of doing successful complex classification. Even if the amount of variance accounted for by the difference in ‘oil type’ is small compared to the total variance of the data, it can be filtered out to obtain successful classification. By simultaneous consideration of the two distinct properties ‘oil type’ and ‘oil-content’, i.e., Figs. 5 and 3, it is in fact possible to obtain reliable classification of the data into 2)5 q 3)1 s 13 distinct groups based on the 54 samples Ž)3 replicates. only. As mentioned in Section 2.1.4, there is a close relationship between the regression approach and the canonical variates. In fact, the fitted values of a linear regression with  0,14 -responses span the canonical space derived from the covariates. The reader should note the insignificant difference between Figs. 3 and 7. Fig. 3 is showing a set of data projected onto the first two canonical variates. Fig. 7 shows a score-plot of the fitted values from a PCR Ž15 PCs. with  0,14 -responses. The minor difference in the structure of the two plots is caused by the extraction of PC1 and PC2 from the canonical space rather than the first two canonical variates. In the case of latent covariates ŽPCR or PLS2-regression. there are some remarks worth making. For PLS2 with seven components Žsee Fig. 2., direct classification according to the largest fitted response gave a success rate of 64r162 s 0.395. With the same seven PCs, LDA and QDA gave success rates

27

of 132r162s 0.815 and 153r162s 0.945, respectively. Hence, the seven components extracted by PLS2 obviously contains much more discriminative information than what is utilized by the traditional PLS-discriminant procedure. The difference in success rates between PLS2 and QDA is here, in fact, 55%! The situation is similar for PCR. Discrimination based on the regression approaches seems to be clearly sub-optimal compared to application of LDA Žand QDA when the number of selected components is small. with the same data. Wavelength selection based on local maxima of the scatter- and variance-curves is attractive both from the computational as well as the interpretational point of view. The technique is a univariate featureextraction, aiming directly for relevant structure of the data. A natural extension of these procedures will typically involve a standard stepwise variable selection based on statistical criteria as described in Ref. w31x. This can be quite helpful in a further reduction and interpretation of the data without reducing classification-performance. Multivariate feature-extraction is possibly a more powerful alternative, but not without a considerable computational cost. If sensitivity to disturbances in the spectra such as shifts, etc., more robust versions of the wavelength selection can be required. For such problems, introduction of extra variables corresponding to small intervals around of the chosen wavelengths is a possible extension. Further decompositions by PLS or PCA of the selected wavelengths is another natural option.

Fig. 7. PC1 against PC2 of fitted values after PCR-regression of 0,14-responses onto mayonnaise-data.

28

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

It should be noted that, in general, PCs and canonical variates are not invariant under pre-transformations Žlike multiplicative scatter correction. of the data. This is also true for the variance- and scatter-curves. In particular, the variance-curve is likely to be altered in the location of its peaks. Also, the scatter-curve may be affected, but probably less seriously. As long as a reduction in total variance will tend to be quite evenly distributed between the ‘within-samples’ and the ‘between-samples’—variance peaks are not likely to be lost. However, if the reduction of variance is mostly due to the ‘withinsamples’ variance, new peaks may appear. We do not pursue the subject any further here. Another aspect not considered in this paper is non-chronological Žchronological with respect to the amount of corresponding variancercovariance. selection of PCA- and PLS2-components. Such selection could be based on component-wise BrW-ratios estimated for each component, or by a stepwise canonical procedure as described in Ref. w6x. We decided not to consider such approaches in our study. According to the conclusions in Ref. w35x and the fact that very good classification results were obtained without any selection strategy, we do not expect component-selections to give significant improvement in classification performance over the methods actually applied in our analysis. All together, the dataset presented seemed very well-suited for discrimination. Except for QDA which starts to suffer from overparametrization due to the required estimation of individual covariance matrices, the cross-validated success rates of Fig. 2 increase quite consistently for the number of components considered. For LDA with PCA-scores, we confirmed the cross-validated success rate of 161r162 for as many as 130Ž!. PCs Žinclusion of that many scores is of course not recommended.. For completeness, we also tried classification of the mayonnaise-data by the SIMCA-method implemented in the software-package ‘SCAN’. The routine in ‘SCAN’ follows the original description of SIMCA given by Wold w4x, and does not consider leverage inside the PC-models. The best cross-validated success rate of 75% by SIMCA was obtained with seven PCs representing each class. Thus, at least this version of SIMCA seems to be limping due to the small number of samples in each class, and it is sig-

nificantly outperformed by the other methods of our study. Generally, one should expect data of a more complicated group-structure than reflected by our example. In such cases, LDA may be too crude, and a dimension-reduction followed by an application of RDA w1x might be preferable. Non-linear multivariate regression onto  0,14 -responses followed by LDA applied to the fitted values in the spirit of Hastie et al. w37x and Ripley w27x are other useful alternatives. Also, the variable-selection approaches described by Wu and Massart w9x and the wavelength-selections suggested in this paper yield good starting-points for discriminant analysis via other flexible methods like NNs. 4.2. Chemical interpretations of the NIR spectra Õia analysis of pure Õegetable oils NIR is suited for measuring the composition of food constituents w38x such as the amount of fat, water, and protein. The spectrum is a result of light absorption from various functional groups, including –CH, –OH, –NH and the wavelength of an absorption band often reveals the nature of the chemical bonds responsible for the absorption. Almost all absorption bands observed in NIR arise from overtones of hydrogenic stretching vibrations of functional groups or combinations involving stretching and bending modes of vibration of the groups. After obtaining a successful classification, it is both natural and often necessary to understand and interpret the discriminative information of the NIR spectra. Vegetable oils contain different amounts of fatty acids which differ in chain length Žaddition of –CH 2 . and number and position of double bonds ŽTable 2.. These differences are likely to be expressed in a number of different wavelength bands of the NIR spectra making an exact interpretation difficult. However, information about the fatty acid composition in oils and fats seems to be concentrated in the range of 1600–2200 nm, and attempts have been made to determine the fatty acid composition by NIR methodology Žsee Ref. w22x.. The NIR spectra obtained from the 24)3 s 72 samples of the same six pure vegetable oils being used in the mayonnaise samples were similar to NIR spectra described in the literature w10,22,38x. The

Oil types

Soybean oil Sunflower oil Rapeseed oil Olive oil Corn oil Grapeseed oil

Fatty acids C14

C16

C16:1

C18:0

C18:1

C18:2

C18:3

C20–22

0.0764"0.1046 0.0658"0.0022 0.0573"0.0068 0.0023"0.0079 0.0409"0.0076 0.0430"0.0016

12.7733"0.3285 6.2441"0.0741 4.7344"0.4788 11.1625"1.1979 10.7617"0.2062 6.9100"0.1138

0.0918"0.0058 0.0799"0.0064 0.2229"0.0119 0.7531"0.1636 0.1029"0.0042 0.0957"0.0058

3.9567"0.2606 4.1972"0.1325 1.7589"0.0523 2.9558"0.5414 1.8500"0.0477 3.7756"0.0682

22.3984"0.8813 23.9858"1.1621 59.5633"0.2727 73.5224"1.2924 28.3832"1.2512 17.6778"0.4992

54.2878"0.3346 64.8389"1.2101 21.1522"0.3419 9.7883"0.10129 56.8917"1.4124 70.5522"0.2414

7.3189"0.7410 0.1428"0.0618 10.0722"0.5128 0.6623"0.0638 1.0005"0.3436 0.5359"0.3439

0.3404"0.1324 0.2598"0.1222 1.9246"0.0608 0.7872"0.0722 0.2641"0.1542 0.2859"0.0701

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

Table 2 The average fatty acid composition of the oil types with standard deviations calculated on the basis of replicates

29

30

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

Fig. 8. Scatter-curve for oil-data showing the discriminative ability of the wavelengths.

spectra are showing peaks in the wavelength regions around 1700 nm and 2300–2400 nm ŽFig. 1b.. The relevance of these wavelength regions is confirmed by corresponding peaks in the scatter-curve for the pure oils ŽFig. 8.. One should also note that the spectra of pure oils show a shape quite similar to the variance-curve of Fig. 4. This supports the information extracted from the data to be related to the oilcomponent of the mayonnaise. A successful discrimination of all six oils based on the NIR spectra is shown in Fig. 9. The canonical variates are computed after projecting the data onto 15 PCA-components. The canonical plots in Figs. 9 and 3 show very similar internal groupings for the

Fig. 10. Plot of monounsaturated acids ŽC18:1. and alpha-linolenic acid ŽC18:3. of oil-samples.

two corresponding datasets. The same structure is shown in Fig. 10 where the contents of the monounsaturated acid C18:1 and the polyunsaturated alphalinolenic acid C18:3 are plotted against each other. Table 2 shows the complete fatty acid composition for the six different oils analyzed by gas chromatography. Based on these observations, we conclude that the first two canonical variates derived from NIR measurements of both datasets are dominated by the amount of C18:1 and C18:3, respectively. Hence, the classification seems mainly to be based on the fatty acid composition with the number of double bonds being one of the most important factors. By comparing the discrimination of mayonnaise with the discrimination of pure oils, it is also apparent that the classification is not significantly influenced by the variation of other ingredients Žegg, stabilizer, and sugar. or other chemical and physical properties of the mayonnaise. The potential of NIR spectroscopy for the analysis of other ingredients, as well as their effects on the consistency in full fat mayonnaise will be discussed in another paper.

5. Conclusions

Fig. 9. Canonical plot of oil-data after projection onto 15 PCAcomponents.

In conclusion, the present study demonstrates that NIR can be used to discriminate between mayonnaise samples with different oil content and oil type, in spite of the fact that a large amount of variance in the data is caused by other phenomena. Classification methods based on NIR data seem to be success-

U.G. Indahl et al.r Chemometrics and Intelligent Laboratory Systems 49 (1999) 19–31

ful in separating mayonnaise samples according to oil type and fatty acid composition. The results are in agreement with recent studies showing that PCA and discriminant analysis can be used to identify oils Žsee Refs. w18,21x.. It may therefore be possible to utilize NIR for the determination of oil type in full fat mayonnaise as well as detection of minor variations in oil content and variations in the content of individual fatty acids. Furthermore, we have seen that for our data, LDA is consistently superior to the regression approaches based on PLS2 and PCR with  0,14 -responses. However, our experience is that PLS2 can be very useful for extraction of subspaces where both QDA and LDA applied to the projected data yield good discrimination. Wavelength-selection corresponding to local maxima of the scatter- and variance-curves is demonstrated both as a useful data-reduction technique, and as a tool for identification of wavelengths containing discriminative information.

Acknowledgements This study was completed as a part of the Nordic R & D Programme for the Food Industry, Nordfood project no. P93131—‘Emulsion quality’, with financial support from the Nordic Industrial Fund and the Norwegian Food-company Mills DA. We are grateful to Grethe Enersen of MATFORSK for measuring the NIR spectra and Palsgaard Industri of Denmark for production of the mayonnaise samples.

References w1x J.H. Friedman, JASA 84 Ž1989. 165–175. w2x I. Frank, Chemometrics and Intelligent Laboratory Systems 4 Ž1988. 215–222. w3x I. Frank, J.H. Friedman, Journal of Chemometrics 3 Ž1989. 463–475. w4x S. Wold, Pattern Recognition 8 Ž1976. 127–139. w5x W.J. Krzanowski, P. Jonathan, W.V. McCarty, M.R. Thomas, Applied Statistics 44 Ž1995. 101–115. w6x D. Bertrand, P. Courcoux, J.-C. Autran, P. Robert, Journal of Chemometrics 4 Ž1990. 411–427. w7x Y. Mallet, D. Coomans, O. de Vel, Chemometrics and Intelligent Laboratory Systems 35 Ž1996. 157–173. w8x W. Wu, B. Walczak, D.L. Massart, K.A. Prebble, I.R. Last, Analytica Chimica Acta 315 Ž1995. 243–255. w9x W. Wu, D.L. Massart, Chemometrics and Intelligent Laboratory Systems 35 Ž1996. 127–135.

31

w10x P.G. Osborne, T. Fearn, Near Infrared Spectroscopy in Food Analysis, Longman, Essex, 1986.. w11x D.G. Evans, C.N.G. Scotter, L.Z. Day, M.N. Hall, Journal of Near Infrared Spectroscopy 1 Ž1993. 33–44. w12x M.R. Ellekjær, K.I. Hildrum, T. Næs, T. Isaksson, Journal of Near Infrared Spectroscopy 1 Ž1993. 65–75. w13x T. Næs, K.I. Hildrum, Applied Spectroscopy 51 Ž3. Ž1997. 350–357. w14x B.G. Osborne, B. Mertens, M. Thompson, T. Fearn, Journal of Near Infrared Spectroscopy 1 Ž1993. 77–84. w15x W.J. Krzanowski, Journal of Near Infrared Spectroscopy 3 Ž1995. 111–117. w16x S. Husain, K. Sita Devi, D. Krishna, P.J. Reddy, Chemometrics and Intelligent Laboratory Systems 35 Ž1996. 117–126. w17x Y.W. Lai, E.K. Kemsley, R.H. Wilson, Food Chemistry 53 Ž1995. 95–98. w18x K.M. Bewig, A.D. Clarke, C. Roberts, N. Unklesbay, JAOCS 71 Ž2. Ž1994. 195–200. w19x M.F. Devaux, P. Robert, A. Qannari, M. Safar, E. Vigneau, Applied Spectroscopy 47 Ž7. Ž1993. 1024–1029. w20x D.B. Dahlberg, S.M. Lee, S.J. Wenger, J.A. Vargo, Applied Spectroscopy 51 Ž8. Ž1997. 1118–1124. w21x T. Sato, JAOCS 71 Ž3. Ž1994. 293–299. w22x T. Sato, S. Kawano, M. Iwamoto, JAOCS 68 Ž11. Ž1991. 827–833. w23x H. Kamishikiryo, K. Hasegawa, H. Takamura, T. Matoba, Journal of Food Science 57 Ž5. Ž1992. 1239–1240. w24x A.J. Boot, A.J. Speek, Journal of AOAC International 77 Ž5. Ž1994. 1184–1189. w25x I.J. Wesley, R.J. Barnes, A.E.J. McGill, JAOCS 72 Ž3. Ž1995. 289–292. w26x M.A. Czarnecki, Y. Liu, Y. Ozaki, M. Suzuki, M. Iwahashi, Applied Spectroscopy 47 Ž12. Ž1993. 2162–2168. w27x B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge, 1996. w28x T. Hastie, R. Tibshirani, A. Buja, Journal of the American Statistical Association 89 Ž1994. 1255–1270. w29x R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973. w30x K.V. Mardia, J.T. Kent, J.M. Bibby, Multivariate Analysis, Academic Press, 1979. w31x G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley, 1992. w32x C.M. Bishop, Neural Networks for Pattern Recognition, Oxford, 1995. w33x B. Flury, M.J. Schmidt, A. Natayanan, Journal of Classification 11 Ž1994. 101–120. w34x J.D. Banfield, A.E. Raftery, Biometrics 49 Ž1993. 803–821. w35x T. Næs, H. Martens, Journal of Chemometrics 2 Ž1988. 155– 167. w36x L. Stahle, S. Wold, Journal of Chemometrics 1 Ž1987. 185– ˚ 196. w37x T. Hastie, A. Buja, R. Tibshirani, Annals of Statistics 23 Ž1995. 73–102. w38x P. Williams, K. Norris ŽEds.., Near-Infrared Technology in the Agricultural and Food Industries, American Association of Cereal Chemists, St. Paul, 1990.