Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data

Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to NIR data

ANALYTICA CHIMICA ACTA ELSEVIER Analytica Chin&a Acta 329 (1996) 257-265 Comparison of regularized discriminant analysis, linear discriminant ana...

760KB Sizes 1 Downloads 215 Views

ANALYTICA CHIMICA ACTA

ELSEVIER

Analytica

Chin&a

Acta 329 (1996) 257-265

Comparison of regularized discriminant analysis, linear discriminant analysis and quadratic discriminant analysis, applied to NIR data W. Wu”, Y. Mallet”, B. Walczak’va, W. Penninckx”, S. Heuerdingb, F. Ernib

D.L. Massarta7*,

“C’hemoAC, Pharmaceutical Institute, Vrije Universiteit Bnrssel, Laarbeeklaan 103, B-1090 Brussels, Belgium bSandoz Phama AG, Analytical Research and Development, CH-4002 Basle, Switzerland Received 6 February

1996; accepted 6 February 1996

Abstract Three classifiers, namely linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and regularized discriminant analysis (RDA) are considered in this study for classification based on MR data. Because NIR data sets are severely ill-conditioned, the three methods cannot be directly applied. A feature selection method was used to reduce the data dimensionality, and the selected features were used as the input of the classifiers. RDA can be considered as an intermediate method between LDA and QDA, and in several cases, RDA reduces to either LDA or QDA depending on which is better. In some other cases, RDA is somewhat better. However, optimization is time consuming. It is therefore concluded that in many cases, LDA or QDA should be recommended for practical use, depending on the characteristics of the data. However, in those cases where even small gains in classification quality are important, the application of RDA might be useful. Keywords; Regularized discriminant Feature selection; Pattern recognition

analysis;

Linear discriminant

analysis;

1. Introduction

Three related dimensional methods, namely linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and regularized discriminant analysis (RDA) are considered in this study for classification based on NIR data. The choice of a classification

* Corresponding author. Tel.: +32 2 4774737: fax: 4774735. ’ On leave from Silesian University, Katowice, Poland.

+32

2

0003-2670/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved PII SOOO3-2670(96)00142-O

Quadratic

discriminant

analysis;

Near infrared

spectroscopy;

method is data dependent. It is known that the performance of LDA and QDA depends on the size of the training set, on how closely the class population follows a multivariate normal distribution and on whether the covariances in the classes considered are equal or not. In the case of small sample sizes and small differences between class covariance matrices, LDA often outperforms QDA since fewer estimates need to be calculated. When the data set is illconditioned (i.e. the number of variables is larger than the number of parameters to be estimated [l]),

W Wu et al./Analytica

258

Chimica Acta 329 (1996) 257-265

and contains deviations from normality, RDA [l-3] has the potential to produce enhanced results compared to those of LDA and QDA. Classification of NIR data is usually an ill-posed problem, so that RDA theoretically can be considered as the best choice. From theoretical considerations one would expect RDA to be by definition better or at least equal to LDA or QDA. However, RDA requires an additional time consuming optimization step compared to LDA or QDA. The question is whether it is worthwhile to always use RDA instead of LDA and QDA in the case of classification of NIR data. In this paper, the performance of RDA, LDA and QDA for NIR data (for which the ratio of number of variables to total number of objects is very high) is studied. It should be noted here that there is another possibility to avoid the use of ill-conditioned data, namely by reducing the number of dimensions. How to do this in the best manner is another important subject, which, however, is not studied here.

2. Theory 2.1. RDA, LDA and QDA[l-31 Consider the classification problem of assigning a test object Xi based on d measurements Xi=(Xii,,..JiJT to one of K a priori defined classes. The classification score is defined as cf(xi) = (xi - Pk)T~~‘(Xi - pk) + lnl&[ - 21nnk, (1) where & is the class covariance matrix of class k, pk is the mean vector of class k, and -/rk is the prior probability of class k. They are estimated by

Object xi is assigned to the class for which it has the lowest classification score. Based on this quadratic classification rule, the boundary of classes is of quadratic shape, so that this kind of classifier is called QDA. In Eq. (l), there are three terms, and the first term corresponds to the Mahanalobis distance. In the case of LDA, the class covariance matrices are assumed to be equal, so that a pooled covariance matrix

(3 is substituted for the class covariance matrix in Eq. (1). Ignoring constants, the following classification rule for LDA is derived cf(xi) = (xi - pk)T?$&d(Xi - pk) - 2 ln rk.

When the prior probability nk is constant, Eq. (6) corresponds to the Mahanalobis distance. In poorly-posed settings for which the number of variables is comparable to the number of parameters to be estimated, the estimates of the class covariance matrix become highly uncertain and the smallest eigenvalues are biased too low and the larger eigenvalue estimates are biased too high [l]. For illconditioned settings, if the number of variables is higher than the number of objects of every group (nk) and lower than the total number of objects (n), QDA cannot be applied, because the class covariance matrix (&) is singular. If the number of variables is higher than the total number of objects (n), both QDA and LDA cannot be used, because both Xk and ~po&.d are singular. Friedman [l] showed that these problems can be overcome by applying a regularization &(A, Y) = (1 - y)&(X) + ~ld~[W)l~,

i=l +k

=

nk/n,

(4)

where nk is the number of objects in class k, and n is the total number of objects in the training set.

(6)

(7)

where Z is the identity matrix; the parameter X~[0,1], controls the degree to which the pooled class covariance matrix should be used; the parameter y~[O,l], controls the degree of shrinking to the average eigenvalue; and xk(X)

=

(1 -

x)xk

+

~&,,&d.

(8)

W Wu et al./Analytica

Chimica Acta 329 (1996) 257-265

259

When X=1 and y=O, RDA is identical to LDA. When X=0 and y=O, the result is QDA. When the number of variables is higher than the number of objects of every group (nk) and lower than the total number of objects (n), the class covariance matrix (&) is singular, while the pooled covariance matrix is not, nor is the covariance matrix (&(A)) @pooled) after regularization by Eq. (8). When the number of variables is higher than the total number of objects (n), both xk and &,oled are singular, and so is &(A). Then the covariance matrix (&(A, r)) is further regularized by Eq. (7). Friedman [l] concludes that QDA is only viable when the ratio of objects to variables is large. LDA has then usually been the method of choice. When the sample size is too small for LDA, or when the population class covariance matrices are too different, so that pooling makes no sense, RDA can improve the method.

where mi=P(k]xi), if xi is correctly classified to class k; mi=l-P(kJXi), if xi is misclassified to class k; where P(klXi) is the probability density of a sample Xi belonging to class k. Pr is continuous in the range of [0, 11. A large Pr obtained for a given decision rule means that the classes can be differentiated with a high degree of certainty and small overlap between the classes. Pr gives more information than correct classification rate. Pr was calculated based on the leave-one-out crossvalidation. The cross-validation was applied including not only the discrimination stage (partial validation), but also the data pre-processing stage (full validation).

2.2. Selecting the regularization parameters for RDA

2.4. Data pre-processing

Suitable estimates for the regularization parameters were determined by trying different combinations of X and y. The strategy consists of choosing a grid of points on the X-y plane (X~[0,1], r~[O,l]). Usually, the number of points in the optimization grid varies from 25 to 50. In this study, a 6x6 grid of X and y values was constructed with X and y taking the values 0,0.2 ,...,l.O respectively. For each point a classification criterion, based on the leave-one-out cross-validation is calculated. Each time one object is left out as a test object, the classification criterion for each point of the grid is calculated. After all objects have been left out, the average classification criterion for each point is computed, and the pair of estimates maximizing this criterion is used in the RDA model.

Data pre-processing, involving data transformation and/or data reduction can dramatically influence the final results of recognition. This is particularly true for spectral data, which contain hundreds of highly correlated variables, noise and irrelevant information. Removing non-relevant variability and extracting relevant features from the initial data set is of vital importance. All data used in this study were transformed by the standard normal variate technique (SNV) [6-81, because, in our experience, it is the most suitable (or at least one of the most suitable) methods to reduce within-class variability. To reduce data dimensionality, one can perform feature selection [9] or transform the full feature space into a lower dimensional space. These approaches are not always equivalent, especially for the spectroscopic data containing relevant and irrelevant variables. For instance, application of Principal Components Analysis (PCA) [ 10-121 preserves the information caused by the main sources of data variability, whereas feature selection methods aim to find subsets of variables relevant for classification. In this article we wanted to compare RDA to LDA and QDA. Therefore, we chose the simplest univariate feature selection strategy possible, based on Fisher’s weights [9]. Fisher’s weights are

2.3. The classijcation criterion Usually the classification criterion is calculated by error counting, i.e. the ratio of correctly classified objects to the total number of objects is calculated for each grid point. Correct classification rate yields discrete numbers which often are the same for several grid points. To overcome this drawback, a probability criterion (Pr) [4,5] can be used instead of the correct

classification Pr = 1/n e{m;

rate. - 0.5[mi + (1 - mj)2]} + 0.5,

(9)

i=l

U! Wu et al./Analytica

260

Chimica Acta 329 (1996) 257-265

defined as the ratio of the between- to the within-class variances

(10) j=l

j=l

where j=l, 2, 3 ,..., k is the number of classes, nj is the number of objects in class j, ~ji denotes the mean absorbance of the objects belonging to class j at the ith variable, yi denotes the mean absorbance of the objects belonging to all classes at the ith variable and sji is the standard deviation of the absorbance of the objects belonging to class j at the ith variable. For each variable, the Fisher’s weight is calculated, and the variables are selected which have the highest Fisher’s weights. This selection is not necessarily, and even probably, not the best possible choice for feature selection, but this aspect was considered to be unimportant for the aims of the study.

3. Experimental

Data set 4 contains 95 NIR spectra (11302152 nm, 5 12 pairs of wavelength) of pure butanol and butanol containing different concentrations of water (from 0.02% to 0.32%). The pure class contains 42 spectra and the impure class contains 53 spectra. Data set 5 contains 140 NIR spectra (98854115 cm-’ 749 wavelengths) of tablets containing drugs of different dosages (0.025, 0.05, 0.075 and 0.15 mg) and three kinds of placebo. 20 different tablets of each dosage form (active and placebo) were measured four times through a glass plate. The average spectra are used. Data set 6 contains 169 NIR spectra (13302353 nm, 512 points of wavelength) of 13 kinds of polymer products. The 13 classes contain respectively 22, 20, 20, 22, 5, 5, 5, 5, 5, 10, 20, 20 and 10 spectra. Data set 7 contains 252 NIR spectra of 6 kinds of solvent containing 2 subgroups (pure and impure) per solvent. All these spectra are trimmed into the same wavelength range (8555-4613 cm-‘, 512 wavelengths). The 12 classes contain 15, 31, 15, 10, 15, 30, 15, 31, 15, 30, 14 and 31 spectra each.

3.1. Data 3.2. SofnYare Seven NIR data sets were studied. The first two and the fourth data sets were especially designed with the intent of making classification challenging. The other data come from industry and are therefore representative of what one might expect in practice. Data set 1 consists of 40 NIR spectra whose wavelengths are 1318, 1320,..., 2340nm. The data consists of 2 classes, each containing 20 spectra. The first class contains MR spectra for 20 different samples of para-xylene. The second class contains 20 different samples of para-xylene with 0.30% orthoxylene added as a simulated impurity Data set 2 consists of 60 NIR spectra (1376, 1378,..., 2398 nm) of three batches of excipients which are made by mixing cellulose, mannitol, sucrose, saccharin sodium salt and citric acid in different proportions. Each class contains the spectra measured of 20 samples of the same batch of excipients. Data set 3 contains 83 NIR spectra (1330, 1332,..., 2352nm) of 4 quality classes of polymer products. Class 1 contains 22 spectra, classes 2 and 3, 20 spectra each, and class 4, 21 spectra.

All methods are programmed Matlab code (version 4.0).

by ourselves

with

4. Results and discussion

4.1. Dimensionality reduction and validation Since RDA was originally developed to deal with ill-conditioned problems, we first tried to apply it directly to the full SNV-transformed MR spectra. However, we failed because of the numerical problems. Indeed, by using the property of determinants [13], the determinant of the last term in Eq. (7) can be expressed as follows: ]y/dt&(A)11]

= {y/dtr[xk(A)l]d.

(11)

In the case of NIR data, y/d @[&(A)] < < 1, because the number of variables (6) is very high. The term in Eq. (11) approaches zero and the covariance matrix is then still singular after regularization. Numerical

W Wu et al./Analytica

problems start with this type of data when n/d<1/2. Therefore, it is necessary to reduce data dimensionality, using the Fisher weight method, described under Theory. The preselected features were used as the new input variables. As an example, the preselected features for data set 3 are given in Fig. 1. The results of classification were validated based on leave-one-out cross-validation. The number of dimensions (selected features) was systematically varied from 1 to a certain maximum (25 to 39 according to the number of objects). The feature selection method is data set dependent, and sometimes the selected features are not the same when different objects are left out. For instance, Fig. 2 displays the 39 selected features by the training sets from which one object was left out for data set 1. It shows that the selected features change depending on which object exactly was left out. Therefore, it is necessary to validate the methods including not only the classification step, but also the feature selection step. To demonstrate this, the classification procedures for the data sets l-4 were partially (i.e. only the classification step) and fully (i.e. including also the feature selection step) validated. For each discriminant method, the Pr value was estimated as a function of the number of the features included, and the optimal Pr, i.e. the one giving the best classification result, was recorded as the result of the method. The optimal results of RDA, LDA and QDA are listed in Table 1. It can be seen that for the data sets 1 and 4, the results of partial validation are better than these of full validation. For the data sets 2 and 3, the differences between full and partial validation are

Table 1 The optimal classification brackets

set set set set set set set

1 2 3 4 5 6 7

Index of wavelength

Fig. 1. Location of the 28 selected features along the mean spectra of every groups for the SNV spectra of data set 3.

.

L 200

Fig. 2. The selected features (x-axis) over the procedure one-out cross-validation (y-axis) for data set 1,

of leave-

very small, because the selected features are almost stable during the leave-one-out validation (Figs. 3 and 4). One can conclude that validating only the classification step may lead to an overoptimistic conclusion in certain cases and that therefore the longer procedure of full validation is needed.

criteria) of RDA, LDA and QDA with full and partial validations;

RDA

Data set

Data Data Data Data Data Data Data

results (probability

261

Chimica Acta 329 (1996) 257-265

(,I,y) is expressed between

LDA

Full

Partial

Full

Partial

Full

Partial

0.8081 (1.0) 0.9275 (LO) 1.OOOO(0,O) 0.8173 (0,l) 0.9989 (0.2.0) 0.9793 (0.2,O) 0.9487 (0.2,O)

0.8575 0.9268 0.9911 0.8353

0.8081 0.9215 0.9653 0.7848 0.9987 0.9687 0.9243

0.8575 0.9268 0.9693 0.7898

0.7193 0.8773 1.OOoo 0.7959 0.7523 0.8149 0.9026

0.8410 0.8633 0.9911 0.8022

262

W Wu et al./Analytica 60

...................... ........................ ......................... ......................... ........................

‘-7

........................ ........................

........................ ........................ ........................ ......................... ........................ ........................ ........................ ........................ ......................... ......................... ........................ ........................

50

Chimica Acta 329 (1996) 257-265

it”“- .:::::::::::::::::::::::: ......................... ......................... .........................

.........................

.........................

;

:

330-

61

5 $

J

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

.........................

........................

......................... .........................

......................... .........................

......................... ......................... ......................... ......................... .........................

.........................

20

lo-

b

......................... ......................... ......................... ........................ ......................... ......................... ......................... ........................ .........................

z::::::::::::::::::::::: ........................ ......................... ........................ ......................... ......................... ......................... ......................... ...................... 260

270

280

###I

230

ml 310 Wavelength

320

_/ 1 330

Fig. 3. The selected features (x-axis) over the procedure one-out cross-validation (y-axis) for data set 2.

340

350

of leave-

For the other data sets (data sets 5 to 7), therefore, full validation was always applied. 4.2. LDA,QDA and RDA The data sets were chosen such that discrimination is not easy, in the sense that simple univariate and bivariate discriminant methods, such as those described [9], do not achieve good discrimination for these data sets. Therefore there is a need to use more complicated methods such as LDA, QDA and RDA. The optimal results for all 7 data sets with full cross-validation are listed in Table 1, together with the values for X and y in the RDA model. The term “optimal” is used in the sense that the result is given for the number of features that yields the best classification result. The results show that LDA (or RDA with X=1, y=O) performs the best for two data sets (1 and 2) and QDA (or RDA with X=0, y=O) for one (data set 3). RDA performs better than the other methods for 4 data sets (4-7). The difference in performance of RDA with QDA is clear in all cases. It is smaller with LDA and, in fact, never very large. It is noteworthy that, in the same way that RDA reduces to other methods (LDA or QDA) for the data sets 1-3, it also reduces to a special case for the data set 4. As explained by Friedman [l], RDA with X=0 and y=l reduces to a weighted nearest mean classifier (which we will call WNMN method), the distance measure being the Euclidean distance and the weights being inversely proportional to the average variance of the

0

50

loo

150

200

250 3w Wavelength

350

400

Fig. 4. The selected features (x-axis) over the procedure one-out cross-validation (y-axis) for data set 3.

450

5w

of leave-

variables within the class. WNMN is not a method one would probably think of in the first place. Using RDA may therefore have the advantage that it points the user in the right direction, also when that direction is unexpected as is the case here. To achieve a better understanding of the results and of the respective properties of LDA, QDA and RDA, data sets l-4 were studied more systematically. Their PCl-PC2 plots are shown in Fig. 5 and Figs. 6-9 show the results obtained with the three methods in function of the number of variables. The PC-plots allow to investigate the class covariance structure of the data. It is immediately clear that the class covariances of data set 3 are very different, which explains why QDA is preferred to LDA, while they are similar for data sets 1 and 2, so that LDA performs better than QDA. For data set 4, one would expect QDA to perform better than LDA, because the covariances of the two classes are far from equal, but, although this is the case, the difference is small. In the first three cases the conditions to apply LDA and QDA are apparently met, so that RDA does not show any improvement. For data set 4, it is not completely clear to us why WNMN yields the best results. Perhaps the reason is that data set 4 is characterized, not only by dissimilar covariances, but also by a clearly non-normal distribution (bimodal in the direction of PC2), which may explain that the more sophisticated method performs somewhat better, The use of a Euclidean measure in preference to a Mahalanobis measure

W Wu et al./Analytica

7r

Chimica Acta 329 (19%) 257-265

263

7.

2 1

2 “1



2

*

I

I

1

.l Q '

2

2

c-l

I&

2

I

'1

p., 1

2

i

' 2

l-d

.2 2 I

3.5’ -04

-1

0.2

0

02

06

-01' -03

I -02

-0.1

0

01

02

03

04

PC1

PC1

(b)

(6)

-7

5’

-3

I -2

-1

0 PC1

1

2 Cd

I -06

I -04

-02

0

02

04

06

06

1

12

PC1 Cd)

Fig. 5. The score plots: (a) data set 1; (b) data set 2; (c) data set 3; (d) data set 4.

Fig. 6. The classification results of RDA, LDA and QDA as a function of the selected features for data set 1; -, -o- and -*- refer to RDA, LDA and QDA respectively.

Fig. 7. The classification results of RDA, LDA and QDA as a function of the selected features for data set 2; -, -o- and -*- refer to RDA, LDA and QDA respectively.

264

U? WU et al./Analytica

Chimica Acta 329 (1996) 257-265

similar. However, QDA becomes rapidly less good. LDA and RDA perform exactly equally well until about 20 features are included (in fact RDA reduces to LDA). From then on the performance is similar: RDA is nearly always somewhat better, and when the number of features reaches 38, LDA cannot be carried out anymore, while RDA still yields results. Values of X and y between 0 and 1 are then obtained. In this case, RDA profits from regularization of the covariance matrix, while QDA and LDA suffer from the singularity problems.

5. Conclusions Fig. 8. The classification results of RDA, LDA and QDA as a function of the selected features for data set 3; -, -o- and -*- refer to RDA, LDA and QDA respectively.

0

5

10

15 selected

20 features

25

30

35

Fig. 9. The classification results of RDA, LDA and QDA as a function of the selected features for data set 4; -, -o- and -*- refer to RDA, LDA and QDA respectively.

seems to indicate that the within-class covariance is no longer important. This is perhaps related to the somewhat rounded structure of class 4. The results of Figs. 6-9 are expressed by plotting Pr versus the number of variables selected. The classification results were obtained using the leaveone-out full cross-validation, and were set to 0 when singularity problems prevented good results. When the number of features is comparable to the total number of objects, RDA outperforms LDA and QDA. This can be seen for instance in Fig. 6. In the beginning the performance of the three methods is

The results presented in Section 4.2 are in agreement with the known characteristics of the discriminant methods studied [I]: QDA is unlikely to produce satisfactory results unless the ratio of the class sample sizes is large reIative to the number of variables posed in the problem, and it has no advantages compared to LDA except when the class covariance matrices are quite different. In situations where the class covariance matrices are similar, LDA is likely to give improved results compared to QDA, because a single covariance matrix can be employed in the classification rule, thus reducing the number of estimates which need to be computed. LDA should therefore be able to produce better results, when the sample sizes are smaller. RDA always gives results, equivalent to or better than LDA and QDA. In many cases it automatically reduces to LDA and sometimes to QDA, or even to other special cases such as WNMN. When this is not the case, better results are obtained. An advantage of RDA is certainly that it is not necessary to study the data structure and other characteristics of the data in order to decide whether LDA, etc. can and should be used: RDA automatically gives the most suitable X and y, and thereby indicates whether reduction to one of the special cases is to be preferred. Its main disadvantage is the time consuming optimization of the y and X factors. It is therefore not recommended to apply RDA when one knows that the conditions for application of LDA or QDA are fulfilled. When the latter two methods cannot be applied, for instance because the sample size is too small, it seems worthwhile to try to apply RDA. RDA cannot be applied directly to full

W Wu et ul./Analytica

Chimica Acta 329 (1996) 257-265

265

spectrum NIR data, but because the ratio objects/ variables can be smaller than for QDA and LDA, more variables can be incorporated, which can also be an advantage. The gains that can be obtained by applying RDA in preference to LDA and QDA are relatively small and in practical work it is often not enough to go from e.g. 90% to 93% classification success. What one needs often is 99+% classification success. In such cases, it would seem that substituting RDA for LDA (or QDA) will rarely lead to sufficiently better results: one probably either needs to apply completely different classifiers or to concentrate on judicious data pretreatment (transformation/selection).

VI S. Aeberhard, D. Coomans and 0. De Vel, Improvements to

Acknowledgements

181

The authors thank C. Menardo and C. SternaGrataloup, both from Rhone Poulenc Industrialization, Crit, Decines, France, who kindly provided some of the data sets. The authors also thank D. Coomans for his fruitful discussions about the mathematical theory and F. Cuesta Sanchez for her helpful suggestion in programming. D. Jouan-Rimbaud is acknowledged for her assistance in preparing this manuscript.

[31 141 PI

[61

171

L91

IlOl

r111

1121

References [I] J. Friedman, Regularized discriminant Assoc., 84 (1989) 165-175.

analysis, J. Am. Stat,

[I31

the classification performance of RDA, J. Chemometrics, 7 (1993) 99-115. I. Frank and J. Friedman, Classification: oldtimers and newcomers, J. Chemometrics, 3 (1989) 463-475. D. Coomans, PhD dissertation, Vrije Universiteit Brussel, 1982. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decision Making, Wiley, New York, 1986. R.J. Barnes, MS. Dhanoa and S.J. Lister, Standard normal variate transformation and detrending of near-infrared diffuse reflectance spectra, App. Spectroscopy, 43 (1989) 172-771. R.J. Barnes, MS. Dhanoa and S.J. Lister, Correction to the description of standard normal variate (SNV) and de-trend (DT) transformations in practical spectroscopy with applications in food and beverage analysis, 2nd edn, J. Near Infrared Spectroscopy, 1 (1993) 185-186. MS. Dhanoa, S.J. Lister, R. Sanderson and R.J. Barnes, The link between multiplicative scatter correction (MSC) and standard normal variate (SNV) transformations of NIR spectra. J. Near Infrared Spectroscopy, 2 (1994) 43-47. W. Wu, B. Walczak, D.L. Massat, K.A. Prebble and I.R. Last, Spectral transformation and wavelength selection in nearinfrared spectra classification, Anal. Chim. Acta, 3 15 (I 995) 243-255. D. Evans, C. Scatter, L. Day and M. Hall, Determination of the authenticity of orange juice by discriminant analysis of near infrared spectra, J. Near Infrared Spectroscopy, I (1993) 33-44. A. Sirieix and G. Downey, Commercial wheatflour authentication by discriminant analysis of near infrared reflectance spectra, J. Near Infrared Spectroscopy, 1 (1993) 187-197. D.G. Evans, A. Legrand, K. Jewel1 and C.N.G. Scatter, Use of high order principal components in NIR spectroscopy, J. Near Infrared Spectroscopy, 1 (1993) 209-219. W.W. Cooley and P.R. Lohnes, Multivariate Data Analysis, Wiley. New York, 1971.