The use and misuse of chemometrics for treating classification problems

The use and misuse of chemometrics for treating classification problems

216 trends in analytical chemistry, vol. 16, no. 4, 1997 J.J. Langenfeld, S.B. Hawthorne, D.J. Miller and J. Pawliszyn, Anal. Chem. 66 (1994) 909. [...

675KB Sizes 0 Downloads 65 Views

216

trends in analytical chemistry, vol. 16, no. 4, 1997

J.J. Langenfeld, S.B. Hawthorne, D.J. Miller and J. Pawliszyn, Anal. Chem. 66 (1994) 909. [ 61 B.D. Ripley and M. Thompson, Analyst I12 [ 51

(1987) 377. [7]

R. Boqu$, Personal communication

Jordi Riu and F. Xavier Rius are respectively Ph. 0. student and Professor of Analytical Chemistry at the Departament de Quimica, Universitat Rovira i Virgili, PI. imperial T;irraco, 1, 43005-Tarragona. Cataionia, Spain. Tel.: 34-77-558187; Fax.: 3477-559563; e-mail: ruse// @quimica. UN. es

The use and misuse of chemometrics for treating classification problems Marianne Defernez*, E. Katherine Kemsley Norwich, UK In this article, we examine the increasing use by analytical chemists of chemometric methods for treating classification problems. The methods considered are principal component analysis (PCA), canonical variates analysis (CVA), discriminant analysis (DA), and discriminant partial least squares (PLS). Overfitting, a potential hazard of multivariate modelling, is illustrated using examples of real and simulated data, and the importance of model validation is discussed.

1. ‘There are three kinds of lies: lies, damned lies, and statistics’ This well-known quotation is attributed to the 19th-century British politician D’Israeli, and its enduring fame is testimony to the entrenched mistrust of matters statistical amongst the general population. Within the scientific community too, certain statistical methods are also subject to suspicion, perhaps due to their complexity, or to the over-optimistic conclusions that are often drawn from them. Chemometric techniques can be defined as the subset of statistical approaches that are especially suitable for handling the large amounts of data produced by modem analytical methods, such as nuclear magnetic resonance, infrared, Raman and mass spectroscopies, gas and liquid chromatographies. Such data is usually characterised by more data values per &sew&ion *Corresponding

author.

0165-9936/97/$17.00 PZTSOI~S-9936(97)00015-O

(measurement on a specimen) than there are observations in a complete experiment, and by considerable redundancy or inter-correlaation within the data set. In this paper, we consider the use and potential misuse of a selection of chemometric methods that are commonly employed to tackle classification and identification problems in large data sets.

2. A survey of the use of chemometric methods Principal component analysis (PCA) [ 1,2], canonical variates analysis (CVA) [ 3 1, discriminant analysis (DA) [ 41, and partial least squares (PLS) [ 5 ] have all undergone a rapid increase in popularity amongst chemical scientists over the last decade. A survey of reported uses of these techniques in chemical research journals over the period 19851995 (using the BIDS on-line database [ 61) shows that they are now established as routine tools for the analytical chemist (Fig. 1). This has clearly been facilitated by the availability of increasingly powerful and affordable computers and commercial chemometrics software. However, most of the research activity lies in new applications of established, rather than the development of fundamentally new, chemometric techniques. In addition, the majority of the work has been carried out by analytical chemists rather than specialists in statistics. Perhaps a consequence of this is that many of the reported experiments involve only relatively small numbers of specimens, and should really be considered as feasibility studies or exploratory analyses. For instance, a survey of articles in our own field of interest, the classification of food materials, for the period 1993-1996, shows that 27% of Copyright

0 1997 Elsevier Science B.V. All rights reserved.

217

trends in analytical chemistry, vol. 16, no. 4, 1997

160

EdPrincipal components analysis w Canonical variates analysis q Discriminant analysis

t

H Partial least squares

Fig. 1. Reported uses of chemometric methods in the period 1985-1995 (BIDS on-line database, research journals including in their title words containing any or all of ‘them’, ‘analyt’, and ‘spectroscop’).

researchers used less than 25 specimens, and another 29% between 25 and 50 specimens. Moreover, more than 50% used the techniques without any validation on independent test specimens. Later in this paper, we will highlight the importance of validation in assessing the generalisation ability of results obtained, and demonstrate a potentially serious pitfall of certain chemometric techniques: ove@tting. In an overfit model, the classification ability may superficially appear satisfactory, but is in fact not statistically significant. But first, we will outline briefly some of the methods considered.

of a data matrix X, allowing it to be examined more easily, whilst retaining most of the original information content. PCs are calculable for any relative sizes of n and d. However, mathematically, PCs represent sources of successively maximised variance in the data, and it should be borne in mind, therefore, that better ( more stable) estimates of the PCs are obtained as n increases. Very often the analyst wishes to go one step further and carry out some form of prediction or assignment, that is, to fit the data to a model. For example, in DA, a classification rule is

3. Chemometric methods for classification problems

16 12

A multivariate data set is characterised by measurements on multiple properties or variates. If a data set comprises observations of d variates on n specimens, then the data set is said to be d-variate or d-dimensional, and can be arranged in a [n X d] matrix X. The relative sizes of II and d are important, with different repercussions depending on the multivariate method to be utilised. If d > n, as is usually the case for spectroscopic or chromatographic data sets, then many multivariate methods cannot be directly applied (because the product matrix XTX, required in the calculations, cannot be inverted). It is for data sets of this kind that the data reduction methods of PCA and PLS are most useful. PCA reduces the significant dimensionality

2

_

CD 8 >" 4-

ooGo0

‘C

I! 5 O 5 0 -4-1 in Y

l/.. 0

- -8 1

0

0

-12.

I1 1 2

I,,,,, 3

4

5

I I I I I I ,I 6

7

8

Observation

3 10 11 12 13 14 15 16

number

Fig. 2. Example of a published, but worthless, result.

218

trends in analytical chemistry, vol. 16, no. 4, 1997

(a)

(W

0.81 0

9 z

9

QO

variety

1

0.4

0

2 variety

n‘CI

3

0

Variate number

0

variety

-0.81 -1

ak variety 5 *

0

0

I

variety

!

4y 0

-0.5

0.5

1



1st Canonical Variate

(4

0.8

l

l .0

A

0

3

*A

n

n

n

mm

OO

-"."I

Variate number

8

A A

0

-0.5

0

0.5

1

1st Canonical Variate

Fig. 3. (a) Series of infrared spectra of raspberry purees and (b) the CVA scores plot calculated from their PC scores with each group representing a different variety, showing clustering of the different raspberry varieties. (c) Series of noise spectra of the same dimensionality as those of raspberry purees and (d) the CVA scores plot calculated from their PC scores with assignment to the same number of arbitrary groups as there are raspberry varieties in (a), showing similar clustering can be obtained with similarly dimensioned but meaningless data assigned to random groups.

sought for assigning observations to one of several pre-defined groups. A typical DA procedure might be: calculate mean observations for each group, along with a distance of each observation from each group mean; re-assign each observation to the nearest group mean; calculate an overall classification success rate. However, if the distance metric involves inversion of a matrix of the form XTX (such as Mahalanobis distances), then data sets with d > n cannot be used directly. To circumvent this problem, it is common practice to first reduce the significant dimen sionality of raw data using PCA, then apply DA to a subset of PC scores (we will call this a PCA/DA procedure). CVA is also a powerful approach for classification problems. Like PCA, it is a data reduction

method, but in contrast, it cannot be applied to data sets for which d > n. Instead, CVA is usually applied to a subset of selected PC scores. CVs represent sources of successively maximised betweengroups/within-groups variance in the data; this means that CV scores simultaneously minimise the spread of and maximise the distance between the groups as pre-defined by the analyst. PLS regression is a popular technique for calibrationtype applications. It can also be used for classification problems, in which case the regression is onto dummy variate(s) representing group membership: this is called discriminant PLS [ 5 1. This method maximises the covariance between successive PLS scores and the dummy variate(s). Like PCA, discriminant PLS can be applied to data sets for any n and d. In PLS and CVA, the pre-defined informa-

219

trends in analytical chemistry, vol. 16, no. 4, 1997

(a)

1.6

(b)

x x group2

group3

x x

+

+ + +

.

0

aF

+

+ group1

++ + 40

60

80

100

Variate number

120

-4

-2

0

2

4

First PLS Score

Fig. 4. (a) 30 simulated spectra (each being obtained by the addition of noise to a single spectrum) assigned to three arbitrary groups, and (b) their discriminant PLS scores, showing clustering can be obtained according to these groups.

tion on group membership is used to determine the parameters of the data reduction, hence both are modelling techniques. In all multivariate modelling methods, there is potential for overfitting, and caution needs to be taken to avoid it. In the next section, we illustrate the circumstances in which overfitting can arise, and offer some suggestions for recognising and avoiding such situations.

4. Some examples

of overfitting

Over-optimistic chemometric analyses can be found quite readily in refereed journals, as well as in commercially-oriented publications. Fig. 2 reproduces a diagram found in a sales brochure. It is a CV score plot (one CV only) which appears to indicate that good discrimination between two classes of specimen can be achieved, but what is the validity of such a result? In this particular case, very little detail was given about experimental parameters, but it was disclosed that the study was based on only three specimens, of which replicate observations were made. However, replicates are by no means independent observations, and three is simply too low a number of specimens to enable statistical analysis; we must therefore conclude that a plot such as this is essentially worthless. Fig. 3b is another plot of CV scores, calculated using a subset of PCs obtained from the series of

infrared spectra of raspberry purees shown in Fig. 3a. There appears to be substantial clustering of the data according to raspberry variety. However, the truth is that this CV plot is also worthless: in this case, there is simply not enough information given for the reader to make a judgement as to whether the model has been overfit or not. The crucial piece of missing information is the dimensionality (that is, the number of PC scores) used to compute the CVs. In fact, this was 12: far too many for a data set with n = 25. As a rule-of-thumb, the onset of overfitting should be strongly suspected when the dimensionality exceeds (n-g)/3, where g is the number of groups. So, for this data set, we should be suspicious of analyses using 7 or more PCs. It is easy to demonstrate overfitting by passing similarlydimensioned sets of random numbers (Fig. 3c) through the same procedure. Such a result is presented in Fig. 3d: qualitatively it is highly similar to Fig. 3b, showing clear clustering, even though the group assignments as well as the data in this case were meaningless. We will now consider an analysis using discriminant PLS. For data sets with inherent class structure, discriminant PLS often performs very well, and clear grouping is seen in the first few PLS scores. However, similar clustering can be obtained from data sets in which there is no class structure present at all. This effect is illustrated with simulated data in Fig. 4: the ‘spectra’ comprise a constant ‘signal’ (a single spectrum) plus random

trends in analytical

220

100 -

++++++++ ++

80 ++*&!PQ@%

‘;; & 2

60.

t! % 8 40. !4 v)

+

+QQ

:,a 0

20 -

0 0

2

4

6

8

10

12

14

Number of PCs used for the DA

Fig. 5. Results of different validation procedures for a set of mid-infrared wine samples, classified by PCA/ DA into ‘Cabernet’ and ‘Shiraz’: using training/test set procedure (+ = training set, and * = test set; note that the procedure was repeated 5000 times with random partitioning into training and test sets, and the average results are plotted here); and using internal cross-validation (0).

‘noise’, and are assigned arbitrarily to three groups. Despite this, the clustering obtained by PLS strongly resembles that which might be obtained with real data belonging to three genuine groups. Although these spectra are simulated, data sets that behave similarly can and do occur in real experiments. For data sets intermediate between the two extremes, in which there is some structure present, but not corresponding to the division of interest, then PLS gives suitably unclustered scores.

5. Methods for recognising overfitting

and avoiding

As we have seen, one method of ascertaining whether the overfitting regime has been entered is to apply an equivalent analysis to a set of simulated data with the same dimensionality, but no class structure and arbitrary group assignments. If apparent grouping is obtained, then there are grounds for suspecting that comparable models obtained from real data are also over-fit. Another, preferred, way to guard against overfitting is to validate the model, by applying it to sufficient completely independent observations. In doing this, we are seeking an explicit answer to the question: can the model gen-

chemistv,

vol. 16, no. 4, 1997

eralise and correctly classify observations other than those with which it was defined? There are a number of approaches to validation. One method is to assign each observation at random to either a training or a test set. The training set only is used to obtain the model, which is then applied in a second step to the test set observations. Typically, training and test sets may contain 2/3 and l/3 of the available observations respectively. Note that the results obtained from this procedure can vary somewhat depending on the partitioning of the two sets; it is therefore advisable to repeat it a number of times with several random divisions of the data, and calculate an average result. An alternative validation procedure is leave-one-out or internal crossvalidation. It consists of omitting one observation at a time from the data set and using the remaining data to obtain a model, which is then applied to the omitted, test observation. This is repeated n times, excluding each observation in turn. The results for the excluded observations only are then assessed. Fig. 5 compares the two validation approaches in a PCA/DA applied to infrared spectra of two types of single-variety wine (Cabernet and Shiraz). It is interesting to note that if the classification results for the training set only were known, the model would appear much more powerful than it actually is: it does not perform nearly as well when applied to the test set, showing a lack of generalisation ability and evidence of some overfitting. In contrast, the internal cross-validated results are highly similar to those obtained on average for the test set in the training/test set procedure. We have compared these two validation approaches, using a number of data sets and several analysis methods (including DA combined with PCA and discriminant PLS), and found that internal cross-validation consistently gives a representative indication of the ability of a model to generalise to new, independent observations. (This work was carried out using the MatLab (The Mathworks, Inc., Natick, MA, USA) matrix programming language, and macros for performing these analyses are available from the authors).

6. Conclusions It is clear that chemometric methods have made an impact on analytical chemistry, and that their use is on the increase. However, they are not always employed to their full potential, partly because of their complexity and the relative ease with which

trendsinanalyticalchemistry, vol.16,no. 4, 1997

they may be misused. In this paper, we have examined overfitting, and have seen that it can occur quite readily. Although overfitting is not the only way in which chemometric methods can be misused, the following suggestions should help to avoid this particular hazard of multivariate modellinn: k is extremely important, in order to avoid overtitting a model, to always use some form of validation. Apply the model to a set of completely independent observations, or use an internal cross-validation approach. Never rely solely on the results from observations used to obtain the model. Internal cross-validation gives a reliable indication of the generalisation ability of classification models. Large differences between the results obtained for training and test sets indicate that the model is over-lit and the classification ability does not generalise well to new samples. The higher the dimensionality of the model (for example, the number of PCA or PLS scores used in the analysis), the greater the likelihood of over-fitting.

221

Acknowledgements The authors thank the UK Ministry of Agriculture, Fisheries and Food (MAFF) for funding this work. References [ 1 ] W.J. Krzanowski,

[ 21 [3]

[ 41 [ 51

[6 ]

Principles of Multivariate Analysis: a User’ s Perspective, Oxford University Press, New York, 1988, p. 53. LT. Joliffe, Principal Component Analysis, Springer-Verlag, New York, 1986. _ W.J. Krzanowski, Principles of Multivariate Analysis: a User’ s Perspective, Oxford University Press, New York, 1988, p. 291. H.L. Mark, D. Tunnel, Anal. Chem. 57 ( 1985 ) 1449. E.K. Kemsley, Chemom. Intell. Lab. Syst. 33 (1996) 47. Bath (University) IS1 Data Service.

Marianne Defernez and E. Katherine Kemsley are at the institute of Food Research, Norwich Research Park, Colney, Norwich, NR4 7UA, UK. E-mail: Marianne. Defernez 8 BBSRC.AC. UK

Determination of low-molecular-mass organic acids by capillary zone electrophoresis Christian Wolfgang Linz, A

W. Klampfl*, Buchberger

ustria

Within the last few years capillary zone electrophoresis (CZE) has been recognized as an attractive technique for the separation and quantification of low-molecular-mass organic acids, thereby complementing established chromatographic methods. Different modes of CZE such as co- and counterelectroosmotic electrophoresis as well as different detection methods like direct and indirect *Corresponding

author.

0165-9936/97/$17.00 PUSO165-9936(97)00026-5

UV detection or conductivity detection are reviewed for these solutes. Systematic optimization of carrier electrolytes containing complexing reagents such as alkaline earth metals or cyclodextrines is discussed with respect to the separation selectivity. Applications to various classes of organic acids in real samples of complex composition are demonstrated.

1. Introduction The determination of low-molecular-mass organic acids is of considerable importance in varCopyright

0 1997 Elsevier Science B.V. All rights reserved.