Detection of inhomogeneities in sets of NIR spectra

Detection of inhomogeneities in sets of NIR spectra

ELSEVIER ANALYTICA CHIMICA ACTA Analytica Chimica Acta 330 (1996) 1-17 Detection of inhomogeneities V. Centnera, bKoninklijke /Shell, in sets of NI...

1MB Sizes 8 Downloads 60 Views

ELSEVIER

ANALYTICA CHIMICA ACTA Analytica Chimica Acta 330 (1996) 1-17

Detection of inhomogeneities V. Centnera, bKoninklijke /Shell,

in sets of NIR spectra

D.L. Massartap*, OX. de Noordb

“ChemoAC, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Bnusel, Belgium Laboratorium-Amstedam (Shell Reseaarch B. a), RO. Box 38 000, 1030 BN Amsterdam

Netherlands

Received 9 October 1995; accepted 22 February 1996

Abstract Methods for the detection of outliers and clusters in data sets for multivariate calibration are discussed in this study. Replicate as well as sample outliers have been investigated. The Co&ran test applied to the sum of absorbances is shown to

identify outlying replicates. Sample outliers were detected on latent variables (PCs) or by using Rao’s statistic combined with Grubbs’ and Dixon’s tests. The conclusions were then verified by the modelling. The tendency to cluster was evaluated by Hopkins statistic. Keywords: Multivariate calibration; Chemometrics; Near-infrared spcctrometry; Inhomogeneity

1. Introduction Before applying any modelling method, one should ascertain as much as possible the quality of the input data and among others their homogeneity. This can be affected by outliers and the presence of subgroups or clustering. In fact, two types of outliers can be distinguished. When replicate measurements are obtained certain replicates can be outliers and a sample as such can also be outlying compared to the population of samples, for instance because it has different chemical characteristics. Outlying replicates are usually removed. The decision about outlying samples is less evident. However, since outliers can have a profound influence on the calibration model, it is useful to detect them and, when there is a good reason for removing them, to do so.

* Corresponding author. Fax: +32 2 477 47 35. 0003-2670/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved PII SOOO3-2670(96)00150-X

Subgroups or clusters can also occur. When this is the case, one should decide whether to model all objects together or rather to develop models for each subgroup separately. The object of the present work is to propose a methodology to detect the existence of outliers and/or subgroups before any modelling is carried out. The methodology is illustrated on a data set of NIR spectra of polyether polyols. The object of the calibration was to determine the hydroxyl number of the samples.

2. Theory 2.1. Detection of outlying replicates The Co&ran test [l] was recommended by IS0 [2] to detect outlying replicate measurements in interlaboratory precision experiments [3]. It is based on

V Centner et al./Analytica

2

comparing the variance of the replicate measurements for the sample that shows the highest variance of all samples with the sum of the variances of all samples, C =

S&&f,

for

i = 1, . . ..n.

(1)

i=l

where n is the number of samples, srnax is the standard deviation of the sample with highest variance between replicates and Si is the standard deviation for the sample i. C is compared to a table of critical values at the appropriate level of confidence. When the number of replicates is not constant, but small differences occur, one uses the most frequent number. 2.2. Outlying samples In the past, Dixon’s test [4] was applied most often when the data are univariate. One computes

Q = IYte,t- YnearestI/ (Ymax-

Ymin) .

(2)

The ratio Q compares the differences between the extreme and therefore suspect measurement (y,,,,) and the measurement nearest (y,,,,,) to it in size and between the highest (ymax) and lowest timin) measurement. If Q is higher than the critical value at the desired probability level, the measurement is rejected. Different variants of this equation are applied according to the number of data. In Eq. (2) one then changes Ynemestto Ysec0ndnearest or Ythirdnetist. More recently AOAC and IS0 recommend the test described by Grubbs and Beck [5]. The single outlier Grubbs’ test [6] is based on comparing the largest absolute value of the normalized deviation z with the tabulated T value. If 121exceeds T, the measurement is considered to be an outlier. We will call this procedure G-ISO.

z = (Ytest-Y)/

(

2(Yi i=l

Grubbs published [6] also another criterion testing outlying observations. One compares ratio

for the

G = SS,/SS

(5)

with the table of critical values Fcrit,,,, modified F-test of sample size = II, SS =

k(yi- y)2

for a

(6)

i=l

is the sum of squares with the suspected outlier included; S& the sum of squares without the suspected outlier, size it- 1. We will call this test Gss. The paired outlier Grubbs’ test was proposed [6] to detect two outliers which may mask each other. The test involves the possibility to check a vicinal pair defined as the two lowest or the two highest measurements in the set and to check a polar pair [7] defined as the lowest and the highest measurements. The test for vicinal pairs is based on the ratio of sums of squared deviations excluding and including the pair. The polar pair is detected according to the ratio of the range to the standard deviation. The single Grubbs outlier test should not be affected by masking for large populations (more than 25 objects). Therefore the paired test was not peflormed in our study. Both the Dixon and the Grubbs tests are univariate tests, which would not be useful here. Therefore the detection of outliers was performed on new latent variables (PCs) in place of raw original measurements, i.e. spectral matrix X with elements Xii, for i = 1, . . .. n and variables j = 1, . . ., p. Principal components were obtained by Singular Value Decomposition on the centered X matrix,

l/2

-Y)‘/@

- 1)

)

for = 1, . . . . n,

where ytest is the value of the tested measurement 7=

Chimica Acfa 330 (1996) l-17

(l/n) 2Yi, i=l

y is the average value of all samples Yi.

x=s*v*z’,

(3) and (4)

(7)

where S is the matrix containing the normalized scores of the objects and V is a matrix describing the variation explained by successive latent variables. The product of these matrices (S*V) is a matrix containing as columns the principal components, called also the matrix of (un-normalized) scores (u). The matrix 2’ describes the weights of the manifest variables called loadings.

V Centner et al./Andytica

A more truly multivariate approach to outlier detection was discussed by Mertens et al. [8]. They describe among others Rae’s statistic. The idea of Rao’s statistic is to accumulate all variation (p PCs), except the meaningful one (k PCs) that explains the variation, in a modelled variable y.

(8) j=k+l

One sums the squared scores (u) from the (k+l)th PC to the pth PC for each object (i). This is the variation that is not representative for the group and noise. The results were presented [g] as a histogram of p(k). An example is given in Fig. 1. We applied Rao’s statistics as input data for outlier detection by the modij5ed G-IS0 test. The reason for the modification of the G-IS0 test is that the Rao statistic is skewed since the majority of the population has small positive values distributed close to zero. Therefore instead of comparing yi to 7 as in Eq. (3), one should compare it to 0.

Chimica Acta 330 (1996) l-17

however fit the calibration model well and are not outliers in that sense. To detect this possibility plots of the D-value (PC) against the y-value should be studied. 2.3. Preprocessing The operation preceding the extraction of latent variables is preprocessing [9]. The effect of centering, log-column and log-double centering for outlier detection was studied. The data were log-transformed to reduce the effect of the large values and to enhance that of the smaller ones. 2.3.1. Column centering Column centering is the subtraction of the corresponding column mean mj from each element of data matrix X. z_~=x~-rnj

fori=l,...,n.

fori=l,...,

=

(l/n)/

p,

i=l

2.3.2. Log column centering Log column centering is the logarithm transformation of each element xij, followed by column centering for i = 1, .. .. n and j = 1, .... p,

yu = log(xu) Zij=Yij-mj,

10

(10)

(11)

2x0.

(9)

It should be noted that objects detected as outliers, might be outlying only because they show a very high or low y-value. These high leverage objects might

nandj=l,...,

where mj

i_r..,(~y~,~~-I))l”

3

(12) (13)

where mj =

(l/n)/

&ij.

(14)

i=l

2.3.3. L.og double centering Log double centering is the logarithm transformation of each element xV, followed by column and row centering

1n

zc =yv-m;-mj+m

for i= l,...,n

and

l-l

0

0

j = 1, . . ..p.

1

2

3

(15)

where mj is defined by Eq. (14),

D12 Fig. 1. Histogram of DC1,-values for the A data set showing object 63 is an outlier.

that

m = (l/n

*p)leey, i=l

(16) j=l

4

V Centner et al./Analytica

and

mi=

(17)

(l/P)/~Yij. j=l

2.4. Modelling To verify the influence of the objects flagged as possible outliers, the PLS [lo] models were built. The model with optimal complexity was considered to be the model with the first local minimal leave-one-out cross-validated (CV) PRESS.

PRESS=

e(Yipred

- Yireal)2.

(18)

i=l

The predictive ability of the models with and without suspected outlier are finally presented as root mean squared error of prediction (RMSEP) based on leave-one-out cross validation [ 111. 112

e(Yipd

- yireal)2/n .

(19)

i=l

When the outlier was included in the model, two values are shown, RMSEP with prediction of the outlier and without it. 2.5. Clustering

Chimica Acta 330 (1996) 1-17

The objects are randomized as follows: the number of objects N (N real object = N artificial objects) corresponding to part of the population x is computed first. It is obtained as: (n * x)/100 rounded towards the nearest integer, where n is the total number of objects (Eqs. (3), (7)). for N objects a uniform vector of RANDOM numbers is generated in the interval (0,l). Real objects are then taken as objects with serial number: (RANDOM * n + 0.5) rounded towards the nearest integer. for each artiJicia1 object its scores on individual PCj are generated uniformly in the range of the real PCS. Rangej = max(PCJ - min(PCj). Computed distances U and Ware summed for all N objects and used for the determination of Hopkins statistic

. t-1

(

e=l

e=l

(20)

)

This procedure is iterated several times. The clustering tendency and the stability of the Hopkins statistic are described by the average H and by the range = H max - Hk,,, respectively. Hopkins statistic can range from 0.5 for a homogeneous set of objects (distances from a real and an artificial object to their

tendency

2.5.1. Hopkins statistic The index for clustering tendency, called the Hopkins statistic, examines whether objects in a data set differ significantly from the assumption that they are uniformly distributed in multidimensional space (null hypothesis, Ho). All n PCs are used since the number of variables p is higher than the number of objects n. The computation in the PC space is computationally simpler than that in the original space. Hopkins statistic [12,13] is based on the comparison between the Euclidean distance from an object to its nearest neighbour (W) and the distance from an artificial object, spread uniformly over the data space to the nearest real object (U). The choice of real and artificial object is repeated several times (e) for some part of the total population x (%).

.

IW .

-1

1

0

2

PC1 Fig. 2. Hopkins statistic. Real (.) and artificial (*) objects displayed in the Pcl-PC2 space. W Euclidean distance from a random real object to its nearest neighbour, U: Euclidean distance from a random artificial object (spread uniformly over the data space) to the nearest real point.

V Centner et al./Analytica

nearest neighbour are equal) to 1 for clustered extreme (U for an artificial object is much larger than W for a real object), (see Fig. 2). If H is greater than 0.75, then there is more than 90% confidence in rejecting the null hypothesis.

3. Experimental The data set consists of polyether polyol nearinfrared spectra (X matrix) and their hydroxyl numbers (y). The spectra were measured between 1100-2158 nm (each 2nm) and recorded on the instrument: NIRSystem Inc., model 6250, Silver Spring, MD, USA. 3.1. Summary

Chimica Acta 330 (1996) I-l 7

5

5. Results and discussion It was known from the chemistry of the samples that two subsets could be distinguished. These subsets are related to the chemical structure of the products which results in the presence or absence of a peak at 1690nm. For this reason it was decided to develop separate models for each subset (27 and 60 objects). We will call these sets A and B. 5.1. Detection of replicate outliers with the Cochran test The Co&ran test was applied to the following characteristics of the data: the summed absorbance 2128

of the data

A sum =

c

(21)

Ax,

X=1132

Samples: 87 polyether polyols, 2 or 3 spectra of each sample, total dimension 248 x 530. 3.2. Data preparation Offset correction: the absorbance at 1lOOnm was subtracted from each variable. Boundary elimination: 15 boundary wavelengths from both sides, i.e. 1102-1130 and 2130-2158nm were removed from X, so that the final range of wavelengths is 1132-2128nm (499 X).

4. Computer

programs

All procedures were programmed in Matlab Windows, version 4.0 (The MathWorks, Inc.).

Table 1 Replicate

outliers by the Co&ran

Data set (number of objects, n)

A (27) B (60)

for

the scores on PC1 and PC2, and the absorbance A(PC1) or A(PC2) at the wavelengths with highest loadings on PC1 and PC2. The results are presented in Table 1, the critical values in Table 2 A sequential procedure is applied. When a sample with an outlying replicate is detected, the outlying Table 2 Critical values of the Co&an test [2], number of replicates r = 3, critical level (Y = 0.05%. The critical values for n r 40 were computed by the linear approximation C,,, = l/n’” (IS0 [2]) number of obiects, n

critical value C,,,

26 21 58 59 60

0.221 0.215 0.131 0.130 0.129

test

A s”m

PC1

PC2

AWl)

object (ratio)

object (ratio)

object (ratio)

object (ratio)

L(nm)

object (ratio)

X (nm)

17 (2.24)

71 (1.85)

77 (1.30)

77 (1.67)

2080

-

1690

70.54 (1.27, 1.06)

-

70 (1.34)

-

2080

70, 41 (1.56, 1.55)

1920

AW2)

6

V Centner et al./Analytica

replicate is removed and the other replicates put back in the data set (when three replicates were measured) or the sample is removed (when two replicates were measured). Then the procedure is repeated again. In the A data set one of three replicates for object 77 is an outlier. The replicate 77~ was detected for A sum, PCl, PC2 and A(PC1). On verification, this replicate was indeed considered to be a deviating measurement so that it was removed from the data set. In the data set B an outlier is detected for object 70 on the sum, and also on one of the PCs (one does not expect to find it on all PCs). For the rest it was not so clear. The decision was made on A,,,. This detects successively as samples with too high variance, samples 70 and 54. The information from the other characteristics is less conclusive and it is proposed for future work to detect outlying replicates on A,,, only. As a conclusion, replicates 77c, 70b and 54b were removed. Fig. 3 presents a comparison of the spectra of the three replicates for these objects. For reasons of industrial confidentiality the original spectra cannot be shown. The figure shows the difference between replicates b and c and replicate a. There are clearly large differences for 77c, 70b and 54b. Such large differences are not seen for the other samples. 5.2. Detection

of outlying objects

The detection of outliers was performed with four methods: l on PC scores: single outlier Grubbs’ test as recommended by IS0 [2] (G-ISO). l on PC scores: single outlier Grubbs’ test using sum of squares [6] (G-SS). l on PC scores: Dixon’s test [4] (Dixon). l on Rao’s statistic: modified (see Theory) single outlier Grubbs’ test. When an outlier was detected, it was not removed from the data set, unless a reason was found to explain its outlying characteristics. Other detected outliers are Jlagged but left in the data set. When, during the modelling, they are also detected as outliers, one can reconsider to eliminate them. The tests were applied sequentially to average (i.e. average of three replicates) data after elimination of replicate outliers: 77c, 706 and 54b. The influence of

Chimicu Acta 330 (1996) I-1 7

three different preprocessing methods, namely: centering, log-column and log-double centering was investigated. The procedure was carried out on the first four PCs which together explain more than 98% of total variance. In general one should verify all significant PCs [14]. The results are summarized in Tables 3-7 for the A data set and in Tables 8-13 for the B data set. The following conclusions can be made. 5.2.1. Data set A - Grubbs’ and Dixon tests The PC plots for the centered data and the log-column centered data are given in Fig. 4 (see Table 3). Each preprocessing method proposes object 63 as an outlier on PC2. Object 63 is an extreme and the manufacturer confirmed that it is a chemical outlier (deviating composition). Therefore, it was removed from further consideration. Other suspicious objects were found on PC3 and PC4, namely objects 77 and 53. Both samples are detected in the presence of sample 63 as well as after eliminating it. Additionally, when object 63 is excluded, object 83 is also found. One observes that 83 is among the highest concentrations and 53 among the lowest ones, but that none of these concentrations is extreme. The high leverage is therefore not the only reason for the outlying behaviour. (See Table 4) Objects 77, 83 and 53 are flagged but not eliminated since the reasons to remove them are not powerful enough at this stage. They will be screened carefully during the modelling. For an interpretation of the outlier detection one should use the PC plots (Fig. 4) and the graphs of the y-value against a PCj (Fig. 5). In that way, one can for instance observe that the outlying behaviour of 63 and 53 is probably not concentration related. - modified G-IS0 test on Rae’s statistics Object 63 was also detected after each type of preprocessing on DC1) (see Table 5 and Fig. 1). One can again support the results by an interpretation of plots of the D-value against the y-value (Fig. 6). Object 53 is a probable outlier since it is found to have a large D(s)-value which cannot be explained by the extreme y-value (see Fig. 6~). The object 23 was also detected as outlying on DC3)(Table 6), but not on any of the PC 1 to 4.

7

V Centner et al./Anal_ytica Chimica Acta 330 (1996) I-17

object70 oT-----l I

I

zb 0.06 a p 0.04 Fi !E 0.02 -u

1200 1400 1600 1800 2000

wavelength

1200 1400 1600 1800 2000 wavelength

object 54

a 0.08 E P 0.06 s g 0.04

2 g 0.02 0 1200 1400 1600 1800 2000

wavelength Fig. 3. The difference

between replicates

b and c and replicate

This shows that Rao’s statistic allows to identify objects that are outliers on higher PCs (PC5 and PC9 in this case) than the first few. Rao’s statistic detects after log preprocessing also objects 77 and 83 due to their extreme scores on PC3. During the modelling stage, one can verify whether the flagged objects are indeed outliers. This is done in Fig. 7, where one compares the absolute deviation between yr& and y for each of the flagged objects for the models of successively 1, 2, . . ., 7 PLS components. The model is considered to be optimal for six components, since it then shows the lowest PRESS value. In the figure these absolute deviations are compared to the median and the third quartile of all such deviations. Objects 23, 53 and 77 are well

a for the objects 77, 70, 54. Aberrant

spectra: 77c, 706, 546.

predicted with the full model, but show extreme behaviour on some of the first components, object 77 on the first four, 23 on components 1, 2 and object 53 on component 4. The figure shows that the y-value of object 83 is predicted worse than that of all other objects. Table 7 compares the RMSEP for the optimal B-component PLS model including all flagged objects (1.86) and the RMSEP for all objects except 83 (or 23, 53, 77). The RMSEP computed without 83 is always better than the original one and for the optimal six components model it is 1.77, i.e. lower than 1.86. The elimination of object 83 from the model however does not improve the RMSEP It then increases from 1.86 to 2.01. The elimination of the objects flagged

8

V Centner et al./Analytica

Chimica Acta 330 (1996) 1-17

023

0.4

:::J

.

. 53 0

-0.5

, 0.5

1

076

-77 .

PC3

B

-1

24@ +I -2

l 35

0.

24

is therefore

%

%%QH 2

-0 44 IO.5

Fig. 4. Data set A: (a) plot of the first two PCs and (b) plot of PC3 and PC4 after centering; PC4 after log-column centering.

during the data inspection mended for this data set.

l 33

-0.2. 5 ai&%?

8 %43

0 PC1

0.2.G?

not recom-

5.2.2. Data set B - Grubbs’ and Dixon tests The results obtained after centering (Fig. 8a, b) and log-column centering (Fig. 8c, d) or log-double centering will be discussed separately (see also Table 8). (a) centering: The objects 67, 37, 56 and 72 were identified as outlying on PC2 and the objects 72, 70, and 11 on PC4. The high leverage objects 72,70 and 11 are not only detected due to their high y-value since y is not correlated with the scores on PC2 nor PC4.

0

0.5

1

PC3

(c) plot of PC1 and PC2 and (d) plot of PC3 and

(b) log-column and log-double centering: The scores of objects 72 and 70 are very different from the scores of other objects on PC1 and PC3. This is due to difference over the spectral region 1280-1340nm which is emphasized after the log transformation. The influence of these deviating scores is so powerful that the scores of the other objects look almost constant on these PCs (Fig. 8c,d). The object 67 detected above for centering is now strongly outlying on PC4. The smaller deviations detected for objects 48,71 and 81 on PC1 are less apparent. On verification object 67 was found to be a chemical outlier (a very different amount of water). This was considered to be sufficient reason to remove

F Centner et al./Analytica

Chimica Acta 330 (1996) I-l 7

9

Table 3 The detection of outlying objects by the Grubbs’ and Dixon’s tests on latent variables, data set A

Table 4 The detection of outlying objects by the Grubbs’ and Dixon’s tests on latent variables, data set A after elimination of sample 63

Preprocessing method

Detection method

PC1

PC2

PC3

PC4

Preprocessing method

Detection method

PC1

PC2

PC3

PC4

centering

G-IS0 G-SS Dixon G-IS0 G-SS Dixon G-IS0 G-SS Dixon

-

63 63 63 63 63 63 -

77 II -

53 53 53 -

centering

G-IS0 G-SS Dixon G-IS0 G-SS Dixon G-IS0 G-SS Dixon

-

-

-

-

-

77 77, 83 _

53 53 53 _

log-column centering log-double centering

-

log-column centering log-double centering

2 8

63

'63 23

1. i? o

&<

-;fj

&?sge

6%4

%

2; 83

-1' 0

50

100

y-value

23

0.4 %d6%3 2363

0.2 J

6

0

a -0.2

100

24

29

%

-74 ;/

50

150

y-value

I 150

y-value Fig. 5. Data set A after centering:

-0.4 -0.6 0

6*8 %3

63

1 ’

-7 8

.

84

53 50

100 y-value

plots of PCl, 2, 3, 4 against y-value.

150

10

V Centner

Table 5 The detection of outlying objects by the modified Rao’s statistic, data set A Preprocessing centering log-column log-double

method centering

centering

et al./Analytica Chimica Acta 330 (1996) 1-17

G-IS0

4,

42)

43,

63 63, 83

-

53, 23 -

63, 83

77, 83

-

Table 6 The detection of outlying objects by the modified G-IS0 Rao’s statistic, data set A after elimination of sample 63 Preprocessing

test on

method

41) -

centering log-column centering log-double centering

test on

42)

43)

77, 83 77, a3

53, 23 -

it. After the elimination, the detection procedure was repeated again for each transformation (Table 9). The results confirmed the earlier ones. The exclusion of sample 67, however, significantly reduces the importance of PC2. The information of what was PC2 is after elimination of object 67 found in PC3. In the modelling we should especially focus on the objects 37 and 56, on the high leverage samples 72, 70, 11 and on the objects 48, 71 and 81. - modified G-IS0 test on Rue ‘S statistic (see Table 10 and Table 11) Additionally to the objects detected on single PCs one object, 49, was found in the presence of outlier 67 on D(s). There is no known (chemical) explanation of its outlying D(s) characteristic resp. scores on PC4 and PCS. The other objects found by Rao’s statistic

0.6

0.4. E

Q3 %3

53 Q6

0.2.

y-value

%4

-74

y-value

L1” g 0.2

23

t 0.1

50

100

150

y-value Fig. 6. Data set A after centering:

plots of Rao’s statistic DC,), Dc2), D(3) against the y-value.

l 7 63 8

.

R Centner et al./Amlytica object 83

-

object 77 x x x

object 23

l

l

l

object 53 + + + third quartile

o oo

median Cc

PLS latent variables Fig. 7. Plot of absolute PLS residuals for the flagged objects against the number of components in the model compared to the median and third quartile of all absolute deviations. Data set A.

Chimica Acta 330 (1996) 1-17

11

residuals of object 37, high even for the optimal six components model. Such bad prediction seems to justify the elimination of this flagged object. Indeed, after removal of 37 the RMSEP is better (1.44 instead of 1.78, see Table 12). If however one computes the RMSEP for the optimal model including object 37, but taking into account for the calculation of the RMSEP all objects except 37, one obtains a value of 1.26 (see Table 12). In other words, the RMSEP value of the optimal model, including 37, is heavily influenced by object 37. When object 37 is eliminated from the model the RMSEP goes from 1.26 to 1.44. Eliminating 37 leads, however, to worse prediction of the other objects. It

Table 7 Study of the flagged objects for data set A. RMSEP of models with increasing

numbers of components

components

RMSEP 1

2

3

4

5

6

7

flagged objects kept in the model and in the RMSEP

5.01

4.04

3.72

3.43

2.22

1.86

1.95

object eliminated in the calculation of the RMSEP

83 77 53 23

4.56 4.87 5.03 4.50

3.80 3.85 4.04 4.02

3.25 3.57 3.77 3.77

2.99 3.34 3.41 3.49

2.08 2.23 2.23 2.24

1.77 1.87 1.89 1.84

1.93 1.97 1.99 1.94

object eliminated from the model

83 77 53 23

4.79 4.83 5.08 4.93

3.90 3.83 4.23 4.66

2.98 3.68 4.03 3.71

3.11 3.39 3.37 3.80

2.40 2.53 2.44 2.48

2.01 2.00 2.04 2.16

2.18 2.05 2.21 2.16

after log-preprocessing on Do) are high leverage objects 41 and 9, detected due to their high y-value. After elimination of object 67 and log preprocessing the test is sensitive as well to objects with the lowest y-value: 61, 62 and 86. All these extreme y-value objects (9,41, 61, 62 and 86) are not suspected to be real outliers. The last object, found never before, is 68 outlying after centering on Dc2). These are reasons why objects 68 and 49 are flagged. Because of the high number of flagged objects in the data set B, the verification based on modelling is very important. From the y predicted by PLS 1, 2, . . ., 7 component models the corresponding absolute deviations were again computed. The highest of them, related to objects 37, 68, 11, 56, 72 and 48 are shown in Fig. 9. One registers mainly extreme

shows that this object has “good” leverage. During the leave-one-out CV object 37 is modelled badly (high residual) since its singularity cannot be modelled by the other objects. However the other objects are predicted better when object 37 is used for modelling. The object 37 extends predictive power to the model and should therefore not be eliminated. For the reasons explained above the object 37 was considered in the following verification only in modelling. Its predicted y was not used for the calculations of the RMSEP in Table 13. One can deduce that the elimination of the flagged objects 11 or 72 from the model slightly improves RMSEP. It decreases for the six component models from 1.26 to 1.10, resp. 1.20. When both objects are removed, the RMSEP has the lowest value, 1.08. We

V Centner et al./Analytica

12

-1

0

1

Chimica Acta 330 (1996) l-l 7

-0.41 -0.5

2

*l1

s

0.5

0 PC3

PC1

0.

a -1. TO

l 71

-2.

%l

-3 -0

-10

0

-1’

10

PC1

-.4

-2

Fig. 8. Data set B: (a) plot of the first two PCs and (b) plot of PC3 and PC4 after centering; PC4 after log-column centering.

Table 8 The detection

of outlying

objects by the Grubbs’

I

R7

and Dixon’s tests on latent variables,

2

0 PC3

4

(c) plot of PC1 and PC2 and (d) plot of PC3 and

data set B

Preprocessing method

Detection method

PC1

PC2

PC3

PC4

centering

G-IS0 G-SS Dixon G-IS0 G-SS Dixon G-IS0 G-SS Dixon

72, 72, 72, 72, 72, 12,

61, 31, 56, 72 61, 37, 56, 12 67, 31 12, 70 72,70 12,70

12,70 70,12 72, 70 70 IO 70

72, 70, 11 72, 70, 11 12 67 67 67 -

log-column

log-double

centering

centering

70, 70, 70, 70, 70, 70,

48, 48, 48, 48, 48, 48,

71, 81 71, 81 71 71 71 71

V Centner et al./Analytica Table 9 The detection

Chimica Acta 330 (1996) I-17

of outlying objects by the Grubbs’ and Dixon’s tests on latent variables,

Preprocessing method

Detection method

PC1

centering

G-IS0 G-SS Dixon G-IS0 G-SS Dixon G-IS0 G-SS Dixon

-

log-column

centering

log-double

centering

72, 72, 72, 72, 72,

70, 70, 70, 70, 70,

48, 48, 48, 48, 48,

71, 81 71, 81 71, 81 71 71

72, 70, 48, 7 1

13

data set B after elimination

of sample 67

PC2

PC3

PC4

72,70 72, 70

37,56 37, 56 37, 56 72, 70 72, 70 72,70 70 70

72 72 72 -

72, 70

70

Table 10 Detection of outlying objects by the modified G-IS0 test on Rao’s statistic, data set B

object 37 object 68 object 11 *‘*

Preprocessing method

Do,)

DC)

D(3)

centering log-column centering log-double centering

67 72, 70, 41, 11, 9

70, 72, 67, 48

72, 11, 70, 49 67, 48

72, 70, 11, 41, 9, 71

70

67, 48

object 56 + + + object 72 x x x object 48 -.-athird quartile

o

00

median _ PLS latent variables

Table 11 Detection of outlying objects by the modified G-IS0 test on Rao’s statistic, data set B after elimination of sample 67 Preprocessing method

D(I)

42,

43,

centering

37, 72, 56

37, 72, 56, 68

log-colunm centering

70,72,41,11, 9, 71, 62, 61, 86 70, 72, 41, 11, 9, 71, 62, 61, 86

70,72,48,37

72, 11,49, 70.56 48

log-double centering

70, 11, 41, 72, 9, 71

37,

48

therefore consider that 11 should probably be treated as an outlier and removed while object 72 is doubtful. The elimination of the objects 68, 56, 48, 49, 70, 71 or 8 1 does not improve RMSEP (Table 13). Therefore they are kept in the data set. One can conclude, that these flagged objects are not outlying from the optimal model. They are outlying only on single features.

Fig. 9. Plot of absolute PLS residuals for the flagged objects against the number of components in the model compared to the median and third quartile of all absolute deviations. Data set B.

The log preprocessing did not help in modelling so that only centering was finally applied. The outliers detected in the transformed data space were flagged and verified by the modelling on centered data. On comparison of the different procedures, it is useful to note that, while Rao’s statistic after centering correctly points out 67 as an outlier on II(,), Rao’s statistic after log double centering finds 6 objects, which as we will show later form a cluster different from the other objects (the highest y-value objects). Clearly, both types of preprocessing indicate aberrant behaviour, but in a different way. Because of the orthogonality of PCs, we considered the data for outlier detection as independent and performed detection at the Q = 5% critical level. If more than one PC is tested, because of the multiple comparison effect, the overall cy increases. One can prefer, in such cases, a 1% critical level or perform a Bonferroni correction.

14

V Centner

Table 12 The verification

et al./Analytica Chimica Acta 330 (1996) I-l 7

of the flagged object 37 in the data set B. RMSEP of models with increasing

components

numbers of components

RMSEP

flagged objects kept in the model and Rh4SEP object 37 eliminated from the model object 37 eliminated in the calculation of the RMSEP

1

2

3

4

5

6

7

4.40

3.39

3.38

3.46

2.65

1.78

1.67

4.49

2.49

2.18

1.89

1.75

1.44

1.48

4.43

2.70

2.47

2.43

1.64

1.26

1.34

Table 13 Study of the flagged objects for data set B. RMSEP of models with increasing components

numbers of components

RMSEP 1

2

3

4

5

6

7

object eliminated from the model

68 11 56 72 48 49 70 71 81 11,72

4.04 4.29 4.20 4.28 4.36 4.46 4.60 4.51 4.40 4.19

2.57 2.51 2.82 2.70 2.67 2.72 2.79 2.73 2.74 2.56

2.29 2.34 2.54 2.46 2.49 2.49 2.51 2.48 2.49 2.35

2.36 2.43 2.51 2.33 2.33 2.56 2.54 2.53 2.45 2.30

1.63 1.35 1.83 1.59 1.64 1.65 1.64 1.65 1.63 1.32

1.36 1.10 1.32 1.20 1.29 1.29 1.29 1.27 1.26 1.08

1.35 1.07 1.36 1.29 1.37 1.43 1.37 1.35 1.35 1.04

object eliminated in the calculation of the RMSEP

72 11 11, 72

4.04 4.42 4.02

2.65 2.53 2.47

2.44 2.32 2.29

2.37 2.19 2.11

1.65 1.62 1.64

1.22 1.17 1.14

1.30 1.11 1.06

5.3.

Clustering

tendency

The clustering tendency was determined after the elimination of outlying replicates (77c, 70b, 54b) and after the elimination of the clear chemical outliers, objects 63 and 67. One should take into account the results for outlier detection when the clustering tendency is examined. An outlier is in fact also a (small) cluster which can deteriorate the computation of indexes for clustering tendency such as the Hopkins statistic.

5.3.1. Hopkins statistic All combinations of population size used (x) and number of iterations were chosen such that each object could be chosen once. We also included x = 100% of the objects and iterated this calculation 5

times. The results lead us to recommend to use more than 5% of the total population. Otherwise the results vary too much and the range = H,,, - Hmin can include the critical value H = 0.75. The results are presented in Table 14 and Table 15. Data set B includes with more than 90% probability more than one cluster, while A does not. The reason for recommending to use more than 5% of the population can be understood from Table 16. A literature [ 131 proposal to use a low percentage of the total population was verified and repeated twice on the combined A and B data sets together. One should remember that it is known that A and B are different. A clustering tendency should be found. Each average result H is higher than the critical value 0.75. However for less than 5% of population the range is more than 0.2 and Hmin less than 0.75. On

K Cenmer et al./Analytica Table 14 Hopkins’ statistic, data set A after elimination iterations

Chimica Acta 330 (1996) I-17

15

of outlier 63 H,i.

H rnjll

range

5 10 20

20 10 5

0.6135 0.6454 0.6899

0.4636 0.4987 0.4066

0.7522 0.8804 0.8979

0.2886 0.3817 0.4913

5

100

0.6463

0.5960

0.6912

0.0952

x (%)

Table 15 Hopkins’ statistic, data set B after elimination iterations

of outlier 67 Hmi,

H max

range

5 10 20

20 10 5

0.8227 0.8197 0.8037

0.7727 0.7639 0.6060

0.8892 0.8861 0.9047

0.1165 0.1223 0.2987

5

100

0.8223

0.8027

0.847 1

0.0444

x (%I

the other hand for 100% objects included, H varies only about 1%. As this is an example with a relatively large population (85 objects), the use of a lower x of population seems to be even less recommendable for smaller data sets. Fig. 8a visually confirms the clustering tendency of data set B (one large and one small cluster) and Fig. 4 the homogeneity of data set A. It is concluded that, when used with more than 5% of the population, Hopkins statistic does give a correct conclusion about clustering tendency. When the non-significant PCs were not considered for the computation of H the obtained results were similar to the original ones. It is therefore not necessary to know complexity of the data before computing H. 5.3.2. Hierarchical clustering To verify the existence of clusters, single linkage hierarchical clustering can be performed with Euclidean distance as the similarity measurement. The dendrograms are shown in Fig. 10a for A and in Fig. lob for B. There are no clear clusters in A. In the B data set one observes a rather large difference between the six high hydroxyl number objects (9, 11, 41, 70, 71, 72) and the rest. This structure confirms the results obtained with the clustering index and explains some of the results with the outlier tests.

Table 16 Hopkins’ statistic, combined outliers 63 and 67

A and B data sets after elimination

of

iterations

x (%)

Hz,.,

Hmin

H max

range

1 1 2 2 3 3 4 4 5 5 10 10 20 20 33 33 50 50 100 100

100 100 50 50 33 33 25 25 20 20 10 10 5 5 3 3 2 2 1 1

0.8135 0.8184 0.7944 0.8180 0.8104 0.8390 0.8322 0.8196 0.8165 0.8446 0.8237 0.7842 0.8184 0.7954 0.8198 0.8071 0.8023 0.8112 0.8317 0.8218

0.8135 0.8184 0.7936 0.7806 0.7994 0.8345 0.7985 0.7891 0.7278 0.7876 0.7662 0.7018 0.6892 0.6350 0.6653 0.5336 0.5205 0.4704 0.5562 0.3070

0.8135 0.8184 0.7953 0.8554 0.8189 0.8468 0.8877 0.8428 0.868 1 0.8950 0.8657 0.8557 0.8903 0.9057 0.9332 0.9360 0.9407 0.9657 0.9714 0.968 1

0.0018 0.0748 0.0196 0.0123 0.0892 0.0537 0.1402 0.1073 0.0995 0.1539 0.2011 0.2707 0.2679 0.4023 0.4202 0.4953 0.4152 0.6611

5

100

0.8160

0.8126

0.8226

0.010

One should consider whether it is useful to keep these heterogeneous spectra together in one data set for multivariate calibration or not. The answer depends on the goal and the context of the work

16

and should be answered, chemical grounds.

V Cenrner er al./Analytica

not on statistical,

Chimica Acta 330 (1996) I-17

but on

6. Conclusion The application of the methods described here seems useful to detect inhomogeneities in data sets before multivariate calibration is attempted. The Cochran test applied to the sum of absorand the scores on the PC identifies bances A,,, outlying replicates. We recommend to use A,,,. Individual latent variables (PCs) and the Rao statistic were used to detect sample outliers. The Dixon’s and Grubbs’ tests (modified for use with the Rao statistic) were applied to these data. There is little difference in the results obtained with the Dixon or the two Grubb’s tests. Therefore it is probably best to apply the Grubbs test proposed by IS0 in another context. The selection of the data is more important. We are not able to decide with the data sets studied whether there is a best method. With our results, it seems that a combination of tests on the Rao statistic and on PC1 seems to be preferred. The preprocessing before PCA has an important effect. The offset correction followed by centering allowed to identify the outliers. More sophisticated preprocessing can however be also useful for emphasising an inhomogeneity in the data set and should therefore be considered as a possibility. Immediate elimination of detected outliers is not indicated. The removal has to be justified by a chemical or another scientific reason. Other outliers should be merely flagged. The outlying character of the flagged objects should be verified by modelling. This allows to distinguish real outliers and objects extreme only on individual features that however do not influence or even do improve the final model. Such objects should be kept in the data set. The tendency to create clusters can be detected by Hopkins statistic (H>. The results obtained agree with the visual observation on dendrograms or PC plots. We recommend to use a large part of the total population x for the computation of Hopkin’s statistic and to eliminate outliers before carrying out these calculations.

x!ly ,

I-J 0

0.5 Euclidean

1 1.5 distance

2

1

0 Euclidean

2 distance

Fig. 10. Dendrograms for single linkage hierarchical clustering. Euclidean distance is used as similarity measurement, the data were centered. The outliers 63 resp. 67 were eliminated: (a) data set A, (b) data set B.

Acknowledgements We would like to thank Mr. Theo Meier for careful collection of the data and FGWO for financial assistance.

References Ill W.J. Youden, VI

[31 [41 [51 [61 [71 PI [91

E.H. Steiner, Statistical Manual of the Association of Official Analytical Chemists, Washington, 1975. International Organization for Standardization, IS0 Standards Handbook 3; Statistical methods, 3rd edn., Geneve, 1989. International Union of Pure and Applied Chemistry, Pure Appl. Chem., 60(6) (1988) 855-864. J.C. Miller and J.N. Miller, Statistics for Analytical Chemistry, Ellis Horwood, Chichester, 1988. F.E. Grubbs and Cl. Beck, Technometrics, 14 (1972) 847854. P.C. Kelly, J. Assoc. Off. Anal. Chem., 73 (1990) 58-64. A. David, 0. Hartley and E.S. Pearson, Biometrika, 41 (1954) 482493. B. Mertens, M. Thompson and T. Fearn, Analyst, 119 (1994) 2777-2784. P.J. Lewi, Interpretation of Biplots, Comett Course: Introduction to Multivariate Techniques for Chemical Application, 1992.

L! Centner et al./Analytica [ 101 H. Martens and T. NOES,Multivariate Calibration, Wiley, New York, 1989. [ll] V. Thomas, Anal. Chem, 66(15) (1994) 795-804. [12] B. Hopkins, AM. Bot., 18 (1954) 213.

Chimica Acta 330 (1996) 1-17 [13] [14]

17

R.G. Lawson and F’. Jurs, J. Chem. Inf. Comput. Sci., 30 (1990) 36-41. F. Malinowski and D. Howery, Factor Analysis in Chemistry, Wiley, New York, 1980.