Outlier detection in multivariate calibration

Outlier detection in multivariate calibration

Chemometrics and intelligent laboratory systems ELSEVIER Chemometricsand Intelligent LaboratorySystems 28 (1995) 259-272 Outlier detection in multiv...

1MB Sizes 3 Downloads 153 Views

Chemometrics and intelligent laboratory systems ELSEVIER

Chemometricsand Intelligent LaboratorySystems 28 (1995) 259-272

Outlier detection in multivariate calibration B. Walczak ChemoAC, Pharmaceutical



Institute, Vrije Uniuersiteit Brussel, Laarbeeklaan

103, B-1090 Brussels, Belgium

Received 15 April 1994;accepted2 August 1994

Abstract A new procedure for the identification of multiple outliers in linear models is proposed. It is based on an evolution program for subset selection. The procedure is illustrated and compared to existing robust methods, using several data sets known to contain multiple outliers and its performance and robustness are additionally tested by a Monte Carlo study on the simulated data sets.

1. Introduction One of the main problems in multivariate calibration is the detection of multiple outliers, causing the so-called masking effect (several outliers can interact in complicated ways to strengthen or to cancel each other’s influence). A possible way of outlier identification is to estimate the extent to which the presence of subsets of data affects the coefficients of the regression model. This is the basic idea of the so-called multiple-cases diagnostics [l-4]. Unfortunately, this method requires consideration of all possible subsets of 1, 2, 3, . . . ) m/2 elements from the m-element initial data set, which leads to long computation times. In analogy with the multiple-cases diagnostics we can formulate our problem as searching for the data subset which contains no outliers. A regression model

’ On leave from Silesian University,

Katowice,

Poland.

0169-7439/95/$09.50 0 1995 Elsevier Science B.V. All rights xserved SSDI 0169-7439(94)00077-8

built for this subset can be considered as a robust model describing the ‘good part’ of the data. Based on the residuals from this robust model the outliers can be identified and rejected, and the final model can be constructed with the remaining ‘good’ data. As the space of the possible solutions of this prob lem is very large, an evolution program, based on the idea of a genetic algorithm (GA) [S-8] seems a reasonable tool to guide the space search. The main desirable characteristic of GAS and the evolution programs is that they run on a population of the potential solutions and balance between exploration and exploitation of the search space. The evolution program presented in this paper makes no use of the binary mutation and crossover operators, that are typical for GAS but it incorporates the other ‘genetic’ operators improving the convergence of evolution. Although GAS have already been successfully applied to subset selection [9] and to curve fitting [lo], the proposed approach differs with respect to its ability to deal with the contaminated data and can be treated as a ‘robust method’.

260

B. Walczak/

Chemometrics

and Intelligent Laboratory Systems 28 (1995) 259-272

2. Theory A GA is a relatively simple procedure guided by the so-called evaluation function (fitness function or objective function), describing the quality of solutions. In our case this function ought to allow finding the subset of data which contains no outliers. The model built for the subset of the data without outliers can be used to detect the outliers in the test set. It is then necessary to define the evaluation function which allows to distinguish the good model from the bad ones. 2.1. Genetic algorithm A GA is an iterative procedure which imitates the evolutionary process in nature. It starts with a randomly selected population of potential solutions, encoded as finite strings of bits. The initial population undergoes a simulated evolution, which copies the natural selective reproduction scheme, ‘survival of the fittest’. The fitness of the individual strings is estimated by the objective function which plays the role of an environment. To every string is assigned a fitness value, which simply is a non-negative measure of the quality of the string’s solution. Definition of the fitness function is problem dependent; it is defined by the user to represent the quality of a string. This fitness is then used to direct the application of the operations, which produce a new population of strings (a new generation). The new population is formed by selecting strings with a probability relative to their fitness. As this reproduction rule can lead to a decrease of string diversity in the next few generations, the two genetic operators, mutation and crossover, are introduced to counterbalance this phenomenon. Mutation works on the string bit level. It works by randomly inverting single bits of individuals, with the probability of this event usually being very small. Crossover, the recombination operator of GAS, is the main variation operator which hopefully recombines useful segments from different individuals. The crossover rate indicates the probability per individual to undergo recombination. Then with the two parent individuals selected (at random) from the population, crossover forms two offspring individuals according to the following scheme: [1101]10001011] [0111I10111000]

[110110111000] [011110001011]

This one-point crossover (with the randomly selected breakpoint) can naturally be extended to a generalized m-point crossover by sampling more than one breakpoint and alternately exchanging each second resulting segment. The new population of strings is again individually evaluated and transformed into a subsequent generation. This process continues until convergence within a population is achieved. 2.2. Evolution programs Evolution programs (EPs) are also based on the principle of evolution (survival of the fittest). The main difference between GAS and EPs is that GAS use fixed-length binary strings for their individuals and two operators: binary mutation and binary crossover, whereas EPs allow any data structure suitable for a problem together with any set of ‘genetic’ operators. In the proposed approach of subset selection we use fixed-length binary strings, but instead of mutation and crossover an alternative operator is applied. The details of the proposed algorithm of subset selection are given below. 2.3. Subset selection Consider as an example the data set containing m = 30 objects, 1 dependent (y) and 5 independent (x) variables. The assumed fraction of contaminated data, p, equals 0.25 (about 7 outlying observations). Initial population An initial population of k strings each of length 30 is generated at random. For k = 10, the initial population can be presented in string notation as string1

[100010100110110110000110001011]

string2

[010000000010010010000101000111]

string3

[101011000110100010001111101101]

string4

[000000001010001011000100000001]

string5

[100010000101010001000000011000]

string6

~000100100000001110001000010001]

string7

[100010001101011000100110110000]

string8

[011101100100110000010101101101]

string9

~001101000011000100001010001011]

string10

[110001001101010000100101010001]

Each string represents one possible subset of the initial data set. If the ith gene of the string is ‘l’, the

B. Walczak/ Chemometrics and Intelligent Laboratory Systems 28 (1995) 259-272

ith object is present in the subset. For instance, the subset 1 contains objects Nos. 1, 5, 7, 10, 11, 13, 14, 16, 17, 22, 23, 27, 29, and 30. Subsets presented by the particular strings will be called model sets (MS). The following constraint ought to be taken into account: the number of objects presented by the particular string can vary from mmin to mmax, where mmin is a reasonable minimal number of objects in the model set (depending on the number of independent variables), and mmax can be estimated as m max -cm(l

-p)

It means that mmax ought to be smaller than the assumed number of non-contaminated data. For the discussed example (30 objects, 5 independent vari-

261

ables, assumed fraction of contamination p = 0.25 (7 outliers)) the m,in can be, for instance, equal to 6, and m,,, can be equal to 16. It means that if a randomly chosen model set contains 16 objects, the remaining 14 objects will be in the test set (TS). The reason why mm_ cannot be equal to 23 (30 - 7 outliers) is explained in the next section.

Eualuation As noted earlier, the evaluation function (f) plays the role of the environment, rating potential solutions in terms of their fitness. The evaluation function for string si is defined as follows: f( si) = l,‘RMS

Table 1 Evaluation of the string3 Object

1

Irl MS 9.364

2 3

lr/ TS

Sorted

Sorted

object

Irl

9.364

14

0.339

28.301

28.301

10

0.437

1.199

19

0.484

5.972

5.972

27

0.548 1.044

1.199

4

IrlMS+TS

5

2.274

2.274

9

6

24.795

24.795

3

1.199

24

1.303 1.694

7

7.539

7.539

8

129.091

129.091

17

9

1.044

1.044

30

1.721

0.437

5

2.274

10

0.437

11

12.666

12

3.024

12.666

13

2.855

3.024

12

3.024 3.121

2.855

29

14

0.339

0.339

22

3.326

15

156.488

156.488

25

3.407

84.636

84.636

23

3.809

1.694

18

5.231

5.231

4

5.972

13

2.855

16 17

1.694 5.231

18 19

0.484

0.484

28

6.004

20

13.883

13.883

7

7.539 8.312

21

8.312

8.312

21

22

3.326

3.326

1

9.364

23

3.809

3.809

11

12.666

24

1.303

1.303

20

13.883

25

3.407

3.407

6

24.795 28.301

142.086

2

27

0.548

0.548

16

84.636

28

6.004

6.004

8

129.091

3.121

26

142.086

1.721

15

156.487

142.086

26

3.121

29 30

1.721

262

B. Walczak/

Chemometrics

and Intelligent Laboratory Systems 28 (1995) 259-272

where RMS is the root mean square error estimated for the m’ objects sorted according to the absolute value of their residuals from the model based on string si, where m’ = (1 -p)m. Let us consider our initial population. The subset presented, for instance, by string 3 contains 16 objects (MS). The remaining 14 objects belong to the test set. The MS is used to construct least squares model (LS), then based on this model the prediction is made for all 30 objects (see Table 1). Residuals. defined as

string1

=

Yobs

-

string2

f

21.762

0.046

[010000000010010

17.951

0.056

4.905

0.204

14.722

0.068

42.160

0.024

18.854

0.053

19.483

0.051

6.5280

0.153

21.532

0.046

23.038

0.043

0100001010001111 string3

[101000001100110 0111000000000101

string4

[000000001010001 0110001000000011

string5

[100010000101010 0010000000110001 [000100100000001 1100010000100011

Ypred string7

are estimated for all 30 objects (objects from both the MS and the TS). Then all objects are sorted according to the absolute value of their residuals and the RMS error

RMS

0001100010100101

string6 ‘i

[110001100010011

[100010001101011 0001001101100001

string8

[000100000101000 1010001110010001

string9

[001101000011000 1000010100010111

string10

[110001001101010 0001001010100011

is estimated for the first 23 objects (i.e., 7 objects are set aside as possible outliers). The proposed definition of the fitness function can be motivated as follows: the fitness function cannot be based on the LS residuals estimated only for objects from the MS because the LS residuals are not a good source of information about outlying observations [4]. The influential outlying observations (socalled bad leverage points [4]) may possess very small LS residuals as the LS model is pulled in their direction. The fitness function cannot be based only on the LS residuals of objects from the TS since we are not sure that the LS model is good. The LS residuals can be used to identify outliers only if objects from both MS and TS are taken into account (i.e., the model and its prediction ability are evaluated simultaneously). In this case the RMS error has to be estimated not for the whole data set (with outliers) but for the assumed fraction of non-contaminated data sorted according to the absolute value of their residuals from the LS model. To evaluate the model and its prediction ability the constraint limiting the number of objects in the MS is necessary. All strings from the initial population are evaluated in an analogous manner:

Reproduction Reproduction is simply a process by which strings with large fitness values are selected to create the next population. The average fitness for the initial population equals 0.074. In this population there are only two strings with fitness higher than the average one. These two strings (string3 and string8) are selected and the number of their copies in the new population, num ,, is estimated as num,=round((f,/

if,\k) \\

i=l

I

I

where k is the size of the population, and n is the number of the strings selected for the new population. In the discussed example (for k = 10 and n = 2) * 10) = the num string3 = round((0.204/0.357) round(5.7) and num string8 = round((0.153/0.357) * 10) = round(4.3). It means that string3 receives 6 copies, and string8 only 4 copies in the next generation. The sum of copies must be equal to the population size (6 + 4 = 10). Now, instead of applying standard genetic operators, the following strategy is chosen:

B. Walczak/

Chemometrics

and Intelligent Laboratory

Instead of string3, the following String3 is constructed: string3’ [l 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 101111101111] The subset represented by this string contains 23 objects (the assumed number of non-contaminated objects), for which the RMS error was estimated, i.e., objects Nos. 1, 3, 4, 5, 7, 9, 10, 11, 12, 13, 14, 17, 18, 19, 21, 22, 23, 24, 25, 27, 28, 29, and 30 (see Table 1; the 23 objects with the lowest residuals are listed in column No. 5). The string8 with fitness higher than the average fitness is replaced by the string8’ constructed in an analogous way. The next generation of strings is created by selecting subsets from the sets presented by these strings. The resulting new generation of strings looks like

Systems 28 (1995) 259-272

263

fraction of outliers can, however, be much smaller than that assumed. The EP model ought to be used to identify the real outliers in the data, as those objects with the large residuals from the fit followed by the data majority. After outlier identification and rejection (or correction), the final LS model can be built for the remaining ‘good’ part of data. To diagnose outliers based on residuals from a robust procedure, the standardized residuals, proposed by Rousseeuw and Leroy [4], are applied. For that purpose a preliminary scale estimate s, is defined in a robust way, namely, it is based on the median multiplied by a finite-sample correction factor for the case of normal errors: s, = 1.4826( 1 + A)(median(

rf))1”

string1 [1011000000110100100000101011111 string2 [1001000001101000001001100011001 string3 [1001001001010100111000101010111 string4 [0011000010001000110001100000101 string5 [0000100011000100010000011000011 string6 [101010000100000000101000000011] string7 [1101010001000110000000110000001 string8 [000110110001010101010001110001] string9 [1100111011001010010000000110001 string10 [0001001000011101000101001000001

The first six strings are the subsets of string3’, the remaining four strings are the subsets of string8’. These subsets are selected randomly with equal probability and with the constraint that the number of objects they present is in the range from mmin to in max. This population is again evaluated and the procedure of reproduction and subset selection is repeated until convergence is achieved. 2.4. Outlier detection

where m is the number of objects, 1 is the number of estimators (for a line I = 2; for a plane 3, etc.), 1.4826 is the factor assuming a normal distribution of the errors, and 1 + 5/(m - Z>is an empirically found special factor correcting for small sample size (in comparison with 1). With this scale estimate, the standardized residuals ri/s, are computed and the ith object is rejected if !L!! > 2.5 The procedure is repeated for the m’ objects remaining after the outliers have been eliminated, but the second scale estimator is defined as: i/2 s *= Each of the m’ objects is evaluated ing to

lril

As input parameter to the EP the expected fraction of the data contamination is required. In some cases its estimation is easy, in others more difficult. In the extreme situation the user can set this parameter equal to 0.5 (for larger amounts of contamination, it becomes impossible to distinguish between the ‘good’ and the ‘bad’ parts of the data [4]). The EP solution yields the model built for the assumed fraction of non-contaminated data, (1 -p)m. The real

; s

again accord-

> 2.5

to perform the robust diagnosis of outliers. The advantage of the formula for s * is that outliers do not influence the scale estimate anymore. 2.5. Robust methods Robustness of the proposed subset selection algorithm is compared with that of the two robust meth-

264

B. Walczak/

ods, i.e., least median dian.

Chemometrics

and Intelligent Laboratory Systems 28 (1995) 259-272

of squares and repeated

me-

Least median of squares (LMS) The least median of squares (LMS) method and its breakdown properties are described in Ref. [4]. The method was introduced into chemometrics by Massart et al. [lo]. The LMS procedure can be summarised as follows. For each data point (xi, yi) all lines between all possible pairs of data points are calculated, i.e., their slope b, and intercept b,. Then the squared residuals rf between each line and all the data points are estimated: r,’ = ( yi - bIx, - b,)* The LMS estimator Minimize

is then defined as

med rf

b, andbl

The slope 6, for which the median of squared residuals attains the minimum is then chosen as regression coefficient in the model.

where xi are generated as normally distributed numbers with mean 0 and standard deviation 100 (N(O,lOO)), and the noise terms are generated as e, -N(O, 1) for i= 1, . . . . 40. Seven randomly selected objects (Nos. 1, 2, 3, 4, 10, 20, and 40) are replaced by random numbers generated as outliers - N(10, 2* SD>, where SD means standard deviation of the data. Data set 2 The performance of the proposed tested for the following data set:

bij=

(Y, -~i)/(

X, -Xi),

j # i

For n points the 12 sets, bi, of (n - 1) slopes are estimated. The final robust slope of regression model is defined as the median of the median of all the n sets of slopes: bfina, = median i (median, ( bij))

2.6. Simulated

data

Performance of the proposed algorithm was tested for the following simulated data sets. Data set 1 For illustrative sidered: yi =x, + ei

purposes,

the following

set is con-

was

y, = x,, i- x,* + xi3 + xi4 + xl5 + e, where the explanatory variables are generated as uniformly distributed numbers with mean 0 and standard deviation 100 (xii - U(0, 100) for i = 1, . . . , 100, and j= 1, . . . . 5) and the noise terms are generated as normaly distributed variables with mean 0 and standard deviation 1 (e, - N(0, 1)). The least squares method yields for this data set the following regression model : yi = 0.9945~~~ + 1.0024x,, + 1.0008xi,

Repeated median (RM) For each data point (x,, yi>, where i = 1, . . , II, a set of (n - 1) slopes between that point and all other points is estimated. Individual slopes are calculated as

algorithm

+ 0.9998~~~

+ 1.0006x,,

The root mean square error (RMS) equals 1.0017.

Table 2 Parameters of contaminated constant * SD(data)

distribution

N(mean,

No.

P

Mean

Constant

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.25

0 0 0 10 10 10 50 50 50 0 0 0 10 10 10 50 50 50

1 2 5 1 2 5 1 2 5 1 2 5 1 2 5

0.49

2 5

s);

s =

B. Walczak / Chemometrics

and Intelligent Laboratory

Contamination of fraction p of the initial data set was performed in the following way: - Outliers in the X data were created by substitution of the initial data by the random numbers with the distribution N(mean, s), where s = constant * standard deviation of the X data. - Outliers in the y data were created by adding the random numbers with the distribution N(mean, s), where s = constant * standard deviation of the y data. All considered types of contamination are listed in Table 2.

3. Results and discussion 3.1. Robustness

of EP

There are two alternative approaches to dealing with outliers in regression analysis: outlier diagnostics and robust regression methods (see Fig. 1). Outliers diagnostics aim to detect influential observations, correct or delete them, and then the LS analysis on the remaining data is performed. Robust regression methods in the first order fit the ‘good’ majority of data and then the outliers are identified as those objects which have large residuals from the robust fit. The analysis can be stopped at this step, or, after outlier rejection, the final LS model can be constructed. This can be motivated as follows: the least squares estimates have the smallest possible variance of any linear unbiased estimator and their confidence limits are easy to estimate. The proposed EP approach can be considered as a robust method. Although it makes use of LS analysis and of the residuals from the LS model, it does not estimate the influence of the particular observations (or group of observations) on the LS model, but aims to find the ‘good’ majority of data, which is characteristic for robust methods. It ought to be stressed that although the evaluation of the string fitness is based on the residuals from the LS model, the performance of the evaluation function is determined by the fact that the LS residuals for both the model and test sets are taken into account. A clear distinction should be made between the robust EP model and the final EP model. The robust EP model is the LS model built for the subset of ob-

Systems 28 (1995) 259-272

265

jects selected by EP. The number of objects in this set equals the assumed number of non-contaminated data, m(1 -p). For instance, if the data set contains 40 objects and the assumed fraction of contamination, p, equals 0.05, the robust model is built for the 38 objects, if p equals 0.20 the robust model is built for 32 objects, etc. The final EP model is the model built for the data after identification and rejection of outliers, based on their residuals from the robust model. It means that the robust model can differ to some extent depending on the assumed fraction of the data contamination (they are built for (1 -p)m objects), whereas the final models ought to be the same for the studied data set (the real number of outliers does not depend on the assumed fraction of data contamination). For illustrative purposes let us first consider data set 1 with 40 observations. 33 of them satisfy the simple regression model yi = 1.0033x+ and the 7 remaining observations (Nos. 1, 2, 3, 4, 10, 20, 40) are the model outliers. Applying the LS regression without intercept to this data set leads to the following model: y, = 0.6876~~ The regression coefficient is strongly influenced by the outliers. The LS model and the residual plot together with the horizontal bands enclosing the standardized residuals between - 2.5 and 2.5 is presented in Fig. 2a. From this residual plot one can

b)

a)

Contaminateddata

contaminated

Robust

Fig. 1. Outlier diagnostics

data

model

(a) and robust methods (b).

B. Walczak/

266

Chemometrics

and Intelligent Laboratory

conclude that only object No. 40 is outlying, for the remaining objects the standardized residuals fall nicely within the band.

100 .: *50 .a

.r *a’

Systems 28 (1995) 259-272

Now let us have a look at the results of the EP, presented in Table 3 and in Figs. 2b-2d. The robust EP models and the residual plots associated with these

.

6’ .

.

.

.

*

0 ‘,--

-50’ 0

I

50

100

150

-4’ 0

200

30 IO 20 index of the object

X

I 40

4 2 :. . . :

0

.. . . . . ... . . *** *.. . . . . . . . . .

*. .2

0

50

100

150

-

rl

-6L 0

200

X

(c)

20 30 10 index of the objw.3

40

200

150, 0,

lo.. !!

*

h

z * fn -50 * p!

-0 L

50

100

150

X

(d)

2ooy

u

150 2

100

$J -20

21 50

.

.

.

J

g-40 c

.

0

-60 tl.-::-

I

I

-500

50

100 X

Fig. 2. Data set 1: the robust models and the associated (e) LMS; (f) RM.

150

200

...*. ......-.. ..-.. I-----

-100 _. 0

200

.

-60;

10 20 30 index of the object

40

*.

-

.

.

.

I

10

20 30 index of the object

40

residuals plot for (a) LS; (b) EP (p = 0.05); (c) EP (p = 0.25); (d) EP (p = 0.49);

B. Walczak/

-50; 0

Chemometrics

50

and Intelligent Laboratory Systems 28 (1995) 259-272

150

100

200

X

(9

267

index of the object 20

*ooyj

150

0

..... ... ..... ...... .. ... ...

0

*.

23 -20

100 x 50

.

1 E-40. ! I=!==I

.

zi-

k:

Ol -5o! 0

50

150

100

.

0

200

X

10 20 30 index of the object

40

Fig. 2 (continued).

models are demonstrated for the same data set 1 to explain the influence of the assumed fraction of the data contamination on the method performance. If the assumed fraction of contamination is smaller than the real fraction of outliers presented in the data the robust model is still influenced by some outlying objects (see Fig. 2b). The EP yields robust models only when the assumed fraction of contamination is higher than the real one. In this case p ought to be higher than 0.18 (18%). As can be seen in Table 3 and Figs. 2c and 2d the robust EP models for p = 0.25 and p = 0.49 are very similar and properly de-

Table 3 Robust and final models for the data set with seven outliers Method

Robust model

Final model

Number of identified outliers

EP p = 0.05 EP p = 0.25 EP p = 0.49 LMS RM

yI y, yI y, y,

y, y, y, y, y,

4 7 7 7 7

= 0.8010x, =1.0039x, = 1.0083x, =1.0026x = 0.9947x:

= = = = =

0.9017X 1.0033x’ 1.0033x: 1.0033X, 1.0033X,

scribe the majority of data. Based on the standardized residuals from these robust models the seven outliers can be identified and rejected. It ought to be stressed that a value of p higher than 0.49 is unreasonable, as we cannot any longer distinguish between good and bad parts of a data set [4]. We recommend p = 0.49 as the standard assumption about data contamination if we have no idea about real data contamination. This extreme assumption does not lead to a worse final model. A robust model describing 51% of data (its good part) is good enough to identify the real outliers, and the final EP model can be built for all good objects contained in the data. In Table 3 and in Figs. 2e and 2f the LMS and RM models with the respective residual plots are presented for comparison. Both robust methods yield robust models similar to the EP model and properly describe the majority of data. Based on the standardized residuals from these models all seven outlying observations can be easily identified and rejected and the final models are exactly the same for LMS, RM and EP. The robustness of the EP was tested for some well-known real data sets, which have been studied

268

B. Walczak/

Chemometrics

and Intelligent Laboratory

by a great number of statisticians and for which the real outliers are known [4,11-161.

Systems 28 (1995) 259-272

ning of the spring season), discharge into the sound. Telephone

Stackloss data (1 I]

The well-known stackloss data set describes the operation of a plant for the oxidation of ammonia to nitric acid and consists of 21 four-dimensional observations (one response and three explanatory variables). The stackloss has to be explained by the rate of operation, the cooling water inlet temperature, and the acid concentration.

and the volume

of river

calls data [4]

Rousseeuw and Leroy [4] found in the Belgian Statistical Survey (published by the Ministry of Economy) a data set containing the total number (in tens of millions) of international calls made. This time series contains heavy contamination for a few years when another recording system was used, giving the total number of minutes of these calls. This caused a large fraction of outliers in the y direction. Hawkins, Bradu and Kass data (161

Salinity data (151

This is a set of measurements of water salinity and river discharge taken in North Carolina’s Pamlico Sound. The data set consists of 28 four-dimensional observations. The salinity has to be regressed against salinity lagged by two weeks, the trend (that is the number of biweekly periods elapsed since the begin-

This constructed data set consists of 75 observations in four dimensions. It contains 10 bad and 4 good leverage points and provides a good example of the masking effect. The residual plots for the described above data sets are presented in Fig. 3. As can be seen immediately from Fig. 3, all the real outliers are identified properly.

Salinity data 15

_jy-Jyq I -50

20 10 index of the observation

30

0

20 10 index of the observation Hawkins

Calls data

30

data

150

-500

20 10 index of the observation

Fig. 3. The residual plots for the stackloss *I.

[4,11-141,

30

50 index of the observation

100

salinity [4,15], telephone calls [4] and Hawkins data [4,16] (real outliers are denoted as

B. jsf

¸v¸~..........._.......

RMS 30,00 .....

{ 2000

~o,oo 1 &O0 0,00

6

Fig. 4. Data set 2: evaluationof the lowest, highest and mean errors versus generation number (size of population, k ,: 100). 3.2. Method performance: simulation studies

Two aspects of method performance were considered: the number of generations necessary to algorithm convergence, and the stability of the final EP model. As the stopping criterion of the EP algorithm one can choose the convergence within population (convergence of the lowest, mean and highest RMS error), the difference between the lowest values of the RMS for the consecutive generation, maximum hum-

................... I.................

RMS

269

Walezak/ Chemometricsand Intelligent Laboratory Systems 28 (1995.)259-272

(a)

bers of generation, or the expected value of the RMS. In Fig. 4 evaluation of the lowest, mean, and highest values of RMS versus generation number is presented for data set 2 containing 49% of outlying observations (population size equals 100). After four generations all the RMS errors (highest, mean and lowest) converge to the value determined by the data noise. This means that for the fourth generation the within-population c o n v e r g e n c e is achieved. Convergence of the EP can also be defined as the difference between the lowest RMS errors for the consecutive generations (between-generation convergence). Application of this criterion decreases the number of necessary generations. In the presented example the algorithm can be stopped after three generations, as there is no significant difference between the lowest RMS for generation 2 and generation 3. (Significant difference has to be defined by the user depending on the proceeded data set and the goal of the analysis.) Tracing the lowest RMS value for consecutive generations and comparing it with the required value of the RMS can also decrease the number of necessary generations but estimation of the RMS for the contaminated data set is not always possible and reliable. The maximum number of generation criterion is the least efficient of all criteria. For instance, the maximum number of generations can be chosen as 10, whereas the within population convergence can be achieved after 3 generations and the

RMS

l&O0-[ ......

7-

~4,00 ~

6~

......... i

....................

(b)

5 106'0 ~ 8,80

4

i

3

~,oo i

400 ! 2,00 1 0.00

6

7

Fig~ 5. Data set 2: influence of the numberon outliers of the EP convergence:(a) size of D)pulation, k = 25; (b) size of population, k = 5&

B. Walczak/Chemometrics

270

and Intelligent Laboratory Systems 28 (1995) 259-272

Table 4 Simulation results for data set 2 containing 25% outliers: the RMS error of the LS and EP model residuals after rejection of outliers estimates over the 100 runs

Table 5 Simulation results for data set 2 containing 49% of outliers: the RMS error of the LS and EP models residuals after rejection of outliers estimates over the 100 runs

No.

Mean

Constant

RMS-LS

RMS-EP

No.

Mean

Constant

RMS-LS

RMS-EP

1 2 3 4 5 6 7 8 9

0 0 0 10 10 10 50 50 50

1 2 5 1 2 5 1 2 5

36.6430 42.9309 68.7485 33.9727 39.2129 70.3681 28.7957 34.2265 64.4118

0.9588 0.9477 0.9509 0.9582 0.9593 0.9548 0.9569 0.9504 0.9551

1 2 3 4 5 6 7 8 9

0 0 0 10 10 10 50 50 50

1 2 5 1 2 5 1 2 5

43.9682 53.9228 92.5608 40.9435 46.5844 93.5457 32.7685 43.3587 86.4142

1.0132 0.9922 0.9825 1.0349 1.0433 0.9949 1.0541 1.0740 0.9671

next 7 generations will not improve the final results. The convergence of the EP depends, of course, on the real, not on the assumed number of outliers in the data set. The less outliers the less generation is needed to achieve convergence. In Fig. 5 the summary results over 100 runs are presented for data set 2 containing 10, 25 and 49% outlying observations, respectively (population size = 25 and 50). The larger the population size, the better is the convergence of the algorithm, but the total number of the necessary string evaluations ought to be taken into account.

Table 6 Simulation results for data set 2 containing model estimates over the 100 runs Outliers

N(10, 1)

N(10, 2)

N(10,5)

Estimates

bl b2 b3 b4 b5 bl b2 b3 b4 b5 bl b2 b3 b4 b5

The stability of the final EP model (after the outliers rejection) was studied for the simulated data set 2. This set was randomly contaminated by replacing a fraction p of observations with outliers. For each distribution of outliers (see Table 2) the data contamination was repeated 100 times. The EP is run in these studies with the following input parameters: population size, k = 50, mmin = 15, m max = 30, p = 0.49, stopping criterion: betweenpopulation convergence of the lowest RMS. To measure to what extent the estimates of the final model:

25% outliers: the mean estimated value, the mean squared error, the bias and the variance

LS model

EP model

Mean

MSE

Bias

Variance

Mean

MSE

Bias

Variance

1.062520 0.907179 1.022718 0.942480 1.136842 1.034592 0.927766 0.975525 0.954157 1.129903 0.915096 0.907852 0.939664 0.981852 1.044596

0.014252 0.015976 0.009947 0.014523 0.031120 0.029530 0.032172 0.028613 0.033494

0.004627 0.009068 0.000572 0.003402 0.018562 0.001608 0.005570 0.000541 0.002176 0.016719 0.106768 0.071942 0.003497 0.000359 0.001936

0.009625 0.006908 0.009375 0.011121 0.012558 0.027922 0.026602 0.028072 0.031318 0.031833 0.086956 0.115541 0.121874 0.187124 0.191788

0.994473 1.001146 1.000174 1.002080 0.999463 0.999743 1.000950 1.000527 1.002372 0.999743 0.993888 1.001214 1.000210 1.002565 0.999512

0.000004 0.000006 0.000005 0.000007 0.000004 0.000004 0.000005 0.000006 0.000007 0.000005 0.000004 0.000006 0.000004 0.000010 0.000005

0.000000 0.000002 0.000001 0.000002 0.000001 0.000000 0.000002 0.000002 0.000002 0.000001 0.000000 0.000001 0.000002 0.000003 0.000001

0.000004 0.000004 0.000004 0.000005 0.000003 0.000004 0.000003 0.000004 0.000005 0.000004 0.000004 0.000005 0.000002 0.000007 0.000004

0.193724 0.187483 0.125371 0.187483 0.193724

of

B. Walczak/

Chemometrics

Table 7 Simulation results for data set 2 containing model estimates over the 100 runs Outliers

Estimates

LS model

N(10, 1)

bl b2 b3 b4 b5 bl b2 b3 b4 b5 bl b2 b3 b4 b5

Mean 1.087518 0.928700 1.020178 0.999062 1.173172 1.003176 0.909959 0.951769 0.935020 1.039995 1.005535 0.860366 0.917884 0.820254 1.024708

N(10, 2)

N(10,5)

and Intelligent Laboratory

49% outliers: the mean estimated

Systems 28 (1995) 259-272

271

value, the mean squared error, the bias and the variance

of

EP model MSE 0.023221 0.022853 0.011038 0.023379 0.044043 0.036273 0.029004 0.028847 0.030670 0.040141 0.165337 0.190526 0.084850 0.164772 0.186396

Bias 0.008652 0.005431 0.000457 0.000003 0.029782 0.000075 0.008546 0.002212 0.004327 0.001552 0.000122 0.020173 0.006548 0.032597 0.000581

differ from the true values (b,), the following summary values over the 100 runs were estimated: the mean value of the estimate

the mean squared error (MSE)

the squared bias

Variance 0.014569 0.017422 0.010581 0.023376 0.014261 0.036198 0.020458 0.026635 0.026343 0.038589 0.165219 0.170353 0.078302 0.132175 0.185815

Mean 0.994163 1.002635 0.999786 1.001646 1.000217 0.994661 1.002835 0.999557 1.001109 1.000294 0.994104 1.001688 0.999780 1.001971 1.000287

MSE 0.000007 0.000010 0.000009 0.000012 0.000008 0.000009 0.0000 11 0.000008 0.000009 0.000009 0.000008 0.000014 0.000008 0.000017 0.000014

Bias 0.000000 0.000000 0.000001 0.000001 0.000000 0.000000 0.000000 0.000001 0.000000 0.000000 0.000000 0.000000 0.000001 0.000002 0.000000

Variance 0.000007 0.000010 0.000008 0.000011 0.000008 0.000009 0.000011 0.000007 0.000009 0.000009 0.000008 0.000014 0.000007 0.000015 0.000014

The root mean square errors of residuals from the LS and the final EP models for all the studied cases of contamination are listed in Tables 4 and 5. The statistical parameters reflecting model stability are presented for the two fractions of contamination p = 0.25 and p = 0.49; outliers distribution N(10, constant * SD) (see Tables 6 and 7). As results from the Tables 4-7 in all the investigated cases the observed model stability and accuracy are very good. The final LS models do not depend on the distribution of outliers. In all cases outliers are identified properly and rejected.

bias’ = (6, - b,)’ the variance 1 variance = 100

100 C

4. Conclusions

(iiik-%j)2

k- 1

the root mean square error (RMS) of residuals from the final model for calibration and test data sets

where m” is the number of objects remaining after rejection of the identified outliers. The squared bias and the variance are the two components of the MSE: MSE = bias* + variance

The observed performance of the algorithm is determined by the applied fitness function and the way of population reproduction. It makes the EP a valuable alternative to the existing robust methods especially when dealing with data sets with many objects and variables, because the computation time of classical robust methods increases with increasing number of objects in the data set and the number of variables. The proposed subset selection algorithm can be applied to any LS regression method, among them to PCR and PLS. We intend to show this in a following publication.

272

B. Walczak/

Chemometrics

and Intelligent Laboratory Systems 28 (1995) 259-272

Acknowledgements The author wishes to thank Professor D.L. Massart for the inspiring academic atmosphere in his research group which significantly contributed to completion of this research paper, and to the Dienst voor Programmering van het Wetenschapsbeleid for financial assistance.

References [l] D.A. Be&y, E. Kuh, R.E. Welsch, Regression Diagnostics, Wiley, New York, 1980. [2] S. Weisberg, Applied Linear Regression, Wiley, New York, 1985. [3] P.J. Huber, Robust Statistics, Wiley, New York, 1981. [4] P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outlier Detection, Wiley, New York, 1987. [5] J.H. Holland, Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, MI, 1975. [6] Z. Michalewicz, Genetic Algorithms + DataStructures = Evolution Programs, Springer-Verlag, New York, 1992.

171D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989. [s1 C.B. Lucasius and G. Kateman, Understanding and using genetic algorithms. Part I. Concepts, properties and context, Chemometrics and Intelligent Laboratory Systems, 19 (1993) 1-33. [91T.H. Li, C.B. Lucasius and G. Kateman, Optimization of calibration data with the dynamic genetic algorithm, Analytica Chimica Acta, 268 (1992) 123-134. 1101CL. Karr, D.A. Stanley and B.J. Scheiner, Genetic Algorithm Applied to Least Squares Curue Fitting, US Bureau of Mines Report, Internal Bureau of Mines, Pittsburgh, PA. 1111K.A. Brownlee, Statistical Theory and Methodology in Science and Engineering, Wiley, New York, 2nd edn., 1965. I121 N.R. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, 1966. [131 R.D. Cook, Influential observations in regression, Journal of the American Statistical Association, 74 (1979) 169-174. [141G. Li, Robust regression, in D. Hoaglin, F. Mosteller and J. Tukey (Editors), Exploring Data Tables, Trends, and Shapes, Wiley, New York, 1985. [151D. Ruppert and R.J. Carroll, Trimmed least squares estimation in the linear model, Journal of the American Statistical Association, 75 (1980) 828-838. ft61 D.M. Hawkins, D. Bradu and G.V. Kass, Location of several outliers in multiple regression data using elemental sets, Technometrics, 26 (1984) 197-208.