Chapter 7 Classification of materials

Chapter 7 Classification of materials

Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved. 155 Chapter...

797KB Sizes 0 Downloads 132 Views

Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.

155

Chapter 7

Classification of materials

Ruqin Yu, Lixian Sun and Yizeng Liang

Department of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, People's Republic of China

1. INTRODUCTION

Cluster analysis as an unsupervised pattern recognition method is an important tool in exploratory analysis of chemical data. It has found wide application in many fields such as disease diagnosis, food analysis, drug analysis, classification of materials, etc. Hierarchical and optimization - partition algorithms are the most widely used methods of cluster analysis [1]. One of the major difficulties for these conventional clustering algorithms is to guarantee a global optimal solution to the corresponding problem. Simulated annealing as a stochastic optimization algorithm [2] could provide a promising way to circumvent such difficulties. Recently, generalized simulated annealing has been introduced into chemometrics for wavelength selection [3] and calibration sample selection [4]. The use of cluster analysis methods based on simulated annealing for chemometric research is of considerable interest. Three modified clustering algorithms based on simulated annealing, K-means algorithm and principal component analysis(PCA) are proposed and used to chemometric research. A modified stopping criterion and perturbation method are also proposed. These algorithms are all tested by using simulated data generated on a computer and then applied to the classification of materials such as Chinese tea, bezoar ( traditional Chinese medicine calculus bovis) , beer samples and biological samples. The results compare favourably with those obtained by conventional clustering methods.

156 2. CLUSTER ANALYSIS BY SIMULATED ANNEALING

2.1. Principle of cluster analysis by simulated annealing Simulated annealing (SA) which derives its name from the statistical mechanics simulating the atomic equilibrium at a fixed temperature belongs to a category of stochastic optimization algorithms. According to statistical mechanics, at a given temperature Ti and under thermal equilibrium the probability of a given configuration

i,

obeys the Boltzmann-Gibbs

distribution:

f, = k e.xp ( -Ei )



(1)

where k is a normalization constant and El is the energy of the configuration i [5, 6] . SA was proposed by Kirkpatrick et al. [7] as a method for solving combinational optimization problems which minimizes or maximizes a function of many variables. The idea was derived from an algorithm proposed by Metropolis et al. [8] who simulated the process involving atoms reaching thermal equilibrium at a given temperature T. The current configuration of the atoms is perturbed randomly and then a trial configuration is obtained according to the method of Metropolis et al.[8]. Let Ec and Et denote the energy of the current and trial configuration, respectively. If Et < Ec , which means that a lower energy has been reached, the trial configuration is accepted as the current configuration. If

Ec , then the trial configuration

is accepted with a probability which is directly proportional to exp(- E t - Ec )/T ). The perturbation process is repeated until the atoms reach thermal equilibrium, i.e. the configuration determined by the Boltzmann distribution at the given temperature. New lower energy states of the atoms will be obtained as T is decreased and the Metropolis simulation process is repeated. When T approaches zero, the atomic lowest-energy state or the ground state is obtained. Cluster analysis can be treated as a combinational optimization problem. Selim and Alsultan [5] as well as Brown 'and Entail [9] described the analogy between SA and cluster analysis. The atomic configuration of SA corresponds to the assignment of patterns or samples to a cluster in cluster analysis. The energy E of the atom configuration and temperature T in SA



157 process correspond to the objective function 0 and control parameter T in cluster analysis, respectively. Suppose n samples or patterns in d -dimensional space are to be partitioned into k clusters or groups. Different clustering criterions could be adopted. The sum of the squared Euclidian distances from all samples to their corresponding cluster centers is used as the criterion .

l./(1 = I, 2, ..., n ; j = 1, 2 ,

, d ) be an n x d sample data matrix, and

Let A = [a

ig ] (i = 1, 2,...,n ; g= 1, 2,..., k) be an

W=[ w

sample i is assigned to cluster g and otherwise

ti

ig

k cluster membership matrix; w = 1,

Ewig = 1. Let Z

( g = 1, 2 ,..., k ;

g=1

j= I,2,..., d) be a kxd matrix of cluster centers where

Z

w. a.. /

• =

i = 1

Wlg



(2)

i=1

The sum of the squared Euclidian distances is used as the objective function to be minimized:

(

W11

••

• , W lk;





,• wnl,...,wnk) —

1=1 1;=1

z

.)

2

(3)

j=1

The clustering is carried out in following steps: Step 1. Set initial values of parameters. Let T I be the initial temperature, T2 the final temperature, p the temperature multiplier, N the desired number of Metropolis iterations, IGM the counting number of a Metropolis iteration and i the counting number of a sample in the sample set . Take Ti = 10 , T2

=

1 0 -99 , iu = 0.7 - 0.9, N = 4n , IGM = 0 and 1= O.

Step 2. Assign an initial class label in k classes to all of n samples randomly and then calculate corresponding values of the objective function 0 . Let both the optimal objective

158 function value Ob and the current objective function value O c be 0, the corresponding cluster membership matrix of all samples be Wb . Tb

is the temperature corresponding to

the optimal objective function Ob Tc and We are the temperature and cluster membership matrix, respectively, corresponding to the current objective function O (' . Let Tc = T , Tb = T1 , Wc = Wb Step 3. While the counting number of Metropolis sampling step is less than N , i.e. IGM < N, go to step 4, otherwise, go to step 7. Step 4. Let flag = 1, let p be the threshold probability and if IGM N / 2 , p = 0.80: otherwise, p = 0.95, a trial assignment matrix W t can be obtained from the current assignment We by the following way: If i > n , then, let i = i - n ; otherwise, let I = 1+ 1, take sample i from the sample set, initial class assignment ( wig ) of this sample is expressed by f (where f belongs to arbitrary class of k classes), i.e. f = wig , then draw a random number u (u = rand, where rand is a random number of uniform distribution in the interval [0,1] ), if u > p , generate a random number r which lies in the range [1, k ], here r # f, put sample i from class f to class r , let wig = r, flag = 2; Otherwise, take another sample, repeat the above process until flag = 2. Step 5. Let corresponding trial assignment after above perturbation be Wt . Calculate the objective function value 0 1- of the assignment. If O t

Oc let We = Wt 0c = Ot • If Ot <

Ob , then, Ob = Ot , Wb = Wt , IGM = 0. Step 6. Produce a random number y , here y = rand, if y < exp - (it - .Ic )/ Tc ), then, W. = Wt ,Oc = 01. • Otherwise, IGM = IGM + 1, go to step 3. Step 7. Let Tc = pTc , IGM = 0, Oc = Ob , Wc = Wb . If Tc < T2 or Tc /Tb <

10 - 10 ,

then, stop; otherwise, go back to step 3. A flow chart of the program is shown in Figure 1.

2.2. Treatment of Simulated data

The algorithm of cluster analysis based en SA was tested by using simulated data generated

159

Input data & parameters

Generate an initial cluster membership matrix Wb randomly and calculate 4 ; =4); ti)

= (1); T =T ;

IGM < N ? no

yes Do perturbation operation to obtain W t: flag = I; if IGM N/2; p = 0.8; else ; p = 0.95; end while flag 2 if i > n; i = i - n; else; i = i + I ; end f=w ; u rand; ig if u > p; w = r ; r f ) ; flag = 2; end ig

Calculate

:

= rand; v < exp(-((1) t - c )IT )?

GM = IGM +

yes

T= p Tc ; IGM = 0 ;

yes

utput of results

Figure 1. Flow chart of a cluster analysis by simulated annealing.

160 on the computer and composed of 30 samples containing 2 variables ( x, y) for each. These samples were supposed to be divided into 3 classes. The data were processed by using cluster analysis based SA, hierarchical cluster analysis [10] and K-means algorithm[11-12] The optimal objective function (

) values obtained are shown in Table 1. Comparing the

results in the column 4 with those of column 7 in Table 1 , one notices that for SA there is only one disagreement of the class assignment for the sample No. 6, which is actually misclassified by all three methods. This shows that the SA method is more preferable than hierarchical cluster analysis and K-means algorithm. The objective function values of K-means algorithm with 50 iterative cycles are listed in Table 2, the lowest value is 168.7852. One notices that the behavior of K-means algorithm is influenced by the choice of initial cluster centers, the order in which the samples were taken, and, of course, the geometrical properties of the data . The tendency of sinking into local optima is obvious. Clustering by SA can provide more stable computational results.

2.3. Classification of tea samples Liu et al. [13] studied the classification of Chinese tea samples by using hierarcharcal cluster analysis and principal component analysis. In their study, three categories of tea of Chinese origin were used: green, black and oolong. Each category contains two varieties: Chunmee (C) and Hyson (H) for green tea, Keemun (K) and Feng Quing (F) for black tea, Tikuanyin (T) and Se Zhong (S) for oolong tea. Each sample in these groups was assigned a number according to its quality tested by tea experts on the basis of the taste and the best quality was numbered as I. One notices that the assignment of quality by the tea experts is valid only for samples belonging to the same category and variety. The names of the samples are composed of the first letter of the variety name, followed by the number indicating the quality. The data which involve concentration(% w/w, dry weight) of cellulose, hemicellulose, lignin, polyphenols, caffeine and amino acids for various tea samples were processed by using cluster analysis based on SA and K-means algorithm. The results obtained by using the two methods were compared with those given by Liu et al. [13] using hierarchical cluster analysis. The results are summarized in Table 3, where the numbers 1, 2, 3 refer to the groups the tea samples

161 Table 1 Comparison of results obtained by using different methods in classification of simulated data Classification results No.

Simulated data

Actual class of

x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1.0000 3.0000 5.0000 0.0535 0.3834 5.8462 4.9103 0 4.0000 2.0000 11.000 6.0000 8.0000 7.0000 10.000 9.5163 6.9478 11.5297 10.8278 9.0000 7.2332 6.5911 7.2693 4.1537 3.5344 7.5546 4.7147 5.0269 4.1809 6.3858

Objective function value, 0b,

y

Hierarchical

K-means*

annealing

simulated data clustering

0 3.0000 2.0000 4.0000 0.4175 3.0920 4.2625 2.6326 5.0000 1.0000 5.6515 5.2727 3.2378 2.4865 6.0000 3.9866 1.5007 3.9410 2.0159 4.7362 12.000 9.0000 9.5373 8.0000 11.000 11.6248 8.0910 7.0000 9.8870 9.4997

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

169.3275

1 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

189.3531

simulated

1 1 1 1 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

168.7852

* The result with lowest value of b among 50 independent iteration cycles.

1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

166.9288

162 Table 2 Oil values obtained by using K-means algorithm with 50 different random initial clustering

307.3102 177.2062 168.7852 169.3275 181.4713 181.4713 347.3008 169.3275 306.3492 173.0778 173.0778 178.9398 347.3008 169.3275 173.0778 177.2062 169.3275 173.0778 171.6456 177.2062 181.4713 178.9398 176.4388 168.7852 373.6821 173.0778 202.4826 168.7852 173.0778 171.6456 181.4713 173.0778 306.3492 169.3275 177.2062 202.4826 373.6821 173.0778 202.4826 178.9398 347.3008 169.3275 181.4713 176.4388 202.4826 173.0778 178.9398 177.2062 347.3008 177.2062

are classified into ( tea samples denoted by the same number are classified into the same group ). The objective Ob

for the hierarchical clustering obtained by Liu et al. [13] was

calculated according to equation 3. As shown in Table 3, different results could be obtained by using different methods with the same criterion. Objective function 0 obtained by using cluster analysis based on SA was the lowest among all methods listed in Table 3. It seems that cluster analysis based on SA can really provide a global optimal result. The column 4 of Table 3 shows the classification results of K-means algorithm with lowest O b among 50 independent iterative cycles. Comparing the results in the column 5 with those of column 2 in Table 3 , one notices that there is only one case of disagreement of the class assignment for the tea sample K2. K2 was distributed to class 1 and class 2 by hierarchical clustering and SA, respectively. As mentioned above, the assignment of quality by the tea experts is valid only for samples belonging to the same category and variety. According to hierarchical clustering sample K2 is classified as 1,

163 Table 3 Comparison of results obtained by using different methods in classification of Chinese tea samples Classification results Sample



Hierarchical clustering K-means a K-meansb

Cl C2 C3 C4 C5 C6 C7 H1 H2 H3 H4 H5 K1 K2 K3 K4 Fl F2 F3 F4 F5 F6 F7 T1 T2 T3 T4 S1 S2 S3 S4

1 1 1 1 2 2 2 1 1 1 2 2 1 1 2 2 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3

1 1 1 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3

Simulated annealing 1 1 1 1 2 2 2 1 1 1 2 2 1 2 2 2 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 2 2 3 3

Objective function value, 0b

50.8600



140.6786



119.2311

50.6999

a. The result obtained in one iteration cycle with arbitrary initial clustering. b. The result with lowest value of ob among 50 independent iteration cycles.

164 meaning K2 is at least as good as C4, H3, etc. SA gave a classification of class 2 for K2, qualifying it as the same quality as C5, 114, etc. As the value of Ob for SA is slightly lower , the results for proposed algorithm seem to be more appropriate in describing the real situation. The clustering results obtained by K-means algorithm seem not sufficiently reliable.

2.4. Some computational aspects of simulated annealing algorithm Selim and Alsultan [5] pointed out that no stopping point was computationally available on clustering analysis based on SA. Searching some stopping criteria for the use of SA cluster analysis deserves further investigation. It is rather time-consuming to proceed the calculation until Tc < T2 , as theoretically T2 itself should approach zero ( T2 = 10-99 is taken here. ) . In general, the exact value of T2

is unknown for practical situations. The present authors

propose a stopping criterion based.on the ratio of temperature Tb which corresponds to the - 10 optimal objective function Ob to current temperature Tc T. When Tc /Tb < 10 10 , one stops computation ( step 7, vide supra ). For example, during the data treatment when Tc = 3.8340 x 10-54 and Tb = 9.6598x 10 -44 , one stops computation. This is a convenient criterion, it saves computing time substantially comparing to the traditional approach using extremely small T2 . The methods to carry out the perturbation of the trial states of sample class assignment in cluster analysis as well as the introduction of perturbation in the SA process also deserve consideration. The present authors propose a perturbation method based on changing class assignment of only one sample at each time ( Figure 1 ). Such a method seems to be more effective, as usually only the class assignment of one sample is wrong, and only the class assignment of this sample has to be changed. Brown and Entail [9] took the sample to be perturbed on a random basis, the corresponding perturbation operation is described as follows:

do perturbation operation to obtain Wt flag = I; if IGAI  N/2; P = 0.8; else; p = 0.95; end while flag # 2

165 i = rand (1, n) ; where i is a random which lies in the interval I I , n . f= wig ; u = rand; if u > p; wig = r ; flag = 2; end end

It seems that in such a method the equal opportunity in perturbation for each sample might not really be guaranteed. Every sample will have equal opportunity in perturbation in step 4 ( vide supra) which takes less computation time in obtaining comparable results ( Table 4 ). On the other hand, Selim and Alsultan [5] proposed the below perturbation method in which the class assignments of several samples might be simultaneously changed in each perturbation :

do perturbation operation to obtain Wt flag =1; if IGM  N/2; P = 0.8; else; p = 0.95; end i=0; while flag 2 or i n ifi>n; i=i-n; else; i=i+1;end f = wig ; u = rand; if u > p; wig = r ; flag = 2; end end

The comparison of different perturbation operations is shown in Table 4 . One notices that the method proposed by Selim and Alsultan [5] takes the longest time, i.e. this method converges to the global optimal solution rather slowly, and the present method converges quite quickly. Cluster analysis based on SA is a very useful cluster algorithm, although it has some insufficiency. As mentioned above, the modified algorithm is more effective than K-means algorithm and is also preferable than hierarchical cluster analysis. A global optimal solution may be obtained by using the algorithm. Feasible stopping criterion and perturbation method are important aspects for the computation algorithm. The present authors use minimization of the

166 Table 4 Comparison of results obtained by using different perturbation methods in classification of simulated data Actual class of simulated data 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

Selim's method

Brown's method

1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 2 2 2 2 3 3 3 2 3 3 3 2 3 3

Objective function ( Ob ) Computation time ( hrs . )

Classification results



209.4124

23.5069





of Present method

1 1 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

173.7650

166.9288

11.7061

1.1627

sum of the squared Euclidian distances as the clustering criterion. As one knows that Euclidian distances are suitable only for spherical distribution data set, searching other clustering criteria

167 suitable for different kinds of data sets for the use of SA cluster analysis deserves further investigation.

3. CLUSTER ANALYSIS BY K-MEANS ALGORITHM AND SIMULATED ANNEALING

3.1. Introduction From the above research results, we notice that cluster analysis based on SA is a very useful algorithm, but it is still time-consuming, especially for large data set. Although we have made some improvements on the conventional algorithm and proposed a modified duster analysis based on simulated annealing (SAC), further modification for this sort of algorithm is of considerable interest. On the basis of the above works, a modified clustering algorithm based on a combination of simulated annealing with the K-means algorithm (SAKMC) can be used. In this procedure the initial class labels among k classes of all n samples are obtained by using the K-means algorithm instead of random assignment. A flow chart of the SAKMC program is shown in Figure 2. The algorithm is firstly tested on two simulated data sets, and then used for the classification of calculus bovis samples and Chinese tea samples. The results show that the algorithm which is guaranteed in obtaining a global optimum with shorter computation time compares favorably with the original SAC and K-means algorithm.

3.2. Treatment of simulated data The simulated data sets were composed of 30 samples (data set I) and 60 samples (data set II) containing 2 variables ( x, y) for each (see Figure 3 and Figure 4 , respectively ). These samples were supposed to be divided into 3 classes. The data were processed by using cluster analysis based on simulated annealing (SAC) and cluster analysis by K-means algorithm and simulated annealing(SAKMC), respectively. As shown in Table 5, the computation time needed to obtain corresponding optimal objective function value (O b ) for clustering based on K-means algorithm, simulated annealing(SAC) and SAKMC are about 3 min., 70 min. and 55 min. for data set I, and 5 min., 660 min., 183 min. for data set II, respectively. From the above results, one notices that the larger the data set , the longer the computation time.



168

Input data & parameters

Generate an initial cluster membership matrix Wb by K-means algorithm and calculate 4; (1)

=4); 4r

0 ; T b

147 =1 V •

IGM < N ? no

yes Do perturbation operation to obtain W t

:

flag = I; if IGA 4 N/2; p = 0.8; else ; p = 0.95; end while flag 2 if i > n; i = i n; else; i = +

f=w .

;

ig

ifu > p; w

end

u = rand; ig

= r ; r f ) ; flag = 2; end

e Calculate 4) t : t

c

W t

f O t .4) b ; 4' b =4 t ;

=

Wb Wt ;

IGM=0; end

end

W=W;

T = T ; IGM =0;

tP

c =0 1, ; Wc = Wb ;

es

Figure 2. Flow chart of a cluster analysis by K-means algorithm and simulated annealing.

169 Table 5 The comparison of results of different approaches Simulated data set II

Simulated data set I

K-means SAC SAKMC

K-means SAC SAKMC Objective function

value, (I)

Computation time ( min .

63.7707

3

60.3346 60.3346

114.6960 114.5048 114.5048

55

70

183

660

5

8 )1:

76

**

5-

4321 0

1

2

3

4

5

x

Figure 3. A plot of simulated data set I

6

7

8

170 8 7 6 5_

*

*

2 1

0

,* 1

* 2

3

4 x

5

6

7

Figure 4. A plot of simulated data II 3.3. Classification of calculus bovis samples Calculus bolds or bezoar is a widely used traditional Chinese medicine suitable for the treatment of the fever and sore throat. The microelement contents in natural calculus bovis and cultivated calculus bovis samples were determined by using an Iarre11-Ash 96-750 ICP instrument[14]. The data after normalization and the results obtained by using different methods are listed in Table 6. K-means algorithm takes the shortest time to obtain the final result (0 b = 96.7396), which is really a local optimal solution. Cultivated calculus bovis samples No.4 and No.7 were misclassified into natural ones by K-means algorithm. Both SAC and SAKMC can get a global optimal solution ( b = 94.3589 ), only sample No. .4 belonging to cultivated calculus bovis was classified into a natural one corresponding to 0 b = 94.3589. If sample No. 4 is classified into a cultivated one, the corresponding objective function b would be 95.2626. this indicates that sample No.4 is closer to natural calculus bovis . From the above results, one notices that calculus bovis samples can be correctly classified into natural and cultivated ones on the basis of their microelement contents by means of SAC and SAKMC except the sample No. 4. The computation times for SAC and SAKMC were 21 and 12 minutes, respectively.

171

Table 6 The normalized data of microelement contents in natural and cultivated calculus bovis samples Sample No.* 1

Cr 0.2601

Cu 1.2663

Mn

Ti

Zn

Pb

-0.3583

-0.8806

2.1462

0.7075 0.9897 0.2796

2 3

-0.5501 -0.2094

-0.4793 1.4416

0.4264 1.3222

-0.7349 -0.9609

1.6575 1.3797

4 5

-0.1412 -0.0352

-0.7887 0.3886

-0.3329 0.9366

-0.9448 -0.8966

-0.4879 -0.3549

2.3681 0.5646

6 7

0.4039 -0.8455

-0.1633 1.6040

0.3890 -0.8126

-0.2495 -0.2655

-0.4589 -0.5768

-0.6124 -0.0425

8 9

-0.5539 -0.5880

-0.9086 -0.6811

20.7482 -0.4788

0.2371 1.6784

-0.4448 -0.5340

-0.0360 -1.0765

10 11

-1.5648 0.0178

-0.7790 0.9968

-1.0007 -0.9148

-0.4273 0.6422

-0.5804 -0.5779

-0.3776 -0.4834

12 13

2.3159 1.4905

-0.8352 -1.0622

-0.6767 2.2487

1.9193 0.8831

-0.5841 -0.5835

-1.1406 -1.1406

Ca

K

Na

Sample No.* Mo



1

0.3092

2.4408

-0.8782

-0.7953

2.7953

2 3 4 5 6 7

0.3092 1.8715

1.1206 0.7906

-0.9742 -0.7823

-0.8503 -0.7128

-0.7975 -0.9002

-0.1121 -0.6738

0.8089 -0.4288

-0.3985 0.7528

0.4146 1.2395

0.3316 -0.0790

-0.9546 1.1693

-0.0437 -0.5755

1.3284 -0.2066

0.6620 0.3871

-0.3356 -0.6435

8 9 10 11

-0.5334 -0.9546 -0.1121 1.5906

-0.8322 -0.4471 -0.8505 -0.4838

-1.7122 1.0406

1.9544 -0.7953

0.8487 -0.3026

0.6620 0.1671

-0.4382 0.7422 -0.2329 0.7422

12 13

-0.9546 -0.9546

-0.6580 -0.8413

-1.1660 -0.9742

-1.4827 -0.8503

-0.7462 -0.4382

* No. 1-3 are cultivated calculus bovis and No.4-13 are natural calculus bovis samples.

3.4. Classification of tea samples The data which involve concentration(% w/w, dry weight) of cellulose, hemicellulose, lignin,

172 polyphenols, caffeine and amino acids for various tea samples were processed by using SAC, SAKMC and K-means algorithm. The results obtained by using the three methods were compared with those given by Liu et al. [13] using hierarchical cluster analysis. The results obtained are summarized as follows: Hierarchical clustering: Class 1: C1-4, H1-3, K1-2, F1-4; Class 2: C5-7, H4-5, K3-4, F5-7; Class 3: T1-4, S1-4. Objective function value Oh : 50.8600 ( The objective function Oh was calculated according to equation 3 ) Computation time: 10 min. K-means: Class 1: C1-7, H1-5, K1-4, F1-7; Class 2: T1-2, S1-2; Class 3: T3-4, S3-4. Objective function value Oh : 119.2311 Computation time: 6 min. SAC and SAKMC Class 1: C1-4, H1-3, Kl, F1-4; Class 2: C5-7, H4-5, K2-4, F5-7; Class 3: T1-4, S1-4. Objective function value Oh : 50.6999 Computation time: 107 min.(SAC); 68 min.(SAKMC) One notices that there is only one case of disagreement of the class assignment for the tea sample K2. K2 was classified into class 1 by hierarchical clustering and K-means, and it was classified into class 2 by SAC and SAKMC. Both SAC and SAKMC give a relatively low objective function value. The K-means algorithm is inferior as shown by the objective function. It puts all the green and black teas into the same group and separates the oolong teas into a high and a low quality group. 3.5. Comparison of methods with and without simulated annealing Comparing the results obtained from the simulated data, calculus bovis samples and tea samples, one notices that different results can be obtained by using different methods with the same criterion, the K-means algorithm in these cases converges to local optimal solutions with shortest time, the behavior of K-means algorithm is influenced by the choice of initial cluster centers, the order in which the samples were taken, and the geometrical properties of the data. The tendency of sinking into local optima is obvious . Both SAC and SAKMC can obtain global optimal solutions but SAKMC converges faster than SAC . Both SAC and SAKMC adopt the same stopping criterion. The main reason why SAKMC converges faster than SAC is that SAKMC uses the results of K-means algorithm as the initial guess. As K-means algorithm is very quick one gets faster convergence with the SAKMC version.

173 4. CLASSIFICATION OF MATERIALS BY PROJECTION PURSUIT BASED ON GENERALIZED SIMULATED ANNEALING

4.1. Introduction The classical principal component analysis (PCA) is the basis of many important methods for classification of materials since the eigenvector plots are extremely useful to display ndimentional data preserving of the maximal amount of information in a space of reduced dimension. The classical PCA method is, unfortunately, non-robust as the variance is adopted as the optimal criterion. Sometimes, a principal component might be created just by the presence of one or two outliers [15]. So if there are outliers existing, the coordinate axes of the principal component space might be misdetermined by the classical PCA, and reliable classification of materials could not be obtained. The robust approach for PCA analysis using simulated annealing has been proposed and discussed in detail in the Chapter " Robust principal component analysis and constrained background bilinearization for quantitative analysis". The projection pursuit (PP) is used to carry out PCA with a criterion which is more robust than the variance[ 16], generalized simulated annealing(GSA) algorithm being introduced as the optimization procedure in the process of PP calculation to guarantee the global optimum. The results for simulated data sets show that PCA via PP is resistant to the deviation of the error distribution from the normal one, and the method is especially recommended for the use in the cases with possible outlier(s) existing in the data. The theory and algorithm of PP PCA together with GSA are described[ 16]. Three practical examples are given to demonstrate its advantages.

4.2. The IRIS data A set of IRIS data[17] which consists of three classes: setosa, versicolor and virginia was used to determine the applicability of the PP PCA algorithm for analyzing multivariate chemical data. Figure 5 shows the classification results of PP PCA and SVD. It can be seen that the PP PCA solutions provide a more distinct separation between the different varieties.

174 PP classification

1

7

0

PP classification

6

4

-3 -4 -4

-2

3 -4

0 PC1





0

10





SVD classification

12

PC1

Figure 5. The plot of PP and SVD for IRIS data * Setosa samples



PC1

SVD classification

8

-2

0 Versicolor samples - Virginia sample

2

175 4.3. Classification of tea samples The tea samples mentioned in section 2.3. are classified into three classes according to their origin by using the PP PCA and SVD. As shown in Figure 6, the tea samples are clearly classified into three classes by PP PCA algorithm which uses the SA. The method is more feasible than classical SVD approach. 5

PP discrimination

2

PP discrimination

4 aw3

-5 -2

a,

0

cP

0

-2 -2

0 PC1 SVD discrimination

2

2 0 0, 00

2 PC1 SVD discrimination

0

cP°

0

0 SA( **

-2 -5

A, *



5

0

PC1

-2

*

-4 -5



5 PC1

Figure 6. The comparison of results for tea samples by using the PP and SVD classification * Green tea

o Black tea



Oolong tea

4.4. Classification of beer samples Xu and Miu determined the contents of alcohol and many other chemical parameters in beer samples and classified them by using the PCA and nonlinear mapping technique [18]. This data set is processed by PP PCA and SVD. The results is shown in Figure

7.

One sample ( No.17 ) which was first classified into " plain " beer by the manufacture is classified into " imperial " one. The beer expert's further examination confirmed this classification. PP PCA compared favourably with the traditional SVD algorithm and was at least as good as the nonlinear mapping technique.

176

-0.4

PP discrimination

-5.5

PP discrimination

*

a.,

0* 4 *

-0.6

-0.8 -0.6

-0.5

-0.4

0 -6 a-,

-0.3

D -6.5 -0.6

0

-0.5

4

-0.4

-0.3

PC1

PC1 SVD discrimination

1

SVD discrimination 0

*0 0 a.,

**

0 1-CD"

-2 Act. 20

30



40

-1 20

30

40

PC1

PC1

Figure 7. The comparison of results of beer samples by using the PP and SVD classification * "Imperial" beer

o "Plain " beer

4.5. Classification of biological samples The contents of Sr, Cu, Mg and Zn in the serum of patients of the coronary heart diseases and normal persons were determined by using the ICP-AES[19]. The data were evaluated by using ordinary principal component analysis, cluster analysis and stepwise discrimination analysis. It has been found that ordinary principal component analysis and cluster analysis could not give satisfactory results with four samples misclassified. There were two samples misclassified in stepwise discrimination analysis. These data sets were treated by PP PCA and SVD. The PC1--PC2 plot of PP classification shown in Figure 8 has only two samples misclassified. The results further demonstrate that PP PCA is more preferable than the traditional SVD algorithm.

177 PP classification

1.5

PP classification

2

0 00g 0

1

0

CIC 0 0 * 0 * 0

N

U05 41-1

0 -2

:4

4.1

0 civ

-4

0

o

-0.5

00 0

-6

0

1

0.5

0.5

0

PC1

PC1 4

SVD classification

SVD classification

2

0

0 0.5

c\I

U 0

0 *

00

.4 * 0"T• * * *0

-2

0

coio

0

-4 20

40

60 PC1

80

0.5 20

60

40

80

PC1

Figure 8. The comparison of results of biological samples by using PP and SVD classification * Serum samples of patients with coronary heart diseases o Serum samples of normal persons

ACKNOWLEDGEMENT This work was supported by National Natural Science Foundation of P.R.C. and partly by the Electroanalytical Laboratory of Changchun Institute of Applied Chemistry, Chinese Academy of Sciences.

178

REFERENCES 1.

N. Batchell, Chemometrics and Intelligent Laboratory Systems, 6 (1989) 105.

2.

N. E. Collins, R. W. Eglese and B. L. Golden, American Journal of Mathematics and Management Science, 8 (1988) 209.

3. 4.

J. H. Kalivas, S. N. Robert and J. M. Sutler, Analytical Chemistry, 61 (1989) 2024. J. H. Kalivas, Journal of Chemometrics, 5 (1991) 37.

5. 6.

S. Z. Selim and K. Alsultan, Pattern Recognition, 24 (1991) 1003. V. Cerny and S. E. Dreyfus, Thermodynamical Approach to the Traveling salesman problem: An efficient simulation algorithm, Journal of Optimization Theory and Applications, 45 (1985) 41.

7. 8.

S. Kirkpatrick, C. Gelatt and M. Vecchi, Science, 220 (1983) 671. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller and E. Teller J. Chem. Phys.,

9.

21 (1953) 108. D. E. Brown and C. L. Entail, Pattern Recognition, 25 (1992) 401.

10. J. H. Ward, J. Am. Stat. Ass., 58 (1963) 236 . 11. G. Li, G. Cai, Computer Pattern Recognition Technique, Chap. 3. Shanghai Jiaotong University Press, (1986). 12. Q. Zhang, Q. R. Wang and R. Boyle, Pattern Recognition, 24 (1991) 331. 13. X. D. Liu, P. V. Espen, F. Adams and S. H. Yan, Anal. Chim. Acta., 200 (1987) 424. 14. Q. Zhang, K. Yan, S. Tian and L. Li, Chinese Medical Herbs (Chinese), 14 (1991) 15. 15. H. Chen, R. Gnanadesikan, J. R. Kettenring, Sarikhya, 1336 (1974) 1. 16. Y. Xie, J. wang Y. Liang, L. Sun, X. Song and R. Yu, Journal of Chemometrics, 7 (1993) 527. 17. R. A. Fisher, Annals of Eugenics, 7 (1936) 179. 18. C. Xu and Q. Miu, Computers and Applied Chemistry (in Chinese), 3 (1986) 21. 19. L. Xu, Y. Sun, C. Lu, Y. Yao and X. Zeng, Analytical Chemistry (in Chinese), 19 (1991) 277.

APPENDIX Principle of K-means algorithm Consider n samples in d dimensions, a1 , a 2 ,..., an , where the sample vector a1 = [ail , ai, ,..., aid], i =1, 2, ..., n, and assume that these samples are to be classified into k groups. The algorithm[ 11-121 is stated as follows:

k arbitrary samples from all n samples as k initial cluster centers z1 (0,z2 (0) , , (01= (0) k, the superscript (0) refers to the g I, (o) zK (0), where ze = Zgd Step 1. Select



179 initial center assignment. Step 2. At the h -th iterative step distribute the n samples among k clusters, using the Zg(h) II < II am - z1(h)11 for all k, g i, where sg(h) denotes the set of samples whose

following criterion, am should belong to sg(h) m =1, 2, ... , n, and i = 1,

if Il am -

represents the norm; Otherwise the clustering of am remains

cluster center is Zgal), unchanged.

Step 3. According to the results of step 2, the new cluster centers zgal+1), g = 1, k, are calculated such that the sum of the squared Euclidian distances from all samples in sg(h) to the new cluster center, i.e. the objective function 9 =

EE

Ilam – zg(h+1) II

m= I, 2, ..., n

g=1 amesezi

is minimized. Zg0+1) is the sample mean of sg(h). Therefore, z 01+1) = 1 7 a g n g m 4.1.–d am e sg (h)

8.--- 1, 2, ..., k

where ng is the number of samples in sg(h). The name "K-means" is derived from the manner in which the cluster centers are sequentially updated. Step 4. If Zg 0 + 0 = zg (h) for g = 1, k , stop, the algorithm is converged; otherwise, go to step 2.