Pattern Recognition Letters 22 (2001) 603±610
www.elsevier.nl/locate/patrec
Stochastic K-means algorithm for vector quantization Balazs K ovesi a, Jean-Marc Boucher b,*, Samir Saoudi b b
a Centre National d'Etudes des T el ecommunications, Avenue Marzin, Lannion, France Ecole Nationale Sup erieure des T el ecommunications de Bretagne, BP 832, 29285 Brest Cedex, France
Received 16 December 1998; received in revised form 18 December 2000
Abstract The proposed stochastic K-means algorithm (SKA) associates a vector with a cluster according to a probability distribution, which depends on the distance between the vector and the cluster gravity centre. It is less dependent than the K-means algorithm (KMA) on the initial centre choice. It can reach local minima closer to the global minimum of a distortion measure than the KMA. It has been applied to vector quantization of speech signals. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: K-means algorithm; Stochastic methods; Vector quantization
1. Introduction The K-means algorithm (KMA) (Forgy, 1965) is a well-known automatic clustering method, which suers from a major drawback. The classi®cation result depends on the initial cluster choice since the algorithm does not enable the global optimum to be reached in most cases, but only a local optimum. This is due to the deterministic behaviour of the algorithm, which looks regularly for the minimum of an optimization criterion. One solution to obtain a global minimum for a nonconvex multi-dimensional function is to apply a simulated annealing (SA) method (van Laarkoven and Aarts, 1987). The aim of this method is to decrease the energy of a particle system in a non-
* Corresponding author. Tel.: +33-2-98-001537; fax: +33-298-001012. E-mail addresses:
[email protected] (B. Kovesi),
[email protected] (J.-M. Boucher).
monotonic way in order to look for a global energy minimum. The system state is modi®ed iteratively. At each step, a new state is accepted if the energy gap is negative. If the energy gap is positive, a stochastic draw is made between the two hypotheses: to keep the old state or to jump to the new state. The probability of state change depends on the state energy gap and on the inverse of temperature. After many iterations, equilibrium is obtained and, at that time, the temperature value is diminished, which implies that the probability of accepting positive energy gaps is reduced. Decreasing the temperature must be done very slowly to reach the global optimum, so it is a very expensive method in terms of computation time. It has been applied to vector quantization, where the energy gap is the Euclidean distance between a vector and a codeword. The stochastic relaxation scheme algorithm (SRS) (Yair et al., 1992; Zeger et al., 1992) is another stochastic algorithm which resembles SA, but introduces some dierences: the energy states
0167-8655/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 1 ) 0 0 0 2 1 - 6
604
B. Kovesi et al. / Pattern Recognition Letters 22 (2001) 603±610
are modi®ed by a draw depending on an exponential probability density and are always accepted; the temperature is decreased at each iteration without waiting for an equilibrium. These two algorithms are complex to implement and require much computation time, around 10±50 times more than the KMA for the SRS. The objective of this paper is to present a new stochastic K-means algorithm (SKA), which avoids the drawbacks of KMA, but keeps KMA's easy implementation and fast computation time. The paper is organized as follows: · The second part recalls the KMA method and the SRS algorithm. · The third part describes the SKA. · In the fourth part, an experimental study is shown. · In the ®fth part, the SKA is applied to vector quantization of speech signals. 2. The K-means algorithm and the stochastic relaxation scheme algorithm Let R be the set of real numbers and RN be the set of N -tuples of real numbers. RN will be called the vector space and elements x 2 RN vectors. Given a ®nite set X RN , X fx1 ; x2 ; x3 ; . . . ; xM g, the aim of the K-means algorithm is to obtain a partition of X called S, whose subsets will be called clusters S fS1 ; S2 ; . . . ; SK g. Each cluster is represented by a vector c called centre. C fc1 ; c2 ; c3 ; . . . ; cK g is the centre set or the codebook in vector quantization. A distance measure d
x; c between vectors and centres is introduced. It can be the Euclidean distance, for instance. The K-means algorithm proceeds iteratively from an initial choice of centre set C0 . At each iteration (i), each vector xj is associated with the
i 1 whose distance is the shortest, in order centre ck
i to create a new partition de®ned by S
i fS1 ;
i
i S2 ; . . . ; SK g. The total distance M 1 X
i 1 D
i min d xj ; ck ; k 1; . . . ; K; M j1 k between each vector xj and its nearest neighbour
i 1 ck is computed over the whole vector set X .
i
i
Then for each cluster Sk , a new centre ck is found by minimizing the distance 1
i
D
Sk
i
card
Sk
X
i d xj ; ck ;
k 1; . . . ; K;
i xj 2Sk
i
i
which gives a new codebook C
i fc1 ; c2 ;
i
i
i c3 ; . . . ; cK g. If d
is the Euclidean distance, ck is
i the centre of gravity of Sk . The algorithm stops if
i D D
i 1 e; D
i where e is a given threshold. Otherwise a new iteration is performed. The major drawback of this algorithm is that it often stops in a local minimum. In order to leave this minimum, a temporary increase of the dis
i tance D
Sk between two iterations must be permitted. As the K-means is a deterministic algorithm, there is no chance of this happening. Yair et al. (1992) proposed the introduction of an SRS for this purpose. Step 1: Initial choice of the centre set C0 . Step 2: A vector xj is chosen stochastically,
i 1 which belonged to a cluster Sk at the previous iteration
i 1. It is now associated with a new
i 1 by a draw in the codebook C
i 1 centre ck0
i 1
i 1
i 1
fc1 ; c2 ; c3 probabilities P
k; j
i
P
e
i 1
; . . . ; cK
b
id
xj ;ck
K k1
i 1
e
b
id
g
according
i 1 xj ;ck
;
to
the
k 1; . . . ; K;
1
where limi!1 b
i ! 1 and b
i acts as the inverse of the temperature as in the SA method. This leads
i
i
i to a new partition S
i fS1 ; S2 ; . . . ; Sk ; . . . ;
i
i
i
i Sk0 ; . . . ; SK g. Sk and Sk0 are obtained by removing
i 1
i 1 and inserting it into Sk0 , respecxj from Sk tively.
i
i Step 3: The centres of both clusters Sk and Sk0 are then computed from the old cluster centres. 1
i
i 1
i 1 ck ck xj c k ;
2
i 1 card Sk 1
B. Kovesi et al. / Pattern Recognition Letters 22 (2001) 603±610
i
i 1
ck 0 ck 0
card
1
i 1 Sk 0
i
i card Sk card Sk
i
i card Sk0 card Sk0
1 1
1
xj
i 1
ck 0
;
3
i
X
i d y; ck0
i d yj ; c k 0
i
4
1:
Step 4: b
i 1 C Log
i 1.
i
i 1 Return to 2 until d
ck ; ck e; k 1; . . . ; K or stop. The distortion change between S
i 1 and S
i is evaluated by:
i
i 1 DE d yj ; ck0 d yj ; c k X
i
i 1 d y; ck d yj ; c k y2Sk
is considered as a multinomial random variable
i ek , whose probability is de®ned as P
k; j
1;
1
:
i 1 y2S 0 k
In the SA algorithm, the probability of choos
i 1 ing a new centre ck0 for xj is the same for all centres, but, if DE 0, this centre is rejected with probability 1 p with p e DE=T which means
i 1 that ck0 is very far from the vector xj . It is always accepted if DE 0. T is the temperature parameter, which slowly decreases during iterations. In the SRS, no centres are rejected, but the probability of having a centre far from the vector xj is low.
The SKA belongs to the same family as the SRS, which is derived from the SA algorithm, but it has an easier implementation and a much faster convergence time. Step 1: Each vector xj is associated with a
0 cluster Sk according to the stochastic draw of an independent and identically distributed random
1 variable ek , which gives the initial segmentation.
i 1 Step 2: The distances d
xj ; ck between each
i 1 xj ; j 1; . . . ; M and all the ck ; k 1; . . . ; K are
i 1 computed. The fact that xj belongs to cluster Sk
e PK
b
i
k1
e
i 1 d xj ;c k ^ d
x j
b
i
i 1 d xj ;c k ^ d
xj
;
5
^ j is the mean distance between xj and where d
x
i 1 the ck ; k 1; . . . ; K. K X
i 1 ^ j 1 d x j ; ck d
x : K k1 Then a stochastic draw of this multinomial random variable is made according to the proba
i bilities, which leads to a new partition S
i fS1 ;
i
i
i
i S2 ; . . . ; Sk ; . . . ; Sk0 ; . . . ; SK g. When the vector xj
i 1 is close to the centre ck , the probability of choosing this centre increases, but, as the choice is stochastic, it may happen that another centre is obtained. b
i is a parameter which increases between each iteration in order to make the behaviour of the algorithm evolve from stochastic to deterministic. b
i must be chosen with limi!1 b
i ! 1. If b
i is small, the draw assigns vectors to clusters which are not always the closest. When b
i is high, the algorithm has a behaviour which resembles that of the KMA. Many dierent functions were tested for b
i including the Log function as in the SA and SRS algorithms. We found experimentally that a linear function was a good way to obtain this result. b
i c
i
3. The stochastic K-means algorithm
605
1 with c constant:
In many vector quantization experiments, we took c 2. Step 3: New centres C
i are computed in the same way as for the KMA algorithm. Then for
i
i each cluster Sk , a new centre ck is found by minimizing the distance X 1
i
i D
Sk d x ; c ; k 1; . . . ; K: j k
i card
Sk x 2S
i j
k
Step 4: Between two iterations, the total distance can increase. So the algorithm stops if during n iterations the total distance is higher than the
606
B. Kovesi et al. / Pattern Recognition Letters 22 (2001) 603±610
minimum distance previously found, otherwise it returns to step 2. The value of n can be between 10 and 20. Many comments can be made on the dierent steps: The ®rst step is also justi®ed by considering formula (5) with b
1 0. In this case, it results
1 that the multinomial random variable ek is identically distributed. As a consequence, if the global gravity centre is chosen as initial centre, the vectors xj are put into clusters independently of the distance with the global gravity centre. When the
1 new gravity centres ck of each cluster are computed, they will be found close to each other, and also close to the global gravity centre. It means that the algorithm will not depend on an initial segmentation choice. During iterations, SKA leads the centres towards the optimal position. When b
i is very high, the behaviour of the algorithm looks deterministic and is similar to the KMA algorithm. It cannot reach the global optimum given by SA, but in most cases, it gives a better solution than the KMA, closer to the global one. 4. Experimental study related to vector quantization Two sources of a four-dimensional Gaussian vector with independent components generate two clusters with variances (1, 2, 3, 4) and respective means (1, 1, 1, 1) and ()1, )1, )1, )1). The ®rst cluster is composed of 30 points represented by `' and the second of 30 points represented by `'. This training sequence is depicted in Fig. 1 with the axis corresponding to the ®rst two components. A circle indicates the gravity centre of each cluster, but the ®gure only displays the central area of the whole cluster. The trajectories of gravity centres computed by KMA are plotted after a stochastic initialization. The trajectories for SKA are in dotted and solid lines and they show the stochastic behaviour of the algorithm. For each cluster, the initial position is represented by a star `', and the ®nal position, at iteration 50, by a triangle `M'. As it can be seen for this particular con®guration, the ®nal solution for the SKA is closer to the true gravity centre than in the KMA
Fig. 1. Example of gravity centre trajectories for KMA (at the top) and for SKA (at the bottom) (`': initial values, `M' ®nal values).
case. In Fig. 2, the evolution of the ®rst vector component of one gravity centre is plotted for 25 runs with the same cluster con®guration. In this case, the convergence point does not depend on the stochastic behaviour of the algorithm. Fig. 3 gives the mean evolution of the same component, while Fig. 4 shows the component variance. A statistical study of the behaviour of these three algorithms (KMA, SRS, SKA) has been undertaken, which used the Euclidean distance between the true and the estimated gravity centres
Fig. 2. Evolution of the ®rst vector component of one of the clusters for 25 dierent runs of the SKA algorithm.
B. Kovesi et al. / Pattern Recognition Letters 22 (2001) 603±610
Fig. 3. Evolution of the ®rst vector component mean of one of the clusters for 25 dierent runs of the SKA algorithm.
Fig. 4. Evolution of the ®rst vector component variance of one of the clusters for 25 dierent runs of the SKA algorithm.
to compare the performances. The distance mean of the KMA, SKA and SRS algorithms, over a 100-run experiment corresponding to dierent cluster con®gurations, is plotted in Fig. 5, while Fig. 6 shows the distortion variance. In this experiment, the improvement given by SKA is not very high compared to the KMA, because the occurrence of KMA failure is low, depending almost on bad initial values for the gravity centres. In 45% of the trials, both solutions were identical, 32% had a lower distance for the SKA than for the
607
Fig. 5. Distance mean as a function of the number of iterations evaluated over 100 con®gurations KMA (dashed line), SKA (solid line), SRS (dotted line).
Fig. 6. Distance variance as a function of the number of iterations evaluated over 100 con®gurations KMA (dashed line), SKA (solid line), SRS (dotted line).
KMA and 23% gave the opposite result. It is nevertheless seen from this simple study that SKA gives a lower distance mean and a lower distance variance than the KMA, and hence may lead to a better balanced codebook for vector quantization. Compared to SRS, the distance mean and variance are quite similar, which means that the dierences between these algorithms cannot be found in their performances, but almost in their complexity and
608
B. Kovesi et al. / Pattern Recognition Letters 22 (2001) 603±610
convergence time. In this experiment, 44% of the trials gave the same results for the SKA and for the SRS, 29% were in favour of the SRS and 27% gave an opposite result. The SKA oers many advantages over other SRAs such as the SRS algorithm. · The convergence time appears to be nearer to the KMA case than conventional SRAs. Typical values are twice those of the KMA, while the convergence time of the SRS is 10 times longer. · The control of standard SRAs is more complicated than that of SKA. In the SKA, only two parameters n and c must be chosen and it has been experimentally observed that the values n 10 and c 2 lead to good behaviour in most cases. · The implementation requirements are easier for the SKA than for the SRS for instance. The SRS needs to compute the new cluster centres after each iteration, and must decide which is the best, which means registering the cluster membership of each vector. This is avoided with the SKA, because they are automatically updated and no comparison occurs. 5. Application to vector quantization of speech signals Vector quantization (Gersho and Gray, 1992) is a particular case, in which classi®cation methods can be used. The problem is to ®nd the codewords C fc1 ; c2 ; c3 ; . . . ; cK g and the quantization areas S fS1 ; S2 ; . . . ; SK g for a vector set X fx1 ; x2 ; x3 ; . . . ; xM g by minimizing the global distortion D
Q Ed
x; Q
x K Z X d
x; ck p
x dx; k1
Sk
Many distortion measures (G urgen et al., 1990) have been proposed, including the squared Eu2 clidean distortion d
x; ck kx ck k and the weighted squared Euclidean distortion d
x; ck
x ck T W
x ck where W is a de®nite positive square matrix or a spectral distance which takes into account perceptual aspects. The squared Euclidean distortion is used here. Line spectrum pair (LSP) parameters (Saoudi et al., 1992) provide an ecient representation of the short-term variation of a speech signal frame. These parameters are computed from the autoregressive model of speech and are generally used for the design of low rate speech coders because of their properties of robustness against noise and perturbations compared to other parameters such as log area ratios (LARs). Ten parameters are computed during each frame of 28 ms in order to build a vector. They respect a given order and this property can be used to ®nd fast search algorithms for vector quantization (Naja et al., 1994). The learning base is composed of 12 800 vectors, and a codebook of 64 codewords must be designed. Fig. 7 shows the distortion evolution for KMA and SKA as a function of the number of iterations. As KMA is a deterministic algorithm, the three dotted lines correspond to three dierent initializations, which converge towards dierent local solutions. In the case of SKA, the ®rst class centres
6
where p
x is the probability density of x and Q
x the quantization operator. As p
x is generally unknown, a learning base of vectors xj is constructed from a large number of samples. The distortion is approximated by D
Q
M 1 X d xj ; Q
xj : M j1
Fig. 7. Distortion evolution for KMA and SKA applied to an LSP learning base in a vector quantization context.
B. Kovesi et al. / Pattern Recognition Letters 22 (2001) 603±610
are the same and are always close to the whole vector gravity centre. Three trials (solid lines) show that a lower distortion is obtained, which implies that a better solution than the KMA is reached. This means that a local minimum of the distortion measure was found closer to the global minimum than in the KMA case. The stochastic behaviour of the algorithm explains the three different curves of Fig. 7. The convergence rate of the SKA seems to be roughly twice that of the KMA in most cases. Fig. 8 describes the evolution of the distortion from a statistical point of view. It shows maximal, mean and minimal distortion for 100 trials with both algorithms. In the case of the KMA, this means that 100 dierent initial values lead to various solutions whose mean and extreme values are displayed. In the case of SKA, it is seen that the found solutions can sometimes be worse than with the KMA, but that the mean solution is better. The next two Figs. 9 and 10 show the codeword trajectory during two trials for the ®rst two LSP with 800 learning vectors and a codebook of dimension 8. The ®rst two LSP are plotted only in order to simplify the ®gure visualization. The ®rst trajectory is shown as a solid line and the second one as a dotted line. Fig. 9 shows the KMA case with two dierent initializations, leading to two solutions. Fig. 10 shows the SKA case and dier-
609
Fig. 9. Gravity centre trajectories in the KMA case for LSP vector quantization.
Fig. 10. Gravity centre trajectories in the SKA case for LSP vector quantization.
ent trajectories, but having a very close ®nal centre position.
6. Conclusion
Fig. 8. Maximal, mean and minimal distortions of the SKA algorithm for the LSP vector quantization case.
A new robust algorithm called the stochastic Kmeans algorithm is proposed. It is not dependent on the choice of an initial cluster gravity centre, as in the case of the KMA. It converges in mean towards the global minimum better than the KMA.
610
B. Kovesi et al. / Pattern Recognition Letters 22 (2001) 603±610
Applied to statistical vector quantization, it gives a balanced codebook. References Forgy, E.W., 1965. Cluster analysis of multivariate data: eciency vs interpretability of classi®cations. Biometrics 21, 768. Gersho, A., Gray, R.M., 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Dordrecht. G urgen, F.S., Sagayama, S., Furi, S., 1990. Line spectrum frequency based distance measures for speech recognition. In: Proc. ICLSP-90, Conference on Spoken Language Processing, Vol. 1, Kobe, Japan, pp. 512±524.
Naja, N., Boucher, J.M., Saoudi, S., 1994. Fast LSP vector quantization algorithms comparison. In: Proc. MELECON 94, Antalya, Turkey, April 12±14, pp. 1127±1130. Saoudi, S., Boucher, J.M., Le Guyader, A., 1992. A new ecient algorithm to compute the LSP parameters for speech coding. Signal Process. J. 28, 201±212. van Laarkoven, P.J.M., Aarts, E.H.L., 1987. Simulated Annealing Theory and Applications. Collection Mathematics and its Applications. D. Reidel Publishing, Boston, USA. Yair, E., Zeger, K., Gersho, A., 1992. Competitive learning and soft competition for vector quantizer design. IEEE Trans. Signal Process. 40, 294±309. Zeger, K., Vaisey, J., Gersho, A., 1992. Globally optimal vector quantizer design by stochastic relaxation. IEEE Trans. Signal Process. 40, 310±322.