September 1994
Pattern Recognition Letters Pattern Recognition Letters 15 ( 1994 ) 885-891
HI SliVIHI<
An effective clustering technique for feature extraction V. R a m d a s a, V. Sridhar b,., G. Krishna a a Department of Computer Science andAutomation, Indian Institute of Science, Bangalore 560 012, India b Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore 560 012, India Received 7 January 1993; revised 11 February 1994
Abstract
The performance of a classifier depends on the accuracy of measurement of the feature values. In this paper, we describe an approach for estimating feature values using a clustering technique. To begin with, we discuss the role of multiple versions of a pattern in pattern analysis. Specifically, we describe how clustering can be employed in pattern analysis. We follow this with a discussion on the usefulness of clustering in class-characterization. Specifically, we describe an approach to account for feature value measurement errors. The proposed approach has been applied in the context of speech recognition. Specifically, we investigate the estimation of formant frequencies (feature values) that form an essential ingredient in the design of a speaker-independent digit recognition system (classifier).
1. Introduction Clustering has been used in grouping a collection of data items such that the elements within a group are more similar to each other than the elements in different groups. The notion of similarity may be based on distances between data items in a feature space. A variety of techniques have been discussed in literature (Anderberg, 1973; Jain and Dubes, 1988; Michalski and Stepp, 1983; Sridhar and Narasimha Murty, 1991a) which may be used for clustering a collection of data items. Clustering can be viewed as a process of data reduction. Specifically, the clustering process helps in the identification of prototypes so that further activities may make use of this prototype instead o f a collection o f patterns. Alternatively, we can also view clustering as an act of data abstraction. The process of clustering finds applications (Lee, 1981 ) in a variety of areas: storage space reduction * Corresponding author. Elsevier Science B.V. S S D I 0 1 6 7 - 8 6 5 5 ( 9 4 ) 0 0 0 1 I-Q
in the context of databases (Kang et al., 1977); to avoid exhaustive searching (Fukunaga and Narendra, 1975 ); to answer partial match queries (Lee and Tseng, 1979); to fill in the missing details in a record in a database (Lee et al., 1978); in the area of document retrieval (El-Hamdouchi and Willet, 1989); in the area of computer networks to dynamically reorganize a computer network ( R a m a m o o r t h y et al., 1986); to reorganize knowledge base that is part of an expert system (Cheng and Fu, 1985 ); to perform database analysis (Michalski, 1983); to carry out data analysis (Falcitelli, 1989); to compare databases (Sridhar and Narasimha Murty, 1991 b); in semantic modeling of databases (Sridhar and Narasimha Murty, 1993). In the context of speech processing, we find clustering has been employed in the design of speaker-independent recognition systems (Levinson et al., 1979a, b; Rabiner and Wilpon, 1979). For example, Levinson et al. (1979a) use clustering to select reference templates for speaker-independent word recognition while Rabiner and Wilpon (1979) iden-
886
v. Ramdas et al. /Pattern Recognition Letters 15 (1994) 885-891
tify multiple templates for each word in the vocabulary by employing clustering. Let us suppose that we are interested in characterizing a class based on a collection of samples. Instead of using these samples, we can alternatively, generate clusters of these samples. The nature of the generated clusters provides useful information about the class under consideration. Typically, we may ignore, while training some samples based on the distribution of these samples within the cluster. Classification (Duda and Hart, 1973) is a process of identifying the class of a pattern. Typically, classes and patterns are characterized using a set of features. This step not only achieves data reduction but also eases the process of generalization. In other words, we want to describe a class in such a way that the unseen samples are accounted for appropriately. A wellknown fact is that there is a trade-off between the complexity of the feature extraction phase and the complexity of the class characterization phase. In the literature, we find applications of clustering in reducing the complexity of the class characterization phase. In this paper, we investigate the role of clustering in the feature extraction phase. Specifically, clustering can be used to fine-tune the extracted features not only from the multiple versions of the input pattern and but also from multiple samples of a class.
tures (from the multiple versions) are sufficiently close to each other. One possible method of obtaining multiple versions of the input pattern can be either by using various filtering and sampling procedures on the input pattern or by having a number of input sensors. Yet another method involves the application of different transformation techniques on the input pattern and extracting similar features from each one of them. A third alternative is to have a feature extractor extract a set o f " r a w " (or primary) features and a set of"support" (or secondary) features from the input pattern, and then obtain a set of "optimum" features using these raw and support features. For example, in the case of speech data, spectral information can be obtained from either DFT spectra (Rabiner and Schafer, 1978) or from LPC spectra (O'Shaughnessy, 1987). Also, in the case of character recognition, the scanner may be moved in various directions (such as left-to-right, top-to-bottom, etc.) across a pattern, thus generating a number of versions of the same pattern. In the rest of our paper, we describe algorithms to achieve feature extraction from multiple versions of the input pattern. We also describe briefly an application of the proposed approach in the extraction of formant frequencies of a speech signal.
2.2. Feature extraction from multiple versions of a pattern 2. Clustering for feature extraction
2.1. Motivation The success of any classifier depends on the features used in the process of building a classifier. The accuracy with which these features are identified forms an important issue in the design of the classifier. Let us consider a pattern described as an n-vector in feature space. The classical method of feature extraction views the pattern as a single entity and imposes severe constraints on feature selection procedure. The feature extractor can be called a "rugged" one if it identifies a similar set of feature values from an input pattern that has been distorted to some extent and an undistorted version of the same pattern. A possible way of achieving this ruggedness is to extract features from more than one version of the input pattern and ascertaining that the extracted fea-
Let P b e an "ideal" pattern and P~, p i , ..., p,, ..., pn be the patterns corresponding to distorted versions of P. Physically, this may mean that pattern P is sensed by n different sensors yielding P~, pi, ..., pi, ..., P" as outputs. Let F ~, F 2, ..., F n be the d-dimensional feature vectors of the respective patterns P', pi, ..., p,, where F i= ( f ~, f i2,f i3,..., f ia). We expect F ~, F 2, ..., F " to be close to each other as P~, pi, ..., P" are distorted versions of the same pattern. Specifically, if we cluster the values { f ) , f i , ..., f 7 }, we expect them to fall into a single cluster. In other words, we expect a singleton cluster along each dimension of the feature vector. For example, consider a pattern P described using its texture, color and weight as features. Let there be n versions of the pattern and suppose that the features of the ith version, pi, be ~ t i, c i, wi). In order to determine the ultimate set of features that will char-
v. Ramdas et al. / Pattern Recognition Letters 15 (1994) 885-891 acterize the pattern, we proceed as follows. We begin with the values {fi, t2, ..., t,} corresponding to the texture feature and generate clusters. Similarly, we generate clusters corresponding to color and weight features. More specifically, let the generated clusters be
Table 1
c~..... c : } , ~ . = { c : , c~ ..... c~}, .~< = { c L c~,..... c~,}.
with
J~t = )"~ c , ,1
Pat, ( { f i', f',~,..., f i' } ) = {P", P'~, ..., pi,} where f i', ..., f ~k are members of a cluster of itb dimension feature values. We say the clusters C~, Ca. and C~ overlap iff
Patt( Ci ) c~Pat~( C{~)c~ Pat.,(C~.,) # ~ . We illustrate cluster overlap with the help of the following example. Let patterns P~, p2 and p3 be as shown in Table 1. In this example,
c,,. =c~',
Pat,(C,)={p~, p3},
Texture
Color
Weight
It t2 13
Cl c2 c3
Wl Wz w3
c: ={t,,t,} c2 = [t2}
c~={c,,c~, c2 = {c3]
c',.={w,} c'2 = {w2, W3}
Table 2
Let C~ be the maximally populated cluster amongst :~. Similarly, let C J,. and C~ be the maximally populated clusters amongst ,:~. and ~.~ respectively. (We discuss the issue of ties occurring in the process of obtaining the maximal clusters later in our paper. For the time being, we break the ties arbitrarily.) One possibility at this stage would be to employ the respective maximally populated cluster representatives to characterize P. Typically, when cluster elements are described numerically, the cluster representative is the mean of the numerical representations of the elements. However, we can try to make use of the "contextual information" in the process of identification of feature values. Specifically, we can use the contextual information by looking at the ways in which the clusters C~, C{. and C~ "overlap". We characterize "overlap" as follows: Let Pat be an operator mapping from a set of clusters into a power set of patterns. That is,
cf=c:,
pl pZ p3
887
c~=c~
.
Pat~(Ca,.)={p,, p2},
Pat~( C~ ) = {p2, p3}. It is to be observed that the maximally populated clusters do not overlap as the intersection is null. We
pl p2 p3
with
Texture
Color
Weight
t'l l~ l~
c] c.~ c~
w] w~ w'~
c1'={r,, t~. t~}
c~' ={c',, c~i
c'~:={w;,%}
would like to label this sample as a noisy sample. On the other hand, for the values given in Table 2, it can be observed that
Pat,( C]') c~Patc( C~.') c~Patw( C~,i) = '~P', p2~ . In this case, we proceed to identify the feature vector as follows. Let Pat:- ~ be an operator from a power set of patterns onto a power set of features. That is,
Patfl({p,,pi2,
"",
p , kJ~' l -_- (:Jc i ,j
c,2 ~ ' ' " . f,kl J J
,Jj
where,f~', .... f~k arejth dimension feature values of pi,, ..., p,k. We use the above operator to restructure the maximally populated clusters C',, Ca. and Ck, by generating Cf, G~' and C~':
C',' =Pat;-'( {P1, P2 } ) = {t',, t~ }, C ( = P a t ( . ' ( { P , , P2 } ) = {c'~, c~}, C~:=Paty.'({P,, P ~', ) = vf~ ,,', % 1 . We use the representatives of C',', C~' and C~' to characterize the pattern P. In the above process of generating Cf, CJ,.' and C~:, we may have to resolve ties. A tie occurs if the maximally populated cluster with respect to a feature is not unique. One way of breaking a tie would be to consider separately each of the clusters involved in the tie and to decide favorably that cluster which yields the m a x i m u m number of elements on taking
888
V. Ramdas et aL / Pattern Recognition Letters 15 (1994) 885-891
intersection. With respect to the example at hand, we would like to maximize Pat(C~)c~Pat(C{.)~ Pat(C~). In the following, we describe an algorithm to implement the above approach. Algorithm 1 (Correct & Characterize Sample)
Input. Let P be the "ideal" pattern and n variations o f P b e {P~, p2 .... , p,} where pi is described using ( f ], f ~..... f &)
a solid material is required to be characterized by its thermal conductivities measured at various temperatures. That is, if the measurements are made at ti, t2, ..., tk temperatures, then the pattern features are (oq, oL2.... , o~k). oli is the measured thermal conductivity at temperature t,. Such measurements are made on n samples of the solid material yielding (a11, oq2, ..., o~k) .... , < OLil , 0[i2, ..., Ogik ) , . . . ,
Output. (f~,f2 .... ,fd) Step 1. Generate clusters along each dimension
(O~nl, Otn2, ..., O~nk) .
cl,
The task at hand is to characterize the solid material ((a'~, ce~.... , o ~ ) ) using the above measurements. In such a sequence of measurements, it may so happen that some of the measurements made on the samples may be out of place. Specifically, there may be a mix of some of the measurements or there may be some spurious measurements. Our idea is to use a clustering technique to identify and correct, if possible, these out-of-place measurements. Note that there is an important difference between the sample characterization problem we discussed in the previous section and the present problem of class characterization. In the former case, the variability amongst feature value measurements are expected to be insignificant while it may not be so in the latter case. Before we discuss the clustering algorithm, we elaborate the idea with the following example. Consider the readings of six samples (S~, ..., $6) at various temperatures (t~, ..., ts) as shown in Table 3. In the first step, we identify the maximally populated clusters for each of the columns. The resulting clusters are shown in Table 4. Now we examine the elements (such as $42 or $44) that do not belong to the maximally populated clusters, so as to improve the density of the maximally populated clusters. Observe the reading $41 that is
...,
....
,
where C~ for 1 ~/IC] ~[ for
1 <~x~kj, 1 <~j<~d. Step 3. Let .~ = f-)]=~ Patj( cTa~). Step 4. Let C~ = Pat)- ~( ~ ) for 1 ~
2.3. Class characterization The discussion so far has been aimed at characterizing a sample. In supervised classification, we begin with a collection of patterns that belong to a single class. With the help of features extracted, for each of these patterns, we are interested in determining the characteristic features of the class. There is a class of problems where the features used in characterizing a pattern are a sequence of measurements with the same unit of measurement. For example, let us suppose that Table 3 Sample
tt
t2
$1 $2 $3 $4 $5 $6
0.100 ( S l l ) 0.101 ($21) 0.098 ($31) 0.050 ($41) 0.1025 ($51) 0.075 ($61)
0.120 0.119 0.121 0.102 0.118 0.122
t3 ($12) ($22) ($32) ($42) ($52) ($62)
0.150 0.148 0.151 0.118 0.139 0.139
t4 (S13) ($23) ($33) ($43) ($53) ($63)
0.141 0.141 0.140 0.150 0.109 0.075
t5 (S14) ($24) ($34) ($44) ($54) ($64)
0.110 ($15) 0.109 ($25) 0.113 ($35) 0.138 ($45) 0.075 ($55) 0.04 ($65)
V. Ramdas et al. / Pattern Recognition Letters 15 (1994) 885-891
889
Table 4 S11 $21
$31 $51
CI mean = 0.1004 $41 $61
S12 $22 $32 $52 $62 C2 mean=O.120 $42
S13 $23 $33
S14 $24 $34
S15 $25 $35
C3 mean = 0.1497 $43 $53 $63
C4 mean = 0.1407 $44 $54 $64
C5 mean=0.1107 $45 $55 $65
S12 $22 $32 $52 $62 $43 C2' mean-- 0.1196
S13 $23 $33 $44
S14 $24 $34 $53 $63 $45 C4' mean = 0.1396 $64
S15 $25 $35 $54
Table 5 SII $21 $31 $51 $42 Cl'
mean = 0.1007 $41 $61
CY mean = 0.1497
not a part of the maximally populated cluster CI. In order to account for the reading, we examine the neighbors of $41. Now observe the reading $42. This value is closer to the m e a n of C1 than to the mean of C2. Hence, we hypothesize that $42 must be more a reading at tl than a reading at t2. Based on this assumption, we make $42 part of CI. The resulting final clusters are shown in Table 5. These clusters are used to describe the characteristics of the samples. Algorithm 2 (Correct & Characterize Class) (Variation 1 ). In this version, we do not construct all the feature clusters a priori. As an alternative, we first construct the cluster corresponding to a feature (maximally populated) a n d then attempt to extend this cluster by inspecting its neighbors. In the b o u n d ary case, when we are dealing with the first feature, obviously it has only one neighbor. When we are characterizing the ith feature, we try to extend the maximally populated cluster by borrowing (if possible) elements which are not part of the maximally populated cluster of the ( i - 1 )th measurement, and all the ( i + l )th measurements. The m e a n of such an extended cluster is used to characterize the ith feature of the class.
c5' mean = 0. l 102 $55 $65
Input. N samples of a class where each sample is characterized by k features. Output. d ( <~k) feature values that characterize the class. Step 1. For each step s ( 1 ~/IC, s[ for 1 ~
End of Algorithm
890
V. Ramdas et al. / PatternRecognitionLetters 15 (1994)885-891
Algorithm 2 (Correct & Characterize Class) (Variation 2 ). As an alternative to the sequential clustering approach, we discuss an approach that initially clusters each set of feature measurements as an initial step. This is particularly useful if most measurements made of a feature are correct and only a few measurements are erroneous.
Input. N samples of a class where each sample is characterized by k features. Output. d( <~k) feature values that characterize the class.
Step 1. For each step s ( 1 ~
Step 2. Form clusters of s measurements of the N samples. Let C~,, C2,, ..., Cx~ be the x clusters formed with Ifm, I >1Ifi,[ for 1 ~i~
Let Cm,={s' l s'~Cm(,_l) ands'¢Cm¢~+,)}. Step 4. For each s'eCm, do ifs' is close to/tm~ then move s' to Cm~ and recompute ltms. Step 5. Now,/trail,/tmi2 .... ,/~mi~ for 1 ~
End of Algorithm In the following section, we describe an application of the proposed technique in the context of speech recognition.
perform the short-term Fourier transform (STFT) (Rabiner and Schafer, 1978) on 32 msec hammingwindowed speech data sampled at 8 kHz. The window is progressively moved by 12 msec. In each of the segments, we pick the first four peaks of the spectral envelope. In the following example, we give a part of our experimental results. The threshold values (z's) employed as maximum permissible cluster radii in the generation of the maximally populated clusters are 75 Hz for F1, 100 Hz for F2, and 150 Hz for F3 and F4 respectively. The measurements for digit "one" uttered by speaker 1 are given in Table 6. After clustering, we get Table 7. In the above example we have made use of an algorithm based on Algorithm 2 (Variation 1 ) (described in Section 2). Specifically, we have chosen four peaks from the STFT envelope, with the objective of identifying three formant frequencies from the obtained data. Initially we cluster the elements of the first column as best as we can using the nearest-neighbor clustering approach, employing the threshold values given above. We then determine the mean of the maximally populated cluster. Then we try to expand this cluster as described in the algorithm and proceed to generate the other clusters by sequentially proceeding from left to right column-wise. In the above example, the effective formant values will now be F1 =519.1 Hz, F 2 = 1005.9 Hz, and F3=2114.3 Hz. Table 6
3. Application of the proposed approach in formant estimation In the previous section, we have discussed how clustering can be used to identify and, if possible, correct out-of-place feature value measurements. Our work has been motivated by the peculiar behavior of the "formant" frequencies (O'Shaughnessy, 1987) associated with speech data. Specifically, our aim was to achieve speaker-independent recognition of digits based on the recognition of vowels alone, and as a first step we concentrated on the digits 1, 2, 3, 4, 6 and 7. We restricted our analysis to n segments (each segment being 12 msec in duration) in the vowel region. (We do not discuss here the identification of these n segments, n is 5 in our implementation. ) We
Segment
FI (Hz)
F2 (Hz)
F3 (Hz)
F4 (Hz)
S1 $2 $3 $4 $5
500.0 531.2 500.0 502.5 312.5
1000.0 1031.0 968.8 781.2 562.1
2093.8 2125.0 2156.2 1062.0 968.8
3093.5 3875.0 3062.0 2196.8 2000.0
Table 7 Segment
F1 (Hz)
F2 (Hz)
F3 (Hz)
S1 $2 $3 $4 $5
500.0 531.2 500.0 502.5 562.1
1000.0 1031.0 968.8 1062.0 968.8
2093.8 2125.0 2156.2 2196.8 2000.0
IL Ramdas et aL / Pattern Recognition Letters" 15 (1994) 885-89I
4. Conclusion In this paper, we have examined the need for the application of clustering techniques in the process of feature extraction. In particular, we have discussed algorithms that employ clustering techniques that can be used in the design of a classifier. We have also exa m i n e d the proposed algorithms in the design of a speaker-independent digit recognition system. Our results indicate that clustering can play a very vital role in the design of a classifier. As has been mentioned in the introduction, this work has been motivated by the need to build a rugged classifier. We are presently working on incorporating the above algorithms in the development of a speaker-independent digit recognition system. In this paper, we have discussed an application of the proposed clustering approach in the area of speech processing. There must be m a n y other areas where the proposed approach is applicable. For example, it is worthwhile to investigate the applicability in the m a n u f a c t u r i n g quality control area as suggested by the thermal conductivity measurements of a solid material that we have discussed in the paper.
Acknowledgements The authors would like to thank the a n o n y m o u s reviewer for h i s / h e r valuable c o m m e n t s that imparted more clarity to the paper. The authors would also like to thank Messrs I n d i a n Telephone Industries, Bangalore, for the project grant that partially supported the work reported in this paper. The first two authors would like to thank Dr. M. Narasimha Murty for useful discussions.
References Anderberg, M.R. (1973). Cluster Analysis ./'or Applications. Academic Press, New York. Cheng, Y. and K.S. Fu (1985). Conceptual clustering in knowledgeorganization. IEEE Trans. Pattern AnaL Machine lntell. 7 (5), 592-598. Duda, R.O and P.E. Hart ( 1973). Pattern Classification andScene Analysis. Wiley-InterScience,New York. El-Hamdouchi, A. and P. Willet (1989). Comparison of hierarchic agglomerative clustering methods for document retrieval. Comp. J. 32 (3), 220-227.
891
Falcitelli, G. (1989). Knowledge based systems supporting epidemiologicaldata analysis.In: E. Diday, Ed., Data Analysis, Learning Symbolic and Numeric Knowledge. Nova Sci.. New York. Fukunaga, K. and P.M. Narendra ( 1975). A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans. Comp. 24 (7), 750-753. Jain, A.K. and R.C. Dubes (1988). Algorithms/or Clustering. Prentice-Hall, EnglewoodCliffs, NJ. Kang, A.N.C. et al. (1977). Storage reduction through minimal spanning trees and spanning forests. IEEE Trans. Comp. 26 (5), 107-131. Lee, R.C.T. ( 1981 ). Clustering analysis and its applications. In: J.L. Tou, Ed., Advances in hformation Systems. Academic Press, New York, 169-292. Lee, R.C.T. and S.H. Tseng (1979). Multi-keysorting. Internal. J. Policy Analysis and Information ~vstems. 3 ( 2 ), 1-20. Lee, R.C.T., J.R. Slagle and C.T. Mong (1978). Towards automatic auditing of records. 1EEE Trans. So/`tware Engrg. 2 (3). Levinson, L.E., L.R. Rabiner, A.E. Rosenberg and J.G. Wilpon (1979a). Interactive clustering techniques for selecting speaker-independent reference templates for isolated word recognition. IEEE Trans. Acoust. Speech Signal Process. 27, 143-141. Levinson, L.E., L.R. Rabiner, A.E. Rosenberg and J.G. Wilpon (1979b). Speaker-independentrecognitionof isolated words usingclusteringtechniques.IEEE Trans. ,tcoust. Speech Signal Process. 27, 336-349. Michalski, R.S. (1983). A theory and methodologyof inductive learning.In: R.S. Michalski,J.G. Corbonelland T.M. Mitchell, Eds., Machine Learning: An A1 Based 4pproach, Vol. I. Morgan Kaufmann, Los Altos, CA. Michalski, R.S. and R.E. Stepp ( 1983). Automated construction of classifications: conceptual clustering versus numerical taxonomy. 1EEE Trans. Pattern Anal. Machine lntell. 5 (4), 396-410. O'Shaughnessy, D ( 1987). Speech Communication: ttuman and Machine. Addison-Wesley,Reading, MA. Rabiner, L.R. and R.W. Schafer (1978). Digital Processing ~?t" Speech Signals. Prentice-Hall, EnglewoodCliffs, N.J. Rabiner, L.R. and J.G. Wilpon (1979). Considerations in applying clustering techniques to speaker-independent word recognition.J. Acoust. Soc. Amer. Ramamoorthy, C.V., J. Srivastava and W. Tsai (1986). ~k distributed clusteringalgorithm for large computer networks. Proc. 6th Internal. Conf on Distributed Comp. Systems. 613-
620. Sridhar, V. and M. NarasimhaMurty (1991a). A knowledgebased clusteringalgorithm. Pattern Recognition Lett. 12, 511-517. Sridhar, V. and M. Narasimha Murty (1991b). Clustering algorithms for library,comparison.Pattern Recognition 24 (9), 815-823. Sridhar, V. and M. Narasimha Murty ( 1993). A knowledge-based clustering algorithm for data abstraction. Knowledge-Based Systems, accepted for publication.