Pattern Recognition Letters 13 (1992) 879-891 North-Holland
December 1992
Feature representations and classification procedures for Slovene phoneme recognition France Miheli~, Ivo Ip~i~, Simon Dobri~ek and Nikola Pave~i~ Laboratory for Artificial Perception, Faculty of Electrical and Computer Engineering, Tr~agka 25, Ljubljana, Slovenia
Received 24 January 1992 Revised 15 April 1992
Abstract Miheli~, F., I. Ip[i~, S. Dobri[ek and N. Pave[i~, Feature representations and classification procedures for Slovene phoneme recognition, Pattern Recognition Letters 13 (1992) 879-891. In this paper the comparison of performances of different feature representations of the speech signal and comparison of classification procedures for Slovene phoneme recognition are presented. Recognition results are obtained on the database of continuous Slovene speech consisting of short Slovene sentences spoken by female speakers. MEL-cepstrum and LPC-cepstrum features combined with the normalized frame loudness were t, und to be the most suitable feature representations for Slovene speech. It was found that determination of MEL-cepstrum using linear spacing ofbandpass filters gave significantly better results for speaker dependent recognition. Comparison of classification procedures favours the Bayes classification assuming normal distribution of the feature vectors (BNF) to the classification based on quadratic discriminant functions (DF) for minimum mean-square error and subspace method (SM), which does not confirm the results obtained in some previous studies for German and Finn speech. Additionally, classification procedures based on hidden Markov models (HMM) and the Kohonen SelfOrganizing Map (KSOM) were tested on a smaller amount of speech data (I speaker only). Classification results are comparable with classification using BNF.
Keywords. Speech recognition, phone components recognition, featt~,'¢~xtraction and selection, classification, quadratic discriminant functions, subspace method, hidden Marker models, Kuk,:;~,,:r, ~:~lf-Organizing Map.
1. Introduction The acoustic-phonetic (AP) transcription of speech is a necessary and hierarchically the lowest part of most continuous speech recognition systems. Its purpose is to transform the speech signal into a string of phonetic units. The AP transcription is performed in several steps. The acoustic speech signal is recorded using a Correspondence to: F. Miheli~, Laboratory for Artificial Perception, Faculty of Electrical and Computer Engineering, Tr~.agka 25, Ljubljana, Slovenia.
close-talking microphone. The signal is passed through a band pass filter (300 Hz-3.4 kHz) and sampled at a l0 kHz sampling rate with a 12-bit linear A / D converter. The frequency range of the signal is the same as that used in the telephone connection. During the feature extraction step the sampled signal is divided into overlapping frames 6.4 msec apart with a duration of 12.8 msec and multiplied by a Hamming window function. The feature vector describing the signal is computed for each frame. Using one of the classification algorithms each
0167-8655/92/$05.00 © 1992 w Elsevier Science Publishers BV. All rights reserved
879
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
frame is classified into one of the phone component classes. We used 30 phone components as basic phonetic units to describe the Slovene phonemes. Frames corresponding to the same phone component are grouped together into segments. In the HMM approach, the phone components classification and segmentation are obtained simultaneously, while the other classification procedures require an additional step for the segmentation task to be performed. The results presented in this paper concern the feature extraction and feature selection steps, including comparison of different feature vectors, and the comparison of the effectiveness of different classification algorithms for frame classification. The final segmentation [ 10] of speech into phonetic units was achieved using a syntactic method similar to the one described in [18, pp. 156-166] and a collection of correction rules [9] using the same basic knowledge about the phonetic properties of Slovene phonemes (average duration and possible order of phone components). Finally, a new AP module, which uses different classification procedures for phone components recognition, is proposed.
December 1992
Table 1 Categories and phone components
vowels
i, 6, 6, a, a, 6, 6, u
nasa|s oral sonorants
n, m
liOllSOllOrants
v, r, 1, j p, t, k, b, d, g, f, h, s, g, z, ~, c, (:
silence noise
+
and a burst, described by a pair: p ~ (-,p), b ~ (-,b), t -~ (-,t), d --+(-,d), g -. (-,g).
Closures preceding the burst of voided phonemes were labelled with -, and those preceding the unvoiced burst of phonemes with -. The third aspiration component is constantly present in the Slovene stop phoneme k. The affricates c and ~ are also combined from three components. The representation of these phonemes thereby consists of triples: k -~ (-,k,h), C ~ (-,C,S), --, (.,~,~).
2. Input data The sp,,:ch input for training and testing the AP module performance consists of 152 simple Slovene sentences (approx. duration, 7 minutes) spoken by 3 female speakers at normal speed and loudness, without dialectical characteristics. Each speaker spoke about 50 sentences. For each speech frame a corresponding phone component was determined manually. For this purpose special software was designed. This enables: • visual inspection of the time signal, • spectral representation of the signal, • listening to an arbitrarily declared part of speech using the D/A conversion. Because of the time varying spectral characteristics of some phonemes, such as stops and affricates, it was necessary to decompose those phonemes into several components. The stop phonemes: p, b, t, d, g are composed of a closure 880
The 30 phone components of Slovene speech are used and grouped in 5 categories (Table 1), according to [19]. Noise in the records resulting from breathing and smack of lips represents a special category, which was labeled +. The relative frequency of input data p[lone components is illustrated in Figure 1. vo~ets
41g
nssats sonorants
°"' ai;'
12g
nonsonorants
s i tenee
i segments 22g
• frames
i
28g i 2 1 g
noise ~l~g6g retative frequency Figure 1. Relativefrequencydistributionof phonecomponents.
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
The training and testing process is cyclically repeated for speaker dependent and speaker independent recognition tests. At speaker independent tests, two speakers were used for training, and the third for recognition. At speaker dependent tests, 5 sentences were used for recognition and the remaining (approx. 45) sentences for training.
December 1992
filter coefficients 1
•
1
.
"v
2
3
4 f [kHz]
2
3
4 f [kHz]
(a)
1
3. Comparison of feature representations
In previous comparative studies [1,17], many different feature representations of speech have been tested. Because the optimal feature representation might depend on the language used, we made our own comparative study for Slovene speech using the most popular feature representations. We chose: -the MEL representation (MEL) [17], - the MEL-cepstrum representation (ME,~j [13,1,171, -the cepstrum (CEP) [1], - t h e Karhunen-Lo6ve transformation of MEL parameters, -the LPC representation (LPC) [1], - the LPC-cepstrum representation (LCE) [1]. For the comparison of different feature representations, we always used the Bayes classifier assuming a normal distribution of feature vectors (BNF).
MEL parameters MEL (melody) parameters are defined as log energies from one frame of speech signal resulting from the set of overlapping bandpass filters (we chose 32 filters), usually spaced nearly logarithmically (Figure 2(a)).
MEL-cepstrum The MEL-cepstrum is defined as a cosine transformation of MEL-parameters.
MECi= ~ MELk.cos i . ( k - 1 ) . 7t k=!
2
'
i= 1,2,...,Nc, where Nc is the number of MEL-cepstrum coeffi-
1
(b)
Figure 2. Different spacing of bandpass filters: (a) logarithmic, (b) linear.
cients, and MELk, k= 1,2,...,Nf represents the log energy output of the k-th filter. As in [17], we have chosen Nt to be 32, Nc equals 16. Because some special bandpass filters spacing could influence the recognition results for the smaller groups of phones, MEL-cepstrum parameters were computed for different spacings of bandpass filters (Figure 2). Surprisingly, the linear spacing of bandpass filters (Figure 2(b)) gives significantly better results for some categories of phones (nasals, oral sonorants) when computing the MEL-cepstrum parameters. The differences in recognition are illustrated in Figure 3. For other spacings the recognition results were under the level achieved by standard MEL filter spacings.
Cepstrum The cepstrum is defined as:
CEPi:
N~,-I k=0
log IXkl .cos
n.i.k , Sa
i = I , 2 , . . . , N c, where Nc is the number of cepstrum coefficients, and IXkl, k = 1,2,..., Na represents the amplitude spectra of the signal at the k-th frequency of one speech frame. In our case, Na is 128, the same as the number of samples of speech data in one frame, and Nc equals 11.
The Karhunen-Lo6ve transformation The MEL-cepstrum representation was in881
Volume 13, Number 12
PATTERN R E C O G N I T I O N LETTERS
December 1992
lOOg
90g 80g 70g
60g 5~ 3~ 20g lOg Og vowels
nasats
oral son.
nonson, silence
total
vouets
speaker dependent recognition
nasats
oral son.
nonson, siter3ce
speaker independent recogni
totat
t ion
.
Figure 3. Recognition results for frame classification using MEL-cepstrum parameters. LIN = linear spacing, LOG = logarithmic spacing of bandpass filters.
troduced by Pols [13]. He derived the MELcepstrum representation formula from experimental results. The Karhunen-Lo6ve transformation for the MEL features of Dutch vowels was performed. The first eigenfunctions resembled each other closely on cosine functions, and he showed that there is only a slight difference in recognition results when using cosine functions instead of eigenfunctions [13, pp. 53-54]. Because his testing data contained only vowels, it is possible that, for some other group of phones, other (non-cosine) functions could determine more efficient feature representation. We also explored this possibility. Figure 4 illustrates the first 6 eigenfunctions for two groups of phones. In spite of obvious differences in the shape of
nu,
oig(,nfu.ctio,
0
;
........
4.
i
. . . . .
;---
eigenfunction
eigenval.
5.37
1
3.8S
2 -• - - .
1.49
3
1.04
4
0.93
5 ~-~-~-
---~----~--
-
0.69
0.78
6 ~ ~ ~ ~ _ ~ - - - ' ~ - " " ~
~~
0.58
I
6.40 .
.
.
3.33
1.77 ~J'~-
I.,
f [kUz] (a)
The linear prediction The linear prediction coefficients (LPC) were obtained from a 12th order all-pole approximation of the spectrum of the windowed speech frame as in [2] (No= 12).
eigenvai, nu.
5
t...............
eigenfunctions for different phone groups, a significant change in recognition results was noticed only when a smaller amount of eigenfunctions (up to 6) was used (Figure 5). When using 12 functions--at this number of MEC features the best recognition was achieved--the differences in recognition results using MEL-cepstrum features, or features derived on the base of eigenfunctions, were not significant.
0
~ J~'-""
I
I
1
~
I
1.24
I • f [kHz]
(b)
Figure 4, Eigenfunctions and eigenvalues of the KL transformation: (a) vowels and sonorants, (b) nonsonorants. 882
Volume 13, Number 12 100% T . . . . . . . .
PATTERN RECOGNITION LETTERS
I
i,.,
""
90%
6
December 1992
........................................................................................... . 4 X
63g
8O% 62%
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70g 61g
60g 50X
40g 30g
58% LPC
2O% vo.ets
nasats
orat
nonson, sitence
son.
Figure 5. The comparison of recognition results with the first
three MEC features, and with features determined by the first three eigenfunctionsof the KL transformation for vowels and sonorants.
The LPC-cepstrum The LPC-cepstrum coefficients (LCE) were obtained from LPC directly as ~-! k-i LCEi= ~ LPC~ + ~ . k=! i
LCE~_ k • LPCk,
i= l , 2 , . . . , N c, where N c is the number of LPC-cepstrum coefficients. In our case, Nc was the same as for LPC, 12.
Additional parameters Because of the independence of some feature representations from frame energy, for all parametric representations the normalized frame loudness (log energy) is taken as an additional parameter. For its determination, the energy is normalized over all utterance duration excluding the silence intervals. For the LPC and the LCE linear prediction approximation error is also taken as an additional parameter.
CEP
HEC
LCE
Figure 6. Comparison of recognition results for non-silent phone components for LPC, CEP, MEC, and LCE feature
representations. Speaker dependent recognition.
representations combined with normalized frame loudness are most suitable for Slovene phone recognition, although the differences in recognition results for other feature representations are not so significant.
MEL-cepstrum features selection To find out whether some special selection of components of MEL-cepstrum representation would improve the recognition results, we performed the following tests. On a smaller database (one speaker only) we performed the best feature subset selection procedure (Branch and Bound Algorithm) based on three different criterion functions--probabilistic distance measures: Bhattacharyya, Mahalanobis and Divergence [2, p. 160], assuming a normal distribution of features. 90g 85X 80X
?Sg
70X 65g
Comparison of recognition results with BNF The recognition results are depicted in Figures 6 and 7. Because of the lack of testing data, only the vowel recognition for comparison of MEL and MEL-cepstrum representations was performed. An examination of the results leads to the conclusion that MEL-cepstrum and LPC-cepstrum
60g 55g 50g
i
~
~
~
a
6
~
u
totat
Figure 7. Comparisonof recognitionresults for Slovenevowels for MEL and MEC feature representation. Speaker dependent recognition. 883
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
85%
85~
80X
80~
75~
75~
December 1992
-
BB SEQ
65~
~
65%
60%
60%
55X
55%
50g
50% i
e
j
totat
~
~
u
total
Figure 8. Comparison of recognition results with different features selectionsfor phone groups (i, ~, j) and (6, 6, u). First i0 MELcepstrum components (SEQ), 10 components selected upon Bhattacharyyacriteria (BHA) and Mahalanobis criteria (MAH).
The first 16 MEL-cepstrum components and frame loudness form the initial set of features. When looking for a best feature subset of 11 features in the group of all phone components, the first 10 MEL-cepstrum components and frame loudness were selected from all the three criterion functions. Differences in selecting feature components appear only when smaller groups of phone components were used. Figure 8 illustrates the comparison of recognition results if the first 10 MEL-cepstrum components, or 10 MEL-cepstrum components derived from Bhattacharyya and Mahalanobis criteria, were used for recognition, Learning and testing procedures were performed on the same patterns of one speaker. The influence of the number of features on the frame classification was studied for frame loudness and MEL-cepstrum representation. The results are depicted in Figure 9. Apparently, the recognition results do not significantly increase when more than 11 components are used. The same result was obtained by Regei in his study [18], by selecting the frame loudness and the first 10 components of MEL-cepstrum as an optimal feature representation for German phone recognition.
4. Comparison of classification rules A comparison of the classification rules for the classification of speech frames into Slovene phone components is performed using the first 11 MEL884
cepstrum features and the loudness of the speech frames. The Bayes classification for normally distributed feature vectors (BNF) [11, pp. 175-180] is compared to the classification using quadratic discriminant functions (DF) derived upon the minimum mean-square error criteria [11, pp. 183-190] and to the subspace method (SM) [12], which were used in some AP modules for other languages. With the same features the classification using hidden Markov modeling and the Kohonen Self-Organizing Map was tested on a smaller part of the data base (50 sentences, one speaker only).
The Bayes classification rule for normally distributed feature vectors (BNF) This type of classification implies the following metric. The distance di(x,£a) between the feature vector x and the mean vector -~i of class Xi:
d~(x,X~)=(y, Aiy)+ In IAal- 2 In P(Xi), 70X 60X 50X 40X 30X 20X IOX Og
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
3
4
5
6
7
8
9
10 11 t2
13
number of features Figure 9. Recognition results depending on the number of features for all non-silent phone components. Speaker dependent recognition.
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
where y=(x-.,Ti), (y, Ai.v) is the inner product of vectors y and Aiy, A i is the inverse covariance matrix of class Xi, IA~I is the determinant of the matrix At, and P(Xr) is the a priori probability of class Xr. The feature vector x is classified into class Xi, where the distance of vector x to the mean vector -~i of class Xi is minimal:
di(x,~r) = min dj(x,£j). ! <~j<<.M
Discriminant functions for minimum mean-square error (DF) Discriminant functions di(x) for feature vectors x are functions, separating the feature vectors space in disjoint parts Xi, which represent different classes of patterns. To classify the patterns into M classes we need the same number M of discriminant functions. The feature vector x is then classified into class X~, giving the maximal value of the discriminant function
di(x)= max dj(x). 1<~j~M
The BNF classifier is a special type of such classification based on quadratic discriminant functions derived on the assumption that the feature vectors are normally distributed. If the feature vectors are not normally distributed it could be more appropriate to derive the discriminant functions based on the minimum mean-square error [18]. Let us assume that in the ideal case we have the discriminant functions Dr(x) with the following characteristic:
Dr(x)=
~
1, x~Xi, O, x~Xi.
Thus the minimum mean-square error criterion function S 2, which we want to minimize, is:
S2=E{D-d)2}, where D=(DI,D2, ...,DM), and d=(dl,d2, ...,riM) is the vector of the estimated discriminant functions. The solution of this problem, according to [11, pp. 183-184], gives the vector:
d(x) = (POt"I Ix), P ( X 2Ix), ..., P(XM Ix)).
December 1992
This means, that such a classifier--if no constraints on the estimated discriminant functions d are assumed--is equal to the Bayes classifier. Let us take the linear discriminant functions written as inner products
di(x)=(wi, x*), where wi = (wio, wrl, ..., Win)T is the coefficient vector, x*=(l,xl,x2, ...,xn) T is the extended feature vector, and n is the dimension of the feature vector. Then the vectors of coefficients w~ can be combined into matrix W W'-(WI, W2,...,WM).
The vector d(x) of the discriminant functions can be therefore written as
d(x)= wTx *. The optimal solution for matrix W concerning the minimum mean-square error criterion function is, according to [11, p. 185]: W - [E {x*x *T } ] - I E {x*D T (x)}.
The problem of the determination of quadratic discriminant functions can be transferred to deterruination of the linear discriminant functions by extending the feature vector x to vector x* of the form:
X* = (I,xI,X2,...,Xn, X~,XIX2,.. . ,X~)2T. Subspace method (SM) In this type of classification for each class Xr of patterns, the linear pr-dimensionai subspace Lr is determined. The feature vector x is then classified into class Xi with the maximal amount di(x) of projection of vector x on the subspace Lr. Subspaces Lr are determined on the base of the Karhunen-Lo~ve transformation. From the covariance matrix Rr of each class Xi the new orthogonal basis Y~ consisting of its eigenvectors is determined. Ri--
yiTAryi,
where Yi=(yil,yr2, ...,Yrn) is the matrix of eigenvectors, and Ar is the diagonal matrix of eigenvalues. 885
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
December 1992
100~
90~ 80X 70~
I IBNF
60Z 50X 40g 30g 20¢ 10~ O~ vovets
nasals
orat
nonson, sitence
vouets
nasats
orat
nonson, sitence
son.
SOn.
speaker dependent recognition
speaker independent recognition
Figure 10. Comparison of the recognition results using BNF and DF classifiers. Eigenvectors are ordered by decreasing values of eigenvalues:
and normalized UYuil
= 1.
The orthogonal bases of the subspaces Li are therefore determined from some of the first eigenvectors of the covariance matrix R~. The projection .ei of the feature vector x on the subspace L~ is pt = E
(x. Y,k ) Y,k.
The amount of the projection d~(x), e.g. the norm of the vector ~ , is then determined from: pt
d (x)
The dimensions Pi of subspaces Li are usually essentially lower than the dimension of the feature vector n (Pi~n). In the Finn speech recognition example [12, pp. 126-132] n was 30 and 2~
Comparison of recognition results using BNF, DF and SM classifiers The recognition results are illustrated in Figures 10 and 11. The testing results show the BNF classifier to be the most suitable. Previous comparative studies for German speech [18, pp. 153-155] and Finn speech [12, p. 149] show that better results were obtained using DF and SM classifiers. Because the BNF classification assumes normal distribution of feature vectors, experimental
= II.,t12 = E (X, Yik)2" k=!
Classification results do not depend on the magnitude of the feature vector x and are sensitive to the position of the coordinate system. The measure of variability of the feature vector in the direction of the eigenvector represents the corresponding eigenvalue. Therefore the dimensions Pi of the subspaces Li are determined on the base of the normalized value of the sum of eigenvalues not exceeding some predeclared positive constant K ( 0 < K < 1):
IOOX
90~ 80X 70~
60~ 50g
40g 30g i
k'= !
886
I
k=l
I
•
e
^
e
0
a
•
o
~
u total
Figure 11. Comparison of speakerdependent recognitionresults for Slovene vowels using BNF and SM classifiers.
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
results can lead to the implicit conclusion that the distribution of MEC feature vectors of Slovene phone components is nearly normal.
Correction of classification results Because of the modest recognition results (40070-70°70) it is more advisable to determine the most probable alternative classifications of each frame (in our case up to 6). After the classification an additional correction of classification results is performed. This correction is for the present based on two criteria: • Short sequences of frames (1 or 2)reclassified as non-silence and surrounded with silence--are assumed to be noise and are reclassified as silence. • Frames classified as burst (p, b, t, d, k, g, ~, c ) - - i n cases, when the preceding frame was not classified as burst or silence (closure)--were classified as other most probable phones. Figure 12 shows the recognition results after this type of correction (AC) in comparison with previous recognition results (BC). Better recognition results of nonsonorant, silence and noise categories can be noticed, while the recognition results of other phone categories did not decrease.
Hidden Markov modefing The theory of HMMs is described in several publications [14,7, 15]. A HMM is a finite automaton, having a finite number of states, described
December 1992
by two interrelated processes: a Markov chain of states connected with transition probabilities and output probability density functions, each associated with one state. At every time instance the automaton is in one of the states, while at the same time an output symbol is generated according to the output probability density function, corresponding to the current state. The Markov chain then changes the state according to the transition probabilities, produces a new output symbol, and continues until the whole sequence is generated. The discrete output symbols are obtained from the speech signal using a vector quantization technique. To obtain the prototype vectors of the vector quantizer we chose the Linde-Buzo-Gray algorithm [8]. The algorithm divided all feature vectors x into 2, 4,..., 64 partitions, with a centroid for each partition. The vectors x are classified into partitions whose centroid Xc gives the smallest distortion measure d(x, Xc). We chose the following distortion measure: n
d(X, Xc)'- ~ (Xi--Xci)2, i=!
where n is the number of feature vector components, and x~ and xc~ are the i-th vector components of x and xc. The centroids of the partitions represent the prototype vectors of the vector quantizer. If we describe each prototype vector with a symbol, the
IOOX ........ m
90~
..........
80g
70g 60X 50g 40g 30~ 20X
10~ OX vowers nasats
orat son.
nons. siten, noise
speaker dependent recognition
totat
vo~ets nasats orat son.
nons. siren, noise totat
speaker independent recognition
Figure 12. Correction of classification results: (a) speaker dependent classification, (b) speaker independent classification. 887
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
Figure 13. The phone model.
input vectors can be represented by these symbols, which are elements of the output probability functions of the HMMs. Before training the phone models the model topology has to be defined. We have chosen a simple model structure, depicted in Figure 13. The model consists of 4 states and 9 transition possibilities. With these four states, we modeled the left, middle and right portion of a phone. In order to train the phone models we have to estimate the transition and the output probabilities of the models. The goal of the training procedure is to maximize the probability of the phone model P(O[A) given the observation sequence O. To evaluate the parameters of the model we used the iterative Baum-Welch algorithm [15] and the labeled database.
The phone recognizer is a parallel concatenation of the phone models connected by a common initial and final state. Figure 14 illustrates the phone recognizer containing 30 phone models. The recognition of an unknown observation sequence is performed by the Viterbi search [3, 6] of states which the model visits to produce the same observation with a maximum probability. Knowing the state sequence of the phone recognizer we can decode the input sequence and transform it into a phone sequence.
The Kohonen Self-Organizing Map The Kohonen Self-Organizing Map (KSOM) [5], like the Vector Quantization method [4], is intended to approximate the continuous vector probability density function p(x) of the feature vector x by a finite number of codebook vectors m~, which are labeled and arranged in a two-dimensional network of cells. The positioning of the codebook vectors is the task of an unsupervised learning procedure.
Figure 14. The phone recognizer. 888
December 1992
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
December 1992
p(Acl ) ~ p(Ac2 ) ~_p(Ac3) ~_ p(At4) _> p(Acs) ~cell
C~ Figure 15. A label of a cell C, where
If the KSOM is to be used as a pattern classifier in which the cells are grouped into subsets, each of which corresponds to a pattern class, then, to minimize misclassification errors, an additional step is needed, using a s'~ervised learning procedure. The classification accuracy can be improved by updating the codebook vectors by an iterative optimization method, known as Learning Vector Quantization (LVQ) [5]. This supervised method more accurately demarcates the borders between groups of cells corresponding to the same class. When the Map is being created by using a set of labeled feature vectors, it is reasonable to include all statistical features, i.e., the cell labels are extended to the main and to the alternative ones with corresponding probabilities (Figure 15), in order to improve accuracy of the subsequent segmentation. Furthermore, as far as known, the basic KSOM has a low accuracy for the classification of the plosives. The plosives are detectable only as transient states of the speech waveform and, because their probability density functions are widespread, they often 'fall out' of the Map. The problem is mainly solved by using special, auxiliary maps in which only the plosives are presented. For this purpose, the presence of such phones (as a group) is first detected from the waveform. Some improvement can also be achieved by adjusting the Map to the statistical properties of the phones before using the LVQ algorithm. The Map is statistically analyzed and corrected in order to assign an appropriate number of network cells to each phone class, according to its probability. If there is no main label for the class on the Map it will be searched out from the alternatives. After 'fine tuning' of the Map by the LVQ algorithm the misclassification error is more equably distributed over all pattern classes. The learning algorithm depends on several parameters, mainly experimentally derived. For up
Aci are
phones.
to a maximum of 260 cells the following parameters are chosen: • the network is a rectangle with 20x 13 cells, • the neighborhood set of cells No(t) is of hexagonal shape, with a linearly decreasing radius from 7 to 1, • the function ot(t) is a ( t ) = 0 . 9 ( l - t / 1 0 0 0 ) for the first 1000 steps and aft)= 0.02(1 -t/50000) for the next 50000 steps, • before LVQ-I optimization of the Map the described correction is made, • the function t~(t) is a(t)=0.02(l-t/100000) for 100000 steps of LVQ-I algorithm. In addition, it should be mentioned that we made experiments with different Map shapes, functions aft) and Nc(t) and different LVQ optimization methods. The LVQ-I [5] algorithm gives probably the best results because of the previously described correction of the Map. Further improvement of the recognition rate can be achieved by increasing the number of network cells, by including statistical features of the phones in the learning algorithm and by including additional auxiliary maps for plosives.
Comparison of classification results An examination of the recognition results shows on average only small differences in classification results for all phones and for the vowels (Figure 16). Significant differences of classification results appear for the nasals, oral sonorants and nonsonorant categories. While the BNF classifier gives the best recognition result for nasals and oral sonorants, the HMM classifier is superior in recognition of nonsonorant phones, which comprise the majority of the phones (Table 1). Thus the conclusion that the feature vector distribution for phones such as vowels, nasals and oral sonorants can be very well approximated by a normal distribution function. Whereas the nonsonorant phones, often having discontinuities in 889
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
December 1992
90X 80X
70X
IIBNF
60X 50X
I"1
40X 3~ 20X lOX OX
vowers
nasats
orat sonorants
nonsonorants
totat
Figure 16. Comparison of recognition results of different classifiers. Speaker dependent recognition.
their spectra, are better described by a stochastic speech model like the HMM. The analysis of the Slovene phone recognition results leads to the conclusion that better recognition results could be achieved if the AP module used different classification procedut~es for different phone categories in phoneme recognition.
5. Conclusion
it has been found that for the recognition of Slovene phone components the best results were obtained using the MEL-cepstrum features, computed from linearly spaced bandpass filters. In the feature selection step it has been shown that the frame loudness, and the first 10 MEL-cepstrum coefficients of the speech frame, form an optimal feature vector. Using such feature vectors we tested different classification procedures. The phone components recognition results presented in this paper display some contradictions in relation to certain previous comparative studies made for other languages. The DF and SM classifiers did not yield better results than the BNF classifier. Better recognition results for nonsonorant phones were achieved using the HMM classifier. Hence the conclusion that the AP module should use different classifiers for different phone categories. We propose a hybrid classifier, which would perform the phone group classification using the BNF classifier, while for phone components classification within a group BNF, HMM or KSOM classification procedures should be used. 890
Further work will be based on forming a greater labeled database in order to provide more reliable results, and thus to confirm the usefulness of the proposed AP module. References [I] Davis, S.B. and P. Mermelstein (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28, 357-366. [2] Devijver, P.A. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Prentice-Hall, Englewood Cliffs, NJ. [3l Forney, Jr., G.D. (1973). The Viterbi algorithm. Proc. IEEE 61 (3), 268-278. [4] Gray, R.M. (1984). Vector quantization. IEEE ASSP
Maga~.ine I, 4-29. [5] Kohonen, T. (1990). The self organizing map. Proc, IEEE 78 (9), 1464-1480. [6] Lee, K.-F. and H.-W. Hon (1989). Speaker independent phone recognition using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 37 (I I), 1641-1648. [7l Levinson, S.E., L.R. Rabiner and M.M. Sondhi (1983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J. 62 (4), 1035-1073. [8] Linde, Y., A. Buzo and R.M. Gray (1980). Vector quantization in speech coding. IEEE Trans. Commun. 28 (1), 84-95. [9] MihelR, F., L. Gyergyek and N. Pave[i~ (1989), Acousticphonetic module for continuous Slovene speech recognition. MELECON'89 Proceedings, Lisbon, 249-252. [10] MihelR, F. (1991). Acoustic-Phonetic Transcription of Slovene Speech. Dissertation, University of Ljubljana-Faculty of Electrical and Computer Engineering (in Slovene). [11] Niemann, H. (1983). Klassifikation yon Mustern. Springer, Berlin.
Volume 13, Number 12
PATTERN RECOGNITION LETTERS
[12] Oja, E. (1983). Subspace Methods of Pattern Recognition. Research Studies Press Ltd., Letchworth, Hertfordshire. [13] Pols, L.C.W. (1977). Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Dissertation, Academische Pers B.V., Amsterdam. [14] Rabiner, L.R. and B.H. Juang (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 4-16. [151 Rabiner, L.R. (1988). Mathematical foundations of hidden Markov models. In: H. Niemann, M. Lang and G. Sagerer, Eds., Recent Advances in Speech Understanding and Dialog Systems. NATO AS! Series Vol. F46. Springer, Berlin, 183-206.
December 1992
[161 Regel, P. (1982). A module for acoustic-phonetic transcription of fluently spoken German speech. IEEE Trans. Acoust. Speech Signal Process. 30, 440-450. [17] Regel, P. (1986). Deriving an efficient set of features for classifying phones. In: I.T. Young et al., Eds., SignalProcessing III: Theories and Applications. North-Holland, Amsterdam, 501-504. [18] Regel, P. (1988). Akustisch-Phonetische Transkription far die Automatische Sprachererkennung. VDI Verlag, Dfisseldorf. [19] Topori[i~ J.. (1976). Slovenska slovnica. Zalo~ba Obzorja, Maribor (in Slovene).
891