Computer Methods and Programs in Biomedicine, 35 ( 1991 ) 125-139
125
~ 1991 Elsevier Science Publishers B.V. 0169-2607/91/$03.50 COMMET (11190
Section I1: Systems and programs
On the use of hidden Markov modelling for recognition of Dysarthric speech J.R. Deller, Jr. 1, D. Hsu 2, and L.J. Fcrrier 3 l Michigan State Unit'ersity, Department of Electrical Engineering, Control, Systems, and Stgnal Processing Research Group: Speech Processing Laboratory, East Lansing, MI 48824, U.S.A. 2 Northeastern Unirersity, Department of Electrical and Computer Engineering, Center for CDSP Research, Boston, MA 02115, U.S.A. and ~ Northeastern Unitersity, Department of Speech and Language Pathology and Audiology, Boston, MA 02115, U.S.A.
Recognition of the speech of severely dysarthric individuals requires a technique which is robust to extraordinary conditions of high variability and very little training data. A hidden Markov model approach to isolated word recognition is used in an attempt to automatically model the enormous variability of the speech, while signal preprocessing measures and model modifications are employed to make better use of the existing data. Two findings are contrary to general experience with normal speech recognition. The first is that an ergodic model is found to outperform a standard left-to-right (Bakis) model structure. The second is that automated clipping of transitional acoustics in the speech is found to significantly enhance recognition. Experimental results using utterances of cerebral palsied persons with an array of articulatory abilities are presented. Cerebral palsy speech recognition; Dysarthric speech recognition; Hidden Markov model; Speech recognition
1. Introduction
Speech articulatory disability can arise from numerous conditions including cerebral palsy (CP), aphasia, amyotrophic lateral sclerosis, multiple sclerosis, Parkinson's disease, laryngectomy, and others. The persons participating in the experiments in this work all have speech disabilities due to CP. While a large segment of the population which might ultimately benefit from this research has CP, there is nothing about the work which makes it specific to dysarthria as a result of CP. To emphasize this generality, therefore, we shall use the term 'dysarthric speech' in this pa-
Correspondence: J.R. Deller, Jr., Department of Electrical Engineering/258 EB, Michigan State University, East Lansing, M1 48824-1226, U.S.A.
per to describe speech which is difficult to understand and recognize as a result of the speaker's disability. Many individuals who arc severely dysarthric, while using alternative means of communication, have normal or exceptional intellects, and normal or superb reading and language skills, and would strongly prefer to use their speech capabilities, however limited. The central message of this paper is that there is apparently recognizable acoustic information in speech of even severely dysarthric persons upon which automated recognition on a wide scale might be based. The applications of such recognition are, of course, manifold. Equally important, however, is the message that recognition of dysarthric speech is a distinctly different problem from that of normal speech, and new strategies and approaches will
126
be needed. It will apparently not be possible to apply 'normal' speech recognition approaches directly. In particular, we report on two findings in the use of the hidden Markov model (HMM) for this purpose which run counter to generally accepted practice in the normal case. The first is that an ergodic model is found to outperform a standard left-to-right (Bakis) model structure. The second is that automated clipping of transitional acoustics in the speech is found to enhance recognition significantly. An HMM approach to isolated word recognition (IWR) was chosen in an attempt to automatically model the enormous variability of dysarthric speech and also as a natural extension of a simple state model used in earlier work * [3]. Much of what we have learned has been based on experimental application of HMMs with various structures and enhancements to real data, followed by a post-analysis of results. This experimental process has shed light not only on how to best use HMMs to recognize dysarthric speech, but why. In this paper, we will describe these findings, along with recognition results for an extensive case study of one person's speech, and digit recognition results for two other individuals. The three subjects comprise a wide spectrum of articulatory skills. This work is part of a more general effort to develop artificially intelligent communication aids for severely speech and motor disabled persons [20], but is also being explored with an interest in the feasibility of a 'stand-alone' recognizer of dysarthric speech.
* The study cited in Ref. 3 suggests the possibility of a simple constrained phoneme, speaker-dependent, language, for certain speech-disabled individuals. Given that such individuals would find this communication medium acceptable and useful, this possibility opens another avenue of research involving many important engineering and clinical problems. Our work here is based on the premise that users would prefer to use conventional language where possible, and upon the desire to investigate relatively speaker-independent techniques. The l t M M , with its trainable and self-organizing capabilities, is ideally suited to such a pursuit.
2. Technical methods 2.1. Introduction
As in any recognition task, our technical methods employ a priori structures which attempt to reduce the entropy of the message to be decoded. We set up these constructs in a way in which a trained listener might form internal representations of the speaker's language in conjunction with 'external' information. An extensive literature search has not uncovered any reports of clinical studies which have attempted to establish how listeners understand severely dysarthric speech. It is widely observed by clinicians, however, that persons familiar to a speech disabled individual (e.g., the mother of a cerebral palsied child) can frequently understand and communicate with him or her in spite of the fact that the person might be utterly unintelligible to an 'untrained" listener. It is likely that 'trained' listeners are employing two perceptual tools: (i) selective recognition of consistently well-pronounced (or at least consistently pronounced) phonemes, and (ii) effective use of 'context'. The human process of understanding dysarthric speech probably uses the ability to discount inconsistent information in the speech, and temporal relationships among consistent pieces of information. It is important to point out that 'consistency' of speech does imply speech which is 'normally' pronounced; rather that the speaker's pronunciation of certain phones is reasonably consistent. Context refers to the grammatical content, the general content of the conversation, knowledge of the environment in which the conversation is taking place, and knowledge of the background of the speaker. Context helps the human listener infer the speaker's message by making some messages 'more likely' than others. In this work, it is the ultimate objective to 'simulate' these two perceptual tools, i.e., (i) to seek out consistent information and discount inconsistent information in dysarthric speech at lower levels of processing, and (ii) to use the grammatical context to constrain message element search at higher levels of processing, to recognize dysarthric speech. This paper is concerned primarily with achieving the
127
first of these tasks using an approach based on the HMM.
2. 2. Signal let 'el processing All speech used in this work was digitized at 10 kHz using 12-bit A / D conversion after 4.5 kHz Iowpass filtering. The basic task at the signal level is to convert the speech sequence into a time varying parametric representation. In this work, a novel algorithm is used to compute an adaptive, temporally recursive, estimate of the linear prediction (LP) parameters [15] of the speech. The method is briefly described here with details found in Ref. 2. The basis for the recursion is the following classical LS problem: Find the lcast-squarc error solution, say ~, for the vector a, in the ovcrdetermined system of equations s;(l)
s(.l )
a =Q(N)
Q( N ) L s ~ ( n )
I
ing ~2~,,~-,,~ on the nth squared prediction error. If employed repeatedly for increasing N, the weights serve as a 'forgetting factor' which causes the solution to most heavily reflect the most recent dynamics of the signal. A simple rccursivc algorithm requiring about (M + 1)2 storage locations for computing such an adaptive estimate of the LP parameter vector is as follows [2,14]. STEP 0: Initialize an ( M + 1 ) × ( M + 1) matrix, W, to a null matrix. STEP 1: Multiply the upper M rows of tt/ by /3 (see Eqn. 7 below). STEP 2: For n = 1, 2 . . . . . N enter the next equation into the bottom row of W, [ s r ( n ) ~ Is(n)]
(5)
STEP 3: 'Rotate *' the new equation into the system using w,,'~ = W , . ~ , , , , + W,~ -~.~',,,
(l)
s(X)
denoted
w h + , . ~ = - w,,,k~,,, + w,~, ,.k~,,,,
for k = m, rn + 1. . . . . M + 1 and for m = 1, 2 . . . . . M: where. & : , - - W,,,,,/p. r,,, = W~t_ ~.,,,/p, p = ; IV,..... + W M + ~..... and W,,,k(W,,,k ) is the m, k clement of W pre- (post-) rotation. No other elements of W arc affected. STEP 4: Solve for fi(n) if necessary or desirable. It is useful to view W as four partitions. Following the rotation of the nth equation in STEP 2, for example, -
Q ( N ) ~.,( N ) a = Q( N)o'( N ) .
(2)
in the present problem, a is the M-vector of LP parameters associated with the speech; s(n) represents the speech sequence, and s(n) a vector of M past values at time n, [s(n - 1 ) . . . s ( n - M ) ] T ; Q ( N ) is a diagonal matrix,
Q ( N ) : d i a g [ / 3 N ,/3,,¢-2/3x 3.../3 1]
(3) W=
in which /3 is a positive number less than unity. The well-known solution is (see for example, Ref. 9)
h ( n ) = [ ~ 7 ( N ) Q ' ( N ) Q ( N ) ~ ( N)] -~ QJ ( N ) ~,t ( N)o.( N ) .
(4)
For a fixed N, this solution is readily shown to be the conventional covariance LP solution [15] commonly employed in speech processing, except that here the error minimization is subject to weight-
(6)
-
T(n)
d,(n)]
0
d2(n ) "
I
t
(7)
T(n) is an upper triangle M × M matrix, and thc system T(n)a =d~(n) can be solved using back substitution [9] to obtain the LS estimate, ~(n). Useful features of the method for speech proccssing (when used with a gcneral set of weights) are discussed in Ref. 4. There arc two such features of particular interest for this study. First
* This n a m e c o m e s from the fact that G i v e n s r o t a t i o n s [9] are b e i n g used to include the new row into the system of equations.
128
is the fact that the sum of squared prediction errors, say 8(n), is conveniently computed from the sequence d 2 ( n ) using a(,,) =
Ed ,(i)
=a(n
- l) + d2(n).
(s)
i-I
Secondly, an Itakura-like distance [12] between a resulting fi(n) and any reference vector, say a,,.f is conveniently computed in terms of the available quantities at time n
.f{a,.,. 1 , d ( n ) } = log
y(n) + a(n) a(n)
(9)
where, y(n) = IlT(n)are f - dl(n)ll 2 and 8(n) is defined in Eqn. 8. Wc will use this second feature to compute a 'parameter derivative' by letting ar,.f = d(n - 1) at time n. The distance computation is also used in the HMM vector quantization procedurc (see below) to find the nearest member of the 'codcbook'. In this case fi,,.f represents incoming estimate and d(n) is associated with the codebook. 2.3. Word level processing
At the 'word level', we are concerned with the training of the HMM's and their use in IWR. We assume here that the reader is familiar with the general operating principles of thc HMM, and discuss specific implementation issues relevant to this research. Comprehensive tutorial material on the HMM can be found in many sources (e.g., Refs. 13, 17). The training and evaluation of HMM's in this study generally follow conventional procedures with some modifications to be described below. Each word in this study was modeled by a discrete symbol HMM. The observation sequence (corresponding to any utterance for training or recognition) was derived from samples of the sequence of 14th order LP parameter vectors [15] resulting from the recursivc estimation procedure discussed above. The exponential weighting factor, /3, was experimentally set at 0.992, and every lOOth vector was selected so that a representative
feature of the data was taken every 10 ms. These parameters were vector quantized using the Itakura distance and a binary search codebook [16]. Whereas the data sample times have been indexed by the discrete variable n above, wc will henceforth index the times at which vectors arc selected (every 100th 'n') by the discrete variable, t.
A 256 level binary search codebook was developed for each vocabulary using the K-means algorithm [16] with the ltakura distance as a measure of distortion. During training, cells were split by seeding the two new ceils with 'perturbed' versions of the existing centroid vector, i.e., if a,. was the centroid of an existing cell to be split, the new clusters were seeded with 'means' 0.99 a , and 1.01 a C. The effects of the binary vs. full search, and of the number of symbols in the codebook are illustrated in Fig. 1 for our primary experiment. This figure is based on 120322 LP vectors obtained from 10 repetitions of a 196-word vocabulary spoken by a cerebral palsied individual ('LE') who is dcscribed below. Many details of the codebook construction, including a study of the appropriate codebook search strategy, are found in Rcf. 10. One point should be highlighted here. Most implementations of LP parameter vector quantization using the Itakura distance as a distortion measure use 'correlation' LP parameters [15]. The conventional Itakura distance [12] is a dot-product involving the autocorrclation matrix of the data. However, our LP vectors are based upon a 'covariance'-likc [15] procedure (see above). There is no such simple equivalent dot-product form corresponding to thc covariance-based method. In order to obtain the Itakura distance between a vector-quantized LP estimate and any codebook centroid, we use the normalized centroid vector covariance matrix * in conjunction with Eqn. 9 rather than the data covariance for distortion comparisons. To be consistent, since the ltakura distance is not symmetric, we use the centroid
* Actually the s q u a r e root subsection 2.2).
of the covariance matrix (see
129 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
T -''r
3
....
i
i
i
i
t
i
4
5
5
7
8
9
Yo.
o
Binary search
binary
10
levels
+
Full s e a r c h
Fig. 1. Average distortion vs. number of clusters for full search and uniform binary search strategies after convergence of the clustering a l g o r i t h m Distortion is based on the Itakura distancc. This experiment is based on 120322 LP vectors dcrived from 10 repetitions of the 196 word vocabulary (W-196) by speaker I.E.
covariance matrix for both training and quantizing comparisons. Training of each HMM was based on the Baum-Welch restimation procedure for multiple observation sequences [13,18,19]. Five training utterances were used in each casc. Thc problem of lack of training data with which to accurately characterize the statistical distributions in the HMM, which is common to all HMM training problems, is extraordinary in thc dysarthric speech problem. Attempts to achieve better statistical characterization through more data collection are impractical and unfruitful. Collection of even small amounts of data is a stressful and time-consuming experience for many speech-disabled individuals *. Lengthy recording sessions are sclf-defeating in this regard, since mental and physical fatigue and frustration introduce morc variability, thus changing the very sta-
* It took 20 contact hours, for example, to complete the sessions in which our main subject phonated 10 repetitions of 196 words.
tistical nature of the process one is attcmpting to estimate. T h e r e f o r e , H M M recognition of dysarthric speech must be accomplished under extraordinary conditions of lack of training data and high variability. As a first measure, therefore, wc employ the conventional procedure of postestimation constraints [17] on the symbol probabilities of the form bjk >_ E > 0
(10)
for any kth symbol in any jth state, where e is a small probability. A critical performance factor related to this paucity of data was found to be the model structure. In this study, two types of model structure, 'ergodic' and 'Bakis' [17], were considered. The ergodic form allows transitions among all states. However, a designated initial and terminal state was included to impose a weak 'left-to-right' character on the model. The Bakis model is a constrained serial ('left-to-right') HMM which allows 'next state' transitions and one 'look-ahead'
130
(skip) transition from each state. A designated start and end state are inherent in this structure. These effects of these HMM structures are best discussed in the context of the experimental evidence to which we now turn.
3. Experiments with conventional HMM procedures
In an attempt to validate the conventional HMM recognition procedures described above, we conducted a case study which focuses upon one cerebral palsied individual. The subject (LE) in this study is a 44-year-old male, with mixed spastic athetoid CP. He can read and has a normal understanding of rules of language and grammar. (He has successfully completed a number of undergraduate courses in Sociology.) LE habitually attempts to use speech to communicate with both strangers and familiar people, in spite of the very slow rate of communication, and extreme difficulty in producing intelligible speech. His dysarthria is largely attributable to difficulty with lip-rounding and tongue elevation. He therefore has particular difficulty with the bilabial stops / b / and / p / , and with vowels requiring liprounding and tongue elevation, e.g., / u / and / i / . LE attempts to produce most phonemes, but distortions and substitutions are frequent. He is occasionally dysfluent, repeating initial syllables. The clinical measure used to quantity LE's speech intelligibility is the Computerized Asscssmcnt of Intelligibility of Dysarthric Speech (CAIDS) [21], which is widely used in clinics. This computcr based test generates a list of 50 randomly ordered, phonetically balanced single words. The subject reads the words and the reading is audio-taped under good recording conditions. Subsequently, three judges listen to each word and choose from a set of 12 words entering their responses through a computer keyboard. The program provides the percentage of words which are intelligible (correctly recognized) to the listeners. Interjudge reliability coefficients range from 88% to 99%. The intelligibility study for all subjects was conducted by L. Ferrier and the judges were three graduate students in the De-
partment of Speech and Language Pathology and Audiology at Northeastern University. According to the CAIDS index, LE's intelligibility score is 59.3%. Extreme caution is urged in the interpretation of this quantity. We include this number because it is well-understood by clinical experts and because it provides at least a subjcctive ranking of the speech skills of our subjects. It should not bc interpreted to mean that LE is intelligible '59.3% of the time', nor can this measurc bc very meaningfully quantitatively compared with recognition rates reported below. The CAIDS task is a relatively simple one which would favor 'high' scores. The computer rccognition task involves a larger vocabulary and no preset search list in each case. Two different sets of vocabulary were used to test HMM recognition of LE's speech. The first is dcnotcd 'W-10', and consists of ten digits, zero through nine. The second denoted 'W-196', is composed of 196 different commonly used English words listed in the Appendix. These words arc a phonctically balanced sample from a text which has bccn used in previous research [8]. LE was asked to phonate the digits from W-10 fifteen times, and thc words from W-196 ten times. For W-10 fivc repetitions of each digit were used as training data, and the other ten repetitions as testing data. For W-196, the training data consistcd of five repetitions of each word, and thc second set of five was used as testing data. Before testing the procedures with LE's data, the codebook, model training, and recognition algorithms were empirically verified with normal speech data. Fifteen utterances of the W-10 vocabulary were collected from each of two adult males and partitioned into training and test sets as above. Both Bakis and crgodic models were tested in speaker-independent trials for each speaker, and in all four experiments, l(KI% recognition was obtained. In the first test with LE's speech, we used the Bakis HMM to model each word in W-10 and W-196. Such a structure can be argued to be realistic in terms of the inherent left-to-right naturc (temporal sequencing) of the observations. It has also been reported to have better performance than any othcr structure in normal speech
131 IABI,I- 1 Digit recognition e x p e r i m e n t s (vocabulary W-10) with three H M M strategies for speaker [.E Utterance (I() repetitions)
()ne Two Three Four Five Six Seven Eight Nine Zero ('orrect recognition:
W o r d s recognized (frequency) Bakis H M M
Ergodic t t M M
Ergodic t IMM & transition clipping
one(7) four(3) two(g) zero(2) three(g~ six(2) fi)ur(7) two(2) zero( I ) five(7) nine(2) h~ur( 1 ) six(g) seven(2) seven(7) six(2) zer(X 1) eight(10) nine(g) five(2) zero(7) two(2) four( 1 )
one(9) fl~ur(1 ) two(10) three(g) six(2) fi)ur(g) one( 1 ) z e r ~ 1 ) five(9) nine( 1 ) six(N) three(2) seven(10) eight(9) zero( I ) nine(g) five(2) zero(N) two( I ~ four( 1 )
one(1[)) two(10) three(b;) six(2~ fi~ur(9) zero( I ) five( lid six(9) three( I ) seven(Ill) eight ( lid nine(g) five(2) zero(,~) two( I ) tour( I 1
77~
g7c~-
92c~
recognition [19]. However, the Bakis model is not vet5, successful in the application of dysarthric speech recognition. Results of using the left-toright structure to model words in vocabulary W-10 and W-196 are summarized in "fables 1 and 2, respectively. The overall correct recognition rates for W-10 and W-196 are 77 out of I(X) (77.0%), and 493 of 980 (50.3%). respectively. Some initial reasons for this tess than satisfactory perfl~rmance are related to the model structure in conjunction with the limited training data. In the process of recognition using the Bakis model, when an 'untrained" symbol (i.e., a symbol which does not appear in the training data at that state) occurs, the consequences are particularly significant. The following example from our early data analysis illustrates the problem in an heuristic way. It is to be noted that the left-to-right structure employed in this example is slightly morc constrained than the Bakis model in that it does not permit a 'skip' transition from states. Example I. Let us denote the set of codebook symbols by {c~}. Given the utterance 'four' with an a s s o c i a t e d s e q u e n c e of V Q symbols, I '531 '53t'531"¢~41"¢~4l 'o31 "o3l '¢541 '04 l '~41 '04 l '¢~4l'631't51l '~11"~4 t'o41"t~41"e~lt'S~l'591"Sq . . . . l e t us look at the process of
computing of the likelihood of this observation sequence given the six-state serial H M M trained on five other utterances of the word 'four' which
is shown in Fig. 2(a). In the Viterbi decoding [7] of the initial part of thc symbol string, the best path will reside in state two in order to generate the apparently 'steady-state' sequence comprised mainly of symbol t~,4. It is apparent from studying the data and the model that the maximum likelihood state sequence will be one which remains in state two to generate the "aberrant' symbol v6~, with probability E, when it appears, even though that symbol did not appear in the training data at that state. An alternative would be jump to state three and generate t'~ with probability (I.173. However, in the latter case, when the many recurrences of the symbol v~ appear, the process must generate symbols r~,4 in one of the final four states with probability ~ each time because the model structure does not allow a return to state two. This alternative would therefore lead to a lower likelihood decoding of the symbol string. The flexibility to return to state two afforded by the crgodic model would be potentially helpful in this regard, but at some cost which is discusscd below. Other insights into the sparse data problem result from the training phase of thc HMM. When a few prominent symbols occur frequently in a word, the Baum-Welch recstimation with a serial constraint often favors the distribution of one prominent symbol per state for the leftmost status
132 TABLE 2 Representative word recognition experiments (trom vocabulaD, W-196) with three H M M strategies for speaker I,E Utterance (5 repetitions)
Words recognized (frequency) Bakis IIMM
Ergodic H M M
Ergodic H M M & transition clipping
a(3) age(2)
a(4) age( 1)
alcoholism(5 ) asked(3) always(2) be(2) he(3) but(3) part(2) childhood(41 child( 1 ) divorce(3) obvious(2)
alcoholism(5 ) asked(5) be(2) he(3) but(5) childhood(5 } divorce(5)
each first get go have he it knows lot muscle now on part ran said still the they television went year
a( 1) age(2) american( 1 ) away( I ) alcoholism(5) asked(2) always(2) was(l) be( 1) he(3) been( 1) but( 1) hot(3) part( 1) childhcx~d(4) child( 1 ) divorce(3) obvious(1) grows( 1 ) each( 1) he(2) he( 1) its( 1) first(2) existed(2) just(1) get(1) had(2) head(2) go( 1) over(3) grows( 1) have(2) has(2) had( 1) be(3) his(2) if(2) be(2) he(1) knows(l) know(2) nowC2) lot( 1 ) not(3) hot( 1 ) muscle(3) must(2) now( 1) know(3) knows( 1 ) on(2) an(2) from(1) part(2) but(2) hot( 1) ran(2) an(2) and(l) said(3) at( 1) say( 1) still(2) bills(2) tell(l) the( 1) he(2) be(2) they(2) were(3) television(4) notion( 1) went(3) and( 1) when( 1) year(2) yesterday(2) you( 1)
each( 1) he(2) be(2) first(3) just(2) get(1 ) had(2) head(2) gc~2) grows( 1) how(2) have(3) shaveC2) he( 1) be(4) if(4) be( l ) knows(3) know(2) lot( 1) not(3) hot( 1 ) muscle(5) now(3) our(2) on(2) an(3) part (2) but(2) hot( 1) ran(3) an(2) said(3) say(2) still(3) bills(2) the( 1) he(2) be(2) they(2) were(3) television(5) went(4) and( 1) year(3) yesterday(2)
each(5) first(5) get(5) gcX4l how( 1 ) have(4) shave( 1) he(4) the( 1) it~5) knows(5) lot(3) not(2) muscle(5) now(4) our( 1) on(5) part(5) ran(3) an(2) said(4) say( 1) still(5) the( 1) he(2) be(2) they(4) were(l) television(5) went(5) year(5l
Correct recognition (over entire W-196):
50.3f~
72.1~k
88.3~~
a alcoholism asked be but childhood divorce
with the rightmost states accounting for all remaining frequently-occurring as well as 'spurious' symbols. For the purpose of recognition, the later states have more responsibility for pattern match than the earlier ones. Intuitively, one way to alleviate this problem is to increase the number of states so that it better matches the number of 'physical states' (corresponding to frequent symbols) in the data for a given word. We conducted a second test to study the effects on recognition rate of using the Bakis H M M s with different
numbers of states. In particular, we computed the H M M word model for each digit in vocabulary W-10 where the number of states varied from 3 to 20. The results of this experiment are given in Fig. 3, which shows the number of correctly recognized digits (of 10(J) versus number of states in the HMM. There is a slow increase in the curve up to model order 10, due to apparently advantageous temporal relationships between the data and the states. Due to the complex interaction between the number of states and the statistical
133 •879
.889
•121
.9565
. 111
.9707
.0435
.7944
.0293
1.0
.2056
State 1:bi,53 - - 0.994 State 2:b2,56 = 0.221 b2,63 = 0.110 b2,s4 = 0.663 State 3". b3,2 = 0.043 b3,3 : 0.086 b3,4 : 0.130 b3,5 : 0.086 b3,49 : 0.173 b3,51 : 0.043 b3,52 : 0.130 b3,59 = 0.130 b3,61 : 0.173
State ~: b4,xl = 0.029 b437 = 0.029 b43s = 0.470 b4,2o = 0.146 State 5:bs,5 = 0.994 State 6:be,s3 = 0.071 b~A2 = 0.639/~,2o = 0.213 ~,2s = 0.071 .25
.5
~
~
.8818
.2276
State State State State State State
1092
~
.
9
1
6
7
.0833
1 : b l , 2 = 0.124 bx,3 = 0.371 bl,53 = 0.124 bl,s9 = 0.373 & b2,11 = 0.059 b2,1r = 0.059 b23s = 0 . 5 8 8 b2,19 = 0 . 2 3 5 b2,2o = 0.053 3". b3,18 -- 0.607 b3,19 -- 0.386 ~: b4,49 = 0.235 b4,51 = 0.059 b4,52 = 0.176 b4,s6 = 0.117 b4,63 = 0.054 b4,64 = 0.353 5:b5,5 = 0.151 b5,2o = 0.536 b5,61 -- 0.302 6:b6,4 - - 0.229 b 6 , 1 2 - - 0.688 b6,25 = 0.076
Fig. 2. tlMM's for word "tk)ur' trained on five utterances by speaker I.E. (a) Left-to-right model. (Note that this model is the most constrained serial model structure and has one less transition from each state than the 'Bakis' serial fi~rm used in the mare experiments.) (b) Ergodic model. In each case bj~ indicates the probability of generating symbol r~ in state j. bjt= e = 0.0{IO05 for all unstated probabilities.
s t r u c t u r e of the m o d e l , it is difficult to d r a w definitive c o n c l u s i o n s from this study even for a small v o c a b u l a r y like W-10. H o w e v e r , this experiment, along with m o r e d e t a i l e d study of the actual H M M s e s t i m a t e d , s h o w e d a c l e a r a d v a n t a g e to using a h i g h e r n u m b e r o f s t a t e s t h a n might o r d i n a r i l y be used in the n o r m a l case. T h e r e f o r e , we chose the n u m b e r to be 10 for all the m o d e l s in the main study•
Using a 10-state Bakis m o d e l still d o e s not solve the ' u n t r a i n e d ' symbol p r o b l e m discussed above. As a s e c o n d m e a s u r e , t h e r e f o r e , we ext e n d e d o u r m o d e l into the e r g o d i c s t r u c t u r e described above. As a result, for the s a m e v o c a b u laries W-10 a n d W-196, the overall p e r f o r m a n c e s i n c r e a s e d to 87 of 100 (87.0%) a n d 707 o f 980 (72.1%) correctly r e c o g n i z e d words, respectively• T h e results are s u m m a r i z e d in T a b l e 1.
134 90
"d
B6
.~
8,t.
~
e
8z
•~
76
u
74
ca
72
•~
6B
o
66 54 I
60
I 2
3
4
7"-- --+"
--~
5
7
5
i 8
9 ~o.
I~ states
11 m
! 12
13
i 14
I
i
15
16
17
i 18
i 19
20
HMM
Fig. 3. N u m b e r o f c o r r e c t l ~ r e c o g n i z e d u t t e r a n c e s o f 100 d i g i t s vs. n u m b e r o f ~,tates in t h e t t M M .
f l M M ' s arc Bakis s t r u c t u r e d and
t r a i n e d o n five u t t e r a n c e s o f t h e d i g i t s by s p e a k e r I.E.
It is interesting to revisit Example 1 to examine the effect of the ergodic model. l;xample 2. The crgodic H M M model for 'four" rcsulting from LE's data is shown in Fig. 2b. The samc observation sequence as in Example 1 is used in thc recognition task. Just prior to the occurrencc of symbol t',,~, the best path search resides in state four to generate symbol ~'~.~.When symbol t'< appears, the search jumps to state five to find that symbol. Then when symbol /'~,4 reappears, the state transition matrix permits the return to state four to continue matching the rccurrenccs of the symbol t'~,,~. The unconstrained structure apparently alleviates some of thc sparse data problem associated with left-to-right model structures. A tradeoff incurred in its use is that the 'right' observation sequence (corresponding to the data) could be generated from a wrong (word) model with a high probability. Heuristically, this is because thc crgodic modcl potentially produces more observation sequences than the serial models, with the samc probability or abovc. Tcchnically, this is duc to the fact that the ergodic model is more likely to converge to an H M M parameterization corrc-
sponding to a poor local maximum during the training phase. Of course, the most significant remaining weakness of the ergodic model approach is that it cannot account for data which do not appear in the training set. This, of course, is not the fault of the model structure, but the sparsity of the data. Having gotten as much performance as possible out of the conventional H M M approach, therefore, we have returned to the problem of needing more data, or needing more reliable data in terms of statistical richness. We have discussed above that neither is feasible. We turn our attention, therefore, to making better use of the available data by introducing preproccssing measures at the signal level.
4. Signal preprocessing to improve recognition
Simple 'steady-state' phonemes like vowels arc physically the easiest sounds to produce, since they do not require dynamic movement of the vocal system. Conversely, phonetic transitions in spccch are more difficult to produce for individu-
135 "I'ABI.[I 3 ('otlebook symbol strings for the initial segments of two utterances of "seven' by speaker l.I:. The top t~o result trtun conventional IIMM procedures and the lower two from the tran,~ition clipping appn~ach described in Scclionr, ,1 and 5. Rov,s 1 and 3 result from the first utterance and rows 2 and 4 lrom the sccontt.
No prt.'proccssing
Transition ("lipping
()m, ct
/~ /
t idt ,~21 ¢,4
I i-l'l-I
I ¢1 I,sl 4q
I'l-I 171"1-1 I~l'l~
I',~1" 4 ¢l'¢~ zl ~4
t t,lt,I
I'/ll )IU211"_~ I
I ttl ttl l,I itl l,
1"2ill 2ltl 201'2ul',tl
I ttl'ttl ttl ltl It
,,
I trl't~l ttl tl
Transition i.I i ~
als with articulatory disabilities (and in gencral) because they require fine muscle control to precisely move the articulators. Many individuals with speech disabilities arc not able to consistently and reliably makc transitions between two phones due to lack of muscle control. Consequently, it can reasonably bc assumed that acoustic transitions in dysarthric speech signals arc of much larger variance than stationary regions. For a small amount of training data, it is very difficult to reliably cstimate parameters of the I-IMM with the large variability of transition regions included. It is the purposc of this facet of the paper to discuss a specific approach which reduces variability of dysarthric speech by employing only stationaD, regions and nominally "clipping out" transitions in the training and recognition proCcsses.
l)ccades of rcscarch have shown that phonetic transitions in speech generally contain important acoustic eucs for human perception and recognition (scc, for example, Rcf. 11). Although, to the best of our knowlcdgc, thcre is no similar rcsearch for dysarthric speech, it is speculated that transitions of dysarthric speech do not transmit useful information due to large variability. On thc contrary, they probably cause the listener to misinterpret messages. To strengthcn this hypothesis, we infl)rmally studied several utterances of the same word from our primary subject by simply looking at patterns of thc VQ indcx scquenccs. For cxamplc, initial parts of the sequences of VQ indices from two rcpetitions of thc utterance 'scvcn' arc shown in thc first two rows of Table 3. Thcsc results arc typical in that stationary, regions of the speech
I 2,fl'.,1l'¢(
' E/ I ¢21'¢,1 ¢21 i.,I ¢21"¢tfl ,,I ¢2" '
121 12
I" Ctfl ¢21 ¢tfl ¢¢fl ¢21 ¢'1" ;21 ; . ' ' ' "
remain fairly consistent not only within one utterance but also across the utteranec~, of the word. The transitions, however, appear almost 'random'. (Proximity of indices is related to similarity of the symbols.) It is likely that these transition VO indices arc the cause of frcqucnt rcet)gnition errors. ]'hcreforc, it is rcasonablc to pursuc the idea of 'downwcighting" or clipping out of these inconsistent and random regions from the spccch in order to decrease the variability. Wc will refer to this process as transition clipping and demonstrate how to automate this procedure. In the modified procedure, the process elassitics all the LP parametcr vectors representing transition rcgions into one cluster called "thc transition class', corresponding to the codebook symbol r , . The LP parameter vcctors in stationary regions arc clustered normally as a (binary search) "stationary" codebook. As a result, a HMM word is trained using a scqucncc of VQ indices, each of which belongs to either the transition class r . or one of the 'stationary" clusters. In order to seek stationary regions of speech and clip out transition regions, the adaptive identification scheme described in subsection 2.2 is used to determine changes of speech dynamics. FABI.I( 4
SummaD of i¢coglllllOfl re,,tills for vocabulap,/ W-Ill Speaker
('AIDS index
Recognitionrat¢ crgodi¢ IIMM
Recognition rate crgodic IIMM & transition clipping
(.'J Lid. RP
65 G. 59:; 22%
90 ¢; S7' ¢ 47' ;
95 ¢.: 92' ;
78:;
13~ Here. wc take advantage of the simple computation of the ltakura distance measure (1DM) given in Eqn. 9 to identify regions of stationarity or transition. The procedure involves the computation of the IDM between the current LP parameter vector and the previous one. A large IDM means that the two consecutive LP filters do not have a similar spectrum. In principle, this happens only in a transition. Consecutive LP models in a stationary region should remain unchanged. Formally, let d(t), t = 1, 2 . . . . . T, bc a sequence of LP p a r a m e t e r vectors which are extracted from an utterance using the adaptive identification procedure. Let d(t) be the IDM between two consecutive LP p a r a m e t e r vectors d(t - 1) and ~(t). i.e., d(t) = ,Jc{d(t - l),~i(t)} (see Eqn. 9). A threshold djH is preset to determine a distance boundary between stationary and transition regions. If d(t)>_ drH, this means that there is dynamic change between signal samples t - 1 and t. If d(t) < d.rn, this means that d(t - 1) and ti(t) retain the same characteristics and the signal is in a stationary region. During the vector codebook generation, if d ( t ) >_ d.rn, then d(t) is not collected into the training vector sequence. If d(t)_d.rH, then ti(t) is considered to be in a transition region and assigned the symbol t',,. If d ( t ) < d r t ~, then M t ) is classified into one of clusters in the codebook and assigned one of symbols. A sequence of VQ symbols resuits from the process of vector quantization which consists of the transition symbol t',, and other 'stationary' symbols. Then the normal H M M training procedure is employed to estimate parameters of each word model.
5. Experiments with the modified procedure To assess thc improvement of the transition clipping HMM procedure and comparc results
with the normal HMM, we continued our case study on the previous individual, LE, with the same speech data described in Section 3. We divided speech samples into two groups as before. Five repetitions of each word were used to train each model for vocabulary W-10 and W-196. As before, each model had 10 states in a full-structured configuration with one starting state and one end state. /3 was again set at (I.992, and an effective value of tlrt I was experimentally determined to be 0.3. The remainder of the data set for each vocabulary was used to test each model. Transition clipping was found to improve the overall performance to 92 out of I(X) (92%) correctly recognized digits and 865 out of 980 (88.3%) words. This is to bc compared with 77% and 50.3% obtained in the experiments without signal preprocessing. Results are summarized in Tables I and 2. The performance of recognition with the transition clipping is apparently improved compared with the normal H M M approach. This preprocessing relatively increases statistical information about the regions of the speech which are subject to meaningful characterization and reduces the amount of nonstationary variability. It is of interest to note the effect of signal preprocessing on the VQ strings in the upper two rows of Table 3 for the utterances of the word 'seven', for exampie. The VQ symbol strings which result from the transition clipping procedure are shown in the lower two rows. Note that the VQ indices have changed because a new codebook is created, but the correspondence between the two sets of indices is easy to see. The apparent drawback of this approach (at least if it were to be used in the normal case) is the potential ambiguity which could result between phonetically similar words. For example, with the clipping process, the sequences of VQ indices for two examples of utterances of words 'be" and ' h e ' are nearly identical. However, a study of the results in Table 2 will show no occurrences of incorrect recognition in the clipping case due to this type of ambiguity which did not already exist in the unmodified H M M experiments. For example, the ' b e ' incorrectly recognized as ' h e ' problem existed to a similar degree
137
in both experiments. Interestingly, the 'he' misrecognized as 'be' problem improved with the clipping approach. Apparently, counter to expectation, the clipping procedure does not increase this anomalous effect *. This, of course, indicates that very little, or even detrimental, 'information' exists in the parts of the symbol string corresponding to dynamic articulation. Not only does decrement due to this ambiguity not occur with the transition clipping, many anomalous matchups of words are also corrected by this approach. For example, the word 'get' is correctly recognized only once in five times with the unmodified HMM, while correctly recognized each time with transition clipping. This finding, along with the 'ambiguity' result described above, provide strong support for the hypothesis that transition acoustics are detrimental to proper recognition of dysarthric speech.
6. Experiments with other speakers In this section, we extend our study to two other male cerebral palsied individuals **. These persons will be referred to as CJ and RP. They are both adult males and their articulatory skills are clinically considered moderately intelligible, and truly nonverbal, respectively [6]. CJ's and RP's intelligibility scores are 65% and 22%, respectively, as measured by the CAIDS [21] (see Section 3 of this paper), CJ's speech, while slow and difficult to understand, is sufficiently good that he is usually able to use it (albeit inefficiently) for communicating in both professional and personal interactions. RP's speech, however, is completely unintelligible to an untrained listener. We asked each subject to utter the 10 digits (W-10) fifteen times. As above, five repetitions of each digit were used as training data, and the remaining ten repetitions as testing data. Two
:~ Even if it were to, however, we take a position of willingness to accept this kind of a m b i g u i t y due to acoustic similarity, r a t h e r than the ' r a n d o m ' m a t c h u p b e t w e e n words due to irregularity. ** CJ is Subject No. 1 in Ref. 3.
tests were performed on each subject's data. In one test we applied the normal HMM to each word using an ergodic structure with one starting state and one end state. The other test employed the transition clipping approach with the same constraints on the HMMs. For both tests, the number of states is ll). Table 3 summarizes the number of correctly recognized digits for all three speakers including LE from the work above. It is clear that rate of correct recognition is related to intelligibility of the speech as would be cxpected. These results also suggest that even with completely unintelligible speech, this recognition procedure will bc able to extract useful information from speech for automated recognition where the human listener fails.
7. Further discussion and conclusions Recognition of the speech of severely dysarthric individuals requires ~. technique which is robust to extreme variability and very little training data. The HMM, used unconventionally with respect to the 'normal' case, provides a promising framework for this difficult recognition task. The main differences with respect to its conventional use are an ergodic model with an increased number of states, and the use of an automated 'transition clipping' procedure in the codebook construction, training and recognition phases. This later signal preprocessing measure is facilitated by a novel adaptive sequential LP identification algorithm. The finding of significant recognition improvement when dynamic regions of dysarthric speech are removed provides support for the reasonable hypothesis that such acoustics are not contributory to human recognition of these individuals' speech. This point provides interesting grounds for perceptual studies with human listeners. A main motivation for this work is the development of augmentative communication aids for the severely speech and motor disabled. Accordingly, it is important to point out possible problem areas in these methods to guide other researchers working on this problem. We began this paper by discussing the extraordinary variability
138
of severely disarthric speech. Indeed, we have found this variability to be a perplexing problem throughout this work. While the recognition percentages reported here are encouraging, it is true that the results are highly sensitive to many factors including, for example, the setting of the threshold d.r,, and the settings used in the endpoint detection algorithm. The threshold drt t is likely to be speaker-dependent because of differing articulatory behaviors. The latter issue (endpoint dctcction) is part of a general problem of setting energy thresholds with this highly variable speech. Wc used a method similar to the one described in Ref. 18 and found it difficult to avoid the occasional omission of an initial or final weak fricative, for example. Similarly, it is apparent that at least some of what we are calling "acoustic transitions' are really abnormal pauses in the speech which might be omitted if reliable energy detection can be implemented. The ability to detect these unnatural ' t r a n s i t i o n ' / p a u s c regions using an altcrnative to the "parameter derivative' method suggested hcrc would make it much less compelling to use the rccursivc LP p a r a m e t e r solution described in subsection 2.2, thereby solving a practical problem - that of expensive computational requirements of the present system. The recursive LI' analysis requires ~ ( M - ' ) floating point operations per incoming datum where M represents the order of the predictor. This prohibits implementation in real time on a typical personal computer (without an expensive supplementary digital signal processing board) which would be the desirable target machine for such a specch recognition system. (Some work on making the HMM search more computationally efficient is found in Ref. 5. Indeed, there is nothing to prevent the computation of the LP parameters using a 'batch' technique on appropriately spaced frames (see, for example, Rcf. 15), but then one is faced with the expensive ltakura distance computation at each frame which is circumvented in the present scheme. Of course, other p a r a m e t e r sets could and should be tried in the recognition strategy. The mel-cepstral parameters, for example, have been found to produce slightly superior recognition results to those obtained with LP parameters in H M M strategies [1],
and could easily be substituted here. The use of the mcl-cepstrum would have the advantage that its computation and the computation of an associated p a r a m e t e r derivative would likely be amenable to a more efficient (potentially realtime) computational algorithm without the use of special-purpose processors. Finally, whereas this study has shown promise that a 'stand alone" speech recognition system is possible for certain individuals, it is just as important that dysarhtric speech has been found to be a source of useful information which can be used in the guidance of a message search procedure (requiring other inputs) in an assistive device. Where speech is not reliably recognizable to a sufficient degree, it still has the potential to be a useful input, offering the device user the satisfaction and freedom to use this natural medium of eommtmication.
Appendix I/ocabulao, W-196 a about addressed adult adulthood after age alcoholism all almost always american an and another answer asked away bad be becomes been between bicycle bills blurred books both boy boys breath burning but calculus call came can cannot child childhood children cold comes confused could courts did different divorce does doesn't done don't drink each emerging enough every existed explore father find first five tk-~llowing for from full gauge get giving go got gradually grows had happens hardly has have he head help her his home hot how if in is it its just knew know knows landmark law learn legal letter line little low lower may might minors more most movies muscle must nearly need never new nice not notion now nylons obvious of often old older on open opinion or our over pair part paycheck pays perhaps president previous problem question ran rattle right said same say says series shaky shave six sometimes sounds still strange take taking tell tends that the their they things think thinking this those time to today topic trying television usually vote wants warm was ways went were what when where who why with year yesterday you your
13~
Acknowledgements This work was supported in part by the Whitaker Foundation and by the National Institutes of Health of the United States under Grant No. R03-R02fl30-01A1. The authors wish to thank thc persons who participated in the experiments for their dedication, enthusiasm and encouragefllen[.
[gl GAl. Goltlb and (.'.F. Van Loan. Matrix ('omputations Johns-Hopkins Baltimore, MI), 19~3 Sects. 4. I. t~. 1 & (~.3. [I11] I). llsu. ('omputer recognition ¢)t ntmverbal speech using hidden Markov modelling (PhD. dissertation). Northeastern Universit',. Ilost,,'m, lg,%";. [11] S.R. l|yde. Automatic speech recognition: ..\ critical ,,urve.,, and di'~cussion of the literature, in [..I:. David. PB. [),ones (cdn.), l luman Communication: ,.\ Unified Vicu. (McGraw-llill. No,a, York. IU72) pp. 3L~)-43S. Al,,o reprinted in: N.R. Dixon. T.|3. Martin (cds.). Automatic Speech and Speaker Recognition (IKEI{ Pres~. Nc~ York. 1~7~) pp. 1fl-55.
References [lJ S.B. Davis and P. Mermclstem. 'Comparisons of parametric rcpresellttHions for monosyllabic word recognition in continuously spoken sentences', IEt-E Trans. Acoust., Speech. Signal Process., 2S (It)g0) 357-3t~6. 12] ,I.R. l)cller and I). I lsu, An alternative sequential regression algorithm and its application to the recognition of cerebral palsy speech, lEE[! Trans. Circuits Syst. (Special issue on adaptive systems). (AS-34 (1487) 782-787. [3] J.R, I)cller, 1). [tsu and I,..I. Ferrier. F,ncouraging results in the automated recognition ol cerebral palsy speech. IFEI-Trans. Biomed. Eng. 35 (lt)SS)21~-22(1. [4] J R . l)ellcr and (LP. Picach,5..,\dvant;,ges of a Givens rotation approach to temporally recursive linear prediclion analysis of speech. 1EEt- Trans. Actn,sl.. Speech. Signal Process.. 37(1~389) 429 431. [51 J.R. I)eller and R.K. Snidcr. 'Ouantized" hidden Markov models lot efficient recognition of cerebral palsy speech. Prec. 1'.390 Int. S.',mp. ('ire. S vsl., Ncu, Orleans. Vol. 3. pp. 21)41-21344, It)gi'L [6] [.J. Fcrrier. Inlormal clinical evaluation by I.. Ferrier. I '-~,";7. [7[ G.D. Forne3. ] h e Vitcrhi algorithm. Proc. IKEE. fll (1'473) 2t~S 27S. lS] ('.1). (iil'qer, l,inguislic and human performance considerations in the design of an anticipatory communications aid (Ph.D. dissertation), Northx~estern University, l'v;mston, ll.. I~NI.
[12] F. Itakura. Minmmm prediction residual principle applied to speech recogniticm. I[.~t-E Trans. Acc, u,,t.. Speech. Signal t'roccss., 23 (It3753 67-72. [13] S.E. l,¢vinsotl. Structural methods in aulonlatic ,,pcech recognition. Prec. It'EE. 73. ( It3853 ltC5 I(15t). [14] T.C'. l.uk and J.R. l)cllcr, A nonclassical WRLS algorithm. Proc. 23rd Annu. ,A,Ilerlon ('o111.. ('hampaign. 11. pp. 732-'741. IgS5. [15] J. Makhoul. l.inear prediction: .,\ tutorial re',icv,. Proc. IICEE. 63 (19753 5t~1-581). [lfi] J. Makhoul. S. Roucos and II. (;ish. Vectol quantizatitm in speech coding. Proc. I EKE. 73 (19S5) 1551 - 15,%";. [17] I..R. Rahmer. ,.\ tutorial on hidden Markov models and sclectcd applications in speech recognition. Prec. IfEL-. 77 (lt)Sq) 257-286. [18] L.R. Rahincr and M.R. Samhur, ,,\ speaker-independent digit recognition system. Bell S~s. Tech..I.. 54. (19-15) ",37-315. [Iql L.R. Rahincr, S.F. I,evinson and M.M. Simdhi. ()n file :.lpplication ol vector quanti/ation and hidden Markov models to speilker-iFidepcndenl v, ord recognition. Bell Sys. l'cch..1., fi2 (l~;S,33 11375- 11135. [201 B.K. :S~ and .I.R. l)¢llcr. An Al-h~,sed comnlunic~,tlon system for 111oI(~r and speech disabled persons: Design nlethodology :rod pr(~totvpe testing. II'EE lrans. I~,iomed. Kng {Special issue on applications of artificial intelligence in medicine), 3¢+