Available online at www.sciencedirect.com
Speech Communication 54 (2012) 430–444 www.elsevier.com/locate/specom
Using articulatory likelihoods in the recognition of dysarthric speech Frank Rudzicz University of Toronto, Department of Computer Science Toronto, Canada Received 30 August 2011; received in revised form 18 October 2011; accepted 19 October 2011 Available online 25 October 2011
Abstract Millions of individuals have congenital or acquired neuro-motor conditions that limit control of their muscles, including those that manipulate the vocal tract. These conditions, collectively called dysarthria, result in speech that is very difficult to understand both by human listeners and by traditional automatic speech recognition (ASR), which in some cases can be rendered completely unusable. In this work we first introduce a new method for acoustic-to-articulatory inversion which estimates positions of the vocal tract given acoustics using a nonlinear Hammerstein system. This is accomplished based on the theory of task-dynamics using the TORGO database of dysarthric articulation. Our approach uses adaptive kernel canonical correlation analysis and is found to be significantly more accurate than mixture density networks, at or above the 95% level of confidence for most vocal tract variables. Next, we introduce a new method for ASR in which acoustic-based hypotheses are re-evaluated according to the likelihoods of their articulatory realizations in task-dynamics. This approach incorporates high-level, long-term aspects of speech production and is found to be significantly more accurate than hidden Markov models, dynamic Bayesian networks, and switching Kalman filters. Ó 2011 Elsevier B.V. All rights reserved. Keywords: Dysarthria; Speech recognition; Acoustic-articulatory inversion; Task-dynamics
1. Introduction Dysarthria is a set of motor speech disorders that disrupt the normal control of the vocal tract musculature. This disruption is caused by neurological trauma or degeneration of the motor control system (e.g., damage to the recurrent laryngeal nerve typically reduces control over vocal fold vibration, often resulting in guttural speech). Dysarthria is characterized by a combination of poor control over respiration, phonation, prosody, and articulatory movement, all of which can severely negatively impact the intelligibility of speech. Cerebral palsy is the most common congenital cause of dysarthria, affecting millions of children in North America who carry it into adulthood (Rudzicz, 2011). Although there have been several attempts to improve speech recognition for dysarthric speakers, and other attempts to integrate articulatory knowledge into speech recognition, these efforts have not until recently converged.
E-mail address:
[email protected] 0167-6393/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2011.10.006
In each case, successes have been tempered by the relatively unconstrained nature of the underlying statistical methods. Several fundamental phenomena of dysarthria such as increased disfluency, longer sonorants, and reduced pitch control (Rudzicz et al., 2008) cannot be readily represented in methods that rely upon short-time observations. Such models cannot inherently account for more complex aspects of articulatory organization, for which parallel and self-organizing theories may be more appropriate (Smith and Goffman, 2004). For example, co-articulatory effects are to at least some degree caused by concurrent demands placed on the vocal tract (e.g., tongue retraction and lip rounding) and a multidimensional hidden state may therefore be more representative of the biological reality, to the extent that these demands can be isolated. Moreover, if these demands can be cast within optimal control theory, then perturbation effects can be explicitly represented, depending on the specific model used. In order to study the long-term dynamics of dysarthria in particular, and speech generally, we require an ASR framework based on dynamical systems.
F. Rudzicz / Speech Communication 54 (2012) 430–444
This paper introduces the application of task-dynamics to speech recognition. The theory is introduced in Section 2 and the empirical data are described in Section 3. Section 4 describes a new method of acoustic-to-articulatory inversion and Section 5 describes the use of that method in the correction of errors made by a traditional speech recognition system. 1.1. Recognition of dysarthric speech Early work in ASR for individuals with dysarthria almost exclusively involved small-vocabulary hidden Markov models (HMMs) whose parameters were trained to the general population. Typically, word-recognition rates were between 26.2% and 81.8% lower for dysarthric speakers than for the general population, depending on the severity of disablement (Rudzicz, 2011). For example, Rudzicz (2007) described a scenario in which a traditional HMM baseline could recognize no more than 4% of the words uttered by a severely dysarthric speaker with cerebral palsy while a non-dysarthric speaker obtained up to 89% word-recognition. Despite this disparity, speech recognition can actually improve communication accuracy and speed for physically disabled individuals relative to other modes of input (e.g., scan-and-switch, specialized keyboards) (Havstam et al., 2003; Hawley et al., 2007), especially for those with more moderate disablements. The unintelligibility of dysarthria is not due to any single phenomenon, but to the combination of many articulatory behaviours that can have unique consequences for automatically recognizing dysarthric speech. For example, muscle fatigue (particularly of the tongue) coupled with slower or spastic speech may non-linearly alter the dynamics of speech within a single utterance. Disfluency can often lead to phonemic insertion errors in or around words containing voiceless plosives or voiceless fricatives (Raghavendra et al., 2001) and atypical pauses may lead to erroneous end-of-speech estimation (Rosen and Yampolsky, 2000). To deal with some of these problems, Polur and Miller (2006) showed that ergodic HMMs allowing for ‘backwards’ transitions could capture aspects of dysarthric speech such as involuntary repetition and disruptions during sonorants and revealed small improvements over a traditional baseline. Raghavendra et al. (2001) compared a speaker-adaptive phoneme recognizer and a speakerdependent word recognizer on dysarthric speech and concluded that adaptation is more appropriate for mild or moderate dysarthria, with empirical relative error reduction (RER) of 22%, but that severely dysarthric speakers are better served by speaker-dependent models, with 47% RER over the baseline. Morales and Cox (2009) improved word-error rates by approximately 5% for severely dysarthric speech and approximately 3% for moderately dysarthric speech by building weighted transducers into an ASR system according to observed phonetic confusion matrices. These metamodels were similar to those used by Matsumasa et al. (2009), except that they also involved a
431
language model, albeit one based on the highly restricted Nemours database of syntactically invariant utterances (Menendez-Pidal et al., 1996). Although adaptation at the acoustic level has led to some increase in accuracy for speakers with dysarthria, there remains room for improvement. Despite the origins of dysarthria in the mechanisms of speech production, relatively little work with this population has focused on physiological models that could directly inform otherwise hidden parameters of speech. 1.2. Articulatory knowledge in speech recognition Explicit use of articulatory knowledge is rare in ASR despite evidence that it is far more speaker-invariant and less ambiguous than the resulting acoustics (King et al., 2007). For example, the nasal sonorants/m/, /n/, and /ng/ are acoustically similar but uniquely and consistently involve either bilabial closure, tongue-tip elevation, or tongue-blade elevation, respectively. The identification of linguistic intention would, in some cases, become almost trivial given access to the articulatory goals of the speaker. There have been several attempts to build theoretical production knowledge directly into models for speech recognition. Sun and Deng (2002), for example, annotated words with parallel asynchronous variables representing the lips (closed or rounded), tongue dorsum, velum, and larynx (which could represent vocalization or aspiration Sun and Deng, 2002). The manners with which these words could be constructed given these annotations were encoded within HMM transition networks with high-level linguistic constraints such as phrase boundaries, morphemes, syllables, and stress. This augmentation could explicitly model co-articulation and phonetic reduction while using fewer parameters than other HMM approaches (Lee et al., 2001). Results were somewhat humble, however, improving over the baseline triphone accuracy of 70.86% by just 2.09% (absolutely) on the TIMIT database. However, this feature-rich approach has the advantage of requiring as little as 10% of the training data as the baseline. Richardson et al. (2000) similarly reduced the size of a state-transition network by placing constraints on articulator velocities and continuity. Their approach reduced word-error rates relative to the state-of-the-art at the time by between 28% and 35%. Appending empirical articulatory measurements to acoustic observations has been shown to reduce phone-error relatively by up to 17% in a standard HMM system; however, if those articulatory measurements were inferred from acoustics, this improvement disappeared (Wrench and Richmond, 2000). Systems that learn discrete articulatory features with neural networks from acoustics and incorporate these into HMMs have shown some improvement over acoustic-only baselines (Fukuda et al., 2003; Kirchhoff, 1999). However, these results were not always statistically significant, especially in the presence of extreme environmental noise (King et al., 2007). Similarly, Metze (2007) showed that
F. Rudzicz / Speech Communication 54 (2012) 430–444
incorporating discrete articulatory features learned with maximum mutual information into HMMs could reduce word-error rates from 25% to 19.8% on spontaneous scheduling tasks. No common representation has yet emerged as standard, although the theoretical benefits of using articulatory knowledge are not contested. A commonality among previous work is its use of non-dysarthric data (Livescu et al., 2007). 2. Background – task-dynamics The neural interaction between linguistic and motor capabilities is complex. In psycholinguistic theory, the linguistic hierarchy is often decomposed into conceptual, syntactic, morphological and phonological representations independent from the motor system through which these aspects are realized (Levelt et al., 1999). Articulatory phonology bridges the gap between phonetics and phonology by encapsulating them as the physical (constraining) and abstract (planning) stages of a single system (Goldstein and Fowler, 2003). Articulatory phonology has also been directly applied to the study of speech disorders such as apraxia (Bahr, 2005), which is believed to affect the exact part of the neurological interface that articulatory phonology describes (Dogil and Mayer, 1998). Task-dynamics is a combined model of skilled articulator motion and the planning of abstract vocal tract configurations (Saltzman, 1986). Here, the dynamic patterns of speech are the result of overlapping gestures, which are high-level abstractions of reconfigurations of the vocal tract. Similarly, the quantal theory of speech is based on the empirical observation that acoustics depend on a relatively discrete set of distinctive underlying articulatory configurations (Stevens and Keyser, 2010). An instance of a gesture is any combination of articulatory movements towards the completion of some speech-relevant goal, such as bilabial closure, or velar opening. The progenitors of this theory claim that all the implicit spatiotemporal behaviour underlying speech is the result of the interaction between the abstract intergestural dimension (between tasks) and the geometric interarticulator dimension (between physical actuators) (Saltzman and Munhall, 1989). Each gesture in task-dynamic theory occurs within one of nine tract variables (TVs): lip aperture (LA), lip protrusion (LP), tongue tip constriction location (TTCL) and degree (TTCD),1 tongue blade constriction location (TBCL) and degree (TBCD),2 velum (VEL), glottis (GLO), and lower tooth height (LTH). For instance, a gesture to close the lips would occur within the LA variable and would set that variable close to zero, as shown in
1
Constriction locations generally refer to the front-back (anterior– posterior) dimension of the vocal tract and constriction degrees generally refer to the top–down (superior–inferior) dimension. 2 Variables TBCL and TBCD are alternatively called tongue dorsum constriction location (TDCL) and degree (TDCD) in the literature.
1
0.8 normalized LA
432
0.6
0.4
0.2
0
0
50
100 Time (ms)
150
200
Fig. 1. Lip aperture (LA) over time for all instances of phoneme/m/in MOCHA database (see Section 3).
repetitions of /m/in Fig. 1, where the relevant articulatory goal of lip closure is evident. The dynamic influence of each gesture in time on the relevant tract variable is modeled by the non-homogeneous second-order linear differential equation (Saltzman and Munhall, 1989) Mz00 þ Bz0 þ Kðz z Þ ¼ 0;
ð1Þ
where z is a 9-dimensional vector of the instantaneous positions of each tract variable, and z0 and z00 are its first and second differentials. Here, M, B, and K are diagonal matrices representing mass, damping, and stiffness coefficients, respectively, and z* is the 9-dimensional vector of target (equilibrium) positions. This model is built on the assumption that tract variables are independent and do not interact dynamically, although it could easily be adjusted to reflect dependencies, if desired (Nam and Saltzman, 2003). If the targets z* of this equation are known, the identification of linguistic intent becomes possible. For example, given that a bilabial closure occurs simultaneously with a velar opening and glottal vibration, we can identify the intended phone as /m/. This represents a substantial reduction in dimensionality. 3. Materials All data used in this work come from two sources: the University of Edinburgh’s MOCHA database and the University of Toronto’s TORGO database. Each of these databases involves electromagnetic articulography (EMA) to track the positions and velocities of point-sensors affixed to the articulators to within 1 mm of error (Yunusova et al., 2009). Edinburgh’s MOCHA database consists of 460 sentences uttered by both a male and a female British speaker without dysarthria (Wrench, 1999) where acoustics are temporally aligned with EMA (recorded at 500 Hz), laryngography (at 16 kHz), and electropalatography (at 200 Hz). The EMA data consist of eight bivariate articulatory measurements, namely the upper lip (UL), lower lip (LL), upper incisor (UI), lower incisor (LI), tongue tip
F. Rudzicz / Speech Communication 54 (2012) 430–444
(TT), tongue blade (TB, 1 cm from the tongue tip), tongue dorsum (TD, 1 cm from the tongue blade3), and velum (V). Each parameter is measured in the midsagittal plane. The TORGO database includes data from eight dysarthric participants (5 male, 3 female), covering a range of intelligibility levels. Seven dysarthric participants with cerebral palsy (spastic, athetoid, and ataxic) were recruited from the Holland–Bloorview Kids Rehab hospital and the Ontario Federation for Cerebral Palsy, and the remaining participant with dysarthria was recruited from the Amyotrophic Lateral Sclerosis Society of Canada in Toronto (Rudzicz et al., 2008). These individuals were matched according to age and gender with non-dysarthric subjects from the general population. Each participant recorded 3 hours of data. The EMA data in TORGO are collected using the threedimensional AG500 system (Carstens Medizinelektronik GmbH, Lenglern, Germany) with fully automated calibration (van Lieshout et al., 2008). Sensors are attached to three points on the surface of the tongue, as in MOCHA. A sensor for tracking jaw movements (JA) is attached to a custom mold that fits the surface of the lower incisors as described by van Lieshout and Moussa (2000) and replicates the LI sensor in MOCHA. Four additional coils are placed on the upper and lower lips (UL and LL) and the left and right corners of the mouth (LM and RM). Further coils are placed on the subject’s forehead, nose bridge, and behind each ear above the mastoid bone for reference. Except for the left and right mouth corners, all sensors that measure the vocal tract lie generally on the midsagittal plane on which much of the relevant motion of speech takes place. The motor functions and intelligibility level of each dysarthric participant in TORGO are assessed according to the standardized Frenchay Dysarthria Assessment (Enderby, 1983) by a speech-language pathologist. Prompts include short words (e.g., international radio alphabet, 360 words from the word intelligibility section of the Yorkston– Beukelman Assessment of Intelligibility of Dysarthric Speech Yorkston and Beukelman, 1981), and sentences (e.g., 460 sentences from the MOCHA database). There are a number of features which differentiate dysarthric and non-dysarthric speech in TORGO. For example, plosives are mispronounced most often, with substitution errors chiefly caused by errant voicing (e.g. /d/for/t/). Dysarthric speakers are also far more likely to delete affricates and plosives, almost all of which are alveolar, in word-final positions. Furthermore, all vowels produced by dysarthric speakers are significantly slower than their non-dysarthric counterparts at the 95% level of confidence, often up to twice as long. In order to convert the acoustic space to a space of taskdynamics, we transform the midsagittal articulatory data using sigmoid-normalized principal component analysis.
3 The literature of task-dynamics often refers to the ‘tongue dorsum’ as a point closer to the tongue tip.
433
For example, we describe VEL by calculating the first principal component of velum motion in the midsagittal plane, finding the minimum and maximum deviations from the mean in this transformed space, and applying a sigmoid to that unidimensional space to retrieve a real-valued function on [0, . . . , 1]. Measurements of the velum occur only in the MOCHA database here. Similarly, the first and second principal components of the distance between UL and LL are used for the determination of lip aperture and protrusion, respectively, the first and second principal components of TT are used for the determination of TTCL and TTCD, respectively, and the first and second principal components of TB are used for the determination of TBCL and TBCD, respectively. Voicing detection on energy below 150 Hz is used to estimate the GLO tract variable. 4. Estimation of task-dynamics from acoustics Despite the one-to-many relationship in acoustic-toarticulatory inversion (Ananthakrishnan et al., 2009), any related protestation has not limited research in this area. For example, Richmond et al. (2003) estimated the 2dimensional midsagittal positions of 7 articulators given kinematic data using both a multi-layer perceptron and discriminatively trained Gaussian mixture models to within 0.41 mm and 2.73 mm. Toda et al. (2008) achieved almost identical results on the same data by applying expectationmaximization using both minimum mean-squared error and maximum likelihood estimation to a Gaussian mixture mapping function with low-pass filtering. Simpler approaches achieved similar results (errors less than 2 mm, typically around 1 mm) using simple vector quantization with an appropriate number of vectors (Hogden et al., 2007). One commonality in existing work is that the target dimensions consist of the absolute physical positions of points in the vocal tract. Despite the popularity of this approach, neither its generalizability among speakers nor its representation of linguistic intent has been fully justified. Why would the physical position of the upper lip be as explicative of intent or of acoustic consequence as a measure of the distance between the lips, for example? We are particularly interested in the long-term dynamics of the vocal articulators, rather than in near-instantaneous inversion; there are various physical constraints (such as the maximum oscillating frequency of the jaw) whose presence cannot be exploited if the windows of observation is on the order of 10 ms, for example. For this reason, we choose to explore an approach in which the windows of analysis are longer in time than those typically used by methods based on neural networks or Gaussian mixture ¨ zbek et al., 2011), for exammodeling (Toda et al., 2008; O ple. Therefore, in this section we use adaptive kernel canonical correlation analysis (KCCA) to estimate taskdynamics features of the vocal tract from acoustics. The KCCA method is encapsulated within a Hammerstein system which allows us to flexibly and directly learn a nonlinear transform and dynamical system in a combined step,
434
F. Rudzicz / Speech Communication 54 (2012) 430–444
rather than in a staggered process. The Hammerstein system is, in a way, the inverse of the Wiener system in which a linear transform of input is followed by a nonlinear one. We use the former rather than the latter since the training process inverts the second transformation in a manner that could not then be used in a feed-forward system for the type of multivariate data used here. This is based on our previous work (Rudzicz, 2010).
" 2 1 Kx 2 KyKx
K xK y K 2y
#
" a¼b
K 2x
0
0
K 2y
# a:
ð4Þ
T b T ax and x b T ay ^x ¼ X ^y ¼ Y Here, a ¼ ½ax ay 2 R2N such that x (Vaerenbergh et al., 2006). This gives a generalized eigenvalue problem in the higher-dimensional space where we can minimize (Kxax + Kyay)/2 by adjusting ax and ay according to our original data space (Vaerenbergh et al., 2008).
4.1. Adaptive KCCA Canonical correlation analysis (CCA) is a popular technique in communication and statistical signal processing that measures linear relationships between sets of variables. Given vector x 2 Rmx and y 2 Rmy , CCA finds a pair of directions xx 2 Rmx and xy 2 Rmy such that the correlation q(x, y) is maximized between the two projections xTx x and xTy y. Given joint observations X = [x1 x2 . . . xN]T and Y = [y1y2 . . . yN]T, where xi co-occurs with yi, CCA is equivalent to finding projection vectors xx and xy that maximize xTx XY T xy ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : qðX; Y; xx ; xy Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi xTx XX T xx xTy YY T xy
ð2Þ
Although this method can find good linear relationships, it is incapable of capturing nonlinear relationships which limits its application in many aspects of speech. We therefore employ the “kernel trick” in which a nonlinear transformation U of the data obtains a higher-dimensional feature b ¼ UðXÞ). It is in this higher dimension that space (e.g., X the categories are to be more linearly separable and the solution of CCA within this space is equivalent to a nonlinear solution in the original space (Lai and Fyfe, 2000). We can avoid the need to explicitly define U, however, since positive definite kernel functions j(x, y) satisfying Mercer’s condition can implicitly map their input to higher-dimensional spaces. We specify a set of such kernels in Section 4.2. Reformulating Eq. (2) within a framework of leastsquares regression allows us to minimize 12 kXxx Yxy k2 such that 12 ðkXxx k þ kYxy kÞ ¼ 1. This allows us to solve the following generalized eigenvalue problem on the b 2 RN m0x and Y b 2 RN my 0 by the transformed data X method of Lagrange multipliers: " # " # bTX b X bTY b b bTX 1 X 0 X ^ ¼b ^ x x; ð3Þ bTX bTY b Y bTY b b 2 Y 0 Y
4.1.1. KCCA and Hammerstein systems A nonlinear Hammerstein system is a memoryless nonlinear function g() followed by a linear dynamic system H() in series, as shown in Fig. 2(a). Our goal is to input each sequence of L acoustic observations, X = [x1x2 . . . xL]T, of Mel-frequency cepstral coefficients (MFCC) to this system and to infer the associated sequence of articulation vectors, K = [K1K2 . . . kL]T, in task-dynamics. In order to accomplish this accurately, we must learn the parameters of the two components of the Hammerstein system. A mechanism for identifying these parameters has recently been proposed that takes advantage of the cascade structure by inverting the linear component, as in Fig. 2(b), and minimizing the difference, e[n], between g(X[n]) and H1(K[n]) using KCCA (Aschbacher and Rupp, 2005). Since H() is linear, we can reformulate Eq. (4) to " # " # b ax K x ðK x þ cIÞ 0 ax 1 K 2x K x K ¼b ; T b b 2 K b T Kx K bT K b xK x 0 K K K ð5Þ where we add a regularizing constant c to prevent over-fitting (Aschbacher and Rupp, 2005). Here, xK provides the parameters of the linear part of the system, H()1, and ax provides the parameters of the nonlinear part, g(). Given a combined average of the output of these two systems, r = (rx + rK)/2 = (Kxax + KxK)/2, the eigenvalue problem decomposes to two coupled least squares problems: bax ¼ ðK x þ cIÞ1 r 1
bxK ¼ ðKT KÞ KT r
ð6Þ
T
cy is the concatenation of the transformed ^ ¼ ½c where x xx x direction vectors and b is the Lagrange multiplier. We can now avoid explicit data transformation by applying a kernel function. Since the kernel matrix describing our transb kX b T 2 RN N , has elements Kx[i, j] = formed data, K x ¼ X k j(xi, xj) defined by vectors in our original data space (Ky is b ), we left-multiply Eq. (3) by defined similarly for Y b 0 X b , giving 0 Y
Fig. 2. The feedforward Hammerstein system and its associated identification system.
F. Rudzicz / Speech Communication 54 (2012) 430–444
435
This representation allows us to minimize a Euclidean error measurement krx rKk by analytically solving for ax and xK. In order to estimate articulation at run time, we compute rx = Kxax, since we can construct the kernel matrix from observed acoustics, and then solve for K K x ax x1 K , since KxK = rK rx = Kxax.
4.2. Experiments
4.1.2. Adaptive algorithm Unfortunately, for problems involving large amounts of data, as is typical in speech, the sizes of the kernel matrices described above become prohibitively large. An online algorithm that iteratively adjusts the estimates of ax and xK based on subsequent segments of data is therefore desirable. We assume that we have a sliding context window covering L aligned frames from each data source, namely, x(n) = [xn, xn1, . . . , xnL+1] and K(n) = [Kn, Kn1, . . . , Kn L+1]. For example, if EMA data are recorded at 500 Hz and acoustics at 16 Hz, then a 10 ms analysis window aligns the resulting MFCC frame with the middle vector of the 5 articulatory observations during that time. A common denominator of the acoustic and articulatory sampling rates simplifies computation of the analysis ðn1Þ window. Assuming that we have matrix K reg for the th b ðn1Þ is the matrix formed (n 1) window of speech, and K reg by its last n 1 rows and columns, then the regularized matrix for the current window is
function ðK rbf Þ, and the sigmoid ðK sigmoid Þ kernels:
¼
b ðn1Þ K reg kn1 ðxðnÞ Þ
kn1 ðxðnÞ Þ T
# ð7Þ
;
k nn þ c
where j is the selected positive definite kernel function, T
kn1 ðxðnÞ Þ ¼ ½jðxðnLþ1Þ ; xðnÞ Þ; . . . ; jðxðn1Þ ; xðnÞ Þ
ðnÞ and knn = j(x(n), x(n)). The inverse of K reg can also be comðn1Þ puted quickly, given the inverse of K reg (Vaerenbergh et al., 2006). We then iteratively update our parameter estimates for xK and ax as new data arrives using Eq. (6). This entire process is summarized in Algorithm 1 and is based on work on Wiener systems by Vaerenbergh et al. (2006).
ðrÞ
ðiÞ
Kh
ðiÞ
poly ðx1 ; x2 Þ
ðj;cÞ
¼ ðx1 x2 Þi i
¼ ðx1 x2 þ 1Þ ðrÞ x2 k2 K rbf ðx1 ; x2 Þ ¼ exp kx12r 2 K nh
poly ðx1 ; x2 Þ
ðj;cÞ
K sigmoid ðx1 ; x2 Þ ¼ tanhðx1 x2 þ cÞ: Training and testing data are split according to source. In each of the following experiments we use tract variables and aligned acoustics (42-dimensional MFCCs) selected from the sentences uttered by the male speaker (non-dysarthric) from Edinburgh’s MOCHA database (Wrench, 1999). In Sections 4.2.2 and 4.2.3 we additionally use 100 utterances spoken in common between data from the two most severely dysarthric males and their two nondysarthric male counterparts from the TORGO database. MOCHA data uniquely includes velum data and nondysarthric data includes the tongue body. Where comparisons are made between sources, those tract variables not shared in common are not included. Results reported below are averages of 10-fold cross validation. Until otherwise indicated, the window length L = 150 is used, as it was determined empirically to provide the highest accuracy on a restricted subset of the data. 4.2.1. Stability and convergence during training The goal of auto-correction is for the Euclidean error (Kx ax KxK) (i.e., e[n] in Fig. 2(b)) to approach zero during training. Fig. 3 shows the best, average, and worst mean squared errors in decibels during training given the 500 Upper bound Lower bound Average
400
er error [dB]
" ðnÞ K reg
Our experiments evaluate the stability of the error-correction method and the estimation of tract variables from acoustics. We apply four of the most popular kernel funcðiÞ tions, namely the homogeneous polynomial ðK h poly Þ, the ðiÞ non-homogeneous polynomial ðK nh poly Þ, the radial-basis
300 200 100 0 −100 100
200
300
400
500
600
700
800
900
1000
Iteration Fig. 3. Normalization error, e[n], for the first-order homogeneous polynomial kernel at window size L = 150 with MOCHA data.
436
F. Rudzicz / Speech Communication 54 (2012) 430–444
homogeneous polynomial kernel and 10 random initial parameterizations. This example is indicative of all other kernels whereby a period of fluctuation tends to follow a rapid decrease in error. Table 1 shows the total decrease in mean squared error (dB) between the first 20 and last 20 windows of the adaptive KCCA training process. As one increases the order of both the homogeneous and non-homogeneous kernels, the MSE reduction decreases. In both the tan-sigmoid and radial-basis function kernels, however, our choice of parameters seems to have little discernible effect. Indeed, although the polynomial kernels with low order offer the most reduction in MSE, the radial basis function appears more consistent, so there is no clear optimum. Vaerenbergh et al. applied a nearly identical approach to learning Wiener systems on the comparatively simple problem of estimating a hyperbolic tangent function given univariate input (Vaerenbergh et al., 2006; Vaerenbergh et al., 2008), reaching MSE between 30 dB and 40 dB within 1000 to 1500 iterations. Surprisingly, most of the error in our experiments is dispelled much earlier, within 200 iterations, with MSE fluctuating between 76.9 dB and 39.5 dB thereafter across all kernels and parameterizations. 4.2.2. KCCA versus mixture density networks In order to judge the accuracy of the articulatory estimates produced by adaptive KCCA against the state-ofthe-art, we consider mixture density neural networks (MDNs) that output parameters of Gaussian mixture probability distributions, as described by Richmond et al. (2003). We train MDNs to estimate the likelihood of tract variable positions given MFCC input and 2 frames of surrounding acoustic context. Fig. 4 shows an example of the estimated likelihood of tract variable positions over time produced by a trained MDN as an intensity map superimposed with the true trajectory for the male speaker from MOCHA. We train one set of classifiers for each of the two non-dysarthric speakers selected from TORGO and the speaker from MOCHA. MDNs are trained on the same data as KCCA. Articulatory estimates for KCCA are smoothed with third-order median filters. Table 1 Total reduction in MSE (dB) between Hammerstein components during training across kernels and parameterizations with MOCHA data. Homogeneous polynomial
Non-homogeneous polynomial
i
MSE reduction
i
MSE reduction
1 2 3
421.6 403.6 394.5
1 2 3
441.9 413.1 382.9
Sigmoid
Radial-basis function
(j, c)
MSE reduction
r
MSE reduction
(0.2, 0.1) (0.2, 0.5) (0.5, 0.1) (0.5, 0.5)
313.2 321.5 309.7 314.3
0.1 0.5 1.0
406.5 410.4 406.7
Fig. 4. Example intensity map of Gaussian mixtures produced by a mixture density network trained to estimate the tongue tip constriction degree. Darker sections represent higher probability. The true trajectory is superimposed as a black curve.
We assess the accuracy of the MDN and KCCA methods by comparing their estimates of the log likelihood of the true articulatory trajectories. A more accurate method will assign a higher probability to the actual trajectory. The likelihood of a frame of articulation is easily computed by MDNs whose output defines a probability distribution over tract variable positions. In these experiments, we empirically choose the number of Gaussian components to be four. We approximate the likelihood of a frame of articulation in the KCCA approach with the radial-basis kernel by fitting a Gaussian to the estimates of 10 trials having different initial parameterizations. Test data in each trial consists of approximately 60 utterances from the male speaker in MOCHA and 15 utterances from each of the two speakers in TORGO. The mean and variance of the log likelihoods of true articulatory positions across all test frames is summarized in Table 2 for both methods. According to the t test with 9.6E4 < n1 = n2 < 9.9E4 frames and one degree of freedom, among the MOCHA data KCCA is significantly more accurate than the MDN method at the 95% confidence level for VEL, LA, LP, TTCL, and TBCL and at the 99% confidence level for GLO, and statistically indistinguishable at these levels for the remaining tract variables. By comparison, the average performance with both nondysarthric speakers selected from TORGO show similar distributions despite being trained with approximately 1/4 as much data as the MOCHA models. For these speakers, the KCCA method more accurately predicts articulatory motion for all tract variables except LA. An ad hoc experiment in which all non-dysarthric data from TORGO were pooled together, rather than segregated by speaker, showed no significant difference at the 95% level of confidence from the averages reported in Table 2. In this case, the doubling of the available training
F. Rudzicz / Speech Communication 54 (2012) 430–444 Table 2 Average log likelihoods of true tract variable positions in test data, under distributions produced by mixture density networks (MDNs) and the KCCA method, with variances. TV
MOCHA MDN l(r2)
VEL LTH LA LP GLO TTCD TTCL TBCD TBCL
0.28 0.18 0.32 0.44 1.30 1.60 1.62 0.79 0.20
(0.08) (0.12) (0.11) (0.12) (0.16) (0.17) (0.17) (0.14) (0.11)
TORGO (avg.) KCCA l(r2)
MDN l(r2)
0.23 0.18 0.28 0.41 1.14 1.60 1.57 0.80 0.18
N/A 0.20 0.33 0.46 1.36 1.67 1.68 0.83 0.21
(0.07) (0.14) (0.10) (0.13) (0.15) (0.17) (0.16) (0.15) (0.09)
KCCA l (r2) (0.11) (0.18) (0.13) (0.21) (0.25) (0.23) (0.16) (0.12)
N/A 0.19 0.34 0.43 1.28 1.65 1.67 0.81 0.19
(0.09) (0.12) (0.11) (0.19) (0.23) (0.22) (0.15) (0.10)
data seems to have been offset by the increase in entropy caused by the differences in articulation between speakers.
5. Correcting errors in ASR with articulatory dynamics This section describes an integration of acoustic-articulatory inversion into an ASR system using task-dynamics for word recognition. Experiments involve dysarthric and non-dysarthric speakers based on our previous work (Rudzicz, 2010). 5.1. Task-dynamic automatic speech recognition Our goal is to integrate task-dynamics within an ASR system for continuous sentences. Our approach, called TD-ASR, is to re-rank N-best lists of sentence hypotheses according to weighted likelihoods of their continuous-valued articulatory realizations. For example, if a word sequence Wi : wi,1 wi,2 . . . wi, m has acoustic likelihoods LX(Wi) and articulatory likelihood LK(Wi), then its overall score is LðW i Þ ¼ aLX ðW i Þ þ ð1 aÞLK ðW i Þ
4.2.3. KCCA approach with dysarthric speech We estimate the position of the vocal tract for severely dysarthric speakers using the methods in Section 4.2.2. Figs. 5(a) and (b) show the average log likelihoods of control speakers and dysarthric speakers for each of the MDN and adaptive KCCA methods, respectively. In this experiment we exclude velum and tongue body tract variables, as associated observations do not exist in the EMA data for the dysarthric speakers. This experiment also normalizes training data, as all speakers utter a common set of phrases. As before, the adaptive KCCA method better matches our articulatory observations, when compared with the MDN method, although the differences are less significant. In both cases, articulatory data from dysarthric speakers are more difficult to estimate and a paired t-test reveals that the distributions on dysarthric speech are significantly different from those on non-dysarthric speech at the 99% level of confidence.
437
ð8Þ
given a weighting parameter a set manually, as in Section 5.2.2. Acoustic likelihoods LX(Wi) are obtained from Viterbi paths through relevant HMMs in the standard fashion. The mechanism for producing the articulatory likelihood is shown in Fig. 6. In that diagram, the top path indicates the derivation of the tract variable trajectories of the best hypotheses for some acoustic input and the bottom path indicates the estimation of the probability field in which those trajectories are evaluated. During recognition, a standard acoustic HMM (or other model taking only acoustic input) produces word sequence hypotheses, Wi, and associated likelihoods, LX(Wi), for i = 1, . . . , N. The expected canonical motion of the tract variables, TVi is then produced by TADA for each of these word sequences and transformed by a switching Kalman filter to better match speaker data, giving TVi . These components are described in greater detail in subsequent sections.
Fig. 5. Average log likelihoods of true tract variable positions in test data, under distributions produced by mixture density networks (MDNs) and the KCCA method, with variances for both control and dysarthric speakers in TORGO.
438
F. Rudzicz / Speech Communication 54 (2012) 430–444
Fig. 6. The TD-ASR system using articulatory likelihoods, LK(Wi) to rerank each word sequence Wi produced by standard acoustic techniques.
In order to estimate the articulatory likelihood of an utterance, we need to evaluate each transformed articulatory sequence, TVi , within probability distributions ranging over all tract variables. These distributions can be inferred using acoustic-articulatory inversion with mixture density networks as described in Section 4.2.2. The MDN, rather than the KCCA approach, is the default mechanism except in Section 5.4, merely for the MDN’s relatively parsimonious use of computational resources, which is relevant in realtime ASR. Our networks are trained with acoustic and EMA-derived data from TORGO and MOCHA. The likelihoods of these paths are then evaluated within probability distributions produced by a mixture density network. The overall likelihood (Eq. (8)) is then used to produce a final hypothesis list for the given acoustic input. 5.1.1. Baseline speech recognition We consider two baseline systems that each alternatively form the first component of the TD-ASR system. The first is an acoustic hidden Markov model augmented with a bigram word language model, as shown in Fig. 7(a). Here, word transition probabilities are learned by maximum likelihood estimation and phoneme transition probabilities are
explicitly ordered according to the Carnegie Mellon pronunciation dictionary. Each phoneme conditions the sub-phoneme state whose transition probabilities describe the dynamics within phonemes. The second baseline model is the articulatory dynamic Bayes network (DBN). This augments the standard acoustic HMM by replacing hidden indices with discrete observations of the vocal tract, Kt, as shown in Fig. 7(b). The patterns of acoustics within each phoneme are dependent on a relatively restricted set of possible articulatory configurations. To find these discrete positions, we obtain k vectors that best describe the articulatory data according to k-means clustering with the sum-of-squares error function. During training, the DBN variable Kt is set to the index of the mean vector nearest to the current frame of EMA data at time t. In this way, the relationship Kt ! Ot allows us to learn how quantized articulatory configurations affect acoustics. The training of DBNs involves a specialized version of expectationmaximization, as described by Murphy (2002). During inference, variables representing words, phonemes, and articulation become hidden and we marginalize over their possible values when computing their likelihoods. Bigrams are computed by maximum likelihood on lexical annotations in the training data.
Fig. 7. Baseline (a) acoustic hidden Markov model and (b) articulatory dynamic Bayes network. Node Wt represents the current word, Pht is the current phoneme, Qt is the hidden state, Ot is the acoustic observation, Mt is the Gaussian mixture component, and Kt is the discretized articulatory configuration. Filled nodes represent variables that are observed during training, although only Ot is observed during recognition. All variables are discrete except for Ot.
F. Rudzicz / Speech Communication 54 (2012) 430–444
5.1.2. The TADA component In order to obtain articulatory likelihoods, LK(Wi), for each word sequence, we first generate articulatory realizations of those sequences according to task-dynamics. To this end, we use components from the open-source TADA system (Nam and Goldstein, 2006), which is a complete implementation of task-dynamics. From this toolbox, we use the following components: A syllabic dictionary supplemented with the International Speech Lexicon Dictionary (Hasegawa-Johnson and Fleck, 2007). This breaks word sequences Wi into syllable sequences Si consisting of onsets, nuclei, and coda and covers all of MOCHA and TORGO. A syllable-to-gesture lookup table. Given a syllabic sequence, Si, this table provides the gestural goals necessary to produce those syllables. For example, given the syllable pub, this table provides the targets for the GLO, VEL, TBCL, and TBCD tract variables, and the parameters for the second-order differential equation, Eq. (1), that achieves those goals. These parameters have been empirically tuned by the authors of TADA according to a generic, speaker-independent representation of the vocal tract (Saltzman and Munhall, 1989). A component that produces the continuous tract variable paths that produce an utterance. This component takes into account various physiological aspects of human speech production, including intergestural and interarticulator co-ordination and timing (Nam and Saltzman, 2003; Goldstein and Fowler, 2003), and the neutral (“schwa”) forces of the vocal tract (Saltzman and Munhall, 1989). This component takes a sequence of gestural goals predicted by the segment-to-gesture lookup table, and produces appropriate paths for each tract variable. The result of the TADA component is a set of N articulatory paths, TVi, necessary to produce the associated word sequences, Wi for i = 1, . . . , N. Since task-dynamics is a prescriptive model and fully deterministic, TVi sequences are the canonical or default articulatory realizations of the associated sentences. These canonical realizations are independent of our training data, so we transform them in order to more closely resemble the observed articulatory
439
behaviour in our EMA data. Towards this end, we train a switching Kalman filter as described in Section 5.1.3, where the hidden state variable xt is the observed instantaneous canonical TVs predicted by TADA. In this way we are explicitly learning a relationship between TADA’s taskdynamics and human data. Since the lengths of these sequences are generally unequal, we align the articulatory behaviour predicted by TADA with training data from MOCHA and TORGO using standard dynamic time warping (Sakoe and Chiba, 1978). During run-time, the articulatory sequence yt most likely to have been produced by the human data given the canonical sequence TVi is inferred by the Viterbi algorithm through the SKF model with all other variables hidden. The result is a set of articulatory sequences, TVi , for i = 1, . . . , N, that represent the predictions of task-dynamics that better resemble our data. 5.1.3. Transformation with the articulatory switching Kalman filter In the following experiments we use a switching Kalman filter that estimates the relationship between canonical abstract tract variables and those observed in data. The abstract (‘true’) state of the tract variables at time t 1 constitutes a vector of continuous values, xt1. Under taskdynamics, the motions of these tract variables obey critically damped second-order oscillatory relationships. We start with the simplifying assumption of linear dynamics here with allowances for random Gaussian process noise, vt, with variance rvt since articulatory behaviour is non-deterministic. Moreover, we know that EMA recordings are subject to some error (usually less than 1 mm Yunusova et al., 2009), so the actual observation at time t, yt, will not in general be the true position of the articulators. Assuming that the relationship between yt and xt is also linear, and that the measurement noise, wt, is also Gaussian with variance rwt , then the dynamical articulatory system can be described by xt ¼ Dt xt1 þ vt
ð9Þ
yt ¼ C t xt þ wt :
Eq. (9) form the basis of the Kalman filter in which we use EMA-derived tract variable measurements directly, rather than quantized abstractions as in the DBN model. Since articulatory dynamics vary significantly for different goals,
Table 3 Phoneme-and Word-Error-Rate (PER and WER) for different parameterizations of the baseline HMM and DBN systems. System
Parameters
PER (%)
WER (%)
HMM
jMj = 4 jMj = 8 jMj = 16 jMj = 32
29.3 27.0 26.1 25.6
14.5 13.9 10.2 9.7
DBN
jKj = 4 jKj = 8 jKj = 16 jKj = 32
26.1 25.2 24.9 24.8
13.0 11.3 9.8 9.4
440
F. Rudzicz / Speech Communication 54 (2012) 430–444
we replicate Eq. (9) for each phoneme and connect these continuous Kalman filters together with discrete conditioning variables for phoneme (of which there are 45) and word (of which there are 1856), resulting in the switching Kalman filter (SKF) model. Here, parameters Dt and vt are implicit in the relationship xt ! xt+1, and parameters Ct and wt are implicit in xt ! yt. Each of these parameters depends on the hidden discrete state and switches with the state. In this model, observation yt is the instantaneous tract variables derived from EMA, and xt is their canonical hidden states. We train the SKF model with a specialized expectationmaximization over its parameters, assuming that the conditioning state is St at time t and that it has Markovian dynamics with state transition matrix Z(St1, St), initial state distribution p1 (sequences are 1-indexed here), mean vectors lt, and covariance Rt. The complete log likelihood of all training data (of length T) in the SKF model is LSKF ¼ log P ðx1:T ; S 1:T ; y1:T Þ T 1X ¼ ½yt C t xt > r1 wt ½yt C t xt 2 t¼1 T 1X log krwt k 2 t¼1 T 1X > ½xt Dt xt1 r1 vt ½xt Dt xt1 2 t¼2 T 1X 1 log krvt k ½x1 l1 > R1 1 ½x1 l1 2 t¼2 2 1 T ðn þ mÞ log 2p þ log p1 log kR1 k 2 2 T X þ log ZðS t1 ; S t Þ: t¼2
Further details are described by Murphy (1998) and Deng et al. (2005). 5.2. Experiments – non-dysarthric speakers Experimental data are obtained from two sources, as described in Section 3. We procure 1200 sentences from among the non-dysarthric speakers in Toronto’s TORGO database, and 896 from Edinburgh’s MOCHA. In total, there are a total of 460 unique sentence forms, 1092 unique word forms, and 11065 words uttered. Except where noted, all experiments randomly split the data into 90% training and 10% testing sets for 5-cross validation. MOCHA and TORGO data are never combined due to differing EMA sampling rates. In all cases, models are database-dependent (i.e., all TORGO data is conflated, as is all of MOCHA). For each of our baseline systems, we calculate the phoneme-error-rate (PER) and word-error-rate (WER) after training. The phoneme-error-rate is calculated according to the proportion of frames of speech incorrectly assigned to the proper phoneme. The word-error-rate is calculated as the sum of insertion, deletion, and substitution errors in the highest-ranked hypothesis divided by the total
number of words in the correct orthography. The traditional HMM is compared by varying the number of Gaussians used in the modelling of acoustic observations. Similarly, the DBN model is compared by varying the number of discrete quantizations of articulatory configurations. Results are obtained by direct decoding. The average results across both databases, between which there are no significant differences, are shown in Table 3. In all cases the DBN model outperforms the HMM, which highlights the benefit of explicitly conditioning acoustic observations on articulatory causes. 5.2.1. Efficacy of TD-ASR components In order to evaluate the whole system, we start by evaluating its parts. First, we test how accurately the mixture-density network (MDN) estimates the position of the articulators given only information from the acoustics available during recognition. Table 4 shows the average log likelihood over each tract variable across nondysarthric speakers in both databases. These results are consistent with the state-of-the-art (Toda et al., 2008). In the following experiments, we use MDNs that produce 4 Gaussians. We evaluate how closely transformations to the canonical tract variables predicted by TADA match the data. Namely, we input the known orthography for each test utterance into TADA, obtain the predicted canonical tract variables TV, and transform these according to our trained SKF. The resulting predicted and transformed sequences are aligned with our measurements derived from EMA with dynamic time warping. Finally, we measure the average difference between the observed data and the predicted (canonical and transformed) tract variables. Table 5 shows these differences according to the phonological manner of articulation. In all cases the transformed tract variable motion is more accurate, and significantly so at the 95% confidence level for nasal and retroflex phonemes, and at 99% for fricatives. The practical utility of the transformation component is evaluated in its effect on recognition rates, below. 5.2.2. Recognition with TD-ASR We combine the components of TD-ASR and study the resulting composite system. Fig. 8(a) shows the WER as a Table 4 Average log likelihood of true tract variable positions in test data, under distributions produced by mixture density networks with varying numbers of Gaussians. No. of Gaussians
LTH LA LP GLO TTCD TTCL TBCD TBCL
1
2
3
4
0.28 0.36 0.46 1.48 1.79 1.81 0.88 0.22
0.18 0.32 0.44 1.30 1.60 1.62 0.79 0.20
0.15 0.30 0.43 1.29 1.51 1.53 0.75 0.18
0.11 0.29 0.43 1.25 1.47 1.49 0.72 0.17
F. Rudzicz / Speech Communication 54 (2012) 430–444 Table 5 Average difference between predicted tract variables and observed data, on [0, 1] scale. Manner
Canonical
Transformed
Approximant Fricative Nasal* Retroflex Plosive Vowel
0.19 0.37 0.24 0.23 0.10 0.27
0.16 0.29 0.18 0.19 0.08 0.25
* Nasals are evaluated only with MOCHA data, since TORGO data lacks velum Measurements.
function of a with TD-ASR and N = 4 hypotheses per utterance on non-dysarthric data. Recall that the overall likelihood of a word sequence hypothesis W is L(W) = aLX(W) + (1 a)LK(W) (higher a signifies higher weight to the acoustic likelihood LX relative to the articulatory likelihood LK). The effect of a is clearly non-monotonic, with articulatory information clearly proving useful. Although systems whose rankings are weighted solely by the articulatory component perform better than the exclusively acoustic systems, the lists available to the former are procured from standard acoustic ASR. Interestingly, the gap between systems trained to the two databases increases as a approaches 1.0. Although this gap is not significant, it may be the result of increased inter-speaker articulatory variation in the TORGO database.
441
Fig. 8(b) shows the WER obtained with TD-ASR given varying-length N-best lists and a = 0.7. TD-ASR accuracy at N = 4 is significantly better than both TD-ASR at N = 2 and the baseline approaches of Table 3 at the 95% confidence level. However, for N > 4 there is a noticeable and systematic worsening of performance. The optimal parameterization of the TD-ASR model results in an average word-error-rate of 8.43%, which represents a 10.3% relative error reduction over the best parameterization of our baseline models. Finally, the experiments of Figs. 8(a) and (b) are repeated with the canonical tract variables passed untransformed to the probability maps generated by the MDNs. Predictably, resulting articulatory likelihoods LK are less representative and increasing their contribution a to the hypothesis reranking does not improve TD-ASR performance significantly, and in some instances worsens it. Although TADA is a useful prescriptive model of generic articulation, its use must be tempered with knowledge of inter-speaker variability. 5.3. Experiments – dysarthric speakers We repeat the experiments of Section 5.2.2 with data from the two severely dysarthric male speakers studied in Section 4. HMM and DBN models are trained uniquely with each speaker. Baseline word error rates with these speakers are 51.6% and 32.3% with the HMM and 43.2% and 27.8% with
Fig. 8. Word-error-rates given non-dysarthric data from TORGO and MOCHA by (a) varying a, and (b) varying lengths of N-best lists. Word-error-rates including dysarthric data from TORGO by (c) varying a, and (d) varying lengths of N-best lists.
442
F. Rudzicz / Speech Communication 54 (2012) 430–444
Fig. 9. Word-error-rates (WER, in %) according to the number of samples (random initial parameterizations) used in the KCCA replacement of the MDN component of TD-ASR.
the DBN model (average of 35.5% WER). Fig. 8(c) shows average WER with TD-ASR and N = 3 as a function of a for both dysarthric and non-dysarthric data from TORGO. Fig. 8(d) shows WER as a function of N with a = 0.4. While TD-ASR with a = 0.4 and N = 3 makes fewer errors (34.1%) than the DBN baseline (35.1%), various other parameterizations perform relatively poorly and do not significantly reduce the gap between dysarthric and non-dysarthric (control) speakers on this data. A number of speculations as to this result are given in Section 6. 5.4. Experiments – inversion with KCCA Finally, we replace the MDN component in TD-ASR with a component based on KCCA. This experiment follows naturally from the apparent advantages of KCCA over MDN in Sections 4.2.2 and 4.2.3. As in those experiments, the KCCA approach approximates the distribution over possible tract variable positions with a single Gaussian fitted over the real-valued outputs of Hammerstein systems trained adaptively with different initial random parameterizations. Fig. 9 shows the word-error-rates of this KCCA-based TD-ASR system according to the number of samples (random initial parameterizations) of the KCCA component for both non-dysarthric and dysarthric subjects in the TORGO database. The best WER with control data is 8.6%, slightly higher than 8.42% with the MDN. Similarly, the best WER with dysarthric data is 34.1%, which is identical to the best WER obtained with the MDN. In each case, more samples used in the KCCA production resulted in lower WER. Although the MDN and KCCA methods result in systems that perform similarly, use of the former appears to be somewhat favourable. As described in Section 5.2.1, the MDNs produce mixtures of 4 Gaussians which may account for more complex vocal patterns than single Gaussians. For each of a random selection of 1000 frames in both non-dysarthric and dysarthric data, each pair from among the 4 Gaussians were compared with Kullback– Leibler and with the paired t-test, and in all cases at least
one pair of Gaussians were not significantly different from one another. We additionally use MDNs that produce single Gaussians as output (see Section 5.2.1) in TD-ASR. Here, we parameterize TD-ASR with N = 4 and a = 0.7 for non-dysarthric speakers and N = 3 and a = 0.4 for dysarthric speakers. Here, non-dysarthric speakers have an average WER of 8.8% and dysarthric speakers have an average WER of 34.7%. In each case, these results are higher than with the KCCA component, but not appreciably so. This implies that, despite the clear advantage in accuracy of the adaptive KCCA method over the MDN in Section 4.2, that advantage does not necessarily translate to more accurate ASR. Moreover, the KCCA method is not designed to produce probability distributions as the MDN is, and generating such distributions is particularly time-consuming. 6. Discussion We have demonstrated that the use of direct articulatory knowledge can substantially reduce word errors in speech recognition, especially if that knowledge is motivated by high-level abstractions of vocal tract behaviour. Taskdynamics provides a coherent and biologically plausible model of speech production with consequences for phonology (Browman and Goldstein, 1986), neurolinguistics, and the evolution of speech and language (Goldstein et al., 2006). We have shown that it is also applicable within speech recognition. We have overcome a conceptual impediment in integrating task-dynamics and ASR, which is the former’s deterministic nature. This integration is accomplished by stochastically transforming predicted articulatory dynamics and by calculating the likelihoods of these dynamics according to speaker data. There are several new avenues for exploration. For example, task-dynamics lends itself to more general applications of control theory, including automated self-correction, rhythm, co-ordination, and segmentation (Friedland, 2005). Other high-level questions also remain, such as whether discrete gestures are the correct biological and practical paradigm, whether a purely continuous representation would be more appropriate, and whether this approach generalizes to other languages. In general, our experiments have revealed very little difference between the use of MOCHA and TORGO EMA data. An ad hoc analysis of some of the errors produced by the TD-ASR system found no particular difference between how systems trained to each of these databases recognized nasal phonemes, although only those trained with MOCHA considered velum motion. Other errors common to both sources of data include phoneme insertion errors which appear to co-occur with some spurious motion of the tongue between segments, especially for longer N-best lists. Despite the slow motion of the articulators relative to the acoustics, there remains some intermittent noise.
F. Rudzicz / Speech Communication 54 (2012) 430–444
Our experiments with dysarthric speech in Section 5.3 demonstrate that TD-ASR can be made more accurate than a dynamic Bayes network trained with articulatory data. However, several parameterizations of TD-ASR do not improve over the baseline for speakers with dysarthria. Figs. 8(c) and (d) demonstrate that while improvements are possible with dysarthric speakers, as with non-dysarthric speakers, in TD-ASR, there remains a wide gap between the expected performance of these two groups. Part of this gap might be due to assumptions of non-dysarthric speech production in the parameterization of TADA. Current work involves adjusting the spring-mass parameters of the TADA component according to dysarthric data in the TORGO database. This is based on previous work with non-dysarthric articulatory data (Reimer and Rudzicz, 2010) in which we used Principal differential analysis (Ramsay and Silverman, 2005) to optimize the parameters of Eq. (1) for which multiple noisy samples were available. Some high-level questions remain. For example it is possible that a quantized representation of task-dynamics may be more amenable to the training of speech recognition systems than continuous estimated articulatory trajectories. A non-adaptive kernel-based system for acoustic-articulatory inversion has been proposed that is constrained by discrete categories (Zheng et al., 2006). Likewise, a k-means clustering of the tract variable motion estimated by the adaptive KCCA process might be applicable as conditioning variables in dynamic Bayes networks for speech classification. Although demonstrably more accurate than mixture-density network on the task of acoustic-articulatory inversion, replacing that model with adaptive KCCA in the TD-ASR system did not yield similarly promising results. Since KCCA is inherently semi-analytical (non-statistical), it should not be surprising that it is not as felicitous within a statistical framework as the neural network model. References Ananthakrishnan, G., Neiberg, D., Engwall, O., 2009. In search of nonuniqueness in the acoustic-to-articulatory mapping. In: Proceedings of Interspeech 2009, Brighton UK. Aschbacher, E., Rupp, M., 2005. Robustness analysis of a gradient identification method for a nonlinear Wiener system. In: Proceedings of the 13th Statistical Signal Processing Workshop (SSP), Bordeaux, France. Bahr, R.H., 2005. Differential diagnosis of severe speech disorders using speech gestures. Topics in Language Disorders. Clinical Perspectives on Speech Sound Disorders 25 (3), 254–265. Browman, C.P., Goldstein, L.M., 1986. Towards an articulatory phonology. Phonology Yearbook 3, 219–252. Deng, J., Bouchard, M., Yeap, T., 2005. Speech Enhancement Using a Switching Kalman Filter with a Perceptual Post-Filter. In: Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, Vol. 1, 2005, pp. 1121–1124. Dogil, G., Mayer, J., 1998. Selective phonological impairment: a case of apraxia of speech. Phonology 15 (2). Enderby, P.M., 1983. Frenchay Dysarthria Assessment. College Hill Press. Friedland, B., 2005. Control System Design: An Introduction to StateSpace Methods. Dover.
443
Fukuda, T., Yamamoto, W., Nitta, T., 2003. Distinctive phonetic feature extraction for robust speech recognition. In: Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), Vol. 2, Hong Kong, pp. 25–28. Goldstein, L.M., Fowler, C., 2003. Articulatory phonology: a phonology for public language use, Phonetics and Phonology in Language Comprehension and Production: Differences and Similarities. Goldstein, L., Byrd, D., Saltzman, E., 2006. The role of vocal tract gestural action units in understanding the evolution of phonology. In: Arib, M. (Ed.), Action to Language via the Mirror Neuron System. Cambridge University Press, Cambridge, UK, pp. 215–249. Hasegawa-Johnson, M., Fleck, M., 2007. International Speech Lexicon Project (2007). http://www.isle.illinois.edu/dict/. Havstam, C., Buchholz, M., Hartelius, L., 2003. Speech recognition and dysarthria: a single subject study of two individuals with profound impairment of speech and motor control. Logopedics Phoniatrics Vocology 28, 81–90, 10. Hawley, M.S., Enderby, P., Green, P., Cunningham, S., Brownsell, S., Carmichael, J., Parker, M., Hatzis, A., ONeill, P., Palmer, R., 2007. A speech-controlled environmental control system for people with severe dysarthria. Medical Engineering & Physics 29 (5), 586–593. Hogden, J., Rubin, P., McDermott, E., Katagiri, S., Goldstein, L., 2007. Inverting mappings from smooth paths through rn to paths through rm: A technique applied to recovering articulation from acoustics. Speech Communication 49 (5), 361–383. King, S., Frankel, J., Livescu, K., McDermott, E., Richmond, K., Wester, M., 2007. Speech production knowledge in automatic speech recognition. The Journal of the Acoustical Society of America 121 (2), 723– 742. Kirchhoff, K., 1999. Robust speech recognition using articulatory information, Ph.D. thesis, University of Bielefeld, Germany (July 1999). Lai, P.L., Fyfe, C., 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (5), 365–377. Lee, L.J., Fieguth, P., Deng, L., 2001. A functional articulatory dynamic model for speech production. In: in Proceedings of ICASSP, Salt Lake City, USA, 2001, pp. 797–800. Levelt, W.J.M., Roelofs, A., Meyer, A.S., 1999. A theory of lexical access in speech production. Behavioral and Brain Sciences 22, 1–75. Livescu, K., Cetin, O., Hasegawa-Johnson, M., King, S., Bartels, C., Borges, N., Kantor, A., Lal, P., Yung, L., Bezman, A., DawsonHaggerty, S., Woods, B., 2007. Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU Summer Workshop. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Honolulu, 2007. Matsumasa, H., Takiguchi, T., Ariki, Y., Li, I.-C., Nakabayashi, T., 2009. Integration of metamodel and acoustic model for dysarthric speech recognition. Journal of Multimedia 4 (4), 254–261. Menendez-Pidal, X., Polikoff, J.B., Peters, S.M., Leonzjo, J.E., Bunnell, H., 1996. The Nemours Database of Dysarthric Speech. In: Proceedings of the Fourth International Conference on Spoken Language Processing, Philadelphia PA, USA. Metze, F., 2007. Discriminative speaker adaptation using articulatory features. Speech Communication 49 (5), 348–360. Morales, S.O.C., Cox, S.J., 2009. Modelling errors in automatic speech recognition for dysarthric speakers. EURASIP Journal on Advances in Signal Processing. Murphy, K.P., 1998. Switching Kalman Filters, Tech. rep. Murphy, K.P., 2002. Dynamic Bayesian Networks: Representation, Inference and Learning, Ph.D. thesis, University of California at Berkeley. Nam, H., Goldstein, L., 2006. TADA (TAsk Dynamics Application) manual. http://www.sail.usc.edu/lgoldste/ArtPhon/Documents/TADA_ manual_v09.pdf. Nam, H., Saltzman, E., 2003. A competitive, coupled oscillator model of syllable structure. In: Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS 2003), Barcelona, Spain, pp. 2253–2256.
444
F. Rudzicz / Speech Communication 54 (2012) 430–444
¨ zbek, I.Y., Hasegawa-Johnson, M., Demirekler, M., 2011. Estimation of O Articulatory Trajectories Based on Gaussian Mixture Model (GMM) With Audio-Visual Information Fusion and Dynamic Kalman Smoothing. IEEE Transactions on Audio, Speech, and Language Processing 19 (5), 1180–1195. Polur, P.D., Miller, G.E., 2006. Investigation of an HMM/ANN hybrid structure in pattern recognition application using cepstral analysis of dysarthric (distorted) speech signals. Medical Engineering and Physics 28 (8), 741–748. Raghavendra, P., Rosengren, E., Hunnicutt, S., 2001. An investigation of different degrees of dysarthric speech as input to speaker-adaptive and speaker-dependent recognition systems. Augmentative and Alternative Communication (AAC) 17 (4), 265–275. Ramsay, J., Silverman, B., 2005. Fitting differential equations to functional data: Principal differential analysis. Springer, pp. 327–348. Reimer, M., Rudzicz, F., 2010. Identifying articulatory goals from kinematic data using principal differential analysis. In: Proceedings of Interspeech 2010, Makuhari Japan, pp. 1608–1611. Richardson, M., Bilmes, J., Diorio, C., 2000. Hidden-articulator markov models: Performance improvements and robustness to noise. In: Proceedings of the ICSLP, 2000, pp. 131–134. Richmond, K., King, S., Taylor, P., 2003. Modelling the uncertainty in recovering articulation from acoustics. Computer Speech and Language 17, 153–172. Rosen, K., Yampolsky, S., 2000. Automatic speech recognition and a review of its functioning with dysarthric speech. Augmentative & Alternative Communication 16 (1), 48–60, http://www.dx.doi.org/ 10.1080/07434610012331278904. Rudzicz, F., 2007. Comparing speaker-dependent and speaker-adaptive acoustic models for recognizing dysarthric speech. In: Proceedings of the Ninth International ACM SIGACCESS Conference on Computers and Accessibility, Tempe, AZ, 2007, pp. 255–256. Rudzicz, F., 2010. Adaptive kernel canonical correlation analysis for estimation of task dynamics from acoustics. In: Proceedings of the 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’10), Dallas Texas. Rudzicz, F., 2010. Correcting errors in speech recognition with articulatory dynamics. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala Sweden. Rudzicz, F., 2011. Production knowledge in the recognition of dysarthric speech, Ph.D. thesis, University of Toronto, Toronto Canada. Rudzicz, F., van Lieshout, P., Hirst, G., Penn, G., Shein, F., Wolff, T., 2008. Towards a comparative database of dysarthric articulation. In: Proceedings of the eighth International Seminar on Speech Production (ISSP’08), Strasbourg France, 2008. Sakoe, H., Chiba, S., 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP 26 (1), 43–49. Saltzman, E.M., 1986. Task dynamic co-ordination of the speech articulators: a preliminary model. Springer-Verlag, pp. 129–144.
Saltzman, E.L., Munhall, K.G., 1989. A dynamical approach to gestural patterning in speech production. Ecological Psychology 1 (4), 333–382, doi:doi:10.1207s15326969eco0104_2. Smith, A., Goffman, L., 2004. Interaction of motor and language factors in the development of speech production. In: Speech Motor Control in Normal and Disordered Speech. Oxford University Press, Oxford, Ch. 10, pp. 227–252. Stevens, K.N., Keyser, S.J., 2010. Quantal theory, enhancement and overlap. Journal of Phonetics 38 (1), 10–19, phonetic Bases of Distinctive Features. Sun, J., Deng, L., 2002. An overlapping-feature-based phonological model incorporating linguistic constraints: Applications to speech recognition. The Journal of the Acoustical Society of America 111 (2), 1086– 1101. doi:10.1121/1.1420380, http://www.link.aip.org/link/?JAS/111/ 1086/1. Toda, T., Black, A.W., Tokuda, K., 2008. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Communication 50 (3), 215–227. Vaerenbergh, S.V., Via, J., Santamaria, I., 2006. Online kernel canonical correlation analysis for supervised equalization of Wiener systems. In: Proceedings of the 2006 International Joint Conference on Neural Networks, Vancouver, Canada, pp. 1198–1204. Vaerenbergh, S.V., Via, J., Santamaria, I., 2006. A sliding-window kernel RLS algorithm and its application to nonlinear channel identification. In: Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toulouse, France. Vaerenbergh, S.V., Via, J., Santamaria, I., 2008. Adaptive kernel canonical correlation analysis algorithms for nonparametric identification of Wiener and Hammerstein systems. EURASIP Journal on Advances in Signal Processing 8 (2), 1–13. van Lieshout, P.H., Moussa, W., 2000. The assessment of speech motor behavior using electromagnetic articulography 81, 9–22. van Lieshout, P., Merrick, G., Goldstein, L., 2008. An articulatory phonology perspective on rhotic articulation problems: A descriptive case study. Asia Pacific Journal of Speech, Language, and Hearing 11 (4), 283–303. Wrench, A., 1999. The MOCHA-TIMIT articulatory database (November 1999). http://www.cstr.ed.ac.uk/research/projects/artic/mocha. html. Wrench, A., Richmond, K., 2000. Continuous speech recognition using articulatory data. In: Proceedings of the International Conference on Spoken Language Processing, Beijing, China. Yorkston, K.M., Beukelman, D.R., 1981. Assessment of Intelligibility of Dysarthric Speech, C.C. Publications Inc., Tigard, Oregon. Yunusova, Y., Green, J.R., Mefferd, A., 2009. Accuracy Assessment for AG500, Electromagnetic Articulograph. Journal of Speech, Language, and Hearing Research 52, 547–555. Zheng, W., Zhou, X., Zou, C., Zhao, L., 2006. Facial expression recognition using kernel canonical correlation analysis (KCCA). IEEE Transactions on Neural Networks 17 (1), 233–238.