Journal of Communication Disorders 51 (2014) 13–18
Contents lists available at ScienceDirect
Journal of Communication Disorders
The influence of speaker and listener variables on intelligibility of dysarthric speech Rupal Patel a,b,*, Nicole Usher a, Heather Kember a, Scott Russell c, Jacqueline Laures-Gore d a
Northeastern University, Bouve College of Health Sciences, United States Northeastern University, College of Computer and Information Sciences, United States Grady Memorial Hospital, United States d Georgia State University, College of Education, United States b c
A R T I C L E I N F O
A B S T R A C T
Article history: Received 9 April 2014 Received in revised form 27 May 2014 Accepted 30 June 2014 Available online 12 July 2014
This study compared changes in speech clarity as a function of speaking context. It is well documented that words produced in sentence contexts yield higher intelligibility than words in isolation for speakers with mild to moderate dysarthria. To tease apart the effect of speaker and listener variables, the current study aimed to quantify differences in word intelligibility by speaking task. Eighteen speakers with dysarthria produced a set of 25 words in isolation and within the context of a sentence. Eighteen listeners heard a randomized sample of the isolated productions, single words extracted from the sentences, and the full unaltered sentences. Listeners transcribed what they heard and rated their confidence. Words produced in isolation were just as intelligible as words produced in sentence context, both of which were more intelligible than extracted words. In other words, speakers reduced articulatory clarity in sentence production compared to isolated productions; listeners were able to cope with this reduction in clarity when they had access to contextual information but not when these cues were removed in the extracted condition. These findings are consistent with Lindblom’s hypo–hyperarticulation theory in that adults with dysarthria appear to be modulating articulatory precision based on listener/task variables. This work has implications for clinical practice in that isolated word and sentence production tasks yielded equivalent intelligibility findings. Learning outcomes: Readers will recognize that speech intelligibility is influenced by speaker and listener variables and thus the choice of speaking and listening task may yield different results. Commonly held clinical belief is that sentence production tasks yield inflated intelligibility scores but we did not find that in this sample. Findings also indicate that speakers with dysarthria may modulate articulatory clarity in response to listener needs which should be considered in treatment planning. ß 2014 Published by Elsevier Inc.
Keywords: Dysarthria Intelligibility Speaking task Listener perception
* Corresponding author at: Northeastern University, Department of Speech Language Pathology and Audiology, College of Computer and Information Sciences, 360 Huntington Avenue, Room 204 FR, Boston, MA 02115, United States. Tel.: +1 617 373 5842; fax: +1 617 373 2239. E-mail address:
[email protected] (R. Patel). http://dx.doi.org/10.1016/j.jcomdis.2014.06.006 0021-9924/ß 2014 Published by Elsevier Inc.
14
R. Patel et al. / Journal of Communication Disorders 51 (2014) 13–18
1. Introduction Dysarthria is a motor speech disorder characterized by weak, slow and/or imprecise movements (Yorkston, Beukelman, Strand, & Bell, 1999). Thus, assessment of speech intelligibility and intervention aimed at improving it are central to the clinical management of this disorder (Ansel & Kent, 1992). Intelligibility may be measured through word and sentence production tests such as the Sentence Intelligibility Test (SIT; Yorkston, Beukelman, & Tice, 1996) or by determining the number of intelligible words produced during conversation or while reading a passage such as the Caterpillar (Patel et al., 2013) or My Grandfather (Van Riper, 1963). It has been suggested that sentence intelligibility ratings are inflated compared to word intelligibility, presumably because context aids listener understanding of the sentences more than single words (Hustad, 2007; Yorkston & Beukelman, 1978, 1981). Previous work has shown that sentence intelligibility is higher than word intelligibility for speakers with mild to moderate dysarthria (Hustad, 2007; Yorkston & Beukelman, 1978, 1981). However, for speakers with severe dysarthria, word intelligibility did not differ by presentation task (i.e., words embedded in sentence context vs. spoken in isolation; Hustad, 2007). Therefore, there seems to be an interaction between dysarthria severity and the intelligibility task. An alternative approach to assessing intelligibility is to compare the perception of target words embedded in context and the same target words extracted from the context and heard in isolation (Lieberman, 1963; Miller, Heise, & Lichten, 1951; O’Neill, 1957; Pollack & Pickett, 1963). In a study of healthy male speakers, Lieberman (1963) found that listeners were less likely to understand an extracted target word compared to the word heard in context. It was also found that if a word was heard a second time, the listeners’ identification of the word improved. Lieberman (1963) suggested that even with the degraded acoustic cues in the extracted word, word identification is still possible in the absence of any other information (i.e., the context). The same has been shown for deaf speakers: the target words embedded within the context of a sentence tended to be better understood by a listener than the same target word extracted from the sentence (McGarr, 1981). These findings are consistent with Lindblom’s hyper and hypo-articulation (H&H) theory (Lindblom, 1990) in that a speaker’s articulatory clarity is related to the perceived informational needs of her listener. In other words, speakers will hyperarticulate when listeners require maximum acoustic information, and hypo-articulate when listeners have access to supplementary channels of information (Lindblom, 1990). Thus, if a speaker knows a listener has access to context, speech clarity may be reduced. Such reductions in speech clarity have been well documented in healthy talkers (Lieberman, 1963; Miller et al., 1951; O’Neill, 1957; Pollack & Pickett, 1963). Co-articulation during connected speech may be another plausible explanation for differences in clarity of isolated and connected speech. While co-articulation helps listeners perceive word boundaries (Mattys, 2004), it alters the within-word speech signal. Words produced in sentences are typically shorter, which may impact intelligibility. For some speakers with dysarthria, co-articulation is generally preserved (Tjaden, 2003), but may decrease perceptual intelligibility (Ziegler & von Cramon, 1985). Whether speakers with dysarthria modulate articulatory clarity to accommodate for listener needs or speaking task remains unresolved. The present study sought to examine the interaction between context, articulatory clarity and listener perception of dysarthric speech. We compared intelligibility of words produced in isolation, words embedded in the context of a sentence, and those same words extracted from a sentence using transcription accuracy and error scores. While previous work had compared the intelligibility of words spoken in context versus in isolation (Hustad, 2007; Yorkston & Beukelman, 1978, 1981), we were not aware of previous work on assessing intelligibility of words extracted from sentence productions (i.e., when the speaker’s assumptions of the listener’s needs differ from the listener task). If speakers with dysarthria are able to modulate their articulatory clarity in the same way that healthy speakers do as proposed by the H&H theory, then words spoken in context should have less articulatory clarity than words spoken in isolation. When these words are extracted from the sentence and presented to listeners as isolated words, intelligibility should decrease because of reduced clarity and lack of top-down context. 2. Method 2.1. Participants Eighteen monolingual English-speaking adult listeners (M age = 34.94 yrs, SD = 13.2 yrs, range 18–62 yrs) were recruited from the Greater Boston area through online advertisements. Listeners had no self-reported history of speech, language, cognitive, or hearing difficulties and did not have experience listening to dysarthric speech. Prior to commencing the perceptual experiment, all participants passed a hearing screening with thresholds below 20 dB at 500, 1000, 2000, and 4000 Hz in at least one ear. 2.2. Stimuli The spoken corpus for the present study is a subset of a larger dataset on 99 speakers with motor speech disorders collected by Russell, Laures-Gore, Patel, & Frankel (submitted for publication) at Grady Memorial Health System in Atlanta, Georgia. Stimuli from eighteen speakers (M age = 55.06 yrs, SD = 12.1 yrs, range 34–79) with various dysarthria diagnoses were used in the current study (speaker demographics are summarized in Table 1). Single word intelligibility ratings averaged across three listeners are included in Table 1 as an index of severity.
R. Patel et al. / Journal of Communication Disorders 51 (2014) 13–18
15
Table 1 Demographic information for the speakers with dysarthria. Speaker
Age
Sex
Dysarthria type
Single-word intelligibility (%)
DYS-1 DYS-2 DYS-3 DYS-4 DYS-5 DYS-6 DYS-7 DYS-8 DYS-9 DYS-10 DYS-11 DYS-12 DYS-13 DYS-14 DYS-15 DYS-16 DYS-17 DYS-18
61 41 50 34 62 69 n/a 75 52 50 60 48 60 54 79 47 54 40
F M M M M M M M F M M F M F M M M M
Mixed Ataxic dysarthria Mixed Dysarthria Flaccid Hypokinetic dysarthria Dysarthria Mixed Mixed Mixed Mild flaccid dysarthria Mild flaccid dysarthria Dysarthria Flaccid Ataxic dysarthria Spastic dysarthria Dysarthria Flaccid
33.3 27.8 19.1 11.8 31.1 29.6 26.7 47.9 37.5 8.2 6.4 38.5 37.8 32.5 45.2 53.6 34.4 29.1
Specifically, the subset of stimuli used here included audio-recordings of twenty-five single word productions taken from the Assessment of Intelligibility of Dysarthria Speech (AIDS) and those same words produced in sentence contexts within readings of the Caterpillar passage (Patel et al., 2013), the Grandfather passage (Van Riper, 1963), or productions of sentences with prosodic or emotional content. For the extracted word condition, target words were excised from sentence recordings using visual spectrographic analysis with auditory confirmation in Praat (Boersma & Weenink, 2014). For each speaker, only target words that had tokens for all three listening tasks (isolated, context, extracted) were included in the stimulus list. Given that data collection occurred in a hospital context, some audio files were too noisy to include in the analysis. Additionally, for some speakers, we did not have useable samples of each target word in all speaking conditions. If a target word was not available in all three speaking conditions, we excluded that target word from the analysis of that speaker. As a result, across all speakers, listeners heard between 108 and 148 audio samples (M 138.2, SD 15.2). Speakers were randomly separated into six groups comprised of three speakers each. Three listeners were assigned to each speaker group and the order of presentation of items within speaker groups was randomized.
3. Procedure The listener perception experiment took place in a sound isolated room within the Communication Analysis and Design Laboratory at Northeastern University. The hearing screening and experimental protocol were completed within a single session lasting approximately one hour. A custom interface running on an Apple desktop computer was used to present stimuli in a randomized order and record listener responses. For each stimulus, listeners were instructed to listen to an audio sample and transcribe exactly what they heard (i.e., transcribe the entire sentence or transcribe the single word they heard). After each trial, they were also asked to rate how confident they were in their transcription on a 5-point likert scale (1 = not at all confident, 2 = not very, 3 = somewhat, 4 = very, 5 = completely confident). Listeners were informed that some speakers may be more difficult to understand than others. The listeners were able to repeat audio clips for clarification and could move through the experiment at their own pace. 3.1. Data measurement Transcriptions were analyzed using a phoneme-level edit distance measure (adapted from Bunnell & Lilley, 2007). Edit distance was defined as the number of operations (insertion, deletion, substitution) needed to map the response onto the stimulus (i.e., the target phoneme sequence). Insertions were defined as additional phonemes within the target word. Deletions were defined as phonemes missing from the target word. Substitutions were defined as incorrect phonemes that were interjected within the target word (see Table 2). Punctuation and capitalization were ignored, and misspellings that were obvious homophones (e.g., bear for bare) were counted as non-errors. The raw edit distance was normalized by the number of phonemes in the target word. Higher edit distance scores indicate more errors. Normalized edit distances (NED) were calculated for each listener and speaking condition. Only phoneme errors in the target words were analyzed; we did not analyze errors in surrounding words or phonemes.
R. Patel et al. / Journal of Communication Disorders 51 (2014) 13–18
16
Table 2 Examples of participant responses and corresponding normalized edit distance scores. Target
Response
# of errors
# of phonemes
NED
Park Park Park
Park Part Art
0 1 (1 substitution) 2 (1 deletion, 1 substitution)
4 4 4
0 0.25 0.5
3.2. Statistical analyses We conducted a linear mixed effects model to assess the effect of word type on transcription accuracy. This allowed for multiple scores per speaker (for each of the three listeners). Word type was entered as an independent fixed factor (3 levels: contextual, extracted, isolated) as the independent variable and Normalized Edit Distance (NED) as the dependent variable. Intra-class correlations were calculated to assess agreement between listeners transcription accuracy for each speaker. We also calculated correlations between NED scores and their confidence ratings. 4. Results Overall, there was a significant effect of speaking condition on edit distance scores, F(2, 34) = 29.96, p < 0.001. There were significantly more transcription errors on extracted words compared to contextual words, t(17) = 14.34, p < 0.001, and significantly more errors on extracted words compared to isolated words, t(17) = 7.33, p < 0.001. There was no significant difference in transcription of contextual words and isolated words t(17) = 0.52, p = 0.61 (see Fig. 1). Intra-class correlations showed significant agreement between listeners in each group for each listening condition: contextual r = 0.772, p < 0.001, extracted r = 0.716, p < 0.001, and isolated r = 0.734, p < 0.001. There were significant negative correlations between edit distance scores and confidence ratings for each word type: r = 0.659 for contextual words, r = 0.386 for extracted words, r = 0.409 for isolated words. All correlations are shown in Table 3.
Fig. 1. The main effect of listening task on normalized edit distance scores. Error bars represent standard error of the mean.
Table 3 Correlations between mean edit distance scores and confidence ratings for each word type. Confidence ratings
Edit distance scores Contextual
Contextual Extracted Isolated
0.659* 0.211 0.260
* Correlation is significant at p = 0.001 level.
Extracted 0.607* 0.386* 0.400*
Isolated 0.472* 0.210 0.409*
R. Patel et al. / Journal of Communication Disorders 51 (2014) 13–18
17
5. Discussion The present study examined the effect of speaking task on speech intelligibility in dysarthria. We compared the intelligibility of target words spoken in isolation (lacking context), words produced in a sentence (with surrounding context), and words extracted from a sentence (lacking content and potentially lacking articulatory clarity in comparison to the words produced in isolation). We found that words produced in isolation were transcribed just as accurately as words produced within the context of a sentence. Words extracted from the sentence, however, were transcribed with the least accuracy. When words were produced in sentences, speakers may have hypo-articulated productions either because they ‘expected’ listeners would have access to supplementary information as suggested by Lindblom’s H&H theory (Lindblom, 1990), and/or because of the natural effect of co-articulation in connected speech. Along these lines, extracted words may have been least understood due to the lack of articulatory clarity and supplementary top down information that aids listener comprehension. This finding suggests that context and clarity had a comparable influence on intelligibility of dysarthric speech in this sample. When listeners do not have context to aid understanding, they rely on clear articulation. Conversely, when listeners do not have access to clear articulation, they rely on context. Therefore extracted words were the least intelligible because listeners had neither context nor articulatory clarity. Moreover, both the finding of greater intelligibility in isolated words versus extracted words and that of equivalent intelligibility in isolated and embedded words suggests that in the context of connected speech, words may be hypo-articulated (i.e., when they believe that listeners will have access to surrounding context) or hyperarticulated (i.e., when the communication channel is compromised such as in noise). We found that listener perception of their performance (e.g., confidence ratings) matched the overall normalized edit distance (NED) scores for each of the following speaking conditions: isolated, embedded, and extracted. As listeners made more errors (indicated by higher NED scores), confidence ratings went down. Prior research has found that listeners may underestimate how well they can understand dysarthric speech (Hustad, 2007). The present study suggests naı¨ve listeners were quite insightful about their transcription accuracy perhaps due to the severity of the speakers in the present study. 5.1. Limitations One limitation of the current study is that we did not have an independent measure of severity of dysarthria for each speaker. Previous work has shown that differences between intelligibility ratings for sentence versus isolated word production tasks are moderated by dysarthria severity (Hustad, 2007). One of the primary ways that severity is operationalized is to use perceptual judgments of the AIDS words, which is problematic given that our stimuli were the same AIDS words. A second limitation of the present work is that the spoken dataset was a subset of the corpus collected in Atlanta, Georgia, yet the listener participants were from the Greater Boston area. Thus, dialectal differences may have confounded listener results. Although listeners were informed that speakers might have a Southern accent, a lack of familiarity in listening to southern accented American English may still have influenced intelligibility scores. 5.2. Future directions As listener familiarity is a factor that tends to have an impact on intelligibility of dysarthric speech (Hustad, 2007), future work aimed at examining the impact of repeated exposure to the same stimuli and speakers on transcription accuracy would be warranted. Additionally, varying speaker expectations – speaking with a familiar listener versus a naı¨ve listener – may further influence how speakers modulate articulatory clarity to maximize intelligibility. Follow-up investigations to assess whether phoneme errors on surrounding words within the sentences impact intelligibility would also be informative. 5.3. Conclusion This present study suggests that there is an interaction between speaking task, articulatory clarity and listening context when assessing intelligibility in individuals with dysarthria. For this subset of speakers, intelligibility ratings were identical when listeners heard target words produced in isolation versus embedded in a sentence, but significantly reduced for extracted words. In other words, speakers reduced articulatory clarity in sentence production compared to isolated productions; listeners were able to cope with this reduction in clarity when they had access to contextual information but not when the these cues were removed in the extracted condition. The clinical implications of the present findings pertain to the choice of speaking and listening task for assessing speech intelligibility in dysarthria. We found that isolated word and sentence production tasks yielded equivalent intelligibility levels/scores and that speakers with dysarthria appeared to be modulating articulatory precision based on listener/task variables. Measures of intelligibility often include word and sentence level production tasks. However, the use of both tasks may be duplicative, as intelligibility will not differ between the tasks. Therefore our results suggest that it may only be necessary to complete either the measure of single word intelligibility or a sentence intelligibility test. This will contribute to less time needed for assessment. It may also be worthwhile to report intelligibility score as well as the speaking and listening tasks in which they were obtained.
18
R. Patel et al. / Journal of Communication Disorders 51 (2014) 13–18
Financial and Non-financial Disclosures The authors have no financial or nonfinancial relationships to disclose. Appendix A. Continuing education questions 1. Which of the following variables impact the intelligibility of dysarthric speech. a. Speaking task b. Speech clarity c. Listening task d. All of the above 2. True or false: Isolated word productions were more intelligible than sentence productions in this sample of speakers with dysarthria? 3. True or false: Speakers with dysarthria in this study reduced articulatory clarity for sentence productions. 4. The correlation between confidence ratings and edit distance scores was found to be: a. Negative b. Positive c. No relationship d. Not tested 5. Larger edit scores mean: a. Listeners had more difficulty understanding that word b. Listeners found the word easier to understand c. None of the above d. Not tested
References Ansel, B. M., & Kent, R. D. (1992). Acoustic–phonetic contrasts and intelligibility in the dysarthria associated with mixed cerebral palsy. Journal of Speech and Hearing Research, 35, 296–308. Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer. Version 5.3.62. Retrieved from http://www.praat.org (02.01.14). Bunnell, H. T., & Lilley, J. (2007). Analysis methods for assessing TTS intelligibility. Proceedings of the sixth international workshop on speech synthesis (pp. 374–379). Hustad, K. C. (2007). Effects of speech stimuli and dysarthria severity on intelligibility scores and listener confidence ratings for speakers with cerebral palsy. Folia Phoniatrica et Logopaedica, 59, 306–317. Lieberman, P. (1963). Some effects of semantic and grammatical context on the production and perception of speech. Language and Speech, 6, 172–187. Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In W. J. Hardcastle & A. Marchal (Eds.), Speech production and speech modeling (pp. 403–439). Dordrecht: Kluwer Academic Publishers. Mattys, S. (2004). Stress versus coarticulation: Toward an integrated approach to explicit speech segmentation. Journal of Experimental Psychology: Human Perception and Performance, 30, 397–408. McGarr, N. S. (1981). The effect of context on the intelligibility of hearing and deaf children’s speech. Language and Speech, 24(3), 255–264. Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of speech as a function of the context of the test materials. Journal of Experimental Psychology, 41, 329–335. O’Neill, J. (1957). Recognition of the intelligibility of test materials in context and isolation. Journal of Speech and Hearing Disorders, 22, 87–90. Patel, R., Connaghan, K., Franco, D., Edsall, E., Forgit, D., Olsen, L., et al. (2013). The Caterpillar: A novel reading passage for assessment of motor speech disorders. American Journal of Speech Language Pathology, 22, 1–9. Pollack, I., & Pickett, J. M. (1963). The intelligibility of excerpts from conversation. Language and Speech, 6, 165–171. Russell, S., Laures-Gore, J., Patel, R., & Frankel, M. (2014). Atlanta motor speech disorders corpus. (submitted for publication). Tjaden, K. (2003). Anticipatory coarticulation in multiple sclerosis and Parkinson’s disease. Journal of Speech, Language, and Hearing Research, 46, 990–1008. Van Riper, C. (1963). Speech correction: Principles and methods (4th ed.). Englewood Cliffs, NJ: Prentice-Hall. Yorkston, K., & Beukelman, D. (1978). A comparison of techniques for measuring intelligibility of dysarthric speech. Journal of Communication Disorders, 11, 499– 512. Yorkston, K., & Beukelman, D. (1981). Communication efficiency of dysarthric speakers as measured by sentence intelligibility and speaking rate. Journal of Speech and Hearing Disorders, 46, 296–301. Yorkston, K. M., Beukelman, D. R., Strand, E. A., & Bell, K. R. (1999). Management of motor speech disorders in children and adults. Austin, TX: Pro-Ed. Yorkston, K., Beukelman, D., & Tice, R. (1996). Sentence Intelligibility Test for Macintosh. Communication disorders software. Lincoln, NE: Tice Technology Services. Ziegler, W., & von Cramon, D. R. (1985). Anticipatory coarticulation in a patient with apraxia of speech. Brain and Language, 26, 117–130.