Available online at www.sciencedirect.com
ScienceDirect Computer Speech and Language 36 (2016) 110–121
Preprocessing for elderly speech recognition of smart devices夽 Soonil Kwon ∗ , Sung-Jae Kim, Joon Yeon Choeh Interaction Technology Laboratory, Department of Digital Contents, Sejong University, 98 Gunja-Dong, Gwangjin-Gu, Seoul 143-747, Republic of Korea Received 13 October 2014; received in revised form 8 July 2015; accepted 4 September 2015 Available online 12 September 2015
Abstract Due to the increasing aging population in modern society and to the proliferation of smart devices, there is a need to enhance speech recognition among smart devices in order to make information easily accessible to the elderly as it is to the younger population. In general, speech recognition systems are optimized to an average adult’s voice and tend to exhibit a lower accuracy rate when recognizing an elderly person’s voice, due to the effects of speech articulation and speaking style. Additional costs are bound to be incurred when adding modifications to current speech recognitions systems for better speech recognition among elderly users. Thus, using a preprocessing application on a smart device can not only deliver better speech recognition but also substantially reduce any added costs. Audio samples of 50 words uttered by 80 elderly and young adults were collected and comparatively analyzed. The speech patterns of the elderly have a slower speech rate with longer inter-syllabic silence length and slightly lower speech intelligibility. The speech recognition rate for elderly adults could be improved by means of increasing the speech rate, adding a 1.5% increase in accuracy, eliminating silence periods, adding another 4.2% increase in accuracy, and boosting the energy of the formant frequency bands for a 6% boost in accuracy. After all the preprocessing, a 12% increase in the accuracy of elderly speech recognition was achieved. Through this study, we show that speech recognition of elderly voices can be improved through modifying specific aspects of differences in speech articulation and speaking style. In the future, we will conduct studies on methods that can precisely measure and adjust speech rate and find additional factors that impact intelligibility. © 2015 Elsevier Ltd. All rights reserved. Keywords: Elderly voice interface; Speech recognition; Aging society
1. Introduction Demographics among developed nations around the world are shifting towards an aging population; for instance, the proportion of the elderly population in South Korea has reached unprecedented levels (Korea National Statistical Office, 2006). As the market penetration of information technology has spread widely, smart devices are not only used by young people for prolonged periods but also by the elderly with high usage indoors and outdoors during the entire day. Advanced interface methods supported by smart devices are largely conventional interfaces that depend on touch, voice, and motion commands. But touch-sensitive technology and finger-recognition motions used on smart devices 夽 ∗
This paper has been recommended for acceptance by R.K. Moore. Corresponding author. Tel.: +82 2 3408 3847. E-mail addresses:
[email protected] (S. Kwon),
[email protected] (S.-J. Kim),
[email protected] (J.Y. Choeh).
http://dx.doi.org/10.1016/j.csl.2015.09.002 0885-2308/© 2015 Elsevier Ltd. All rights reserved.
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
111
Fig. 1. Illustration of the proposed preprocessing method for elderly speech recognition.
are difficult to use among the elderly due to deterioration of eyesight and sense of touch. We therefore deemed that a speech-recognition interface in a smart device would increase its ease-of-use for the elderly. This interface would especially be utilized by senior citizens in emergency situations including a traumatic event that renders them physically limited. Voice interfaces embedded in existing smart devices use speech recognition methods that are optimized to the speech and speaking patterns of average young and middle-aged adults and therefore, show a decline in performance when there is a deviation in the speech rate. Furthermore, tongue thickness, range of movement and duration of movement decreases as humans reach old age (Bennett et al., 2007) and due to this, feedback becomes slower when articulating speech sounds, in which it can partially affect articulation functions (Sonies, 1987). In other words, among the elderly, speech rate becomes slower, silence lengths are longer and speech becomes less precise (Kahane, 1981). This study aims to investigate the factors that impair speaking ability among the elderly by comparing speech patterns between the elderly and young adults while also identifying which areas of speech that can be normalized or adapted for improving speech recognition. After the analysis of elderly speech, a preprocessing method was conducted to modify voice signals from elderly adults, which were then tested to determine if accuracy was improved in speech recognition without any modification of an automatic speech recognition system installed in an existing smart phone (Fig. 1). Test results showed that voice signals from the elderly increased the error rates among existing speech recognition systems, but accuracy showed marked improvement when the speech rate, silence lengths, and formant of the voice signals were preprocessed. The study results show promise in offering better accessibility to smart devices among the elderly and the speech-impaired who are left behind in the information age due to the lack of modifications in automatic speech recognition systems installed in existing smart phones. In Section 2, speech patterns of the elderly and related studies on speech recognition are summarized, and in Section 3, the methods used to analyze and measure characteristics of speech patterns among the elderly are described. Section 4 goes into the comparative analysis of speech patterns between young adults and the elderly in order to identify differences in speech patterns. The speech signals are subsequently preprocessed for these differences and tested with existing automatic speech recognition systems installed on a smart device, and the results are analyzed. The conclusion of this paper is given in Section 5. 2. Previous research In this section, we provide a brief review of studies on elderly speech based on prior technological research in the field of automatic speech recognition as well as research in fields such as applied linguistics and biomechanics. When humans reach old age, their eyesight, cognitive processes, sensory abilities, and physical functions that allow them to communicate verbally decline in comparison to younger people. The occipital lobe, which controls muscle movement and the lungs, shows a marked decline while the mucous membrane in the vocal cords becomes thinner and keratinized. In addition, the thyroid cartilage in men over 65 becomes nearly completely ossified, while, in women, ossification is limited to the lower part of the cartilage (Lee, 2011). Due to these aging factors, people over the age of 65 exhibited a significantly slower articulation speed and speech rates compared to that of the younger generation, and also have a
112
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
change in resonance properties of the larynx which creates a fainter pitch (Kim, 2003). In addition, there is an increase in fluency breaks, resulting in longer and more frequent silence lengths (Manning and Monte, 1981). Ryan empirically proved that 70–80 year old adults exhibited a significant increase in mean vocal intensity for reading and speaking conditions and a significant decrease in the overall reading rate and mean sentence reading rate compared to younger adults (Ryan, 1972). In addition, 60–70 year old adults exhibited a significant decrease in mean sentence reading rate compared to 40–50 year old adults. Biological changes have an impact on voice signals formed from vocalization. Fundamental frequency (F0) varies with gender and age. The average fundamental frequency of elderly males is 121.8 Hz, while the average of young and middle-aged males is 118.7 Hz. Elderly females have an average fundamental frequency of 174.8 Hz compared to that of young and middle-aged females which is about 224.82 Hz. The formant frequency that appears when making vowel sounds not only expresses the characteristics of the vowel but also includes most of the vocal energy, which plays an important role in measuring the quality of phonetic information. The average first formant (F1) for elderly males and middle-aged males is 688.2 Hz and 703.17 Hz, respectively, with a standard deviation of 50.02 for the former and 30.07 for the latter. For elderly females and middle-aged women, the average F1 is 765.01 Hz and 827.79 Hz respectively with a standard deviation of 40.05 for the former and 33.08 for the latter (Jin et al., 1997; Wilpon and Jacobsen, 1996; Kim and Ko, 2008). The study results show that those over 60 will undergo changes in key frequencies in their speech compared to a middle-aged adult. Moreover, the special energy distribution across frequencies is a key factor in speech recognition and is thus a topic of interest in this study. Speech intelligibility is the degree that a spoken utterance can be understood by the listener (Han, 2011). Intelligibility is determined according to various factors such as articulation, resonance, cadence, breathing, vocalization, environment noise, circumstance, the listener, and other factors. Ptacek and Sander had a test on the ability of 10 listeners to differentiate the voices of younger adults (under age 35) from older (over age 65) subjects on the basis of a prolonged vowel, a reading sample played backward, and a reading sample played forward. Listeners were able to differentiate the voices of younger adults from aged speakers with impressive accuracy under each of the three successive listening conditions of decreasing difficulty (Ptacek and Sander, 1966). Ultimately, this test proved the claim that people can tell the difference between the voice of an elderly person and that of a young adult through their sense of hearing. Based on Harnsberger’s experiment, it became possible to increase the accuracy in identifying acoustic patterns by age with the key factors being F0 and speech rate. Therefore, a rough estimate of a person’s age can be made based on his or her voice by using these two features (Harnsberger et al., 2008). However, acoustically, the variables related to intelligibility are the duration of the vowel sound, location of the F1 and second formants (F2), fricative sound, and noise duration of an affricate (Pyo and Shim, 2005). To reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, vocal tract length normalization (VTLN) is in use in almost all state-of-the-art Gaussian Mixture Model/Hidden Markov Model based automatic speech recognition systems. For speaker normalization, Lee and Rose estimated a linear frequency warping factor and warped frequency by modifying the filterbank in mel-frequency cepstrum feature analysis (Lee and Rose, 1998). This method stretched or compressed the frequency scale of the filters for frequency warping. Their results showed that frequency warping was consistently able to reduce word error rate by 20% even for very short utterances. Potamianos and Narayanan proposed a speaker normalization algorithm that combines frequency warping and model transformation is shown to reduce acoustic variability and significantly improve automatic speech recognition (ASR) performance for children speakers (Potamianos and Narayanan, 2003). They computed the average scaling factors between children and adult speech for all phonemes and then combined frequency warping and spectral shaping. The experimental results showed about 28.7% improvement for connected digit and sub-word recognition tasks. Elenius and Blomberg presented dynamic VTLN method in connected-digit recognition of children’s speech using models trained on male adult speech (Elenius and Blomberg, 2010). This is to incorporate time varying speaker characteristic properties into the acoustic model. The word error rate was reduced by 10% relative to the conventional utterance-specific warping factor. By utilizing findings in the aforementioned prior research, it is possible to predict the relative differences between the speech of an elderly person and that of a young adult in terms of speech rate, silence length, fluency break, F0, F1, and F2. Variability in some of these factors is known to impact the accuracy of speech recognition systems. It is possible to counteract errors in speech recognition that arise when people speak rapidly by using cepstrum normalization (Richardson et al., 1999; Kwon, 2011; Kwon et al., 2013). Automatic speech recognition algorithms, which inherently have trouble with voice samples of a fast speaker, have seen a modest performance increase of 1.9% in terms of accuracy
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
113
after modeling the changes of a person’s rapid speaking rate (Zheng et al., 2000; Siegler and Stem, 1995). However, no study taking into account all of the aforementioned factors has been carried out yet nor has any correlation been made between speech recognition and changing speech patterns caused by aging. In this study, factors attributed to aging in speech patterning are considered as areas that can enhance the accuracy rate of speech recognition through preprocessing. 3. Elderly adult voice features and conventional extraction method The speech rate of the elderly is slower than the rate of young adults due to the aging of the mucous membrane, larynx, and pulmonary functions. Intelligibility of elderly speech tends to suffer as well. Thus, when analyzing the characteristics of vocalization among elderly adults, features such as speech rate, silence length, and formants of voice, which are measurable, are used for the analysis. Generally, the average phoneme rate and syllable rate can be utilized as a method for measuring the speaking rate. The formula below in Eq. (1) is used to find the speech rate for this study (Chu and Povey, 2010). Given an utterance i, composed of a sequence of Ni words: [wi,1 , wi,2 , wi,Ni ]. j of Eq. (1) is the label number of words. After analyzing the difference in the average speech rate between the elderly and young adult, the resulting data can be used to modify the voice signal by artificially adjusting the speech rate. Ni t(wi,j ) F (i) = Ni=1 i i=1 n(wi,j ) F (i) = speech rate (1) t(wi,j ) = the duration of a word n(wi,j ) = the number of phones in a word The Synchronized Overlap-Add (SOLA) algorithm is one basic method used to adjust the speech rate according to measured ratio on a time-scale. As shown in Eq. (2), the overlapped window of the fixed length is used to extract the signal (x) that is derived from regular intervals (SA ) (Hejna and Musicus, 1991; Kwon, 2012). The resulting window can be compressed, extended, or converted according to any given ratio and then be combined again. The segment (SA ) cut from an input signal, which is then overlapped with segment (SS ) can be expressed as the ratio α = SS / SA , whereupon the signal will be converted accordingly. If α > 1, the conversion will be an extension, but if α < 1, the conversion will be a compression. The converted signals will overlap during the overlapped-add stage, and fine-tuning is needed to make the overlapped areas as identical as possible. x[mS A + n] for n = 0, . . ., W − 1 xm [n] = (2) 0 otherwise End point detection (EPD) is an algorithm that can separate the silence components in human speech. EPD generally detects energy to determine the existence of spoken sounds in each analysis frame (typically 25 ms). In order to enhance the accuracy in separating voice components, zero crossing rate (ZCR) is additionally used. After separating voiced sounds in speech, the silence regions and voiceless sounds are separated by using ZCR. Formant frequency can be detected from speech spectra by peak picking method (Wood, 2003). The peak picking algorithm uses the N-point FFT to apply a parabola interpolation and peak picking to a discrete number sampling in order to scan for the formant. The shape of the parabola is as follows: y(λ) = aλ2 + bλ + c
(3)
If y(0) defines the discrete peak value where y(−1) represents the left sample and y(1) represents the right sample, then the coefficient of the parabola that crosses all three points is as follows: c = y(0) b = [y(1) − y(−1)]/2 a = [y(1) + y(−1)]2 − y(0)
(4)
114
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
Fig. 2. Speech recognition accuracy rate (%).
The differential calculated from dy(λ)/dλ = 0 is the peak point, which makes λp = − b/2a the maximum value. If the discrete peak is located at np , the formant point and the interpolated formant’s bandwidth value can be expressed as follows. F B
= =
(np + λp )fs 2N b2 − 4a[c − 0.5y(λp )]1/2 fs − aN
(5)
4. Experiments and results A total of 80 test participants were chosen with 40 people falling into the 20–30 years age group and 40 in the over 65 age group. Each age group was comprised of an equal number of males and females in a 50/50 gender split. Their voice data were collected for the purpose of data analysis in this study and none of the test participants had any prior conditions such as impaired speech, impaired sight, nervous system problems or vocal cords ailments. Recordings were carried out in a quiet lecture room at Gwangjin-gu Senior Citizen’s Welfare Hall in Seoul, Korea. Participants for the data collection stood about 10 cm from the microphone. The sampling rate and total time length of recording data were 16,000 Hz and 4152 s, respectively. A set of 50 isolated Korean words were given to the test participants to speak for the data recording. These 50 words were chosen from the smart device function control commands from “S” company. They consisted of 12 two-syllable words, 15 three-syllable words, 11 four-syllable words, 5 five-syllable words, 4 six-syllable words, 3 over seven-syllable words. 4.1. Comparative analysis of speech cues amongst the elderly and young adults Drawing upon prior research done in the fields of biological aging, vocalization behavior, and speech recognition, we settled on conducting a comparative analysis of elderly speech patterns and young adult speech cues in terms of speech rate, silence length, and vocal frequency. The data on speech rate, silence length, and formant was collected for this study by standard methods explained in Section 3 along with the C program and the Praat voice analysis tool. The speech recognition performance test was run through an Android smart phone. The baseline speech recognition system used in the tests was a sub-word system with GMM-based triphone HMMs and 39-dimensional PLP-cepstral coefficients (Schalkwyk et al., 2010). Prerecorded and normalized spoken word data were fed into the smart phone by a speaker at a distance of 10 cm. The accuracy rate of the speech recognition experiment was calculated by counting the number of right answers at the word level, in which any words containing one or more incorrectly recognized syllables were deemed wrong answers. The results of this accuracy test of speech recognition are shown in Fig. 2 where one can see that the accuracy rate was 94% for young adults and only 76% for the elderly participants. There was approximately a 20% difference in speech recognition rates between the elderly and young people with the former lagging behind the latter. In Fig. 2, we conducted a study to determine the reasons why elderly male subjects had a higher speech recognition accuracy rate than elderly female subjects and, conversely, why young female subjects had a higher speech recognition accuracy rate compared to their male counterparts. To compare these subjects, the p-value was calculated and the results showed that
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
115
Fig. 3. Average speech rate (in seconds).
Fig. 4. Average silence length (in seconds).
the p-value for the elderly test subjects was 0.8 and for the young test subjects, the p-value was 0.47. As a result, the gender among the elderly and young test subjects was not an influential factor in speech recognition accuracy. Fig. 3 shows that the elderly participants have a slower speaking rate for each syllable in comparison to young adults, and we deduced that this was caused by the aging of the larynx in addition to mispronunciations. A few elderly participants showed a tendency to stretch the first syllables of words in their pronunciation. When asked to pronounce an unfamiliar word, most of the participants would extend the pronunciation of a certain syllable until the next syllable. Elderly females had noticeably longer inter-syllabic silence lengths as shown in Fig. 4 compared to inter-syllabic silence lengths among the elderly and young adults. However, elderly males did not exhibit any difference with young adults in terms of inter-syllabic silence length. Elderly males displayed better aptitude in reading technological terms compared to elderly females due to the lack of technical knowledge among elderly females. Thus, some of the sample words were new to them, causing longer inter-syllabic silence lengths. Prior studies have asserted that aging can have a significant impact on formant frequencies because the larynx shifts to a lower position. Elderly females have significantly lower formant frequency compared to their younger counterparts (Fisher and Linville, 1985). For the frequency trend analysis pertaining to the formants (F1, F2, and F3) of young adults and the elderly, we studied all the vowels located in the first syllable of the words since an overall averaged formant value does not convey any meaningful information. The frequency distribution table of the first vowels is found in Table 1. Based on the results of formant frequency (F1, F2, F3) tracking on each vowel by using Praat, a t-test was conducted to determine if the age of the young and elderly test subjects was an influential factor. Table 2 shows the proportion of the vowels that had a significant p-value. Among the formant frequency categories, F2 showed the most significance in its group of vowels, but it only comprised less than 50% of the entire group. After conducting a t-test on vowels such as “a,” “i,” and “u,” which are the most frequently recurring vowels in the word groups, F2 and F3 displayed significant numbers (Table 3). In addition, changes in F1–F2 vowel space as a function of age and gender are shown in Fig. 5. The vowel space boundaries are marked by average formant frequency values for the three point vowels /a, i, u/ for the age and gender groups: young males, young females, elderly males, and elderly females. Even though there was a non-negligible difference between the formant frequency of the elderly and young adults with respect to some vowels, it was lack of coherence over all vowels and thus decided that it did not have a significant influence on speech recognition. The energy ratio of spectral bands including F1, F2 and F3, which are relative to the young adult and elderly test sound sources, were calculated and analyzed. This was done to compare articulation strength on formants instead of
116
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
Table 1 Frequency distribution of first vowel of each word. Vowel
Frequency (%)
a i u ay ey o wi wa yu e we wu ye yey
20 16 12 12 10 6 6 4 4 2 2 2 2 2
Total
100
Table 2 Proportion of significant p-values in formant frequencies by gender. Female
Male
p < 0.05
F1
F2
F3
F1
F2
F3
22%
40%
16%
38%
40%
18%
Table 3 Significances of major vowel for each formant by gender (p-value). Male
a i u
Female
F1
F2
F3
F1
F2
F3
0.816 0.095 0.045
0.004 0.000 0.000
0.155 0.000 0.001
0.174 0.439 0.480
0.000 0.000 0.000
0.000 0.000 0.000
the formant frequency position under the assumption that the speech of elderly people was less clear compared to that of young people due to the energy strength in the formant frequency. Fig. 6 indicates the energy ratio relative to the formant frequency in adjacent bands that are exhibited by young adults and the elderly. RF1E is the relative energy ratio of the sum of energy in 250–1250 Hz range divided by total energy, RF2E is the relative energy ratio of the sum of energy 1000–2000 Hz range divided by total energy, and RF3E is the relative energy ratio of the sum of energy 2200–3200 Hz range divided by total energy. The results from analyzing the relative formant frequency band energy ratio, demonstrate that a young adult’s voice is different from that of an elderly male or female. The relative formant frequency band energy ratio of young adults and the elderly were tested for significance level and resulted in a statistical significance below 0.05 (significance level) in the RF2E and RF3E, which indicated it was significant (Table 4). 4.2. Elderly speech preprocessing and recognition results All elderly speech data was preprocessed as necessary according to the previous analysis. Based on the analysis results shown in Fig. 3, 87.5% of the elderly participants, which includes 15 elderly males and 20 elderly females, showed a slower average speech rate than young adults, and their speech rates were given a boost, with the results shown in Fig. 7. When the average speech rate of the elderly is increased in the course of preprocessing, there is a change in the average speech recognition rate. Previously, the average speech recognition rate for elderly males was
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
117
Fig. 5. Changes in F1–F2 vowel space as a function of age and gender.
Fig. 6. Relative proportion of formant frequency band energy (%). Table 4 Significance level of relative formant frequency band energy ratio.
RF1E
RF2E
RF3E
Young male Old male Young female Old female Young male Old male Young female Old female Young male Old male Young female Old female
Average
Variance
4.989 5.233 5.128 4.810 1.918 1.557 1.835 1.756 0.817 0.669 0.742 0.701
0.3592 0.2955 1.0692 0.2047 0.1268 0.0633 0.1480 0.1379 0.0220 0.0225 0.0351 0.0224
p (T ≤ t) two-sided test 0.1845801 0.2147417 0.0006657 0.5165394 0.0032277 0.4592736
76% and this was increased to 77.9% for a 1.9% gain. The average speech recognition rate for elderly females at 75.4% was slightly improved by 1% to achieve a rate of 76.4%. The elderly had weaker muscle strength than that of young adults, and this affected their articulation ability. They also had difficulty with the perception and articulation of the target words they were given, which caused unintended inter-syllabic pauses and prolonged pronunciation in uncommon places that further influenced the speech recognition rate as well as speech rate.
118
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
Fig. 7. Average speech recognition accuracy rate after increasing the speech rate.
Fig. 8. Speech recognition accuracy rate (%) of elderly females after removing silence.
The result in Fig. 7 was analyzed in detail because of the smaller than expected improvements. When the speech rate was increased in preprocessing, 169 words of the total of 1750 words displayed gains in speech recognition while 140 words showed declines. Some words showed a reduction in speech recognition accuracy when they were sped up, and there are potentially two causes for this. The first one is that all syllable rates (speech rate in syllable level) increased in speed at an overall ratio. In an average syllable rate, syllables are supposed to be evenly split dependent on the entire word length, but the actual syllable rate varied substantially from the average syllable rate. Thus, when increasing the syllable rate by an overall ratio, syllables that were pronounced more slowly than the average syllable rate showed higher recognition rates, while syllables spoken at average speed or quickly were incorrectly recognized or recognized as being an incorrect syllable, leading to a failure in recognizing the entire word. The second cause is how each syllable in an elderly’s slow speech was increased in a uniform rate. Compared to the speech rate of young adults, the elderly generally spoke slower. When a few words from an elderly’s speech were sped up, they turned out to be equal in speed to the young adult’s speech rate and sometimes were even faster. Increasing the speed uniformly across the entire speech accelerated the speech rate too much, causing some previously recognized words to be missed or incorrectly recognized for another word. In conclusion, certain syllables in a word might have been spoken abnormally longer than normal, causing the speech recognition system to recognize two or three syllables when there was only one, which ultimately led to an incorrect recognition of the word. It is necessary to selectively adjust the length of an abnormally long syllable by minutely preprocessing the syllable rate after an analysis of the durations of each syllable is completed in order to boost speech recognition accuracy. In Fig. 4, elderly females had an inter-syllabic silence length 0.2 s longer than that of the other groups, so in order to check whether the inter-syllabic silence length affected speech recognition, we removed the inter-syllabic silence parts in the preprocessing and ran the speech through the speech recognition system again. Fig. 8 shows the result such that the recognition rate for elderly females increased by 4.2% (from 75.4% to 79.6%) when inter-syllabic silence periods were removed. Comparing samples of inter-syllabic silence that were removed with the unmodified samples,
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
119
Fig. 9. Speech recognition accuracy rate (%) of elderly males after RF2E and RF3E adjustment.
the recognition rate of words with more than five syllables increased the most. This demonstrates that current speech recognition technology is likely to have a lower accuracy rate when recognizing long words spoken by the elderly. Based on Table 4, the significance of the RF2E and RF3E of the elderly and young males has been confirmed. Hence, we made an attempt to determine the causal connection between the relative formant frequency band energy ratio, especially for RF2E and RF3E, and the speech recognition accuracy. The elderly male’s RF2E and RF3E were adjusted to closely match the young male’s ratios of RF2E and RF3E in the preprocessing. For the adjustment in the preprocessing, the elderly male’s voice signal was spectrally equalized based on the relative formant frequency adjacent band’s energy ratio of young males, found in Fig. 6, in which elderly male’s RF2E and RF3E were similar to that of young males. This entails changes to the formant band energy ratio of an elderly males’ voice in order to make it identical to that of a young male’s voice. The adjusted speech underwent a speech recognition test again. Fig. 9 shows a graph with the results of the retest of the elderly male’s speech with adjusted F2 and F3 band energy ratios. Before the adjustment, the accuracy rate for elderly males averaged 76%, but after the adjustment, the accuracy rate increased to 82%. Analysis of the results showed that the vowel sounds that were unclear compared to a young adult’s speech were smoothened out. While the overall accuracy rate for an elderly male’s voice after formant adjustment increased, there were also some words that showed a decline in speech recognition after the correction. These previously recognized words, which failed to be recognized after the correction, were analyzed, and it was determined that the vowel sounds and other vocal sounds in the words were corrected too aggressively, thereby causing errors in the speech recognition. At last, we tested the elderly speech recognition after preprocessing of the speech rate, inter-syllabic silence, and articulation strength on formants. The result was an 88.4% speech recognition rate for elderly females and 87.6% for elderly males, which brings the total average accuracy rate to 88%. Hence, the accuracy of elderly speech recognition was improved by 12%, from 76% to 88%, without any modification of the automatic speech recognition system installed on an existing smart phone. To ensure the reproducibility of the proposed preprocessing, additional experiments have executed with the adaptation values previously used in Section 4.2. First, ten Korean words (2 four-syllable words, 6 five-syllable words, and 2 six-syllable words) that were not used in the earlier experiment were selected and recorded. The recording process used the same experiment conditions and settings as the ones used in the prior experiment. But the composition of the test subjects is different. Five young male and five young female test subjects along with four elderly male and four elderly female test subjects who did not participate in the prior experiment were selected for the experiment. The test results are shown in Fig. 10. The young male and female test subjects had an average speech recognition accuracy rate of 84%, while the elderly male and female test subjects had an average accuracy rate of 59%. After using the modified parameters for increasing the speech rate used in the prior experiment, the speech recognition accuracy among the elderly test subjects increased to 75%. Eliminating inter-syllabic silence improved their speech recognition accuracy to 73%. Also, after adjusting the formant band energy, the elderly test subject’s speech recognition rate improved to 68%. Combining all three of the processes and applying them to the elderly test subject’s speech resulted in an 82% speech recognition accuracy rate. After preprocessing, the average speech recognition accuracy among the elderly test subjects was lower than the average accuracy rate shown among the young test subjects. However, there were still significant improvements in terms of the accuracy rate among the elderly subjects after eliminating inter-syllabic silence and boosting the speech rate.
120
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
Fig. 10. Speech recognition accuracy rate (%) with additional speech data.
5. Conclusion This paper analyzed speech rate, inter-syllabic silence length, and formants of elderly speech in order to identify the causal factors that lower the speech recognition accuracy rate for the elderly when using a voice interface within a smart device. As a result, we concluded that the functional decline of elderly adults’ vocal organs causes the elderly to have a slow average speech rate compared to young adults, while elderly females had a notable longer inter-syllabic silence length and elderly males had no difference in inter-syllabic silence length from that of a young adult’s. After comparing the formant frequency, which correlated with the intelligibility of acoustical human speech, there was only a slight difference in formant frequencies between the elderly and young adults, and thus, it did not sufficiently explain the 20% difference in automatic speech recognition accuracy difference between the older and younger group. Instead of looking at the formant frequency that expresses vowel sounds, we analyzed the energy of adjacent bands included in the formant frequency to compare the level of articulation strength seen in the formant. The results revealed that elderly males have a significant reduction of energy in F2 and F3. After boosting the energy, the recognition rate improved from 76% to 82%. After running the preprocessing of all three factors, a 12% increase in the accuracy of elderly speech recognition was achieved. We conclude that the results from this experiment can lead to more robust preprocessing method for elderly speech interface design in smart phones as well as increasing the usability of smart phones for the elderly without any modification of automatic speech recognition systems installed on existing smart phones. There is a need to analyze further the correlations of the identified factors that affect speech recognition for the elderly from multiple perspectives. Also, it is necessary to explore additional features that affect the intelligibility of speech and articulation in the elderly that are relevant for robust automatic speech recognition. Acknowledgements This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2013R1A1A2008554). References Bennett, J.W., Van Lieshout, P.H.H.M., Steele, C.M., 2007. Tongue control for speech and swallowing in healthy younger and older subjects. Int. J. Orofac. Myol. 33, 5–18. Chu, S.M., Povey, D., 2010. Speaking rate adaptation using continuous frame rate normalization. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4306–4309. Elenius, D., Blomberg, M., 2010. Dynamic vocal tract length normalization in speech recognition. In: Proceedings from Fonetik, Citeseer, pp. 29–34. Fisher, H.B., Linville, S.E., 1985. Acoustic characteristics of women’s voices with advancing age. J. Gerontol. 40 (3), 324–330. Han, J.H., 2011. Computer Programming Through the End of the Attack End Speed Control According to the Severity of Disability Words Clarity and Impact of the Acoustic Parameters (Master’s thesis). Ewha Womans University. Harnsberger, J.D., Shrivastav, R., Brown, W.S., Rothman, H., Hollien, H., 2008. Speaking rate and fundamental frequency as speech cues to perceived age. J. Voice 22 (1), 58–69. Hejna, D., Musicus, B., 1991. The SOLAFS Time-Scale Modification Algorithm, Technical Report of BBN. Jin, S.M., Kwon, G.H., Kang, H.G., 1997. The steady increase in the age of sound analytical characteristics of the elderly. Korea J. Speech Lang. 8 (1), 44–48.
S. Kwon et al. / Computer Speech and Language 36 (2016) 110–121
121
Kahane, J.C., 1981. Anatomic and physiologic changes in the aging peripheral speech mechanism. In: Aging: Communication Processes and Disorders., pp. 21–45. Kim, S.H., Ko, D.H., 2008. Fundamental frequencies in Korean elderly speakers. Korean J. Speech Sci. 15 (3), 95–102. Kim, Y.H., 2003. Geriatric Speech, Plenary Session IV. Yonsei University College of Medicine, Otolaryngology Clinic, pp. 205–207. Korea National Statistical Office, 2006. The prospective population results. In: Proceedings of the Prospective Population Statistics, pp. 2–5. Kwon, S., Choeh, J.-Y., Lee, J.-W., 2013. User-personality classification based on the non-verbal cues from spoken conversations. Int. J. Comput. Intell. Syst. 6 (4), 739–749. Kwon, S., 2011. Focused word spotting in spoken Korean based on fundamental frequency. IEICE Electron. Express 8 (14), 1149–1154. Kwon, S., 2012. Voice-driven sound effect manipulation. Int. J. Hum.–Comput. Interact. 28 (6), 373–382. Lee, L., Rose, R., 1998. A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6, 49–60. Lee, S.Y., 2011. The Overall Speaking Rate and Articulation Rate of Normal Elderly People, Graduate Program in Speech and Language Pathology (Master’ theses). Yonsei University. Manning, W.H., Monte, K.L., 1981. Fluency breaks in older speakers: implications for a model of stuttering throughout the life cycle. J. Fluen. Disorders 6, 35–48. Potamianos, A., Narayanan, S., 2003. Robust recognition of children’s speech. IEEE Trans. Speech Audio Process. 11 (6), 603–616. Ptacek, P.H., Sander, E.K., 1966. Age recognition from voice. J. Speech, Lang. Hear. Res. 9, 273–277. Pyo, H.Y., Shim, H.S., 2005. Paralytic disorder words (dysarthria) for improving the clarity of research trends: a literature review. Spec. Educ. 4 (1), 35–50. Richardson, M., Hwang, M., Acero, A., Huang, X., 1999. Improvements on speech recognition for fast talkers. In: EUROSPEECH, pp. 411–414. Ryan, W.J., 1972. Acoustic aspects of the aging voice. J. Gerontol. 27 (2), 265–268. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., Strope, B., 2010. Google Search by Voice: A Case Study, Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics. Springer, pp. 61–90. Siegler, M.A., Stem, R.M., 1995. On the effects of speech rate in large vocabulary speech recognition systems. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 612–615. Sonies, B.C., 1987. Oral-motor problems. In: Communication Disorders in Aging: Assessment and Management. Gallaudet University Press, Washington, pp. 185–213. Wilpon, J.G., Jacobsen, C.N., 1996. A study of speech recognition for children and the elderly. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), vol. 1, pp. 349–352. Wood, S., 2003. Beginners Guide to Praat. Lund University Dept. of Linguistics and Phonetics Centre for Language and Literature. Zheng, J., Franco, H., Stolcke, A., 2000. Rate-of-speech modeling for large vocabulary conversational speech recognition. In: Automatic Speech Recognition: Challenges for the New Millennium ISCA Tutorial and Research Workshop (ITRW), pp. 145–159.