Available online at www.sciencedirect.com
ScienceDirect Speech Communication 56 (2014) 70–81 www.elsevier.com/locate/specom
Native vs. non-native accent identification using Japanese spoken telephone numbers Kanae Amino ⇑, Takashi Osanai National Research Institute of Police Science, 6-3-1 Kashiwanoha, Kashiwa, Chiba 277-0882, Japan Received 25 December 2012; received in revised form 26 July 2013; accepted 29 July 2013 Available online 11 August 2013
Abstract In forensic investigations, it would be helpful to be able to identify a speaker’s native language based on the sound of their speech. Previous research on foreign accent identification suggested that the identification accuracy can be improved by using linguistic forms in which non-native characteristics are reflected. This study investigates how native and non-native speakers of Japanese differ in reading Japanese telephone numbers, which have a specific prosodic structure called a bipodic template. Spoken Japanese telephone numbers were recorded from native speakers, and Chinese and Korean learners of Japanese. Twelve utterances were obtained from each speaker, and their F0 contours were compared between native and non-native speakers. All native speakers realised the prosodic pattern of the bipodic template while reading the telephone numbers, whereas non-native speakers did not. The metric rhythm and segmental properties of the speech samples were also analysed, and a foreign accent identification experiment was carried out using six acoustic features. By applying a logistic regression analysis, this method yielded an 81.8% correct identification rate, which is slightly better than that achieved in other studies. Discrimination accuracy between native and non-native accents was better than 90%, although discrimination between the two non-native accents was not that successful. A perceptual accent identification experiment was also conducted in order to compare automatic and human identifications. The results revealed that human listeners could discriminate between native and nonnative speakers better, while they were inferior at identifying foreign accents. Ó 2013 Elsevier B.V. All rights reserved. Keywords: Foreign accent identification; Non-native speech; Spoken telephone numbers; Prosody; Forensic speech science
1. Introduction Globalisation has provided us with more opportunities to communicate with people from all around the world. This also leads to more chances to hear foreign accents. Investigation of foreign accents is important not only for second language (L2) acquisition research and language teaching, but also for technologies such as speech recognition, speaker recognition, and accent identification. The term “accent” can be defined as speech properties that indicate which country, or which part of a country, the speaker originates from. Accent identification is commonly used to ⇑ Corresponding author. Tel.: +81 (0)4 7135 8001.
E-mail addresses:
[email protected] (K. Amino),
[email protected] (T. Osanai). 0167-6393/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.specom.2013.07.010
identify a speaker’s mother dialect (D1) by using speech samples spoken in D1 or other dialects (D2). For foreign accent identification, a speaker’s first language (L1) is identified using speech in L2 or a later language. Applications of accent identification include preprocessing for automatic speech recognition and language support for L2 speakers. The performance of a speech recognition system can be improved by applying accent identification in advance and then using a dialect or language model in which the accent colour is taken into consideration (e.g., Brousseau and Fox, 1992; Blackburn et al., 1993; Arslan and Hansen, 1996; Fung and Kat, 1999). This is also useful for assisting L2 speakers when call routing is needed for emergency operators or in multi-lingual voice-controlled information retrieval systems (Muthusamy et al., 1994; Zissman, 1996). Furthermore, in forensic situations, when there is
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
a possibility that obtained speech samples were spoken by a D2/L2 speaker, identifying the speaker’s accent, and consequently his/her nationality and/or hometown, can often lead to important clues with regard to the suspect. A speech technology similar to accent identification is language identification. However, compared to language identification, in which the language spoken by a native speaker is identified, accent identification is considered to be a more challenging task. One reason is that the traits of a speaker’s D1/L1 are carried into D2/L2 speech in various ways. These traits, often called language transfers, may appear on the segmental level, for instance, the substitution of unfamiliar phonemes with similar sounds from the D1/L1, or on the supra-segmental (prosodic) level, e.g., erroneous word accents, clumsy rhythm, and inappropriate intonation. What makes accent identification more difficult is the fact that language transfer is not unique to one target dialect/language or speaker but depends on the speaker’s D1/L1, the language-typological distance between D1/L1 and D2/L2, and various individual factors. For example, different phonemic inventories and phonotactics will bring about different articulatory errors, and different accentuation systems will cause different prosodic problems. Also, the degree of language transfer is reported to depend on each speaker’s age of learning (or age of arrival), amount of exposure and interactive contact with native speakers (e.g., Flege, 1988; Flege and Fletcher, 1992), experience of learning other foreign dialects or languages (Mehlhorn, 2007; Wrembel, 2009), and the individual’s language talent (Markham, 1999); there are also several reports that disclaim the effects of the former two factors (Mackay and Fullana, 2009; Fullana and Mora, 2009). Previous research on accent identification can be classified into three groups: that based on segmental and articulatory features (Arslan and Hansen, 1996; Kumpf and King, 1996; Teixeira et al., 1996; Berkling et al., 1998; Yanguas et al., 1998), that based on prosodic features (Itahashi and Yamashita, 1992; Itahashi and Tanaka, 1993; Hansen and Arslan, 1995; Mixdorff, 1996; Piat et al., 2008), and that based on both (Piat et al., 2008; Arslan and Hansen, 1997; Vieru-Dimulescu et al., 2007). Kumpf and King (1996) identified three accents of Australian English: Lebanese, Vietnamese, and native. They used a system based on a hidden Markov model (HMM) trained using 2000 sentences recorded from 16 speakers, and identified more than 50 utterances produced by 63 speakers using 12th-order mel-frequency cepstral coefficients (MFCC), log energy and the delta of both as the acoustic features. Their system performed 85.3% correctly in pair-wise identification on average and 76.6% in the identification of the three accents. Similarly, Teixeira et al. (1996) identified six accents of English (Portuguese, Danish, German, British, Spanish, and Italian) using a HMM-based system. They used a speech corpus that contained 200 English isolated words, and calculated linear predictive coding (LPC) cepstra and their delta as the acoustic features. Their system obtained
71
a 65.5% correct identification rate. An example of using prosodic cues was described by Itahashi and Tanaka (1993). They analysed a Japanese passage read by speakers of 14 regional dialects, and extracted 19 acoustic parameters related to the F0. A principal component analysis was performed on these 19 components and the results showed that the 14 dialects could be classified into six groups that approximately corresponded to the regions that the dialects belonged to. Finally, Piat et al. (2008) carried out a study that involved identification of four accents (French, Italian, Greek, and Spanish) of English. They compared the identification performance of their HMMbased system using 1-dimensional duration, 3-dimensional energy, 36-dimensional MFCC, and other prosodic features. The results showed that the MFCC yielded the highest identification rate of 82.9%, whereas duration and energy yielded rates of 67.1% and 68.6%, respectively. They thus concluded that MFCC provided a superior identification rate, although the computational cost was higher. It is not easy to compare the identification results of the above studies, as they used different speech corpora and different comparison methods; however, these previous studies indicate that accent identification performance improves by using linguistic knowledge of the target languages effectively. This can be, for example, knowledge of linguistic forms for which non-native speakers saliently differ from native speakers, or knowledge of how to detect these linguistic forms in running speech. Blackburn et al. (1993) suggested a method for classifying non-native English accents using features related to phonological differences between the accents. They exploited knowledge on segmental differences among Arabic-accented, Mandarinaccented and Australian (native) English, and extracted features such as the phoneme duration of the sibilants, the voice onset time of the plosives, and the formant frequencies of the vowels. With their system, which was based on a neural network, 96% of Australian English, 35% of Arabic, and 62% of Mandarin male speech were correctly identified using voiced segments. Cleirigh and Vonwiller (1994) developed a phonological model of Australian English that included information on English syllable structure and the distribution of phonemes within a syllable. Berkling et al. (1998) applied this model for the identification of Vietnamese-accented and Lebanese-accented Australian English. They conducted two accent identification experiments, one using the linguistic model and the other not using the model. When they incorporated the linguistic model into their system, the performance improved by 6– 7% (84% for the English–Lebanese pair and 93% for the English–Vietnamese pair) compared to the system not using the model (78% for the English–Lebanese pair and 86% for the English–Vietnamese pair). Zissman (1996) built a speech corpus for testing accent identification (using the term “dialect identification”) systems for conversational Latin-American Spanish. He also built an accent identification system using HMM-based phoneme recognition. By applying N-gram language modeling, his system
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
books, all 14 textbooks in his research covered pronunciation of the vowels and consonants, vowel duration contrast, and syllabic consonants. Most of them also covered word accent and vowel devoicing. Even so, only a few of them contained information on intonation or rhythm. This implies that non-native Japanese speakers are not familiar with the prosody of Japanese STN; thus, STN may give rise to large differences between native and non-native speech. Most Japanese regional dialects employ two-level (high and low) pitch accents with mora-timed rhythm (Vance, 2008). Each mora in a word is inherently associated with a specific pitch. In addition, Japanese has bimoraic metrical feet. In Japanese, a bimoraic foot plays an important role in accounting for various prosodic phenomena. Linguistic forms such as reduplicated mimetics (/ki|aki|a ki|aki|a/ “glitter”), clipped words (/|imokoN/“remote control”), and STN all have a prosodic pattern based on bimoraic feet. In these linguistic forms, two bimoraic feet, i.e., four morae, are grouped together to produce a prosodic structure called a bipodic template (BT) (Poser, 1990; Tsujimura, 1996; Nasu, 2001). There are certain rules for reading Japanese STN. Numerical digits in Japanese are either one- or two-moraic as shown in Table 1; however, once subsumed into STN, one-moraic digits (/ni/ “2” and /go/ “5”) are read with an elongated vowel (/ni+/ and / go+/), which means all Japanese digits are read as twomoraic in STN. Additionally, digits in Japanese STN are read in isolation, unlike in English and other European languages (e.g., “11” is read as “one one” and not as “double one;” and “62” is read as “six two” and not as “sixty-two”). Every two digits from the beginning of STN are phonologically grouped together to compose one BT. Accentuation occurs for every BT; accordingly, one accentual peak appears every two digits, i.e., every four morae. Three-digit numbers are read as two-one digit combinations. The first two digits are read in one BT, and the remaining one digit is read in another BT (Nasu, 2001). The purpose of the present study was, first of all, to show the differences in STN prosody realisation between Table 1 Inherent pitch-accent patterns for digits spoken in Japanese of the Tokyo dialect (Kindaichi and Akinaga, 2001). The diacritic “.” in IPA transcriptions represent the moraic boundaries; H and L represent high and low pitches for each mora, respectively. In the STN, words with an asterisk are the more commonly used forms for 0, 4, and 7. Digit 0 1 2 3 4 5 6 7 8 9
Segmentals /ze.|o/* or /|e+/ _
/i.t i/ /ni/ /sa.N/ /jo.N/* or / i/ /go/ /|o.k / * or / i.t_i/ /na.na/ _ /ha.t i/ /kj .+/ m
achieved an 84% correct discrimination rate between Cuban and Peruvian Spanish. Subsequently, Yanguas et al. (1998) used the same speech corpus and explored accent identification systems that exploit linguistic knowledge of variations in phoneme realisations. They concentrated on reducing the length of the speech samples needed for identification. By using the duration and energy of the fricative /s/ taken from read digits, their system yielded 72% accuracy in discriminating between Cuban and Peruvian Spanish. In the present article, another example of using linguistic knowledge is introduced for identifying foreign accents of Japanese. Our method successfully discriminated between Japanese native and non-native accents, although it still needs improvement in discriminating among non-native accents. The study itself is motivated by the fact that not enough research has been made on foreign accent identification in forensic context, although it has been alerted that foreign accent has a detrimental effect on speaker identification tasks (Tate, 1979) reviewed in Hollien (2002). It is important for forensic speech investigators to determine whether the target speech samples contain any accents, and if they do, what type of accents they are. What is important here is that the existence of two different accents in the speech samples strongly suggests that the speakers of the speech samples are two different people (Hollien, 2002; Rogers, 1998); and forensic practicians must be able to show objective and scientific evidence that the accent spoken by the perpetrator is the same as that spoken by the suspect, before they tell that the two speech samples, perpetrator’s and suspect’s, were spoken by the same speaker. Also, accents, whether regional or foreign, can provide us with speaker profile information. Forensic practicians often use accents heard in speech samples as part of their assessment of the speaker’s identity (Kulshreshtha et al., 2012). Giving consideration to these circumstances, research on accent identification is critical for forensic speech investigations, and construction of a reliable accent identification system is of valuable for those who work in this field. For the selection of the acoustic features used for accent identification in forensic case, we often come across the situation where we are constrained to use prosodic features due to rather poor quality of the speech data. Generally speaking, prosodic features are more robust against background noise and transmission characteristics. Sometimes we are able to extract segment-related frequency features. In the present article, we mainly analysed the prosodic characteristics, although we also had some benefit of segmental characteristics. In the experiment, spoken telephone numbers (hereafter STN) are focused on. Some studies have pointed out that language-dependent structures are highlighted in STN (Baumann and Trouvain, 2001; Katagiri, 2008). In Japanese, too, there is a specific prosodic pattern for STN. In the Japanese language education field, teachers do not spend enough time on pronunciation. According to Toki’s (Toki, 2010) survey of Japanese text-
m
72
Accent pattern HL, HL LH H Non-accented HL, H H LH HL, LH LH HL
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
native and non-native Japanese speakers, both quantitatively and qualitatively. Pitch contours were analysed for Japanese STN produced by speakers with three accents: Chinese, Korean, and native Japanese. The results revealed that all of the native Japanese speakers realised the prosodic pattern of the BT structure, whereas non-native speakers showed different prosodic patterns. The second objective of this study was to examine the extent to which these differences can be used for identifying foreign accents. An accent identification experiment was conducted in order to assess the usefulness of extracted acoustic features related to the prosodic pattern of Japanese STN and the frequency properties of certain segments. A small experiment on the performance of human listeners in identifying non-native accents of Japanese was also conducted. 2. Pitch patterns of Japanese spoken telephone numbers 2.1. Speech materials Native Japanese, Chinese and Korean speakers participated in the recording sessions which lasted between 60 and 90 min. The six telephone numbers shown in Table 2 were recorded twice for each participant in an anechoic room at the National Research Institute of Police Science, Chiba, Japan. Twenty-six (14 female and 12 male) native Japanese speakers with a mean age of 24.3 years participated in the recording sessions. These speakers came from various regions of Japan, including the Tohoku, Kanto, Kansai, and Kyushu districts. As non-native Japanese speakers, eighteen (10 female and 8 male) Chinese and nine (7 female and 2 male) Koreans, with a mean age of 26.1 and 24.0 years, respectively, were recorded in the same anechoic room as the native speakers. They were mainly students who were living in Kanto area of Japan, and studying at universities or colleges in Tokyo, Chiba, and Ibaraki Prefectures. They had learned Japanese for 3.9 years on average. Most of them learned it at a language school or at university. Their Japanese proficiency was between higher-intermediate to advanced; almost all of them had passed Level 2 (or Level N2, after the latest revision in the year 2010) of Japanese-Language Proficiency Test; some of them had passed Level 1 (N1, after the year 2010). Most of the Chinese speakers of Japanese came from Beijing or Shanghai, and spoke Mandarin or Wu Chinese as their native dialects. Among the Korean speakers of Japanese, five were from Busan, three were from Seoul, one was from Jinju, and one was from Daejeon. The mean age of arrival was 24.2 years for the Chinese speakers and 21.1 years for the Korean speakers. Apart from the above speakers, some balanced and Japanese-dominant bilinguals were recorded, but their data were omitted from the analysis at this time. Speech materials were simultaneously recorded using a PCM recorder (Marantz, PMD671) through a condenser microphone (SONY, ECM-23F5) and a telephone (Nitsuko, 2002 K). The data were digitised at a sampling fre-
73
quency of 44.1 kHz with 16 bit quantisation. The speech data used in the acoustic analysis in this study were those recorded through the telephone, in order to simulate more realistic forensic case. They were downsampled at 8 kHz before the analysis. The total number of recorded speech samples was 312 for the natives (26 speakers and 6 STN, repeated twice), and 216 (18 speakers and 6 STN, repeated twice) and 120 (10 speakers and 6 STN, repeated twice) for Chinese and Korean speakers, respectively. 2.2. Procedure Pitch patterns for STN were analysed using Praat, a computer software program (Boersma, 2001). First, F0 was calculated every 10 ms using an auto-correlation method. In order to achieve a good comparison across different speakers and genders, F0 in hertz (f [Hz]) was converted into semitones (F [st]) using the following formula: f F ¼ 12 log2 ; ð1Þ fave where fave [Hz] is the average F0 of the target utterance. After this conversion, and by using the Manipulation-Stylise commands in Praat, the temporal properties of the F0 contours were normalised by decreasing the number of pitch analysis points and taking two representative points per syllable. Thus, we obtained F0 vectors for STN utterances whose lengths were twice the number of syllables. In order to quantitatively evaluate the differences in F0 contours between native and non-native speakers, a comparison using the cosine similarity was conducted. The cosine similarity S(x, y) of two feature vectors x = {x1, x2, . . . , xn} and y = {y1, y2, . . . , yn} can be derived by calculating the cosine of the angle between x and y (Manning and Schuetze, 1999). The similarity S(x, y) is obtained through the following formula; the resulting values range from 1, meaning exactly the opposite, to 1, meaning exactly the same. xy def Sðx; yÞ ¼ cos h ¼ : ð2Þ jjxjj jjyjj The average F0 vector for the native Japanese speakers was first calculated as vector x, and the cosine similarity between x and each utterance was then calculated. When the vowel in a syllable was devoiced or creaky and F0 could not be measured, those particular analysis points were Table 2 List of recorded telephone numbers. Numbering
ID
Numbers
# Syllables and (#Morae)
3-3-4
N1 N2
053 574 0182 097 993 0312
15 (20) 14 (20)
3-4-4
N3 N4
090 0978 8135 080 2912 6830
18 (22) 18 (22)
2-4-4
N5 N6
03 3736 2319 06 6715 1362
14 (20) 17 (20)
74
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
omitted for all utterances and the number of feature dimensions was decreased.
(a) 2.3. Results and discussion Fig. 1 shows the average F0 contours for the native Japanese speakers, Chinese speakers of Japanese, and Korean speakers of Japanese reading three Japanese STN with different digits. The average contours for the native Japanese speakers exhibit one F0 peak per two digits, reflecting the prosodic structure of BT. Native speakers also showed a BT structure for 3-digit area codes, reading them as 2–1 digit combinations. On the other hand, the F0 contours for Chinese and Korean speakers of Japanese deviated from the prosodic pattern of BT. As we can see in Fig. 1, some speakers seemed to have read the numbers with their inherent pitch-accent patterns, which is shown in Table 1. Examining the contours for the non-native speakers, it can be seen that those of the Chinese speakers exhibited more peaks than those for the native speakers, and those for the Korean speakers had less fluctuations than those for the native speakers. Both the Chinese and Korean speakers showed a smaller F0 range than the native speakers. Hirano et al. (2006a, 2006b) analysed and modelled the pitch patterns of Japanese sentences uttered by native speakers and Chinese speakers of Japanese. They pointed out that Chinese speakers’ utterances had more accentuation commands, shorter prosodic phrases, and a smaller range of maximum F0 compared to the native speakers’ utterances. Min (1996) and Utsugi (2004) investigated Korean speakers’ phonetic realisation of Japanese intonation, and both studies indicated that Korean speech showed few downsteps. These reports are also corroborated by the results of the present study. Another difference between native and non-native speakers is that native speakers tended to start their utterances at about 0 [st], whereas most non-native speakers started their utterances at a higher F0. The average F0 [st] for native Japanese speakers was 1.16 (S.D. = 1.29), while the average F0 [st] for Chinese and Korean speakers of Japanese were 2.44 (S.D. = 2.74) and 1.78 (S.D. = 1.57), respectively. In our analysis, 0 [st] represented the average F0 [Hz] of the entire utterance. It is currently unclear why native Japanese speakers start their STN utterances at the average F0 [Hz]. Fig. 2 shows the distributions of cosine similarity values for the three speaker groups. It can be seen that the native speakers’ similarity values converged at around 0.9 and within-group variation was small, whereas non-native speakers’ similarity values were lower and varied within each group. The average similarity was 0.86 for the native speakers (S.D. = 0.07), 0.60 for Chinese speakers (S.D. = 0.18), and 0.66 for Korean speakers (S.D. = 0.15). Within-group variation of the non-native speakers seems to have derived from linguistic and learning factors rather than individual differences. Statistically significant effects were observed for the telephone number
(b)
(c)
Fig. 1. Average F0 patterns for Japanese STN produced by 18 native Japanese (blue solid line), 15 Chinese (red dashed-dotted line), and 9 Korean (green dotted line) speakers: (a) 3-3-4 scheme (053-574 0182), (b) 3-4-4 scheme (090-0978-8135), and (c) 2-4-4 scheme (03-3736-2319). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
being spoken and the length of time that the non-native speakers spent learning Japanese. The average cosine similarity for each telephone number is shown in Fig. 3. The results of a two-way ANOVA (split-plot design with varying sample sizes for the speaker factor) showed that the main effect of the telephone numbers (F(5, 525) = 44.39), speaker groups (F(2, 105) = 69.10), and their interaction (F(10, 525) = 5.93) were all significant (p < .01). Post-hoc tests via the Bonferroni method were performed and the results revealed that N1 and N2 gave significantly lower similarity values than the other four telephone numbers (p < .05). Also, the similarity values for the three speaker groups differed significantly (p < .01) for N1 and N2. For N2, N3, N5 and N6, the difference between Chinese and Korean speakers of Japanese was not significant. A slight positive correlation between the similarity values and the
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
75
with its inherent tone. Statistically, there was no significant difference in cosine similarity values between speakers of the Kinki dialect and those of other Japanese dialects. An automatic accent identification experiment was next carried out using the cosine similarity of the F0 pattern for STN combined with other acoustic features. An experiment was also performed to assess the ability of humans to identify accents. 3. Experiment 1: automatic accent identification 3.1. Speech materials Fig. 2. Percentage histogram of cosine similarity values for three speaker groups: blue line with circles for Japanese speakers, red line with squares for Chinese speakers, and green line with triangles for Korean speakers. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
length of time learning Japanese was significant for both Chinese (r = .21, p < .01) and Korean (r = .46, p < .01) speakers. When the speakers were recruited, no attempt was made to control their dialectal background. The analysis results showed no consistent trend with regard to regional dialects for either Chinese or Korean speech, although there was a slight difference due to dialects among native Japanese speakers. Speakers of the Kinki dialect, which is spoken in the western part of Japan, including Kyoto and Osaka, showed different accentuation patterns than speakers of other dialects. The Kinki dialect has high and low register tones as well as pitch accents (Saito, 2009; Joh et al., 2011). All of the Kinki speakers started each of the number segments (area code, city area code or number) with a low tone, which is in accordance with BT prosody and at the same time unmarked in Japanese phonology. When the first digit in a number segment started with a heavy syllable (i.e., 2, 3, 4, 5, or 9), the speakers with strong accents read it
Fig. 3. Average cosine similarity for each telephone number and each speaker group: blue line with circles for Japanese speakers, red line with squares for Chinese speakers, and green line with triangles for Korean speakers. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The same speech materials were used as in the above analysis. Each speaker uttered six telephone numbers two times. The total number of speech samples was, again, 312 for the native Japanese speakers, 216 for the Chinese speakers of Japanese, and 108 for the Korean speakers of Japanese. 3.2. Acoustic features Six acoustic features were extracted and used in the experiments: one feature related to F0, two features related to speech rhythm, and three features related to the frequency properties of certain segments. As the F0-related feature, the cosine similarity between each utterance’s F0 pattern and the native speakers’ average pattern was used. This feature, S, is defined and can be calculated by the method described in the preceding chapter. The rhythm structure is also known to reflect cross-language differences. Ramus et al. (1999) proposed using the within-utterance standard deviation of vowel and consonant durations (DV and DC, respectively) and the proportion of vowel duration to the total utterance duration (%V) for classifying three language rhythms: syllable-timed, stress-timed or mora-timed. They suggested that the best measure for rhythmic diversity is provided by DC and %V. Among the three languages considered in this study, Japanese has a mora-timed rhythm (Bloch, 1950), and Chinese and Korean both have a syllable-timed rhythm (Lee and Jang, 2004; Lee and Kim, 2005; Lin and Wang, 2007; Mok and Dellwo, 2008). A pre-test was carried out to evaluate the language-discriminating power of DV, DC, and %V, and since DV and DC were found to exhibit better performance with our L2 Japanese data, these were used in the identification experiments. The duration of vowels and consonants was measured manually using Praat annotations. The frequency properties of the two segments /z/ and / o/ were considered. All telephone numbers started with “0,” which is read as /ze|o/ in Japanese. The word-initial fricative /z/ is typically, though not contrastively, produced as an affricate [dz] in Japanese (Tsujimura, 1996; Nasu, 2001). Non-native speakers may not notice this tendency, and may pronounce /z/ as [z], which is a minor alternative for Japanese natives. In particular, it is common for
76
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
Korean speakers to pronounce Japanese (and other foreign languages’) /z/ as [(d) ], or its devoiced counterpart (Matsuzaki, 1999; Ito et al., 2006). The Japanese mid back vowel /o/ is lower than the seventh cardinal vowel and higher than the eighth, and is produced with a weak lip-rounding (Vance, 2008). Toki (2010) cautions learners of Japanese not to make lip-rounding too strong. Chinese has two higher-mid back vowels: rounded /o/ and unrounded /G/; and Korean has two rounded mid back vowels: higher-mid /o/ and lowermid /O/ (Kim, 1968; Koshimizu, 1998; Noma, 1998). Nonnative speakers may substitute vowels when they speak in Japanese. Therefore, non-native speakers’ utterances may show different spectral properties from those of native speakers. For this reason, /z/ and /o/ were manually extracted from the utterances and their 12th-order FFT cepstral coefficients were calculated. The Euclidean distance from the native speakers’ average coefficients was then calculated for each speech sample. The cepstral distances of /z/ (Dz), /o/ (Do), and /z/ and /o/ together (Dzo) were used as the acoustic features in the experiment. 3.3. Procedure Using the six acoustic features mentioned above, L1 identification experiments were conducted. Each speaker uttered each telephone number twice. In forensics, reference samples are recorded several times, when possible, in order to take within-speaker variations into consideration; therefore, acoustic features averaged over two speech samples were used as the input. Multi-nominal logistic regression analysis with tenfold cross-validation was applied for all possible combinations of the acoustic features using the Weka data mining toolkit (Hall et al., 2009), and the results were evaluated using the percentage of correct accent identifications. 3.4. Results and discussion The results of the acoustic analysis and identification experiment are summarised in Tables 3–6. In Table 3, it can be seen that the differences between native and nonnative speakers were greater for S and Dzo compared to other features. Further inspection of the results revealed that the odds ratios of S and Dzo were significantly large for the dummy variables Japanese speakers and Chinese speakers of Japanese, respectively, which means that S was the main factor for discriminating Japanese native speakers from non-native speakers, and so was Dzo for discriminating between Chinese and Korean speakers. In Table 4, the percentage of correct identifications for each of the six acoustic features is shown as the baseline, and the feature combinations that yielded the best performance are given as the main results. All of the six acoustic features yielded an identification performance that was better than the chance level when used alone. The two best features were, again, S and Dzo for the accent combinations that involve Japanese: three accents, Japanese–Chinese,
and Japanese–Korean. For Chinese–Korean combination, Dz and S were the two best acoustic features. The best identification rate for the three accents was 81.8% when using all of the acoustic features. Table 5 shows the confusion matrix for the identification of the three accents averaged across STN. It can be seen that the Korean accent had the worst identification rate. The Japanese and Korean confusion rate was considerably high (20.4%) compared to that for Japanese and Chinese (6.5%). Referring back to Table 3, it is found that Korean speakers showed more similar feature values to native Japanese speakers than Chinese speakers did. This may have caused a higher confusion rate between Japanese and Korean accents. Japanese and Chinese was best identified (95.8%) using the acoustic features S, Dzo, and DV. For the discrimination between Chinese and Korean speakers, the best performance (79.6%) was obtained when S and three kinds of cepstral distances, Dz, Do, and Dzo, were used together as the acoustic features. Table 5 shows that the probability of Korean speakers being identified as Chinese speakers (33.3%) was higher than the other way around (12.9%). Discrimination between Chinese and Korean was difficult, as mentioned above, due to a distribution overlap for the two strong features, S and Dz. When we look at the identification results for each STN in Table 6, the identification performance was the best for N4 (92.5% correct) and the worst for N5 (64.2% correct). As the confusion matrices show, the difference in total identification rates of these two STN comes from the discrimination rates between Chinese and Korean. Here, too, further inspection of the analysis results indicated that the distribution overlaps for S, Dz, and Dzo between Chinese and Korean speakers were more significant for N5 than for N4. In Fig. 1, certain differences were observed in Chinese and Korean speakers’ F0 patterns. In order to express the differences, it is necessary to use evaluation measures other than just the cosine similarity, since it only captures the direction differences of two feature vectors. 4. Experiment 2: accent identification by human listeners 4.1. Speech materials In order to investigate the accuracy of human accent identification, a small perceptual experiment was carried Table 3 Mean and standard deviation of each acoustic feature for the three speaker groups: Japanese (N = 312), Chinese (N = 216), and Korean (N = 120). Features
S DC DV Dz Do Dzo
Japanese
Chinese
Korean
Mean
S.D.
Mean
S.D.
Mean
S.D.
0.86 0.05 0.05 0.99 1.47 1.56
0.07 0.01 0.01 0.76 0.97 0.37
0.60 0.06 0.06 1.96 2.16 2.59
0.18 0.01 0.03 1.61 2.01 0.73
0.68 0.06 0.07 0.75 1.92 2.09
0.15 0.01 0.02 0.75 1.44 0.48
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81 Table 4 Results of foreign accent identification experiments using multi-nominal logistic regression analysis with tenfold cross-validation. The percentage of correct identifications is shown for each acoustic feature (baselines), together with the best identification rate and the feature combination that produced it.
Table 6 Confusion matrices for identification of three accents using six acoustic features for each of the six STN. J, C and K stand for Japanese, Chinese, and Korean, respectively. Responses [%]
Correct Responses [%] Japanese Chinese
Japanese Korean
Chinese Korean
S (Baselines) DC (Baselines) DV (Baselines) Dz (Baselines) Do (Baselines) Dzo (Baselines)
72.0 49.7 49.4 55.4 54.1 71.4
86.4 62.5 61.0 68.6 63.6 86.4
85.7 75.7 75.7 74.3 75.2 80.0
69.8 64.8 65.4 70.4 66.7 69.1
Best performance Feature combination
81.8 All
95.8 S, DV, and Dzo
91.9 All
79.6 S, Dz, Do, and Dzo
N1
J
C
K
N2
J
C
K
J
92.4
3.8
3.8
J
88.5
7.7
3.8
C
5.6
83.3
11.1
C
5.6
72.2
22.2
K
11.1
33.3
55.6
K
22.2
55.6
22.2
83.0
Total
Total
Stimuli
Three Accents
Stimuli
Japanese Chinese Korean
71.7
N3
J
C
K
N4
J
C
K
J
88.5
3.8
7.7
J
92.4
3.8
3.8
C
5.6
83.3
11.1
C
5.6
94.4
0.0
K
11.1
22.2
66.7
K
11.1
0.0
88.9
83.0
Total
Total
Table 5 Confusion matrix for identification of three accents using six acoustic features averaged across STN.
77
92.5
N5
J
C
K
N6
J
C
K
J
88.5
0.0
11.5
J
92.4
3.8
3.8
C
11.1
61.1
27.8
C
5.6
72.2
22.2
K
22.2
77.8
0.0
K
22.2
33.3
44.5
64.2
Total
Responses [%] Japanese
Chinese
Korean
94.9 6.5 20.4
1.9 80.6 33.3
3.2 12.9 46.3
out in which listeners attempted to identify foreign accents for Japanese STN. A subset of the speech materials used in Experiment 1 was selected and used in this experiment. Taking into account the experimental burden on the listeners, three telephone numbers out of the previous six were selected (N1, N4, and N5), one for each type of numbering, to be tested in the experiment. It was confirmed that there were no significant differences in acoustic features between the selected and unselected numbers. The total number of tested tokens was 216, which is the sum of 72 Japanese, 90 Chinese, and 54 Korean samples. These were further divided them into two subsets; set A contained speech samples from six Japanese, eight Chinese, and four Korean speakers, while set B consisted of speech samples from six Japanese, seven Chinese, and five Korean speakers. 4.2. Listeners Ten (nine male and one female) listeners volunteered to participate in the experiment. All were native Japanese speakers with an age between the mid-20s to mid-30s. None of the listeners had the experience of learning either Korean or Chinese. Two stimulus sets were counterbalanced among the listeners; set A and B were assigned to five listeners each. The listeners took breaks after every 54 stimuli. The whole session lasted approximately 20 min.
Total
77.4
4.3. Procedure Multiple forced-choice tests were conducted using the Praat experiment program. Instructions were given before the experiment. The participants were told to listen to the stimuli carefully, and to choose which of the three languages they thought the speaker’s native language was: Japanese, Chinese or Korean. They were also told to provide a degree of confidence for each reply, between 5 (confident) and 1 (not confident). 4.4. Results and discussion First of all, the difference in identification accuracy averaged across accents was not significant between the two stimulus subsets in a t-test (t(28) = 0.12, p = .91); therefore, the results of the two subsets were treated together. The results of the experiment are summarised in Table 7. The overall average identification rate was 65.6%. Although the listeners could sensitively discriminate between native and non-native Japanese speakers, they were not good at identifying Chinese and Korean accents. The average scores for identifying native and non-native speakers were 90.6%, while that for discriminating between Chinese and Korean speakers was 53.2%. Most of the errors were
78
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81 Table 7 Results of human accent identification experiment showing the average percentage of correct responses for ten participants. Stimuli
Japanese Chinese Korean
Responses [%] Japanese
Chinese
Korean
90.6 3.1 11.5
4.4 55.8 39.6
5.0 41.1 48.9
Fig. 4. Relationship between identification correctness and degree of confidence of listeners.
concentrated on the discrimination between Chinese and Korean speakers of Japanese. More than 10% of Korean speech was erroneously identified as that of a Japanese native. The differences in the identification rates among the three accents and among the listeners were tested in a two-way ANOVA. The results showed that the listeners identified native Japanese speakers significantly better than non-native speakers (F(2, 18) = 61.44, p < .01). The performance difference among the listeners was not significant (F(9, 18) = 1.88, p = .12). Fig. 4 shows the relationship between identification correctness and degree of confidence. The confidence level averaged over ten listeners was 3.9 when native speakers were correctly identified and 2.1 when they were erroneously identified. Those for identifying non-native speakers were 2.9 and 2.8, respectively. The confidence level averaged across listeners and accents differed significantly between correct and erroneous responses (t(29) = 2.93, p < .01). The effects of the accents and identification correctness were then examined by performing a two-way ANOVA. First, the interaction between the two factors was significant (p < .001). The results of a signed-rank test showed that the confidence score was significantly better for correct identifications than incorrect identifications only for native Japanese speech (T(9) = 1.00, p < .01). This suggests that the listeners were more conscious of their level of confidence when identifying native Japanese speech, and they dithered over the decision when identifying non-native speech.
5. General discussion In this study, the prosody of Japanese STN was analysed in order to investigate the differences between native and non-native Japanese speakers. As predicted, the native speakers realised the prosodic structure of BT, whereas non-native speakers did not follow the prosodic rule for Japanese STN. The Chinese speakers’ patterns had more peaks whereas Korean speakers’ patterns were much flatter than the native speakers’ patterns. Both Chinese and Korean speakers’ utterances had a smaller F0 range compared to native speakers. The ratio of BT prosody realisation was at 100% in native speakers, regardless of their regional dialects. Some speakers from the Kinki district showed a slightly different behaviour. The Kinki dialect is unique among Japanese regional dialects because it has register tones as well as pitch accents. In this study, those who had strong accents read the first digit in the number segments with its inherent register tone, especially when the first syllable of the digit was a heavy syllable, and otherwise with a low tone. This phenomenon has not been previously reported in Japanese phonetics research, and needs to be further investigated in future work. An automatic foreign accent identification experiment was also conducted using STN. Acoustic features related to F0 patterns, speech rhythm, and segmental properties were extracted and subjected to a multi-nominal logistic regression analysis with tenfold cross-validation. The best identification rate of 81.8% was achieved when all of the six acoustic features were used, where 94.9% of native Japanese, 80.6% of Chinese, and 46.3% of Korean speech were correctly identified. The best identification rates for Japanese–Chinese, Japanese–Korean, and Chinese–Korean pairs were 95.8%, 91.9%, and 79.6%, respectively. This is equivalent to or better than other studies where three foreign accents were identified utilising linguistic knowledge (Blackburn et al., 1993; Berkling et al., 1998). The usefulness of the BT prosody was again shown in the identification experiment. Through the further inspection of the results we found that the cosine similarity of the prosodic patterns to the native speakers’ average pattern was the most effective factor for discriminating native and nonnative Japanese speakers, and so was the cepstral distance of the segments /z/ and /o/ from the native speakers’ average for discriminating between Chinese and Korean speakers of Japanese. The cosine similarity was much higher for the Japanese speakers than the Chinese and Korean speakers; however, the cosine similarity only expresses the degree of similarity in the directions of two vectors; it does not show qualitative differences among the vectors. Other similarity/distance measures may be attempted in future work. Also, it may be useful to extract segmental characteristics for discriminating among non-native accents. The segments where non-nativeness appears can be predicted by using linguistic knowledge, too. In analysing the prosodic patterns, it was noticed that native speakers’ F0 patterns started at approximately 0
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
[st], which was the average F0 [Hz] for the whole utterance, while the non-native speakers’ patterns did not. This phenomenon, too, has never been previously reported. Whether this only occurs for STN or for other utterances as well, is a subject that needs further investigation. This study focused on segmental properties that would distinguish between native Japanese speakers and Chinese and Korean speakers of Japanese. When other L1 speakers are included in future studies, it will be necessary to reconsider which segments to use in order to cover cross-linguistic differences of the languages in question. A small human accent identification experiment was also carried out. Listeners, with Japanese as their native language, attempted to identify the accents of speech samples they heard. The identification rate averaged over the ten listeners was 65.6% for the entire data set, and the identification rates for native vs. non-native speakers and for discriminating between Chinese and Korean speakers were 90.6% and 53.2%, respectively. In the experiment, the listeners were also told to rate their confidence level for each selection on a discrete 5-point scale. The average confidence scores for correct and erroneous identifications of native speakers were 3.9 and 2.1, respectively; whereas those of the non-native speakers were 2.9 and 2.8, respectively. The interaction between accents and identification correctness was significant. These results imply that the listeners were able to rate their confidence level more appropriately when identifying native speech, while they all reacted the same when identifying the speech of non-native Japanese speakers. This can be considered to be the type of in-group effect reported with regard to recognition of individual faces and emotions (e.g., Brewer, 1979; Elfenbein and Ambady, 2002). Beaupre and Hess (2006) reported that an in-group effect could be observed in the confidence level for each judgment. They conducted an experiment where the participants judged the facial emotional expressions of three cultural groups and found a cultural in-group advantage for confidence in emotion judgments. The results obtained in the present study may also be associated with in-group advantages for language groups. In comparing automatic and human accent identification, it was found that the former was better at discriminating Chinese speakers from the other two speaker groups. On the other hand, human listeners made fewer errors in identifying native speakers, i.e., in signal detection terms, they had a lower “miss” rate. In order to combine the benefits of both methods, a human-in-the-loop identification system may be used in a future study, where humans first identify whether the target speaker is a native Japanese or not, and an automatic identification system then judges which nonnative accent the target speech belongs to. Using linguistic forms that non-native speakers are not familiar with would be useful in foreign accent identification, especially in forensic situations. One reason is that such forms are rarely acquired unless taught in specific classroom lessons. Even advanced-level language learners, such as the participants of this study, may not be aware
79
of the differences between the utterances of a native speaker and those of their own. It is therefore necessary to investigate the effect of teaching these linguistic forms in order to determine how well non-native speakers can reproduce native-like speech, before we apply our method to actual forensic case. Acknowledgments Portions of this work were presented at ICPhS 2011 (K. Amino and T. Osanai, Realisation of the prosodic structure of spoken telephone numbers by native and non-native speakers of Japanese, in: Proc. International Congress of Phonetic Sciences, pp. 236–239, Hong Kong, August 2011) and the ASJ meeting in 2011 (K. Amino and T. Osanai, Identification of native and nonnative speech by using Japanese spoken telephone numbers, in: Proc. Autumn Meeting Acoust. Soc. Jpn., pp. 407–410, Matsue, September 2011). This work was supported by Grants-in-Aid for Scientific Research from MEXT (25350488, 24810034, 21300060). Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/ j.specom.2013.07.010. References Arslan, L.M., Hansen, J., 1996. Language accent classification in American English. Speech Commun. 18, 353–367. Arslan, L.M., Hansen, J., 1997. Frequency characteristics of foreign accented speech. In: Proc. International Conference on Acoustics, Speech and, Signal Processing. pp. 1123–1126. Baumann, S., Trouvain, J., 2001. On the prosody of German telephone numbers. In: Proc. Eurospeech. pp. 557–560. Beaupre, M.G., Hess, U., 2006. An ingroup advantage for confidence in emotion recognition judgments: the moderating effect of familiarity with the expressions of outgroup members. Personal. Social Psychol. Bull. 32, 16–26. Berkling, K., Zissman, M., Vonwiller, J., Cleirigh, C., 1998. Improving accent identification through knowledge of English syllable structure. In: Proc. International Conference on Spoken Language Processing. Paper #0394. Blackburn, C.S., Vonwiller, J.P., King, R.W., 1993. Automatic accent classification using artificial neural networks. In: Proc. Eurospeech. pp. 1241–1244. Bloch, B., 1950. Studies in colloquial Japanese 4 – phonemics. Language 26, 86–125. Boersma, P., 2001. Praat, a system for doing phonetics by computer. Glot Int. 5, 341–345. Brewer, M.B., 1979. In-group bias in the minimal intergroup situation: a cognitive-motivational analysis. Psychol. Bull. 86, 307–324. Brousseau, J., Fox, S.A., 1992. Dialect-dependent speech recognisers for Canadian and European French. In: Proc. International Conference on Spoken Language Processing. pp. 1003–1006. Cleirigh, C., Vonwiller, J., 1994. Accent identification with a view to assisting recognition. In: Proc. International Conference on Spoken Language Processing. pp. 375–378. Elfenbein, H.A., Ambady, N., 2002. Is there an in-group advantage in emotion recognition?. Psychol. Bull. 128, 243–249.
80
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81
Flege, J.M., 1988. Factors affecting degree of perceived foreign accent in English sentences. J. Acoust. Soc. Am. 84, 70–79. Flege, J.M., Fletcher, K.L., 1992. Talker and listener effects on degree of perceived foreign accent. J. Acoust. Soc. Am. 91, 370–389. Fullana, N., Mora, J.C., 2009. Production and perception of voicing contrasts in English word-final obstruents: assessing the effects of experience and starting age. In: Watkins, M.A., Rauber, A.S., Baptista, B.O. (Eds.), Recent Research in Second Language Phonetics and Phonology – Perception and Production. Cambridge Scholars Publishing, Newcastle upon Tyne, pp. 97–117. Fung, P., Kat, L.W., 1999. Fast accent identification and accented speech recognition. In: Proc. International Conference on Acoustics, Speech, and, Signal Processing. pp. 221–224. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The WEKA data mining software: an update. SIGKDD Explor. 11 (1), 10–18. Hansen, J., Arslan, L.M., 1995. Foreign accent classification using source generator based prosodic features. In: Proc. International Conference on Acoustics, Speech and, Signal Processing. pp. 836–839. Hirano, H., Hirose, K., Minematsu, N., Kawai, G., 2006a. Prosodic features and evaluation of Japanese sentences spoken by Chinese learners. IEICE Technical Report SP 105 (686), 23–28. Hirano, H., Gu, W., Hirose, K., Minematsu, N., Kawai, G., 2006b. Analysis of prosodic features in native and non-native Japanese using generation process model of fundamental frequency contours. IEICE Technical Report SP 106 (333), 19–24. Hollien, H., 2002. Forensic Voice Identification. Academic Press, London. Itahashi, S., Tanaka, K., 1993. A method of classification among Japanese dialects. In: Proc. Eurospeech. pp. 639–642. Itahashi, S., Yamashita, T., 1992. A discrimination method between Japanese dialects. In: Proc. International Conference on Spoken Language Processing. pp. 1015–1018. Ito, C., Kang, Y.J., Kenstowicz, M., 2006. The adaptation of Japanese loanwords into Korean. MIT Working Pap. Ling. 52, 65–104. Japanese-Language Proficiency Test: http://www.jlpt.jp/e/index.html. Joh, H., Fukumori, T., Saito, Y., 2011. Dictionary of Basic Phonetic Terms. Bensei Shuppan Publishing Company, Tokyo. Katagiri, K.L., 2008. Pitch accent realisation of the Japanese digits by Filipino learners of Japanese. In: Proc. Symposium on Education of Japanese in Asia, Chulalongkorn University. pp. 103–127. Kim, C.W., 1968. The vowel system of Korean. Language 44, 516–527. Kindaichi, H., Akinaga, K., 2001. Dictionary of Japanese Accents, second ed. Sanseido, Tokyo. Koshimizu, M., 1998. Chinese. In: Institute of Language Research (Ed.), A Guide to the World’s Languages Part 2: Asia and Africa, Tokyo University of Foreign Languages. Sanseido, Tokyo. Kulshreshtha, M., Singh, C.P., Sharma, R.M., 2012. Speaker profiling: the study of acoustic characteristics based on phonetic features of Hindi dialects for forensic speaker identification. In: Neustein, A., Patil, H.A. (Eds.), Forensic Speaker Recognition: Law Enforcement and CounterTerrorism. Springer Verlag, Berlin, pp. 71–100. Kumpf, K., King, R.W., 1996. Automatic accent classification of foreign accented Australian English speech. In: Proc. International Conference on Spoken Language Processing. pp. 1740–1743. Lee, J.P., Jang, T.Y., 2004. A comparative study on the production of inter-stress intervals of English speech by English native speakers and Korean speakers. In: Proc. Interspeech. pp. 1245–1248. Lee, O.H., Kim, J.M., 2005. Syllable-timing interferes with Korean learners’ speech of stress-timed English. Speech Sci. 12, 95– 112. Lin, H., Wang, Q., 2007. Mandarin rhythm: an acoustic study. J. Chin. Ling. Comput. 17, 127–140. Mackay, I.R.A., Fullana, N., 2009. Starting age and exposure effects on EFL learners’ sound production in a formal learning context. In: Watkins, M.A., Rauber, A.S., Baptista, B.O. (Eds.), Recent Research in Second Language Phonetics and Phonology – Perception and
Production. Cambridge Scholars Publishing, Newcastle upon Tyne, pp. 43–61. Manning, C., Schuetze, H., 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge. Markham, D., 1999. Prosodic imitation: productional results. In: Proc. International Conference on Spoken Language Processing. pp. 1187– 1190. Matsuzaki, H., 1999. Phonetic education of Japanese for Korean speakers. J. Phonet. Soc. Jpn. 3, 26–35. Mehlhorn, G., 2007. From Russian to Polish: positive transfer in third language acquisition. In: Proc. International Congress of Phonetic Sciences. pp. 1745–1748. Min, K.J., 1996. Acoustic-phonetic comparative study on prosodic properties of Japanese and Korean – in the aim of Japanese education for Korean learners. Doctoral dissertation, Tohoku University, Sendai, Japan. Mixdorff, H., 1996. Foreign accent in intonation patterns – a contrastive study applying a quantitative model of the F0 contour. In: Proc. International Conference on Spoken Language Processing. pp. 1469– 1472. Mok, P., Dellwo, V., 2008. Comparing native and non-native speech rhythm using acoustic rhythmic measures: Cantonese, Beijing Mandarin and English. In: Proc. Speech Prosody. pp. 423–426. Muthusamy, Y.K., Barnard, E., Cole, R.A., 1994. Reviewing automatic language identification. IEEE Signal Process. Mag. 11, 33–41. Nasu, A., 2001. Prosodic structure of reduplicated mimetics in Japanese. J. Osaka Univ. Foreign Stud. 25, 115–125. Noma, H., 1998. Korean. In: Institute of Language Research (Ed.), A Guide to the World’s Languages Part 2: Asia and Africa, Tokyo University of Foreign Languages. Sanseido, Tokyo. Piat, M., Fohr, D., Illina, I., 2008. Foreign accent identification based on prosodic parameters. In: Proc. Interspeech. pp. 759–762. Poser, W., 1990. Evidence for foot structure in Japanese. Language 66, 78–105. Ramus, F., Nespor, M., Mehler, J., 1999. Correlates of linguistic rhythm in the speech signal. Cognition 73, 265–292. Rogers, H., 1998. Foreign accent in voice discrimination: a case study. Int. J. Speech Lang. Law 5, 203–208. Saito, Y., 2009. An Introduction to Japanese Phonetics (Nihongo Onseigaku Nyuumon). Sanseido, Tokyo. Tate, D.A., 1979. Preliminary data on dialect in speech disguise. In: Hollien, H., Hollien, P. (Eds.), Current Issues in the Phonetic Sciences. John Benjamins Publishing, Amsterdam, pp. 847–850. Teixeira, C., Trancoso, I., Serralheiro, A., 1996. Accent identification. In: Proc. International Conference on Spoken Language Processing. pp. 1784–1787. Toki, S., 2010. Phonetics Research Based on Japanese Education (Nihongo kyouiku kara no Onsei Kenkyuu). Hitsuji-Shobo Publishing, Tokyo. Tsujimura, N., 1996. An Introduction to Japanese Linguistics. Blackwell, Oxford. Utsugi, A., 2004. The phonetic and phonological characteristics of focused and neutral utterances in Japanese spoken by Korean learners of Japanese. J. Phonet. Soc. Jpn. 8, 96–108. Vance, T., 2008. The Sounds of Japanese. Cambridge University Press, Cambridge. Vieru-Dimulescu, B., de Mareuil, P.B., Adda-Decker, M., 2007. Identification of foreign-accented French using data mining techniques. In: Proc. International Workshop on Paralinguistic Speech –Between Models and Data. pp. 47–52. Wrembel, M., 2009. The impact of voice quality resetting on the perception of a foreign accent in third language acquisition. In: Watkins, M.A., Rauber, A.S., Baptista, B.O. (Eds.), Recent Research in Second Language Phonetics and Phonology – Perception and Production. Cambridge Scholars Publishing, Newcastle upon Tyne, pp. 291–307.
K. Amino, T. Osanai / Speech Communication 56 (2014) 70–81 Yanguas, L.R., O’Leary, G.C., Zissman, M.A., 1998. Incorporating linguistic knowledge into automatic dialect identification of Spanish. In: Proc. International Conference on Spoken Language Processing. Paper #1136. Zissman, M.A., 1996. Automatic dialect identification of extemporaneous, conversational, Latin American Spanish speech. In: Proc. Interna-
81
tional Conference on Acoustics, Speech and, Signal Processing. pp. 777–780. Zissman, M.A., 1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4, 31–44.