The Building and Evaluation of a Mobile Parallel Multi-Dialect Speech Corpus for Arabic

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Procedi...

Download PDF

328KB Sizes 1 Downloads 43 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2017) 000–000 Procedia Computer Science (2017) 000–000 Procedia Computer Science 14200 (2018) 166–173

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

The The 4th 4th International International Conference Conference on on Arabic Arabic Computational Computational Linguistics Linguistics (ACLing (ACLing 2018), 2018), November 17-19 2018, Dubai, United Arab Emirates November 17-19 2018, Dubai, United Arab Emirates

The The Building Building and and Evaluation Evaluation of of aa Mobile Mobile Parallel Parallel Multi-Dialect Multi-Dialect Speech Speech Corpus Corpus for for Arabic Arabic a Department a Department

a Khalid Khalid Almeman Almemana

of Applied Natural Sciences, Community College of Unaizah, Qassim University, Qassim, Saudi Arabia of Applied Natural Sciences, Community College of Unaizah, Qassim University, Qassim, Saudi Arabia

Abstract Abstract This paper discusses the process of building and evaluation a mobile parallel multi-dialect speech corpus for Arabic. The methodThis paper discusses the process of building and evaluation a mobile parallel multi-dialect speech corpus for Arabic. The methodology for implementing the experiment is as follows: Two SIM cards were installed in two mobiles phones. One party is the sender ology for implementing the experiment is as follows: Two SIM cards were installed in two mobiles phones. One party is the sender and the other the receiver. Four different environments were chosen for the receiver, i.e. inside the home, in a moving car, in a and the other the receiver. Four different environments were chosen for the receiver, i.e. inside the home, in a moving car, in a public place and in a quiet place. By the end of the experiment, a new mobile parallel speech corpus for Arabic dialects was built. public place and in a quiet place. By the end of the experiment, a new mobile parallel speech corpus for Arabic dialects was built. The newly obtained corpus provides us with the benefits of a large, fully parallel and labelled speech corpus without the necessity The newly obtained corpus provides us with the benefits of a large, fully parallel and labelled speech corpus without the necessity of a big effort for collection and building. The resultant corpus will be made freely available to researchers. To evaluate the resulof a big effort for collection and building. The resultant corpus will be made freely available to researchers. To evaluate the resultant corpus, the CMU Sphinx recogniser extracted the word error rates (WERs) 24.3, 17.9, 31.2, 18.7 and 32.0 for multi-dialect, tant corpus, the CMU Sphinx recogniser extracted the word error rates (WERs) 24.3, 17.9, 31.2, 18.7 and 32.0 for multi-dialect, Levantine, Gulf, MSA and Egyptian, respectively. Levantine, Gulf, MSA and Egyptian, respectively. c 2018 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. © c 2018 The Authors. Published by Elsevier B.V. This is an open access article under thescientific CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the committee of the 4th International Conference on Arabic Computational LinguisPeer-review under under responsibility responsibilityof ofthe thescientific scientificcommittee committeeofofthe the4th 4thInternational International Conference Arabic Computational LinguisPeer-review Conference onon Arabic Computational Linguistics. tics. tics. Keywords: mobile corpus; cellular corpus; Arabic dialects corpus; Arabic parallel corpus; speech corpus for Arabic dialects Keywords: mobile corpus; cellular corpus; Arabic dialects corpus; Arabic parallel corpus; speech corpus for Arabic dialects

1. Introduction 1. Introduction The limited availability of resources providing data about the Arabic language affects the accuracy of diverse The limited availability of resources providing data about the Arabic language affects the accuracy of diverse natural language processing and speech recognition applications. The huge differences between Arabic dialects further natural language processing and speech recognition applications. The huge differences between Arabic dialects further heighten the need for additional resources for use in different areas. heighten the need for additional resources for use in different areas. In terms of speech recognition applications, using the same channel to train and test data is highly recommended to In terms of speech recognition applications, using the same channel to train and test data is highly recommended to guarantee high accuracy. For example, using data derived from a microphone source to identify mobile calls is likely guarantee high accuracy. For example, using data derived from a microphone source to identify mobile calls is likely to result in low accuracy. to result in low accuracy. ∗ ∗

Corresponding author. Tel.: +966-16-3800050. Corresponding author. Tel.: +966-16-3800050. E-mail address: [email protected] E-mail address: [email protected]

c 2018 The Authors. Published by Elsevier B.V. 1877-0509 c 2018 The Authors. Published by Elsevier B.V. 1877-0509 Peer-review the scientific committee 1877-0509 ©under 2018responsibility The Authors. of Published by Elsevier B.V.of the 4th International Conference on Arabic Computational Linguistics. Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics. 10.1016/j.procs.2018.10.472

2

Khalid Almeman / Procedia Computer Science 142 (2018) 166–173 Author name / Procedia Computer Science 00 (2017) 000–000

Table 1. The 13 combinations of short vowels for

/j/ ‘the Jeem letter’

FatHah

Kasrah

Dhammah

167

Sukun

The purpose of this work is to build and evaluate a new version of multi-dialect Arabic speech parallel corpus, which includes four Arabic dialects; Gulf, Levantine, Egypt as well as MSA. The Arabic multi-dialect speech corpus [4] now has three different versions using different sources: a microphone [ibid.], a VOIP source [3] and this mobile version, which will be explained in the experiment conducted for this research. All of these versions will be freely available for researchers. This paper is organised thus: Section 2 highlights the main features of Modern Standard Arabic (MSA) and its dialects; Section 3 outlines the relevant work; Section 4 details the data used to produce the new corpus; Section 5 explains the methodology applied; Section 6 presents the resultant mobile corpus; Section 7 displays the tool used to extract the WERs results and also introduces the WERs results for all experiments; Section 8 stresses the process of the evaluation of the work; and Section 9 presents the conclusion and planned future work.

2. Modern Standard Arabic and Arabic Dialects The Arabic language is one of the most widely spoken languages globally, ranked fourth in use after Chinese, Spanish and English [9], with an estimated 422 million speakers [6]. Furthermore, Arabic is the official language in 24 countries [ibid.]. Each Arabic letter has up to four written forms1 based on its position in the word: initial, medial, final or isolated. In Arabic, diacritisation (Tashkeel) is used to denote short vowels. There are up to 13 different possible combination forms for letters in Arabic [2]. Table 1 shows an example of the possible combinations for the letter /j/ ‘Jeem letter’. Overall, the different possible diacritised letters of the 28 Arabic letters reach in excess of 350 different forms [ibid.]. 2.1. Modern Standard Arabic versus dialects in usage MSA is the formal version of the language used for communication, and it is understood by the majority of the Arabic population and employed both in the spheres of education and the media [10]. MSA enables individuals who speak different dialects to communicate, although local dialects are used during the majority of telephone calls and in routine conversations. Dialects have also more recently begun to appear in television programmes. 2.2. The multiplicity of Arabic dialects The majority of contemporary Arabic dialects originated from a combination of diverse Arabic dialects and the languages of neighbouring countries [15]; for example, the interaction between Arabic, Berber and French languages led to the North African dialect [ibid.]. 1

the Hamzah letter has five forms of writing. In this paper, we represent Arabic words in some or all of the following variants: the word in Arabic letters /HSB transliteration scheme [16] / (the dialect). 2

168

Khalid Almeman / Procedia Computer Science 142 (2018) 166–173 Author name / Procedia Computer Science 00 (2017) 000–000

3

2 /salAsah/ (Egyptian),

In Arabic dialects, the majority of words originate from either MSA (for example, /tilifzywn/ ‘television’). In both cases has the origin /TalATah/ ‘three’ in MSA), or are loan-word (e.g. cited, there are significant differences between the original and current expressions. Three different levels of changes are detectable in expressions known to originate from MSA: Firstly, there are changes expressed by changing consonants or long vowels3 ; secondly, a number of changes arise from the use of short vowels, i.e. diacritisation; and thirdly, changes take place when ignoring the Al Tajweed4 rules [2]. Two important linguistic aspects exist between MSA and dialects: (1) the large differences between the dialects [2] and (2) MSA is that they are a second language, and therefore non-native to the Arabic speaker [15]. There are over 30 Arabic dialects [13]5 , with each country having its own specific main dialect, and some also having a number of subdialects.

3. Related Work There is a lack of available speech databases for Arabic dialects that can be used for speech recognition applications and NLP tasks [11]. This lack affects the speech recognition tasks accuracy [2]. Few parallel speech databases have been collected previously [2], principally because compiling a speech corpus is a time-consuming process [ibid.]. Anumanchipalli et al. [5] created an example of a parallel corpus, collecting approximately two speech hours of a parallel corpus for English and Portuguese. For the same work, they also collected approximately 25 minutes of German and English [ibid.]. A further example detailing the production of a parallel corpus is the work of Erjavec [12], who collected a parallel multilingual speech corpus for English, before translating it into four languages [ibid.]. The final example of a collection of a parallel speech corpus for two different languages is the work of Perez et al. [21], who collated a parallel speech and text corpus for the Basque and Spanish languages. The TIMIT Acoustic-Phonetic Continuous Speech Corpus [14] is the most popular speech database. Many speech corpus resources have been produced using TIMIT, such as CTIMIT (Cellular TIMIT) [7], NTIMIT (Network TIMIT) [17], etc. In CTIMIT [7], a DAT player was placed in a van, with the output transmitted to a mobile phone by placing a speaker close to it, which produced a new speech database. CTIMIT has the same content as TIMIT, but it has different features. As mentioned above, there is currently a lack of freely available speech databases for Arabic dialects for use with speech-recognition applications and NLP tasks [11], which impacts on the accuracy of speech recognition tasks [2]. Contemporary Arabic speech corpora are derived from a number of different sources: (1) microphones for example, the West Point corpus [18]; (2) receivers for example, the NEMLAR Broadcast News Speech Corpus [20]; (3) telephone conversations for example, the CALL-HOME Egyptian Arabic Speech corpus [8] and Saudi accented Arabic voice bank [1]. However, aside from the Arabic multi-dialect parallel speech corpus [4], which is a microphonesourced corpus, there is currently no available parallel speech corpus for alternative sources such as cellular networks.

4. Data In this research, the Arabic dialects speech corpus [4] was used to conduct the experiments. In Almeman et al. [4], the researchers collected more than 67,000 PCM audio files. This collection contains three Arabic dialects; Egyptian, Gulf, Levantine and MSA. The resultant corpus was produced by Almeman et al. [4] has a 16-bit at 48,000 hertz, and it was recorded using mono, i.e. one channel. The main subject domain of the Almeman et al. [4] corpus is travel and tourism, and they also include a corpus of MSA numbers. Almeman et al. [4] used a microphone source to build the entire Arabic multi-dialect speech corpus.

3 4 5

There are three long vowels in Arabic /A/, /W/ and

/Y/.

Means ‘to recite Quran in a correct way’. A web-based publication contains statistics for more than seven thousands languages.

4

Khalid Almeman / Procedia Computer Science 142 (2018) 166–173 Author name / Procedia Computer Science 00 (2017) 000–000

169

5. Methodology The methodology of this research was formulated after first determining the recording environments. Four environments were chosen: inside the home, in a moving car, in a public place, and in a quiet place. Table 2 shows the distribution of the speakers within these four environments, showing that they are equally divided. There were two parties in this experiment: sender and receiver. A mobile network was used and the conversations recorded on the SIM cards of the sender’s and receiver’s mobiles. The sender was in a fixed environment; i.e. inside the home. However, the environment of the receiver varied as detailed above. All the recordings made for the Arabic multi-dialect speech corpus were prepared manually. The city where the data was collected is Unayzah, located about 300 km north of Riyadh (the capital of the Kingdom of Saudi Arabia). It is a medium-sized city, so noise is expected in public areas, and road traffic is average. The chosen public areas and the streets used in the experiment varied between high noise and medium noise. Table 2. Recordings distributed between four chosen environments

The environment Inside the home In a moving car In a public place In a quiet place Total

Egyptian 5 5 5 5 20

Number of speaker Gulf Levantine 3 2 3 2 3 2 3 2 12 8

MSA 3 3 3 3 12

Total 13 13 13 13 52

The final part of the methodology involves testing of the resultant corpora. One of the speech recognition engines will be used to extract word error rate and evaluate the results which can be compared to the other versions. 6. Mobile Resultant Corpora By the end of the experiment, we had obtained a new parallel mobile corpus, which included the same content as the Arabic dialects speech corpus [4]. The resultant corpus includes four dialects Egyptian, Gulf, MSA, and Levantine. It also contains more than 67,000 audio segmented files. The total number of participants in the resultant corpora is 52 speakers, with 12, 8, 12 and 20 speakers for MSA, Levantine, Gulf, and Egyptian dialects respectively. Table 3 presents the distribution of wave files for the chosen dialects. For additional details about the original speech corpus, evaluation and overlap between the dialects see [4]. 7. Recognition system and WER results When extracting the results for the new corpus, we used CMU Sphinx [19]. To obtain the training results, we used CMU Sphinxtrain v1.0.7 [23]. Meanwhile, to decode and extract the results, we used Sphinx v3-0.8 [22]. Tables 4, 5, 6, 7 and 8 present the word error rate results acquired with the CMU decoder. We set three different values 4, 8 and 16for the Gaussian densities to try to obtain the best WER results. In addition, we set six different values for tied states. The best WER results as Tables 4, 5, 6, 7 and 8 show were 24.3, 17.9, 31.2, 18.7 and 32.0 for multi-dialect, Levantine, Gulf, MSA and Egyptian, respectively. Table 3. The distribution of audio files

Corpus MSA Gulf Egyptian Levantine Total for all

files total 15492 files 15492 files 25820 files 10328 files 67132 files

170

Khalid Almeman / Procedia Computer Science 142 (2018) 166–173 Author name / Procedia Computer Science 00 (2017) 000–000

5

Table 4. Multi Dialect WERs results

Tied States

1000 2000 3000 4000 4500 5000

Gaussian densities 4 8 16 32.3 30.0 27.8 29.5 26.3 25.4 27.4 24.8 24.3 26.8 25.1 24.9 26.5 25.0 25.0 26.4 25.1 25.9

1000 2000 3000 4000 4500 5000

Gaussian densities 4 8 16 23.4 19.4 17.9 19.6 18.6 22.6 19.6 23.2 32.5 21.0 29.7 47.0 23.2 33.8 52.3 25.2 38.0 56.8

1000 2000 3000 4000 4500 5000

Gaussian densities 4 8 16 36.6 33.5 33.5 32.6 31.2 34.6 31.7 33.3 41.4 32.3 36.4 51.2 32.9 39.9 54.8 33.2 42.8 58.3

1000 2000 3000 4000 4500 5000

Gaussian densities 4 8 16 21.7 20.1 20.2 19.4 19.0 19.8 18.7 20.0 25.2 19.3 22.3 31.2 20.3 24.8 35.3 21.0 27.0 38.8

Table 5. Levantine WERs results

Tied States

Table 6. Gulf WERs results

Tied States

Table 7. MSA WERs results

Tied States

8. Evaluation The main aim of this research was to re-record, train and evaluate one of the speech corpora using a mobile network. The chosen speech corpus was Arabic parallel multi-dialect speech corpus [4]. The reason for choosing this resource

6

Khalid Almeman / Procedia Computer Science 142 (2018) 166–173 Author name / Procedia Computer Science 00 (2017) 000–000

171

Table 8. Egyptian WERs results

Tied States

1000 2000 3000 4000 4500 5000

Gaussian densities 4 8 16 38.9 34.5 32.5 34.2 32.8 32.8 33.1 32.7 35.8 32.0 33.4 39.5 32.6 35.1 42.6 33.5 35.9 45.5

Table 9. Accuracy when comparing the best results for microphone and mobile sources

Multi dialect MSA Gulf Levantine Egyptian

Microphone source 13.7 8.2 12.7 8.8 11.2

cellular source 24.3 18.7 31.2 17.9 32.0

The difference 10.6 10.5 18.5 9.1 20.8

is that it is uniquely (1) an Arabic parallel database, (2) has multiple dialects and (3) is freely available. By the end of the experiment, we had developed a new version of the corpus, which is identical in content but arranged in versions with their own specifications. The methodology pointed out that four different environments had been chosen for the experiment: inside the home, in a moving car, in a public place and in a quiet place. For those recordings held inside the home, the noise level varied. Some rooms in the home are very quiet and some have noise in the background. For the recordings made in a moving car, and those in public places, there was noise in the background. Of those recordings obtained in quiet places, the noise was mostly at a low level compared to the previous environments, as expected. Noise that occurs in the background can be divided into non-human noise and human noise. The examples of nonhuman noise in the background are doors closing, cutlery sounds, car horns, road traffic, while examples of human noise are crying, shouting, speaking, etc. Mobile call quality can be affected by many additional factors, such as network signal quality, recording quality, the distance between the mobile and the mouth, etc. The comparison between the best results for the microphone source corpus and mobile call corpus is given in Table 9, which shows that the accuracy of recognition in the microphone source (the original) is higher than the mobile source (the newly obtained corpus). The results vary between 9.1 and 20.8. By checking the contrast between the speech (in the foreground) and the noise (in the background) we receive an indication of the sound quality when evaluating the resultant corpus. WCAG 2.0[24]6 stated that background noise has to be at least 20 root mean square (rms) dB lower than speech in the foreground. Thirty speech files were randomly chosen, and the contrast was measured. The results are shown in Table 10. The differences between the foreground and the background were over 20 rms dB for all selected files, except for one file which obtained 15.5 20 rms dB, i.e. less than 20 rms dB. By checking this file, the problem appeared to be the weakness of the mobile network, which affects the level of background noise. The average difference for all files was 44.74 rms dB, which carried out WCAG 2.0 suggested conditions. The results indicate that the noise level in the background was at an acceptable level.

6

Web Content Accessibility Guidelines (WCAG) 2.0 is a guideline for accessible audio files on the internet, which is recommended by W3C.

Khalid Almeman / Procedia Computer Science 142 (2018) 166–173 Author name / Procedia Computer Science 00 (2017) 000–000

172

7

Table 10. Speech contrast evaluation

Foreground

Audio file #

MSA\01\A\01\53 MSA\02\E\01\111 MSA\04\H\02\103 MSA\05\A\04\27 MSA\06\C\01\44 MSA\11\G\01\127 MSA\12\D\02\04 MSA\07\A\01\36 GULF\01\A\01\02 GULF\09\D\01\42 GULF\03\H\02\25 GULF\03\A\01\38 GULF\03\G\01\84 GULF\10\H\02\61 GULF\03\B\01\24 GULF\06\C\02\02 LEV\01\A\01\10 LEV\01\H\02\39 LEV\02\D\02\49 LEV\06\D\01\25 LEV\05\A\02\46 LEV\06\D\01\23 LEV\06\C\01\19 EGY\01\A\01\03 EGY\03\E\01\105 EGY\07\F\01\29 EGY\15\A\04\21 EGY\16\E\01\143 EGY\20\H\02\216 EGY\14\E\01\204 Average

Time started (sa ) 0.76 0.01 0.19 0.43 0.35 0.24 0.44 0.26 1.01 0.34 0.43 0.73 0.72 0.32 0.44 0.27 0.44 0.20 0.54 0.51 0.62 0.52 0.52 0.66 0.64 0.58 1.10 0.76 0.56 0.76 —

Background

Time ended (s)

Average rms dB

1.60 1.41 0.86 1.37 0.98 1.11 1.84 2.57 3.39 1.06 1.35 2.21 1.59 1.27 2.16 1.56 2.48 0.93 1.58 1.39 1.70 1.55 1.91 1.31 1.67 0.99 1.97 1.56 1.50 1.36 —

-33.9 -32.4 -35.2 -32.3 -35.2 -31.9 -26.5 -32.3 -33.7 -25.7 -34.8 -33.4 -29.9 -31.6 -34.2 -25.0 -33.4 -28.0 -21.8 -27.6 -32.1 -29.3 -30.5 -35.5 -32.9 -30.7 -27.8 -31.7 -17.9 -32.6 -30.66

Time started (s) 0.35 1.50 0.92 0.00 0.00 0.10 0.00 2.63 0.00 1.08 1.61 0.00 1.99 0.00 0.00 1.66 0.00 1.21 0.00 0.00 0.00 1.75 2.02 0.00 2.01 1.32 0.00 0.02 1.77 0.00 —

Time ended (s)

Average rms dB

Results Difference Average rms dB.

0.75 2.03 1.27 0.31 0.20 0.25 0.37 3.08 0.75 1.29 2.22 0.50 2.40 0.27 0.39 1.87 0.40 1.96 0.35 0.43 0.43 2.28 2.75 0.48 2.50 1.62 0.50 0.59 2.44 0.48 —

-75.7 -76.4 -81.3 -80.8 -76.5 -60.0 -78.5 -79.7 -77.5 -70.4 -75.9 -76.8 -64.0 -78.0 -77.1 -75.9 -48.9 -79.9 -63.3 -78.6 -77.6 -78.1 -76.5 -77.3 -81.7 -81.4 -77.7 -77.8 -67.7 -80.3 -75.04

41.9 44.0 46.1 48.5 41.3 28.2 52.0 47.4 43.8 44.7 41.1 43.5 34.1 46.4 43.0 50.9 15.5 51.9 41.5 51.0 45.5 48.8 46.0 41.8 48.8 50.6 49.9 46.1 49.8 47.7 44.74

a Seconds.

9. Conclusions The result of this paper is a mobile multi-dialect Arabic speech corpus. The new source corpus will be freely available to researchers. The Arabic multi-dialect speech corpus [4] now has three different versions using different sources: a microphone, a VOIP source [3] and this mobile version, as noted in the experiment conducted for this research. The methodology employed four different environments for recording: inside the home, in a moving car, in a public place and in a quiet place all equally divided between the four environments. There was also diversity in the noise level, and the noise could be divided into non-human noise and human noise.

8

Khalid Almeman / Procedia Computer Science 142 (2018) 166–173 Author name / Procedia Computer Science 00 (2017) 000–000

173

The comparison between the best results for the microphone source corpus and the mobile calls corpus shows the accuracy of recognition for the microphone source as higher than for the mobile source. The average difference for all files tested was 44.74 rms dB, which satisfied the WCAG 2.0 suggested conditions. In addition, this result indicates that the noise level in the background of the new resultant corpus is acceptable. Various speech databases have been produced from the TIMIT speech database, e.g. FFMTIMIT (Free-Field Microphone TIMIT), NTIMIT (Network TIMIT), CTIMIT (Cellular TIMIT) and HTIMIT (Handset TIMIT), and STCTIMIT (the Single-channel Telephone Corpus TIMIT), so it is interesting for future work to check the ability to obtain new speech corpora for new sources for parallel Arabic dialects. References [1] Alghamdi, M., Alhargan, F., Alkanhal, M., Alkhairy, A., Eldesouki, M., Alenazi, A., 2008. Saudi Accented Arabic Voice Bank. Journal of King Saud University - Computer and Information Sciences 20, 43–58. [2] Almeman, K., 2015. Reducing out-of-vocabulary in morphology to improve the accuracy in Arabic dialects speech recognition. Phd thesis. [3] Almeman, K., 2017. Automatically building voip speech parallel corpora for arabic dialects 17, 4:1–4:12. URL: http://doi.acm.org/10. 1145/3132708, doi:10.1145/3132708. [4] Almeman, K., Lee, M., Almiman, A.A., 2013. Multi Dialect Arabic Speech Parallel Corpora, in: Proceedings of the First International Conference on Communications, Signal Processing, and their Applications (ICCSPA13), Sharjah, UAE. pp. 1–6. [5] Anumanchipalli, G.K., Oliveira, L.C., Black, A.W., 2012. INTENT TRANSFER IN SPEECH-TO-SPEECH MACHINE TRANSLATION, in: Proceedings of the Spoken Language Technology Workshop (SLT), IEEE. pp. 153–158. [6] Bokova, I., 2012. World Arabic Language Day. http://www.unesco.org/new/en/unesco/events/prizes-and-celebrations/ celebrations/international-days/world-arabic-language-day/. [accessed 23 October 2017]. [7] Brown, K.L., George, E.B., 1995. CTIMIT: A speech corpus for the cellular environment with applications to automatic speech recognition, in: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1995), IEEE. pp. 105–108. [8] Canavan, A., Zipperlen, G., Graff, D., 1997. CALLHOME Egyptian Arabic Speech. Technical Report. Linguistic Data Consortium (LDC), University of Pennsylvania. Philadelphia, USA. LDC Catalog No: LDC97S45, http://catalog.ldc.upenn.edu/LDC97S45 [accessed 23 October 2017]. [9] CIA, 2013. The World Factbook. https://www.cia.gov/library/publications/the-world-factbook/. [accessed 23 October 2017]. [10] Clive, H., 2004. Modern Arabic: Structures, Functions and Varieties. Georgetown Classics in Arabic Languages and Linguistics series. revised ed., Georgetown University Press, Washington, DC, USA. [11] Elmahdy, M., Gruhn, R., Minker, W., 2012. Novel techniques for dialectal arabic speech recognition. Springer, Heidelberger Platz 3, 14197 Berlin, GERMANY. [12] Erjavec, T., 2004. MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora, in: Proceedings of the LREC, pp. 2544–2547. [13] Ethnologue, 17th ed., 2013. Arabic, Standard. http://www.ethnologue.com/language/arb. [accessed 23 October 2017]. [14] Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V., 1993. TIMIT acoustic-phonetic continuous speech corpus. Technical Report 5. Linguistic Data Consortium (LDC), University of Pennsylvania. Philadelphia, PA, USA. LDC Catalog No: LDC93S1, http://catalog.ldc.upenn.edu/LDC93S1 [accessed 23 October 2017]. [15] Habash, N., 2010. Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, 82 Winterspot Ln, Williston, VT 05495, USA. doi:10.2200/S00277ED1V01Y201008HLT010. [16] Habash, N., Soudi, A., Buckwalter, T., 2007. On arabic transliteration. Springer, Heidelberger Platz 3, 14197 - Berlin, GERMANY. pp. 15–22. [17] Jankowski, C., Kalyanswamy, A., Basson, S., Spitz, J., 1990. Ntimit: A phonetically balanced, continuous speech, telephone bandwidth speech database, in: Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on, IEEE. pp. 109–112. [18] LaRocca, C.S.A., Chouairi, R., 2002. West point Arabic speech corpus. Technical Report. Linguistic Data Consortium (LDC), University of Pennsylvania. Philadelphia, USA. LDC Catalog No: LDC2002S02, http://catalog.ldc.upenn.edu/LDC2002S02 [accessed 23 October 2017]. [19] Lee, K.F., Hon, H.W., Reddy, R., 1990. An overview of the SPHINX speech recognition system. Acoustics, Speech and Signal Processing 38, 35–45. [20] Maamouri, M., Graff, D., Cieri, C., 2006. Arabic Broadcast News Speech. Technical Report. Linguistic Data Consortium (LDC), University of Pennsylvania. Philadelphia, USA. LDC Catalog No: LDC2006S46, http://catalog.ldc.upenn.edu/LDC2006S46 [accessed 23 October 2017]. [21] P´erez, A., Alcaide, J.M., Torres, M.I., 2012. EuskoParl: a speech and text Spanish-Basque parallel corpus, in: Proceedings of the 13th International Conference on Spoken Language Processing (INTERSPEECH 2012), Portland, Oregon, United States. pp. 2362–2365. [22] Sphinx, 2009. Sphinx 3.0.8 [software]. http://sourceforge.net/projects/cmusphinx/files/sphinx3/0.8/. [accessed 23 October 2017]. [23] Sphinxtrain, 2011. Sphinxtrain 1.0.7 [software]. http://sourceforge.net/projects/cmusphinx/files/sphinxtrain/1.0.7/. [accessed 23 October 2017]. [24] WCAG2.0, 2008. Web content accessibility guidelines (wcag) 2.0. http://www.w3.org/TR/WCAG20/. URL: https://www.w3.org/TR/ WCAG20/. [accessed 23 October 2017].

The Building and Evaluation of a Mobile Parallel Multi-Dialect Speech Corpus for Arabic

The Building and Evaluation of a Mobile Parallel Multi-Dialect Speech Corpus for Arabic

Recommend Documents