Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions

Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions

ARTICLE IN PRESS JID: IPM [m3Gsc;August 24, 2017;2:28] Information Processing and Management 0 0 0 (2017) 1–11 Contents lists available at Science...

1MB Sizes 0 Downloads 140 Views

ARTICLE IN PRESS

JID: IPM

[m3Gsc;August 24, 2017;2:28]

Information Processing and Management 0 0 0 (2017) 1–11

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions Eiman Alsharhan a,∗, Allan Ramsay b a b

Kuwait University, Kuwait University of Manchester, England, United Kingdom

a r t i c l e

i n f o

Article history: Received 1 November 2016 Revised 29 May 2017 Accepted 14 July 2017 Available online xxx

a b s t r a c t This paper aims at determining the best way to exploit the phonological properties of the Arabic language in order to improve the performance of the speech recognition system. One of the main challenges facing the processing of Arabic is the effect of the local context, which induces changes in the phonetic representation of a given text, thereby causing the recognition engine to misclassify it. The proposed solution is to develop a set of language-dependent grapheme-to-allophone rules that can predict such allophonic variations and hence provide a phonetic transcription that is sensitive to the local context for the automatic speech recognition system. The novel aspect of this method is that the pronunciation of each word is extracted directly from a context-sensitive phonetic transcription rather than a predefined dictionary that typically does not reflect the actual pronunciation of the word. The paper also aims at employing the stress feature as one of the supra-segmental characteristics of speech to enhance the acoustic modelling. The effectiveness of applying the proposed rules has been tested by comparing the performance of a dictionary based system against one using the automatically generated phonetic transcription. The research reported an average of 9.3% improvement in the system’s performance by eliminating the fixed dictionary and using the generated phonetic transcription to learn the phone probabilities. Marking the stressed vowels with separate stress markers leads to a further improvement of 1.7%. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction and objectives Delivering a robust ASR system that can be applied in a wide range of applications has motivated researchers to find solutions for different kinds of problems that arise in the field. Generally speaking, the performance of the speech recogniser can be affected by every change made to the speech signals, no matter how small it is, such as the changes resulting from using different microphones. Many factors contribute to make delivering robust ASR systems a challenging task, such as inter and intra-speaker variability, dealing with noisy backgrounds, and the complexity of natural language, including the fact that co-articulation effects mean that phonemes have different acoustic realisations in different situations. Besides the aforementioned difficulties, each language has its own peculiar characteristics. Arabic, for instance, poses a number of challenges somewhat different from other languages for which ASR systems have been developed. These arise from a variety of sources: the gap between Modern Standard Arabic (MSA) and dialectal Arabic, the complex morphology ∗

Corresponding author. E-mail address: [email protected] (E. Alsharhan).

http://dx.doi.org/10.1016/j.ipm.2017.07.002 0306-4573/© 2017 Elsevier Ltd. All rights reserved.

Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

ARTICLE IN PRESS

JID: IPM 2

[m3Gsc;August 24, 2017;2:28]

E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11 Table 1 Multiple pronunciations for the word (min). word

pronunciation

context

(min)

/min/

(min AiDoTirAbAt)

/miɲ/

(min jihap)

/mim/

(min buTuwlapi)

/mir/

(min rijAli)

/miŋ/

(min qibali)

/miɱ/

(min fariyqi)

/mil/

(min lawmi)

/mir/

(min ra}iysi)

Table 2 Experimental results. data

training

testing

set

size

size

absolute accuracy fixed dict.

multi dict.

phonetic transcription

Pilot study Main experiment

1.45 h 19 h

28 min 4.45 h

82.1% 63.7%

86.6% 69.1%

93.4% 71.0%

Table 3 Results of adding the stress value in the phonetic transcription. The table provides a comparison between the baseline system (no stress) and different levels of adding the stress marker. Phonetic transcription level

Accuracy

Baseline system (no stress) Stress marker attached to consonants and vowels Stress marker attached to vowels Stress marker as a separate phoneme following vowels

71% 64.9% 68.3% 72.7%

which increases the perplexity and out-of-vocabulary rates, and the significant differences between spoken and written forms. The main objective of the work described here is to deal with the problem of within-word and cross-word pronunciation variation in MSA speech resulting from coarticulation effects. This is done by designing a tool that automatically generates a context-sensitive phonetic transcription for a given text in order to limit the variation sources. This work also involves investigating the effectiveness of incorporating the stress value during the acoustic modelling. This is done by developing a set of precise word stress rules that can be used in combination with the developed grapheme-to-allophone rules. The effectiveness of using the alternative phonetic transcription was tested by running different HMM-based experiments, using version 3.4.1 of the Hidden Markov Model ToolKit (HTK).1 The results in Tables 2 and 3 show modest but useful improvements over the use of dictionaries containing multiple pronunciations for individual words. 1.1. Relation to previous work It is widely believed in the literature that pronunciation variation is one of the major factors which leads to a deterioration in the performance of ASR systems, especially in continuous speech systems (AbuZeina, Al-Khatib, Elshafei, & AlMuhtaseb, 2012; Lehr, Gorman, & Shafran, 2014; NG & Hirose, 2012; Sumner, Kurumada, Gafter, & Casillas, 2013). Therefore, the effort put into modelling pronunciation variation for ASR has increased lately. Roughly speaking, two main techniques are used in the literature to account for the pronunciation variation: data-driven approaches (Chen, Chen, Lim, & Ma, 2015; Lu, Ghoshal, & Renals, 2013; McGraw, Badr, & Glass, 2013; Razavi, Rasipuram, & Doss, 2016; Schlippe, Ochs, & Schultz, 2014; Tsujioka, Sakti, Yoshino, Neubig, & Nakamura, 2016) and knowledge-based approaches (Deri & Knight, 2016; Smirnov et al., 2016). The two approaches differ in how the information on pronunciation variation is obtained. In data-driven approaches, the acoustic data is used solely to find the pronunciation variants. On

1

http://htk.eng.cam.ac.uk/.

Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

JID: IPM

ARTICLE IN PRESS E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

[m3Gsc;August 24, 2017;2:28] 3

the other hand, knowledge-based approaches use information that is derived from linguistic sources, like pronunciation dictionaries and linguistic findings on pronunciation variation. The problem of pronunciation variation is more significant in Arabic than in other languages due to the great influence of the context in pronouncing letters. Looking at the literature, it has been found that only one work that used a datadriven approach in modelling within-word Arabic pronunciation variation, reported in AbuZeina et al. (2012) and Nahar, Al-Muhtaseb, Al-Khatib, Elshafei, and Alghamdi (2015). The variants are distilled from the training corpus and then added to the system’s dictionary as well as the language model. Results of applying this method show no improvement when expanding the pronunciation dictionary alone, however, the Word Error Rate (WER) reduced by 2.22% when representing those pronunciation variants in the language modelling. In contrast, the knowledge-based approach has received great interest among researchers in modelling Arabic pronunciation variation. The standard approach involves generating an Arabic multi-pronunciation dictionary that includes many possible pronunciations for each word. For instance, Biadsy, Habash, and Hirschberg (2009) generated a multi-pronunciation dictionary for MSA using pronunciation rules and then the MADA (Habash, Rambow, & Roth, 2009), as a morphological disambiguation tool, to determine the most likely pronunciation of a given word in the context. The proposed method reported a significant improvement of 4.1% in accuracy compared to the baseline system. Ali, Elshafei, Al-Ghamdi, Al-Muhtaseb, and Al-Najjar (2008) provided a limited set of phonetic rules for automatic generation of an Arabic phonetic dictionary. The rules were mainly direct grapheme-to-phoneme rules with few rules for the assimilation of “lAm” with solar letters, the conversion of (n) to (m) when followed by (b), and emphatic with pharyngeal vowels. The effectiveness of using the generated dictionary was tested by using a large-vocabulary speaker-independent Arabic ASR system and achieved a comparable accuracy with a similar vocabulary size English ASR system. The work of Ali et al. (2008) was then implemented in many other researches such as Alghamdi, Elshafei, and Al-Muhtaseb (2009), Abuzeina, Al-Khatib, Elshafei, and Al-Muhtaseb (2011), and AbuZeina et al. (2012). In addition, the Qatar Computing Research Institute (Ali et al., 2014) provided a complete recipe and language resources for building ASR systems for MSA using the KALDI toolkit. The text was firstly processed using MADA. To construct the lexicon, words that occurred more than once in the news archive were selected. The lexicon has 526K unique grapheme words, with 2M pronunciation entries. Using the reported recipe with the multi-pronunciation dictionary has led to an average WER of 26.95%. A more recent study was carried out by Masmoudi, Khmekhem, Estève, Belguith, and Habash (2014). The study presented a tool for automatically generating a pronunciation dictionary for Tunisian Arabic. This rule-based tool includes grapheme to phoneme mapping, a lexicon of exceptions, and a set of phonetic rules. The resulting software was tested on a word list in Tunisian Arabic using two independent test sets and reached an error rate of 9%. 1.2. An alternative approach: generating a fine-grained phonetic transcription It can be gathered from the studies reviewed above that the predominant methodology in tackling the problem of crossword and within-word pronunciation variation is generating an expanded version of the system’s dictionary to include many possible pronunciations for each word. The choice is then left to the speech recognition engine to pick the closest pronunciation for the word given the acoustic evidence. Doing this ignores the fact that the choice of which pronunciation is to be used is conditioned by the surrounding phonetic (phonological) context. The proposed research has investigated using a set of pronunciation rules to provide a contextually determined phonetic transcription. The HTK requires a number of source files in order to prepare the data and transfer it to a format that can be processed during training and testing. In addition to the training sentences, a pronunciation dictionary is normally provided prior to invoking the data preparation tools. Because we want to generate a context-sensitive phonetic transcription for the training text, we changed the workflow of the recogniser by incorporating a set of pronunciation rules in place of the fixed dictionary, as shown in Fig. 1. The proposed work provide a comprehensive set of conversion rules that can handle cross-word and within-word pronunciation variation in continuous speech. Instead of expanding the pronunciation dictionary with multiple pronunciation of each word, the generated phonetic transcription provides only one possible pronunciation for each word in each context. The recognition tool is then forced to use the generated pronunciation of the word rather than doing complex alignment tasks to choose the most applicable pronunciation. The experimental evidence confirms that this method outperforms using a fixed pronunciation dictionary as well as a multi-pronunciation dictionary as will be discussed in Section 3.2. 1.3. Contents of the article The rest of the paper is divided into four parts. Section 2 describes the algorithm for generating the phonetic transcription for Arabic texts, starting with giving a brief overview of the phonetic and phonological properties of Arabic. This is followed by stating the advantages of the developed system. The section ends by outlining the main parts of the developed graphemeto-allophone system. Section 3 lays out the experimental dimensions of the research starting by describing the corpora used in training and testing the ASR system. It then gives the results of testing the three developed systems: a fixed-dictionary based system, a multi-pronunciation dictionary based system, and a phonetic transcription based system. Section 4 reports the results of adding the stress value to the generated phonetic transcription. Discussion of the problem addressed in this research and conclusion of the outcomes are given in Section 5. Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

ARTICLE IN PRESS

JID: IPM 4

[m3Gsc;August 24, 2017;2:28]

E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

Fig. 1. HTK workflow when using the generated phonetic transcription instead of the predefined dictionary.

2. Deriving an Arabic phonetic transcription algorithm This section explores the outlines of designing a comprehensive system for automatically generating a phonetic transcription of a given Arabic text which closely matches the pronunciation of the speakers. The designed system is based on a set of language-dependent pronunciation rules that works on converting a fully diacriticised Arabic text into the actual sounds, beside a lexicon for exceptional words. This is a two-phase process: one-to-one grapheme2 to phoneme3 conversion and phoneme-to-allophone4 conversion using a set of “phonological rules”. Phonological rules operate on the phonemes and convert them to the actual sounds considering the neighboring phones or the containing syllable or word. A summarised description of the developed algorithm is given in this section. 2.1. Phonetics and phonological properties of Arabic With a total of 34 phonemes(6 vowels and 28 consonants), the allowed syllables in Arabic are: CV, CVC, and CVCC where V indicates a vowel (long or short) and C indicates a consonant. The correspondence between the spellings and the pronunciation of the lexical items in Arabic is deterministic. The process of mapping between graphemes to phonemes consists of a large number of one-to-one rules and a set of deterministic rules that assign the appropriate pronunciation by inspecting the syllabic structure. For instance, there are three graphemes in Arabic each of which presents two different phonemes from different phonetic classes. Those graphemes are: (A) which presents hamza /ʔ/ and long Alif /a:/, and semi-vowels (w and y) which present the consonants /w/ and /y/ besides the long vowels /u:/ and /i:/ respectively. Consider the pronunciation of the following examples: • (AimtiHAn) –> /ʔimti h ¯ a:n/ ‘‘test” 2 3 4

The minimal unit of the writing system of a language. The minimal unit of the sound system of a language. One of a set of multiple possible spoken sounds used to pronounce a single phoneme.

Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

ARTICLE IN PRESS

JID: IPM

E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

[m3Gsc;August 24, 2017;2:28] 5

• (wurwd) –> /wuru:d/ “flowers” • (yarmy) –> /jarmi:/ “he throws” The pronunciation of those graphemes is determined according to their position in the syllable. In general this is a fairly straightforward task and can be settled by looking at the adjacent graphemes. The situation does become more challenging when two of these graphemes appear adjacent to one another – ‘y’, for instance is typically a vowel when it is followed by a consonant, and ‘w’ is typically a vowel when it is preceded by a consonant. But if ‘y’ is immediately followed by ‘w’, we cannot use facts about ‘y’ to work out whether ‘w’ is a vowel or a consonant, and we cannot use facts about ‘w’ to decide whether ‘y’ is a vowel or a consonant. It is, nonetheless, always possible to settle this issue by looking at the remainder of the word. The main challenge faced in converting phonemes to allophones is the coarticulation effect where the speech sounds are influenced by a preceding or following sound. An example of this can be seen in Table 1 which provides the result of generating the phonetic transcription of the preposition (min) “from” when occurring in different phonetic context. This process is called nasal assimilation. 2.2. Advantages of the system There are two advantages of the proposed system: • Phonetic transcription is an essential source for training the speech recogniser. Typical ASR systems need to have the sounds associated with their textual and phonetic transcription. Pronunciation dictionaries cannot be found for Arabic due to the huge number of possible word forms. As an alternative, the system described here automatically generates an accurate replacement of the lexicon needed for ASR systems regardless of the number and kind of vocabulary. • Unlike ordinary recognition systems which use dictionary only during the recognition phase, the proposed system uses a context-sensitive phonetic transcription during training and recognition phases, which has a better effect on the performance. 2.3. The architecture of the system This section reviews the outlines of our transcription algorithm for MSA. This system utilises a comprehensive set of pronunciation rules and a lexicon which introduces exceptional words, numbers, abbreviations, symbols, and acronyms. Due to the limitation in space, only the main steps are presented with a few examples. A detailed description of the system can be found in Ramsay, Alsharhan, and Ahmed (2014). The accuracy of the transcription system was evaluated by getting an experienced annotator to transcribe recordings of the same materials. This evaluation shows that the generated transcription is a reasonable approximation to reality, as it matches the manual transcription by 98.3%. The 1.7% of cases where there were discrepancies were largely explained by the presence of loan words (which would not be expected to follow the standard Arabic rules), errors in vowel production by the native speakers, and minor accentual effects. Please refer to Ramsay et al. (2014) for more details. Fig. 2 describes the architecture of the developed grapheme-to-allophone system. 2.3.1. Text pre-processing This is an essential front-end for any system that deals with transcribing text. Basically, it manipulates the information from textual input and prepares the text to undergo further processing by the system. This includes: restoring the diacritics of the text, text segmentation (into words, syllables, and phoneme boundaries), deleting the unnecessary symbols such as (sukun), removing the letters that do not correspond to any sound in the word’s pronunciation, and labelling the characters with the consonant and vowel value. 2.3.2. Grapheme-to-phoneme mapping This includes a set of one-to-one rules that inspect the graphemes and convert them into phonemes. This is a straightforward conversion; the task of these rules is to map the Arabic graphemes to their matching phonemes. Some Arabic graphemes cannot be included in this conversion phase, namely: Alif, semi-vowels, and the feminine marker(ta‘ marbuTa) In the provided rules, we use the Buckwalter transliteration scheme (Habash, Soudi, & Buckwalter, 2007) and the Speech Assessment Methods Phonetic Alphabet (SAMPA) (UCL, 2002) phonetisation scheme to present the letters and the sounds of the language, respectively. The reason we use the SAMPA notation rather than the IPA in writing the program is because the IPA is not an ASCII friendly scheme. 2.3.3. Phonological rules This part describes the development of a set of phonological rules that convert phonemes into phones. These rules are intended to control the allophonic variation. In other words, these are context-sensitive rules that operate on phonemes to convert them to phones in specific contexts. This section gives a short description and only one example of each variation type. In writing these rules, common Arabic pronunciation rules along with some of the Tajweed rules/footnoteThe tradition of the Holy Qur’an’s recitation have been accommodated. Only rules that were found to comply with the MSA speech were chosen. In addition to a number of novel rules obtained by analysing spoken Arabic by speakers from different regions. Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

ARTICLE IN PRESS

JID: IPM 6

[m3Gsc;August 24, 2017;2:28]

E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

Fig. 2. The architecture of the developed grapheme-to-allophone system.

The rules, which are implemented as a set of two-level finite state automata, are formalised in the following way:

[A]==> [B]: [X]##[Y].

The rule has two main parts separated by a colon. The first part describes the conversion process by indicating what sound(s) is changed and what the sound(s) changes to. The second part of the rule indicates in what environment the phonological rule is applied. X is the backwards context and Y is the forwards context. The double hashes represent the location of the sound(s) that is to be changed. Square brackets are used to represent the individual sounds or features the sounds have in common (represented in curly brackets) or null. • Assimilation (total and partial): Assimilation refers to the influence of one sound upon the articulation of another so that the two sounds become alike (partial) or even identical (total). Many aspects of assimilations can be found in Arabic as it is considered the main coarticulation phenomena. To save space only one example of each assimilation type is given, but it should be recognised that each of these examples stands for a large number of cases. Total assimilation can be found in Arabic in the pronunciation of the definite article (Al). The lateral of the Arabic definite prefix has different phonetic realisations depending on the type of initial consonant of the noun to which it is prefixed. The “lAm” is totally assimilated when followed by any one of a group consisting of fourteen consonants called “the solar letters”. For example, consider the word (Al$ams) “the sun” where the “lAm” is assimilated with /ʃ/ so it is pronounced as /ʔaʃʃams/. The following pair of rules describe this process.

[l]==>[c0] :[‘A’]##[{c0,+solar},???]. [l]==>[c0]:[l,i]##[{c0,+solar},???].

The first of these says if the sound /l/ occurs in a context where it is preceded by /A/ and followed by a solar consonant c0 then it will be replaced by a second copy of c0. The second says that the same thing will happen if /l/ was preceded by /li/ and followed by a solar consonant. Note that these rules apply to phonemes, not to graphemes. However, because each grapheme realises one phoneme, and each phoneme is realised by one grapheme, we do from time to time mention the corresponding graphemes when specifying rules. An example of partial assimilation is the sound /n/ which adopts the labiality of the consonant /f/ and is assimilated to /ɱ/ every time it is followed by /f/ either in word or in phrase boundary. For example, the word (yanfad) “to run out” is pronounced /jaɱfad/. This kind of assimilation is controlled by the following rule: Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

JID: IPM

ARTICLE IN PRESS

[m3Gsc;August 24, 2017;2:28]

E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

7

[n]==>[ɱ]: [???] ## [f, ???].

• Neutralisation: In MSA, there is a contrast in sound length in some contexts. The sounds neutralisation can be observed in short and long vowels plus double consonants. For instance, the double consonants are shortened and pronounced as one consonant when they occur utterance finally as in (Hajj) “pilgrimage” which is pronounced / h ¯ aʤ/. In order to apply this conversion, the word should not be followed by a vowel like where it pronounced / h ¯ aʤʤa/. The rule responsible for this kind of variation is:

[{c0, grapheme=A, +final@word}] ==> []: [{c1, grapheme=A}] ## [].

• Insertion: This process is about inserting a phonetic element into a string without having an orthographic representation, some times referred to as “epenthesis”. An example of this is when inserting the short vowel /i/ between two words (first ends with a consonant and the second starts with “hamzatu Alwasl”). This can be seen in phrases like (man AlqAdim) “who is coming?” which is pronounced as /mani lqa:dim/. The phonological rule below expresses the short vowel insertion process before “hamzatu Alwasl”:

[{‘A’, +initial@word}] ==> [i, ‘A’]: [???, c0] ## [c1, c2, v1, ???].

• Sound pharyngealisation: The /r/ sound could be pharyngealised or not depending on the context. It is not pharyngealised whenever it is preceded or followed by “kasra” /i/ within the same syllable, e.g. “a trip” and (sir) /sir/ “a secret”. In contrast it is pharyngealisated in all the other cases like (rajul) “a man” and (wurwd) /wurû:d/ “flowers”. This can be described in the following rule:

[{r,+coda@syllble}] ==>[{r,+pharyngealised,+coda@syllble}]: [???,{v0,open=O,rounded=R}] ## [???] if not (O = - and R = -). [{r,+onset@syllable}] ==>[{r, +pharyngealised, +onset@syllable}]: [???]##[{v0,open=O,rounded=R}, ???] if (O = - and R = -).

2.3.4. Aligning stress into syllables The process of incorporating stress markers comes at the end because it depends mainly on the internal structure of the syllables that make up the word, which may change greatly after applying the phonological rules. The stress rules presented here are based on the standard pronunciation of MSA. However, the accent may cause a shifting in the stress position, e.g. Egyptian speakers pronounce the word (maktabap) “library” as [mak-ta-ba] with stress on /ta/ rather than the standard way [mak-ta-ba]. Stress Rules: 1. If the word contains a word-final super heavy syllable (CVVC or CVCC), this syllable must be stressed. Consider the word (sijil ∼ ) “register” [si-jill] and (rah ∼ Al) “traveler” [rah-ha:l]. This kind of syllable can only occur once in a word. 2. In words with open heavy syllables (CVV), the stress is placed on the last such syllable. Consider for instance, (dAris) “student” [da:-ris] and “students” [da:-ri-su:-na]. 3. In case the word has no super heavy syllable or open heavy syllable, the position of the stressed syllable depends on the number of syllables in the word: • In disyllabic words, the stress falls in the first syllable. e.g. (huwa) “he” [hu-wa] and (maktab) “office” [mak-tab]. • In polysyllabic words, stress is placed on the antepenultimate syllable. For example (rasamat) “she draws” [ra-sa-mat] and (maktabatK) “a library” [mak-ta-ba-tun] 3. ASR experiments This section lays out the experimental dimensions of the research by introducing the corpora used for training and testing the system. In addition, a comparison is made between the performance of three different systems. The first system is developed using a fixed-dictionary with a single pronunciation for each word. The second system uses a multi-pronunciation dictionary which gives different possible ways of pronouncing the words and allows the HMM to choose how to map each version of a word to a set of acoustic parameters, as used by Ali et al. (2008), Hiyassat, Yaseen, and Arabiat (2007), Biadsy et al. (2009), Abuzeina et al. (2011) and Ali et al. (2014). The third system is built using the developed context-sensitive phonetic transcription taking into account inter-word and cross-word pronunciation variation. In designing the dictionary-based systems, the standard steps for running the HTK were applied. These steps are composed of a number of library modules and tools that allow for sophisticated analysis of the speech. However, in designing the context-sensitive phonetic transcription system, several modifications have taken place in the data preparation and training steps in order to call the grapheme-to-allophone system rather than the predefined dictionary. The main changes are: • Canceling the dictionary management tool (HDMAN) whose main task is to construct and edit the dictionary. Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

ARTICLE IN PRESS

JID: IPM 8

[m3Gsc;August 24, 2017;2:28]

E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

Fig. 3. An example of two phonetic transcriptions for a given sentence. The first row gives the surface form of the sentence, the second row gives the phonetic transcription that is based on a fixed dictionary, and the third row gives the transcription obtained by using the developed conversion rules.

• Canceling the label editor tool (HLED) which is designed to create the phone level multiple label file MLF based on the information given in the dictionary. • Canceling the realigning of the training data by skipping the recognition tool (HVite). This tool performs a forced alignment of the training data by considering all pronunciations for each word and choosing the best matching pronunciation. Canceling those essential steps was compensated by different created commands that were introduced in the appropriate points of training the system to benefit from the grapheme-to-allophone system’s output. Fig. 3 gives a parallel example of two phonetic transcriptions for a given sentence, one obtained by looking up in the fixed dictionary and one generated by applying the developed grapheme-to-allophone rules. The section also investigates the effectiveness of providing word-stress markers during the acoustic modelling, which could be of great value to enhancing the performance of speech recognition systems. 3.1. Training and testing corpora Two datasets are used to test the effectiveness of the various strategies. The first, pilot, was used to verify that the ideas being explored here were worth pursuing. The second larger set was used to carry out a more substantial set of experiments, in order to conform the effects that were observed on the basis of the pilot study. 3.1.1. Pilot data set The pilot data set uses 20 manually created sentences varying in length between 3 and 7 words. Although this set contains just 20 sentences, those sentences were created carefully to ensure that they cover all the possible phonological variation in Arabic. The recordings were collected with the aid of 23 Arabic native speakers from different Arab regions and the total recording size is around 2.13 h. The perplexity of this set was fairly low, which led to fairly high recognition rates for all experiments. 3.1.2. Main data set The main data set is based on 300 automatically generated sentences varying in length between 2 and 6 words. The sentences are written phonetically rich and balanced in order to produce a robust Arabic speech recogniser. 54 native Arabic speakers covering the main dialectal regions were asked to record a number of sentences. The total recordings size is about 23.45 h. The perplexity of this set was higher than for the pilot experiments, with consequently lower recognition rates 3.2. Testing the effectiveness of using the generated phonetic transcription To evaluate the effectiveness of using a context-sensitive phonetic transcription rather than a standard dictionary, three recognition systems were developed for each data set: a fixed dictionary based system, a multi-pronunciation dictionary based system, and a phonetic transcription based system. The recordings were split into training and testing sets. The pilot data set contained about 1.45 h of audio used for training the system and about 28 min used for testing. The recordings in the main data set were divided into 19 h audio files for training and about 4.45 h for testing. For the best use of data and to avoid unfair testing, the research uses a 5-fold cross-validation approach as a way of assessing the proposed systems. This involves randomly partitioning the data into 5 equal size subsets to perform the training on 4 subsets and validate the performance on the other subset. This process is repeated 5 times, with each of the 5 subsets used only once as the validation data. The results from the 5 folds can then be averaged to compute a single estimation. The advantage of this method over the standard evaluation approach is that all observations are used for both training and testing, providing a more robust testing for experiments with limited data sources. Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

JID: IPM

ARTICLE IN PRESS E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

[m3Gsc;August 24, 2017;2:28] 9

3.2.1. Results For the pilot data set, using a fixed pronunciation dictionary led to 82.1% absolute accuracy,5 whilst using a multipronunciation dictionary produced 86.6% absolute accuracy. Using the generated phonetic transcription for training a further improvement was achieved to 93.4% absolute accuracy. In other words, using the phonetic rules to explicitly tell the system which phonetic transcription was used in a given instance is more effective that supplying it with a dictionary containing multiple possible pronunciations and using forced alignment to choose between them in each case (doing this also shortens the training time, since there it cuts out a stage in which an initial model is used to carry out forced alignment and then a number of further rounds training are carried out on the newly transcribed data). These results were confirmed when carried out the same set of experiments on the larger data set, with from 63.68% for the fixed dictionary, 69.1% with a multi-entry dictionary, and 71% with the generated transcription. These results are shown in Table 2. 3.3. Testing the effectiveness of incorporating stress value during the acoustic modelling In this section, the stress phenomenon is addressed as one of the supra-segmental characteristics of speech. Although word stress in MSA is non-phonemic, which means it cannot be used to distinguish the meaning (see Al-Ani, 1970; Odisho, 1976), researchers confirm that well-defined stress rules can help in enhancing Arabic speech processing applications (Halpern, 2009; Ramsay & Mansour, 2008). A number of experiments were conducted using only the main dataset in order to test the effectiveness of adding stress value to syllables in the generated phonetic transcription. Three ways were considered to carry out this testing: introducing a stressed version of every phoneme, a stressed version of just the vowels, or a separate stress marker following the stressed vowels. 3.3.1. Including a stress marker for every stressed phoneme In building a speech recogniser that uses a stressed version of every phoneme, the transcription system produces a context-sensitive phonetic transcription that uses all grapheme-to-allophone conversion rules in addition to the ones that mark the consonants and vowels of the stressed syllables. The stressed syllables are marked in the transcription by placing a distinctive symbol just after the stressed consonants and vowels. The following is an example of a generated transcription that includes stress markers: Sentence: Alwaladu yatanAwalu AlTaEAm Transcription: /Q a l w’ a’ l a d u - y a t a n’ aa’ w a l u - tˆ tˆ aˆ E’ aa’ m’/ The symbol “ ’ ” indicates that the sound is stressed, so /w’/ is a stressed phoneme while /w/ is unstressed. As far as the HTK is concerned, these are two unrelated phonemes. By activating the stress rules to mark the stressed phonemes, it was not expected that the overall performance of the system would deteriorate by 6.1%. The system reported 61.60% word recognition accuracy when attaching the stress marker to the contents of the stressed syllables, whereas in the baseline system the reported word recognition accuracy was 67.7%. We have always maintained that providing the most sensitive phonetic transcription would have a positive impact on the recognition performance. However, the reported result of activating stress rules does not support that assertion. 3.3.2. Including a stress marker only for stressed vowels The second way of investigating the influence of incorporating the stress information is done by using a phonetic transcription that marks only stressed vowels. The fact that the stress is more distinguishable with vowels than consonants further motivated us to carry out this type of testing. When introducing a stressed version of the vowels, the transcription system produces a context-sensitive phonetic transcription, marking only the vowels of the stressed syllables with “ ’ ”, as in the following: /Q a l w a’ l a d u - y a t a n aa’ w a l u - tˆ tˆ aˆ E aa’ m/ The testing results show that the word recognition accuracy increased to 65.0%, which, although not as good as excluding them from the provided transcription, shows an improvement over the full syllable marking. 3.3.3. Including a separate stress marker following every stressed vowel Introducing the stress marker as a separate phoneme makes the transcription of the mentioned sentence appears in this way: /Q a l w a ’ l a d u - y a t a n aa ’ w a l u - tˆ tˆ aˆ E aa ’ m/ In this transcription, the vowels are the same as the stress marker introduced as an additional phoneme. This method leads to reduce the number of phonemes that the HTK has to deal with. In other words, instead of having two different representations of each vowel (a stressed version and a non-stressed version), only one representation of each vowel is used with an additional stress phoneme that follows the stressed vowels. Evaluating this system shows that the word recognition accuracy jumped to 69.4%, which outperforms all the systems previously mentioned. This consistent improvement reported in the two systems previously mentioned might be explained by a reduction in the number of phonemes and HMMs accordingly. For instance, when the stress is ignored in the given transcription, the 5

The absolute accuracy is calculated by excluding the result of recognising silences from the word recognition result.

Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

ARTICLE IN PRESS

JID: IPM 10

[m3Gsc;August 24, 2017;2:28]

E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

total number of phonemes in the recognition system was 44. On the other hand, by attaching the stress marker symbol to the stressed consonants and vowels, this number was increased to 81. This large amount of phonemes could lead to great confusion during the acoustic analysis which in turn weakens the overall performance of the system, as already observed. The word recognition accuracy was improved when the stress marker was attached only to vowels, thereby making the total number of phonemes reach 47. Adding the stress marker as a separate phoneme following the stressed vowels leads to the best performance with only 45 phonemes. The main advantage of this method lies in its ability to determine the stress positions in the word while keeping the number of phonemes to a minimum. Table 3 summarises the results reported in this section. 4. Discussion and conclusion This paper has addressed the problem of the pronunciation variation presented in the MSA speech. A grapheme-toallophone algorithm has been developed to generate a context sensitive phonetic transcription of the speech. The proposed algorithm has been tested using two different databases that consist of spoken Arabic sentences. The obtained results showed an interesting improvement in word recognition accuracy by between 9% and 11.3%. The research confirms that using the generated phonetic transcription outperforms the use of multi-pronunciation dictionaries which is widely believed to be the best method for capturing the pronunciation variation. The paper has also tested the effectiveness of incorporating an advanced level of phonological constraints for speech recognition tasks by incorporating the stress value. It has been found that by introducing stress as a separate phoneme following stressed vowels, the obtained accuracy outperforms the baseline system by 1.7%. The absolute accuracy of a speech recognition system depends on a huge number of factors–size of the training data, perplexity of the language (which in turn depends on the size of the vocabulary and the complexity of the grammar), quality of the recordings, homogeneity of the subject population, to mention just a few. It is therefore difficult to provide direct comparison between the performance of different systems, especially in the absence of a widely and easily available shared dataset. What matters here is the improvement produced by using the generated phonetic transcription. The two datasets have different vocabularies, different grammars and different populations of speakers. In both cases, using context-sensitive phonetic transcriptions leads to substantial improvements in performance. It is believed that similar results are likely to be obtained with other data sets and other grammars and vocabularies. References Abuzeina, D., Al-Khatib, W., Elshafei, M., & Al-Muhtaseb, H. (2011). Cross-word Arabic pronunciation variation modeling for speech recognition. International Journal of Speech Technology, 14(3), 227–236. AbuZeina, D., Al-Khatib, W., Elshafei, M., & Al-Muhtaseb, H. (2012). Within-word pronunciation variation modeling for Arabic ASRs: A direct data-driven approach. International Journal of Speech Technology, 1–11. Al-Ani, S. (1970). Arabic phonology. Mouton The Hague, Paris. Alghamdi, M., Elshafei, M., & Al-Muhtaseb, H. (2009). Arabic broadcast news transcription system. International Journal of Speech Technology, 10(4), 183–195. Ali, A., Zhang, Y., Cardinal, P., Dahak, N., Vogel, S., & Glass, J. (2014). A complete kaldi recipe for building Arabic speech recognition systems. In Spoken language technology workshop (SLT), 2014 IEEE (pp. 525–529). IEEE. Ali, M., Elshafei, M., Al-Ghamdi, M., Al-Muhtaseb, H., & Al-Najjar, A. (2008). Generation of Arabic phonetic dictionaries for speech recognition. In 2008 international conference on innovations in information technology (pp. 59–63). IEEE. Biadsy, F., Habash, N., & Hirschberg, J. (2009). Improving the Arabic pronunciation dictionary for phone and word recognition with linguistically-based pronunciation rules. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics (pp. 397–405). Association for Computational Linguistics. Chen, W., Chen, N. F., Lim, B. P., & Ma, B. (2015). Corpus-based pronunciation variation rule analysis for Singapore English. In SLaTE (pp. 35–40). Deri, A., & Knight, K. (2016). Grapheme-to-phoneme models for (almost) any language. In Proceedings of the 54th annual meeting of the association for computational linguistics: 1 (pp. 399–408). Habash, N., Rambow, O., & Roth, R. (2009). Mada+ tokan: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, pos tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt (pp. 102–109). Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. Arabic Computational Morphology, 15–22. Halpern, J. (2009). Word stress and vowel neutralization in modern standard Arabic. In Proceedings of the second international conference on Arabic language resources and tools. Hiyassat, H., Yaseen, M., & Arabiat, N. (2007). Automatic pronunciation dictionary toolkit for Arabic. Ph.D. thesis. Arab Academy for Banking and Financial Sciences, Amman, Jordan.. Lehr, M., Gorman, K., & Shafran, I. (2014). Discriminative pronunciation modeling for dialectal speech recognition. Lu, L., Ghoshal, A., & Renals, S. (2013). Acoustic data-driven pronunciation lexicon for large vocabulary speech recognition. In IEEE workshop on automatic speech recognition and understanding(ASRU) (pp. 374–379). IEEE. Masmoudi, A., Khmekhem, M. E., Estève, Y., Belguith, L. H., & Habash, N. (2014). A corpus and phonetic dictionary for Tunisian Arabic speech recognition. In LREC (pp. 306–310). McGraw, I., Badr, I., & Glass, J. (2013). Learning lexicons from speech using a pronunciation mixture model. IEEE Transactions on Audio, Speech, and Language Processing, 21(2), 357–366. Nahar, K., Al-Muhtaseb, H., Al-Khatib, W., Elshafei, M., & Alghamdi, M. (2015). Arabic phonemes transcription using data driven approach. International Arab Journal of Information Technology(IAJIT), 12(3), 237–245. NG, R., & Hirose, K. (2012). Syllable: A self-contained unit to model pronunciation variation. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4457–4460). IEEE. Odisho, E. (1976). A case in comparative study of wordstress in RP English and Arabic. IDELTI, 7, 43–55. Department of Phonetic and LinguisticsUniversity College London (UCL) (2002). Speech assessment methods phonetic alphabet (SAMPA) for Arabic. Ramsay, A., Alsharhan, I., & Ahmed, H. (2014). Generation of a phonetic transcription for modern standard Arabic: A knowledge-based model. Computer Speech & Language, 28(4), 959–978. Ramsay, A., & Mansour, H. (2008). Towards including prosody in a text-to-speech system for modern standard Arabic. Computer Speech & Language, 22(1), 84–103.

Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002

JID: IPM

ARTICLE IN PRESS E. Alsharhan, A. Ramsay / Information Processing and Management 000 (2017) 1–11

[m3Gsc;August 24, 2017;2:28] 11

Razavi, M., Rasipuram, R., & Doss, M. M. (2016). Acoustic data-driven grapheme-to-phoneme conversion in the probabilistic lexical modeling framework. Speech Communication, 80, 1–21. Schlippe, T., Ochs, S., & Schultz, T. (2014). Web-based tools and methods for rapid pronunciation dictionary creation. Speech Communication, 56, 101–118. Smirnov, V., Ignatov, D., Gusev, M., Farkhadov, M., Rumyantseva, N., & Farkhadova, M. (2016). A Russian keyword spotting system based on large vocabulary continuous speech recognition and linguistic knowledge. Journal of Electrical and Computer Engineering, 2016. Sumner, M., Kurumada, C., Gafter, R., & Casillas, M. (2013). Phonetic variation and the recognition of words with pronunciation variants. In Proceedings of the annual meeting of the 35th annual conference of the cognitive science society (pp. 3486–3491). Cognitive Science Society Austin, TX. Tsujioka, S., Sakti, S., Yoshino, K., Neubig, G., & Nakamura, S. (2016). Unsupervised joint estimation of grapheme-to-phoneme conversion systems and acoustic model adaptation for non-native speech recognition. Interspeech, 3091–3095.

Please cite this article as: E. Alsharhan, A. Ramsay, Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions, Information Processing and Management (2017), http://dx.doi.org/10.1016/j.ipm.2017.07.002