name / Procedia Computer Science 00 (2017) 000–000 name / Procedia Computer Science 00 (2017) 000–000 Availablename online at www.sciencedirect.com / Procedia Computer Science 00 (2017) 000–000
The 4th International Conference on Arabic Computational Linguistics (ACLing 2018), name / Procedia ComputerComputational Science 00 (2017) 000–000 The 4th International Conference on Arabic Linguistics 2018), November 17-19 2018, Dubai, United Arab Emirates (ACLing name / Procedia Computer Science 00 (2017) 000–000 The 4th International Conference on Arabic Computational Linguistics November 17-19 2018, Dubai, United Arab Emirates (ACLing 2018), ScienceDirect November 17-19 on 2018, Dubai, United ArabLinguistics Emirates (ACLing 2018), The 4th International Conference Arabic Computational The 4th International Conference on Arabic Computational Linguistics (ACLing 2018), Procedia Computer Science 142 (2018) 14–25 The Constitution of an Arabic Touristic Corpus November 17-19 2018, Dubai, United Arab Emirates November 17-19 2018, United Arab Emirates The Constitution of anDubai, Arabic Touristic Corpus
1 1 1 1 1
The Constitution aof an Arabicb Touristic Corpus c Chahira Lhiouiaof , Anis Zouaghi Zrigui The Constitution an Arabic Touristic Corpus b, Mounir c Chahira Lhioui , Anis Zouaghi , Mounir Zrigui The Constitution aof an Arabic b Touristic Corpus Chahira Lhioui , Anis Zouaghi , Mounir Zriguic
[email protected], LaTICE Laboratory, Tunis, Tunisia a c a b c
[email protected], LaTICE Laboratory, Tunis, Tunisia
[email protected], ISSAT, Sousse, Tunisia,
[email protected], FSM, Monastir Tunisia a b c a b c
[email protected], LaTICE Laboratory, Tunis, Tunisia
[email protected], ISSAT, Sousse, Tunisia,
[email protected], FSM, Monastir Tunisia b c a
[email protected], ISSAT, Sousse, Tunisia,
[email protected], FSM, Monastir Tunisia
[email protected], LaTICE Laboratory, Tunis, Tunisia a
[email protected], LaTICE Laboratory, Tunis, Tunisia b c
[email protected], ISSAT, Sousse, Tunisia,
[email protected], FSM, Monastir Tunisia b
[email protected], ISSAT, Sousse, Tunisia,
[email protected], FSM, Monastir Tunisia a
b
Chahira Lhioui , Anis Zouaghi , Mounir Zrigui Chahira Lhioui , Anis Zouaghi , Mounir Zrigui
Abstract Abstract This paper presents the ArabicMEDIA reference dialogue corpus focusing on the recording protocol. This project aims to define Abstract This paper presents themethodology ArabicMEDIA dialogue corpus the recording protocol. This project aims language to define and test an evaluation thatreference assesses and diagnoses thefocusing sensitiveoncontext understanding capability of spoken This paper presentsTherefore, themethodology ArabicMEDIA dialogue focusing oncontext the protocol. project aims language to tourist define and test an evaluation thatreference assesses and diagnoses sensitive understanding capability ofconcerning spoken Abstract dialogue systems. as an original contribution ofcorpus this the work, we took therecording initiative to build aThis corpus Abstract and test an evaluation methodology that assesses and diagnoses the sensitive context understanding capability of spoken language dialogue systems. Therefore, as an original contribution of this work, we took the initiative to build a corpus concerning tourist This paper presents the ArabicMEDIA reference dialogue corpus focusing on the recording protocol. This project aims to define information and hotel reservations. We have drawn inspiration from the corpus of MEDIA and LUNA. Evaluation will also This paper presents themethodology ArabicMEDIA reference dialogue corpus focusing on the recording protocol. This project aims language to define dialogue systems. Therefore, as an original contribution of this work, we took the initiative to build a corpus concerning tourist information and hotel reservations. We have drawn inspiration from the corpus of MEDIA and LUNA. Evaluation will also and test an evaluation that assesses and diagnoses the sensitive context understanding capability of spoken pertain Arabic systems of academic organizations as well as industrial sites. and testArabic an evaluation methodology that and diagnoses the sensitive context understanding capability ofconcerning spoken language information and hotel reservations. Weassesses have drawn inspiration from thetook corpus MEDIA LUNA. Evaluation will also pertain systems of academic organizations as well as this industrial sites. dialogue systems. Therefore, as an original contribution of work, we the of initiative to and build a corpus tourist dialogue systems. Therefore, as an original contribution of this work, we took the initiative to build a corpus concerning tourist pertain Arabic systems of academic organizations as well as industrial sites. information and hotelPublished reservations. We have drawn inspiration from the corpus of MEDIA and LUNA. Evaluation will also © 2017 The Authors. by Elsevier B.V. information and hotelPublished reservations. We have drawn inspiration from the corpus of MEDIA and LUNA. Evaluation will also © 2017Arabic The under Authors. by Elsevier B.V. pertain systems of academic organizations as well as of industrial Peer-review responsibility of the scientific committee the 3rd sites. International Conference on Arabic Computational © 2018 The Authors. Published by Elsevier B.V. pertain Arabic systems of academic organizations as well as industrial © 2017 The under Authors. Published by B.V. committee of the 3rd sites. Peer-review responsibility of Elsevier the scientific International Conference on Arabic Computational Linguistics. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of Elsevier the scientific committee of the 3rd International Conference on Arabic Computational Linguistics. © 2017 The Authors. Published by B.V. Peer-review responsibility the scientific committee of the 4th International Conference on Arabic Computational Linguistics. © 2017 The under Authors. Published of by B.V. Linguistics. Peer-review under responsibility of Elsevier theLanguage; scientific committee of the 3rd International Conference on Arabic Computational Keywords: Corpus; ArabicMEDIA; Arabic Touristic information Peer-review under responsibility of the scientific committee of the 3rd International Conference on Arabic Computational Keywords: Corpus; ArabicMEDIA; Arabic Language; Touristic information Linguistics. Linguistics. Keywords: Corpus; ArabicMEDIA; Arabic Language; Touristic information Keywords: Introduction Corpus; ArabicMEDIA; Arabic Language; Touristic information 1. Keywords: Corpus; ArabicMEDIA; Arabic Language; Touristic information
1. The field Introduction of study of our work is Tourist Information and Hotel Reservations (TIHR) [1]. A great deal of interest 1. The field Introduction of study of our work isdata Tourist Information and Hotel Reservations (TIHR) [1]. great dealresearchers of interest has been given to Arab electronic sources due to many usages of the Internet. This is Awhy some The field of study of our work isdata Tourist Information and Hotel Reservations (TIHR) [1]. Awhy great dealresearchers of interest 1. Introduction has been given to Arab electronic sources due to many usages of the Internet. This is some have triedIntroduction to construct different forms of Arabic corpora. These corpora include the Arab Penn Treebank corpus [2] 1. has been given to Arab electronic sources due to many usages of the Internet. This is Awhy some The field of study ofdifferent our workforms isdata Tourist Information and Hotel Reservations (TIHR) [1]. great dealresearchers of interest have tried to construct of Arabic corpora. These corpora include the Arab Penn Treebank corpus [2] andThe thefield TuDiCoI corpus [3]. However, to our knowledge, there is no Arabic (TIHR) corpus, [1]. working strictly on interest tourist study of our work is Tourist Information and Hotel Reservations A great deal corpus of have tried to of construct forms Arabic corpora. These corpora include the Arab Penn Treebank [2] has been given to corpus Arabdifferent electronic dataofsources due to many usages of the Internet. Thisworking is why some researchers and the TuDiCoI [3]. However, to our knowledge, there is no Arabic corpus, strictly on tourist information and to hotel reservation. has been given Arab electronic data sources due to manythere usages of the Internet. This is why strictly some researchers and the TuDiCoI corpus [3]. However, our knowledge, is no Arabic corpus, on tourist have tried toand construct different forms of to Arabic corpora. These corpora include the Arabworking Penn Treebank corpus [2] information hotel reservation. Astried an original contribution this work, we took the initiative to build a corpus concerning tourist information have toand construct differentofforms of Arabic corpora. These corpora include the Arab Penn Treebank corpus [2] information hotel reservation. and theanTuDiCoI corpus [3]. However, to our knowledge, there is no Arabic corpus, workingtourist strictlyinformation on tourist As original contribution of this work, we took the initiative to build a corpus concerning and hotel reservations. We[3]. have drawn inspiration from the there corpusisofnoMEDIA and LUNA [4]. The purpose of the theanTuDiCoI corpus However, to our Arabic corpus, working strictly on tourist original contribution of drawn this work, we knowledge, took the the initiative toofbuild a corpus concerning tourist information information and hotel reservation. andAs hotel reservations. We have inspiration from corpusrestaurants, MEDIA and LUNA [4]. The purpose of the request is to provide information on the cities of residence, hotels, routes, timetables of public transport, information and hotel reservation. and hotel reservations. We have inspiration from corpusrestaurants, MEDIA and LUNA [4]. The purpose of the As anis original contribution of drawn thisthework, we took the the initiative toofbuild a corpus concerning tourist information request to provide information on cities of residence, hotels, routes, timetables of public transport, tourist events and contribution all other relevant information to tourists (internaltoorbuild external), one concerning or more rooms in one or more As an of on this we took the initiative a corpus tourist information request is original to provide information thework, cities of to residence, hotels, timetables of public transport, and hotel reservations. We relevant have drawn inspiration from the(internal corpusrestaurants, of external), MEDIAroutes, and LUNA [4]. The purpose of the tourist events and all other information tourists or one or more rooms in one or more hotels. Reservations areWe made within the inspiration framework from of thethe organization of a weekend, holiday[4]. or aThe professional stay. and hotel reservations. have drawn corpus or of external), MEDIA and purpose the tourist events and allare other relevant tourists (internal oneLUNA or more or rooms in onetransport, orof more request is to provide information on information the cities of to residence, hotels, restaurants, routes, timetables ofprofessional public hotels. Reservations made within the framework of the organization of a weekend, holiday a stay. The construction of such a corpus is embodied in the creation ofrestaurants, an ArabicMEDIA project. The objective of this request is to provide information on the cities of residence, hotels, routes, timetables of public transport, hotels. Reservations are made the framework ofthe thecreation organization of a weekend, holiday a professional stay. tourist and allof other relevant information to in tourists (internal one or more or rooms in one orofmore The events construction such aawithin corpus is embodied of or an external), ArabicMEDIA project. The objective this project is to define testrelevant methodology for to appraising the understanding in one andorout of dialogue We tourist and alland other information tourists (internal more roomsobjective in systems. one orofmore The events construction of aawithin corpus is embodied in of or an external), ArabicMEDIA project. this hotels. Reservations aresuch made the framework ofthe thecreation organization of a weekend, holiday orThe a professional stay. project is to define and test methodology for appraising the understanding in and out of dialogue systems. We propose to set up an evaluation paradigm based on the definition and use of test kits derived from real corpus and on hotels. Reservations are made the framework of the organization of a weekend, holiday or a professional stay. project istoto define test aawithin methodology foronappraising the understanding in and out offrom dialogue systems. We The construction of such corpus is embodied in creation of an ArabicMEDIA project. The objective of this set up an and evaluation paradigm based thethe definition and use of test kits derived real corpus and on apropose semantic representation and common metrics. This paradigm would make it possible to diagnose the capacities of The construction of such a corpus is embodied in creationand of an ArabicMEDIA project. The objective of this propose set up an and evaluation paradigm based thethe definition use of test kits derived real corpus and on project istoto define test a common methodology foron appraising the would understanding in and out offrom dialogue systems. We acomprehension semantic representation and metrics. This paradigm make it possible to diagnose the capacities of out of context and in context of the systems of dialogue. This paradigm will also be used as part of project is torepresentation define and test methodology for This appraising the would understanding in and out of dialogue systems. We acomprehension semantic anda common metrics. paradigm make it possible towill diagnose the capacities of propose to set upout anof evaluation paradigm based on the definition and use of test kits derived from real corpus and on context and in context of the systems of dialogue. This paradigm also be used as part of an evaluation campaign that will bring together the systems of the different sites, on the same task of inquiries. State propose to set upout anof evaluation paradigm based on definition and use of testparadigm kits derived from realused corpus on context and in context ofthe thethe systems dialogue. This will also as and part acomprehension representation andwill common metrics. This paradigm would make it possible diagnose capacities of ansemantic evaluation campaign that bring together systems ofofthe different sites, on the to same task be ofthe inquiries. State of the Art representation aansemantic and common metrics. This paradigm would make it possible to diagnose the capacities of evaluation campaign that will together different sites, on the same task be of inquiries. State comprehension out of context andbring in context ofthe thesystems systemsofofthe dialogue. This paradigm will also used as part of of the Art Currently, there isofneither a and standard methodology, nor a commonly accepted practice in thealso scientific community comprehension out context in context of the systems of dialogue. This paradigm will be used as part of of the Art an evaluation campaign that will bring together the systems of the different sites,practice on the same task of inquiries. State Currently, there is neither a standard methodology, nor a commonly accepted in the scientific community to andcampaign comparethat systems of dialogue. The dynamic andtheinteractive natureonofthe thesame dialogue makes it difficult an evaluate evaluation will bring together the systems of different sites, task of inquiries. State Currently, there is neither a standard methodology, nor a commonly accepted practice in the scientific community of the Art and to build evaluate compare systems of dialogue. The dynamic andrepository interactive nature of the dialogue makes it difficult to data set in order to provide a common evaluation for several tasks. Yet, large projects have of the Arta test to evaluate and compare systems of dialogue. The dynamic andrepository interactive nature of the dialogue makes it difficult Currently, there is neither a standard methodology, nor a commonly accepted practice in the scientific community to build a test data set in order to provide a common evaluation for several tasks. Yet, large projects have attempted to lay the foundations for an evaluation methodology for oral dialogue systems, starting with the Currently, there is neither a standard methodology, nor a commonly accepted practice in the scientific community build a test data set in order to provide common evaluation for several tasks. Yet, large projects have to evaluate andlay compare systems offor dialogue. The dynamic andrepository interactive nature of the dialogue makes itwith difficult attempted to the foundations ana evaluation methodology for oral dialogue systems, starting the Francophone project AUF-Arc B2 [5], the evaluation by DEFI [6] , The European projects EAGLES [7], MEDIA to evaluate to andlay compare systems offor dialogue. The dynamic and interactive nature of thesystems, dialogue starting makes itwith difficult attempted foundations ana evaluation methodology for oral dialogue the to build a test data the setAUF-Arc in order toB2provide common evaluation repository for several tasks. Yet, large projects have Francophone project [5], the evaluation by DEFI [6] , The European projects EAGLES [7], MEDIA [4], LUNA [15], SUNDIAL [9] and the evaluation ATIS projects [10] and COMMUNICATOR [11] inhave the to build a test dataDISC set in [8], order toB2provide a common repository for now several tasks. Yet, large projects Francophone project [5], by DEFI [6] ,[10] The European projects EAGLES [7], MEDIA attempted to[15], lay DISC theAUF-Arc foundations for the an evaluation methodology for and oral dialogue systems, starting with [4], LUNA [8], SUNDIAL [9] and the ATIS projects now COMMUNICATOR [11] in the USA. attempted to lay DISC the foundations for an methodology for oral now dialogue systems, starting[11] with the [4], [8], SUNDIAL [9]evaluation and the ATIS projects in the Francophone project AUF-Arc B2 [5], the evaluation by DEFI [6] ,[10] The and EuropeanCOMMUNICATOR projects EAGLES [7], MEDIA USA.LUNA [15], Francophone project AUF-Arc B2 [5], the evaluation by DEFI [6] , The European projects EAGLES [7], MEDIA USA.LUNA [15], DISC [8], SUNDIAL [9] and the ATIS projects [10] and now COMMUNICATOR [11] in the [4], [4], LUNA [15], DISC [8], SUNDIAL [9] and the ATIS projects [10] and now COMMUNICATOR [11] in the 1877-0509 © 2018 The Authors. Published by Elsevier B.V. USA. USA. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics. 10.1016/j.procs.2018.10.457
2
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
15
The paradigm PEACE (Paradigm of Automatic Evaluation of Understanding Outside and in Dialogic Context) [12, 13] on which the ArabicMEDIA project is based allows an automatic, comparative and diagnostic evaluation for the understanding in and out of dialogue. It is based on the constitution of batteries of reproducible tests resulting from real dialogues. This paradigm follows the same current as the DQR [14] and DEFI [6] evaluations based on battery tests. The evaluation environment is based on the idea that within a database-based information task system, a common semantic representation, to which each system is capable to convert its own representation. Moreover, the paradigm permits for an evaluation in context of the dialogue. The context is artificially simulated by a paraphrase, the aim is to test the interpretation of a U-statement in the context D ^ n (to take the notations of the DQR approach). Finally, while the major evaluation programs focused on performance evaluation (global measures), this campaign should not only allow an assessment of performance but also a diagnosis of the modeling used. The aim of the ArabicMEDIA project is therefore to give the Arab scientific community the means of evaluating comparatively the approaches of apprehension, by offering the possibility of sharing corpuses and defining common generic representations and metrics. The first stage of the ArabicMEDIA project was devoted to the definition and the constitution of a common corpus of dialogues in Arabic TIHR_ARC, dedicated to the task chosen in ArabicMEDIA (tourist information server). After a presentation of the ArabicMEDIA project, the article shows the methodology used for the collection of the corpus TIHR _ARC (definition of the task, description of the registration platform, protocol), as well as the first observations on this corpus. 2. The project ArabicMEDIA 2.1. Campaign Organization The organization of a campaign to evaluate the understanding outside and in context of Arab dialogue systems has the main aim of promoting a dynamic of evaluation within the community. The objective of this project is to put in place a generic evaluation paradigm for comprehension outside and in dialogue context allowing an automatic, comparative and diagnostic evaluation of systems. An evaluation campaign must guarantee the sustainability of the resources set up for the campaign and the products derived from it. To ensure the impartiality of the campaign, the evaluation must be conducted by a partner who is not participating in the campaign. We have taken charge of this aspect and recorded the corpus necessary for this project. It takes in charge the production of the necessary tools for the evaluation. In addition, it supports the organization of the campaign and the evaluation of the results. We have also set up the corpus registration platform (WOZ tool and tool). The LaTICE as project promoter has the role of scientific coordinator. The participants in the evaluations are both academic partners (LaTICE) and industrial partners (Tunisie TELECOM, Ooreedo). 2.2. Evaluation paradigm In order to allow for a diagnostic evaluation, the evaluation paradigm is based on a common generic representation. 3.2.1 Generic semantic representation It is a question of putting in place a representation of the meaning of the user statements making it possible to establish an equivalence relation in the set of possible queries. A reflection is carried out to define this common representation outside of any field, but in the context of information systems linked to a database. The formalism of representation chosen must be consensual and allow the annotation of large corpora. It is based on an attribute-value structure that allows the representation of complex structures. This formalism makes it possible both to code the act of dialogue and the propositional content of an utterance. It is agreed that each participant takes care of the conversion from its own internal representation to the common representation. 3.2.2 Reference units A reference unit for the evaluation of out-of-context understanding includes the exact transcription of the user statements and the reference semantic representation. A unit of references for the evaluation of comprehension in context of dialogue subsumes the context in the form of a paraphrase [12], the exact transcription of the user utterance and the semantic representation resulting from the interpretation of the utterance given the context. The
16
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
3
paraphrase can be obtained either from the annotations in context of the corpus, or from the concatenation of the user phrases and the responses system. The set of reference units will be divided into three parts: a task-matching corpus (2/3), a development corpus (1/3) distributed to the partners and a hidden-test corpus (1/3) for the evaluation. Each user statement that is transcribed according to the norms of transcriptions of the oral statements will then be annotated according to the common semantic representation out of context and in context. 3.2.3 Common Evaluation Measures The objective is to define common measures for carrying out diagnostic evaluations of the systems. It should also be possible to weight the importance of errors according to the types established below. 3.2.4 Definition and typology of dialogic phenomena and functions The paradigm must offer a qualitative and automatic diagnostic analysis of the performance of the understanding module out of context and in context. One can, for example, be interested in particular difficulties of the oral speech existing outside the context: hesitation, repetition, etc. A list of the functionalities of understanding in context is to be tested, based on the documentation produced in our study framework [16] and will be defined by the consortium (ellipses, anaphors, relaxation of constraints etc.). 3. Constitution of the corpus 4.1 Definition of task and domain As part of the evaluation of human-machine dialogues, we decided to restrict the task to requests for information requests accessing databases, such as tourist server, train schedules server, airplane etc. The definition of the semantic representation is generic. It is then adapted to the task and to the database. The best way was to work on an application linked to a real database, for example from an access to the web on a travel agency site or tourism office. The common task chosen for the evaluation is that of tourist information concerning the reservation of hotels from websites. 4.2 Corpus collection We are in need of common corpus of dialogue to adapt the different systems of comprehension in a dialogic context and to create the test batteries used for evaluation. For an unbiased evaluation, we decided to record a new corpus for tourist information speech simulation using a Wizard of Oz. Thus each speaker believes to dialogue with a machine, whereas the dialogue is actually assured by a participant who simulates the responses of a tourist information server. This allows us to obtain a corpus of varied dialogues, thanks to the behaviors of the participant. In this campaign, it is planned to work only on the exact transcripts of the dialogues including those of speakers’ transcripts (users and system). However, it seems also important to us to have the good quality digitized audio signal corresponding to the dialogues, in order to be able to widen the campaign to the processing of inputs coming from a speech recognition system. 4.2.1 Registration Platform The chosen method of corpus collection is that of the "magician of Oz" (Wizard of Oz: WoZ). This consists in simulating a man-machine dialogue in natural language. The simulation comes from the fact that the machine is replaced by a person who responds to a request from the user by mimicking the automatic operation of a voice server. To do this, this operator or participant uses a graphic tool developed by VECSYS, which helps him to generate answers that he must communicate to the caller. Generation sentences are obtained by completing a model with information from a tourist information website and caller data. Figure 1 illustrates the operation of the recording tool. The signal is recorded directly in digital format. The dialogues are then transcribed orthographically and segmented into dialogic actions to be eventually and semantically annotated. 4.2.2 Recording protocol Users participating in the recordings are based on hotel reservation scenarios that have been generated from basic scenarios so as to have diversity in the dialogues. In order to obtain user queries presented in the most natural way possible, these scenarios are communicated to the users by telephone in order to reduce the language
4
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
17
paraphrases of the texts of the scenario.
1. The caller contacts the ArabicHybridSLU phone server for tourist information and to book a hotel room
2. The participant enters the information on the site Elkantaoui
3. The participant completes the sentence pattern with the information obtained on the website or given by the caller 4. Depending on the response on the site or missing information, he presses the appropriate key on the WoZ tool.
Figure 1: Operation of the WoZ tool Several points of entry into the dialogue are possible: choice of a city, choice of a route, choice of an event, price, period. Eight categories of scenarios were defined to have different levels of complexity. An example of a complex scenario is shown in Figure 2. Including reserving several hotels in several locations according to a particular route. DATE: from 20/02 to 25/02 LOCATION: from 20/02 to 21/02 (1 night) to Lille, from 21/02 to 23/02 (2 Nights) in Paris and from 23/02 to 25/02 (2 nights) in Hammamet NB-ROOMS 2 couples, one with 1 child NB-ADULTS 4 NB-CHILDREN 2 PRICE: good standing (maximum 200 €) MISCELLANEOUS: Mercury, animals, parking Figure 2: Example of scenario In addition to the variety of scenarios provided to speakers, we have defined guidelines that the participant uses to respond to user requests. The first type of instruction concerns speech recognition or comprehension errors. So the fellow will produce a response based on a misunderstanding of the user query. The second type concerns implicit or explicit confirmations made by the participant. Finally, a last type of instruction concerns the level of cooperation of the fellow. If the instruction is to be cooperative, the participant will answer and give all the information to the user. Conversely, the participant may be uncooperative and fail to respond or may partially respond to user requests. In addition to the instructions, the speaker receives instructions on the number and type of criteria he can negotiate with the server. 4.2.3 Transcription of the reference corpus We used the Transcriber tool [6] for the transcription of our reference corpus. The transcriber is assistance software in the transcription of speech signals. It offers a simple graphical interface allowing a non-computer user to segment long-term recordings, transcribe them and mark speech turns (speaker switching), thematic segmentation
18
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
5
and acoustic conditions. However, we did not use query segmentation via this software. In fact, we limited ourselves to the segmentation in turns of words and then to queries relating to the user and queries relating to the simulated system. It is especially designed for the transcription of radio and television broadcasts, to create corpuses necessary for the development of automatic transcription systems for such programs, but its functionalities can be useful in other areas of speech processing. The figure below illustrates a transcription of a speech turn between a tourist and the simulated SVI system via the transcriber.
Figure3: Transcription of a Transcriber Request 4.3 State of the corpus This corpus is being recorded. Currently, more than 3 / 4ths of the corpus has been recorded, that is to say about 3000 dialogues. The variability of the statements produced depends on the complexity of the scenario but also on the behavior of the fellow. The most interesting phenomena (references, negotiations, negations ...) are observed especially for complex scenarios with a non-cooperative participant. Figure 4 shows an excerption of the dialogue. One can observe there the frequent phenomena of the dialogue: hesitations, repetitions, as well as references "that night there" "the same thing" which are very numerous. U: آه حتى باه سأقوم بالحجز لتلك الليلة اذا غرفتين في فندق فنادق ميركيور في ليل “ uh so bah I would book for that night so two rooms at the hotel Mercure hotels in Lille” S: أنا حجز غرفتين في فندق فندق ميرور لو جراند في ليل تريد مزيد من التحفظات “I reserve two rooms at the hotel Mercure le Grand Hotel in Lille want further reservations” U: لذلك سيكون نفسه اذا غرفتين أيضا في باريس آه لليلة الواحدة والعشرين إلى اثنين وعشرين ثم من اثنين وعشرين إلى ثالثة وعشرين فبراير مع نفس آه نفس المعايير آه لذلك دائما زوجين اثنين مع طفل “so it would be the same so two rooms also in Paris ah for the night of Twenty-one to twenty-two then from twenty-two to twenty-three February with the same ah the same criteria ah so always two couples with a child” Figure 4: Example of user-comer exchanges
6
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
19
4.3.1 Characteristic of the working corpus Our corpus of reference TIHR_ARC resulting from the simulation of a vocal server of tourist information and reservation of hotels is composed of 4000 dialogues collected over the interrogation of 1000 tourist speakers. These dialogues are on average of 4 dialogues per tourist and contain a total of 24981 speeches divided as follows: 12501 tours dedicated to tourists and 12480 tours dedicated to the simulated system. Each tourist carried out 40 different scenarios of the order on average of about ten rounds per scenario. In each scenario several types of requests are encountered such as requests, refusals, acceptances, hesitations, false starts, interrogations, etc. Moreover, the respondents' answers were equipped with simple and complex structures expressing for example the concessions, the conditions, the causes describing the intentions of the users. The following tables summarize some of the characteristics of our corpus. Table 1 gives statistics on the number of dialogues, number of speakers, etc. Table 2 compares the size and domain of our corpus with other international corpora. Table 1. Characteristics of our corpus TIHR_ARC Dialogue number
4 000
Interviewed tourists number
1000
Dialogue average / Tourist Total tours number
4 24981
Number of tours dedicated to tourists Number of tours dedicated to the simulated system Tourists words number
12501 12480 41956
Tourists vocabulary size
11832
The table below presents comparisons of our corpus with some international corpora. Table 2. Comparison of the size of our corpus TIHR_ARC with some corpus references Corpus
Language
Fields
Size (Kilo words)
MEDIA
French
Touristic Reservation
18k
LUNA
Polish
Transport Information
12k
TELDIR
German
Time Train
22k
ATIS
English
Plan Ticket Reservation
6K
PlanRest
French
Restaurant Reservation
12k
TIHR_ARC
Arabic
TIHR
35k
The TIHR domain is considered an open domain since it is very rich in terms of concepts. In addition, user requests can belong to a very large number of subdomains that are relatively open and can not be identified. Hence, TIHR is an open domain. Table 3 summarizes the quasi-equiprobable distribution of the dialogues cardinality of tourists in the TIHR_ARC corpus.
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
20
7
Table 3. Description of the corpus coverage RTRH_ARC Field/Domain
# Dialogue tourists
Railway
12%
Air
10%
Private and public transport
10%
Hotel reservation
15%
Services
5%
Distraction and trip
15%
Touristic events
5%
Information
18%
Other information
10%
Total
100%
Table 4 sums up the characteristics of different corpus of dialogue used in other projects in different languages. It should be noted that #D is the number of dialogs, #T is the number of speech turns, #V is the size of the vocabulary, #M is the number of words. A denotes the type of corpus (H / H for Man / Man and H / M for Man / Machine). Finally, L gives information on the language used (Ang for English, Fr for French, DT for the Tunisian dialect, ASM Sp for modern spontaneous modern Arabic and Esp for Spanish). These corpuses vary in size from a few tens to thousands of dialogues. Table 4. Characteristics of some corpus of dialogue in limited fields Corpus
#D
#T
#V
#M
A
L
Task
Trains 93
98
5900
860
55000
H/H
Ang
Goods manufacturing and expedition over railways
DIHANA
900
15 413
823
48 243
H/M
Esp
Railways information
SARF
350
9 763
827
117 156
H/M
Ar
Railways information
TuDiCoI
1 825
12 182
1 437
21 551
H/H
DT
Railways information
TARIC
4 662
18 657
--
71 684
H/H
DT
Railways information
TIHR_ARC
4000
24 981
11832
41 956
H/M
ASM
Touristic information reservation
MEDIA
1 257
38 434
2 715
156 048
H/M
Fr
Hotel reservation
and
hotel
8
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
21
4.
Diffusion of the corpus The corpus, including transcripts and semantic annotations, will be disseminated by the ELRA / ELDA consortium as widely as possible in the form of a distribution which will also include the anonymous results of the evaluation and the tools developed for our laboratory. The consortium will pay attention to the reusability of this type of resources in order to contribute to the standardization of test methods. The purpose of this distribution is to allow an external actor in our laboratory to assess and compare its results with those produced in our laboratory. 5.1 Lexical analysis of the corpus 5.1.1 Foreign words and rejects In a lexical study conducted on the TIHR_ARC corpus [1], we found that there existed 4651 words (11%) of non -Arabic languages borrowed from the non - Arabic language, Arabic origin. Similarly, we noted the presence of 1801 words (5%) considered as scrap due to the spontaneous aspect of speech. Table 5 summarizes the cases of foreign words unconsciously used by non-native speakers of the Arabic language as well as the cases of speech words uttered by the speakers and which appear in the lexicon of the corpus TIHR_ARC. Table 5. Some examples of foreign words and scrap Foreign word « ويoui » « بنهbon » « الوغalors » « دونكdonc » « بيانهbien » « نونهnon »
Scraps Hesitation «امhum » « اهah » « اوهeuh » « انahn »
Interlacing of the throat « احauh » « احمahm » « اعaah »
Rejects - unknown words - Mispronounced words - Unfinished words - silence
5.1.2 Disfluencies Repetitions, hesitations, autocorrections and primers are forms of disfluence that must be taken into consideration when processing corpus. - Repetition: it is a matter of repeating one or more words by the speaker in order to express a confirmation of his / her request. Repeated words are usually the words that depend on the task. Repetition is also used during a disturbance in oral production. The following figure illustrates the two above-mentioned cases of repetition.
Figure 5: Example of repetition from the TIHR_ARC corpus We performed an analysis of the repetitions and autocorrections found in the TIHR_ARC corpus. - The primer: it is a phenomenon of the spontaneous oral speech which results from the truncation of a word during its formulation. The following statement " قط قطار الى المنستيرtrain in Monastir" presents a case of primer whose word in position (*) has been incomplete to pronounce another word. - Hesitation / self-correction: this phenomenon makes it possible to add new lexical classes specific to spontaneous speech. Some classes resemble those of French spoken as "euh", "ah"," eu", while others are specific to the DT as " آ "آand ""ي ي. The hesitation in an oral statement is sometimes used to correct a speaker's request. In the figure
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
22
9
below, we cite two examples: the first is an autocorrection while the second signaled a hesitation.
Figure 6: Examples of hesitation extracted from the TIHR_ARC corpus In order to carry out a statistical study on repetitions, autocorrections and primers, we extracted a sample of the corpus consisting of 400 dialogues, representing 1,250 tourist utterances. We took 300 repetitions, self-corrections and primers in 500 statements, which is the equivalent to 40% of client statements. The table below shows the distribution of each phenomenon in the selected sample. Table 6. Distribution of repetitions, autocorrections and disfluencies Disfluences
Repetition
Hesitation
Autocorrection
Amorce
% 45% 27% 21% 7% We found that repetitions constitute the most frequent phenomenon (45% or 562 repetitions). The other types are distributed more homogeneously. Indeed, we detected 338 hesitations which corresponds to 27% and 262 autocorrections which represents 21%. Primers represent the lowest percentage and is on the order of 7%, corresponding to 77 statements. 5.1.3 Treatment of the corpus Most works in the field of speech processing perform pre-processing on the study corpus upstream of the annotation step in order to reduce the complexity of oral statements [18]. For this reason, we have started an automatic pre-processing step of the raw data of our corpus TIHR_ARC based on manually-created dictionaries. It is worth mentioning be noted that these dictionaries are of considerable size since the target domain is practically open. The pretreatment step recommends the following steps: • Morphological analysis: first we analyze verbs to determine their canonical form and plural names. For example, we label the word " اسافرtravel" with its canonical form سافر. In a second place, we treat names by labeling the name by its indefinite singular form. For example, the word انزلةis labeled in نزل. • The treatment of agglutination: this treatment requires the association of a dictionary which recapitulates all vocabulary lexicon with all their agglutinated forms (see sub-section below). • Standardization: standardization consists in replacing the digits, numbers, numbers and measurements present in the corpus by their digital form. This facilitates its treatment. This presupposes the construction of dictionaries of synonymies and grammars of recognitions of context. • The treatment of the named entities: this treatment has the role of distinguishing the designations of the objects from the text of the statements. Recognition grammars and dictionaries are then crucial to recognize these entities and separate them from particles if they have an agglutinative form. • The treatment of compound words: this treatment consists in gathering the words separated by a space when they represent useful constructions. For example, the two words " بئرbir" and " الطفلةettofla" are grouped together to obtain " بئر_الطفلةbir ettoflat" which represents the name of a small village in Tunisia.
10
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
•
23
The processing of frozen expressions, etc.
5.2 Annotation of the corpus RTRH_ARC To ensure proper HMM-based learning of the model parameters, we have divided our corpus into two illegal parts: usually 2/3 of the corpus are reserved for the training corpus (ENT: training) and 1 / 3 for the test (TES). Knowing that our corpus is pretreated, we have established a phase of segmentation and manual annotation by two Arab native annotators who have annotated the training corpus by: • a linguistic annotation according to the orientations defined in the annotation manual of NOOJ [19]. The semantic annotation settled on frames of the corpus required the definition of the semantic representation adapted to the domain of the task TIHR. To obtain a complete hierarchical representation of the semantic composition of a query, the use of richer and more complex structures is essential. The properties of semantic frames have led to their use in this work. This representation is generic, providing a good coverage of the domain and is however sufficiently simple to allow the annotation of a corpus size of TIHR_ARC. 5.2.1 The language annotation To make a reliable linguistic annotation, we used external lexical Arabic resources and we also created ours. The resulting Arabic lexicon used is constituted as follows: • a complete dictionary called EL-DICAR (ELectronic DICtionary for ARabic), of [20] and containing more than 52,000 lexical units distributed as follows: • 1 / 19,504 names (N) • 10 162 verbs (V) • 5 816 adjectives (ADJ) • 1,230 particles (PREP, ADV, REL, DEM) • 3,686 locations (N + LOC) • 11 860 Proper name (N + Firstname) • a dictionary dedicated to TIHR: It is an extension of the EL-DICAR dictionary which can enrich it and be a contribution for the resources of the Arabic language. This extension is in the form of a dictionary whose extension in NOOJ is (.dic). It is constructed from the words of the field of study. These words are organized as follows: • 10 400 names (N): simple names and compound names • 507 verbs (V): modal verbs, simple verbs • 328 adjectives (ADJ): simple adjectives and compound adjectives (DEM-PRES), adverbs (ADV), personal pronouns attached (PRON-ATT), personal pronouns detached (PRON-DET) The prepositions (PREP) (LOC + REQ), negation (LOC + NEG), affirmation (LOC + AFFIR), concession (LOC + CONCESS), conjunctive (LOC + CONJC), coordination (LOC + COOR), cause / consequence (LOC + CC), goal (LOC + BUT), explanation (LOC + EXPL), circumstantial (LOC + PLACE)) • a base of numbers, dates, and schedules: it contains 3,098 digits (EXPR_DIGIT, EXPR_DATE, EXPR_TMP) • a base of the named entities (EN): it contains 405,647 ENs such as locations, cities, places, streets, routes, agencies, museums, establishments, fairs and festivals, cinemas , Beaches, seas, outings, tourist events, names and types of buses, trains and subways, train-bus stations and airports, tourist areas, hotels. The list of ENs is still exhaustive. • a base of fixed expressions: it is composed of 30,000 fixed expressions.
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
24
11
• grammars of the inflected and derivative forms whose extension is (.nof) in NOOJ [19], associated with nouns and verbs. This grammar allows for a tokenization to identify and annotate morphemes in their agglutinated forms. • syntactic-semantic grammars of out-of-context type: CFG.nog [21] which consists of local grammars for the recognition of complex linguistic structures frequently used. Table 7. Local grammar statistics
Flexional grammar / derivational grammar Local syntactic grammar
Local-seamantic grammar
Rules cardinality 40876 47
83
Examples of grammar forms Agglutination Treatment Grammars Grammar of inflections of verbs and nouns Conditional form recognition, Interrogative form recognition Emphatic form recognition Affirmative form recognition Negative form recognition Adjectival form recognition Annexed form recognition Relative form recognition Circonstancial form recognition Segmentation grammar, Date recognition, Place recognition, Period Recognition, Time recognition, price recognition, Cardinality recognition
5. Conclusion The collection and transcriptions of the corpus TIHR_ARC is achieved. Each dialogue (signal and transcripts) is accompanied by instructions for the scenario given to the user on the one hand and instructions setting the behavior of the participant on the other hand. As a result of the recordings, the semantic annotation of the dialogues is also done. The work was on the analysis of dialogues already recorded in order to be worked out the structure of the representation and the set of concepts related to the task. 6. Refrences [1] Lhioui, C., Zouaghi, A., Zrigui, M. (2013) A combined method based on stochastic and linguistic paradigm for the understanding of arabic spontaneous utterances. In Computational Linguistics and Intelligent Text Processing (pp. 549-558). Springer Berlin Heidelberg [2] Habash.N and Rambow.O., (2004) Extracting a Tree Adjoining Grammar from the Penn Arabic Treebank, JEP TALN 2004, Session Traitement Automatique de l’Arabe, Fès, 20 avril [3] Graja Marwa Boudabbous (2015) Compréhension automatique de la parole en dialecte tunisien dans le cadre des systèmes de dialogue présentée et soutenue publiquement, dissertation en informatique à Sfax, Tunisie [4] Meurs M. J. (2009) Approche stochastique bayésienne de la composition sémantique pour les modules de compréhension automatique de la parole dans les systemes de dialogue homme-machine (Doctoral dissertation, RWTH, Aachen) [5] J. Mariani. The Aupelf-Uref Evaluation-Based Language Engineering Action and Related Projects. In Proceedings of the First International Conference on Language Resources and Evaluation, volume 1, Granada, 1998 [6] J. Antoine and al. Predictive and objective evaluation of speech understanding: the challenge evaluation
12
Chahira Lhioui et al. / Procedia Computer Science 142 (2018) 14–25 name / Procedia Computer Science 00 (2017) 000–000
25
campaign of the I3 speech workgroup of the French CNRS. In Proceedings of the third International Conference on Language Resources and Evaluation, volume 1, Las Palmas, 2002. [7] D. Gibbon, R. Moore and R. Winsky, Handbook of Standards and Resources for Spoken Language Resources, Mouton de Gruyter, New-York, 1997. [8] L. Dybkjaer and al., The Disc Approach to Spoken Language System Development and Evaluation. In Proceedings of the First International Conference on Language Resources and Evaluation, volume 1, Granada, 1998. [9] E. Giachin, and S. McGlashan. Spoken Language Dialogue Systems. In S. Young and G.Bloothooft (Eds.) Corpusbased methods in language and speech processing. Dordrecht Kluwer Academic Publishers, 69-117, 1997. [10] MADCOW. Multi-Site Data Collection for a Spoken Language Corpus, DARPA Speech and Natural Language Workshop, 1992. [11] M. Walker, R. Passonneau and J. Boland. Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialog Systems, ACL/EACL Toulouse, 2001. [12] L. Devillers, H. Maynard and P. Paroubek. Méthodologies d’évaluation des systèmes de dialogue parlé: réflexions et expériences autour de la compréhension. TALN 2002. [13] H. Maynard and L. Devillers. A framework for evaluating contextual understanding. In Proceedings of the International Conference of Speech and Language Processing, 2000. [14] J. Antoine and al. Obtaining predictive results with an objective evaluation of spoken dialogue systems: experiments with the DCR assessment paradigm. In Proceedings of the Second International Conference on Language Resources and Evaluation, volume 1, Athens, 2002. [15] Pinault, F. (novembre, 2011) Apprentissage par renforcement pour la généralisation des approches automatiques dans la conception des systèmes de dialogue oral. Avignon [16] Chahira Lhioui, Anis Zouaghi, Mounir Zrigui A Rule-based Semantic Frame Annotation of Arabic Speech Turns for Automatic Dialogue Analysis. ACLING 2017: 46-54 [18] Martínez-Hinarejos C. D., Benedí J. M., Granell R., (2008) Statistical Framework for Spanish Spoken Dialogue Corpus. Speech communication, vol. 50, pp.992-1008 [19] Max Silberztein, (2004) La formalisation des langues : l'approche de NooJ. ISTE: London (426 p.) [20] Mesfar S. (2008) Analyse morphosyntaxique automatique et reconnaissance des entités nommées arabe standard. Thèse de doctorat en Informatique, Université de Franche-Comite. [21] Chahira Lhioui, Anis Zouaghi, Mounir Zrigui: Towards a Hybrid Approach to Semantic Analysis of Spontaneous Arabic Speech. Int. J. Comput. Linguistics Appl. 5(2): 165-193 (2014)