Available online at www.sciencedirect.com
ScienceDirect IERI Procedia 10 (2014) 190 – 195
2014 International Conference on Future Information Engineering
Improve Bayesian Network to Generating Vietnamese Sentence Reduction Ha Nguyen Thi Thua*, Dung Vu Thi Ngocb a
Information Techonology Faculty, Vietnam Electric Power University, Hanoi, Vietnam b HaiDuong Center for Continuing Education, HaiDuong, Vietnam
Abstract Sentence reduction is one of approaches for text summarization that has been attracted many researchers and scholars of natural language processing field. In this paper, we present a method that generates sentence reduction and applying in Vietnamese text summarization using Bayesian Network model. Bayesian network model is used to find the best likelihood short sentence through compare difference of probability. Experimental results with 980 sentences, show that our method really effectively in generating sentence reduction that understandable, readable and exactly grammar. © 2014 ThePublished Authors. Published by Elsevier © 2014. by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection and peer review under responsibility of Information Engineering Research Institute Selection and peer review under responsibility of Information Engineering Research Institute
Keywords: Sentence reduction, natural language processing, text summarization, Bayesian network, probability;
1. Introduction Today, most text summarization systems based on extracted sentences to generate a summary and we called extracted approach [9], [12], [13], [14]. With this approach, the weight of sentence is calculated based on some features that we think it is important like: term frequency, sentence position, sentence length... And
*
* Corresponding author. Tel.: +84906113373
E-mail address:
[email protected].
2212-6678 © 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection and peer review under responsibility of Information Engineering Research Institute doi:10.1016/j.ieri.2014.09.076
Ha Nguyen Thi Thu and Dung Vu Thi Ngoc / IERI Procedia 10 (2014) 190 – 195
then, sentences will be sorted by its weight and extracted based on the rate (extraction rate). A text summary is including sentences that has maximum weight from the original text. With this approach, text summary will synthesis of the discrete sentences from original text, it can be: x Text summarization is seamlessly because of sentences are not linked by content in the text, specially, when the extraction rate is smaller, it will be greater discrete. x Text summarization is confusing sometimes it can be loosen important information in original text by some sentences that have been not extracted. Therefore, we have chosen the sentence reduction approach for processing in sentence level, remove not important words in the sentence and generate new sentence and creating summary. Target text will overcome some disadvantages that have been analyzed above [5], [6], [7]. This paper present a Vietnamese sentence reduction method based on Bayesian network, each word in original text is considered as a node of the network. Reduced sentence is generated by find a path that is the shortest and greatest weight, we called it: the best likelihood path. The next structure of this paper: In Section 2, is overview of related work, the methodology of sentence reduction based on Bayesian network method in section 3, Section 4 is the experimental results and finally is conclusion. 2. Related works Almost related works focus in building lexical rules model or syntax parser tree. First Aho and Ullman using synchronous context free grammars (SCFGs)[21]. Wu in 1997 has proposed a method that included inversion transduction grammar [22] and some other relate research with CFG like head transducers by Alshawi, Bangalore and Douglas in 2000. Knight and Marcu proposed a noisy channel of sentence compression. They use two components: P(y) is language model and P(x|y) is a channel model. P(x|y) capturing the probability of original sentence x and target compression y. After that, using decoding algorithm searches for maximizes of P(x)P(x|y). This channel model is a SCFG, parallel corpus is used to extracted rules, and weights estimated using maximum likelihood [9]. With Vietnamese sentence reduction approach, most of the methods are applied from English method. However, the performance of this method is not high when applied to Vietnamese language. Because of single syllable language, In Vietnamese, words can not be determined based on space..... So that, they often use extraction approach for building Vietnamese text summarization systems, there are some method that use reduction approach but it is not high effectively. 3. Methodology of sentence reduction based on bayesian network 3.1. Bayesian Network Bayesian networks is the one of probabilistic graphical models. When represent the uncertain knowledge can used graphical models. In Bayesian model, each node is a random variable, and the edges between the nodes represent prbalilistic dependencies among the corresponding random variables. This probabilistic can be computed from history data [2]. If B is a Bayesian network B, so that B is an annotated acyclic graph and B represents a joint probability distribution over a set of random variables V. This network is defined: B=
In which:
191
Ha Nguyen Thi Thu and Dung Vu Thi Ngoc / IERI Procedia 10 (2014) 190 – 195
x G is directed acyclic graph whose nodes X1,X2,…,Xn represents random variables, each variable Xi is independent of its non=descendants given its parents in G, denoted generically as 3 i . x Ԧ, denotes the set of parameters of the network. This set contains the parameter T xi |3 i PB ( xi | 3 i ) for each realization of xi of Xi conditioned on 3 i , the set of parents of Xi in G. So, B defines a unique joint probability distribution over V. (1) 3.2. Methodology of sentence reduction based on bayesian network Supposed that have a sentence S, each word wk in S was generated by wk-1 or can be generated by wk-2 wk-n. Then, we can build a improving Bayesian network that can find a reduced sentence based on probability of ngrams between word wk and word wk-n with n 1, ( k 1) . Need to find an initial state. Supposed that, the set of initial states with probability: Start -> w1= 0.6, Start > w2=0.32, Start -> w3= 0.47, Start -> w4=0.56, Start -> w5=0.2, Start -> w6= 0.11. So that, we choose w1 is the initial state, so reduced sentence will be started by w1. Figure 1 illustrates the structure of Bayesian network with six words in sentence S. Word w1 is considered the first likely node to the next words in sentence S. And word w2 can be created a path to w3, w4, w5, w6.
Fig.1 Bayesian network model with 6 words
To overcome the computational complexity, in this proposed we use dynamic programming and don’t need to compute n-grams probability over all node. For example, have probability as: w1 -> w2=0.3, w1 -> w3 = 0.6, w1-> w4 = 0.042, w1 -> w5= 0.002, w1 -> w6 =0. Choose one state that have maximum probabily. So that, w3 will be choosen and use path that contain w3. For ease of visualization, Bayesian networks is described as a tree as Figure 2
0.3
Fig.2. Bayesian network with probability.
4 00 0.
0.24
192
Ha Nguyen Thi Thu and Dung Vu Thi Ngoc / IERI Procedia 10 (2014) 190 – 195
In this Bayesian network. Probability ability on the branch is calculated by n-grams. In this example, path from word w1 to word w2 has probability is 0.3. The first point of the sentence S is w1. From word w1 sentence reduced by: x From w1 we have some likely paths to w2, w3, w4, w5, w6. x Selecting the path is weighted the highest probability. For example in the figure 4 is w1-> w3 = 0.6 x Save point in the highest path. For example w3 will be stored. x From this point of highest path find some paths to the other words, choosing the most likelihood path. For example w3-> w4. x Continue to repeat the last word of the sentence S. Finish we have a sentence that has been reduced, and reduced sentence include four words w1w3w5w6. In figure 3, we simulated an algorithm which called SRBBN (Sentence Reduction Based on Bayesian Network) as Sentence Reduction Based on Bayesian Network Algorithm Input: S: original sentence; Output: S’: reduced sentence; 1. Initialization
T
I; T ' I; N
I ; i 1; j
0;
2. Separate words from S For i=1 to Length (S) T(i) m Separate (S); 3. Selecting word from original sentence While (i< Length(S) do begin for j m i+1 to Length(S) do begin N(j) m Ngrams(wj,wi); point=argmax(N(j)); T’=T’ wj end; i=j; N I ; end; 4. Generating sentence reduction; S’=Order(T’); Fig. 3 Sentence Reduction Based on Bayesian Network Algorithm
In this algorithm, we use some functions: Separate() is used separate words in sentence. Length() returns length of sentence N-grams use for calculating n-grams of word wi and word wj that learns from training data. Argmax() takes the largest value in the set N. Order() use for sorting words in the original sentences to generate reduced sentence. 4. Experimentals There is not standard corpus for Vietnamese text summarization now. So that, in our experimental, we built corpus by manual. Documents in this corpus had been downloaded from websites’s news as: http://thongtincongnghe.com, http://echip.com, http://vnexpress.net, http://vietnamnet.vn, http://tin247.com...
193
194
Ha Nguyen Thi Thu and Dung Vu Thi Ngoc / IERI Procedia 10 (2014) 190 – 195
Corpus’s entitle is "information" and "technology". There are over 300 documents in it. We segmented from 300 documents into 16,117 sentences. After that, We use a Vietnamese text segmentation tool that use for word segmentation . We selected 814 sentences in the corpus for signed label. There isn’t a standard evaluation for Vietnamese sentence reduction methods. So that, we use the evaluation way by Knight and Marcu to compare our proposed (called SRBBN) with some other methods likes: Human, Syn.con (proposed by M.L Nguyen using syntax control) [9]. Below is the results of compare between our proposed method with two other methods (in Table 1) Table 1 Experimental Results for Vietnamese Sentence Reduction Method
Compression
Grammaticality
Word significance weight
Baseline
X
X
X
SRBBN
65.82
84.2
78.4
Human
61.2209
83.33333
63.5
Syn.con
67
65. 7
6.11
5. Conclusion This paper has proposed a novel method (we called SRBBN) to reducing Vietnamese sentence. The method based on Bayesian Network to find the best sentence reduction. We used a corpus of 16,117 sentences of Vietnamese text for training and test with 980 sentences shown that this method achieved acceptable results when compared with human. Sentences after reduced is satisfied user requirement, readable, understandable and well grammatically.
References [1] Ann Arbor, et. al. Constraint based sentence compression: An integer programming 306 approach. Proceedings of the COLING/ACL 2006, pp 144 – 151, 2006 [2] Blanco, Roi; Lioma; Christina, Graph based term weighting for information retrieval, Information Retrieval, pp. 54-92, 2012 [3] Courtney Napoles; Chris Callison-Burch; Juri Ganitkevitch, Benjamin Van Durme; Paraphrastic Sentence Compression with a Character based Metric: Tightening without Deletion, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 84–90, Portland, 2011 [4] David Vickrey; Daphne Koller, Sentence Simplification for Semantic Role Labeling, Proceedings of ACL-08: HLT, pp. 344–352, USA, 2008. [5] Hongyan Jing; Kathleen R. McKeown. Cut and paste based text summarization. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics, pp 178–185, 2000. [6] H. Jing; K. McKeown, The Decomposition of Human-Written Summary Sentences, Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129–136, 1999. [7] H. Jing, Sentence reduction for automatic text summarization, Proceedings of the Conference on Applied Natural Language Processing, pp. 310–315, 2000. [8] H. Jing; K. R. McKeown, Cut and paste based text summarization, Proceedings of the North American chapter of the Association for Computa-tional Linguistics Conference, pp. 178–185, 2000.
Ha Nguyen Thi Thu and Dung Vu Thi Ngoc / IERI Procedia 10 (2014) 190 – 195
[9] Jenie Turner; Eugen Charniak. Supervised and unsupervised learning for sentence compression. Proceedings of the 43rd Annual Meeting of the ACL, pp. 290–297, 2005. [10] Knight, K.;Marcu, D. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artif. Intell. 139, 1 , 91-107, 2002. [11] Lloret E; et.al, A. Towards building a competitive opinion summarization system: challenges and keys. In: Proceedings of the NAACL. Student Research Workshop and Doctoral Consortium. pp 72–77, 2009. [12] Lloret E; et.al, Experiments on summary-based opinion classification. In: Proceed-ings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text. pp 107–115, 2010. [13] Lloret; et.al, Text summarisation in progress: a literature review, Springer Science & Business Media, pp.1-41, 2012 [14] Mani I, Automatic summarization. John Benjamins Publishing Co. Amsterdam, Philadelphia, USA, 2001. [15] Mei Jian-ping; Chen, Lihui, SumCR: A new subtopic-based extractive approach for text summarization, Knowledge and Information Systems 31.3 pp. 527-545, 2012. [16] Michel Galley; Kathleen McKeown. Lexicalized Markov grammars for sentence compression. Proceedings of the HLT-NAACL 2007, pp.180–187, 2007. [17] M.L. Nguyen; S. Horiguchi, A Sentence Reduction Using Syntax Control, Proc. Of 6th Information Retrieval with Asian Language, pp. 139-146, 2003. [18] Nguyen, M.L.; et.al, M.. Probabilistic Sentence Reduction Using Support Vector Machines. Proceedings of the 20th international conference on Computational Linguistics, 2004 [19] M. Johnson; E. Charniak, A TAG-based noisy-channel model of speech repairs, Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 33–39, 2004. [20] Stefan Riezler; Tracy H. King; Richard Crouch; An-nie Zaenen, Statistical sentence condensation using ambiguity packing and stochastic disambigua-tion methods for lexical functional grammar. Proceedings of HLT-NAACL 2003, pp. 118–125, Ed-monton, 2003 [21] V. Aho and J.D. Ullman. Properties of syntax directed translations. Journal of Computer and System Sciences, 3:319–334, 1969. [22] Wu. Stochastic, Inversion transduction grammars and bilingual parsing of parallel corpora, Computational Linguistics, 23(3):pp.377–404, 1997.
195