Accepted Manuscript
Machine learning-based Multi-Documents Sentiment-oriented Summarization using Linguistic treatment Asad Abdi , Siti Mariyam Shamsuddin , Shafaatunnur Hasan PII: DOI: Reference:
S0957-4174(18)30293-8 10.1016/j.eswa.2018.05.010 ESWA 11963
To appear in:
Expert Systems With Applications
Received date: Revised date: Accepted date:
12 October 2017 27 March 2018 10 May 2018
Please cite this article as: Asad Abdi , Siti Mariyam Shamsuddin , Shafaatunnur Hasan , Machine learning-based Multi-Documents Sentiment-oriented Summarization using Linguistic treatment, Expert Systems With Applications (2018), doi: 10.1016/j.eswa.2018.05.010
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlight: Machine learning-based sentiment-oriented summarization using linguistic treatment
Rich feature set using word embedding, sentiment, statistical, linguistic knowledge
We restrict proposed method in not using much domain- dependent external resources
It uses a deep-learning-inspired method, for the vector representation
Results displayed that the method is to be preferred over other methods
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Machine learning-based Multi-Documents Sentiment-oriented Summarization using Linguistic treatment Asad Abdi (Corresponding author)1 , Siti Mariyam Shamsuddin2, Shafaatunnur Hasan3 1,2
UTM Big Data Centre (BDC), Universiti Teknologi Malaysia, Skudai 81310 Johor, MALAYSIA. E-mail:
[email protected],
[email protected],
[email protected].
CR IP T
Abstract
ED
M
AN US
Sentiment summarization is the process of automatically creating a compressed version of the opinionated information expressed in a text. This paper presents a machine learning-based approach to summarize user‘s opinion expressed in reviews using: 1) Sentiment knowledge to calculate a sentence sentiment score as one of the features for sentence-level classification. It integrates multiple strategies to tackle the following problems: sentiment shifter, the types of sentences and word coverage limit. 2) Word embedding model, a deep-learning-inspired method to understand meaning and semantic relationships among words and to extract a vector representation for each word. 3) Statistical and linguistic knowledge to determine salient sentences. The proposed method combines several types of features into a unified feature set to design a more accurate classification system (“True”: the extractive reference summary; “False”: otherwise). Thus, to achieve better performance scores, we carried out a performance study of four well-known feature selection techniques and seven of the most famous classifiers to select the most relevant set of features and find an efficient machine learning classifier, respectively. The proposed method is applied to three different datasets and the results show the integration of support vector machine-based classification method and Information Gain (IG) as a feature selection technique can significantly improve the performance and make the method comparable to other existing methods. Furthermore, our method that learns from this unified feature set can obtain better performance than one that learns from a feature subset.
CE
1. Introduction
PT
Keywords: Sentiment analysis, Sentiment summarization, Machine Learning, Sentiment knowledge, Word embedding.
AC
Sentiment analysis has become an important research field since the early 2000s (Habernal, Ptáček, & Steinberger, 2014). It has attracted considerable attention in recent years due to its applicability in various purposes (Khan, Qamar, & Bashir, 2016b). A document may include facts, opinions, or objective/ subjective information. The sentiment analysis aims to classify factual and opinion information and identify positive, negative, or neutral sentiment from a text document (Hung & Chen, 2016). In other words, the main task of the sentiment analysis is to investigate people‘s opinions (Kolchyna, Souza, Treleaven, & Aste, 2015). Due to the huge amount of information available, a new technology that can process this information is required by users. Document summarization can be an essential technology for tackling this problem. Document summarization aims to produce a short version of a source text that provides informative information for users (A. Abdi, Idris, Alguliev, & Aliguliyev, 2015; Fang, Mu, Deng, & Wu, 2017). There are several types of summary such as a single document, multiple documents, generic, query-based, opinion-
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
based, etc. (A. Abdi, N. Idris, R. M. Alguliyev, & R. M. Aliguliyev, 2015c; S. A. Abdi & Idris, 2014; Chatterjee & Sahoo, 2015). Opinion/ sentiment summarization is one of the new types. It aims to obtain significant information and the overall opinion/sentiment orientation of an opinionated document (Lloret, et al., 2015). The sentiment analysis and the text summarization analyze and identify important information from opinionated documents using Natural Language Processing (NLP), machine learning, text analysis, statistical and linguistics knowledge (P. Gupta, Tiwari, & Robert, 2016). Sentiment analysis aims to analyze user reviews to detect whether they are positive, negative, or neutral, while the text summarization aims to elicit essential information from these user reviews which can be used to create a summary of user‘s opinions exposed in the text. The sentiment summarization is one of the strong NLP methods. This type of methods can be considered as an expert system to assist human for a decision making (Lloret, et al., 2015), and therefore, it is vital in the current society (Nassirtoussi, Aghabozorgi, Wah, & Ngo, 2014). The systems based on the opinion summarization will give readers the significant information about different topics (Y.-H. Hu, Chen, & Chou, 2017). Furthermore, the sentiment summarization includes a set of processes to determine opinionoriented information from people‘s reviews on various subjects (Ofek, et al., 2016). A sentiment summarization approach collects the sentences related to the major subject, detects the polarity of sentences, and eventually, summarizes positive sentences and negative sentences. An automatic system that can be able to detect subjective information, classify user‘s opinions and summarize reviews is required by users. Hence, in recent years, researchers also focused on sentiment analysis and proposed various techniques to tackle this problem (Gambhir & Gupta, 2017; Y.-H. Hu, et al., 2017). This paper presents a machine learning-based sentiment-oriented summarisation of multidocuments using the combination of the sentiment knowledge, a deep-learning-inspired method and statistical and linguistic knowledge (called SOSML). Although some of previous systems considered the sentiment knowledge, word embedding model or statistical and linguistic knowledge for a machine learning algorithm to produce a classification model, to the best of our knowledge, a method in which word embedding model, sentiment knowledge, statistical and linguistic knowledge for a sentiment-oriented summarisation are used has not been thoroughly studied. The SOSML employs sentiment knowledge to calculate the sentence sentiment score as one of the features for sentence-level classification. It also integrates multiple strategies to tackle the following problems. 1) Sentiment shifter: the polarity of a word presented by a lexicon can be reversed, since the polarity (positivity or negativity) of a word depends on the context in which the word is being used. Thus, to tackle the current problem, our proposed method considers the negation handling and the but-clause handling to adjust word prior sentiment orientation. 2) Types of sentences: the types of sentences also affect the performance of sentiment analysis method (Chen, Xu, He, & Wang, 2017). On the other hand, most of existing systems ignored different sentence types. Therefore, we also consider the subjective/objective sentence handling and interrogative/conditional sentences handling in order to improve the performance of the SOSML in sentence-level sentiment classification (refer to section “3) Contextual polarity for sentiment analysis”). 3) Word-sentiment score, we combined several sentiment dictionaries (ten dictionaries) to create a High-Coverage
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Lexical resource (HCLr) to analyze opinionated texts. We combined those dictionaries, since, 1) an individual sentiment lexicon usually has word coverage limit and may fail to determine the sentiment score of a word if it is not included in a sentiment lexicon. In other words, our aim was to expand the sentiment dictionary coverage to improve word coverage limit of an individual lexicon; 2) on the other hand, different sentiment lexicons complement each other. Moreover, the SOSML also uses the Semantic Sentiment approach (SSA) to determine sentiment score of a word if it is not included in HCLr. The SOSML also uses a deep-learning-inspired method, word embedding model, to derive the vector representation of a sentence as one of the features for sentence-level classification. The word embedding model extracts a vector representation for each word encountered in a sentence, . It then follows a simple method, according to which it derives the vector representation of a given sentence as the average over the vectors of all its comprising words. A text summarization approach aims to select the most significant sentence in a document. Therefore, it is very important to determine those features that help to identify this sentence and improve the quality of the summary. To do this, we also consider some of statistical and linguistic features such as key word, title word, sentence location, cue phrase and sentence similarity. On the other hand, the SOSML uses a minimal set of features to accurately classify a review text. Most of the features that we use are domain-independent and generic in nature, and therefore may be used for the applications of similar nature. We restrict the SOSML in not using much domain-dependent external resources. It can be expected the accuracy of the SOSML could be improved if a suitable feature selection technique and a machine learning approach are used. However, there are two important questions that need to be answered: 1) which one among the popular feature selection techniques and 2) which one among the popular machine learning techniques perform best in multi-document sentiment-oriented summarization. The answers would be valuable to improve the existing sentiment classifiers methods. To achieve this aim, we compare the performances of different feature selection techniques (e.g., information gain (IG), gain ratio (GR)) and machine learning approaches (e.g., decision tree (DT), naïve Bayes (NB), support vector machine (SVM) and K-nearest neighbor (KNN)) to find the most relevant set of features and identify the best classifier. In this paper, four feature selection techniques and seven machine learning classification approaches are investigated on the three different datasets. Furthermore, the SOSML method requires some degree of linguistic pre-processing, including part of speech tagging (POS), word stemming and stop-words removal. Moreover, since in multi-document summarization, the sentences are selected from various documents, they may convey the same information. Therefore, the SOSML also considers content coverage and redundancy to improve the quality of the summary. In summary, the contributions of the present work can be summarized as follows: 1) To the best of our knowledge, a hybrid vector in which sentiment knowledge-based, word embedding-based, statistical and linguistic knowledge-based features are used for sentiment-oriented summarization has not been thoroughly studied; 2) we conducted a comparative study of seven classifiers and four well-known feature selection techniques to select the best learning algorithm for classification and find the most relevant set of features to improve the classifiers‘ performance, respectively; 3) we combine several sentiment dictionaries to improve word coverage limit of the individual lexicon; 4) we considers the contextual
ACCEPTED MANUSCRIPT polarity and the type of a sentence for sentiment analysis; 5) the method also checks the redundancy information in order to increase the quality of summary; 6) finally, we perform experiments on three different publicly available datasets. The rest of this paper is structured as follows. In section 2 of this paper, we consider related works on sentiment summarization. In section 3, we explain our proposed system. We then summarize the experimental results in section 4. Finally, we conclude this paper in section 5. 2. Related work
ED
M
AN US
CR IP T
Sentiment analysis — it has a great impact on company and government decision making. The topic of sentiment analysis as one of the most exciting fields in NLP which can provide a solution to analyze people‘s reviews (Birjali, Beni-Hssane, & Erritali, 2017). There are three main levels in sentiment analysis: 1) sentence level, 2) document level and 3) aspect level (Rana & Cheah, 2016). At a sentence level or document level, a sentence or a document is classified into positive/negative or subjective/objective according to their polarity and subjectivity degree. Unlike the document or sentence level, feature/aspect level can determine what accurately people like and don‘t like. It can detect whether the tendency regarding an aspect/feature is positive, negative or neutral. Given the sentence, ―S: the color quality of this camera is amazing”, the feature is „color quality‟ of the „this camera‟. The opinion on the „color quality‟ feature is positive. Sentiment analysis approaches are split into three groups, 1) machine learning approach, 2) lexicon-based approach and 3) the hybrid approach. A machine learning (ML) based approach uses a set of well-known machine learning methods to classify sentiment orientation (e.g., ANN (Lee & Choeh, 2014), SVM (Shahana & Omman, 2015)). It can handle large collections of people‘s reviews. The Lexicon based approach is based on the sentiment lexicon (e.g., AFINN (Nielsen, 2011), Sentiment140 Lexicon (Mohammad, Kiritchenko, & Zhu, 2013)). A sentiment dictionary includes a set of sentiment words that are used to express a positive/negative sentiment. As an example, the words like (‗good‘, ‗nice‘) and (‗weak‘, ‗bad‘) are used to make a positive and negative sentiment, respectively. Finally, a hybrid approach integrates the above two approaches(Deshwal & Sharma, 2016).
AC
CE
PT
Summarization — a summarization task aims to produce concise and informative information to represent the main idea of the source documents. A text summarization is divided into two groups (Tayal, Raghuwanshi, & Malik, 2017): generic summarization and query-focused summarization. In generic summarization, the system considers the whole document and extracts the general idea of the source document (e.g., AUGGS (Alguliyev, Aliguliyev, & Isazade, 2015a), ASSL (Ferreira, et al., 2014)), while in query-based summarization, the system must consider the user‘s query and the summary text should answer the user‘s question (e.g., QSLK (A. Abdi, Idris, et al., 2015c), QSDM (Zhong, Liu, Li, & Long, 2015)) . A text summarization can also be categorized as either extractive or abstractive. An abstractive-based summarization is producing a text summary that includes new sentence or phrases which are not necessarily the words or sentences used in the original text. The produced summary must keep the same meaning as the same source text. An extractive-based summarization aims to produce a text summary by choosing sentences from a document. The summary presents the exact sentence or a part of the sentence of the original text (Mendoza, Bonilla, Noguera, Cobos, & León, 2014). Besides these facts, the text summarization system can also be either indicative or informative summarization. The indicative summary only
ACCEPTED MANUSCRIPT introduces the basic idea of a text to the user. An indicative summary aims to help the user to decide whether to read the source text or not. The informative summary produces brief information of the main text that can be considered as a replacement for the original text. A text summarization can also be a single-document: only one document is considered to make a text summary; or multi-documents: a set of documents are considered to produce a text summary.
AC
CE
PT
ED
M
AN US
CR IP T
Phases of Text Summarization Systems — the text summarization process includes the following steps (Lloret, 2012): 1) Interpretation, in the interpretation step, the input text is processed and a few salient features are selected; 2) Transformation, in the transformation step, the results of the interpretation step are transformed into a summary representation; 3) Generation, it generates an appropriate summary using the summary representation. Sentiment based summarization — recently, due to the huge amount of opinionated information and the people‘s reviews the compound of sentiment analysis and summarization task to generate opinion-oriented summaries can have great benefits for decision making (Amplayo & Song, 2017) . Since a text summarization aims to generate a concise version of factual information, a sentiment summarization summarizes sentiments from a large number of reviewers or multiple reviews (Lloret, et al., 2015) . Consequently, text summarization and sentiment analysis must be composed to produce a sentiment summary. Text summarization determines most relevant sentences from a reviews text and the sentiment analysis component identifies and categorizes objective or subjective sentences and their polarity, respectively (Pang & Lee, 2008). The multi-text summarization method has been used for review summarization in various proposed methods as follows. N. Yadav and Chatterjee (2016) proposed a technique based on sentiments of key words in the text to select the key sentences of a document. The method includes the following steps: a) the pre-processing step involves a set of basic functions: stop word removal, sentence splitting, Porter Stemming algorithm and POS tagging. b) The method assigns sentiments according to the POS tags of the words. Finally, in the third step, for each word in the sentence the sentiment is computed corresponding to its POS tags. The sentiment of each word is retrieved from the SentiWordNet database. The sentence sentiment score is computed as the sum of the sentiment values of all the words in the sentence. Raut and Londhe (2014) proposed a text summarization approach to produce a concise opinion summary of reviews based on machine learning approach and SentiWordNet (Esuli & Sebastiani, 2007) based approach. The system includes the following stages. Preprocessing stage performs the basic NLP analysis: tokenization, POS-tagging and Sentence Segmentation. In Opinion mining stage, review text is classified as positive or negative review using machine learning classifiers and SentiWordNet based algorithm. SentiWordNet is one of the popular lexical resources for sentiment analysis. SentiWordNet is a database that contains word with its polarity score for positive or negative sentiment based on its part of speech. The system also handles negation words such as ‗not‘, ‘never‘ etc that affect the polarity of sentence. Finally, the Opinion summarization stage uses sentence extraction method to produce a summary text. In this method sentences are filtered, and most informative sentences are selected from document to represent a summary. The opinion summarization employs term frequency and relevance scoring method to represent most informative sentences in summary. Relevance score is calculated by following formula (Lloret, Balahur, Gómez, Montoyo, & Palomar, 2012): Relevance score=((1/
ACCEPTED MANUSCRIPT
CR IP T
∑ , where NPi is the number of noun phrases contained in sentence i and tfw is frequency of word w belongs to that noun phrase. Balahur, Kabadjov, Steinberger, Steinberger, and Montoyo (2012) introduced an approach (OM+ Summarizer) to produce opinion summaries. The proposed method performs two main tasks: 1) the first step determines the opinionated sentences: sentences containing positive sentiment, sentences containing negative sentiment and neutral or objective sentences. Subsequently, the positive and the negative sentences are passed on to 2) the summarizer (LSA-based (Landauer, 2002) text summarization) to produce a summary. Khosla and Venkataraman (2015) also proposed an approach to summarize the user‘s reviews. The approach includes two different steps: encoding review information in vectors and using these vectors to extract key sentences that capture the essence of the reviews. The first step uses the word2vec method for sentence vector representations. The proposed approach also extracts the key sentences using the k-means clustering. It takes the review sentence vectors and clustered them into k clusters. After this, the most central sentences are extracted from each cluster as ―characteristic‖ representation of the cluster.
AN US
3. Proposed method
The overall a four-step pipeline system for sentence-level classification is shown in Figure 1. The pipeline contains four main steps: 1) pre-processing step; 2) Feature extraction step; 3) Classification step; 4) Summary generation.
ED
M
1. Pre-processing, in this stage the basic NLP techniques are performed on review documents. 2. Feature extraction, the task of the current stage is to extract a set of features to improve the overall quality of text classification. 3. Classification, it aims to classify the review texts. 4. The summary generation step, it produces the final summary. It also checks the redundancy information to increase the quality of summary.
PT
We describe each stage in the following sections. 3.1. Pre-processing
AC
CE
The pre-processing step applies a set of basic linguistic functions on dataset to become more suitable for the text mining techniques. The current step includes the following functions: 1) sentence splitting; 2) stemming; 3) stop-word deletion; 4) part-of-speech (POS) tagging. Sentence splitting — since the sentiment analysis is performed at the sentence level, this function split the review text into several sentences. A sentence ends with a sentence delimiter (“.”, “?”, “!”). Part-of-speech (POS) tagging — the POS tagging allows to automatically tag each word its morphological category (e.g., ―Students/NNS help/VBP the/DT teacher/NN./.”). We used an English part-of-speech tagger which was developed by Tsuruoka and Tsujii (2005) University of Tokyo. Stemming — it aims to get the stem or root of a word. It is useful to identify words that belong to the same stem. The root of each word is obtained using the lexical database, WordNet (Miller & Charles, 1991). The WordNet includes 121,962 unique words, 99,642
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
synsets (each synset is a lexical concept represented by a set of synonymous words) and 173,941 senses of words. Stop word removal — stop words includes a set of words1 that, a) they occur very commonly within a text; b) they are also considered as noisy terms such as prepositions, articles, etc. (A. Abdi, Idris, Alguliev, et al., 2015; A. Abdi, N. Idris, R. Alguliyev, & R. Aliguliyev, 2015a); c) They don‘t affect the sentiment of the sentence (Kolchyna, et al., 2015); d) They did not provide worth information in a sentence (A. Abdi, N. Idris, R. M. Alguliyev, & R. M. Aliguliyev, 2015b; Wang, Zhang, Sun, Yang, & Larson, 2015). It is worth nothing, we excluded a set of words as explained in section (―3.4.Sentiment score Computation”).
Figure 1. The Architecture of the SOSML
3.2. Feature extraction The features are text attributes that are useful for capturing patterns in data. Feature generation aims to extract a set of features to improve the overall quality of text
1
http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html
ACCEPTED MANUSCRIPT classification. The following features are considered for classification. The procedure of feature extraction applied on the pre-processed data is explained in detail as follows: i. Sentiment knowledge Sentiment lexicon feature — a sentiment lexicon refers to the list of words that are used to
express a positive/ negative sentiment. We employ the HCLr to extract both positive and negative words as separate features with their frequency as their values.
Number of positive words in the review text. Number of negative words in the review text.
CR IP T
Negation features — a negation refers to some special words that can affect the polarity of a
sentence. In other words it changes the sentiment of a sentence from negative to positive and vice versa (refer to section “3.4.Sentiment score Computation‖). For instance, given the sentence ―I do not like this chair‖, the negation word, ―do not” changes the sentiment orientation of the word ―Like‖.
Number of negation words.
AN US
Sentence types — we also considered the different types of sentences as a feature: subjective/
objective sentence and conditional/interrogative sentences.
Is the sentence subjective? Is the sentence objective? Is the sentence question/ interrogative sentence? Is the sentence conditional sentence?
M
Punctuations feature — the question mark (―?‖) and the exclamation mark (―!‖) are also
considered for features. We counted their occurrence in each sentence and extracted two features as follows. Number of exclamation (―!‖) mark; Number of Question (―?‖) mark;
ED
POS feature — number of various parts-of-speech in a text (e.g., number of nouns, adjectives,
CE
PT
verbs and adverb). Example: ―The/DT book/NN has/VBZ no/DT information/NN. /.‖. We only consider the adjectives, adverbs, noun and verbs, since they are more probable to convey sentiment orientations than words of other parts of speech. Number of nouns, adjectives, verbs, adverb
Sentiment score feature — the sentiment score obtained using a lexicon-based approach (refer
AC
to section “3.4.Sentiment score Computation”) is also considered as a feature.
Sentence sentiment score
ii. Statistical and linguistic knowledge Preparing significant features using Statistical and linguistic knowledge — the following
significant features are determined for all sentences (refer to section “3.5. Statistical and Linguistic Knowledge‖).
Sentence position; Title word; Key word; Cue words;
ACCEPTED MANUSCRIPT
Sentence-to-Sentences similarity.
iii. Word Embedding model Vector representation for word and sentence — in order to transform any NLP task into
∑
M
[∑
AN US
CR IP T
machine learning algorithms, text must firstly be transferred into corresponding vector representation. Word embedding, also known as distributed representations, is one of the approaches that can be used to represent words with dense, low-dimensional and real-valued vectors. In this work, we employ a word embedding technique (Word2vec) for feature extraction to augment other features extracted using sentiment, statistical and linguistic knowledge. Word2vec 2 is a deep-learning-inspired approach to model words as vectors (Mikolov, Chen, Corrado, & Dean, 2013). It efficiently computes word vector representations in high-dimensional vector space. It tries to understand the meaning and semantic relationships among words. In other words, by mapping word vectors into a vector space, semantically similar words will have similar vector representation and these word vectors contain semantic information. Word2vec is based on the Skip-gram and continuous bag-ofwords (CBOW) models to perform the computation of the word vector representations. The Skip-gram predicts the context given a word, while CBOW aims to predict a word given its context. The skip-gram model, which is a state-of-the-art word-embedding method, is used to obtain the vector representation of words. Given a sequence of training words , , . . ., in the domain-specific corpus, the model aims to maximize the average log probability (Q. Li, Jin, Wang, & Zeng, 2016): ]
(1)
AC
CE
PT
ED
where is the size of the training window and denotes the probability of correctly predicting the word , in which represents the middle word in the training window. In our method, for a given sentence , the desired vector representation is extracted as follows: 1) the word2vec method from the gensim Python library (Rehurek & Sojka, 2010) extracts a vector representation for each word encountered in a sentence. Each word is represented by an N-dimensional vector (N=100). Using this method, the word corresponds to a N-dimensional word vector . Then, a vector representation of a sentence consisting of n words is as follows: , where is the concatenation operator. refers to the concatenation of words , , …, . It is worth noting, the method by a loop for each word of sentence S that belongs to the word class (i.e., ―noun‖, ―verb‖, ―adjective‖, ‖adverb‖) extracts a vector representation; 2) in this step, the SOSML derives the vector representation of a given sentence as the average over the vectors of all its comprising words. Finally, this average vector will represent the sentence vector. The derived sentence vector is then concatenated into the sentiment knowledge-based, statistical and linguistic knowledge-based feature vectors. Based on this process, the hybrid vector is fed to the selected classification algorithm for accomplishing the desired task.
The vector representation for the sentence, S.
3.2.1. Feature selection 2
https://code.google.com/archive/p/word2vec/
ACCEPTED MANUSCRIPT
CR IP T
It aims to determine a subset of features (attributes) that are important for a classifier. In other words, feature selection is a process to reduce the original feature set and remove the irrelevant features for classification. This step is vital for the classification process, since elimination of irrelevant, noisy, redundant and non-valuable features allows: 1) to increase the classification accuracy and improve the runtime of classification (Parlar & Özel, 2016); 2) to reduce the size of feature space and improve the quality of classification approach (Kolchyna, et al., 2015). We employ the following filter methods based on the statistical measures such as information gain, Gain Ratio, Relief-F and Symmetrical Uncertainty techniques to select key features. These techniques assign a score to each feature and then the features with top score are selected. We describe the four feature selection methods in the following sections.
AN US
Relief-F (Kira & Rendell, 1992) — is one of the successful filtering for feature selection approach. The main task of the Relief algorithm is to estimate the quality of features according to how well their values distinguish between instances that are near to each other. The corresponding process is shown in algorithm 1 (Robnik-Šikonja & Kononenko, 2003). The Relief-F algorithm randomly selects a sample Xi (line 3), then the algorithm searches for its two closest neighbors: one from the same class, called closest hits, H, and the other from different class, called closest misses, M (line 4). Finally, the algorithm updates the quality estimation for all features depending on their values for Xi, H and M (lines 5 and 6). The process is repeated for n times. Algorithm 1. The Relief-F algorithm
PT
ED
M
Input: for each training sample a vector of features values and the class value; Output: the vector W of estimations of the qualities of features; 1. Set all weight ; 2. For to n do 3. Randomly select a sample ; 4. Find nearest hit, and nearest miss M; 5. For to do 6. + ; /* The calculates the difference between the values of the Feature for two samples and . For nominal attributes it was originally defined as:
CE
{
attributes as:
7.
and for numerical */
End.
AC
Information gain (IG) (Hall & Smith, 1998)— the information gain measure is used to select the test attribute at each node of the decision tree. The information gain (IG) measure aims: 1) to select features having a large number of values; 2) to decide the ordering of features; 3) to identify which feature in a given set of training features is most useful for classification approach. Let be an attribute (feature). is the set of all training examples and is its cardinality. denotes set of all possible values of attribute . defines the value of a specific example for attribute . is the set of class, is the subset of including training examples belonging to class . specifies the entropy. However, the information gain for an attribute, a, is defined as follows (T. M. Mitchell & Learning, 1997; Quinlan, 2014):
ACCEPTED MANUSCRIPT ∑
(2)
∑
(3)
Gain Ratio (GR) — GR is a feature selection algorithm based on the principle of IG (Salzberg, 1994; Sharma & Dey, 2012). The GR value of a feature is calculated by normalizing the IG value of the feature. The high GR value indicates that the text feature is useful for classification. The GR is computed as follows:
CR IP T
(4) ∑
(5)
AN US
Where, is computed by splitting the training examples into v partitions, where v is the outcome of a test on the feature a. where represents the number of the texts belonged to . Symmetrical Uncertainty (SU)(Hall & Smith, 1998) — the SU algorithm uses an information theoretic measure called symmetric uncertainty, therefore the is same as that of the hence it reduces the number of comparisons required, where x and y are two features. Let be the information gain of feature x, be the entropy of feature and be the entropy of feature y. The SU is computed using the following equation.
M
3.2.2. Document vectors
(6)
PT
ED
After feature extraction and feature selection tasks, the text is transformed into a feature vector that feeds the machine learning classifier. Each document (or sentence) is represented as a vector where the element of each vector is the weight of the corresponding feature. Each element (feature/attribute) usually is weighted using different approaches (i.e., word-based approach: word frequency, word frequency and inverse document frequency; Boolean-based: presence/absent, True/False, 0/1). 3.3. Classification algorithm
AC
CE
Given a training dataset (labelled data), { ), where each sample belongs to dataset and the label belongs to the set of pre-defined group of classes , a machine learning algorithm takes as input the training dataset (labelled data) and will learn how to classify unlabelled dataset. We employed the following machine learning approaches to select the best of them to determine the polarities of a review text. Support Vector Machines (SVM) (Cortes & Vapnik, 1995) — the SVM is a powerful supervised algorithm for classification. The SVM aims to find the "maximum-margin hyperplane" that divides the group of samples in one class from other class with maximum margin. Given the training sample set , where means the training sample labeled with class , the SVM method need to solve the following optimization problem, which can be represented as,
{
∑
}
(7)
ACCEPTED MANUSCRIPT
{
Subject to
Where, is the normal vector to the hyperplane, and is the bias term of the hyperplane. The w is the weight parameter assigned to variables, is the slack variable. is the penalty factor that balances the importance between the maximization of the margin width and the minimization of the training error. The objective of the problem (7) is to ∑ minimize ― ‖where value of ― ‖needs to be greater than ―
‖ and the value of
is considered to be very small i.e., nearly equal to 0.
CR IP T
Decision Trees (DT) (Mitchell, 1996) — the DT learning approach uses a tree structure for classification purpose. A decision tree starts with a single node, which branches into possible outcomes. The leaf nodes refer to class labels. The branches indicate the conjunctions of features. The nodes (non-leaf nodes) refer to a conditional test on a feature.
(8) (9)
M
AN US
Naïve Bayes (NB)(Manning & Schütze, 1999) — the NB is a probabilistic classifier based on the Bayes theorem to calculate the probability of a data sample belonging to a specific class. The Bayes theorem supposes all features are completely independent of each other. Given a sample, represented by a vector ,where n indicates the number of features. The probability of a sample belonging to a class can be computed using the following formula.
ED
Where, indicates the probability of class, , given a feature, . Let be the probability of class and be the probability of feature. indicates the probability of feature belonging to class.
CE
PT
Logistic Regression (LR)(Hosmer Jr, Lemeshow, & Sturdivant, 2013) — logistic regression is a statistical approach based on the set of features to predict the target class. It aims to find a model to describe the relationship between a dependent variable (outcome variable) and a set of independent variables (features/attributes). The LR uses the following equation to predict the probability of a dependent variable.
AC
Where parameters.
∑
are the independent variables. and , is the expected value of the dependent variable y.
(10) are the
Random Forest (RF)(Friedman, Hastie, & Tibshirani, 2001) — the RF is an algorithm based on the many tree-structured decision trees to perform classification task. The random forest algorithm performs the classification using the creating several decision trees and outputting the class that is the classification having the most votes. Let N be the number of training data samples and M be the number of features/ attributes. To create each decision tree, the RF algorithm works as follows. First, it selects randomly as set of samples, m, and a set of features, n. then, using the m samples and n features the decision tree is grown. Each tree is fully grown and not pruned. A next sample can be classified using the prediction of the
ACCEPTED MANUSCRIPT k (k indicates the number of decision tree) decision trees (i.e., majority votes for classification).
(
)
AN US
{
∑
CR IP T
K-Nearest Neighbor (KNN)(Yang & Liu, 1999) — the KNN is a sample-based lazy learning algorithm. It does not need the training data samples to perform any generalization. In other word there is not any training stage in this algorithm. Furthermore, the KNN needs all training samples during the testing stage. The KNN has not the training step but has a costly testing step in term of memory and time. It needs more time as in the worth case, all samples may be considered in decision. It also needs more memory to store all training samples during the testing step. Let be a test sample and refers to a set of classes. Given a test sample , the algorithm first selects nearest neighbors from the training samples. Let , , , be the nearest neighbors from the training samples. Then, the similarity measure between each nearest neighbor sample and the test sample, , is considered as the weight of the classes of the neighbor samples. However, a KNN score of is calculated as follows: (11) (12)
Where, is the similarity measure between test sample, y, and the training sample, . The equals 1 if the training sample, , belongs to class . The test sample, y, must belong to the class that has the highest KNN score.
AC
CE
PT
ED
M
Artificial Neural Networks (ANN) — the ANN is an algorithm based on the biological brains that can be employed to solve the classification problems in various application domains such as information retrieval, pattern recognition, signal processing, etc (Gao & Selmic, 2006; Ng, Yeung, Firth, Tsang, & Wang, 2008). As shown in Figure 2, the ANN is combined of multiple nodes (neurons). Each arrow displays a connection between two nodes and indicates the pathway for the flow of information. Each link between two nodes also has a randomly weight. ANN usually has three main layers of nodes: input layer, a hidden layer with a non-linear activation function and an output layer. The input layer sends the input data to the hidden layer. The number of nodes in input layer is equal with number of features (attributes). The hidden Layer takes the input data and performs simple operations on the data. Then, the result is sent to the other nodes. The output at each node is named its activation. The output layer is responsible for outputting a value received from the last hidden layer. Summing up, given a set of features , the ANN aims to learn a function → by training on a dataset, where is the number of features for input and is the number of targets/ classes for output.
ACCEPTED MANUSCRIPT Figure 2. Structure of ANN
3.4. Sentiment score Computation The process for the sentiment score calculation is described with the following steps. 1) Sentiment lexicons combination
AN US
CR IP T
In this study, we aim to make a high-coverage lexical resource (HCLr). As the number of words in a small sentiment lexicon is limited, the sentiment words will be neglected if they are not in the lexicon. Therefore, to expand the sentiment dictionary coverage and to improve the individual dictionary limited word, we 1) merged several existing sentiment lexicons with different size and format; 2) employed the semantic sentiment approach (SSA). Many sentiment dictionaries (e.g., SentiWordNet (Baccianella, Esuli, & Sebastiani, 2010), Micro-WNOp (Cerini, Compagnoni, Demontis, Formentelli, & Gandini, 2007), WordNetAffect (Strapparava & Valitutti, 2004) have been manually or automatically constructed to classify positive and negative opinions in a text. We employed several sentiment dictionaries. An overview of the most commonly used sentiment lexicons is presented in Table 1. As shown in Table 1, some of the lexicons include sentiment scores with various numerical ranges. Furthermore, some of them categorized sentiment words into positive, negative and neutral while some of them classified into types of emotions (e.g., “bad”, “joy”, “happy”, “sadness”). Since these lexicons have a different format, we standardize them to have one of the sentiment values 1, 0, -1. The processes of the standardization are explained as follows. It is worth noting, the sentiment score of each word in the combined dictionary, HCLr, is calculated using the averaging the sentiment values of the overlapping words.
M
The processes of merging the sentiment dictionaries are the following steps: General Inquirer (GI) (Stone & Hunt, 1963) — sentiment word of GI have been classified into
PT
ED
more than 180 groups. Therefore, we consider ‗positiv‟, „affil‟, „strong‟, „active‟, „doctrin‟, „pstv‟, „virtue‟, „PosAf'f‟ and „yes‟ in this classification as positive words and assigned them a sentiment score of (+1). We also considered ‗negative‟,‟ ngtv‟, „week‟, „fail‟, „passive‟, „decreas‟, „finish‟, „no‟, „negaff‟ in this classification as positive words and assigned them a sentiment score of (-1). AFINN (Nielsen, 2011) — we normalize the sentiment score from [-5, +5] to [-1, +1].
CE
Opinion Lexicon (M. Hu & Liu, 2004) — the sentiment words in opinion lexicon have been
AC
categorized into positive and negative words. Therefore, we assigned sentiment value of +1to positive words. We also assigned sentiment value of -1 to negative words. Finally, we dedicated sentiment score of 0 to words which appear in both positive and negative categories. SenticNet4 (Cambria, Poria, Bajpai, & Schuller, 2016) — in this dictionary, a sentiment value
(within a range of [-1, +1]) has been assigned to each sentiment word. However, we employed the sentiment score of each word. SentiWordNet (Baccianella, et al., 2010) — we used the following equation to calculate the
sentiment value for each word within the range of [-1, 1] (Khan, Qamar, & Bashir, 2016a). (13) Where, posscore and negscore are positive sentiment score and negative sentiment score of each word, respectively. If senti_score > 0 the sentiment word orientation is positive. If
ACCEPTED MANUSCRIPT senti_score < 0 the sentiment word orientation is negative. Finally, If senti_score = 0 the sentiment word objective. SO-CAL (Taboada, Brooke, Tofiloski, Voll, & Stede, 2011) — the sentiment value of each
sentiment word is normalized from [-5, +5] to [-1, +1]. Subjectivity Lexicon (Riloff & Wiebe, 2003) — words have been categorized into positive,
negative, both and neutral words. These categories: positive, negative, both and neutral are considered +1, -1, 0 and 0 respectively. WordNet-Affect (Strapparava & Valitutti, 2004) — there are various classification (e.g.,
CR IP T
‗positive-emotion‟, „negative- emotion‟, „ambiguous-emotion‟, and „neutral- Emotion‟) to categorize each word. Each category is assigned +1, -1, 0 and 0 respectively. NRC Hashtag Sentiment Lexicon and Sentiment140 Lexicon (Mohammad, et al., 2013) — words
are normalized from [-7, +7] to [-1, +1]. A positive value illustrates a positive orientation. A negative value illustrates a negative orientation. Table 1. An overview of ten lexical resources
(POS)
No. of words
Classifying
AN US
Sentiment lexicons
Score
No
Yes
11,789
„Positiv‟, „Negativ‟, „Pstv‟, „Pstv‟, „EMOT‟, etc.
Nil
√
AFINN (Nielsen, 2011)
2,477
Nil
[-5, +5]
√
Opinion Lexicon (M. Hu & Liu, 2004)
6,786
„Positive‟, „Negative‟
Nil
√
SenticNet4 (Cambria, et al., 2016)
50,000
„Positive‟, „Negative‟
[-1,+1]
√
SO-CAL (Taboada, et al., 2011)
6,306
Nil
[-5, +5]
√
NRC Hashtag Sentiment Lexicon (Mohammad, et al., 2013)
54,129
„Positive‟, „Negative‟
[-7, +7]
√
Sentiment140 Lexicon (Mohammad, et al., 2013)
62,468
„Positive‟, „Negative‟
[-7, +7]
√
Subjectivity Lexicon (Riloff & Wiebe, 2003)
8,221
„Positive‟, „negative‟, „neutral‟
Nil
√
The synset are first grouped into „behaviour‟, „situation‟, „trait‟, etc. and these groups are classified into „positive‟, „negative‟, „ambiguous‟, „neutral‟
Nil
√
„Positive‟, „negative‟, „objective‟
[0, 1]
√
ED
PT
CE
AC
WordNet-Affect (Strapparava & Valitutti, 2004)
SentiWordNet (Baccianella, et al., 2010)
M
General Inquirer (Stone & Hunt, 1963)
words:4,787 synsets :2,874 words:155,287 synsets :117,659
2) Semantic Sentiment Approach (SSA) As mentioned above, the limited words coverage is one of the major limitations of the sentiment lexicons. Therefore, a sentiment word will be discarded if it is not included in a
ACCEPTED MANUSCRIPT
∑
Where, ∑
∑
CR IP T
sentiment lexicon. Thus, we employ the SSA to 1) tackle the aforementioned problem; 2) cope the lexical gaps; 3) determine sentiment score of a word if it is not included in HCLr. In this method, we considered some specific POS (i.e. Noun, Adjective, Adverb, and verb). Given a word (W), Let denote the synonymous words of W that are collected using WordNet (Miller & Charles, 1991). For each word W of WS, the Algorithm 2 performs the following tasks: 1) if the word appears in the HCLr, then its sentiment score (SC) is obtained; 2) if the sentiment score value is positive, add the SC to the positive sentiment score (Possw); 3) if sentiment score value is negative, add the SC to the negative sentiment score (Negsw). Finally, the total sentiment value of the word W is calculated using the Eq. (14).
and are number of positive and negative words respectively. ∑ ∑ ∑ are defined as follows: ∑ and ∑
The SSA method is explained in the algorithm 2.
and
AN US
Algorithm 2. The Semantic Sentiment Approach
(14)
ii.
IF SW
ED
M
Input: Word (W); Output: Sentiment score of W ; 1: Let W be an input word ; 2: Let SC be sentiment score of W ; 3: Let Possw be positive sentiment score; 4: Let Negsw be negative sentiment score; 5: Let WS= {SW1, SW2… SWn} denotes an array that includes all synonyms of W; 6: Let Synset(w) be a function to collect synonyms of W using WordNet; 7: Let n indicate the total number of positive sentiment score; 8: Let m indicate the total number of negative sentiment score; 9: Set m=n= SC = Possw = Negsw =0; 10: For each SW of WS i. Look for SW in HCLr; HCLr Then
PT
1) Get the SC of the corresponding word; 2) If the SC is positive; Then, n+=1; Possw= Possw+ SC; 3) If the SC is negative; Then, m+=1; Negsw = Negsw - SC;
CE
4) Jump to step 10;
iii.
Otherwise, jump to step 10;
AC
11: Finally, the total sentiment value of a given word is computed using Eq. (13);
3) Contextual polarity for sentiment analysis Usually a dictionary- based approach used pre-defined sentiment score to determine the overall sentiment orientation of a text. However, the pre-defined sentiment score may affect the performance of a dictionary-based approach. This is due to the fact that, the polarity of a word presented by a sentiment lexicon can be reversed since the polarity of a word (positivity or negativity) depends on the context in which it appears. As an example, in the sentence ‗The bed is not well‘, the polarity of the word ‗well‘ is positive while the polarity of the whole sentence is negative because of the negation word ‗not‘(Chen, et al., 2017; Kolchyna, et al., 2015; Liu, 2012; Wu & Wen, 2010; Xia, Xu, Yu, Qi, & Cambria, 2016).
ACCEPTED MANUSCRIPT
M
AN US
CR IP T
On the other hand, we also considered the various types of sentences in sentiment analysis, since the type of a sentence also affects the performance of sentiment analysis approach. There are different types of sentences (e.g., subjective sentences, comparative sentences, conditional/interrogative sentences, sarcastic sentences, objective sentence) that can be used for sentiment analysis (Chen, et al., 2017; Narayanan, Liu, & Choudhary, 2009). In our work, we considered the subjective/ objective sentence handling, interrogative/ conditional sentences handling, and sentiment shifter (e.g. negation handling, but-clause handling) for sentiment analysis. Subjective and objective sentence — a subjective sentence includes a sentiment word (i.e., „good‟, bad‟, „excellent‟, „poor‟) or expresses an opinion, while an objective sentence does express an opinion or expresses factual information (Chen, et al., 2017). Interrogative and conditional sentences — a review sentence including a sentiment word may not present any opinion. Interrogative and conditional sentences can be considered this type of sentences (Narayanan, et al., 2009). A question word, ―may you tell me which phone is good?‖, ―If I can find a good phone in the shop, I will buy it‖ and ―Is your car in a good condition?”. All sentences include sentiment words (e.g., ‘well‘ and ―good‖), but they did not express a positive or negative opinion on TV. However, all conditional and interrogative sentences do not express opinion or sentiments (Liu, 2012). Sentiment shifter — sentiment shifter includes a set of words that change the sentiment orientation of a sentence (Liu, 2012; Polanyi & Zaenen, 2006; Xia, et al., 2016). A sentiment shifter contains negations, but-clause (contrasts), etc. (Taboada, et al., 2011). Due to the result of research S. Li, Lee, Chen, Huang, and Zhou (2010) the negations and the but-clause (contrasts) cover more than 60 percentage structures sentiment shift.
CE
PT
ED
1. Negation handling. A negation includes some special words, Table 2, that can change the sentiment of a sentence from negative to positive and vice versa. However, we can detect a negation sentence using a set of pre-defined negation words: if a negation word appears in a sentence, the polarity of a sentiment word will be changed. Usually, the sentiment word appears between the negation word and the punctuation mark (‗.‟, „,‟, „!‟, „?‟, „:‟, „;‟). Given the sentence ―He does not like red car‖, the negation word, ―does not” change the sentiment orientation of the word ―Like‖. It is worth nothing, we don‘t consider a negation word that it is a part of phrase such as, ―not only‖, ―not wholly‖, ―not all‖, ―not just‖, ―not quite‖, ―not least”, ―no question‖, ―not to mention‖ and ―no wonder‖.
AC
2. But-clause handling. It contains some words like ―but‖, ―with the exception of‖, ―except that‖, ―except for‖, ―however‖, ―yet‖, ―unfortunately‖, ―thought‖, ―although‖ and ―nevertheless‖. These words usually change the sentiment orientation of the sentence following them. In other words, the sentiment orientations before the contrary word (e.g., but) and after the contrary word are opposite to each other (Liu, 2012). As an example, given the sentence ―I don‟t like this laptop, but the CPU speed is high‖. The but-clause changes the sentiment orientation of the previous phrase ―I don‟t like this laptop‖. However, the polarity of a sentence can be set as follows:‖ ―I don‟t like [-1] this laptop, but the CPU speed is high [+1]‖.
ACCEPTED MANUSCRIPT Table 2. Sample of Negation words (Kolchyna, et al., 2015; S. Li, et al., 2010) Negation words no are not Were not
Does not
not Was not Have not Should not
Don‟t Did not neither lack
hardly lacking nor Had not
Can not Would not without Lacks
None Nobody Seldom Nothing
Never Nothing Wont Isn‟t
Nothing Nowhere Couldn‟t …
nowhere Cant Doesn‟t …
4) Sentence sentiment score
CR IP T
Let indicate all sentences from the review text, where n is the number of sentences. For each sentence the following tasks are performed: 1) If the sentence, Si, is an interrogative (Q) or conditional (C) sentence, the process will be performed by the next sentence of QRS. 2) if the Si is not a Q/C sentence, the method by a loop for each word of sentence Si that belongs to the word class (i.e., ―noun‖, ―verb‖, ―adjective‖,‖adverb‖), performs the following tasks: Each word (W) is looked up in sentiment lexicon, HCLr; if the word appears in HCLr, its sentiment score is obtained; finally, the pair of sentiment score and word is added to an array. Let indicate the sentiment word and its sentiment score.
II.
If the word is not included in the HCLr, the method uses the SSA to determine the sentiment score of the current word. If the SSA returns any value, the pair of sentiment score and word is added to the WSC. If the SSA does not return a value, the method continues the process by the next word.
III.
The method checks if the current sentence includes a sentiment word, it considers the sentence as a subjective sentence and then adds the current sentence to an array. Let indicate all subjective sentences, where . The method also calculates the sentence sentiment score, step (IV). If the current sentence does not include a sentiment word, the sentence is an objective sentence, and the method continues the process by the next sentence of QRS.
IV.
The method checks negation handling and but-clause handling. It calculates the sentiment score of the current sentence using the equation (14). Finally, the current sentence and the corresponding sentiment score is added to Sss. Let indicate sentence sentiment score, where Si is a sentence and is a corresponding sentiment score.
AC
CE
PT
ED
M
AN US
I.
∑
(15)
Where, K is the number of the sentiment word in a sentence, Si. indicates the sentiment score of word, Wi. The aforementioned tasks are performed for each sentence of QRS. The corresponding process is shown in the algorithm 3.
ACCEPTED MANUSCRIPT Algorithm 3. The Sentence sentiment score calculation Input: indicates all sentences from the review text. Output 1: Sentiment score of sentences; Output 2: Subjective sentences; 1: Let WC= {―noun‖, ―verb‖, ―adjective‖,‖ adverb‖} indicate word classes; 2: Let POS (word) indicate the part of speech tag of a word; 3: Let indicate the sentiment word and its sentiment score; 4: Let indicate all subjective sentences, where ; indicate
score; 6: For each
sentence
CR IP T
5: Let sentiment
/*to calculate the sentiment score for each sentence*/
If is an interrogative or a conditional sentence, then jump step 6; otherwise, continue the following steps; /*to eliminate interrogative /conditional sentences*/
ii.
WSC = NULL;
iii.
For each word (w) in
iv.
If POS(w)
AN US
i.
WC Then, /*to obtain the sentiment score for each word*/
1.
If W appears in HCLr then, assign the sentiment score to the word; add pair (W, sentiment score) to WSC; Otherwise, go to the following step, 2;
2.
Call SSA (w) to determine sentiment score;
If the SSA returned any value Then, add pair (W, sentiment score) to WSC; jump to step (iv); otherwise, jump to step (iv);
M
a.
If includes a sentiment word then, add to ; Jump to step (vi); Otherwise, jump to step 6; /* to identify objective/ subjective sentences*/
vi.
This step includes the following tasks:
ED
v.
Check negation handling; /*‟If a negation operator appears before a sentiment word, then reverse the sentiment tendency‟*/
2.
Check but-clause handling;
PT
1.
3.
Calculate the sentiment score of the
4.
Add
to Sss; jump to 6;
CE
and
using the equation (15);
3.5. Statistical and Linguistic Knowledge
AC
3.5.1. Preparing significant features A text summarization approach aims to select the most significant sentence in a document. Therefore, it is very important to determine those features that help to identify this sentence and improve the quality of the summary. In our method, we consider some of these features such as key word method, title method, position method, cue phrase method and sentence-tosentences similarity. The aforementioned features are described in the following sections. 1) Key-word method — The most frequent terms in a source text include relevant information and can be indicative of the document‘s topic (A. Abdi, Idris, Alguliyev, & Aliguliyev, 2016; Nenkova, Vanderwende, & McKeown, 2006; Neto, Freitas, & Kaestner, 2002). An important sentence can be determined by counting the number of significant words in a sentence (Neto, et al., 2002; C. S. Yadav & Sharan, 2015). On the other hand, a sentence with a frequent
ACCEPTED MANUSCRIPT word has a greater chance of being included in the summary V. Gupta and Lehal (2010). It is obvious that all words in a text are not of the equal importance. Therefore, we used the traditional method of ―TF-IDF", Term Frequency- Inverse Document Frequency, a measure to find a significant word, Eq. (16) (Alguliyev, Aliguliyev, & Isazade, 2015b; Ouyang, Li, Li, & Lu, 2011). (16) Where, is the term frequency of in the document, ND is the total number of documents and indicates the number of documents in which term appears.
CR IP T
In this function, if a sentence contains a key word, then the K (Si) =1; otherwise the K (Si
AN US
2) Sentence Position method — The position of a sentence in the document, section or paragraph can also be considered to assess the sentence relevance (Neto, et al., 2002; Saggion, 2014). According to previous studies, the authors usually present the main idea in certain sections such as, at the beginning of the paragraphs or the opening paragraphs, etc.(A. Abdi, et al., 2016; Neto, et al., 2002; C. S. Yadav & Sharan, 2015). Paragraphs at the beginning and in the end of a document are more likely to contain information that is useful for a summary, especially the first and last sentences of the paragraphs (V. Gupta & Lehal, 2010; Teufel & Moens, 1997; Xie, Liu, & Lin, 2008). We used the following equation, Eq. (17) (C. S. Yadav & Sharan, 2015) to calculate sentence position score. (17)
PT
ED
M
Where, N is a total number of sentences and . In this function the sentence position score of each sentence is equal to n the sentence position score. 3) Title method — A significant sentence normally include a term that is expressed in the title or major headings of a document (Kupiec, Pedersen, & Chen, 1995; Qazvinian, Hassanabadi, & Halavati, 2008; Shareghi & Hassanabadi, 2008). Thus, terms occurring in the title are good candidates for document specific concepts (Teufel & Moens, 1997). In this function, if a sentence includes a title word, the T (S) =1; otherwise the T (S) =0.
AC
CE
4) Cue method — Cue phrases, such as ―as a conclusion‖ or ―in particular‖ are often followed by the important information. Thus, sentences that contain one or more of these cue phrases are considered as more important than sentences without cue phrases (Zhang, Sun, & Zhou, 2005). In our work, in order to consider ―Cue method‖, we collected a set of cue words from previous studies, Table 3 (A. Abdi, et al., 2016; Alonso, 2005; Fraser, 1999; Knott, 1996). In the current function, if a sentence contains a cue word, the C (S) ; otherwise the C (S) =0. Table 3. The sample of Cue words
Cue words
“therefore; thus; consequently; hence; as a result; to conclude; in conclusion” “as a result, in short; to sum up; to summarize; to recapitulate; In consequence" “last of all; finally; to end; to complete; to bring to an end; to close”
ACCEPTED MANUSCRIPT 5) Sentence-to-sentences similarity based on the Vector Space Model (VSM): SSVSM The overall task of the SSVSM is to compute the similarity between 2 sentences using the following steps: Take two sentences and ; Create a word set using and ; Create a semantic-vector for and , separately; Expand the sentence words using CWE approach. This approach is used by the step 3; Compute the semantic similarity measure (SSM) between and using Eq. (19); The similarity score is assigned to the edge between and .
CR IP T
1. 2. 3. 4. 5. 6.
a) The Word- Set (WS)
} and { } , be two sentences. Let be a ‗word- set‟, where N the number of distinct words from sentences S1 and S2. The WS is created using the following steps: Let
{
AN US
1. Take two sentences, S1 and S2; 2. For each word (W) From sentence S1, 2.1. If the Then, continue step 2 by next word; otherwise, go to step 2.2; 2.2. If the Then, Add to ; jump to step 2; 2.3. Perform steps (2- 2.2) for sentence S2. b) Content Word Expansion (CWE) method
PT
ED
M
The SSVSM employs the Content Word Expansion (CWE) method to 1) improve the sentence ranking result; 2) tackle the limit of information expressed in each sentence; 3) overcome vocabulary mismatch problem in sentence comparison. It also bridges the lexical gaps for semantically similar contexts that are expressed in a different wording. The CWE method is based on semantic word similarity. The Semantic Similarity between two Words (SSbW) is determined using following steps:
AC
CE
Dice measure (Vani & Gupta, 2014). The similarity measure between two words based on their synonyms can be defined as follows:
and
(18) {
Where, is the set of words (synonyms) based on the WordNet. the cardinality of the set .
represents
c) Semantic Similarity Measurement (SSM) The SSM of two sentences is computed based on the two semantic vectors. Each semantic vector is created using the word-set and the corresponding sentence. Each cell in the semantic vector corresponds to a word in the word-set; therefore, the dimension of the semantic vector equals the number of words in the word-set. The weight of each cell is determined as follows: if a word from word-set appears in the sentence, the corresponding word in the vector is set to 1. If the word does not exist in a sentence, the semantic similarity between (the current word
ACCEPTED MANUSCRIPT from word-set and all words from the corresponding sentence) is computed using the CWE method to check whether a similar word is there. If it is there, the weight of the corresponding word in the vector is set to the highest similarity value, and if it is not there, the weight of the corresponding word is set to 0. Finally, the semantic similarity is computed using the equation (19). The corresponding steps are presented next: The Jaccard measure (Jaccard, 1912) — the Jaccard measure calculates semantic similarity between S1 and S2 using the following steps:
ED
M
AN US
CR IP T
1. Make the Semantic-Vector (SV); 1.1. Take WS and a sentence S1 or S2 as input; 1.2. SV dimension = No. of words in the WS. 1.3. Let S1 or S2 = (sw1, sw2, …, swn), n is the length of sentence. 1.4. Let sim_value list= {}, includes similarity value between two words. 2. Weight each cell of the SV using the following steps, 2.1-2.5. 2.1. For each W WS do 2.2. If W S1, then weight (W) = 1; Otherwise, go to the next step; 2.3. If W S1, then 2.3.1. For each sw S1 2.3.2. Compute , then Add to sim_value list; 2.4. If sim_value list ≠ then weight (W) = high value from sim_value list; Otherwise, go to the next step; 2.5. If sim_value list = then weight (W) = 0, zero; 3. The SV is produced for both sentences S1 and S2. The Eq.(19) is employed to calculate the semantic similarity between two sentences: ∑ ∑
(19)
∑
are the semantic vectors is the weight of the word in vector ,
PT
Where and of sentences and , respectively; is the number of words.
∑
CE
We use the graph-based model to calculate the total similarity score of each sentence as follows:
AC
In the graph-based model, each node represents a sentence. The edge between nodes (sentences) indicates the similarity between the pair of sentences where each sentence is represented as a vector of word specific weights. Let includes all sentences from the review text, where is the number of sentences. Let denotes a review sentence. The model calculates the similarity measure between sentences and each sentence of (e.g., Sim ( , ), where Si ; i ≤ m), ―Sentences similarity based on the Vector Space Model (VSM)‖. Finally, the total similarity score of each sentence is computed using Eq. (20). ∑
∑
Where, the T includes all sentences presented in the graph-based model.
(20)
ACCEPTED MANUSCRIPT ∑ : It calculates the similarity score between sentence Si and other sentences of T. the is calculated using Eq. (19). 3.6. Summary generation: Sentence selection and redundancy removal
AN US
CR IP T
The summary includes the opinionated sentences. One of the main issues in multidocument summarization is redundancy. Since in multi-document summarization the sentences are selected from various documents, they may convey the same information. Therefore, a text summary with high quality must not only be informative, it should also be concise (non-redundant). The redundancy must be removed to avoid repeated information. To do this, the method performs the following tasks. Let indicate a null set and indicate the total similarity score of each sentence. The sentences of are ranked according to their scores in descending order. In the first step, the top sentence is moved from to . In the second step, the next top sentence is selected from , then, before adding the current sentence into , the sentence will be checked to make sure that it does not have any similarity measure with each sentence of above the similarity threshold (ST). If yes, the sentence will be added into . Otherwise, the sentence will be removed. These steps will be repeated until the number of sentences in final summary has been satisfied. In the end, the set is considered as a final summary. To estimate the values of , we used gradient search strategy. We implement a set of experiments with different ranging from 0.1 to 0.9 to observe the variation in performance. 4. Experiments
AC
CE
PT
ED
M
This section presents the evaluation of the proposed method. We conduct the experiments on two public datasets that are freely available from online repositories: 1) the DUC 2002 and 2001 datasets provided by Document Understanding Conference (DUC) 3 . 2) The Movie Review Dataset 4 . DUC 2002 dataset contains the news reports in English, taken from newspapers and news agencies such as the Financial Times, Associated Press and the Wall Street Journal. It includes 567 documents in 59 sets. DUC 2001 data set includes sixty sets of approximately 10 documents. t includes two main tasks: 1) single-document summarization: given a single newswire/newspaper document, a generic summary of the document with a length of approximately 100 words is created; 2) multi-document summarization: given a set of newswire/newspaper documents, four generic summaries of the documents with lengths of approximately 400, 200, 100, and 50 words are created. The DUC 2001 and DUC 2002 datasets are respectively divided into three data subsets. The DUC 2001 dataset is divided into DUC 2001_1, DUC 2001_2 and DUC 2001_3. The DUC 2002 dataset is divided into DUC 2002_1, DUC 2002_2 and DUC 2002_3. Furthermore, the Movie dataset is divided into two data subsets, denoted as Mov_1 and Mov_2. We describe the details of experiments in the followings sections. It is worth noting; in our experiment we used the tool WEKA5 (Waikato Environment for Knowledge Analysis) version 3.8.1, which is a collection of different feature selection approaches and machine learning methods. 4.1. Preparing the gold standard data To evaluate the proposed method, we need a gold standard data, which is a set of all correct results. Based on this dataset, also known as judgment data, we can decide whether 3
http://duc.nist.gov http://www.cs.cornell.edu/people/pabo/movie-review-data/ 5 http://www.cs.waikato.ac.nz/ml/weka/ 4
ACCEPTED MANUSCRIPT
CR IP T
the output of the method is correct or not. For this purpose, we employed three annotators, 1) two English teacher with good reading skills and understanding ability in the English language; 2) a lecturer with experience in English language teaching. The annotators aimed to create a gold standard (including training data and testing data) using the opinionated sentences. A gold standard is created using the following processes: 1) the data is annotated at the sentence level; 2) subsequently; they extracted all features according to the section ―3.2.Feature extraction‖. 3) They discriminated objective and subjective sentences, and then they considered only the subjective sentences to produce a summary text. It is worth noting, if there was any disagreement among annotators, we followed the third annotator to obtain an optimal gold standard. We used Cohen's Kappa (Cohen, 1968; Fleiss, 1971) as a measure of agreement between the two annotators. The Kappa coefficient agreement was 0.62. This value indicated that our annotators had a good agreement (Landis & Koch, 1977) to create each summary. 4.2. Evaluation metrics
We evaluated the proposed method using 2 Key Performance Indicators (KPI) which are:
AN US
1) We used the standard ROUGE-N metric (Eq. (21)) (Lin, 2004) to evaluate the performance of the QMOS. ROUGE has been adopted by DUC as the official evaluation metric for text summarization. ROUGE-N is calculated as follows: ∑
∑
∑
∑
(21)
ED
M
Where, N indicates the length of the n-gram and Count match (N-gram) indicates the total NO. of the n-grams occurring in both a reference and a candidate summaries. Count (N-gram) indicates the No. of the n-grams in the reference summaries. In our experiments, we employed two metrics ROUGE-1 and ROUGE-2. We also measured the average ROUGE score using equation (22). (22)
CE
PT
2) Statistical significance test, the Wilcoxon test (Statistics, 2015) is used to verify the significance of the results obtained by proposed method. ―The Wilcoxon signed-rank test is the nonparametric test equivalent to the dependent t-test. It is used to compare two sets of scores that come from the same participants. This can occur when we wish to investigate any change in scores from one-time point to another, or when individuals are subjected to more than one condition.‖
AC
4.3. Feature selection approach and Machine learning classification algorithm The Feature selection approach (FSA) includes a process to select a minimum set of features from the original set of features. A FSA aims to improve the classification accuracy by the removing the noisy and irrelevant features. In this paper, the Relief-F (Relf), Information gain (IG), Gain Ratio (GR) and Symmetrical Uncertainty (SU) algorithms, and the five pre-defined feature set sizes are used, which are 5, 8, 11, 14, and 17(all features) are employed to select significant features. Furthermore, in order to evaluate the performance of a feature selection method, the accuracy of seven machine learning classification methods (e.g., Support Vector Machines (SVM), Decision Trees (DT), Naïve Bayes (NB), Logistic Regression (LR), Random Forest (RF), K-Nearest
ACCEPTED MANUSCRIPT Neighbor (KNN) and Artificial Neural Networks (ANN)) trained on those features selected by the aforementioned approaches is also calculated. In our experiment we used the approach of 10-fold cross validation. In this approach, each data subset (i.e., DUC 2001_1 Dataset) is divided into 10 equal subsets. The 9 subsets are considered as training dataset and the remaining subset is considered as test dataset. The process is repeated ten times. Finally, the average result of the ten times is considered as the result of the 10-fold cross validation.
CR IP T
Summing up, the following questions are also addressed: 1) which one among the popular feature selection techniques (refer to section ―4.5. Comparisons between feature selection approaches ―) and 2) which one among the popular machine learning techniques (refer to section “4.6. Comparisons between machine learning methods ―) perform best in multidocument sentiment-oriented summarization. 4.4. Experimental results and analysis
AC
CE
PT
ED
M
AN US
The Figures 3- 6 present the experimental results. For instance, the Figure 3 displays the values of seven machine learning classification methods that combine the SU method with the seven machine learning classification methods (e.g., NB, SVM, KNN, ANN, RF, DT and LR) concerning the six datasets. Figure 3 (DUC 2001_1 Dataset) shows the results of the ARS values carried out over the DUC 2001_1 Dataset. The horizontal axis of each figure illustrates the five pre-defined feature set sizes (i.e., 5, 8, 11, 14, 17). The vertical axis of each figure shows the values. Table 4 presents the best concerning the eight datasets. For instance, the value (yellow rectangle) of Table 4 indicates the best result of values carried out over the DUC 2001_1 dataset using the SU and the SVM (yellow triangle), when the 5 pre-defined numbers of features are used. In other words, the 0.2635 of Table 4 is corresponding to the peak of curve of SVM in Figure 3 (DUC 2001_1 Dataset).
ACCEPTED MANUSCRIPT
DUC 2001_1 Dataset
DUC 2001_2 Dataset
0.28
0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18 0.16
0.24 0.22 0.20 0.18 0.16 5
8
11
14
17
5
DUC 2001_3 Dataset 0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18 0.16
0.26 0.24
0.20 0.18 0.16 0.14 14
DUC 2002_2 Dataset 0.30
17
M
11
ED
0.28 0.26 0.24
PT
0.22 0.20 0.18
8
11
5
8
8
11
14
17
0.25 0.23 0.21 0.19 0.17
0.13 14
5
17
8
11
14
17
14
17
Mov_2 Dataset 0.39 0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 0.21 0.19
AC 5
17
DUC 2002_3 Dataset
Mov_1 Dataset
0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18
14
0.15
CE
5
AN US
0.22
8
11
DUC 2002_1 Dataset
0.28
5
8
CR IP T
0.26
11
14
17
5
8
11
Figure 3. The performance curves of 7 learning approach using SU vs. feature number
ACCEPTED MANUSCRIPT
DUC 2001_1 Dataset
DUC 2001_2 Dataset 0.41 0.39 0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 0.21
0.31 0.29 0.27 0.25 0.23 0.21 0.19 5
8
11
14
17
5
DUC 2001_3 Dataset
14
DUC 2002_2 Dataset 0.28
ED
0.26
17
M
11
0.24
0.20 0.18 0.16
8
11
14
5
17
8
11
14
17
0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18 5
17
8
11
14
17
14
17
Mov_2 Dataset
AC
Mov_1 Dataset
0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 0.21 0.19
5
14
DUC 2002_3 Dataset
CE
5
PT
0.22
AN US
0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18
8
11
DUC 2002_1 Dataset
0.36 0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18 5
8
CR IP T
0.33
0.45 0.43 0.41 0.39 0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 8
11
14
17
5
8
11
ACCEPTED MANUSCRIPT Figure 4. The performance curves of 7 learning approach using Relf vs. feature number.
DUC 2001_2 Dataset 0.41 0.39 0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 0.21
0.33 0.31 0.29 0.27 0.25 0.23 0.21 0.19 5
8
11
14
17
5
DUC 2001_3 Dataset
14
DUC 2002_2 Dataset 0.30
ED
0.28
17
M
11
0.26 0.24 0.20 0.18 0.16
8
11
5
17
8
11
14
17
0.31 0.29 0.27 0.25 0.23
14
17
5
8
11
14
17
14
17
Mov_2 Dataset
AC 5
14
0.33
Mov_1 Dataset
0.39 0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 0.21 0.19
11
DUC 2002_3 Dataset
CE
5
PT
0.22
AN US
0.38 0.36 0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18
8
8
DUC 2002_1 Dataset
0.36 0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18 5
CR IP T
DUC 2001_1 Dataset
0.44 0.42 0.40 0.38 0.36 0.34 0.32 0.30 0.28 0.26 0.24 0.22 8
11
14
17
5
8
11
ACCEPTED MANUSCRIPT Figure 5. The performance curves of 7 learning approach using GR vs. feature number.
DUC 2001_2 Dataset
0.35
0.40 0.38 0.36 0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20
0.33 0.31 0.29 0.27 0.25 0.23 0.21 0.19 5
8
11
14
17
5
DUC 2001_3 Dataset
14
PT CE
0.31 0.29 0.27 0.25 0.23 0.21 0.19 0.17
ED
DUC 2002_2 Dataset
17
AC
5
8
11
11
14
17
8
11
14
17
DUC 2002_3 Dataset 0.35 0.33 0.31 0.29 0.27 0.25 0.23
14
17
5
Mov_1 Dataset
8
11
14
17
14
17
Mov_2 Dataset 0.44 0.42 0.40 0.38 0.36 0.34 0.32 0.30 0.28 0.26 0.24 0.22
0.39 0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 0.21 0.19
5
5
M
11
AN US
0.37 0.35 0.33 0.31 0.29 0.27 0.25 0.23 0.21 0.19
8
8
DUC 2002_1 Dataset
0.38 0.36 0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18 5
CR IP T
DUC 2001_1 Dataset
8
11
14
17
5
8
11
ACCEPTED MANUSCRIPT
Figure 6. The performance curves of 7 learning approach using IG vs. feature number. Table 4. The best ARS values using the combination of different feature selection techniques and machine learning approaches
KNN DT SVM NB LR RF ANN Ave.
SU 0.2030 0.2213 0.2635 0.2080 0.2064 0.2140 0.2190 0.2193
DUC 2001-2 Dataset
Relf 0.2245 0.2640 0.3265 0.3180 0.2230 0.2960 0.3010 0.2790
GR 0.2288 0.2693 0.3230 0.3243 0.2324 0.3018 0.3068 0.2837
IG 0.2305 0.2630 0.3385 0.3257 0.2305 0.2869 0.2969 0.2817
Ave. 0.2217 0.2544 0.3129 0.2940 0.2231 0.2746 0.2809
KNN DT SVM NB LR RF ANN Ave.
Relf 0.2192 0.2900 0.3435 0.3290 0.2005 0.3263 0.3313 0.2914
GR 0.2286 0.2936 0.3550 0.3242 0.2080 0.2939 0.3189 0.2889
IG 0.2258 0.2958 0.3613 0.3230 0.2108 0.3094 0.3194 0.2922
GR 0.2555 0.2430 0.2835 0.2480 0.2230 0.2455 0.2505 0.2499
IG 0.2605 0.2480 0.2924 0.2458 0.2280 0.2469 0.2619 0.2548
PT
Relf 0.2590 0.2435 0.2728 0.2340 0.2230 0.2388 0.2438 0.2450
CE
SU 0.2490 0.2674 0.2942 0.2586 0.2374 0.2611 0.2661 0.2620
Ave. 0.2150 0.2821 0.3314 0.2892 0.1982 0.2853 0.2966
ED
DUC 2002-2 Dataset KNN DT SVM NB LR RF ANN Ave.
Ave. 0.2560 0.2505 0.2857 0.2466 0.2279 0.2481 0.2556
AC
Mov_1 Dataset KNN DT SVM NB LR RF ANN Ave.
SU 0.2396 0.2835 0.3296 0.2030 0.2058 0.2408 0.2458 0.2497
IG 0.2440 0.3090 0.3895 0.3940 0.2952 0.3515 0.3565 0.3342
Ave. 0.2263 0.2978 0.3676 0.3508 0.2448 0.3221 0.3321
GR 0.2496 0.3060 0.3675 0.3448 0.2263 0.3228 0.3278 0.3064
IG 0.2680 0.3030 0.3629 0.3330 0.2530 0.3130 0.3230 0.3080
Ave. 0.2520 0.2936 0.3501 0.3075 0.2281 0.3049 0.3111
Relf 0.2575 0.2725 0.3215 0.2890 0.2555 0.2953 0.3103 0.2859
GR 0.2480 0.2680 0.3249 0.2880 0.2467 0.2780 0.2830 0.2767
IG 0.2580 0.2707 0.3335 0.3030 0.2530 0.2855 0.2955 0.2856
Ave. 0.2352 0.2539 0.3047 0.2611 0.2379 0.2608 0.2695
Relf 0.2940 0.3748 0.4297 0.4243 0.2841 0.3991 0.3969 0.3718
GR 0.2957 0.3787 0.4290 0.4319 0.2786 0.4000 0.4000 0.3734
IG 0.3080 0.3730 0.4335 0.4308 0.2958 0.4019 0.4069 0.3785
Ave. 0.2953 0.3705 0.4176 0.3791 0.2766 0.3719 0.3739
DUC 2002-1 Dataset
M
KNN DT SVM NB LR RF ANN Ave.
SU 0.1863 0.2490 0.2658 0.1806 0.1735 0.2119 0.2169 0.2120
GR 0.2293 0.3003 0.3858 0.3936 0.2430 0.3546 0.3596 0.3237
Relf 0.2195 0.2990 0.3705 0.3942 0.2197 0.3300 0.3550 0.3125
Relf 0.2480 0.3180 0.3635 0.3485 0.2130 0.3333 0.3383 0.3089
GR 0.2545 0.3222 0.3761 0.3658 0.2237 0.3408 0.3458 0.3184
AN US
DUC 2001-3 Dataset
SU 0.2125 0.2830 0.3245 0.2216 0.2215 0.2523 0.2573 0.2532
CR IP T
DUC 2001-1 Dataset
KNN DT SVM NB LR RF ANN Ave.
SU 0.2430 0.2651 0.3191 0.2166 0.2140 0.2406 0.2456 0.2491
Relf 0.2474 0.3004 0.3508 0.3357 0.2190 0.3432 0.3482 0.3064
DUC 2002-3 Dataset KNN DT SVM NB LR RF ANN Ave.
SU 0.1774 0.2043 0.2391 0.1644 0.1966 0.1843 0.1893 0.1936
Mov_2 Dataset IG 0.2680 0.3208 0.3835 0.3660 0.3180 0.3420 0.3470 0.3350
Ave. 0.2525 0.3111 0.3632 0.3208 0.2401 0.3142 0.3192
KNN DT SVM NB LR RF ANN Ave.
SU 0.2835 0.3555 0.3781 0.2296 0.2480 0.2868 0.2918 0.2962
4.5. Comparisons between feature selection approaches
ACCEPTED MANUSCRIPT As displayed in Table 5, the average best values of various feature selection approaches (IG, GR, SU and Relf) concerning different datasets are computed using the data presented in Table 4. Moreover, the average ARS values of various feature selection approaches are computed, as presented in the last row of Table 5. According to the result displayed in Table 5, in terms of , the IG (0.3088) performs best among the four feature selection approaches and produced most accurate subset of features for various machine learning methods on ten datasets. Table 5. The average best ARS values of various feature selection techniques
SU
Relf
GR
IG
DUC 2001_1 Dataset
0.2193
0.2790
0.2837
0.2817
DUC 2001_2 Dataset
0.2532
0.3125
0.3237
0.3342
DUC 2001_3 Dataset
0.2120
0.2914
0.2889
0.2922
DUC 2002_1 Dataset
0.2491
0.3064
0.3064
0.3080
DUC 2002_2 Dataset
0.2620
0.2450
0.2499
0.2548
DUC 2002_3 Dataset
0.1936
0.2859
0.2767
0.2856
Mov_1 Dataset
0.2497
0.3089
0.3184
0.3350
Mov_2 Dataset
0.2962
0.3718
0.3734
0.3785
Ave.
0.2419
0.3001
0.3026
0.3088
AN US
Dataset
CR IP T
ARS values
4.6. Comparisons between machine learning methods
PT
ED
M
As presented in Table 6, the average best classification accuracies of various machine learning classification methods (NB, SVM, KNN, ANN, RF, DT and LR) concerning different datasets are calculated using the data presented in Table 4. Moreover, the average ARS values of aforementioned machine learning methods are computed, as displayed in the last row of Table 6. According to the result shown in Table 6, we can conclude that in terms of , the SVM performs best among the seven machine learning methods (0.3416/ ARS value).
CE
Table 6. The average best ARS values of various machine learning approaches
AC
DUC 2001_1 Dataset DUC 2001_2 Dataset DUC 2001_3 Dataset DUC 2002_1 Dataset DUC 2002_2 Dataset DUC 2002_3 Dataset Mov_1 Dataset Mov_2 Dataset Ave.
SVM 0.3129 0.3676 0.3314 0.3501 0.2857 0.3047 0.3632 0.4176 0.3416
ARS values ANN 0.2809 0.3321 0.2966 0.3111 0.2556 0.2695 0.3192 0.3739 0.3049
NB 0.2940 0.3508 0.2892 0.3075 0.2466 0.2611 0.3208 0.3791 0.3061
KNN 0.2217 0.2263 0.2150 0.2520 0.2560 0.2352 0.2525 0.2953 0.2442
LR 0.2231 0.2448 0.1982 0.2281 0.2279 0.2379 0.2401 0.2766 0.2346
DT 0.2544 0.2978 0.2821 0.2936 0.2505 0.2539 0.3111 0.3705 0.2892
RF 0.2746 0.3221 0.2853 0.3049 0.2481 0.2608 0.3142 0.3719 0.2977
Summing up, according to the Tables 5-6, we can observe that: 1) in terms of ARS measure , the ranking of the seven machine learning classification methods is SVM > NB > ANN > RF
ACCEPTED MANUSCRIPT > DT > KNN > LR, where ‗>‘ denotes ―better than‖; 2) the ranking of the four feature selection methods is IG > GR > Relf > SU.
4.7. Comparison with
,
,
AN US
CR IP T
In this section, we compare with , and , where SLK + SK (statistical and linguistic knowledge features combined with sentiment knowledge-based features), SLK + WEM (statistical and linguistic knowledge features combined with word embedding-based feature), SK + WEM (sentiment knowledge-based features combined with word embedding-based feature), and SLK + SK + WEM (all three types of features). We use SVM classifier for comparison. In this experiment, our aim is to examine the efficiency of the combination SK, SLK and WEM on SOSML method. The results of , , and are reported in Table 7 on three datasets. Moreover, the average ARS value of each method is computed, as presented in the last row of Table 7. From this table, it can be seen that the performance of (0.3859/ARS value)is better than , and in terms of the ARS value. In other words, a SOSML method that learns from this unified set could achieve better performance scores that one that learns from a feature subset. Due to the results, we used the combined features (SLK + SK + WEM) for the proposed method.
M
Table 7. Performance of the SOSML against various tests (SLK+SK+WEM), SLK+SK, SLK+WEM and SK+WEM
ARS measures
Dataset
ED
0.3723 0.4332 0.3937 0.3962 0.3588 0.3860 0.3760 0.3710 0.3859
CE
PT
DUC 2001_1 DUC 2001_2 DUC 2001_3 DUC 2002_1 DUC 2002_2 DUC 2002_3 Mov_1 Mov_2 Ave.
0.2929 0.3186 0.2872 0.2755 0.3015 0.2950 0.2888 0.3271 0.2983
0.1936 0.2008 0.2143 0.1838 0.2170 0.2000 0.1984 0.1789 0.1984
0.1989 0.2498 0.2146 0.2291 0.1736 0.1875 0.2372 0.2919 0.2228
AC
4.8. Comparison with related Methods We evaluated the performance of our method by comparing it with a series of existing wellknown methods on the aforementioned datasets. The methods to be compared are: LSVRS (Khosla & Venkataraman, 2015), OMSHR (Raut & Londhe, 2014), TSAD (N. Yadav & Chatterjee, 2016) and CHOS (Balahur, et al., 2012). Table 8 shows the experimental comparison of different approaches. We observe from the comparison results, the SOSML method outperforms the other methods. The SOSML obtained the best result (0.3859/ ARS value) in comparison with the OMSHR, which is the best existing approach and has an ARS measure of (0.3312).
ACCEPTED MANUSCRIPT
Table 8.
The Performance of the SOSML against other methods ARS values
DUC 2001_1 Dataset DUC 2001_2 Dataset DUC 2001_3 Dataset DUC 2002_1 Dataset DUC 2002_2 Dataset DUC 2002_3 Dataset Mov_1 Dataset Mov_2 Dataset Ave.
LSVRS 0.1900 0.2002 0.2021 0.2180 0.1843 0.2009 0.1890 0.1990 0.1979
SOSML 0.3723 0.4332 0.3937 0.3962 0.3588 0.3860 0.3760 0.3710 0.3859
CHOS 0.3149 0.3240 0.3030 0.3269 0.3165 0.3219 0.2669 0.2990 0.3091
TSAD 0.2512 0.2460 0.2523 0.2490 0.2425 0.2534 0.2299 0.2220 0.2433
CR IP T
OMSHR 0.3330 0.3414 0.3114 0.3359 0.3234 0.3409 0.3324 0.3314 0.3312
Dataset
ED
M
AN US
Statistical significance test — we conducted a statistical test to compare the classification results of our method with other methods. For this purpose, we create five groups corresponding to the six methods: 1) OMSHR; 2) LSVRS; 3) CHOS; 4) TSAD; 5) SOSML. Two groups, (e.g., SOSML, TSAD), are compared at the same time. Each group includes the ARS values. Table 9 presents the P-values produced by Wilcoxon‘s signed rank test for comparison of two groups, (e.g., SOSML, TSAD), at a time. We consider two hypotheses: H0 (Null): there is not the difference between the ARS values of two groups. HA (alternative): there exists a significant difference. As shown in Table 9, the P-values are much less than 0.05 (5% significance level). For instance, the test between the SOSML and TSAD provides a P-value of 0.012, which is very small. The same result is also achieved for all other methods. However, this is a strong evidence to accept the alternative hypothesis and refuse the null hypothesis. Table 9. P-values produced by Wilcoxon‘s test by comparing SOSML with other methods
Metrics
Comparison SOSML vs. OMSHR
Hypothesis H0 rejected
P-value 0.012
SOSML vs. LSVRS
H0 rejected
0.011
SOSML vs. CHOS
H0 rejected
0.012
SOSML vs. TSAD
H0 rejected
0.012
PT
5% significance level
AC
CE
ARS measure
Detailed comparison — we employs the relative improvement Eq. (23) for comparison between the SOSML and other methods. (23)
Table 10 presents the results. In Tables 10, ‗‗+‘‘ demonstrates the SOSML improves the corresponding method. Table 10 shows among the four methods (OMSHR, LSVRS, TSAD and CHOS); the OMSHR obtains the best result. In comparison with the method OMSHR, the SOSML improves the performance of the OMSHR method as follows: 16.52% (ARS measure).
ACCEPTED MANUSCRIPT Table 10. Detailed comparison between the SOSML and other approaches
SOSML improvement Metrics
OMSHR
LSVRS
CHOS
TSAD
ARS measure
+ 16.52
+ 95.00
+ 24.85
+ 58.61
AC
CE
PT
ED
M
AN US
CR IP T
Discussion — From the experiments above, we obtained the following observations. The SOSML outperforms other methods and obtained good performance. This is due to the facts that, 1) We combined several sentiment dictionaries (ten dictionaries) to improve word coverage limit of the individual lexicon, while other methods (e.g, TSAD, OMSHR and LSVRS) do not use the combination of several sentiment lexicons and CHOS combines three sentiment dictionaries. Furthermore, in the case where the word does not appear in the lexicon (HCLr), SOSML uses SSA method to compute the sentiment score of that word. 2) The SOSML also considers contextual polarity (e.g., negation, but-clause) and the types of sentences (e.g., subjective/objective and interrogative/conditional) in sentiment analysis, while other methods (e.g, CHOS, TSAD, LSVRS (excluding negation and but-clause handling); OMSHR (excluding but-clause handling); LSVRS (excluding types of sentences); CHOS, TSAD, OMSHR (excluding interrogative/ conditional sentences handling) do not consider contextual polarity and the types of sentences completely. According to the previous study (A. Abdi, Shamsuddin, & Aliguliyev, 2018; Chen, et al., 2017), contextual polarity and the sentence type identification can improve the performance of sentence-level sentiment analysis. 3) Unlike other methods, SOSML can identify the synonymous words among all sentences using the CWE method. 4) The vector representations of words using word2vec can extract the deep semantic relationships between words, hence the word embedding model augments other features extracted using sentiment, statistical and linguistic knowledge (Table 7). The result of our experiments is in agreement with the previous studies of Giatsoglou, et al. (2017) and Araque, Corcuera-Platas, Sánchez-Rada, and Iglesias (2017). 5) A hybrid vectorization process that takes advantage of sentiment knowledge-based, word embedding-based, statistical and linguistic knowledge-based features. Based on this process, a hybrid vector is constructed for the representation of each sentence through the concatenation of the sentiment-based, word embedding-based, statistical and linguistic knowledge-based feature vectors. 6) The SOSML deals with redundancy and information diversity issues (e.g, CHOS); nonetheless, the other systems do not check the redundancy information. 7) As a result, based on word2vec, sentiment, statistical, linguistic knowledge and SVM algorithm, the SOSML method for multi-documents sentiment-oriented summarization achieves encouraging performance. 5. Conclusion and future work Opinion summarization aims to automatically generate a concise version of opinionated documents with a few words. In this paper, we proposed a machine learning based method to produce sentiment-oriented extractive summarization of multi-documents, SOSML. The SOSML integrates several types of features: 1) Sentiment knowledge to calculate a sentence sentiment score as one of the features for sentence-level classification. SOSML merged
ACCEPTED MANUSCRIPT
PT
ED
M
AN US
CR IP T
several sentiment lexicons to improve word coverage limit. It also employed the SSA to determine sentiment score of a word if it not defined in sentiment lexicon. Moreover, the SOSML combined multiple strategies to tackle the problem, contextual polarity: the sentiment orientation of a word defined by a sentiment dictionary can be reversed since the sentiment orientation of a word depends on the context in which it appears (e.g. negation handling and but-clause handling). Furthermore, SOSML also takes into account the type of sentence (e.g., subjective and objective sentence, interrogative and conditional sentence), since they can affect the performance of the method. 2) Word embedding model, to capture the meaning and semantic relationships among words and to extract a vector representation for each word. 3) Statistical and linguistic knowledge to determine salient sentences. We evaluated the SOSML over DUC 2002 and 2001 datasets and Movie dataset. In order to determine the best features selection method and supervised classification approach, we compare the performances of the different feature selection techniques and machine learning approaches. Therefore, in this paper, four feature selection techniques (e.g., IG, GR, SU and Relf) and seven machine learning classification approaches (e.g., NB, SVM, KNN, ANN, RF, DT and LR) are investigated on the Movie Review Dataset, DUC 2001 and DUC 2002 datasets. The experimental results show the integration of SVM based sentiment classification methods with Information Gain (IG) as a feature selection technique provides superior performance in terms of ARS measures. The results also present that the proposed method can improve the performance compared with the other existing methods. We proceeded with a comparison between of , , and . Based on the results, the combination of the SLK, SK and WEM could provide good performance in terms of ARS measure. Finally, we also compared our method with some of the recently developed systems. In comparison with the other systems, results showed that the proposed method improved the performance of the other systems. In future work, we plan to study in more depth the problem of comparative sentences, sarcastic sentence and negation handling. Furthermore, we also aim to evaluate the performance of SOSML on other datasets. In addition, we would like to compare our proposed method with the DNN based methods such as LSTM, RNN, RBM, CNN, etc. Acknowledgement
AC
CE
This work is supported by The Ministry of Higher Education (MOHE) under Q.J130000.21A2.03E53 - STATISTICAL MACHINE LEARNING METHODS TO TEXT SUMMARIZATIONS and 13H82 (INTELLIGENT PREDICTIVE ANALYTICS FOR RETAIL INDUSTRY). The authors would like to thank Research Management Centre (RMC), Universiti Teknologi Malaysia (UTM) for the support in R & D, UTM Big Data Centre (BDC) for the inspiration in making this study a success. The authors would also like to thank the anonymous reviewers who have contributed enormously to this work. References Abdi, A., Idris, N., Alguliev, R. M., & Aliguliyev, R. M. (2015). Automatic summarization assessment through a combination of semantic and syntactic information for intelligent educational systems. Information Processing & Management, 51, 340-358. Abdi, A., Idris, N., Alguliyev, R., & Aliguliyev, R. (2015a). Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Computing, 1-17.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015b). PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42, 8936-8946. Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015c). Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Computing, 1-17. Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2016). An Automated Summarization Assessment Algorithm for Identifying Summarizing Strategies. Plos One, 11. Abdi, A., Shamsuddin, S. M., & Aliguliyev, R. M. (2018). QMOS: Query-based multi-documents opinionoriented summarization. Information Processing & Management, 54, 318-338. Abdi, S. A., & Idris, N. (2014). An Analysis on Student-Written Summaries: Automatic Assessment of Summary Writing. International Journal of Enhanced Research in Science Technology & Engineering, 3, 466-472. Alguliyev, R. M., Aliguliyev, R. M., & Isazade, N. R. (2015a). AN UNSUPERVISED APPROACH TO GENERATING GENERIC SUMMARIES OF DOCUMENTS. Applied Soft Computing. Alguliyev, R. M., Aliguliyev, R. M., & Isazade, N. R. (2015b). An unsupervised approach to generating generic summaries of documents. Applied Soft Computing, 34, 236-250. Alonso, L. (2005). Representing discourse for automatic text summarization via shallow NLP techniques. Tesis doctoral. Barcelona: Universitat de Barcelona. Amplayo, R. K., & Song, M. (2017). An adaptable fine-grained sentiment analysis for summarization of multiple short online reviews. Data & Knowledge Engineering. Araque, O., Corcuera-Platas, I., Sánchez-Rada, J. F., & Iglesias, C. A. (2017). Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Systems with Applications, 77, 236-246. Baccianella, S., Esuli, A., & Sebastiani, F. (2010). SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In LREC (Vol. 10, pp. 2200-2204). Balahur, A., Kabadjov, M., Steinberger, J., Steinberger, R., & Montoyo, A. (2012). Challenges and solutions in the opinion summarization of user-generated content. Journal of Intelligent Information Systems, 39, 375-398. Birjali, M., Beni-Hssane, A., & Erritali, M. (2017). Machine Learning and Semantic Sentiment Analysis based Algorithms for Suicide Sentiment Prediction in Social Networks. Procedia Computer Science, 113, 6572. Cambria, E., Poria, S., Bajpai, R., & Schuller, B. W. (2016). SenticNet 4: A Semantic Resource for Sentiment Analysis Based on Conceptual Primitives. In COLING (pp. 2666-2677). Cerini, S., Compagnoni, V., Demontis, A., Formentelli, M., & Gandini, G. (2007). Micro-WNOp: A gold standard for the evaluation of automatically compiled lexical resources for opinion mining. Language resources and linguistic theory: Typology, second language acquisition, English linguistics, 200-210. Chatterjee, N., & Sahoo, P. K. (2015). Random indexing and modified random indexing based approach for extractive text summarization. Computer Speech & Language, 29, 32-44. Chen, T., Xu, R., He, Y., & Wang, X. (2017). Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Systems with Applications, 72, 221-230. Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70, 213. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297. Deshwal, A., & Sharma, S. K. (2016). Twitter sentiment analysis using various classification algorithms. In Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO), 2016 5th International Conference on (pp. 251-257): IEEE. Esuli, A., & Sebastiani, F. (2007). SentiWordNet: a high-coverage lexical resource for opinion mining. Evaluation, 1-26. Fang, C., Mu, D., Deng, Z., & Wu, Z. (2017). Word-sentence co-ranking for automatic extractive text summarization. Expert Systems with Applications, 72, 189-195. Ferreira, R., de Souza Cabral, L., Freitas, F., Lins, R. D., de França Silva, G., Simske, S. J., & Favaro, L. (2014). A multi-document summarization system based on statistics and linguistic treatment. Expert Systems with Applications, 41, 5780-5787. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bulletin, 76, 378. Fraser, B. (1999). What are discourse markers? Journal of pragmatics, 31, 931-952. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1): Springer series in statistics New York. Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization techniques: a survey. Artificial Intelligence Review, 47, 1-66. Gao, W., & Selmic, R. R. (2006). Neural network control of a class of nonlinear systems with actuator saturation. IEEE Transactions on Neural Networks, 17, 147-156.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Giatsoglou, M., Vozalis, M. G., Diamantaras, K., Vakali, A., Sarigiannidis, G., & Chatzisavvas, K. C. (2017). Sentiment analysis leveraging emotions and word embeddings. Expert Systems with Applications, 69, 214-224. Gupta, P., Tiwari, R., & Robert, N. (2016). Sentiment analysis and text summarization of online reviews: A survey. In Communication and Signal Processing (ICCSP), 2016 International Conference on (pp. 0241-0245): IEEE. Gupta, V., & Lehal, G. S. (2010). A Survey of Text Summarization Extractive Techniques. Journal of Emerging Technologies in Web Intelligence, 2. Habernal, I., Ptáček, T., & Steinberger, J. (2014). Supervised sentiment analysis in Czech social media. Information Processing & Management, 50, 693-707. Hall, M. A., & Smith, L. A. (1998). Practical feature subset selection for machine learning. Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398): John Wiley & Sons. Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168-177): ACM. Hu, Y.-H., Chen, Y.-L., & Chou, H.-L. (2017). Opinion mining from online hotel reviews–A text summarization approach. Information Processing & Management, 53, 436-449. Hung, C., & Chen, S.-J. (2016). Word sense disambiguation based sentiment lexicons for sentiment classification. Knowledge-Based Systems, 110, 224-232. Jaccard, P. (1912). The distribution of the flora in the alpine zone. New phytologist, 11, 37-50. Khan, F. H., Qamar, U., & Bashir, S. (2016a). A semi-supervised approach to sentiment analysis using revised sentiment strength based on SentiWordNet. Knowledge and information Systems, 1-22. Khan, F. H., Qamar, U., & Bashir, S. (2016b). SWIMS: Semi-supervised subjective feature weighting and intelligent model selection for sentiment analysis. Knowledge-Based Systems, 100, 97-111. Khosla, N., & Venkataraman, V. (2015). Learning Sentence Vector Representations to Summarize Yelp Reviews. Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm. In Aaai (Vol. 2, pp. 129-134). Knott, A. (1996). A data-driven methodology for motivating a set of coherence relations. Kolchyna, O., Souza, T. T., Treleaven, P., & Aste, T. (2015). Twitter sentiment analysis: Lexicon method, machine learning method and their combination. arXiv preprint arXiv:1507.00955. Kupiec, J., Pedersen, J., & Chen, F. (1995). A trainable document summarizer. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 68-73): ACM. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics, 159-174. Lee, S., & Choeh, J. Y. (2014). Predicting the helpfulness of online reviews using multilayer perceptron neural networks. Expert Systems with Applications, 41, 3041-3046. Li, Q., Jin, Z., Wang, C., & Zeng, D. D. (2016). Mining opinion summarizations using convolutional neural networks in Chinese microblogging systems. Knowledge-Based Systems, 107, 289-300. Li, S., Lee, S. Y. M., Chen, Y., Huang, C.-R., & Zhou, G. (2010). Sentiment classification and polarity shifting. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 635-643): Association for Computational Linguistics. Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop (pp. 74-81). Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5, 1-167. Lloret, E. (2012). Text summarisation based on human language technologies and its applications. Procesamiento del lenguaje natural, 48, 119-122. Lloret, E., Balahur, A., Gómez, J. M., Montoyo, A., & Palomar, M. (2012). Towards a unified framework for opinion retrieval, mining and summarization. Journal of Intelligent Information Systems, 39, 711-747. Lloret, E., Boldrini, E., Vodolazova, T., Martínez-Barco, P., Muñoz, R., & Palomar, M. (2015). A novel concept-level approach for ultra-concise opinion summarization. Expert Systems with Applications, 42, 7148-7156. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing (Vol. 999): MIT Press. Mendoza, M., Bonilla, S., Noguera, C., Cobos, C., & León, E. (2014). Extractive single-document summarization based on genetic operators and guided local search. Expert Systems with Applications, 41, 4158-4169.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and cognitive processes, 6, 1-28. Mitchell. (1996). Machine Learning. In McGrwa Hill, New York, New York, NY,USA. Mitchell, T. M., & Learning, M. (1997). Mcgraw-hill science. Engineering/Math, 1. Mohammad, S. M., Kiritchenko, S., & Zhu, X. (2013). NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242. Narayanan, R., Liu, B., & Choudhary, A. (2009). Sentiment analysis of conditional sentences. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 180-189): Association for Computational Linguistics. Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining for market prediction: A systematic review. Expert Systems with Applications, 41, 7653-7670. Nenkova, A., Vanderwende, L., & McKeown, K. (2006). A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 573580): ACM. Neto, J. L., Freitas, A. A., & Kaestner, C. A. (2002). Automatic text summarization using a machine learning approach. In Brazilian Symposium on Artificial Intelligence (pp. 205-215): Springer. Ng, W. W., Yeung, D. S., Firth, M., Tsang, E. C., & Wang, X.-Z. (2008). Feature selection using localized generalization error for supervised classification problems using RBFNN. Pattern Recognition, 41, 3706-3719. Nielsen, F. Å. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. arXiv preprint arXiv:1103.2903. Ofek, N., Poria, S., Rokach, L., Cambria, E., Hussain, A., & Shabtai, A. (2016). Unsupervised commonsense knowledge enrichment for domain-specific sentiment analysis. Cognitive Computation, 8, 467-477. Ouyang, Y., Li, W., Li, S., & Lu, Q. (2011). Applying regression models to query-focused multi-document summarization. Information Processing & Management, 47, 227-237. Pang, B., & Lee, L. (2008). Using Very Simple Statistics for Review Search: An Exploration. In COLING (Posters) (pp. 75-78). Parlar, T., & Özel, S. A. (2016). A new feature selection method for sentiment analysis of Turkish reviews. In INnovations in Intelligent SysTems and Applications (INISTA), 2016 International Symposium on (pp. 1-6): IEEE. Polanyi, L., & Zaenen, A. (2006). Contextual valence shifters. In Computing attitude and affect in text: Theory and applications (pp. 1-10): Springer. Qazvinian, V., Hassanabadi, L. S., & Halavati, R. (2008). Summarising text with a genetic algorithm-based sentence extraction. International Journal of Knowledge Management Studies, 2, 426-444. Quinlan, J. R. (2014). C4. 5: programs for machine learning: Elsevier. Rana, T. A., & Cheah, Y.-N. (2016). Aspect extraction in sentiment analysis: comparative analysis and survey. Artificial Intelligence Review, 46, 459-483. Raut, V. B., & Londhe, D. (2014). Opinion mining and summarization of hotel reviews. In Computational Intelligence and Communication Networks (CICN), 2014 International Conference on (pp. 556-559): IEEE. Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks: Citeseer. Riloff, E., & Wiebe, J. (2003). Learning extraction patterns for subjective expressions. In Proceedings of the 2003 conference on Empirical methods in natural language processing (pp. 105-112): Association for Computational Linguistics. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, 53, 23-69. Saggion, H. (2014). Creating Summarization Systems with SUMMA. In LREC (pp. 4157-4163). Salzberg, S. L. (1994). C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993. Machine Learning, 16, 235-240. Shahana, P., & Omman, B. (2015). Evaluation of Features on Sentimental Analysis. Procedia Computer Science, 46, 1585-1592. Shareghi, E., & Hassanabadi, L. S. (2008). Text summarization with harmony search algorithm-based sentence extraction. In Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology (pp. 226-231): ACM.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Sharma, A., & Dey, S. (2012). Performance investigation of feature selection methods and sentiment lexicons for sentiment analysis. IJCA Special Issue on Advanced Computing and Communication Technologies for HPC Applications, 3, 15-20. Statistics, L. (2015). Wilcoxon signed-rank test using SPSS statistics. In Statistical tutorials and software guides. Stone, P. J., & Hunt, E. B. (1963). A computer approach to content analysis: studies using the general inquirer system. In Proceedings of the May 21-23, 1963, spring joint computer conference (pp. 241-256): ACM. Strapparava, C., & Valitutti, A. (2004). WordNet Affect: an Affective Extension of WordNet. In LREC (Vol. 4, pp. 1083-1086): Citeseer. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational Linguistics, 37, 267-307. Tayal, M. A., Raghuwanshi, M. M., & Malik, L. G. (2017). ATSSC: Development of an approach based on soft computing for text summarization. Computer Speech & Language, 41, 214-235. Teufel, S., & Moens, M. (1997). Sentence extraction as a classification task. In Proceedings of the ACL (Vol. 97, pp. 58-65). Tsuruoka, Y., & Tsujii, J. i. (2005). Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 467-474): Association for Computational Linguistics. Vani, K., & Gupta, D. (2014). Using K-means cluster based techniques in external plagiarism detection. In Contemporary Computing and Informatics (IC3I), 2014 International Conference on (pp. 1268-1273): IEEE. Wang, G., Zhang, Z., Sun, J., Yang, S., & Larson, C. A. (2015). POS-RS: A Random Subspace method for sentiment classification based on part-of-speech analysis. Information Processing & Management, 51, 458-479. Wu, Y., & Wen, M. (2010). Disambiguating dynamic sentiment ambiguous adjectives. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1191-1199): Association for Computational Linguistics. Xia, R., Xu, F., Yu, J., Qi, Y., & Cambria, E. (2016). Polarity shift detection, elimination and ensemble: A three-stage model for document-level sentiment analysis. Information Processing & Management, 52, 36-45. Xie, S., Liu, Y., & Lin, H. (2008). Evaluating the effectiveness of features and sampling in extractive meeting summarization. In Spoken Language Technology Workshop, 2008. SLT 2008. IEEE (pp. 157-160): IEEE. Yadav, C. S., & Sharan, A. (2015). Hybrid approach for single text document summarization using statistical and sentiment features. International Journal of Information Retrieval Research (IJIRR), 5, 46-70. Yadav, N., & Chatterjee, N. (2016). Text Summarization Using Sentiment Analysis for DUC Data. In Information Technology (ICIT), 2016 International Conference on (pp. 229-234): IEEE. Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42-49): ACM. Zhang, J., Sun, L., & Zhou, Q. (2005). A cue-based hub-authority approach for multi-document text summarization. In Natural Language Processing and Knowledge Engineering, 2005. IEEE NLPKE'05. Proceedings of 2005 IEEE International Conference on (pp. 642-645): IEEE. Zhong, S.-h., Liu, Y., Li, B., & Long, J. (2015). Query-oriented unsupervised multi-document summarization via deep learning model. Expert Systems with Applications, 42, 8146-8155.