Using Chinese radical parts for sentiment analysis and domain-dependent seed set extraction

Using Chinese radical parts for sentiment analysis and domain-dependent seed set extraction

JID: YCSLA ARTICLE IN PRESS [m3+;August 8, 2017;14:02] Available online at www.sciencedirect.com Computer Speech & Language xxx (2017) xxx-xxx www...

6MB Sizes 63 Downloads 63 Views

JID: YCSLA

ARTICLE IN PRESS

[m3+;August 8, 2017;14:02]

Available online at www.sciencedirect.com

Computer Speech & Language xxx (2017) xxx-xxx www.elsevier.com/locate/csl

Using Chinese radical parts for sentiment analysis and domaindependent seed set extractionI agedPD20X XAugust XT F.Y. ChaoD21X X, D2X XHeng-Li YangD23X X*

Q1

TagedPDepartment of Management Information Systems, National Cheng-Chi University, 64, Sec. 2, Chihnan Road, Wenshan District, Taipei, Taiwan Received 3 November 2015; received in revised form 19 June 2017; accepted 24 July 2017 Available online xxx

TagedPAbstract Although there has been good progress in English sentiment analysis and resources, studies in English cannot be directly used in Chinese owing to the nature of Chinese language. Previous studies suggested adopting linguistic information, such as grammar and morpheme information, to assist in sentiment analysis for Chinese text. However, morpheme-based approaches have a problem in identifying seeds. In addition, these methods do not take advantage of radicals in the characters, which contain a great deal of semantic information. A Chinese word is composed of one or more characters, each of which has its radical part. We can interpret the partial meaning of a character by analyzing that of the radical in the character. Therefore, we not only consider the radical information as the semantic root of a character, but also consider the radical parts between characters in a word as an appropriate linguistic unit for conducting sentiment analysis. In this study, we conducted a series of experiments using radicals as the feature unit in sentiment analysis. Using segmented results from part-of-speech tools as a meaningful linguistic unit (word) in Chinese, we conducted analyses of single-feature word (unigram) and frequently seen two words (pointwise mutual information collocated bigrams) through various sentiment analysis measures. It is concluded that radical features could work better than word features and would consume less computing memory and time. An extended study of the extraction of seeds was also conducted, and the results indicated that 50 seed radical features performed well. A cross-corpus comparison was also conducted; the results demonstrated that the use of 50 extracted radical features as domain-dependent keywords worked better than other sentiment analysis strategies. This study confirmed that radical information could be adopted as a feature unit in sentiment analysis and that domain-dependent radicals could be reused in different corpora. Ó 2017 Elsevier Ltd. All rights reserved. TagedPKeywords: Sentiment analysis; Chinese radical; Restaurant review analysis; Domain-dependent seed

1 2 3

1. Introduction TagedPReviews are central to almost all human activities and are key influencers of our behavior (Liu, 2012). From the perspective of business owners, monitoring online comments has become an important marketing strategy for I

This paper has been recommended for acceptance by S. Narayanan. * Corresponding author. E-mail address: [email protected] (A.F.Y. Chao), [email protected] (H.-L. Yang).

http://dx.doi.org/10.1016/j.csl.2017.07.007 0885-2308/ 2017 Elsevier Ltd. All rights reserved.

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Q2

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Q3

46 47 48 49 50 51 52 53 54 55

ARTICLE IN PRESS

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

TagedPunderstanding customers. From the perspective of consumers, it is crucial to be able to gain an integrated and comprehensive understanding of available services and products. Thus, we need sentiment analysis of reviews when inundated with a huge array of comments on the Internet. However, sentiment analysis faces many challenges. First, user-generated content is presented in an unstructured format and contains more details than Likert-style survey responses (Pan et al., 2007); therefore, it is difficult to establish a fixed model for these different types of contents. Second, before formulating review analysis patterns, we require knowledge of the different aspects of a product that might be of concern to the customers. For example, in restaurant reviews, the amount of time spent waiting and queuing is relevant to consumers. Language resources used for sentiment analysis are domain dependent (Pang and Lee, 2008); therefore, it is difficult to create a universal sentiment lexicon for general purposes. People use their native language to describe their experiences and express their sentiments; thus, comprehending the semantic meaning of written reviews requires massive natural language processing (NLP) and additional dependent language resources. These challenges have driven the rapid development in sentiment analysis and NLP studies in recent years, and many language resources have been introduced to facilitate comprehension of sentiments in written texts. TagedPUnfortunately, not all language resources and techniques can be directly adapted to the Chinese language environment for text mining (Yang and Chao, 2014; Zheng et al., 2015). Chinese is a non-space-delimited, polysyllabic, and sequentially interpreted language (Chao, 1965), which has its own characters, lexicons, and unique sentence grammar, and these characteristics present language-dependent problems in the development of sentiment analysis. To seek a native approach for analyzing Chinese reviews, researchers have used linguistic information, such as morphological words (Ku et al., 2009; Liu, 2010; Lu et al., 2010; Fu and Wang, 2010; Zhang et al., 2012; Yang and Chao, 2014) as a leverage to uncover the semantic meaning of reviews. TagedPChinese morphological words, which can consist of more than one character, have easily observable semantic meanings and can be used as a guideline for classifying words or doing other in-depth analysis (Ku et al., 2009). Several studies have shown that morphological information can assist in the identification of sentiments at lexicon level (Ku et al., 2006; LiuD24X X et al., 2010), aspect-sentiment level (Zhang et al., 2012), sentence level (Ku et al., 2009; Fu and Wang, 2010), and document level (Yang and Chao, 2014). HoweverX X, in morpheme-based approaches, to come up with the initial morpheme words (i.e., seeds) D25X Xare a challenging task and needD26X X domain experts. TagedPIn addition to morphological information, it can be observed that Chinese morpheme characters are logograms and that almost 90% of Chinese characters can be disassembled into semantic radicals (“ ”) and phonetic radicals (“ ”) (Li and Kang, 1993). These semantic symbols can generally be used as a radical index for organizing characters, and they can reveal the basic concepts mixed within a single character (Huang et al., 2008). Such an observation motivates us in this study to try to use radical information to relieve the burden of locating morpheme words in the morpheme-based approach. We consider the radical parts (i.e., a radical form of two or more feature words) to be the basic concept compositions, which can be used to identify sentiment features at the linguistic conceptual level. For example, the words “ ” (breakfast) and “ ” (dinner) can be considered as a combination of the radical “ ” (day, time) and the radical “ ” (food), as well as be comprehended as “food for a specific time.” Another example, words with a negative meaning, e.g., “ ” (angry) and “ ” (lazy), can be considered as a combination of the radical “ ” (heart) and the radical “ ” (heart), as well as be comprehended as “feeling from the heart.” TagedPTo our knowledge, prior research has not formally investigated the use of radical information to facilitate sentiment analysis. The purpose of this study was to demonstrate the superiority of applying radical information. Since previous studies have shown that morpheme-based sentiment analysis approaches can be more effective than textretrieval and keyword-based approaches (Jang and Shin, 2010; Yang and Chao, 2014), specifically, this study tries to demonstrate the advantages of radical-based approach over morpheme word-based approachX X. We tried two cases of comparison: single-feature word (unigram) and two frequently seen words (i.e., pointwise mutual information collocated bigrams). In the case of bigrams, the morpheme roots were applied. In both cases, we compared a wordbased approach with its corresponding radical-based approach. Finally, we propose that the extracted radical features be used as domain-dependent keywords to analyze similar domain reviews from different sources. In all of the above cases, we also made a subsidiary comparison—comparing with a keyword-based approach. Since our collected review corpora are in traditional Chinese, those language resources (e.g., HowNet) in simplified Chinese are not suitable for this study because there are different cultural problems and different radicals for the same character. For example, the word (phoenix) with the radical (bird) in traditional Chinese becomes the word with the radical (small table) in simplified Chinese. There are some resources for traditional Chinese, e.g., Chinese WordNet, Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

3

Table 1 Terms used in this study. Term

Explanations

English word

An English word is composed of one or more English letters from A to Z.

Chinese word

The Chinese language is polysyllabic. A Chinese word is composed of one or more non-spaced Chinese characters, each of which could also be a Chinese word itself if appearing alone. For example, the Chinese word “ ” (money) contains two Chinese characters: “ ” (gold) and “ ” (money). In this example, each single character is also a Chinese word itself if appearing alone. Other examples might not be. It is also possible to have more long words. For example, both “ ” (Hong Kong-style barbecue) and “ ” (patty with turnip shreds) contain four Chinese characters. As a limitation, this study considered words as those containing less than five characters.

Chinese radical

In a Chinese dictionary, Chinese radicals are used as indices. Each Chinese character is listed under one Chinese radical. A Chinese radical is a graphical component of a Chinese character, which is often a semantic indicator or a phonetic component of the character.

Collocation

In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. In text mining, we tried to identify the corresponding compounds that are used to modify features. To find the corresponding collocations of features, we limited the window size to §D5 1X X to select feature compounds, if there were no stop words or end punctuation found within this range.

Unigram

In the text-mining part of this study, the “unigram” approach considers only a single feature, which could be a single Chinese word (word-based), or its corresponding Chinese radical(s) (radical-based). Since a Chinese word may contain more than one character, a unigram may consider one character (e.g., “ ” (vegetables) or “ ” in radical) or several characters (e.g., “ ” (barbecue in Hong Kong style) or “ ” in radical).

Bigram

In the text-mining part of this study, the “bigram” approach looks for two frequently seen words (word-based, e.g., “ sponding Chinese radicals (radical-based, e.g., “ ” and “ ”).

” and “

”) or their corre-

68

TagedPe-HowNet (Extended-HowNet), etc., that provide more semantics based on ontology, e.g., entity-relation, upperlower hierarchy, etc. However, they do not give the polarity classification of words. Since the NTUSD (National Taiwan University Sentiment Dictionary) (Ku et al., 2006) is a predefined wordlist with positive/negative that has been most commonly applied for text mining in traditional Chinese environment, we have chosen it in this study. The datasets we used in this study were gathered from two different restaurant review websites—one consisted of four rating dimensions (i.e., overall, taste, service, and environment rankings), and the other revealed only the overall ranking. TagedPBefore continuing to the following sections, to avoid some potential misunderstanding owing to Chinese language characteristics, we summarize in Table 1, some terms used in this study. TagedPThe rest of this paper is structured as follows. In Section 2, we discuss the theoretical foundation of this research and review the relevant literature. In Section 3, we propose our approach and give a general overview of the experiment design. In Section 4, we present the experiment results. Finally, in Section 5, we give the conclusions and suggestions for further research.

69

2. Literature review

56 57 58 59 60 61 62 63 64 65 66 67

72

TagedPIn this section, we describe the relevant sentiment analysis studies, including Chinese morpheme-based sentiment analysis and related language resources. We also explain the Chinese radicals, as well as how they can be used in information retrieval.

73

2.1. Sentiment analysis

70 71

74 75 76 77 78 79 80 81 82 83 84

TagedPSentiment analysis differs from text mining in that it explores semantically positive or negative sentiment in texts. These sentiments are then generally represented as several words that describe feelings, e.g., happy, enjoyable, sad, sorry, etc. When applying sentiment analysis to reviews of products or services, the sets of sentiment words are different depending on the contexts of the review authors and products. Liu (2012) defined five elements of sentiment analysis: entity, aspect, sentiment, opinion holder, and time. In most cases, it is plausible to assume that time stays fixed while processing a selected dataset. Sentiment analysis then summarizes the reviews of the review holders, which contain a sentiment toward an aspect of an entity. Because sentiment analysis involves uncovering the semantic meaning of textual data, massive NLP must take place before conducting the actual analysis. TagedPFor understanding the sentiment in a review, predefined language resources are adopted to assist in explaining the semantic meaning. These resources include ANEW (Affective Norms for English Words) (Bradley and Lang, 1999), General Inquirer (Stone and Hunt, 1963), and WordNet-Affect (Strapparava and Valitutti, 2004). These linguistic Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

4 85 86 87 88

89

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

TagedPresources are categorized into three groups (positive, negative, and neutral) and are used to explain the words that appear in a review through pattern-matching or statistical methods. One of the statistical methods is PMI (Pointwise Mutual Information) (Church and Hanks, 1990; Turney, 2002), which can be used to compare the co-occurring probability between each pair of words independently. This method is defined as follows:   pðword 1 & word 2 Þ PMIðword 1 ; word 2 Þ ¼ log2 pðword 1 Þpðword 2 Þ

106

TagedPPMI can also be used to search for idioms and common phrases when observing through a smaller window size; when used within a larger window size, it can highlight semantic concepts and other significant relationships (Church and Hanks, 1990). For example, it can explore the specific features of products (Zhang et al., 2011) and the sentiment toward a specific feature (Su et al., 2008; Zhang et al., 2012). In addition to pattern-matching methods, predefined lexicon and grammar details are required to establish patterns, and advanced natural language tools must be used to parse sentences into dependency trees, which make it possible to identify sentiment features. Turney (2002) used a combination of adjectives (JJ), adverbs (RB), verbs (VB), and nouns (NN) to conduct a sentiment analysis of movie reviews, and Nakagawa et al.D27X X(2010) used dependency trees to extract the sentiment at the sub-sentence level. TagedPVarious machine learning techniques have also been applied to sentiment analysis. Zhang et al.D28X X (2009) applied three popular supervised learning algorithms: support vector machines (SVM), na€ıve Bayes, and decision trees. These three algorithms have performed well in many classification applications. Other algorithms, such as conditional random field (CRF) (Nakagawa et al., 2010), artificial neural network (ANN) (Ghiassi et al., 2013), and the co-training algorithm (Wan, 2009; Balahur and Turchi, 2013X X), have also been used in recent studies. In case the review corpora are unlabeled, unsupervised learning algorithms (e.g., Hu et al., 2013; Zhu et al., 2014) are suitable. However, these unD29X Xsupervised learning algorithms rely on complex equations and users cannot easily get the extracted product/service features.

107

2.2. Chinese sentiment analysis

90 91 92 93 94 95 96 97 98 99 100 101 102 Q4 103

104 105

108 109 110 Q5 111

112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129

TagedPChinese is a non-space-delimited and sequentially interpreted language (Chao, 1965), and modern Chinese words can generally have one to six ideographic meanings (Wu and Tseng, 1993). Therefore, before a sentiment analysis can be conducted, sentences have to be segmented to a proper and meaningful level of granularity for analysis. Language resources for Chinese (especially traditional Chinese) sentiment analysis are not plentiful (Wan, 2009). TheX X existing annotated sentiment wordlists, e.g., NTUSD (Ku et al., 2006) and HowNet (Dong and Dong, 2006), D30X Xare both manually tagged and contained 13,160 and 11,000 positiveD31X Xnegative word senses, respectively. Although it is not necessary to use all Chinese words’ sentiment meanings to conduct an analysis, both annotated sentiment wordlists are relatively insufficient, compared to the words in a dictionary.1D32X X Their sizes are inadequate, and owing to the corpus used for annotation, the words in the wordlists might have different sentiment polarities in different contexts. Furthermore, Chinese language environments can be separated into traditional environments (Taiwan, Singapore, and Hong Kong) and the simplified environment. Some words are semantic false friends (they differ significantly in meaning) owing to cultural differences (Hong and Huang, 2013). It is important to use suitable language processing tools while conducting NLP. TagedPIn Chinese, a word is composed of one or more characters, and its meaning can be interpreted in terms of the composite characters (Ku et al., 2009). Chinese has multisyllabic morphemes adopted from foreign languages, and the meaning of a Chinese word consists of a mixture of the concepts embedded in the morphemes that are present. For example, we can find the character “ ” (excellent) in positive sentiment words like “ ” (excellent) and “ ” (outstanding) (Lu et al., 2010), as well as the character “ ” (price) in aspect words like “ ” (price) and “ ” (cost) (Zhang et al., 2012). TagedPKu et al. (2007) suggested that the meaning of Chinese sentiment words is a function of the composite of two or more Chinese characters. This is exactly how people read an ideogram when they encounter a new word. The bagof-character (BOC) method, which counts the existence of characters that were not accumulated, can also be used to 1

According to the MOE Revised Mandarin Chinese Dictionary ( dictionary.

), Taiwan, it has 166,176 words in this

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

5

144

TagedPestimate the sentiment degree of words by averaging the observation probabilities of the character occurring in positive and negative seed words. Ku et al.D3X X (2009) used eight morphological structures and the BOC method to classify word sentiment polarity and showed that morphological information improves detection performance when compared to the BOC method. Liu et al. (2010) continued to investigate the BOC method while also proposing a novel model that integrates the BOC method and label propagation into a constructed word graph to classify Chinese words. Fu and Wang (2010) extended the BOC method by creating out-of-vocabulary (OOV) polarity and used a fuzzy-set algorithm to estimate sentiment polarity at the sentence level. TagedPSome studies have also applied morpheme-based feature selection approaches to sentiment analysis. Zhang et al.D34X X (2012) argued that aspect words, e.g., product features, containing certain morphemes can be considered to be similar to product features; for example, the morpheme “ ” (price) can be used to compose price-related aspect words like “ ” (price), “ ” (cost), “ ” (special offer), and “ ” (high price). Zhan et al. considered aspectsentiment as a sentiment-determining tuple that can be extracted by determining the co-occurrence that appears statistically in texts. Yang and Chao (2014) also conducted a study showing that eight morpheme rooted aspects and collocations can be directly leveraged to construct a sentiment classification. Their results showed that the morpheme-based sentiment analysis approach outperformed text-retrieval and keyword-based approaches.

145

2.3. Chinese radicals

130 131 132 133 134 135 136 137 138 139 140 141 142 143

167

TagedPChinese characters are ideographic, and almost 90% of Chinese characters can be disassembled into several semantic radicals (“ ”) and phonetic radicals (“ ”) (Li and Kang, 1993). Because semantic radicals serve as a processing unit in character recognition (Feldman and Siok, 1999), a character not only possesses its own semantic meaning, but also has a semantic relationship with referencing radicals. For example, the character “ ” is a semantic radical symbol for grass, and characters categorized under the “ ” radical, like “ ” (stem), “ ” (sprout), and “ ” (seedling), can be identified by the semantic radical symbol “ ,” which is located at the top of the characters in the classification hierarchy. The semantic meaning of grass can be derived from this radical. However, since mainland China has promoted to use simplified Chinese in printing, the radical parts of many Chinese characters have been changed. For example, the abovementioned “ ” (radish, an edible vegetable root) in traditional Chinese becomes “ ” in simplified Chinese, and then their radical parts “ ” are also changed into “ ”; the grasssemantic radical of the second character is missing. Therefore, the language resources in simplified Chinese are not suitable for this study. TagedPRadicals can also be used to fashion an organizing index for Chinese characters. In D35X X100 A.D., Shu Shen “ ” used 540 radicals to index characters with explanations of the semantic relationships between a character and a radical. Although modern indices have reduced the number of radicals to 214, the semantic relationship between a character and a radical can still be explained at the conceptual level. Chou (2005) used expert interpretations to create a cutting-edge ontology based on Chinese radicals and connected this radical ontology with the upper ontology and WordNet. TagedPThe semantic relationships between characters and radicals can also be processed using a statistical approach, rather than relying on the opinions of experts. Chao and Chung (2011) conducted a study on the semantic relationships between words and their radicals as determined by a dictionary and showed that words categorized under the same radical also overlap conceptually when subjected to a deeper level of evaluation.

168

3. The proposed approach and experiment preparation

146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166

170

TagedPIn this section, we give an overview of the proposed approach and describe the selected dataset for preparing the sentiment experiments.

171

3.1. Radical form in Chinese

169

172 173 174 175

TagedPAs stated earlier, the meaningful linguistic unit, i.e., word, in Chinese is different from English. A Chinese word, e.g., “ ” (money), could contain several Chinese characters, e.g., “ ” (gold) and “ ” (money), each of which could also be a Chinese word itself if appearing alone. Studies (e.g., Yang and Chao, 2014) have shown that applying linguistic information such as morphemes to sentiment analysis can lead to improved performance and overcome the Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

6

ARTICLE IN PRESS

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx Table 2 Example of morpheme-based and radical-based feature selection.

199

TagedPscarce sentiment analysis linguistic resources that specifically plague Chinese sentiment analysis. Further, the semantic meaning of Chinese words can be deduced from the composed characters, and a character has a semantic relationship with the radicals within the character. Therefore, a radical combination can be considered to be a reliable linguistic unit for the identification of aspect (product feature words) or sentiment words. In this case, we consider the radical parts (the radical form of one or several Chinese words) to be the basic concept compositions that can be adopted to identify sentiment features at the linguistic conceptual level. This study claims that the radicalbased method can capture the root concept underlying reviews depending on the given review corpus and that it differs from the word-based approach because it requires no seed morphemes. For example, considering that many aspects (feature words) related to dishes are made with the Chinese character “ ” (vegetables), we searched restaurant reviews for possible candidates. The results are shown in Table 2. TagedPIn Table 2, the word-based method searches for words composed with the morpheme “ ,” so that we can observe the use of “ ” in every possible candidate. The radical-based method, however, returns more candidates and concepts than the word-based method. Each radical-based candidate can also be interpreted according to the radical meaning. For example, in Table 2, words matching the radical parts “ ” are herbaceous plants, whereas words matching the radical parts “ ” are woody plants. The radical parts “ ” can be interpreted as something related to “ ” (water), “ ” (wet), or “ ” (overseas); the radical parts “ ” can be interpreted as a postfix pattern of list (“ ”) or a postfix pattern of taste (“ ”) because the radical form of both “ ” and “ ” is “ ”. However, the possible combination of radical parts is limited by the corpus; for example, in restaurant reviews, we seldom find radical parts like “ ” (pain) or “ ” (germ), where the radical form is “ .” TagedPIt is clear that the radical-based method can incorporate more various words and meanings than the word-based approach, although the combinations of radical-based methods are dependent on the corpus. The problem of identifying the proper root morphemes within reviews is a challenge for the word-based approach because it would be difficult to guess the mindset of general consumers. Adopting radical information in sentiment analysis could capture a higher conceptual level of Chinese words and, thus, may relieve the problem of root seeds.

200

3.2. Dataset preparation

176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198

201 202 203 204 205 206

TagedP3.2.1. Collecting corpora TagedPTo confirm our hypothesis that adopting radical parts can improve sentiment analysis, we need to make sure that the dataset is large and contains words in the related dimensions. We collected restaurant reviews from IPEEN.com. tw and TRIPADVISOR.com, and then compiled them into restaurant review corpora. The IPEEN review corpus contains a four-dimensional ranking system, labeled as “overall,” “taste,” “service,” and “environment.” This corpus can be used to generate radical parts in different sub-dimensions. The TRIPADVISOR review corpus was considered to Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

ARTICLE IN PRESS

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx 207 208 209 210 211 212 213 214 215 216 217 218

219 220 221 222 223 224 225 226 227

228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250

7

TagedPbe a comparison corpus that was helpful in examining the performance of the radial-based extraction method that we applied to the IPEEN corpus. TagedPThe evaluation ranking systems employed in IPEEN and TRIPADVISOR were different. In IPEEN, there was a 12-level ranking system of up to 60 points for the “overall” dimension and 5-star ranking systems were used for “taste,” “service,” and “environment.” Adopting more strict criteria for positive, we considered the top 2 levels (rated from 60 to 55 points) to be positive reviews, whereas reviews awarding less than 45 points were considered negative reviews. We compiled 60,000 reviews in a corpus (30,000 positive reviews and 30,000 negative reviews). As for the TRIPADVISOR corpus, we treated reviews that ranked a restaurant with 4 or 5 stars to be positive and anything ranked with less than 3.5 stars was considered a negative review. We compiled 13,462 reviews into an unbalanced review corpus, which contained an unequal number of positive and negative reviews. After the preliminary analysis, there were 30,554,282 tokens and 294,860 words from the IPEEN corpus, as well as 887,895 tokens and 26,098 words from the TRIPADVISOR corpus.

TagedP3.2.2. Applying natural language processing TagedPAfter compiling two restaurant corpora, we conducted the same NLP over both corpora. First, we applied part-ofspeech tagging by using the SINICA CKIP system2 to segment the collected reviews, and then we filtered the sample through part-of-speech annotations to exclude verbs and nouns, including “^P,” “^C,” “^D,” “^N[cjdjejfjgjh],” “V_2,” “^T,” and “^SHI,” to cover most of the meaningful words in the reviews. Next, we conducted negation processing over the corpora. Das and Chen (2001) suggested that negation words, e.g., "not," "no," and "never," affect the sentiment of the subsequent words until the punctuation marker appears. Here, we used 11 Chinese negation words, including “ ,” “ ,” “ ,” “ ,” “ ,” “ ,” “ ,” “ ,” “ ,” and “ ,” to mark the subsequent words if an odd number of negation words were found in the preceding position.

TagedP3.2.3. Radical parts and collocations TagedPWe extended the morpheme-based sentiment analysis suggested in Yang and Chao’s 2014 study to use radical parts as a feature selection unit for revealing sentiments at the conceptual level. Several studies (Su et al., 2008; Zhang et al., 2011; Zhang et al., 2012) also explored the relationship between morpheme-based features and collocations through the use of PMI calculation. Church and Hank (1990) suggested that the proper window size for feature and collocation is limited to five positions and noted only co-occurring frequencies higher than three. TagedPBecause Chinese grammar can tolerate stuff words, i.e., words that are unimportant to sentiment analysis, for fulfilling the semantic context, skip gram (Xu et al., 2013) can be adopted to avoid stuff words. For example, given an English sentence like “This price is high,” 4-skip-2-gram (i.e., two words co-occurring within four positions) would generate a lowercase bigram set {(this, price), (price, is), (is, high), (this, is), (price, high), (this, high)}. TagedPThis study adopted Church and Hank’s (1990) suggestion of a window size of five and used skip gram constructions, called 5-skip-2-gram, as a PMI sentiment investigation unit to explore the proper PMI collocations for both radical-based and word-based approaches in the IPEEN corpus. A similar approach was also adopted by Yang and Chao (2014), who suggested using a PMI value within 03 for investigating sentiment features. To compare the PMI collocations of feature words and radicals within a range, we calculated a PMI value for each 5-skip-2-gram and geometrically normalized the PMI value to 0D1. 36X X The distributions for the PMI relationships in the IPEEN corpus are shown in Fig. 1. TagedPIn Fig. 1, there are fewer radical-based PMI collections than word-based ones (548,658D37X X vs. 655,534) because radical parts are a higher conceptual representation for Chinese words, and words having similar semantic radical concepts are aggregated into a single radical form. The apex of the radical distribution curve was also higher than that of the word PMI collection curve; therefore, the relationships that co-occurred with a higher frequency were concentrated to a range of the given 0D38X X3 limits. 2 A part-of-speech tagger (POS tagger), available at http://ckipsvr.iis.sinica.edu.tw/, is a piece of software that reads texts in some language, segments the texts into a collection of meaningful words, and assigns parts-of-speech tags to each word (and other tokens), such as noun, verb, adjective, etc.

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

8

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

Fig. 1. Word and radical collections distributions. 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273

274

275

4. Sentiment experiment results TagedPThis study conducted a series of experiments to demonstrate the superiority of the proposed radical-based approach. In the first two experiments, we used the IPEEN corpus to examine the proposed radical-based approach in four dimensions of the review data. We first compared the case of using the unigrams (single-feature word vs. radical form) for sentiment analysis. Then, we compared the cases of applying three different strategies using 5-skip-2gram—the word-based features, the radical parts with the corresponding radical set—and the NTUSD. TagedPIn all cases, the same parameters were used for the SVM classifiers and weighting was applied to the features in the SVM according to TFIDF (term frequency, inversed document frequency). The toolkit we used in this study was provided by scikit-learn (Pedregosa et al., 2011). For every step of the experiment, we randomly selected 80% of the 60,000 IPEEN balanced corpus data as a training set (the remaining 12,000 reviews acted as the test set) and we also conducted 10-folding cross-validation3 to examine the consistency. In addition, although there are plenty of neutral polarity review on the Internet, positive and negative product reviews are more concerned by business owners. Thus, as stated above, we adopted strict definition of positive criteria, and then built two classifiers: positive classifiers to classify positive vs. non-positive, and negative classifiers to classify negative vs. non-negative in all cases. TagedPTaking an example of a feature selection in Chinese, let us consider the sentence “ ” (I like this pot of vegetables very much). If using unigram and applying NLP mentioned above, we would take , , and in word representation and , , , and in its corresponding radical representation as features into SVM, respectively. If using 5-skip-2-gram and considering the seed “ ,” we would take , collocations in word representation and , in its corresponding radical representation as features into SVM, respectively. TagedPWe used the F1 value as a measurement of the effectiveness of the classifier. This value combines recall and precision (van Rijsbergen, 1979) and is commonly recommended as a performance measurement for SVM. In the case of 10-folding cross-validation, we reported the F1 scores by averaging the results of 10 iterations for the testing data. tp PrecisionðAccuracyÞ ¼ tp þ fp RecallðSensitivityÞ ¼

tp tp þ fn

3

10-folding cross-validation is a process that chunks training dataset into 10 equal-lot of subsets, and then uses one subset for testing and others for training sequentially. Therefore, the validation process involves 10 iterations of training and testing procedures.

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

9

TagedPF1 ¼ 2  Recell  Precision 276

ðRecall þ PrecisionÞ

278

where tp represents the number of true-positive (correct) results, fp represents the number of false positive (unexpected) results, and fn represents the number of false negative (missing) results.

279

4.1. Sentiment analysis in unigram

277

280 281 282 283 284 285 286 287

TagedPIn the case of unigram, we considered each possible single Chinese word and there was no seed. A radical form is the corresponding semantic representation of a word. Its use as a filter reduces the number of words in the corpus, and this can only be semantically interpreted at a higher conceptual level. We used only unigrams as features when employing both word-based and radical-based strategies to construct SVM classifiers, and we tested these classifiers in all four restaurant dimensions to gain an understanding of the differences between both strategies. TFIDF was the default configuration for weighting SVM classifiers in this experiment. We tried two cases: all input features and features with the top 40% weight (ranking by TFIDF). The training and predicting phase results of unigrams are shown in Tables 3 and 4, respectively. Table 3 The comparison of sentiment analysis in unigram (10-folding validation training). Dimension

Method

Input Features

AVG Positive F1

AVG Negative F1

Overall

Word

TOP 40% 100% TOP 40% 100%

0.748 0.759 0.754 0.765

0.728 0.739 0.735 0.746

TOP 40% 100% TOP 40% 100%

0.774 0.786 0.788 0.799

0.762 0.774 0.776 0.787

TOP 40% 100% TOP 40% 100%

0.736 0.743 0.739 0.748

0.790 0.798 0.789 0.800

TOP 40% 100% TOP 40% 100%

0.751 0.760 0.758 0.769

0.774 0.784 0.775 0.787

Radical Taste

Word Radical

Service

Word Radical

Environment

Word Radical

Table 4 The comparison of sentiment analysis in unigram (prediction). Positive classifier D2X X

Negative classifier D3X X

Dimension

Input features D4X X

Method

Precision

Recall

F1

Precision

Recall

F1

Time (Ds) 5X X

Number of features D6X X

Overall

TOP 40%

Word Radical Word Radical

0.758 0.760 0.769 0.772

0.739 0.742 0.753 0.760

0.749 0.751 0.761 0.766

0.724 0.726 0.737 0.743

0.744 0.745 0.754 0.756

0.734 0.735 0.745 0.750

9.0 8.4 10.0 9.8

15,539 8438 38,849 21,096

Word Radical Word Radical

0.792 0.793 0.799 0.802

0.770 0.779 0.773 0.789

0.781 0.786 0.786 0.796

0.760 0.767 0.765 0.778

0.782 0.781 0.792 0.791

0.771 0.774 0.778 0.784

8.4 8.1 10.0 9.5

15,134 10,533 37,836 26,333

Word Radical Word Radical

0.790 0.799 0.813 0.807

0.692 0.688 0.686 0.694

0.738 0.740 0.744 0.746

0.751 0.751 0.753 0.756

0.834 0.844 0.858 0.850

0.790 0.795 0.802 0.800

8.6 8.0 9.2 9.1

15,279 10,586 38,199 26,467

Word Radical Word Radical

0.793 0.791 0.807 0.808

0.716 0.732 0.716 0.738

0.752 0.761 0.759 0.771

0.743 0.752 0.746 0.760

0.814 0.808 0.830 0.826

0.777 0.779 0.786 0.792

8.6 8.4 10.0 9.1

15,424 10,676 38,561 26,691

100% Taste

TOP 40% 100%

Service

TOP 40% 100%

Environment

TOP 40% 100%

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

10

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

291

TagedPAs shown in Tables 3 and 4, in all four of the review dimensions, the F1 values of the radical-based strategy were slightly higher than those of the word-based strategy. However, the use of radicals can significantly reduce the feature size owing to conceptualization of the word’s meaning. The ability to use a smaller feature size for the classifier could certainly decrease the time and memory space consumption of sentiment analysis over a large corpus.

292

4.2. Sentiment analysis in bigram

288 289 290

293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312

TagedPAlthough the F1 scores of sentiment analysis in unigrams were not bad, the task of text mining still needed to perform the analysis in bigram since the collocations from bigram would give us more semantics. For example, a positive opinion like (shumai tastes good) can be segmented into two terms: (shumai, barbecue in Hong Kong style) and (tastes good); the first term “ ” matches the morpheme-based aspect consisting of the radical . After the collocation feature for the classifier is explored, the aspect-sentiment feature can be marked as a tuple of and , and its radical form will be a tuple of and . TagedPTherefore, in the second experiment, we continued to use the bigram as the feature in classifiers. Every bigram is a 5-skip-2-gram, which means that two words or radical parts co-occur within five positions and possess semantic relationships within a sentence. Before the analysis, a domain-dependent root morpheme set had to be selected to explore the collocations. The root morpheme set in this study was discussed by several enlisted experts, who recognized what features in restaurant dimensions would be of concern to customers. Yang and Chao (2014) used eight selected morpheme root words to search for possible feature words and collocations to assist in the sentiment analysis of movie reviews. In their study, a pre-computed PMI was used as a criterion to identify the relationship between feature words and collocations, and they showed that morpheme-based features had better performance than the keyword-based approach in various movie genres. In this study, we adopted the same idea, but used a pre-computed PMI separately in word and radical representation. As shown in Table 5, we used fewer seeds, i.e., three morphemes as seeds of the root set in each sub-dimension: “taste,” “service,” and “environment.” As for the “overall” dimension, we included all morphemes of the other three sub-dimensions and added additional morphemes for price feature extraction: (money), (price), and (cost). The possible words for each morpheme are also listed in the table, and the semantic relationship between the existing morphemes and the possible words can be observed. In the fourth Table 5 Root morpheme words and corresponding radical parts.

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

ARTICLE IN PRESS A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

[m3+;August 8, 2017;14:02]

11

Fig. 2. The 5-skip-2-gram construction algorithm for word and radical.

313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345

TagedPcolumn, the root radical parts are then listed corresponding to the selected root morpheme words. The radical form of a root morpheme is its conceptual representation. TagedPAfter a domain-dependent root morpheme set was chosen, we designed a morpheme-based feature extraction approach based on 5-skip-2-gram. The 5-skip-2-gram construction algorithm proposed in this study is shown in Fig. 2. In the algorithm, the PMIword and the PMIradical value have to be calculated separately because several words might have the same radical part representation, like the radicals listed in Table 2. We used the PMI screen mechanism twice (step 06 and step 08) to explore the features with the same collocations as modifiers in the reviews. For example, “ ” (vegetable) contains the morpheme character “ ” and one of its frequently occurring collocations is “ ” (tastes good). However, other words, like “ ” (stew), “ ” (handmade noodles), and “ ” (soy sauce chicken), can also appear together with “ ” as a frequently occurring collocation. Therefore, we can find “ ” (vegetable) from the PMIword key set in step 06 and store its collocation “ ” in colword. In step 08, we used the words in the colword list to search for frequently occurring collocations, and then “ ,” “ ,” “ ,” as well as “ ” will emerge from the PMIword key set. This PMI relationship between morpheme-based feature words and collocations can also work while using radical presentations. For example, “ ” is the radical form of “ ,”D39X X and its frequently occurring collocations are “ ” (the radical form of “ ”), “ ” (the radical form of “ ”), and “ ” (the radical form of “ ”). Moreover, we had to repeat all steps (step 09) to generate fradical by using PMIradical and mradical. For the result, the pairwise features fword and fradical were generated according to the given review corpus C and the morpheme root seeds mword. TagedPFor comparison, we also used the NTUSD (Ku et al., 2006), which is the most commonly applied traditional Chinese wordlist, to construct the classifiers. The NTUSD consists of manually annotated news articles in traditional Chinese. It includes 1122 positive words and 4525 negative words after it was subjected to CKIP part-of-speech processing. Considering that the NTUSD is a sentiment wordlist, we can adopt this wordlist to find frequently cooccurring aspect collocations in our restaurant reviews. TagedPThus, three different feature selection strategies, namely, word-based, radical-based, and NTUSD-based strategies, are compared in Tables 6 and 7. All strategies used the same PMI value set that we precomputed in advance. We used 80% of the 60,000 IPEEN balanced corpus as a training set, and the remaining reviews acted as the test set. In the case of bigram, we also tried two cases: all input collocations and collocations with the top 40% weight (ranking by TFIDF). TagedPAs shown in Tables 6 and 7, among the three strategies, in all cases, the commonly applied NTUSD-based strategy had the lowest F1 scores. Furthermore, among the 16 F1 scores of the 10-folding cross-validation training of Table 6, there were 9 cases in which radical-based approach D40X Xwas slightly higher (with the amount of 0.001D0.006) 41X X than the corresponding word-based approach; both were equal in the other 7 cases. In the case of prediction, when inputting all (100%) collocations, among the eight F1 scores of prediction in Table 7, the radical-based approach was slightly superior in 3 cases, the word-based approach was slightly superior in 3 cases, and both were equal in 2 Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

12

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx Table 6 The comparison of sentiment analysis in bigram (10-folding validation training). Dimension

Method

Input collocations

AVG positive D7X X F1

AVG negative D8X X F1

Overall

Word

TOP 40% 100% TOP 40% 100% TOP 40% 100%

0.801 0.800 0.805 0.806 0.735 0.739

0.788 0.787 0.788 0.787 0.730 0.731

TOP 40% 100% TOP 40% 100% TOP 40% 100%

0.838 0.836 0.838 0.839 0.758 0.762

0.829 0.828 0.829 0.829 0.756 0.761

TOP 40% 100% TOP 40% 100% TOP 40% 100%

0.792 0.786 0.795 0.793 0.714 0.712

0.830 0.828 0.832 0.831 0.787 0.789

TOP 40% 100% TOP 40% 100% TOP 40% 100%

0.814 0.811 0.814 0.813 0.736 0.734

0.826 0.826 0.826 0.826 0.770 0.772

Radical NTUSD Taste

Word Radical NTUSD

Service

Word Radical NTUSD

Environment

Word Radical NTUSD

346 347 348 349 350

TagedPcases. The differences were of the order of 0.001D42X X0.005. However, when inputting the top 40% collocations (ranking with TFIDF), the radical-based approach became superior in all eight F1 scores, obviously. The superiority differences (with the amount of 0.018D43X X0.056) in the 40% situation were 10 times larger than the differences in the 100% situation. With the decreasing ratios of input collocations, the superiority of the radical-based approach became more distinct in Figs. 3 and 4. From these figures, we can see that, if we only inputted 10% (i.e., 32,140 in Table 7 The comparison of sentiment analysis in bigram (prediction). Positive classifier D9X X

Negative classifier D10X X

Dimension

Input collocations

Method

Precision

Recall

F1

Precision

Recall

F1

Time (Ds) 1X X

Number of collocations

Overall

TOP 40%

Word Radical NTUSD Word Radical NTUSD

0.779 0.808 0.760 0.801 0.806 0.763

0.772 0.790 0.708 0.806 0.790 0.711

0.776 0.799 0.733 0.803 0.798 0.736

0.756 0.777 0.704 0.787 0.777 0.708

0.763 0.796 0.757 0.783 0.793 0.760

0.760 0.786 0.730 0.785 0.785 0.733

59.0 44.0

130,653 128,563

61.0 44.0

326,633 321,408

Word Radical NTUSD Word Radical NTUSD

0.820 0.850 0.795 0.848 0.851 0.798

0.797 0.824 0.727 0.825 0.817 0.731

0.809 0.836 0.759 0.836 0.834 0.763

0.790 0.817 0.732 0.817 0.812 0.735

0.813 0.844 0.799 0.841 0.846 0.801

0.801 0.830 0.764 0.829 0.829 0.766

58.0 44.0

123,213 121,981

62.0 50.0

308,034 304,954

Word Radical NTUSD Word Radical NTUSD

0.811 0.844 0.810 0.843 0.849 0.816

0.741 0.748 0.646 0.741 0.734 0.632

0.775 0.793 0.719 0.789 0.787 0.712

0.761 0.794 0.731 0.790 0.787 0.725

0.826 0.876 0.864 0.876 0.883 0.872

0.792 0.833 0.792 0.831 0.832 0.792

64.0 48.0

125,039 123,823

66.0 49.0

312,599 309,559

Word Radical NTUSD Word Radical NTUSD

0.805 0.848 0.799 0.852 0.849 0.810

0.72 0.787 0.679 0.774 0.782 0.673

0.760 0.816 0.734 0.811 0.814 0.735

0.765 0.802 0.722 0.795 0.799 0.722

0.838 0.860 0.830 0.866 0.862 0.843

0.800 0.830 0.773 0.829 0.830 0.778

61.0 46.0

128,550 126,385

67.0 50.0

321,376 315,963

100%

Taste

TOP 40%

100%

Service

TOP 40%

100%

Environment

TOP 40%

100%

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

ARTICLE IN PRESS A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

[m3+;August 8, 2017;14:02]

13

Fig. 3. F1 Scores change with the input collocation ratio (positive classifiers).

Fig. 4. F1 Scores change with the input collocation ratio (negative classifiers).

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

14

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

366

TagedPthe “overall” dimension) or even 1% collocations in the radical-based approach, the F1 scores would still be satisfactory, but not in the word-based approach. To achieve almost the same F1 scores of the radical-based approach, the word-based approach needed to input at least 50% collocations (i.e., 163,316 in the “overall” dimension). Thus, the superiority comparison of the time spent became apparent in the bigram case. In the above example of the “overall” dimension, to achieve the same F1 score, the word-based approach would have to spend 40D4X XsD45X X for 50% collocations, whereas the radical-based approach would have to spend 31D46X sX D47X X for 10% collocations. Therefore, it is clear that the F1 performance of the radical-form features was better than that of the word-form features, disregarding unigram or 5-skip-2-gram, and the word-form features would generate far more features for analysis, thus consuming more computing memory4 and time. TagedPA further comparison was also conducted. We deleted one morpheme seed in each dimension, one-by-one, and rerun the tests. Thus, with 100% collocations, we tried to use 3 rather than 2 root seeds in each sub-dimension, namely, “taste,” “service,” and “environment,” and we used 8 rather than 12 root seeds in the “overall” dimension. As shown in Table 8, the F1 scores of the radical-based approach did not change much and were still satisfactory; the same was not true for the word-based approach. Almost all F1 scores of the radical-based approach outperformed the scores of the word-based approach, which proved that the radical-based approach is robust, whereas the wordbased approach is fragile and depends on the chosen morpheme root seeds.

367

4.3. Generating the seed list

351 352 353 354 355 356 357 358 359 360 361 362 363 364 365

368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388

389 390 391 392 393

TagedPNext, we tried to propose a seed list in radical form to improve the procedure of sentiment analysis. If such a list existed, we could use it as a keyword list and apply the same analysis approach as the NTUSD method by searching for keywords in the extracted list and collocations as machine learning features. By having this list, we could skip the extension of morpheme feature words (steps 07 and 08 in Fig. 2) and would need no expert consultation to identify the root seeds in Table 5. In addition, since the seed list contained domain-dependent knowledge, we were hoping that this seed list could be reused in a new corpus of the same domain. TagedPWe adopted the approach of Liu et al. (2010), in which hierarchical synonym groups are used to build semantic word graphs to filter significant words or radicals. We considered a pre-calculated PMI as a word graph generated by a dependent corpus. In this study, pre-calculated PMI relationships represented the high frequently co-occurring collocations in the reviews, and the PMI value of the relationships was between 0 and 3. Our approach differs from that of Liu et al. in that our PMI relationship graph contained no semantic or linguistic information. However, it is possible to prune this graph to extract the most significant nodes (word- or radical-based) to assist in sentiment analysis. We assumed the following. (1) Users intend to use various words to express their sentiment toward an aspect of a restaurant. For example, customers express sentiments toward a dish by using words such as “tasty,” “yum yum,” and “lovely,” and, generally, these words are already listed in the sentiment dictionary. In a graph representation, this means a node will have a higher possibility to pair up with others. (2) Words used for describing restaurant services are similar to those used for dishes and can be seen frequently. This is the same assumption driving TFIDF, and it can be recognized as a high PMI value in the relationship graph. Combining these two assumptions and adopting the Rich-club algorithm (Zhou and Mondrag on, 2004), we managed to rank existing nodes according to frequency of occurrence and richness (higher co-occurring PMI value), called FR-Rank, as can be seen in the following formula:     edgesconnected normalized FRRankðnodeÞ ¼ log  average PMIconnected edgesall where edgesconnected is the number of co-occurring nodes in a graph, edgesall is the total number of PMI collocation normalized pairs, and PMIconnected is the normalized PMI value subjected to an expression between 0 and 1, and connected to the nodes. FR-Rank can be used in both word-based and radical-based PMI graphs. It extracts most of the significant nodes (word- or radical-form) according to PMI statistics and collocation connections. 4 With the number of 60,000 reviews and without first compressing the matrix, the 163,316 features in the word-based approach would consume about 19.6 GB in memory space, but the 32,140 features in the radical-based approach would only consume 3.8 GB. With the compression used in our experiment, the used memory would be lower, but would still be 1D886D 48X X 49X M X B vs. 477D50X M X B.

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

15

Table 8 The comparison of word-based and radical-based with fewer root seeds. Positive classifier D12X X Dimension

D14X X Comparison

Overall

(1)

(2)

(3)

Taste

(1)

(2)

(3)

Service

(1)

(2)

(3)

Environment

(1)

(2)

(3)

394 395 396 397

Root seeds D15X X

Negative classifier D13X X

Method

Precision

Recall

F1

Precision

Recall

F1

Word

0.818

0.704

0.757

0.465

0.577

0.515

Radical

0.782

0.812

0.797

0.833

0.784

0.808

Word

0.829

0.679

0.747

0.428

0.522

0.470

Radical

0.835

0.813

0.824

0.841

0.802

0.821

Word

0.876

0.758

0.813

0.376

0.490

0.425

Radical

0.789

0.802

0.795

0.838

0.820

0.829

Word

0.728

0.693

0.710

0.742

0.683

0.711

Radical

0.832

0.864

0.848

0.884

0.836

0.859

Word

0.637

0.658

0.647

0.745

0.731

0.738

Radical

0.819

0.858

0.838

0.858

0.860

0.859

Word

0.636

0.686

0.660

0.741

0.705

0.723

Radical

0.786

0.799

0.792

0.824

0.829

0.826

Word

0.541

0.616

0.576

0.825

0.746

0.784

Radical

0.756

0.782

0.769

0.846

0.818

0.832

Word

0.495

0.609

0.546

0.885

0.730

0.800

Radical

0.771

0.800

0.785

0.845

0.813

0.829

Word

0.485

0.599

0.536

0.823

0.734

0.776

Radical

0.766

0.791

0.778

0.878

0.815

0.845

Word

0.387

0.508

0.439

0.873

0.709

0.782

Radical

0.768

0.819

0.793

0.861

0.808

0.834

Word

0.451

0.562

0.500

0.951

0.766

0.849

Radical

0.791

0.828

0.809

0.844

0.823

0.833

Word

0.467

0.597

0.524

0.916

0.741

0.819

Radical

0.735

0.786

0.760

0.868

0.821

0.844

TagedPUsing the morpheme-based approach in Section 4.2, we can get more than 30,000 unique feature words and more than 18,000 unique feature radical parts including both morpheme words and collocations.5D51X X If considering these extracted features as keywords to understand how much the extracted feature size interfered with the reduction of the F1 score, we used FR-Rank to rank all of the nodes in the word-based PMI and radical-based PMI graphs and 5

Unique implies non-overlapping, e.g., there were 10 collocations, namely, AB, AC, AD, AE, BC, BD, BE, CD, CE, and DE, but there are only 5 unique features: A, B, C, D, and E.

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

16

ARTICLE IN PRESS

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

Fig. 5. Size of FR-ranked seed set and F1 scores of positive classifiers.

408

TagedPconducted a sentiment analysis for each of the four dimensions using the same methods described in Section 4.2. If the extracted sizes for every step were set to “8000”D,52X X “4000”D53X X, “2000”D54X X, “1000”D5X X, “500”D,56X X “100”D57X X, and “50”D58X X, we could observe a degradation of the performance for the four dimensions in either positive classifiers (Fig. 5) or negative classifiers (Fig. 6). TagedPIn Figs. 5 and 6, the size of the extracted features was critical for sentiment analysis. It seems that, for the word representation, a reasonable seed set size should be larger than 2000 to achieve at least the same F1 score of the unigram strategy in Tables 3 and 4. However, for the representation of radical parts, the seed set can be reduced to 50 radical parts and still possess satisfactory F1 scores compared to the unigram strategy. TagedPThe extracted radical seed set requires an in-depth experiment in a domain with similar reviews to ensure the applicability; therefore, we used our second corpus, the TRIPADVISOR restaurant review corpus, to verify this argument.

409

4.4. Verification of the radical seed set in the same domain

398 399 400 401 402 403 404 405 406 407

410 411 412 413 414 415 416 417 418 419 420 421

TagedPThe 50 radicals extracted above are shown in Table 9. In the table, there are syllabic and disyllabic radicals, and it is difficult to describe the semantic relationship between these radicals in the restaurant reviews. However, we claim that the 50 radical parts extracted can be reused in a similar domain. To test the applicability, we used our second corpus, the TRIPADVISOR restaurant review corpus, to verify this claim. The TRIPADVISOR experiment used 50 extracted radical seed sets from the “overall” dimension of the IPEEN corpus because the TRIPADVISOR corpus only had one equivalent rank data from its website. We used word-based unigram, morpheme word-based bigram with the same seeds in Table 5, and NTUSD 5-skip-2-gram for comparison and the same parameter settings as described previously. TagedPAs shown in Table 10, we can see that, in the case of positive classifiers, the NTUSD-based strategy was the least accurate among the four strategies and that the extracted radical-based strategy held the highest F1 score. The performance of the 50 extracted radical-based seed sets was better than that of the morpheme word-based ones with the same seeds in Table 5. Because TRIPADVISOR was an unbalanced corpus with 10,258 positive reviews and 3204 Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

ARTICLE IN PRESS

JID: YCSLA

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

17

Fig. 6. Size of FR-ranked seed set and F1 scores of negative classifiers.

Table 9 Extracted radical seed set from IPEEN corpus.

422 423 424 425

TagedPnegative reviews, the performance of the negative classifiers was not satisfactory. However, in the case of the negative classifiers, the performance of the extracted radical-based strategy declined slowly. If the performance measurement was adjusted by the supported instances of positive and negative number of reviews, the superiority of the extracted radical-based strategy would become more apparent.

Table 10 Sentiment analysis comparison results on TRIPADVISOR corpus. Positive classifier D16X X

Negative classifier D17X X

Adjusted average D18X X

Method

Precision

Recall

F1

Precision

Recall

F1

Precision

Recall

F1

NTUSD-based Word-based unigram Morpheme word-based D19X X bigram Extracted 50 radical-parts

0.765 0.768 0.765 0.814

0.996 0.998 0.999 0.960

0.866 0.868 0.867 0.881

0.385 0.750 0.75 0.686

0.008 0.024 0.005 0.285

0.015 0.046 0.009 0.403

0.675 0.764 0.761 0.790

0.761 0.766 0.763 0.799

0.715 0.765 0.762 0.795

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

18 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468

Q6 469

ARTICLE IN PRESS

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

5. Conclusions TagedPOver a billion people worldwide speak some form of Chinese as their first language. Chinese words with these pronunciations have also been borrowed extensively into the Korean, Japanese, and Vietnamese languages, and, today, these words comprise over half their vocabularies. The Chinese language has developed over a long period of time, and linguists have devoted a great deal of effort to understand and organize Chinese scripts. However, Chinese is a non-space-delimited, polysyllabic language and has its unique characteristics. Thus, not all English language resources and techniques can be directly adapted to the Chinese language environment for text mining. TagedPThis study recognizes that the semantics of Chinese words are represented in the relationship between the meaning of the word itself and its radical parts. In this study, we considered radical parts as conceptual representations of words and explain the semantic relationship between a word and a radical. Although this semantic relationship cannot be interpreted directly if the context of the sentence is not known, radical parts can be considered as a semantic unit for sentiment analysis. TagedPPrevious studies have shown that a morpheme-based sentiment analysis approach can be more effective than textretrieval and keyword-based approaches. However, in the morpheme-based approach, to come up with the initial morpheme words (i.e., seeds) is a challenging task and needs domain experts.6D59X X The purpose of this study was to demonstrate the superiority of applying radical-based information over word-based. We conducted a series of text-mining experiments. First, we used the exact radicals corresponding to words in either unigram or bigram sentiment analysis for four different dimensions, namely, “overall,” “taste,” “service,” and “environment,” of the IPEEN restaurant corpus. The results demonstrated that the radical approach performed better than the word-based approach and the commonly applied NTUSD-based approach, and that it was robust and consumed less computing memory and time. TagedPNext, to extract the sentiment seed set, we proposed to use the FR-Rank approach to extract frequently used nodes as keywords, which can improve the procedure of sentiment analysis, minimize expert intervention, and be reused in other corpora of the same domain. Our experiment indicated that the number of radical features can be further reduced to 50 and still maintain satisfactory performance. We also tested the 50 extracted radical seed sets to analyze a new corpus of the same domain, i.e., the TRIPADVISOR restaurant review, and the results showed that the F1 score of the radical-based strategy was better than that of the word-based unigram, morpheme word-based bigram, and NTUSD-based approaches. Thus, we have proved that the radical-based approach is a “better alternative” for text mining in traditional Chinese. TagedPThere are some limitations of this study and future research suggestions. First, the aim of this study was to compare radical-based with traditional morpheme-based approaches. The comparison with keyword-based ones was supplementary. Therefore, we only chose the commonly applied NTUSD-based approach. However, Chinese WordNet and e-HowNet do provide more ontology semantics, and the augmented version of the NTUSD called ANTUSD was integrated into E-HowNet (Wang and Ku, 2016). It would be interesting for future research to compare ontology-based approaches with our radical-based approach. Next, although the radical parts have been changed in simplified Chinese, we would also like to apply this approach to the simplified Chinese environment and compare our approach to the one using the HowNet sentiment word list. Third, we look forward to examining the semantic meanings of radicals and studying how they affect the sentiment analysis. Fourth, future research may also try unsupervised sentiment methods, such as deep learning neural networks, and combine our radical-based approach to obtain meaningful collocations in sentiment analysis. It would be also interesting to extend the current work to model general semantic composition, not just sentiment, at the radical level.

Uncited D60X XreferencesDX61 X TagedPFeldman and Siok (1997), Huang (2005), Balahur and Turchi (2014), Montejo-Raez et al. (2014), Salton (1991).

470 6

Even the seed identification problem could be mitigated by e-Hownet, as it could not be solved and needed expert consultation. For example, from the structure of e-Hownet, “ ” (be full) can be automatically identified as a morpheme, but not “ ” (disgust).

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

ARTICLE IN PRESS A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

471

[m3+;August 8, 2017;14:02]

19

Acknowledgments

473

TagedPThe authors would like to thank the National Science Council, Taiwan, for financially supporting this research under D62X Xcontract no. D63X X NSC 101-2410-H-004-015-MY3D64X X.

474

References

475

TagedPBradley, M.M., Lang, P.J., 1999. Affective norms for English words ANEW: instruction manual and affective ratings. Technical Report C-1, The Center for Research in Psychophysiology, University of Florida. TagedPBalahur, A., Turchi, M., 2014. Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput. Speech Lang. 281, 56–75. TagedPChao, Y.R., 1965. A Grammar of Spoken Chinese, Berkeley & Los Angeles. University of California Press. TagedPChou, Y.M., 2005. Hantology: the knowledge structure of Chinese writing system and its applications Doctoral dissertation. Ph.D. thesis. National Taiwan University, Ph.D. thesis Unpublished. TagedPChao, A.F.Y., Chung, S.F., 2011. A measurement of multi-level semantic relations among mandarin lexemes with radical mu4: a study based on dictionary explanations. Comput. Ling. Chin. Lang. Process.X X 21–40. TagedPChurch, K.W., Hanks, P., 1990. Word association norms, mutual information, and lexicography. Comput. Ling. 161, 22–29. TagedPDong, Z., Dong, Q., 2006. HowNet and the Computation of Meaning. World Scientific. TagedPDas, S., Chen, M., 2001. Yahoo! for Amazon: extracting market sentiment from stock message boards. Manage. Sci. 539, 1375–1388. TagedPFu, G., Wang, X., 2010. Chinese sentence-level sentiment classification based on fuzzy sets. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, pp. 312–319. TagedPFeldman, L.B., Siok, W.W., 1997. The role of component function in visual recognition of Chinese characters. J. Exp. Psychol.: Learn. Memory Cognit. 23 (1) pp. 776771. TagedPGhiassi, M., Skinner, J., Zimbra, D., 2013. Twitter brand sentiment analysis: a hybrid system using n-gram analysis and dynamic artificial neural. Expert Syst. Appl. 40 (16), 6266–6282. TagedPHong, J.F., Huang, C.R., 2013. Cross-strait lexical differences: a comparative study based on Chinese gigaword corpus. Comput. Ling. Chin. Lang. Process. 18 (2), 19–34. TagedPHu, X., Tang, J., Gao, H., Liu, H., 2013. Unsupervised sentiment analysis with emotional signals. In: Proceedings of the 22nd International Conference on World Wide Web International World Wide Web Conferences Steering Committee, Rio de Janeiro, Brazilpp. 607–618. TagedPHuang, C.R., 2005. Knowledge representation with Hanzi: the relationship among characters, words, and senses. Presented at the International Conference on Chinese Characters and Globalization, Taipei. [in Chinese]. TagedPHuang, C., Yang, Y., Chen, S., 2008. An ontology of Chinese radicals: concept derivation and knowledge representation based on the semantic symbols of the four hoofed-mammals. 22nd Pacific Asia Conference on Language. Information and Computation, pp. 189–196. TagedPJang, Shin, 2010. Language-specific sentiment analysis in morphologically rich languages. The 23rd International Conference on Computational Lingusitics (Coling), August 23-27, Beijing, Chinapp. 498–506.X X TagedPKu, L., Huang, T., Chen, H., 2009. Using morphological and syntactic structures for Chinese opinion analysis. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 3, Association for Computational Linguistics, pp. 1260–1269. TagedPKu, L.W., Liang, Y.T., Chen, H.H., 2006. Opinion extraction, summarization and tracking in news and blog Corpora. In: Proceedings of AAAI2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, pp. 100–107. AAAI Technical Report. TagedPLiu, B., 2012. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. TagedPLu, B., Song, Y., Zhang, X., Tsou, B.K., 2010. Learning Chinese polarity lexicons by integration of graph models and morphological features. Inf. Retrieval Technol. 466–477. TagedPLu, B., 2010. Identifying opinion holders and targets with dependency parser in Chinese news texts. In: Proceedings of the NAACL HLT 2010 Student Research Workshop. Association for Computational Linguistics, pp. 46–51. TagedPLi, Y., Kang, J., 1993. Analysis of phonetics of the ideophonetic characters in modern Chinese. Information analysis of Usage of Characters in Modern Chinese, pp. 84–98. TagedPMontejo-Raez, A., Martınez-Camara, E., Martın-Valdivia, M.T., Uree na-Lopez, L.A., 2014. Ranked WordNet graph for sentiment polarity classification in twitter. Comput. Speech Lang. 28 (1), 93–107. TagedPNakagawa, T., Inui, K., Kurohashi, S., 2010. Dependency tree-based sentiment classification using CRFs with hidden variables. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 786–794. TagedPPan, B., MacLaurin, T., Crotts, J.C., 2007. Travel blogs and the implications for destination marketing. J. Travel Res. 46 (1), 35–45. TagedPPang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2 (12), 1–135. TagedPPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830. TagedPSu, Q., Xu, X., Guo, H., Guo, Z., Wu, X., Zhang, X., Swen, B., 2008. Hidden sentiment association in Chinese web opinion mining. In: Proceedings of the 17th International Conference on World Wide Web, pp. 959–968. TagedPSalton, G., 1991. Developments in automatic text retrieval. Science 253 (5023), 974–980. TagedPStone, P.J., Hunt, E.B., 1963. A computer approach to content analysis: studies using the general inquirer system. In: Proceedings of Spring Joint Computer Conference, pp. 241–256.

472

476 477 478 479 480 481 482 Q7 483

484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 Q8 502

503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007

JID: YCSLA

20 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552

ARTICLE IN PRESS

[m3+;August 8, 2017;14:02]

A.F.Y. Chao and H.-L. Yang / Computer Speech & Language xxx (2017) xxx-xxx

TagedPStrapparava, C., Valitutti, A., 2004. WordNet affect: an affective extension of WordNet. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, 4, pp. 1083–1086. TagedPTurney, P.D., 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 417–424. TagedPVan Rijsbergen, C.J., 1979. Information Retrieval. second ed Butterworth, London. TagedPWan, X., 2009. Co-training for cross-lingual sentiment classification. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 235–243. TagedPWang, S.M., Ku, L.W., 2016. ANTUSD: a large Chinese sentiment dictionary. 10th Edition of the Language Resources and Evaluation Conference, 23-28 May, Portoroz, Sloveniapp. 2697–2702. TagedPWu, Z., Tseng, G., 1993. Chinese text segmentation for text retrieval: achievements and problems. J. Am. Soc. Inf. Sci. 449, 532–542. TagedPXu, G., Huang, C.R., Wang, H., 2013. Extracting Chinese product features: representing a sequence by a set of skip-bigrams. Chinese Lexical Semantics, pp. 72–83. TagedPYang, H.L., Chao, A.F., 2014. Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations. Inf. Syst. Front. 1–18. TagedPZhang, C., Zeng, D., Li, J., Wang, F.Y., Zuo, W., 2009. Sentiment analysis of Chinese documents: from sentence to document level. J. Am. Soc. Inf. Sci. Technol. 60 (12), 2474–2487. TagedPZheng, L., Wang, H., Gao, S., 2015. Sentimental feature selection for sentiment analysis of Chinese online reviews. Int. J. Mach. Learn. Cybern. 1–10. TagedPZhang, W., Xu, H., Wan, W., 2012. Weakness finder: find product weakness from Chinese reviews by using aspects based sentiment analysis. Expert Syst. Appl. 39 (11), 10283–10291. TagedPZhang, H., Yu, Z., Xu, M., Shi, Y., 2011. Feature-level sentiment analysis for Chinese product reviews. International Conference on Computer Research and Development ICCRD, 2, pp. 135–140. TagedPZhou, S., Mondrag on, R.J., 2004. The rich-club phenomenon in the internet topology. Commun. Lett. IEEE 8 (3), 180–182. TagedPZhu, L., Galstyan, A., Cheng, J., Lerman, K., 2014. Tripartite graph clustering for dynamic sentiment analysis on social media. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1531–1542.

Please cite this article as: A. Chao, H. Yang, Using Chinese radical parts for sentiment analysis and domaindependent seed set extraction, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.007