Exploiting effective features for chinese sentiment classification

Exploiting effective features for chinese sentiment classification

Expert Systems with Applications 38 (2011) 9139–9146 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

886KB Sizes 1 Downloads 76 Views

Expert Systems with Applications 38 (2011) 9139–9146

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Exploiting effective features for chinese sentiment classification Zhongwu Zhai a,⇑, Hua Xu a, Bada Kang b, Peifa Jia a a

State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China b Viterbi School of Engineering, University of Southern California, United States

a r t i c l e

i n f o

Keywords: Sentiment classification Substring features Substring-group Suffix tree

a b s t r a c t Features play a fundamental role in sentiment classification. How to effectively select different types of features to improve sentiment classification performance is the primary topic of this paper. Ngram features are commonly employed in text classification tasks; in this paper, sentiment-words, substrings, substring-groups, and key-substring-groups, which have never been considered in sentiment classification area before, are also extracted as features. The extracted features are then compared and analyzed. To demonstrate generality, we use two authoritative Chinese data sets in different domains to conduct our experiments. Our statistical analysis of the experimental results indicate the following: (1) different types of features possess different discriminative capabilities in Chinese sentiment classification; (2) character bigram features perform the best among the Ngram features; (3) substring-group features have greater potential to improve the performance of sentiment classification by combining substrings of different lengths; (4) sentiment words or phrases extracted from existing sentiment lexicons are not effective for sentiment classification; (5) effective features are usually at varying lengths rather than fixed lengths. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Textual information can be broadly categorized into two main types: facts and opinions. Facts are objective expressions about entities, events and their properties. Opinions are usually subjective expressions that describe people’s sentiments, appraisals or feelings toward entities, events and their properties (Liu, 2010). With the rapid development of the internet, huge amounts of opinionated texts (texts with opinions or sentiments) are constantly being generated for everyone to see, such as reviews, forum discussions, and blog posts. The opinion or sentiment contained in these opinionated texts is a valuable source for customers, companies, and governments alike. However, the volume of opinionated text published on the internet every day is too large for a human to process. Therefore, in many cases, valuable opinions stay unnoticed for a long time in reviews, forum threads and blog posts before they are discovered. Sentiment analysis, also known as opinion mining, grows out of this need, and has become one of the key technologies for handling and analyzing the opinionated texts on the internet (Pang & Lee, 2008). The main question in the field of sentiment analysis is sentiment classification which classifies evaluative documents, sentences or words as positive or negative (in some cases, the ⇑ Corresponding author. E-mail address: [email protected] (Z. Zhai). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.01.047

neutral class is used as well) (Liu, 2006) to help people automatically identify the viewpoints underlying the opinionated texts (Pang & Lee, 2004). Since sentiment classification concerns the expressed opinion rather than the topic of the document, it challenges data-driven methods and resists conventional text classification techniques (Pang & Lee, 2002). Up to this date, machine learning-based methods have been commonly adopted for sentiment classification due to their outstanding performance (Pang & Lee, 2002, 2004). An open problem in machine learning-based sentiment classification is how to extract complex features that outperform simple features; figuring out which types of features are more valuable is another (Riloff, Patwardhan, & Wiebe, 2006). Most of the existing research focus on simple features, including single words (Tan & Zhang, 2008), character Ngrams (Li & Sun, 2007; Raaijmakers & Kraaij, 2008), word Ngrams (Li & Sun, 2007; Pang & Lee, 2002, 2004; Riloff et al., 2006), or the combination of aforementioned features. Though some of the features mentioned above have been studied individually, few people systematically compare and statistically analyze the impact of these different types of features. Furthermore, substring-group features have never been considered in sentiment classification; in fact, the classifications based on substringgroup features have at least the following potential advantages (Zhang & Lee, 2006). It allows sub-word features and super-word features to be used automatically. Also, defining word boundaries in some Asian language is a daunting task due to special linguistic

9140

Z. Zhai et al. / Expert Systems with Applications 38 (2011) 9139–9146

properties such as having no spaces between words. With such features mentioned above, the artificial word-grouping rules can be avoided, and non-alphabetical features can be taken into account. All in all, there are some important questions that are still unanswered in sentiment classification: (1) Ngram features are commonly used in text classification task. Which Ngram feature is the most suitable for Chinese sentiment classification? (2) Are substring-group features suitable for sentiment classification? Why? (3) Sentiment classification focuses on the sentiments expressed in documents rather than topic. Are sentiment word or phrase features effective for classification? (4) What are the characteristics of effective features? In this paper, an in-depth study has been conducted on all types of features for exploiting effective features for sentiment classification, and these questions have been experimentally answered. The rest of this paper is organized as follows. Section 2 reviews related work. Methodology and key techniques are described in detail in Section 3. The experimental setup is illustrated in Section 4 and the results are given and analyzed in Section 5. Section 6 discusses some of the key problems of feature extraction algorithms. Finally, this paper is summarized in Section 7. 2. Related work Sentiment classification can be performed on word level, sentence level and document level. In this paper, we focus on document-level sentiment classification. Most existing techniques for document-level sentiment classification are based on supervised learning, although there are also some unsupervised methods. We give an introduction to these methods in this section.

it contains. Hatzivassiloglou and McKeown (1997) predicate the semantic orientation of adjectives by using constraints on conjunctions. Although their method achieves high precision, it relies on a large corpus and needs a large amount of manually tagged training data. Turney determines phrases’ sentiment orientation by Pointwise Mutual Information (PMI) based on pre-defined seed words and rates reviews as thumbs up or down (Turney, 2002; Turney & Littman, 2002). Kim and Hovy (2004) build three models to assign a sentiment category to a given sentence by combining the individual sentiments of sentiment-bearing words. Hu and Liu (2004) classify customer reviews using a holistic lexicon. Kennedy and Inkpen (2006) determine the sentiment of customer reviews by counting positive and negative terms and taking into account contextual valence shifters, such as negations and intensifiers. Devitt and Ahmad (2007) explore a computable metric of positive or negative polarity in financial news text. Wan (2008) uses bilingual knowledge and ensemble techniques for unsupervised Chinese sentiment analysis.

3. Methodology 3.1. Overview The framework of sentiment classification based on supervised learning is shown in Fig. 1. The framework consists of training phase and classifying phase. In the training phase, a sentiment classifier is trained by the labeled documents. In the classifying phase, unlabeled documents are classified as positive or negative by the trained classifier. In the following subsections, we illustrate the key steps of the framework respectively, including feature extraction, term weighting, training and classifying. 3.2. Feature extraction

2.1. Classification based on supervised learning The supervised learning based approaches focus on training a sentiment classifier using labeled corpus. Since the work of Pang and Lee (2002), various classification models and linguistic features have been proposed. Dave, Lawrence, and Pennock (2003) use machine learning based methods to classify reviews on several kinds of products. Pang and Lee (2004) report 86.4% accuracy rate of sentiment classification of movie reviews by using word unigrams features for SVMs. Mullen and Collier (2004) also employ SVMs to bring together diverse sources of potentially pertinent information, including several favorability measures for phrases and adjectives and knowledge of the topic of the text. Most recently, Li and Sun (2007) compare the performance of four machine learning methods for sentiment classification of Chinese reviews using Ngrams features. Blitzer, Dredze, and Pereira (2007) investigate domains adaptation for sentiment classifier. McDonald (2007) investigate a structural model for jointly classifying the sentiment of text at varying levels of granularity. Tan and Zhang (2008) discover that IG (Information Gain) performs the best for sentimental terms selection and SVM exhibits the best performance for sentiment classification. Tan, Wang, and Cheng (2008) combine learn-based and lexicon-based techniques for sentiment detection without using labeled examples.

This step aims to extract features from the labeled documents. As shown in Fig. 1, the extracted unique features are used to transform original documents to vectors on which the training and classifying steps are based. Thus, this step determines the overall performance of sentiment classification. Sentiment term, substring, substring-group and key-substringgroup features extraction algorithms are also studied in this subsection, in addition to the commonly-used Ngram features. 3.2.1. Sentiment term features Sentiment words and phrases are the dominant indicators for sentiment classification (Liu, 2010). Thus, it is quite natural to

2.2. Classification based on unsupervised learning The unsupervised approaches focus on identifying sentiment orientation of individual words or phrases, and then classifying each document in terms of the number of these words or phrases

Fig. 1. The framework of sentiment classification based on supervised learning.

Z. Zhai et al. / Expert Systems with Applications 38 (2011) 9139–9146

9141

Fig. 2. Ngram features extraction algorithm.

use existing sentiment lexicons as the features for sentiment classification. More specifically, the extracting steps are as follows. Firstly, all terms are extracted as candidate features from the training corpus; Secondly, these extracted terms are further selected. For each term, if it is one of the words in sentiment lexicon, it is selected as features; otherwise, it is not selected. An authoritative sentiment lexicon released by Tsinghua University is used in this study, including 5,563 positive words and 4,464 negative words.1 This method is called baseline in the following experiments. 3.2.2. Ngram features Ngram features are commonly used in text category tasks (Li & Sun, 2007; Pang & Lee, 2002, 2004; Raaijmakers & Kraaij, 2008; Tan & Zhang, 2008). The Ngram extraction algorithm is very simple, and the pseudo-code of the algorithm is shown in Fig. 2. Here, S[j] represents the jth unit in the string S, and n is the size of Ngram. Depending on the granularity, Ngram features can be categorized into character Ngram features and word Ngram features. Take sentence ‘‘I like this camera’’ for example, its character bigram features are Il, li, ik, ke, et, th, hi, is, sc, ca, am, me, er, ra, and its word bigram features are I like, like this, this camera. In the experiments in Section 5, character-unigram (CU), character-bigram (CB), character-trigram (CT), word-unigram (WU), word-bigram (WB) and word-trigram (WT) features are all studied. 3.2.3. Substring features Although substring features are seldom adopted in text category tasks, they have many potential advantages (Zhang & Lee, 2006) as mentioned in Section 1. The pseudo-code of the algorithm is shown in Fig. 3. Here, S[j] represents the jth unit in the string S. We can observe that the substring features include all the combinations of the n  gram features, where n is from 1 to l. For the sentence mentioned above, the word substring features are I, I like, I like this, I like this camera, like, like this, like this camera, this, this camera, camera. The strings in a document can be tokenized into characters or words: the features are called either character based substring features (CS) and word based substring features (WS), respectively. 3.2.4. Substring-group features For a corpus D of length l, it has lðlþ1Þ substrings. When l is large, 2 the number of substring features will be too huge to use in classification. However, the substring features can be clustered into a relatively small number of equivalence groups by using suffix tree techniques. The main steps are given below. Firstly, a generalized suffix tree is constructed with all the documents in training corpus using Ukkonen’s algorithm with O(l) time complexity (Ukkonen, 1995). In this paper, the basic unit of sequence added to construct suffix tree is a short sentence, which 1

http://nlp.csai.tsinghua.edu.cn/lj/downloads/sentiment.dict.v1.0.zip.

Fig. 3. Substring features extraction algorithm.

references to the sentences split by period, comma, question mark, exclamation, ellipsis, semicolon and colon. Secondly, all the substrings of the training and testing documents are matched with the constructed suffix tree, and all the IDs of the matched nodes are taken as the features of the corresponding document. This process can also be finished in O(l) time by making full use of the suffix links to move from one suffix to the next suffix. According to the conclusions in Gusfield (1997), for a text corpus of length l, the constructed suffix tree has l leaf-nodes and at most l  1 internal nodes. So the number of features can be reduced from lðlþ1Þ to (2l  1) by suffix tree technique. 2 Take the sentence ‘‘I like this camera, because this camera looks like chocolate’’ for example, the suffix tree constructed by the two short sentences are shown in Fig. 4, and we get 14 substring groups. When we match another sentence ‘‘I like chocolate very much’’, the node 1, 12, 13 and 14 are matched. As a result, the substring-group features of this sentence are ‘‘1, 12, 13, 14’’. The short sentence mentioned above can be further tokenized into characters or words. If the suffix tree is constructed by the character sequence of short sentences, the exacted features are called character based substring-group features (CSG); otherwise, the extracted features are called word based substring-group features (WSG).

3.2.5. Key-substring-group features Although the lðlþ1Þ substring features can be reduced to (2l  1) 2 equivalent groups, (2l  1) is also too large for classification when the size of training corpus is very large. It’s not hard to realize that the nodes in the suffix tree are still redundant. The criteria proposed in Zhang and Lee (2006) is further used to reduce the number of substring groups. Compared with the substring-group extraction algorithm, keysubstring-group extraction algorithm has an additional step at the end, which removes the id in IDs based on the criteria shown in Table 1. Finally, the removed IDs are taken as the key substring features of the corresponding document. In the experiments in Section 5, the recommended values of the criteria parameters are adopted: L = 20, H = 8000, B = 8, P = 0.8 and Q = 0.8. Moreover, according to the same reason mentioned in sub Section 3.2.4, the extracted features can also be categorized into character based key-substring-group features (CKSG) and word based key-substring-group features (WKSG).

3.3. Term weighting The representation of a problem has a strong impact on the general accuracy of learning systems (Joachims, 1997a). Documents, which are typically strings of characters, have to be transformed

9142

Z. Zhai et al. / Expert Systems with Applications 38 (2011) 9139–9146

Fig. 4. Generalized suffix tree constructed by the two short sentences: ’’I like this camera’’ and ‘‘because this camera looks like chocolate’’. The dotted arrows represent suffix links.

Table 1 The criteria for key-node selection. L

The minimum frequency. A node is not selected, if it has less than L leaf-nodes in the suffix tree.

H B P Q

The maximum frequency. A node is not selected, if it has more than H leaf-nodes in the suffix tree. The minimum number of children. A node is not selected, if it has less than B children. The maximum parent–child conditional probability. A node u is not selected, if the probability Pr(vju) = freq(v)/freq(u) P P, where u is the parent node of v. The maximum suffix-link conditional probability. A node s(v) is not selected, if the probability Pr(vjs(v)) = freq(v)/freq(s(v)) P Q, where the suffix-link of v points to s(v).

into a suitable representation for learning algorithms and classification tasks. tfidf  c is a variant of standard tfidf, and it is widely used in text classification (Sebastiani, 2002). It is defined as Formula (1).

tf ðt k ; dj Þ  log df Nðt Þ k tfidf  c : rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 P  N t2dj tf ðt; dj Þ  log df ðtÞ

Table 2 The summary of the data sets.

3.4. Training and classifying In this step, the sentiment classifier is trained by machine learning algorithms to predict the classes of the unlabeled test documents. Compared with the state-of-art methods (Li & Sun, 2007; Pang & Lee, 2002, 2004; Tan & Zhang, 2008), including Naive Bayes, Maximum Entropy, K-Nearest Neighbor and Neural Network, SVMs show substantial performance gains. SVMs prove to be very robust, eliminating the need for expensive parameter tuning (Joachims, 1997b). Moreover, since SVMs use over fitting protection technique which does not necessarily depend on the number of features, they have the potential to handle large feature spaces. Due to SVMs’ outstanding performance, SVMs are adopted in this paper. The SVMlight package2 is used for training and testing with default parameters.

#Negative

Publish Authority

2,000 451

2,000 435

Chinese Academy of Science Peking University

4. Experimental setup 4.1. Data sets All the experiments are conducted on two authoritative data sets in two domains, Hotel3 and Product.4 The first data set is crawled from Ctrip5 which is one of the most well-known websites in China for hotel and flight reservations, and the second data set is collected from a popular Chinese IT product website–IT168.6 In order to categorize these reviews as positive or negative, both of these data sets are further filtered manually. A brief summary of them is shown in Table 2. To conduct the experiments, each data set is randomly divided into three equal-sized folds, maintaining balanced class distribution in each fold. Two folds are used for training, and the other one fold is used for testing. Note that all the results reported below are the average value of 30 experiments on these data sets.

3

http://www.searchforum.org.cn/tansongbo/corpus/ChnSentiCorp_htl_ba_4000.

rar. 4 5

http://svmlight.joachims.org/.

#Positive

Hotel Product

ð1Þ

Here, tk denotes a distinct term corresponding to a single feature; tf(tk, dj) represents the number of times term tk occurs in the document dj; df(tk) is the number of documents the term tk occurs in; N is the total number of the training documents.

2

Dataset Name

6

http://wanxiaojun1979.googlepages.com/PKU-ICST-ProductReviewData.rar. http://www.ctrip.com/. http://www.it168.com.

9143

Z. Zhai et al. / Expert Systems with Applications 38 (2011) 9139–9146 Table 3 Performance of the Ngram based features on two data sets. Method

Positive

Negative

#Features

P

R

F1

P

R

F1

Hotel

baseline CU CB CT WU WB WT

0.857 0.882 0.927 0.926 0.905 0.928 0.886

0.810 0.866 0.910 0.899 0.893 0.894 0.867

0.833 0.874 0.918 0.912 0.899 0.911 0.876

0.820 0.868 0.912 0.902 0.894 0.898 0.870

0.865 0.884 0.928 0.928 0.906 0.930 0.888

0.842 0.876 0.920 0.915 0.900 0.914 0.879

0.838 0.875 0.919 0.913 0.899 0.912 0.878

1,345 3,216 68,279 172,495 10,293 86,744 146,213

Product

baseline CU CB CT WU WB WT

0.849 0.920 0.932 0.864 0.938 0.863 0.675

0.763 0.903 0.914 0.876 0.920 0.882 0.883

0.804 0.911 0.923 0.869 0.929 0.872 0.765

0.777 0.902 0.913 0.870 0.919 0.876 0.821

0.858 0.917 0.930 0.854 0.936 0.853 0.556

0.815 0.909 0.921 0.861 0.927 0.864 0.662

0.810 0.910 0.922 0.865 0.928 0.868 0.723

475 1,627 17,713 30,739 3,392 15,730 19,884

4.2. Evaluation metrics The standard precision, recall and F1 are used to measure the performance of the negative and positive class, respectively. The accuracy is used to evaluate the overall performance of the sentiment classification. All these metrics are the same as those in general text categorization. 5. Experimental results In order to figure out the effective features for sentiment classification, baseline, CU, CB, CT, WU, WB, WT, CS, WS, CSG, WSG, CKSG and WKSG features are all closely examined in the following experiments. These features can be generally classified into two main categories, Ngram based features and substring based features. In this section, the performances of these two categories of features are analyzed at first, and the representative ones in each category are selected to make comparisons at the end of this section. 5.1. Performances of ngram based features This subsection discusses the first question proposed in the introduction section. The performance of Ngram based features are listed in Table 3. For all Ngram based features, CB usually outperforms the others in terms of Accuracy, Precision, Recall and F1, even the number of features (#Features); WU can achieve acceptable accuracy at cost the of relatively small number of features, while WB is always inferior to CB; Tri-grams features (CT and WT) perform worst, both in performance7 and the number of features. Considering both performance and number of features, CB and WU are selected to make further comparisons. These results are also indicative of the following: the numbers of CB (or CT) are on the same order of magnitude as WB (or WT), but the performance7 of WB (or WT) is inferior to CB (or CT). Also considering the trickiness of word segmentation in Chinese, character based Ngram features are recommended in sentiment classification instead of word based Ngram features. 5.2. Performances of substring based features The performances of substring based features are listed in Table 4. As illustrated by these data, it is apparent that the substring7

Total Accuracy

Here, performance refers to Accuracy, Precision, Recall and F1.

group features (CSG and WSG) outperform every other substring based features. Compared to CS (or WS), CSG (or WSG) has an absolute advantage in the number of features used in classification, which illustrates the effectiveness of suffix tree technique for reducing substring features. We also discover that word based substring based features (WS, WSG) usually outperform the character based substring features (CS, CSG) in Accuracy, Precision, Recall, F1 and the number of features (#Features), respectively. The influence of Chinese word segmentation technique is fully manifested in the classifiers based on these features. Moreover, the key substring features (CKSG, WKSG) achieve relatively good performance by very seldom features (The reasons of the phenomena are further discussed in Section 6). Based on the analysis above, WSG, CKSG and WKSG are chosen to make comparisons. 5.3. Comparison In order to further determine the efficacy of different types of features, a comprehensive comparison between the representations of each category and baseline is made in this subsection. The comparison results are shown in Figs. 5 and 6. According to the data in Table 3 and 4, the trend that Precision, Recall and F1 follow is consistent with the trend that Accuracy follows, so Accuracy is representative of the performance. Thus, only Accuracy and #Features are shown in these two figures. According to different performances7 and the numbers of features, these six features are further categorized into two groups, high-performance group (CB, WU, WSG) and low-cost group (baseline, WKSG, CKSG). For high-performance group, WSG is inferior to CB and WU in the terms of the number of features. However, WSG achieves much better Accuracy (2%) than CB on Product data set, while they get almost the same performance on Hotel data set; WSG is superior to WU on both Product and Hotel data set. These results demonstrate the potential of substring features in sentiment classification. For low-cost group, CKSG features outperform baseline and WKCG on both data sets, which have proved the effectiveness of character key-substring-group features. For both high-performance and low-cost group, the sentiment term bases features (baseline) achieves the worst results: on one hand, baseline get the lowest Accuracy; on the other hand, baseline is inferior to WKSG in terms of the number of features. This result is not consistent with our expectations, and it gives the answer to the third question proposed in introduction section.

9144

Z. Zhai et al. / Expert Systems with Applications 38 (2011) 9139–9146

Table 4 Performance of the Substring based Features on Two Data Sets Method

Positive

Negative

Total Accuracy

#Features

P

R

F1

P

R

F1

Hotel

baseline CS WS CSG WSG CKSG WKSG

0.857 0.934 0.932 0.933 0.931 0.896 0.887

0.810 0.894 0.894 0.902 0.902 0.891 0.883

0.833 0.913 0.912 0.917 0.916 0.893 0.885

0.820 0.898 0.898 0.905 0.905 0.891 0.884

0.865 0.937 0.935 0.935 0.933 0.896 0.888

0.842 0.917 0.916 0.920 0.919 0.894 0.886

0.838 0.915 0.914 0.919 0.917 0.894 0.886

1,345 1,048,084 500,632 232,244 153,830 2,069 1,353

Product

baseline CS WS CSG WSG CKSG WKSG

0.849 0.911 0.931 0.924 0.944 0.868 0.815

0.763 0.957 0.956 0.937 0.939 0.881 0.821

0.804 0.933 0.943 0.930 0.942 0.874 0.816

0.777 0.953 0.953 0.933 0.937 0.874 0.817

0.858 0.902 0.925 0.919 0.942 0.860 0.805

0.815 0.926 0.939 0.926 0.939 0.867 0.809

0.810 0.930 0.941 0.928 0.941 0.871 0.813

475 153,326 69,410 33,821 21,826 386 213

Fig. 5. The comparison results on hotel data set.

Fig. 6. The comparison results on product data set.

6. Discussion

Fig. 7. The distributions of the length of key substring-group features.

In the sub Section 5.2, we observe that key-substring-group features achieve good performance at the cost of very seldom features. This subsection discusses the reasons for the effectiveness of these features. According to the framework of the sentiment classification algorithm in Fig. 1, the unique features directly determine the overall performance of sentiment classification. In order to gain insight into the nature of the problem, we manually investigate the CKSG and WKSG features and find the following interesting conclusions. Firstly, a detailed statistical analysis of the length of key-substring-group features is performed, and the results are shown in Fig. 7.

Z. Zhai et al. / Expert Systems with Applications 38 (2011) 9139–9146

9145

Fig. 8. All the 4-character and 5-character features extracted from the hotel data set.

As illustrated by Fig. 7, both CKSG and WKSG consist of features at different lengths, ranging from one character to five characters. For both Hotel data set and Product data set, 1-Character and 2Character features account for the main part. However, the 3-Character, 4-Character and 5-Character features also play an important role in sentiment classification, especially for the Hotel data set. Secondly, we manually analyze the 4-Character and 5-Character features to determine the real effects that they have on sentiment classification. As demonstrated in Fig. 8, all these features are perfect combinations of characters, although any word segmentation techniques are not used. Moreover, most of the features are exact replication of different aspects of hotels, such as ‘‘Hotel Room’’, ‘‘Waiter’s Manner’’, ‘‘Room Facilities’’, ‘‘Food Variety’’ and ‘‘Breakfast’’. Clearly, these combination features are more representative of the contents of the reviews than simple Ngram based features. As a result, they contribute more to sentiment classification. From the analysis above, it is evident that the effective features should consist of features at different lengths rather than fixed lengths, and that key-substring-group features are effective for sentiment classification.

7. Conclusion This paper finishes an evaluation of different types of features used for sentiment classification. For our research, sentimentword, substring, substring-group, and key-substring-group features, which have not been studied extensively in sentiment classification, are also extracted and examined, in addition to the commonly-used Ngram features. In order to evaluate the efficacy of these features, experiments have been conducted on two authoritative data sets in different domains. The experimental results show that different types of features possess different discriminative capabilities in Chinese sentiment classification. More specifically, following conclusions are drawn from analyzing the results from the experiments. (1) Character bigram features (CB) always outperform other Ngram features; (2) Though character bigram features (CB) achieve encouraging performance, the substring-group features have greater potential to improve the performance of sentiment classification by combining substrings of different length; (3) Sentiment words and phrases extracted from the existing sentiment lexicon are not effective for sentiment classification. (4) Features are usually more effective at varying lengths rather than fixed lengths; Besides these important conclusions, this paper also contributes to the following aspects.

(a) This paper proposed a series of substring-group based features for sentiment classification, which have never been studied before; (b) The mechanism of the effectiveness of substring-group features is further studied; (c) This paper gives a very clear introduction for all the possible techniques of sentiment classification based on machine learning.

Acknowledgement This work is supported by National Natural Science Foundation of China under Grant No. 60405011, 60575057 and 60875073. References Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL (pp. 440–447). Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of WWW (pp. 519–528). Devitt, A., & Ahmad, K. (2007). Sentiment analysis in financial news: A cohesionbased approach. In Proceedings of ACL (pp. 984–991). Gusfield, D. (1997). Algorithms on strings, trees, and sequences. New York: Cambridge University Press. Hatzivassiloglou, V., & McKeown, K. (1997). Predicting the semantic orientation of adjectives. In Proceedings of the joint ACL/EACL conference (pp. 174–181). Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. In Proceedings of AAAI (pp. 755–760). Joachims, T. (1997a). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of ICML (pp. 143–151). Joachims, T. (1997b). Text categorization with support vector machines: Learning with many relevant features. Springer. Kennedy, A., & Inkpen, D. (2006). Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence, 22, 110–125. Kim, S.-M., & Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of COLING (pp. 1367–1373). Li, J., & Sun, M. (2007). Experimental study on sentiment classification of chinese review using machine learning techniques. In Proceedings of IEEE NLPKE (pp. 393–400). Liu, B. (2006). Web data mining; Exploring hyperlinks, contents, and usage data. Springer. Ch. 11: Opinion Mining. Liu, B. (2010). Handbook of natural language processing. (2nd ed.). (Ch. Sentiment Analysis and Subjectivity). McDonald, R., Hannan, K., Neylon, T., Wells, M., & Reynar, J. (2007). Structured models for fine-to-coarse sentiment analysis. In Proceedings of ACL (pp. 432– 439). Mullen, T., & Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources. In Proceedings of EMNLP (pp. 412–418). poster paper. Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of ACL (pp. 271–278). Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of EMNLP (pp. 79–86).

9146

Z. Zhai et al. / Expert Systems with Applications 38 (2011) 9139–9146

Raaijmakers, S., & Kraaij, W., 2008. A shallow approach to subjectivity classification. In Proceedings of ICWSM (pp. 216–217). Riloff, E., Patwardhan, S., & Wiebe, J. (2006). Feature subsumption for opinion analysis. In Proceedings of EMNLP (pp. 440–448). Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Tan, S., Wang, Y., & Cheng, X. (2008). Combining learn-based and lexicon-based techniques for sentiment detection without using labeled examples. In Proceedings of SIGIR (pp. 743–744). Tan, S., & Zhang, J. (2008). An empirical study of sentiment analysis for chinese documents. Expert Systems with Applications, 34(4), 2622–2629.

Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of ACL (pp. 417–424). Turney, P., & Littman, M. (2002). Unsupervised learning of semantic orientation from a hundred-billion-word corpus. Arxiv preprint cs/0212012. Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14(3), 249–260. Wan, X. J. (2008). Using bilingual knowledge and ensemble techniques for unsupervised chinese sentiment analysis. In Proceeding of EMNLP (pp. 553–561). Zhang, D., & Lee, W. (2006). Extracting key-substring-group features for text classification. New York, NY, USA: ACM. pp. 474–483.