Using unsupervised information to improve semi-supervised tweet sentiment classification

Using unsupervised information to improve semi-supervised tweet sentiment classification

ARTICLE IN PRESS JID: INS [m3Gsc;February 23, 2016;7:14] Information Sciences xxx (2016) xxx–xxx Contents lists available at ScienceDirect Inform...

3MB Sizes 268 Downloads 363 Views

ARTICLE IN PRESS

JID: INS

[m3Gsc;February 23, 2016;7:14]

Information Sciences xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Using unsupervised information to improve semi-supervised tweet sentiment classification Nádia Felix Felipe da Silva a,∗, Luiz F.S. Coletta a, Eduardo R. Hruschka a, Estevam R. Hruschka Jr. b

Q1

Q2

a

Department of Computer Sciences, University of Sao Paulo (USP), Av. Trabalhador Sao Carlense, 400, CEP 13.560-970 Sao Carlos, SP, Brazil b Department of Computer Science, Federal University (UFSCAR), Rodovia Washington Luis, km 235 - SP-310, CEP 13565-905 Sao Carlos, SP, Brazil

a r t i c l e

i n f o

Article history: Received 8 July 2015 Revised 28 January 2016 Accepted 2 February 2016 Available online xxx Keywords: Tweet sentiment analysis Semi-supervised learning

a b s t r a c t Supervised algorithms require a set of representative labeled data for building classification models. However, labeled data are usually difficult and expensive to obtain, which motivates the interest in semi-supervised learning. This type of learning uses both labeled and unlabeled data in the training process and is particularly useful in applications such as tweet sentiment analysis, where a large amount of unlabeled data is available. Semisupervised learning for tweet sentiment analysis, although quite appealing, is relatively new. We propose a semi-supervised learning framework that combines unsupervised information, captured from a similarity matrix constructed from unlabeled data, with a classifier. Our motivation is that such a similarity matrix is a powerful knowledge-discovery tool that can help classify unlabeled tweet sets. Our framework makes use of the wellknown Self-training algorithm to induce a better tweet sentiment classifier. Experimental results in real-world datasets demonstrate that the proposed framework can improve the accuracy of tweet sentiment analysis. © 2016 Elsevier Inc. All rights reserved.

1

1. Introduction

2

Social networking platforms, such as blogs, forums, and microblogs are rich sources of content for varied applications [1–7]. Twitter is one of the most popular microblogging services, enabling users to post status messages called “tweets” of no more than 140 characters. Tweets represent one of the largest and most dynamic datasets of user generated content – as approximately 288 million active users post 500 million tweets per day1 . These short texts can express opinions on different topics, which can help to direct marketing campaigns via the sharing of consumers’ opinions concerning brands and products [8]. Outside the realm of business applications, tweets can make it possible to identify outbreaks of bullying [9], events that generate insecurity [10], and acceptance or rejection of politicians [11], all in an electronic word-of-mouth manner. Given the huge amount of data typically available in the outlined scenarios, actionable insight can be derived from

3 4 5 6 7 8 9

Q3



Corresponding author. Tel.: +55 16 3373 9700. E-mail addresses: [email protected] (N.F.F. da Silva), [email protected] (L.F.S. Coletta), [email protected] (E.R. Hruschka), [email protected] (E.R. Hruschka Jr.). 1 https://about.twitter.com/company. http://dx.doi.org/10.1016/j.ins.2016.02.002 0020-0255/© 2016 Elsevier Inc. All rights reserved.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS 2

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

ARTICLE IN PRESS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

human-machine systems, in which both human expertise and data driven approaches are intelligently combined. In order to do so, particularly considering our application scenario, four relevant issues must be addressed: • Despite a tweet being able to contain up to 140 characters, people tend to use much less than this limit. Indeed, the average length of a tweet is 28 characters2 . This characteristic makes the analyses of tweets based on the so-called bag-of-words harder to perform because the data matrix is very sparse; • The frequency of misspellings and slang in tweet messages is much higher than that in other domains because users typically post messages from a wide variety of different electronic devices, including cell phones and tablets. Furthermore, in this type of environment, users develop their own culture and specific vocabulary, which, while being limited from the perspective of length (e.g., the number of characters), may convey rich meaning; • Unlike blogs, news providers, and other sites, which are tailored to specific topics, Twitter users post messages on a variety of topics; • Most tweet sentiment analysis techniques fall into two categories: lexicon-based and corpus-based approaches. As with all supervised tasks, these categories require labeled sentiment data to build a machine learning model and/or labeled sentiment data for evaluation. The more labeled sentiment data that are available, the more robust a machine learning model will be, and the more accurate the evaluation scores will be.

50

Our focus is on the development of a semi-supervised tweet sentiment analysis approach, with particular attention to the role of the users (experts), who feed the system with annotated (labeled) data. Essentially, manual annotation is necessary, but it is tedious, expensive, and error-prone [12,13]. Go et al. [14] suggest obtaining labels from emoticons and hashtags, but note that these are not part of every tweet. Therefore, this and other related approaches [15–19] are of limited use in practice. Semi-Supervised Learning (SSL) techniques utilize unlabeled data in their training process. Unlabeled data can improve classification in applications where labeled data are scarce [20–22]. Therefore, SSL-based methods appear to be a promising alternative to deal with tweet sentiment analysis because an overwhelming number of unannotated tweets are accessible, in contrast to the limited number of annotated ones [23–26]. The acquisition of labeled tweets often requires a costly process that involves skilled experts, whereas the acquisition of unlabeled ones is relatively inexpensive. From this perspective, simple semi-supervised classification models, such as Self-training [22], can be of great practical value. In our study, a particular version of the Consensus between Classification and Clustering Ensembles (C3 E) algorithm [27,28] is integrated into an SSL framework to perform tweet sentiment classification. In a previous work, C3 E was applied to refine the classification of tweets in a supervised context [29]. In brief, C3 E is based on the assumption that similar instances are more likely to share the same class label. Our current work, as an extension of [29], shows that C3 E is also helpful for tweet classification in semi-supervised settings. Simply put, we combine the classification power of Support Vector Machines (SVMs), constructed from labeled data, with the information provided by the pair-wise similarities between unlabeled data points. The proposed framework is based on an iterative Self-training approach guided by the predictions made by the framework of C3 E. Because only a relatively small amount of annotated data is required, tweet labeling costs are reduced. In addition, our empirical results show that the use of unlabeled tweets leverages classification when a few labeled tweets are available. The remainder of the paper is organized as follows. Section 2 addresses the related works on semi-supervised approaches for sentiment analysis. Section 3 describes our framework for semi-supervised tweet sentiment classification. Section 4 reports experimental results based on comparisons of our approach to the traditional Self-training and Co-training algorithms, as well as to a stand-alone SVM classifier and an unsupervised lexicon-based approach. Finally, Section 5 concludes the paper and discusses directions for future work.

51

2. Related works

52

Most studies regarding tweet sentiment analysis have utilized supervised learning algorithms to produce sentiment classification models. Such algorithms require a training set formed by labeled data, where the labels are the classes (e.g., positive, neutral, and negative) of each tweet. Some studies propose the use of emoticons and hashtags to build the training set. For instance, Go et al. [14] and Davidov et al. [19] identified tweet polarity by using emoticons as class labels. Other algorithms use the characteristics of the social network, e.g., followers/followee links, as in Hu et al. [30]. Approaches that integrate opinion mining lexicon-based techniques [31] and learning-based techniques have also been investigated. For example, Agarwal et al. [32], Read [33], Zhang et al. [34], and Saif et al. [35] used lexicons, part-of-speech, and writing style as linguistic resources. In a similar context, Saif et al. [36] introduced an approach to add semantics to the training set as an additional feature. More recently, classifier ensembles have been successfully used [37–42]. The first work on sentiment classification that does not require labeled data was proposed by Turney [43], in which a document is classified as either positive or negative by taking into account the average semantic orientation of its phrases that contain adjectives or adverbs. His approach was assessed on automobile reviews and movie reviews, which are data sources that are very different from the type of short texts found in tweets. In the same vein, Read and Carroll [44] proposed three different methods (based on lexical association, semantic spaces, and distributional similarity) to measure the

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

53 54 55 56 57 58 59 60 61 62 63 64 65

2

http://thenextweb.com/twitter/2012/01/07/interesting-fact-most-tweets-posted-are-approximately-30-characters-long/.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS

ARTICLE IN PRESS N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

[m3Gsc;February 23, 2016;7:14] 3

115

similarity between words and polarity words. Because labeled data are not used by unsupervised learning approaches, they are expected to be less accurate than those based on supervised learning. From this aspect, prominent systems for sentiment analysis should address the scarcity of labeled tweets (taking advantage of unlabeled ones, such as is performed by unsupervised models) and provide better classification (such as that usually achieved by supervised models). In contrast to supervised and unsupervised learning, Semi-Supervised Learning (SSL) takes into account both labeled and unlabeled data during the training phase. For tweet sentiment analysis, SSL algorithms are based on graphs [45–47], Self-training [24,25,48,49], Co-training [26,50–53], and topics [23]. In general, graph-based methods propagate labels to unlabeled data. The label propagation process requires the computation of similarities among the data instances. However, finding appropriate similarity measures, especially for tweet sentiment analysis, is a non-trivial task. Typically, cosine similarity based on bag-of-words representation is employed, but it favors topic similarity rather than sentiment similarity. Indeed, a high similarity value often suggests that two documents share numerous content words rather than similar sentiments [54]. The use of graph-based methods has also been motivated by the available social information, which can help to capture sentiments of particular users [45,46]. Self-training has been applied in several contexts. For example, the algorithm AROW [55] makes use of Self-training for large scale reviews of polarity prediction. Haimovitch et al. [56] show that AROW can reduce test errors by more than half compared to the supervised classifier trained on the initial labeled data. Another sentiment classification approach based on Self-training was proposed by Zagibalov and Carroll [57], in which Self-training is used to add sentiment lexical items into the vocabulary for Chinese text. Liu et al. [58] also used a Self-training approach for sentiment analysis in a Chinese microblog. Similarly Zagibalov and Carroll [57] and Qiu et al. [59], they utilized an iterative process based on lexicon to increase a sentiment dictionary. Such an approach considers a massive Chinese sentiment dictionary instead of employing a one-word seed dictionary as in [57]. Documents predicted in the initial phase are used as the training data to build Support Vector Machines (SVMs), which are subsequently employed in the refinement of the primary results. Finally, approaches that employ Self-training for increasing the size of the feature space can be found in [24,25,60], in which the training process leads to the inclusion of additional polarity lexicons. The main underlying motivation of these approaches is to find the best possible adjustment of a static polarity lexicon to the unlabeled tweet set used for expansion. Closely related to Self-training, Co-training [61] is also used to sentiment analysis by exploring two disjoint feature sets. For the cross-lingual document polarity classification [50], each view used by Co-training is a set of language-based features. For problems where the class distribution is imbalanced, Li et al. [51] proposed an under-sampling method to generate balanced datasets of different views. Ning [52] revisited Co-training, discussing several strategies for sentiment analysis in three domains: news articles, online reviews, and blog posts. Li et al. [53] adopt two views, personal and impersonal views, and employed them in a Co-training framework. Yu and Kubler [62] investigated the use of SSL in opinion detection both in sparse data situations and for domain adaptation, comparing Co-training and Self-training approaches. Liu et al. [26] also designed a two-view (textual and non-textual views) approach for tweets classification based on the Co-training framework. In their approach, two classifiers are trained on a common set of labeled tweets. Textual and non-textual features are also extracted and split into two views for an adaptive Co-training algorithm in [63], which updates topic-adaptative features and selects the more reliable tweets to boost the performance. A more recent notion is explored by Xiang and Zhou [23], who implemented a semi-supervised tweet sentiment analysis method that makes use of topic-based modeling. The authors perform clustering analysis and several classification phases in the same dataset (a training set). As a result, a sentiment mixture model is obtained and then used to predict the class of unlabeled tweets, from which a subset is chosen to augment the training set. To achieve satisfactory results when using this approach, it is necessary to have a large labeled tweet set – the authors took into account 9684 labeled tweets and 2 millions of unlabeled ones. As the topic structure formed by clustering comes exclusively from the training set, an important gap in this approach is that no topic analysis takes place in the unlabeled tweets, where this supplementary information could be useful. We propose an SSL framework that uses unsupervised information to improve SVM-based classification for tweet sentiment analysis. Such a combination is promoted by a particular instantiation of the C3 E-SL algorithm [27,28], revisited in the next section. We shall anticipate that we make use of unsupervised information, which is encoded as a similarity matrix built from the distances of unlabeled tweets in a space formed by lexicon scores. Such a matrix allows for exploring the “intrinsic structures” present in the data and helps to augment the initial labeled set with more confident predictions during the Self-training process.

116

3. Brief review of C3 E

117

The C3 E algorithm [27,28] is a framework to combine classification and clustering algorithms to obtain a better classifier. Its core component is an optimization algorithm that assumes that an ensemble of classifiers (consisting of one or more classifiers) has been previously induced from a training set. This ensemble estimates initial class probability distributions for every instance xi of a target/test set X = {xi }ni=1 . These probability distributions are stored as c-dimensional vectors {πi }ni=1 , where c is the number of classes. Assuming that a cluster ensemble can provide supplementary constraints for such a previous classification, the probability distributions in {π i }ni=1 are refined taking into account a similarity matrix S = {si j }ni, j=1 – with the rationale that similar instances are more likely to share the same class label. Matrix S captures

66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 Q4 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

118 119 120 121 122 123

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS 4

124 125 126 127 128 129 130 131 132 133

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

similarities between the instances of X . In particular, each entry corresponds to the relative co-occurrence of two instances in the same cluster [64,65] – considering all of the data partitions built from X . Our work is inspired by C3 E, but there is a remarkable difference between the original approach and the one used here, namely that we do not make use of ensembles. Instead, and for the sake of simplicity, we just use a single classifier, and the similarity matrix is induced directly from the raw data. By doing so, we hope to provide a proof of concept that our approach works, ruling out additional complexities that arise when determining the (experimental) space design of ensembles. The classification refinement is obtained from a posterior class probability distribution for every instance in X – defined as a set of vectors {yi }ni=1 – that is provided by C3 E. In [27], C3 E exploits the general properties of a large class of loss functions described by Bregman divergences [66]. If we opt to use a Squared Loss (SL) function as the Bregman divergence, the optimization problem can be simplified [28,67]. More precisely, we have to minimize J in (1) with respect to {yi }ni=1 :

J=

1 1  yi − π i 2 + α si j yi − y j 2 2 2

135

(1)

(i, j )∈X

i∈X

134

[m3Gsc;February 23, 2016;7:14]

The coefficient α ∈ R+ controls the relative importance of supervised and unsupervised components. By keeping {y j }nj=1 \ {yi } fixed, we can minimize J in Eq. (1) for every yi by setting:

∂J = 0. ∂ yi

(2)

2 Considering that the similarity matrix S is symmetric and observing that ∂∂xx = 2x, we obtain:

136



yi =

πi + α  j=i si j y j  , 1 + α  j=i si j

(3)

142

where α  = 2α has been set for mathematical convenience. Eq. (3) can be computed iteratively, for all i ∈ {1, 2, . . ., n}, until a maximum number of iterations (I) is reached, in order to obtain posterior class probability distributions for the instances in X 3 . This simplified version of the C3 E algorithm, conveniently named as C3 E-SL, is adapted here to perform semi-supervised tweet sentiment classification. In order to optimize the two user-defined parameters of C3 E-SL, namely the coefficient α and its number of iterations I, we have used a differential evolution algorithm (addressed in Section 4).

143

3.1. Generating a discriminative classification component

144

149

C3 E-SL receives class probability distributions that will be refined by its optimization process. The literature indicates that Naive Bayes, SVM with linear kernel, and Logistic Regression are the most used classifiers for tweet sentiment analysis [68]. Therefore, we have chosen one of them, the SVM classifier, as the C3 E-SL supervised component that provides one of the two necessary inputs of the algorithm. Notice that SVM exhibits a good out-of-sample generalization, usually achieving better classification accuracies compared to Naive Bayes and Logistic Regression in tweet classification applications [68]. As in [35], we have used a linear kernel with the parameter C = 0.005.

150

3.2. Inferring a similarity matrix

151

In addition to class probability distributions for the instances in the target set, C3 E-SL also requires a similarity matrix that captures pair-wise similarities between instances of the target set. In [27,28,67], this matrix comes from clustering ensembles. Here, we compute such a matrix, which allows for refining the classification model, in a different way, namely: directly from the raw data. More precisely, it is computed according to the following steps:

137 138 139 140 141

145 146 147 148

152 153 154

(i) Computing pair-wise distances between instances of the target set: a distance matrix D is a two-dimensional array containing the Euclidean distances between every pair of tweets. (ii) Normalizing the distance matrix: the distances are normalized to contain only values between 0 and 1. Eq. (4) was used to compute di j between tweets i and j, where dij is the raw distance. The terms dmin and dmax refer to the minimum and maximum observed distances, respectively.

155 156 157 158 159

di j = 160 161

di j − dmin dmax − dmin

(4)

(iii) Computing the similarity matrix: from the distance values dij , the elements sij of the similarity matrix, S, are straightforwardly computed as:

si j = 1 − di j 3

The set of vectors {yi } have been initialized calculating yi =

(5)

1 c

∀i ∈ {1, 2, . . ., n}, ∀ ∈ {1, 2, . . ., c}.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

[m3Gsc;February 23, 2016;7:14] 5

Fig. 1. First step: Estimating the C3 E-SL parameters by means of D2 E [67].

Fig. 2. Second step: improving tweet classification by using the C3 E-SL algorithm [67]. Class probability distributions computed by SVM are refined with the information of a similarity matrix constructed from the unlabeled tweets.

162

3.3. Estimating the C3 E-SL parameters

166

There are some well-studied alternatives for optimizing the C3 E-SL user-defined parameters, α and I. In this work, they are directly estimated from the data by means of the Dynamic Differential Evolution (D2 E) algorithm introduced in [67]. We refer any reader interested in the details of this algorithm to that paper and the references therein. The specific manner in which the parameters are optimized for our application is described next.

167

4. Semi-supervised learning framework

168

179

Our main contribution is to adapt the C3 E-SL algorithm for semi-supervised tweet sentiment analysis. Figs. 1–3 show an overview of our approach. Fig. 1 illustrates how the C3 E-SL parameters are estimated: first, we split the labeled tweet set into training and validation sets. The classifier is built with the training set. Then, we use the D2 E algorithm [67] to estimate the values of α and I by utilizing the validation set, for which a similarity matrix was induced. The resulting values for the parameters (α ∗ and I∗ ) are then used for the particular dataset in hand (see Fig. 2). As a result of the second step, we obtain a set of vectors, {y j }nj=1 , that are the posterior class probability distributions which are used in the selection of tweets to be promoted to the labeled set. The promoted tweets are those whose class labels are more likely to be correct, following the traditional Self-training approach. We implement this idea by looking at each class separately, i.e., the number of selected tweets reflects the proportions of each class in the training set. In other words, only the most-confident predictions are promoted to the set of labeled tweets, proportionally to the positive, negative, and neutral classes. Fig. 3 summarizes our semi-supervised learning framework. Note that its procedure can be repeated several times (or until all unlabeled tweets have been labeled).

180

4.1. Pedagogical example

181

Table 1 presents four tweets from the Twitter2014 test set [69] that will be used to illustrate our approach. Fig. 4 (bottom right-hand side) depicts these tweets, representing them by the number of positive and negative opinion words. These

163 164 165

169 170 171 172 173 174 175 176 177 178

182

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS 6

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

Fig. 3. Summary of the first and second steps in our semi-supervised framework for tweet sentiment analysis. The iterative process described can be repeated several times. Table 1 Tweets used in the example. ID

Tweet

t1 t2 t3 t4

@LUF C _SOCC ER Gators soccer wins 12th SEC title: The three points earned put Florida on top with 33 … I hope my better half is feeling better today! I will be so sad if we can’t climb Mount Snowdon tomorrow! Was so looking forward to it! Getting ready for Knollwood P.S. bazaar, tonight and tomorrow! @SeanJohnGerard @gmurph25 @tompcotter It’s decent but not +3 standard yet, teeing it up at MJ on Monday. How’d you play at Lough Erne?

Fig. 4. Plots of tweet sentiment classification using the traditional Self-training (with SVM), our approach (with C3 E-SL), a stand-alone SVM, and the true classes.

183 184 185 186 187

features are used to build the similarity matrix. For tweet t1 there are two positive opinion words (wins and earned) and there are no negative opinion words – a lexicon provides this information. For tweet t2 there are two positive opinion words (better and better), and there are two negative opinion words (sad and can’t). For tweet t3 there is only one positive opinion word (ready), and there are no negative opinion words. For tweet t4 there is one positive opinion word (decent) and one negative opinion word (not). As a training set, 5% of the SemEval 2013 [68] labeled tweets was used. We run our approach Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

7

Table 2 Part of the similarity matrix for Twitter2014 [69].

Q5

t1 t2 t3 t4 ... t20 ... t50 ... t125 ... t1853

t1

t2

t3

t4

...

t20

...

t50

...

t125

...

t1853

1.0 0.5 0.5 0.0 ... 1.0 ... 0.3 ... 0.0 ... 1.0

0.5 1.0 0.2 0.3 ... 0.5 ... 1.0 ... 0.2 ... 0.0

0.5 0.2 1.0 0.5 ... 0.5 ... 0.5 ... 0.8 ... 0.3

0.0 0.3 0.5 1.0 ... 0.2 ... 0.3 ... 0.7 ... 0.1

... ... ... ... ... ... ... ... ... ... ... ...

1.0 0.5 0.5 0.2 ... 1.0 ... 0.2 ... 0.3 ... 1.0

... ... ... ... ... ... ... ... ... ... ... ...

0.3 1.0 0.5 0.6 ... 0.2 ... 1.0 ... 0.3 ... 0.2

... ... ... ... ... ... ... ... ... ... ... ...

0.0 0.2 0.8 0.7 ... 0.3 ... 0.3 ... 1.0 ... 0.2

... ... ... ... ... ... ... ... ... ... ... ...

1.0 0.0 0.4 0.2 ... ... ... 0.2 ... 0.2 ... 1.0

Table 3 Class distributions of the training and test sets. Name

Positive

SemEval 2013 [68]

Training set 4,215 (37%) 1,807 (15%)

LiveJournal [69] SMS2013 [68] Twitter2013 [68] Twitter2014 [69] Twitter Sarcasm 2014 [69]

427 492 1,572 982 33

Negative

Test sets (37%) 304 (27%) (23%) 394 (19%) (41%) 601 (16%) (53%) 202 (11%) (38%) 40 (47%)

Neutral

Total

5,325 (48%)

11,338

411 1,207 1,640 669 13

(36%) (58%) (43%) (36%) (15%)

1,142 2,093 3,813 1,853 86

195

and a traditional Self-training with SVM for a single iteration and then we plot and compare them for illustrative purposes only. In Fig. 4, one can see that C3 E-SL found the true class for the four tweets, while the stand-alone SVM classifier did not. Clearly, C3 E-SL is taking advantage of the unsupervised information inferred from the similarity matrix. Indeed, C3 ESL establishes a balance between the supervised and unsupervised information, and the latter offers refinements to the classification done by the SVM, thereby generating improvements. From the similarity matrix (see Table 2), tweet t1 is similar to tweets that have the same number of positive and negative lexicons (tweets t20 and t1853 ), which are both from the positive class and contributed to correctly classifying t1 .

196

5. Experimental results

197

200

We assess our proposed framework for semi-supervised tweet classification on representative datasets. Comparisons with related approaches are also provided. In particular, we use traditional Self-Training and Co-training approaches, a stand-alone SVM, and an unsupervised lexicon-based method. Additionally, the best classification results reported in the literature are presented. Details on the used datasets and the performed preprocessing steps are described next.

201

5.1. Datasets

202

Experiments were carried out on six representative datasets, which are summarized in Table 3. We make use of datasets used by the organizers of the International Workshop on Semantic Evaluation (SemEval)4 , which is a leading scientific event on this field. As suggested by the organizers of the SemEval 2013 (task 2) and SemEval 2014 (task 9) competitions, the dataset known as SemEval 2013 was used to induce classification models. Actually, this dataset is currently the most used for tweet sentiment analysis as, besides being representative, it is publicly available and of a considerable size [68,69]. The models induced were then assessed on five test sets, namely: LiveJournal, SMS2013, Twitter2013, Twitter2014, and Twitter Sarcasm 2014. Essentially, the datasets LiveJournal and SMS2013 were included in order to determine how systems trained on Twitter perform on other sources (particularly, from weblogs and cell phone messages). They were labeled by means of the Amazon Mechanical Turk5 annotators. Twitter2013 was obtained in a process composed of three phases. Firstly, named entities were extracted from millions of tweets, which were collected over a one-year period spanning from January 2012 to January 2013 using the public streaming Twitter API. Then, popular topics, as the named entities frequently mentioned in

188 189 190 191 192 193 194

198 199

203 204 205 206 207 208 209 210 211 212

4 5

http://alt.qcri.org/semeval2014/task9/. https://www.mturk.com.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS 8

ARTICLE IN PRESS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

Fig. 5. Features used by the SVM classifier.

218

association with a specific date, were identified. Finally, given this set of automatically identified topics, tweets were gathered from the same time period related to the named entities. Twitter2013 has different topics from those of the training set and was collected in later periods. Twitter2014 and Twitter Sarcasm 2014 were obtained more recently. The latter was collected by the #sarcasm hashtag with the goal of determining how the sarcasm affects the tweet polarity. The studied approaches were trained with labeled tweets from SemEval 2013 [68]. The reported results were obtained on five different test sets (see Table 3).

219

5.2. Feature engineering

220

Different studies have used different features to represent tweet messages. In fact, it is expected that a set of chosen features will properly fit the classification model adopted. For example, related approaches have employed Ngrams and emoticons [26,70], or only Ngrams [25]. Others adopted a more complex feature space also containing part-of-speech tags, lexicons and hashtags [23,24,60,71]. The feature set used in our experiments was inspired from [35], whose authors ranked first in SemEval 2013 [68]. That feature set also achieved the highest scores on LiveJournal, Twitter Sarcasm 2014, and SMS2013 in SemEval 2014 [69]. It is composed by:

213 214 215 216 217

221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238

(i) Ngrams: unigrams, bigrams, and trigrams; (ii) Negation: the number of negated contexts [72,73]. A negated context according to [74] is a segment of a tweet that starts with a negation word (e.g., “no”, “shouldn’t”) and ends with punctuation, such as a comma, period, colon, semicolon, exclamation mark, or question mark. A negated context affects the ngram and lexicon features, such that the suffix “NEG” was added to each word following the negation word (e.g., “good” became “good_NEG”). A list of negation words was adopted from Christopher Potts’ sentiment tutorial6 ; (iii) Part of speech: a part-of-speech tagging was carried out by using Ark-twitter NLP [75] and the number of occurrences of each part-of-speech tag was computed; (iv) Writing style: we considered the presence of three or more repeated characters in the words, the sequence of three or more punctuation marks, and the number of words with all letters in uppercase; (v) Lexicons: the number of positive and negative words computed by the lexicon-based method described in [35,76]. (vi) Microblogging features: the total number of sentiment hashtags in the text provided by sentiment lexicons and emoticons [35,77].

240

All features were used to build the sentiment classifier model, as illustrated in Fig. 5. For building the similarity matrix, only lexicon were considered, as represented in Fig. 6.

241

5.3. Experimental setup

242

To evaluate the algorithms, we simulate an inductive semi-supervised learning configuration, considering different amounts of initial labeled tweet sets. We randomly sampled a proportion, p, of labeled tweets from the training set named as SemEval 2013 [68], maintaining the balance of the three classes. The remaining (1 − p) tweets were used in the learning phase. In each iteration, a certain number of instances were incorporated and used for adapting the classification model. The

239

243 244 245

6

http://sentiment.christopherpotts.net/lingstruc.html.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

9

Fig. 6. Features for computing the similarity matrix. Table 4 F rate (%) on LiveJournal [69].

Trial

Supervised SVM with labeled data FPos

FNeg

Semi-supervised SVM with Self-training

Semi-supervised SVM with Co-training

F

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

Our approach F

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10

61.10 67.19 50.30 56.66 66.85 62.19 67.56 68.38 62.64 63.56

52.31 50.31 46.05 40.47 60.91 41.47 50.42 35.53 38.41 37.41

56.71 58.75 48.18 48.56 63.88 51.83 58.99 51.96 50.53 50.48

68.40 67.71 62.97 67.42 67.51 66.12 66.46 66.03 63.76 63.76

53.20 50.75 50.66 46.54 57.89 43.27 54.41 47.64 37.96 51.21

60.80 59.23 56.82 56.98 62.70 54.69 60.43 56.83 50.86 57.48

69.24 69.44 69.46 69.24 69.53 68.72 69.62 68.53 68.84 68.78

43.92 53.98 54.33 43.55 53.98 51.31 53.98 44.08 42.99 44.80

56.58 61.71 61.90 56.40 61.75 60.01 61.80 56.31 55.91 56.79

70.07 68.59 63.47 68.05 68.38 67.24 68.23 67.72 66.88 66.32

54.81 54.90 46.88 47.59 62.36 46.02 55.60 50.53 44.94 44.34

62.44 61.75 55.18 57.82 65.37 56.63 61.92 59.13 55.91 55.33

F Std. deviation

62.64 05.65

45.33 08.08

53.99 05.26

66.01 01.90

49.35 5.75

57.68 03.37

69.14 0.39

48.69 5.17

58.92 2.75

67.50 01.74

50.80 05.90

59.15 03.54

257

F-score (F) was adopted for evaluating the accuracy of the algorithms on the five tweets sets (LiveJournal, SMS2013, Twitter2013, Twitter2014, and Twitter Sarcasm 2014 – see Table 3); we computed the F-score for positive and negative classes, as well as the overall F-score F = (Fpositive + Fnegative )/2 [23–25,60]. The classifiers were run 10 times, and the averages and standard deviations were reported. As addressed in Sections 3 and 4, the C3 E-SL user-defined parameters, α and I, are estimated by means of the D2 E algorithm using the set of labeled tweets available. D2 E was set as in [67], and it is here reproduced for completeness. To summarize, this algorithm is based on the trial vector generation strategy “DE/best/2/bin” and starts with the population size N p = 20 and its control parameters F = Cr = 0.25 [67]. Its optimization process was guided to minimize the C3 E-SL misclassification rate in a validation set by evaluating different combinations of α = {0, 0.001, 0.002, . . . , 0.15} and I = {1, 2, . . ., 10}. The labeled tweet set was divided randomly into two distinct sets of the same size, with one being the training set and the other the validation set. After doing that, C3 E-SL with the optimal pair of values <α ∗ , I∗ > was used to classify unlabeled tweets.

258

5.4. Comparing our SSL framework with Self-training and Co-training

259

To evaluate the proposed SSL framework, we initially compared it against Self-training and Co-training, which are traditional semi-supervised approaches [78]. To do so, we randomly selected 5% of the data from SemEval 2013 to be the initial training set (with labeled tweets). From this initial training set, the Self-training learning takes place until all instances of the rest of SemEval 2013 have been promoted to the augmented training set. This procedure was repeated 10 times (independent trials). For comparison purposes, a supervised stand-alone SVM was also run on the same initial labeled sets in each trial (the same initial labeled sets are provided for the first iteration of Self-training and Co-training). For the Selftraining approach, we used all features, whereas for Co-training, we considered two views: one provided by Ngrams and other containing the remaining features described in Section 5.2. Tables 4–8 summarize the results, where the best results are in bold face. From Tables 4–8, we can derive useful observations. First, all semi-supervised classification approaches offer fairly good results. Specifically, unlabeled tweet data can help train a better classifier and improve the generalization ability of the classifier. Second, the proposed SSL framework obtained the best overall results in comparison to Self-training. Considering Co-training, it performed best in 60% of the cases. We believe that the ability to exploit “intrinsic structures” in unlabeled data provides supplementary information that may compensate for the lack of prior knowledge. Interestingly, we found that, sometimes (e.g., trials #1 and #7 from Table 8), the performances of the semi-supervised algorithms are worse than that of the supervised SVM – only trained with labeled data. This may happen because, in those cases, the labeled data

246 247 248 249 250 251 252 253 254 255 256

260 261 262 263 264 265 266 267 268 269 270 271 272 273 274

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS 10

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

Table 5 F rate (%) on SMS2013 [68].

Trial

Supervised SVM with labeled data

Semi-supervised SVM with Self-training

Semi-supervised SVM with Co-training

Our approach

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10

49.91 54.92 49.64 51.98 53.38 48.67 55.27 54.53 48.67 47.92

49.31 47.09 41.89 39.56 45.58 39.59 47.19 37.04 38.27 39.12

49.61 51.01 45.77 45.77 49.48 44.13 51.23 45.78 43.47 43.52

53.98 60.00 52.05 53.93 55.08 49.31 53.58 53.47 46.89 49.01

47.58 40.68 39.66 42.87 41.80 38.56 43.07 41.33 34.94 44.66

50.78 50.34 45.85 48.40 48.44 43.94 48.32 47.40 40.92 46.83

60.18 56.18 55.94 60.56 56.38 59.48 56.15 60.00 60.26 60.28

47.01 49.28 49.34 47.34 49.58 49.67 49.40 46.75 46.81 46.73

53.60 52.73 52.64 53.95 52.98 54.57 52.77 53.37 53.50 53.50

56.52 59.94 56.26 57.96 56.78 54.05 58.66 58.26 52.99 49.61

51.72 49.16 43.43 46.11 48.36 42.38 51.16 47.18 41.69 46.64

54.12 54.55 49.84 52.04 52.58 48.21 54.90 52.72 47.34 48.12

F Std. deviation

51.49 02.86

42.46 04.42

46.98 03.06

52.73 03.69

41.52 03.45

47.12 02.96

58.54 2.06

48.19 1.34

53.37 0.60

56.10 03.09

46.78 03.47

51.44 02.84

Table 6 F rate (%) on Twitter2013 [68].

Trial

Supervised SVM with labeled data

Semi-supervised SVM with Self-training

Semi-supervised SVM with Co-training

Our approach

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10

60.50 61.18 57.55 60.42 61.43 62.30 60.80 62.59 60.00 63.85

45.66 42.41 42.62 44.89 44.97 43.91 43.13 35.39 41.91 42.42

53.08 51.79 50.08 52.65 53.20 53.11 51.97 48.99 50.96 53.14

64.98 65.13 62.21 64.80 64.74 64.53 63.73 64.74 62.16 64.29

45.83 45.66 41.91 47.43 42.73 43.81 44.70 45.61 41.27 46.00

55.41 55.39 52.06 56.11 53.73 54.17 54.21 55.17 51.72 55.14

66.64 64.87 64.38 66.79 64.89 66.22 64.71 66.72 66.66 66.72

38.28 38.34 38.44 38.58 38.49 37.64 38.50 38.37 37.81 38.46

52.46 51.60 51.41 52.68 51.69 51.93 51.61 52.55 52.24 52.59

63.99 65.30 67.95 65.66 65.32 67.49 68.62 64.32 68.62 68.98

48.44 41.29 41.30 44.32 39.02 44.91 45.64 45.42 42.27 43.58

56.21 53.30 54.63 54.99 52.17 56.20 57.13 54.86 55.44 56.28

F Std. deviation

61.06 01.70

42.73 02.88

51.90 01.47

64.13 01.10

44.49 02.00

54.31 01.46

65.86 1.01

38.29 0.31

52.08 0.48

66.62 01.90

43.63 02.71

55.12 01.49

Table 7 F rate (%) on Twitter2014 [69].

Trial

275 276 277 278 279 280 281 282

Supervised SVM with labeled data

Semi-supervised SVM with Self-training

Semi-supervised SVM with Co-training

Our approach

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10

57.33 60.84 57.52 58.73 61.58 60.64 60.42 61.59 58.88 62.73

43.72 42.13 44.92 44.32 45.96 40.87 43.14 40.12 40.58 45.58

50.52 51.49 51.22 51.53 53.77 50.76 51.78 50.85 49.73 54.16

64.63 64.92 63.21 62.43 64.40 64.12 63.30 64.25 61.66 63.03

47.03 44.33 41.87 46.31 45.23 43.26 43.41 45.15 40.72 45.23

55.83 54.62 52.54 54.37 54.81 53.69 53.35 54.70 51.19 54.13

66.55 64.01 63.88 66.55 64.15 66.63 63.85 66.63 66.48 66.52

42.68 33.49 33.65 42.33 33.17 35.16 33.73 42.07 42.68 42.81

54.62 48.75 48.76 54.44 48.66 50.89 48.79 54.35 54.84 54.66

63.69 64.76 63.77 64.40 64.12 65.30 64.29 65.39 62.80 64.23

45.00 44.80 44.02 42.66 42.32 45.01 43.70 47.96 43.74 47.15

54.35 54.79 53.89 53.54 53.22 55.15 53.99 56.68 53.27 55.69

F Std. deviation

60.02 01.82

43.14 02.12

51.58 01.39

63.60 01.05

44.25 01.96

53.92 01.31

65.52 1.33

38.18 4.60

51.85 2.89

64.28 0.77

44.64 01.79

54.46 01.12

can represent the entire data space fairly well, while the unlabeled data actually increase the probability of overfitting. As expected, unlabeled data are not always useful for improving the performance of a classifier. Co-training paradigm applies when PAC-learnable classifiers can be learned from two sets of independent features (each called a view). The intuition behind the Co-training algorithm [61] is that two views of the data can be used to train two classifiers that can help each other. More precisely, it is based on two assumptions: first, that each view of the data should itself be sufficient for learning the classification task, what cannot be taken for granted [50–53,62,63]; and second that the views are independent of each other, or they (at least) produce independent errors. As our experiments show, Co-training and C3 E-SL are competitive, but C3 E-SL does not rely on the assumption of having such two views. Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

11

Table 8 F rate (%) on Twitter Sarcasm 2014 [69].

Trial

Supervised SVM with labeled data

Semi-supervised SVM with Self-training

Semi-supervised SVM with Co-training

Our approach

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

FPos

FNeg

F

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10

50.00 49.32 46.15 44.44 55.70 49.28 58.54 54.76 49.35 43.84

25.53 13.64 17.39 09.30 33.96 25.53 13.64 09.30 17.78 09.30

37.77 31.48 31.77 26.87 44.83 37.40 36.09 32.03 33.56 26.57

50.63 52.05 50.63 48.72 52.63 48.10 55.42 52.50 51.85 52.38

13.64 24.49 13.64 09.30 24.49 17.39 25.53 25.00 21.28 17.78

32.13 38.27 32.13 29.01 38.56 32.75 40.48 38.75 36.56 35.08

55.00 55.26 56.75 55.69 56.00 55.69 55.26 55.69 56.41 56.41

13.95 45.28 44.44 13.94 44.44 36.73 45.28 13.95 13.95 13.95

34.47 50.27 50.60 34.81 50.22 46.21 50.27 34.82 35.18 35.18

50.75 54.55 50.00 51.85 56.47 51.28 56.41 62.37 53.16 49.35

21.74 25.53 21.28 17.78 34.62 09.30 21.74 09.52 17.78 17.78

36.24 40.03 35.64 34.81 45.54 30.29 39.07 35.94 35.47 33.56

F Std. deviation

50.14 04.86

17.54 08.38

33.84 05.46

51.49 02.10

19.25 05.77

35.37 03.74

55.81 0.57

28.59 15.62

42.20 7.80

53.62 03.96

19.70 0.73

36.66 0.41

Fig. 7. F-score from labeling newly added tweets – LiveJournal.

Fig. 8. F-score from labeling newly added tweets – SMS.

283

5.5. F-score rate of labeling newly added data

284

Here, we investigate the details of the training behavior with respect to the different methods. We consider the F rate of labeling newly added data in the training process as a performance indicator of different methods on various datasets. Again, we randomly selected 5% of the training set as labeled data, and the remainder ones are taken as unlabeled data. We repeated this process 10 times. Figs. 7–11 show the performances of the studied algorithms (these plots are from Tables 4–8). For comparison purposes, an unsupervised lexicon-based approach was also run. We considered only the lexicons proposed by Mohammad et al. [35] and Hu et al. [76]. The sentiment was computed based on a counting of positive and negative opinion words, i.e., if the number of positive words is greater than the number of negative words, the tweet is considered positive, and if the opposite happens, the tweet is considered negative. If there is no opinion word or the number of positive and negative words is the same, the tweet is considered neutral. Even though the unsupervised lexicon-based approach has the advantage of avoiding the hard-working step of labeling training data, the semi-supervised methods showed superior results in most of the cases (Figs. 9–11). Note also that, for the

285 286 287 288 289 290 291 292 293 294 295

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS 12

ARTICLE IN PRESS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

Fig. 9. F-score from labeling newly added tweets – Twitter2013.

Fig. 10. F-score from labeling newly added tweets –Twitter2014.

Fig. 11. F-score from labeling newly added tweets – Twitter Sarcasm 2014.

297

Twitter Sarcasm 2014 set the lexicon was insufficient to classify the sentiments. The lexicon-based approach performed well in LiveJournal and SMS, which are datasets with more “formal” English content than the others.

298

5.6. The impact of the proportion of initially available labeled data

299

In this section, we will discuss the impact of the ratio of labeled to unlabeled data on the performance of the different methods. Experiments are designed to compare the performance of our approach to the other methods for different quantities of initially labeled data. To do so, we randomly selected some labeled tweets from the complete set of the available ones. Different proportions, with respect to the whole set of labeled tweets, were tested. We selected from 5% to 40% of the labeled tweets from SemEval 2013. Figs. 12–16 show the results. Naturally, when the proportion of labeled data increases, the F-scores get better and the standard deviations decrease. However, the F-score rates obtained by our approach are in most of the cases higher than the other methods, while the standard deviation rates obtained by our approach are lower. Note also that the performance of our algorithm is relatively less dependent on the ratio of labeled data. In the LiveJournal, SMS2013, Twitter2014, all F-score rates show improvements. In Twitter2013, the F-score improvement was lower than that of the other datasets but still competitive. We believe that the used lexicon set was insufficient to capture additional knowledge in Twitter2013. Table 9 shows, with more details, the F-scores for different proportions of the

296

300 301 302 303 304 305 306 307 308 309 310

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS

ARTICLE IN PRESS N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

[m3Gsc;February 23, 2016;7:14] 13

Fig. 12. Results for different proportions of initially labeled tweets – LiveJournal.

Fig. 13. Results for different proportions of initially labeled tweets – SMS2013.

Fig. 14. Results for different proportions of initially labeled tweets – Twitter2013.

Fig. 15. Results for different proportions of initially labeled tweets – Twitter2014.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

5.90 5.75 5.17 8.08

64.28 63.59 65.52 60.02 61.75

± ± ± ±

0.77 1.04 1.33 1.82

19.71 19.25 28.59 17.53 7.58

± ± ± ±

7.38 5.76 15.62 8.37

44.64 ± 1.80 44.25 ± 1.95 38.18 ± 4.60 43.13 ± 2.12 37.11

2.71 1.99 0.31 2.87

36.66 35.37 42.20 33.83 20.96

54.46 53.92 51.85 51.58 49.43

± ± ± ±

± ± ± ±

4.12 3.74 7.80 5.46

1.12 1.31 2.89 1.38

55.12 ± 1.49 54.31 ± 1.46 52.08 ± 0.48 51.89 ± 1.46 51.87

54.85 52.85 56.73 50.40 34.86

± ± ± ±

± ± ± ± 1.23 1.01 2.13 1.29

46.26 46.26 39.97 45.29 37.11

± ± ± ± 1.57 2.05 3.23 2.28

45.58 ± 1.60 46.90 ± 1.54 37.80 ± 1.63 46.13 ± 1.69 42.00

52.89 49.20 47.05 50.85 50.29

2.89 23.13 ± 4.65 1.55 23.99 ± 5.24 0.49 19.24 ± 12.46 3.26 20.28 ± 6.46 7.58

65.88 ± 0.91 66.01 ± 0.79 66.16 ± 0.69 65.17 ± 1.25 61.75

66.10 ± 0.65 66.38 ± 0.39 66.45 ± 0.49 64.80 ± 0.79 61.75

± 1.10 ± 2.22 ± 0.99 ± 2.12

58.27 ± 2.86 57.77 ± 2.77 46.14 ± 4.17 55.35 ± 5.58 55.59

± ± ± ± 1.82 1.61 2.32 3.39

38.98 38.42 37.99 35.34 20.96

± ± ± ±

62.25 61.24 49.38 59.69 55.59

± ± ± ± 4.30 4.08 4.49 5.64

66.06 ± 2.23 65.19 ± 2.10 58.59 ± 2.66 64.01 ± 2.90 61.09

30%

70.21 69.54 67.08 69.01 66.58

F-pos

± ± ± ±

62.06 60.50 57.38 60.35 54.43

± ± ± ± 1.10 0.81 1.03 1.56

55.51 ± 1.00 52.16 ± 0.99 46.99 ± 2.28 53.77 ± 1.06 50.29

58.79 ± 0.70 63.34 56.33 ± 0.81 60.60 52.18 ± 0.81 57.69 57.06 ± 1.07 61.56 52.36 54.43

± ± ± ±

Supervised approach proposed by Zhu et al. [79]

69.87 ± 0.67 69.13 ± 0.69 67.79 ± 0.91 68.33 ± 1.57 66.58

57.35 57.23 51.94 57.34 51.87

± ± ± ±

± ± ± ± 0.79 48.45 0.73 47.33 0.91 37.82 0.79 47.61 37.11

± ± ± ± 2.12 1.78 3.62 1.97

57.59 ± 1.08 57.02 ± 1.08 51.72 ± 2.20 57.12 ± 0.98 49.43

67.80 67.31 66.23 67.57 61.75

38.99 38.55 42.45 36.86 20.96

± ± ± ±

2.42 2.29 7.67 2.35

55.75 ± 2.43 54.21 ± 1.41 57.18 ± 1.24 54.66 ± 2.23 34.86

Supervised approach proposed by Zhu et al. [79]

23.46 ± 4.97 23.62 ± 4.33 27.54 ± 15.34 21.12 ± 4.23 7.58

± ± ± ±

25.86 21.62 37.35 24.31 7.58

± ± ± ±

± ± ± ± 1.49 1.80 1.42 2.87

0.88 58.09 1.06 57.73 0.89 52.47 0.83 57.85 51.87

2.73 40.81 4.68 37.91 8.61 47.27 3.05 39.48 20.96

± ± ± ±

± ± ± ±

± ± ± ±

40%

± ± ± ±

± ± ± ±

± ± ± ±

F-neg

± ± ± ±

0.32 0.26 0.25 0.29

0.91 0.78 0.92 1.04

2.06 2.18 0.69 3.05

23.93 22.06 34.66 23.96 7.58

49.57 48.33 36.88 49.26 37.11

± ± ± ±

± ± ± ±

± ± ± ±

1.14 1.22 0.66 1.71

100%

– – – 68.84 66.58

58.88 ± 0.51 58.19 ± 0.90 51.80 ± 0.54 58.70 ± 0.64 49.43

58.56 ± 0.44 58.42 ± 0.37 53.14 ± 0.55 58.45 ± 0.43 51.87

– – – 57.15 34.86

– – – 71.87 61.75

– – – 70.70 61.75

– – – 25.00 7.58

– – – 55.51 37.11

– – – 61.41 42.00

– – – 51.69 50.29

– – – 61.75 55.59

58.16

– – – 41.08 20.96

70.96

– – – 63.69 49.43

69.02

– – – 66.05 51.87

68.46

– – – 55.50 52.36

74.84

– – – 67.34 61.09

F-pos F-neg F

60.76 ± 0.44 – 58.54 ± 0.52 – 54.11 ± 0.59 – 59.74 ± 0.54 59.31 52.36 54.43

68.51 67.47 59.56 66.27 61.09

F

2.53 41.06 ± 0.97 3.34 38.35 ± 1.72 2.86 44.17 ± 1.87 2.02 38.96 ± 1.04 20.96

0.71 1.54 0.66 1.08

49.26 ± 0.76 49.15 ± 0.76 39.43 ± 1.06 49.40 ± 0.83 42.00

0.49 57.67 ± 0.63 55.41 ± 0.56 49.84 ± 0.62 56.72 ± 50.29

0.65 66.21 0.84 64.71 0.73 52.64 1.18 63.34 55.59

68.18 ± 0.56 68.06 ± 0.35 66.72 ± 0.49 68.14 ± 0.44 61.75

67.86 67.68 66.85 67.50 61.75

63.87 61.67 58.38 62.78 54.43

70.81 70.22 66.48 69.21 66.58

F-pos

1.54 58.19 ± 1.73 2.71 54.64 ± 1.02 4.50 53.67 ± 1.65 1.97 53.95 ± 1.45 34.86

0.59 0.82 0.78 0.75

0.53 0.42 0.49 0.51

59.86 ± 0.59 57.16 ± 0.94 53.26 ± 1.09 58.36 ± 1.02 52.36

67.41 66.01 59.43 64.92 61.09

F

± 0.42 48.69 ± 1.06 58.25 ± 0.52 47.40 ± 1.37 57.35 ± 0.45 36.27 ± 1.49 51.25 ± 0.75 48.30 ± 1.39 57.93 37.11 49.43

48.80 48.33 38.37 48.88 42.00

0.81 56.38 ± 0.83 0.87 53.73 ± 1.22 1.10 48.84 ± 1.89 1.15 55.15 ± 1.28 50.29

0.79 67.38 ± 0.41 0.28 67.13 ± 0.47 0.25 66.56 ± 0.39 0.77 66.83 ± 0.65 61.75

Supervised approach proposed by Miura et al. [80]

66.73 66.71 65.63 66.44 61.75

F-neg

0.66 64.61 ± 2.59 0.66 62.49 ± 3.13 0.75 51.78 ± 2.59 0.87 60.82 ± 5.38 55.59

Supervised approach proposed by Mohammad et al. [35]

2.40 54.51 ± 1.71 3.03 53.49 ± 1.79 6.28 57.37 ± 1.02 4.18 52.2 ± 2.63 34.86

56.07 ± 1.03 56.13 ± 1.34 53.06 ± 1.87 55.23 ± 1.50 49.43

55.84 ± 0.94 56.64 ± 0.82 52.13 ± 0.82 55.46 ± 1.01 51.87

56.43 ± 1.08 52.96 ± 1.32 53.18 ± 0.99 53.80 ± 1.53 52.36

63.69 62.93 57.33 60.95 61.09

F

ARTICLE IN PRESS

Literature

Twitter Sarcasm 2014 Our approach 53.62 ± 3.96 Self-training 51.49 ± 2.09 Co-Training 55.81 ± 0.57 SVM 50.13 ± 4.86 Lexicon 34.86

Literature

Twitter2014 Our approach Self-training Co-Training SVM Lexicon

Literature

± ± ± ±

59.97 56.72 59.32 56.74 54.43

69.10 ± 0.86 68.09 ± 0.79 68.51 ± 0.61 66.54 ± 3.21 66.58

F-neg

66.63 ± 0.56 48.06 ± 1.64 66.78 ± 0.41 47.67 ± 0.50 66.14 ± 0.71 37.74 ± 0.66 66.39 ± 0.50 48.29 ± 1.37 61.75 42.00

43.62 44.49 38.29 42.73 42.00

51.44 ± 2.84 47.12 ± 2.96 53.37 ± 0.60 46.97 ± 3.05 52.36

59.15 ± 3.54 57.68 ± 3.37 58.92 ± 2.75 53.98 ± 5.26 61.09

20% F-pos

Supervised approach proposed by Mohammad et al. [35]

46.78 ± 3.47 41.52 ± 3.44 48.19 ± 1.34 42.46 ± 4.42 50.29

± ± ± ±

F

Twitter2013 Our approach 66.62 ± 1.90 Self-training 64.13 ± 1.09 Co-Training 65.86 ± 1.01 SVM 61.06 ± 1.70 Lexicon 61.75

56.10 ± 3.09 52.73 ± 3.69 58.54 ± 2.06 51.48 ± 2.86 54.43

50.80 49.35 48.69 45.33 55.59

F-neg

Literature

SMS2013 Our approach Self-training Co-Training SVM Lexicon

Literature

67.50 ± 1.74 66.01 ± 1.89 69.14 ± 0.39 62.64 ± 5.64 66.58

10%

F

F-pos

F-neg

5%

F-pos

14

LiveJournal Our approach Self-training Co-Training SVM Lexicon

Approach

Table 9 F-scores for different proportions of the initial labeled data.

JID: INS [m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

ARTICLE IN PRESS

JID: INS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

15

Fig. 16. Results for different proportions of initially labeled tweets – Twitter Sarcasm 2014. Table 10 Percentages of highest values for F-scores (Fs) and lowest values for standard deviations (SD) for the proposed SSL framework, Self-training, Cotraining, SVM, and Lexicon approach according to the results from Table 9 (better results are highlighted in boldface). Dataset

LiveJournal SMS2013 Twitter2013 Twitter2014 Twitter Sarcasm 2014 a

Our approach

Self-training

Co-training

Lexicona

SVM

Highest Fs

Lowest SD

Highest Fs

Lowest SD

Highest Fs

Lowest SD

Highest Fs

Lowest SD

Highest Fs

80.00 80.00 46.66 80.00 80.00

40.00 40.00 6.66 60.00 33.33

0.00 0.00 26.66 6.66 6.66

26.67 26.66 40.00 20.00 33.33

6.66 13.33 6.66 13.33 13.33

33.33 33.33 46.00 13.33 26.66

0.00 0.00 20.00 0.00 0.00

0.00 0.00 6.66 6.66 6.66

13.33 6.66 0.00 0.00 0.00

Standard deviations for the lexicon-based approach are equal to zero.

328

initial labeled data for Self-training, Co-training, supervised approaches from the literature, a lexicon-based method, and our approach (better results are in boldface). Table 10 summarizes the results from Table 9 by providing the percentages of higher F-scores and lower standard deviations for each approach. Note that our approach tipically provides better F-scores and is more stable in most cases as well. In [14] the training set is built by considering that emoticons and hashtags are the class labels. Tweets with both positive and negative sentiments are discarded in a preliminary, pre-processing stage. In our experiments, messages such as “I like his hat but his shoes are horrible” were not removed. Thus, the induced classifier has to deal with this scenario through features composed of unigrams, bigrams, trigrams, number of negated contexts along features from part of speech, writing style, lexicon, and features based on the microblogging content. Bigrams and trigrams capture patterns from negation contexts. Negated context features are also used to change the direction of a sentiment. In addition, features from the writing style, part of speech, and microblogging can capture patterns that are good indicators of the tweet sentiment. However, sometimes even for well-trained, multivariate classifiers, it may be difficult to accurately classify the tweet sentiment. For instance, there may be some especial cases of ambiguity – e.g., (i) when several sentiments are expressed in the tweet, so that they are equally potent and with a particular sentiment not standing out, or (ii) when some personal opinion is being expressed without the appropriate context. Illustrative examples of these cases include sentences such as “I kind of like heroes and don’t like it at the same time...” and “That’s exactly how I feel about avengers” (no context at all). Note that these cases are difficult even for humans.

329

6. Conclusion

330

We introduced a framework where unsupervised knowledge derived from unlabeled tweets is integrated into Selftraining as a helping strategy for tweet sentiment analysis. This work is motivated by the observation that unlabeled tweet sets can contain some inner data space structure information and that this underlying information both can be revealed by a similarity matrix and helps to train a better classifier. In particular, we considered an algorithm named C3 E-SL that was integrated into the Self-training approach. We carried out a series of experiments to analyze our approach and compared it to Self-training, Co-training, a Lexicon-based method, and a stand-alone SVM. The results show that our approach performed better than the others in most of the assessed datasets. Therefore, it is a promising approach to join a pool of state-of-the-art tools to be used in practice. As an emerging research topic, the use of SSL to address tweet sentiment analysis faces many challenges that motivate relevant future work such as:

311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327

331 332 333 334 335 336 337 338 339

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS 16

340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361

ARTICLE IN PRESS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

1. In our approach, the sentiment lexicon plays a key role. By expanding the polarity lexicon, we can produce a high bias to the domain and context of the tweet set, making it possible to improve the contribution that the similarity matrix can give to the sentiment classification task. 2. Section 5 shows that the performance of semi-supervised tweet sentiment classification methods is not always better than that of supervised tweet sentiment methods [81]. An exploration of which conditions are more appropriate for semi-supervised sentiment analysis is worth consideration. 3. When selecting a portion of available data for the initial training phase, some features may not make sense for the purpose of building classifiers. Therefore, it is important to evaluate the impact of the chosen features in a semi-supervised setting. In our experiments, the selected features were inspired by [35], whose authors ranked first in SemEval 2013 [68]. However, this feature set was defined based on the entire training set. It is more realistic to select features on-the-fly as an intrinsic part of the semi-supervised learning. 4. To increase the performance of SSL methods in applications where sarcasm and irony are present, it is interesting to study specific features, such as in [82–85]. The feature set used in our experiments did not consider this particular scenario. Moreover, it is interesting to explore feature selection [86], dimensionality reduction [87], as well as structurebased features [88], and language resources [89]. 5. All of the processes of the semi-supervised phase depend on the initial sampling of tweets. Thus, another interesting research topic involves studying sampling methods. For instance, Hajmohammadi et al. [90] and Smailovia et al. [91] combine active and semi-supervised learning to incorporate unlabeled sentiment documents from the target language to improve the performance of their methods. 6. There may be messages with more than one sentence (for example, “I support Obama. Romney sucks.”). In our experiments, tweets were not tokenized by sentences, i.e., they were considered as unique sentences. In this sense, a research opportunity is to evaluate how C3E-SL can be applied to carry out aspect-based tweet sentiment classification [76,92–94].

362

Acknowledgments

363 364

The authors would like to acknowledge Brazilian Research Agencies Capes (Proc. DS-7253238/D), CNPq (Proc. 303348/2013-5) and FAPESP (Proc. 2013/07375-0 and 2010/20830-0) for their financial support.

365

References

366 367 368 369 370 371 372 373 374 375 376 377 378 Q6 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405

[1] C. Wang, Z. Xiao, Y. Liu, Y. Xu, A. Zhou, K. Zhang, Sentiview: Sentiment analysis and visualization for internet popular topics, IEEE Trans. Hum. Mach. Syst. 43 (6) (2013) 620–630. [2] J. Serrano-Guerrero, J.A. Olivas, F.P. Romero, E. Herrera-Viedma, Sentiment analysis: A review and comparative analysis of web services, Inf. Sci. 311 (0) (2015) 18–38. [3] J.M. Chenlo, D.E. Losada, An empirical study of sentence features for subjectivity and polarity classification, Inf. Sci. 280 (0) (2014) 275–288. [4] R. Xia, C. Zong, S. Li, Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci. 181 (6) (2011) 1138–1152. [5] A. Hogenboom, F. Frasincar, F. de Jong, U. Kaymak, Using rhetorical structure in sentiment analysis, Commun. ACM 58 (7) (2015) 69–77. [6] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, A. Kappas, Sentiment strength detection in short informal text, J. Am. Soc. Inf. Sci. Technol. 61 (12) (2010) 2544–2558. [7] A. Hogenboom, D. Bal, F. Frasincar, M. Bal, F. de Jong, U. Kaymak, Exploiting emoticons in polarity classification of text, J. Web Eng. 14 (1–2) (2015) 22–40. [8] B.J. Jansen, M. Zhang, K. Sobel, A. Chowdury, Twitter power: Tweets as electronic word of mouth, J. Am. Soc. Inf. Sci. Technol. 60 (11) (2009) 2169– 2188. [9] J.-M. Xu, K.-S. Jun, X. Zhu, A. Bellmore, Learning from bullying traces in social media., in: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. HLT-NAACL, 2012, pp. 656–666. [10] M. Cheong, V.C. Lee, A microblogging-based approach to terrorism informatics: Exploration and chronicling civilian sentiment and response to terrorism events via twitter, Inf. Syst. Front. 13 (1) (2011) 45–59. [11] N.A. Diakopoulos, D.A. Shamma, Characterizing debate performance via aggregated twitter sentiment, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, ACM, New York, NY, USA, 2010, pp. 1195–1198. [12] X. Zhu, Semi-Supervised Learning Literature Survey, Computer Sciences, University of Wisconsin-Madison, 2005. Technical Report 1530. [13] O. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, MA, 2006. [14] A. Go, R. Bhayani, L. Huang, Twitter Sentiment Classification Using Distant Supervision, Stanford University, 2009, pp. 1–6. Unpublished Manuscript. [15] S. Mohammad, # Emotional tweets, in: Proceedings of the ∗ SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 2012, pp. 246–255. [16] W. Wang, L. Chen, K. Thirunarayan, A.P. Sheth, Harnessing twitter “big data” for automatic emotion identification, in: Proceedings of the International Conference on Privacy, Security, Risk and Trust (PASSAT) and 2012 International Conference on Social Computing (SocialCom), IEEE, 2012, pp. 587–592. [17] A. Qadir, E. Riloff, Bootstrapped learning of emotion hashtags #hashtags4you, in: Proceedings of the Fourth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Association for Computational Linguistics, Atlanta, Georgia, 2013, pp. 2–11. [18] J. Suttles, N. Ide, Distant supervision for emotion classification with discrete binary values, in: A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, vol. 7817, Springer Berlin Heidelberg, 2013, pp. 121–136. [19] D. Davidov, O. Tsur, A. Rappoport, Enhanced sentiment learning using twitter hashtags and smileys, in: Proceedings of the Twenty Third International Conference on Computational Linguistics: Posters, COLING ’10, Association for Computational Linguistics, Stroudsburg, PA, USA, 2010, pp. 241–249. [20] M.-F. Balcan, A. Blum, A discriminative model for semi-supervised learning, J. ACM 57 (3) (2010) 19:1–19:46. [21] A.B. Goldberg, New Directions in Semi-Supervised Learning, University of Wisconsin - Madison, 2010 Ph.D. thesis. [22] X. Zhu, A.B. Goldberg, Introduction to Semi-Supervised Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers, 2009. [23] B. Xiang, L. Zhou, Improving twitter sentiment analysis with topic-based mixture modeling and semi-supervised training, in: Proceedings of the Fifty Second Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Baltimore, Maryland, 2014, pp. 434–439.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS

ARTICLE IN PRESS N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 Q7 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 Q8 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484

[m3Gsc;February 23, 2016;7:14] 17

[24] L. Becker, G. Erhart, D. Skiba, V. Matula, Avaya: Sentiment analysis on twitter with self-training and polarity lexicon expansion, in: Proceedings of the Second Joint Conference on Lexical and Computational Semantics (∗ SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta, Georgia, USA, 2013, pp. 333–340. [25] W. Baugh, bwbaugh : Hierarchical sentiment analysis with partial self-training, in: Proceedings of the Second Joint Conference on Lexical and Computational Semantics (∗ SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta, Georgia, USA, 2013, pp. 539–542. [26] S. Liu, W. Zhu, N. Xu, F. Li, X.-q. Cheng, Y. Liu, Y. Wang, Co-training and visualizing sentiment evolvement for tweet events, in: Proceedings of the Twenty Second International Conference on World Wide Web Companion, WWW ’13 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2013, pp. 105–106. [27] A. Acharya, E.R. Hruschka, J. Ghosh, S. Acharyya, C3E: A framework for combining ensembles of classifiers and clusterers, in: Multiple Classifier Systems, Lecture Notes in Computer Science, vol. 6713, Springer, 2011, pp. 269–278. [28] A. Acharya, E.R. Hruschka, J. Ghosh, S. Acharyya, An optimization framework for combining ensembles of classifiers and clusterers with applications to nontransductive semisupervised learning and transfer learning, ACM Trans. Knowl. Discov. Data 9 (1) (2014) 1:1–1:35. [29] L. Coletta, N. da Silva, E. Hruschka, E. Hruschka, Combining classification and clustering for tweet sentiment analysis, in: Proceedings of the 2014 Brazilian Conference on Intelligent Systems (BRACIS), 2014, pp. 210–215. [30] X. Hu, L. Tang, J. Tang, H. Liu, Exploiting social relations for sentiment analysis in microblogging, in: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 2013. [31] M. Taboada, M. Tofiloski, J. Brooke, K. Voll, M. Stede, Lexicon-based methods for sentiment analysis, 2011. [32] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, R. Passonneau, Sentiment analysis of twitter data, in: Proceedings of the Workshop on Languages in Social Media, LSM ’11, Association for Computational Linguistics, Stroudsburg, PA, USA, 2011, pp. 30–38. [33] J. Read, Using emoticons to reduce dependency in machine learning techniques for sentiment classification, in: Proceedings of the ACL Student Research Workshop, ACLstudent ’05, Association for Computational Linguistics, Stroudsburg, PA, USA, 2005, pp. 43–48. [34] L. Zhang, R. Ghosh, M. Dekhil, M. Hsu, B. Liu, Combining Lexicon-Based and Learning-Based Methods for Twitter Sentiment Analysis, HP Laboratories, 2011. Technical Report. [35] S. Mohammad, S. Kiritchenko, X. Zhu, Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets, in: Proceedings of the Seventh International Workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta, Georgia, USA, 2013. [36] H. Saif, Y. He, H. Alani, Semantic sentiment analysis of twitter, in: Proceedings of the Eleventh International Conference on The Semantic Web Volume Part I, ISWC’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 508–524. [37] N.F. da Silva, E.R. Hruschka, E.R.H. Jr., Tweet sentiment analysis with classifier ensembles, Decis. Support Syst. 66 (0) (2014) 170–179. [38] N. Silva, E. Hruschka, E. Hruschka, Biocom usp: Tweet sentiment analysis with adaptive boosting ensemble, in: Proceedings of the Eighth International Workshop on Semantic Evaluation (SemEval 2014), Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014, pp. 123– 128. [39] J. Lin, A. Kolcz, Large-scale machine learning at twitter, in: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, ACM, New York, NY, USA, 2012, pp. 793–804. [40] S. Clark, R. Wicentwoski, Swatcs: Combining simple classifiers with estimated accuracy, in: Proceedings of the Second Joint Conference on Lexical and Computational Semantics (∗ SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta, Georgia, USA, 2013, pp. 425–429. [41] C. Rodrıguez-Penagos, J. Atserias, J. Codina-Filba, D. Garcıa-Narbona, J. Grivolla, P. Lambert, R. Saurı, Fbm: Combining lexicon-based ml and heuristics for social media polarities, in: Proceedings of SemEval-2013 – International Workshop on Semantic Evaluation Co-located with ∗Sem and NAACL, Atlanta, Georgia, 2013. Url date at 2013-10-10. [42] A. Hassan, A. Abbasi, D. Zeng, Twitter sentiment analysis: A bootstrap ensemble framework., in: Proceedings of the Social Computing, SocialCom, IEEE, 2013, pp. 357–364. [43] P.D. Turney, Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews, in: Proceedings of the Fourth Annual Meeting on Association for Computational Linguistics, ACL 02, Association for Computational Linguistics, Stroudsburg, PA, USA, 2002, pp. 417–424. [44] J. Read, J. Carroll, Weakly supervised techniques for domain-independent sentiment classification, in: Proceedings of the 1st International CIKM Workshop on Topic-sentiment Analysis for Mass Opinion, TSA ’09, ACM, New York, NY, USA, 2009, pp. 45–52. [45] C. Tan, L. Lee, J. Tang, L. Jiang, M. Zhou, P. Li, User-level sentiment analysis incorporating social networks, in: Proceedings of the Seventeenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, ACM, New York, NY, USA, 2011, pp. 1397–1405. [46] F. Pozzi, D. Maccagnola, E. Fersini, E. Messina, Enhance user-level sentiment analysis on microblogs with approval relations, in: M. Baldoni, C. Baroglio, G. Boella, R. Micalizio (Eds.), AI∗IA 2013: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 8249, Springer International Publishing, 2013, pp. 133–144. [47] C. Johnson, P. Shukla, S. Shukla, On classifying the political sentiment of tweets, http://www.cs.utexas.edu/∼cjohnson/TwitterSentimentAnalysis.pdf2011. [48] S. Hong, J. Lee, J.-H. Lee, Competitive self-training technique for sentiment analysis in mass social media, in: Proceedings of the 2014 Joint Seventh International Conference on and Advanced Intelligent Systems (ISIS) and Fifteenth International Symposium on Soft Computing and Intelligent Systems (SCIS), 2014, pp. 9–12. [49] B. Drury, L. Torgo, J.J. Almedia, Guided self training for sentiment classification, in: Robust Unsupervised and Semi-Supervised Methods in Natural Language Processing, 2011, pp. 9–16. [50] X. Wan, Co-training for cross-lingual sentiment classification, in: Proceedings of the Joint Conference of the Forty Seventh Annual Meeting of the ACL and the Fourth International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1, ACL ’09, Association for Computational Linguistics, Stroudsburg, PA, USA, 2009, pp. 235–243. [51] S. Li, Z. Wang, G. Zhou, S.Y.M. Lee, Semi-supervised learning for imbalanced sentiment classification, in: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011, pp. 1826–1831. [52] N. Yu, Exploring co-training strategies for opinion detection, J. Assoc. Inf. Sci. Technol. 65 (10) (2014) 2098–2110. [53] S. Li, C.-R. Huang, G. Zhou, S.Y.M. Lee, Employing personal/impersonal views in supervised and semi-supervised sentiment classification, in: Proceedings of the Fortyu Eighth Annual Meeting of the Association for Computational Linguistics, ACL ’10, Association for Computational Linguistics, Stroudsburg, PA, USA, 2010, pp. 414–423. [54] A.B. Goldberg, X. Zhu, Seeing Stars when there aren’t many Stars: Graph-based Semi-supervised Learning for Sentiment Categorization, in: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, TextGraphs-1, Association for Computational Linguistics, Stroudsburg, PA, USA, 2006, pp. 45–52. [55] K. Crammer, A. Kulesza, M. Dredze, Adaptive regularization of weight vectors, in: Proceedings of the Conference on Neural Information Processing Systems NIPS, 2009, pp. 414–422. [56] Y. Haimovitch K. Crammer S. Mannor More is better: Large scale partially-supervised sentiment classification - appendix CoRR abs/1209.6329 2012 [57] T. Zagibalov, J. Carroll, Unsupervised classification of sentiment and objectivity in Chinese text, in: Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP), 2008. [58] Z. Liu, X. Dong, Y. Guan, J. Yang, Reserved self-training: A semi-supervised sentiment classification method for chinese microblogs, in: Proceedings of the Sixth International Joint Conference on Natural Language Processing, Asian Federation of Natural Language Processing, Nagoya, Japan, 2013, pp. 455–462. [59] L. Qiu, W. Zhang, C. Hu, K. Zhao, Selc: A self-supervised model for sentiment classification, in: Proceedings of the Eighteenth ACM Conference on Information and Knowledge Management, CIKM ’09, ACM, New York, NY, USA, 2009, pp. 929–936.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002

JID: INS 18

485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552

ARTICLE IN PRESS

[m3Gsc;February 23, 2016;7:14]

N.F.F. da Silva et al. / Information Sciences xxx (2016) xxx–xxx

[60] J. Zhao, M. Lan, T. Zhu, Ecnu: Expression- and message-level sentiment orientation classification in twitter using multiple effective features, in: Proceedings of the Eighth International Workshop on Semantic Evaluation (SemEval 2014), Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014, pp. 259–264. [61] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, ACM, New York, NY, USA, 1998, pp. 92–100. [62] N. Yu, S. Kübler, Filling the gap: Semi-supervised learning for opinion detection across domains, in: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL ’11, Association for Computational Linguistics, Stroudsburg, PA, USA, 2011, pp. 200–209. [63] S. Liu, X. Cheng, F. Li, F. Li, Tasc:topic-adaptive sentiment classification on dynamic tweets, IEEE Trans. Knowl. Data Eng. 27 (6) (2015) 1696–1709. [64] J. Ghosh, A. Acharya, Cluster ensembles, Wiley Interdiscp. Rew. Data Min. Knowl. Discov. 1 (4) (2011) 305–315. [65] A. Strehl, J. Ghosh, Cluster ensembles - A knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res. 3 (2002) 583–617. [66] A. Banerjee, S. Merugu, I.S. Dhillon, J. Ghosh, Clustering with Bregman divergences, J. Mach. Learn. Res. 6 (2005) 1705–1749. [67] L.F.S. Coletta, E.R. Hruschka, A. Acharya, J. Ghosh, A differential evolution algorithm to optimise the combination of classifier and cluster ensembles, Int. J. BioInspir. Comput. 7 (2) (2015) 111–124. [68] P. Nakov, S. Rosenthal, Z. Kozareva, V. Stoyanov, A. Ritter, T. Wilson, Semeval-2013 task 2: Sentiment analysis in twitter, in: Proceedings of the Second Joint Conference on Lexical and Computational Semantics (∗ SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta, Georgia, USA, 2013, pp. 312–320. [69] S. Rosenthal, P. Nakov, A. Ritter, V. Stoyanov, SemEval-2014 Task 9: Sentiment Analysis in Twitter, in: P. Nakov, T. Zesch (Eds.), Proceedings of the Eighth International Workshop on Semantic Evaluation, SemEval 14, Dublin, Ireland, 2014. [70] S. Liu, F. Li, F. Li, X. Cheng, H. Shen, Adaptive co-training svm for sentiment classification on tweets, in: Proceedings of the Twenty Second ACM International Conference on Conference on Information & Knowledge Management, CIKM ’13, ACM, New York, NY, USA, 2013, pp. 2079–2088. [71] L. Zhang, R. Ghosh, M. Dekhil, B. Hsu M. Liu, Combining Lexicon-Based and Learning-Based Methods for Twitter Sentiment Analysis, Hewlett-Packard Labs, 2011. Technical Report HPL, 2011–89. [72] A. Hogenboom, P. van Iterson, B. Heerschop, F. Frasincar, U. Kaymak, Determining negation scope and strength in sentiment analysis, in: Proceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2011, pp. 2589–2594. [73] L. Jia, C. Yu, W. Meng, The effect of negation on sentiment analysis and retrieval effectiveness, in: Proceedings of the Eighteenth ACM Conference on Information and Knowledge Management, CIKM ’09, ACM, New York, NY, USA, 2009, pp. 1827–1830. [74] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up? sentiment classification using machine learning techniques, in: Proceedings of EMNLP, 2002, pp. 79–86. [75] O. Owoputi, C. Dyer, K. Gimpel, N. Schneider, N.A. Smith, Improved part-of-speech tagging for online conversational text with word clusters, in: Proceedings of NAACL, 2013. [76] M. Hu, B. Liu, Mining and summarizing customer reviews, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, ACM, New York, NY, USA, 2004, pp. 168–177. [77] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, A. Kappas, Sentiment in short strength detection informal text, J. Am. Soc. Inf. Sci. Technol. 61 (12) (2010) 2544–2558. [78] X. Zhu, Semi-Supervised Learning Literature Survey, Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005,. [79] X. Zhu, S. Kiritchenko, S. Mohammad, Nrc-canada-2014: Recent improvements in the sentiment analysis of tweets, in: Proceedings of the Eighth International Workshop on Semantic Evaluation (SemEval 2014), Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014, pp. 443–447. [80] Y. Miura, S. Sakaki, K. Hattori, T. Ohkuma, Teamx: A sentiment analyzer with enhanced lexicon mapping and weighting scheme for unbalanced data, in: Proceedings of the Eighth International Workshop on Semantic Evaluation (SemEval 2014), Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014, pp. 628–632. [81] Y.-F. Li, Z.-H. Zhou, Towards making unlabeled data never hurt, IEEE Trans. Pattern Anal. Mach. Intell. 37 (1) (2015) 175–188. [82] P. Carvalho, L. Sarmento, M.J. Silva, E. de Oliveira, Clues for detecting irony in user-generated contents: Oh...!! it’s “so easy” ;-), in: Proceedings of the First International CIKM Workshop on Topic-sentiment Analysis for Mass Opinion, TSA ’09, ACM, New York, NY, USA, 2009, pp. 53–56. [83] R. Gonzalez-Ibanez, S. Muresan, N. Wacholder, Identifying sarcasm in twitter: A closer look, in: Proceedings of the Forty Ninth Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, Association for Computational Linguistics, Stroudsburg, PA, USA, 2011, pp. 581–586. [84] A.A. Vanin, L.A. Freitas, R. Vieira, M. Bochernitsan, Some clues on irony detection in tweets, in: Proceedings of the Twenty Second International Conference on World Wide Web Companion, WWW ’13 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2013, pp. 635–636. [85] L.A. de Freitas, A.A. Vanin, D.N. Hogetop, M.N. Bochernitsan, R. Vieira, Pathways for irony detection in tweets, in: Proceedings of the Twenty Ninth Annual ACM Symposium on Applied Computing, SAC ’14, ACM, New York, NY, USA, 2014, pp. 628–633. [86] T.F. Covoes, E.R. Hruschka, Towards improving cluster-based feature selection with a simplified silhouette filter, Inf. Sci. 181 (18) (2011) 3766–3782. [87] M. Shafiei, S. Wang, R. Zhang, E. Milios, B. Tang, J. Tougas, R. Spiteri, Document representation and dimension reduction for text clustering, in: Proceedings of the 2007 IEEE Twenty Third International Conference on Data Engineering Workshop, 2007, pp. 770–779. [88] Polarity classification using structure-based vector representations of text, Decis. Support Syst. 74 (2015) 46–56. [89] M. Iruskieta, I. da Cunha, M. Taboada, A qualitative comparison method for rhetorical structures: identifying different discourse structures in multilingual corpora, Lang. Resour. Eval. 49 (2) (2015) 263–309. [90] M.S. Hajmohammadi, R. Ibrahim, A. Selamat, H. Fujita, Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples, Inf. Sci. 317 (0) (2015) 67–77. [91] J. Smailovia, M. Graar, N. Lavra, M. AcenidarA, Stream-based active learning for sentiment analysis in the financial domain, Inf. Sci. 285 (0) (2014) 181–203. Processing and Mining Complex Data Streams. [92] L. Zhang, B. Liu, Identifying noun product features that imply opinions, in: Proceedings of the Forty Ninth Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, Association for Computational Linguistics, Stroudsburg, PA, USA, 2011, pp. 575–580. [93] H.H. Lek, D. Poo, Sentix: An aspect and domain sensitive sentiment lexicon, in: Proceedings of the 2012 IEEE Twenty Fourth International Conference on Tools with Artificial Intelligence (ICTAI), vol. 1, 2012, pp. 261–268, doi:10.1109/ICTAI.2012.43. [94] H.H. Lek, D. Poo, Aspect-based twitter sentiment classification, in: Proceedings of the 2012 IEEE Twenty Fifth International Conference on Tools with Artificial Intelligence (ICTAI), 2013, pp. 366–373, doi:10.1109/ICTAI.2013.62.

Please cite this article as: N.F.F. da Silva et al., Using unsupervised information to improve semi-supervised tweet sentiment classification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.02.002