Discovering public sentiment in social media for predicting stock movement of publicly listed companies

Discovering public sentiment in social media for predicting stock movement of publicly listed companies

Information Systems 69C (2017) 81–92 Contents lists available at ScienceDirect Information Systems journal homepage: www.elsevier.com/locate/is Dis...

2MB Sizes 0 Downloads 53 Views

Information Systems 69C (2017) 81–92

Contents lists available at ScienceDirect

Information Systems journal homepage: www.elsevier.com/locate/is

Discovering public sentiment in social media for predicting stock movement of publicly listed companies Bing Li a,∗, Keith C.C. Chan a, Carol Ou b, Sun Ruifeng a a b

Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Department of Management, Tilburg School of Economics and Management, Tilburg University, Tilburg, Netherlands

a r t i c l e

i n f o

Article history: Received 6 September 2014 Revised 26 June 2016 Accepted 13 October 2016 Available online 02 February 2017 Keywords: Social media analysis Twitter Stock prediction Data mining Sentiment analysis Big data SMeDA-SA Parallel architecture

a b s t r a c t The popularity of many social media sites has prompted both academic and practical research on the possibility of mining social media data for the analysis of public sentiment. Studies have suggested that public emotions shown through Twitter could be well correlated with the Dow Jones Industrial Average. However, it remains unclear how public sentiment, as reflected on social media, can be used to predict stock price movement of a particular publicly-listed company. In this study, we attempt to fill this research void by proposing a technique, called SMeDA-SA, to mine Twitter data for sentiment analysis and then predict the stock movement of specific listed companies. For the purpose of experimentation, we collected 200 million tweets that mentioned one or more of 30 companies that were listed in NASDAQ or the New York Stock Exchange. SMeDA-SA performs its task by first extracting ambiguous textual messages from these tweets to create a list of words that reflects public sentiment. SMeDA-SA then made use of a data mining algorithm to expand the word list by adding emotional phrases so as to better classify sentiments in the tweets. With SMeDA-SA, we discover that the stock movement of many companies can be predicted rather accurately with an average accuracy over 70%. This paper describes how SMeDA-SA can be used to mine social media date for sentiments. It also presents the key implications of our study. © 2016 Published by Elsevier Ltd.

1. Introduction Traditionally, public opinion in open societies can be studied through face-to-face, telephone or on-line surveys. Ever since social media sites, such as Twitter and the likes, have become popular, collecting and analyzing public opinions have never been any easier. Millions of Twitter users, for instance, post over 340 million messages, which are referred to as tweets, to the Twitter site everyday [1]. In most cases, these tweets represent different opinions expressed on different social, economic and political issues. Many people have started to consider social media sites like Twitter to be containing repositories for answers to all kinds of opinion poll questions. As a result, researchers have started to analyze the massive amount of social media data for public opinions on different issues ranging from product marketing to political preferences [2,3]. Ever since Milgram’s work in 1967 reporting on a “small world experiment” performed to have identified a “six degrees of separation” [4] between people in a social network, researchers have started to investigate into social connections and the effect that



Corresponding author. E-mail address: [email protected] (B. Li).

http://dx.doi.org/10.1016/j.is.2016.10.001 0306-4379/© 2016 Published by Elsevier Ltd.

they may have on public opinions and behavior. For example, some recent attempts have been made to analyze social data to see if movie revenues [5,6] or the trend of the Dow Jones Industrial Average (DJIA) [7,8] can be forecasted. Based on the results obtained, it is believed that the patterns embedded in social media data may provide the information needed to better understand and predict social events. The traditional way of seeking out public opinions by the use of such tool as questionnaire survey has been effective but relatively time consuming and expensive. It is especially the case when public opinions have to be monitored continually. The popularity of social media platforms on which people exchange ideas and express opinions has provided valuable sources for sentiment to be understood relatively easily if there is an effective way that social media data can be analyzed. The enormous volume and diversity of the data that can be collected from social media sites present an excellent opportunity for the data to be mined to identify nuggets of knowledge that can be leveraged to understand public opinions and sentiments for predictions about specific events to be published. This approach of discovering opinions from various social communities facilitates the building of models that can reveal useful insights into the behavior of various stakeholders, for predicting future trends and can facilitate design of marketing and advertising campaigns [2,3].

82

B. Li et al. / Information Systems 69C (2017) 81–92

In order to mine social media data for information that could lead to the understanding of opinions and sentiments, and to predict social events, several challenges need to be addressed. Firstly, it should be noted that social media data are collected according to time lines and are therefore temporal data. Many data mining methods handle time series data that are numerical in nature and cannot be used directly with temporal data that contain a lot of texts [9]. Mining temporal data thus requires different techniques and algorithms from those that are used to mine traditional time series data. Secondly, data from social media are usually textual data that are ambiguous. It is sometimes hard for sentiment to be understood and distinguished easily into good or bad, or positive or negative. Therefore, sentiment analysis of such ambiguous data requires an effective text mining method. Last, but not the least, social media data commonly contains billions of messages, requiring a proper database to store as well as a creative architecture to process and for them to be analyzed, data and text mining methods have to be implemented efficiently. In this study, we attempt to address the above challenges in mining social media data. Specifically, we created a corpus of Twitter data to predict the movement of share prices of certain stocks in the stock markets in the U.S. We attempt to demonstrate how ambiguous temporal data collected from social media sites could be mined effectively and efficiently. The problem of stock price prediction has been a popular research problem in the last two decades but not many approaches have been proposed to effectively tackle it [10]. For instance, prediction based on the assumption of Random Walk has not so far been very satisfactory. There has been some effort to focus on prediction based on detailed financial news analysis about listed companies [11] basing on such assumptions as the classical Efficient Market Hypothesis (EMH) [12]. However, predicting news trends have not been shown to be any easier. Consequently, it appears that any prediction based on unpredictable factors like financial news is likely to be arbitrary. Even though relatively higher prediction accuracy has been reported in some studies such as [6,7], these studies cannot be easily generalized since too many specific parameters and conditions are required for predictions to be made more accurately. Motivated by the challenges and the practical significance, we have developed a novel approach, called Social Media Data Analyzer – Sentiment Analysis (SMeDA-SA), to mine ambiguous temporal social media data collected from Twitter to determine the movement of the US stock markets, namely NYSE and NASDAQ. It has been widely accepted by economists that there is a potential connection between a company’s stock price and the information published about it [13]. Given that data about public opinion can be collected relatively easily from social media sites, we attempted to collect and mine such data to find out public sentiments about products and services to predict stock price of listed companies directly and indirectly. In this paper, we present details of SMeDA-SA which we develop for such a purpose. To perform its task, SMeDA-SA takes several steps. First, we consider each tweet’s structure as a combination of words and phrases. We apply neuro-linguistic programming (NLP) techniques to classify a tweet’s sentiment into five categories (Positive+ , Positive, Neutral, Negative, Negative- ). We then use the concept of adjusted residuals [24] to identify interesting patterns between public sentiments and stock market prices. To evaluate the effectiveness of the proposed approach, we have performed a number of experiments. In our experiments, we selected 30 listed companies from different industries in NYSE and NASDAQ to test out how accurate prediction of stock movements can be made based on mining social media data using the proposed approach. For mining social media data for public sentiment, we had collected approximately 15 million records of Twitter data that mentioned these 30 companies either directly or indirectly by mentioning their products or

services. For instance, for the company “Apple Inc”, which we had selected, we looked for tweets that mention “AAPL” the stock market code for the company, as well as the keywords of its products, such as “iPad”, “iTunes” and “iPhone”, etc., and also product characteristics such as “CPU Speed” and “Color”, etc. In order to identify correlation between the sentiment as reflected by the posting on Twitter and the movement of stock prices, we made use of the proposed algorithms to compute a degree of sentiment for each of the 30 listed companies that we have chosen to determine how much it is correlated with the price movement of selected stocks. Following this introduction, in Section 2 we describe the background of the proposed work and review related literature. In Section 3, we present details of our proposed methods to tackle the research problem. Specifically, we describe the process of analyzing and extracting valuable information from the ambiguous temporal textual data collected from the social media platform. Section 4 presents our experiments and the results. We conclude this article and suggest directions for future work in Section 5. 2. Related works Although social media analytics is becoming increasingly popular as a research topic, not much work has been done in discovering temporal patterns of ambiguous contents in social media and relate them to other time-dependent events that take place in the real world. The pioneers in this research domain are Jansen and his colleagues [14]. They have investigated how word-of-mouth advertisements in social media may change the information recipients’ sentiments in the related brand and products. Their work provided insights on how ambiguous information from social media can be analyzed. Despite their effort, the potential of social media analytics remains very much unexplored. This is especially the case with the problem of the kind of time-varying temporal patterns in social media data that we are directly concerned with here. There has been some previous effort to analyze blog contents to determine if they are correlated with any business performance indicators such as spikes in the sales volume of books [15]. There have also been some attempts to determine if movie ticket sales can be predicted based on social media contents. The predictions are primarily made based on meta-data information about the movies, including such information as the Motion Picture Association of America (MPAA) ratings, the genre, the number of screens on which the movie debuted, running time, release date and the presence of particular actors or actresses in the cast, etc. Based on such information, linear regression was used to predict earnings about the posted movies [5]. In [16], instead of linear regression, the prediction problem is treated as a traditional classification problem and artificial neural networks are used to classify movies into categories ranging from’blockbuster’ to’flop’. Apart from the fact that they predicted the ranges instead of the actual sales volume of a movie, the use of these approaches does not seem to be able to allow very accurate models to be constructed for prediction. The accuracy of these models was tested and found to be relatively low. Of the work related to social media analytics, it is worth pointing out that Asur and Huberman [6] mined the temporal data from social media. In their paper, they show how a model can be constructed based on the popularity rating of movies, as determined by relevant tweets, can be used to predict the actual box office revenue of a movie. The dataset that was used in the studies was collected from the Twitter site on an hourly basis and aggregated for analysis. In order to ensure that the tweets obtained all referred to a specific movie, the keywords obtained from a movie title were used for searching of relevant tweets over a period of three months. These tweets were then used for prediction and the accuracy of the predicted results were found to be higher than other methods as described in [17–19]. Their studies provided some

B. Li et al. / Information Systems 69C (2017) 81–92

evidence that meaningful patterns can be found in temporal social media data for prediction. If a more effective text mining method were used, more meaningful patterns in the ambiguous social media data can be discovered. Also, it should be noted that, while the way the model was constructed seems to be reasonable, the model could only be constructed after the movies were released long enough for opinions to be expressed and collected from Twitter. Two tools have recently been developed to track public mood, namely, the OpinionFinder and Google-Profile of Mood States (GPOMS). They were developed to classify the mood of tweets’ into 6 dimensions of sentiments including: calm, alert, sure, vital, kind, and happy [7]. These two tools were used to measure variations in the public mood from tweets in the Twitter database from February 28 to December 19, 2008. OpinionFinder was used to analyze the textual contents of the tweets in order to classify them into positive and negative public moods. GPOMS can also analyze the tweets in order to determine the six-dimensional public mood classification [7]. In [7], Granger Causality Analysis [20] and Self-Organizing Fuzzy Neural Network are used to correlate the changes of DJIA with public mood, as measured by OpinionFinder and GPOMS moods. The results indicate that an accuracy rate as high as 87.6% and a reduction of the Mean Average Percentage Error as large as about 6% can be achieved. Following [7], Mittal and Goel [8] also performed similar studies using similar approach to identify trends in DJIA based on relating tweets to financial news and obtained a prediction accuracy of 87%. These two studies [7,8] provided a foundation for further research in social media analytics and there are several good reasons for further work to be done. First, the datasets used in [7,8] were still relatively small and the periods which were selected for experimentation and testing was still relatively short. Second, the geographical location factors of social media participants were not considered. It is not clear whether people in Asian countries would be able to make the same impact on DJIA as those living in the U.S. when discussing products, services, markets or stocks. Third, the studies as reported in [7,8] prediction focused on DJIA as a whole rather than on specific stocks or sectors of stocks. Last, but not least, previous studies in [7,8] did not emphasize on developing an effective architecture to handle big data involving different companies, their products, and product attributes for social media analytics. In order to address these issues, we thus propose an architecture and an algorithm to address these issues so that ambiguous temporal social media data can be mined for patterns that reflect the sentiment to allow the stock movements of specific companies’ to be accurately predicted. Stock micro blogs, such as the StockTwits, provide a simple way to allow data from social media and the stock markets to be more easily collected [29,30]. For the same purpose, some researchers prefer to collect social media data from Twitter for each publicly traded firm [32] or via stock ticker symbol [31]. Data collection for sentiment analysis may cover a significant proportion of news and advertisements, but little work is done in relating customer discussions, opinions and sentiment about specific firms and their stock market dynamics, even though recently there have been some attempts [e.g. [33]] to segment data by topics and to find out opinions or sentiments. During sentiment analysis, manual marking [29] such as the use of Twitter Investor Sentiment (TIS) [30], can allow a sentiment score of a given topic to be represented by a weighted sum of opinion scores computed based on all words in the sentence. Similar to our work, [30] also focused on stock market, but more specifically on stock returns, volatility and trading volume. The TIS performs its tasks by counting the number of tweets that contains the words “bullish” or “bearish”. As this is a relatively simplistic model of sentiment classification, predictions that are made

83

based on it are not too accurate. Among the three focal market variables (returns, volatility and trading volume), only trading volume was found to related to the posting volume of tweets. In contrast, researchers that use a Naïve Bayesian classifier to categorize each tweet and weigh the tweets by confidence level based on the weighted number of positive and negative tweets as well as the volume of tweets [34] seem to provide a more promising approach for sentiment analysis. In addition, there have also been some efforts to develop topic-based sentiment analysis. For instance, an approach [33] has been proposed to build a model to identify topics and capture sentiment classified as positive, negative or neutral, based on opinion-related keywords of different topics. Given the huge repository of data that many social media sites are making available and the potentially very useful information hidden in them, a number of different approaches have been proposed to mine social media data for useful information (as reviewed above). In most cases, these approaches have been found to be effective. However, for them to be able to tackle more sophisticated problems in more diversified areas, novel approaches need to be developed to handle big temporal social media data that are characterised as noisy, uncertain and ambiguous. For this purpose, we develop SMeDA-SA and present it in the following section. 3. Proposed method – SMeDA-SA Compared with other approaches for social media analytics, SMeDA-SA is unique in several ways. First, when it performs its tasks, it takes into consideration what we refer to as concept maps. A concept map captures the relationship between concepts. For our application involving social media data, we use a concept map to represent words as circles and it connects with labeled arrows in a downward-branching hierarchical structure. For instance, “Apple Inc.”, “iPhone 5s”, “MacBook Pro”, “Battery” and “Color” are words that can be related with each other and organized in a hierarchy in a concept map. If we are to put “Apple Inc.” at the top level in the hierarchy in the concept map, then “iPhone 5 s” and “MacBook Pro” as they are “Apple Inc.” products. Similarly, as “Battery” and “Processing Speed” are words that are used to describing the products “iPhone 5 s” and “MacBook Pro”, they can also be included into the concept map at a level below that of the products. Previous work on social media analytics does not consider concept maps. SMeDA-SA, however, takes into consideration these concept maps while trying to discover patterns in the tweets. Any tweets that is directly and indirectly related to “Apple Inc.” at the top level in the concept map hierarchy and words at the lower levels can be considered when deciding how relevant a tweet is to the strength of positive or negative sentiment about a company’s business or products or services. The reason why the whole concept map is needed rather than the mentioning of one particular company is that messages posted in many social media sites are quite often unstructured and ambiguous and thus, for analysis to be more complete, the concept map has to be considered when mining for useful hidden information in the social media. In order to construct the concept map, SMeDA-SA first discovers association relationship between concepts. If a concept is found to appear significantly frequently together in the social media data with another concept, then the two concepts are connected by SMeDA-SA in a concept map. For example, if “Apple Inc.” is found to appear frequently with “iPad”, then the two “concepts”, “Apple Inc.” and “iPad” are put together into the same concept map. Similarly, if “iPad” appears frequently enough with “iPad Pro”, then they are in the same concept map, etc. Once the “concepts” that associate with each other and that are to be included in a concept map are determined, there is a need for them to be arranged in a hierarchy. SMeDA-SA makes use of an algorithm to measure the strength of the association between them and how directly they

84

B. Li et al. / Information Systems 69C (2017) 81–92

may be related to the concept at the top level. SMeDA-SA determines how significantly the concept at lower level affect those at higher level in the concept map when the effect of those at the higher level on the determination of the sentiment are to be evaluated. When mining opinions about a top-level concept, such as “Apple Inc.”, SMeDA-SA is expected to accurately measure all the weights of the lower-level concepts linking directly or indirectly to the top-level. SMeDA-SA perform its tasks in several steps: (i) it makes use of Natural Language processing and analysis techniques to mine Twitter data to extract a sentiment word list so as to classify each tweet into different sentiment categories; (ii) it identifies a target concept that the user is interested in (e.g., a particular company, such as “Apple Inc.”); (iii) it makes use of an association discover algorithm to discover all concepts that are associated with the target concept (e.g., the “iPad” can be discovered to be associated with “Apple Inc.”); (iv) it makes use of an algorithm developed for semantic analysis to generate a concept map for the target concept based on other concepts that are found to be associated with it; (v) it makes use of an algorithm to determine how much association there is between concept maps and sentiment categories; and (vi) it makes use of an algorithm to see if there is correlation between stock prices and the sentiment detected in the tweets. In the following, we provide the details of SMeDA-SA. 3.1. Extract sentiment words from tweets Given a set of social media data, such as a set of Twitter data, one of the most useful thing to discover in it is to find out what kinds of sentiment are reflected in the data. It is for this reason that the first step that SMeDA-SA performs its tasks is to make use of an effective text mining algorithm to extract words or phrases from the tweets that are the most reflective of the sentiment of the authors. To do so, SMeDA-SA considers each tweet, which is composed of one or more sentences, as made up of a sequence of words representable as “A+B+C+D+E+B…+R”, where “A”, “B”, “C”… “R” refer to different words. To classify tweets according to the sentiment expressed in them, SMeDA-SA first generates a word list that contains words that are used to reflect different sentiment to different degrees so that, for each word, a sentiment value can be computed. Based on the word list, a tweet can be classified into a sentiment category according to the sentiment value of the words that phrases and expressions in the tweet contains. For compiling a word list for analyzing the sentiment, we randomly sample from Twitter those tweets that are related to a particular topic and we refer to such set of sampled data as Tbi which is defined as follows:





Tbi = random Dbi , MAX − sum

i−1 





Tb j , MIN ,

j=1

MAX = 50 0 0;MIN = 20 0 S=

n 

Tb j ,

sum(S ) = 50 0 0

(1)

(2)

j=1

where Tbi is the set of randomly sampled data from Twitter related to, in our applications, a particularly industry as defined in NYSE or NASDAQ. Assuming that we are interested in n topics (or industries in the case of our application here), we use an index, i, in the above equation to refer to a particular topic, (or industry). With this index, we define Dbi as the total number of tweets that refer to a particular topic indexed by i. Given the Tbi s defined for each topic, we can define S to be the complete set of randomly sampled data.

Table 1 Examples of sentiment words. Positive

Objective

Negative

Text

0.5 0 0 0.375 0.625

0.25 0.65 1 0.375 0.375

0.25 0.375 0 0.25 0

Hopeful#1 Promising#2 hopeful#2 bright#10 Sympathetic#1 Sympathetic#2 Sympathetic#3 openhearted#1 large-hearted#1 kindly#1 good-hearted#1

In Eq. (1), the function, random(D, MAX , MIN ) , is used to control the number of samples randomly drawn from Twitter The size of these samples are to be between MIN and MAX. The samples that are selected are used for “training” purposes. Using the training data, SMeDA-SA makes use of Eq. (3) to build a preliminary sentiment library and score the sentiment level based on data from SentiWordNet 3.0. The grammar analysis is then converted into the format of C-SVC algorithm and used to calculate the accuracy level with the testing data. The process is repeated ten times, which results in ten accuracy scores. The library with highest accuracy score is used as the sentiment word library in the later algorithm. In order to create a sentiment word list from the collected textual social media data, we applied the term frequency-inverse document frequency (TF-IDF) (which was developed under the vector space model) [21], a well-known statistics-based method, in this step to select the most significant sentiment words. More specifically, using the calculated TF-IDF value of each word, the top 40% of the words in the ranking are selected for the initial sentiment word list. This process is presented in Eq. (3).





W i, d



=







t f i, d × log( nNi + 0.01 )







i∈ d





(3)

t f i, d × log( nNi + 0.01 ) 

In the above formula, W (i, d ) means the weight of the word i 



in tweet d . t f (i, d ) means the frequency of the word i in tweet 

d . N means the total number of the training samples. We also used the SentiWordNet 3.0, a lexical application for sentiment classification, to calculate the specific word’s sentiment value from the created word list [22]. In SentiWordNet 3.0, each word is given one of the three kinds of notions: positivity, negativity and neutrality. We match the inputted word from SentiWordNet 3.0 with the data in our system. The matching gives each word a numerical score that ranges from 0.0 to 1.0 and their sum is 1. We then save these results in our local server. We list some examples in Table 1. After obtaining words’ sentiment values in the above steps, we then use Formula (4) to classify each tweet into one of the following five sentiment categories. In Formula (4), (Pj − N j ) means the popularity value of the sentiment word in one particular tweet; i means the sequence of the target word as shown in the set of its synonyms; wordi,score , wordi,tag . A higher rank indicates the bigger weight of its popularity, where we use 1j to measure the popularity.

n wordi,score =

mwordm,tag =

(Pj −N j ) j=1 j n 1 j=1 j

,

⎧ + Positive , ⎪ ⎪ ⎪ ⎪ ⎪ ⎨Positive, Neutral ,

⎪ ⎪ ⎪ Negative, ⎪ ⎪ ⎩ −

Negative ,

i = 1, . . . , wordm,score ≥ 0.75 0.25 ≤ wordm,score < 0.75 −0.25 < wordm,score < 0.25 −0.75 < wordm,score ≤ −0.25 wordm,score ≤ −0.75

(4)

B. Li et al. / Information Systems 69C (2017) 81–92

85

Fig. 1. Concept maps to represent the object.

Based on the sentiment classification of each frequent word, we further use C-SVC to allocate each tweet into one of the five predefined sentiment categories. More specifically, the “one-againstone” approach [23] is used to solve the M-class classification problem. We represent each tweet by the word from the created word list and its sentiment values into the following training vectors in order to train the input tweets. Then we assign words with sentiment possibilities according to the five pre-defined sentiment categories, as shown below:

word1 : word1,score · · · wordi : wordi,score . . . wordm : wordm,score Cat t r

where wordi,score ∈ [0, 1]; Cat t r ∈ Positive+ , Positive, Neutral , Negative, Negative−



Based on the former algorithm, we obtain 8892 eligible sentiment words that consist of 803 Positive+ words, 3572 Positive words, 2152 Neutral words, 1476 Negative words and 979 Negative− words.

As hypothesized above, the measured attributes in attribute set E are independent of each other. We use adjusted residuals to examine this hypothesis so as to evaluate the relationships between attributes. The results can be further used to measure the weight of the revealed association rules. If the problem is to determine whether two attributes are associated, such as E j and Ek as described above, we merge all the tweets using the set of ordered sequence objects S, in order to summarize the information related to each attribute. The degrees of grouped sentiment possibilities for these two attributes are used to construct the chi-square test table as shown below. Specifically, the two sets of data, { j1, j2, · · · , jr} and {k1, k2, · · · , kr}, are defined to represent the possibilities of each sentiment category calculated from the first step for E j and Ek , as shown in Fig. 2. In the chi-square test table, f ( ji,1 , ki,1 ) measures the minimum m value of ji,1 and ki,1 , and i=1 f ( ji,1 , ki,1 ) measures all the information for the set of ordered sequence objects S. Following the pre-defined attributes and values, the chi-square statistic can be further defined in Formula (5).

3.2. Association rule discovering After putting each word into one of the five sentiment categories, our next step is to discover patterns between the public sentiment and the stock price movements. We first divide the whole set of tweet data into subsets of the large database by company (labeled as LD). Data in each LD involve the descriptions of aspects in many attributes (labeled as E). As we investigate a data set in time serial (labeled as S), it is necessary to analyze public opinion based on the integrated data, as well as distinguish the influences and the relations of attributes to a firm. We first give the following definitions to formalize this classification problem. Suppose that there is a dataset LD constructed by a set of attributes E = {E1 , E2 , · · · , En }, where E j , i = 1, 2, . . . , n can be quantitative or categorical; and a set of ordered sequence objects S = {S1 , S2 , · · · , Sm }, in which 1 ≤ i ≤ n and 1 ≤ j ≤ m. For any record, d ∈ LD, d[E j ] denotes the value for the attribute E j . Furthermore, each element in E and S is supposedly independent, meaning Ei ∩ E j = ∅ and Si ∩ S j = ∅, i = j. Hence, we can consider that each attribute E j represents the concepts of concept map; each attribute S j represents the time slot of the tweet publishing date, as shown in Fig. 1. To give an example, each defined attribute, such as iPhone, color, etc. could be considered as the products and the characteristics of the listed company “Apple.Inc”.

X2 =

r  r  (ti, j − (ti,r+1 )/tr+1,r+1 )2 (ti,r+1 × tr+1, j )/tr+1,r+1 i=1 j=1

=

r  r  i=1 j=1

ti,2 j

((ti,r+1 × tr+1, j ))/tr+1,r+1

− tr+1,r+1

(5)

Using the above formula, if the calculation result is X 2 > X 2 0.05 ((r − 1 )(r − 1 )), it can be said that under the degree of freedom d = (r − 1 ) × (r − 1 ) and the significance level α = 0.05 , the two attributes E j and Ek are associated with each other. Otherwise it can be said that under the degree of freedom d = (r − 1 ) × (r − 1 ) and the significance level α = 0.05, the two attributes E j and Ek are independent from each other. Through this chisquare test, we can determine whether the two single attributes are associated or not. However, there is no embedded information about whether specific sentiment categories are associated exclusively with one particular attribute. For instance, if the public opinion about “Iphone5s_battery” is negative, it is not clear whether this also implies that the public opinion about “Apple.inc” is also negative. To address this issue, adjusted residuals is introduced to measure the differences between attributes such as Ej and Ek with specific sentiment categories Lx ∈ {L j1 , L j2 , · · · , L jr }, 1 ≤ a ≤ r

86

B. Li et al. / Information Systems 69C (2017) 81–92

Fig. 2. Constructing the Chi-square Test Table.

and Ly ∈ {Lk1 , Lk2 , · · · , Lkr }, 1 ≤ b ≤ r:

dxy =

Z

xy

(6)

Vxy

where Vxy is the maximum likelihood estimate of the variance of Zxy :



Vxy =

Zxy =

1−

tx,r+1 tr+1,r+1



1−

tr+1,y



tr+1,r+1

tx,y − (tx,r+1 × tr+1,y )/tr+1,r+1



(tx,r+1 × tr+1,y )/tr+1,r+1

(7)

(8)

Based on the results from the last step in determining the association rules between attributes, those attributes that are not associated are ignored. For example in the cases of E j and Ek , and Formulas (6)–(8) are defined to further identify whether they are associated with specific linguistic terms. Corresponding with the predefined table (Fig. 2) for the chi-square test, each element from it can be explored using the matrix T to calculate the value of Zxy , Vxy , and dxy . For instance, if the calculation result is |d (Lx , Ly )| > 1.96 (the 95% of the normal distribution), it can be said that the discrepancy between P r (Lx |Ly ) and P r (Lx ) is significantly different and hence the association between Lx and Ly is considered interesting. More specifically, if d (Lx , Ly ) > 1.96, the presence of Lx implies the presence of Ly . That also suggests that L ja is positively associated with Lkb . If d(Lx , Ly ) < −1.96, and the absence of Ly implies the presence of Lx and Lx is negatively associated with Ly. For instance in the previous example, if the public’s opinion about “Iphone5s_battery” is positive+, the public’s opinion about “Apple.inc” would also be positive. In another case, if the public’s opinion about “Iphone5s_color” and “Iphone5s_battery” are both positive, the public’s opinion about “Apple.inc” would be also positive+ , as shown in the association rules below: Rule1: IF Iphone 5s_battery is Positive+ T HEN Apple.inc is Positive Rule2: IF Iphone 5s_coloris Positive AND Iphone 5s_battery is Positive T HEN Apple.inc is Positive+ 3.3. Classify higher-layered objects into corresponding sentiment categories As described earlier, the data from social media may have multileveled meanings. For example, if the purpose of the research is to identify the overall sentiment about the company “Apple Inc.”, it is very likely that the calculated sentiment about the products of that company, such as the “iPhone” and “MacBook Pro”, plays a significant role and exerts a certain weight. Detailed product features, such as “CPU performance” and “appearance”, may also have

an influence on the sentiment analysis about the high-level attribute “Apple Inc.”. As discussed before, the proposed method in this study can increase the dimensionality of the social media data for the target object using both direct and indirect division operations to measure all other concepts that are associated with the target concept. Hence, by identifying the association rules between attributes, the classification problem can be formalized as a multidimensional patterns classification problem. This can be handled by classifying the higher layer attributes into categories, according to the defined linguistic terms, and then measuring the weights of all of the relevant rules related to the lower layer attributes. The association rule can be expressed as Lq → Cq , where Lq = {Lq1 , Lq2 , · · · , Lqn } is the sentiment set within the defined sentiment categories [Negative− , Negative, Neutral , Positive, Positive+ ]; Cq is the target higher layer attribute where Cq C = {C1 , C2 , · · · , Cr }, in which C concludes all the defined categories. Therefore, the IFTHEN rule E j+1 isLq( j+1 ) and · · · andEn isLqn ) can be extended as follows: where CW q is defined to measure the weights for the attribute Cq by calculating all the related association rules for the selected sentiment categories in the set Lq :

RuleRq : IF E1 isLq1 AND . . .

ANDEn isLqn T HENAt t ributeCq ANDCW q

In order to distinguish this rule from the association rule mining problem, a newly defined single row of the training data LD, i.e. X p = {x p1 , x p2 , · · · , x pn }, p = 1, 2, . . . , n , is used to mark the number of attributes, and define μLq1 (x p1 ) as the sentiment possibility of data x p1 for one attribute in a sentiment category. Moreover, an operation is designed to summarize all the factors that are equal to the attributes from all the found interested association rules as measured by sentiment set Lq below:

    μLq (Xp ) = μLq1 x p1 × μLq2 x p2 × · · · × μLqn (x pn )

(9)

Hence, the confidence measure for the association rule Lq → Cq can be defined as follows:



μL q ( X p ) p=1 μLq (X p )

x p ∈Cq

CW q = C (Lq → Cq ) = m

(10)

When the association rules have a single attribute under one particular sentiment category, each association rule casts a vote for its consequent attribute. The product of compatibility grade and the rule of weight define the strength of the vote. Based on the confidence measured for each association rule, the total strength of votes for each attribute under each sentiment category can be calculated as follows:

WClassh (X p ) =



μLq (Xp ) ∗ CW q ;h = 1, 2, · · · r

(11)

Cq =h

By using Formula (11), for each target higher-layer attribute, the result can be calculated as the numeric values under each

B. Li et al. / Information Systems 69C (2017) 81–92

defined sentiment category. The calculation results are presented as WClass (X p ) = max(WClass1 (X p ), WClass2 (X p ), · · · , WClassr (X p )) . For instance, for the selected top layer attribute “Apple.Inc.”, this calculation measures the weights of all bottom layer attributes such as “CPU performance” and “appearance” by calculating all the relationships (i.e., potential association rules) from the bottom layer attributes to the top layer attribute. Overall, the SMeDA-SA for handling the classification problem is to calculate a weight for the target attribute using one defined sentiment category by summarizing the weights of all the interested association rules where the consequent attribute is also the next target attribute using the defined sentiment category. By taking into account all the related association rules, this method offers a solution to the identified research problem, namely, how to handle the missing and embedded information from the social media data. 3.4. Performance evaluation In order to evaluate the effectiveness of SMeDA-SA at the task of classifying tweets according to the sentiment expressed, we compared its performance against that of other traditional classification algorithms, such as C4.5 and the Naïve Bayesian Classifier, using real tweets. For the purpose of experimentation, we retrieved tweets based on keywords such as ‘Iphone 4 s’ according to those 20 different characteristics described above. We then deleted repeated data (caused by retweets) and obtain a final set of experimental data. The data set contains altogether 196,370 tweets and they are classified into 20 classes. For the purpose of performance testing, 90% of the tweets in each class were randomly selected for training and the remaining 10% for testing. That means we used 90% of the historical data for training. To evaluate our predication, we did trade forward and compared with actual data obtained from the New York Stock Exchange (i.e., the remaining 10% of the data for the testing and comparisons). The test is repeated five times using different randomly selected training sets. The average accuracy was then computed for the classification algorithms that were tested. For C4.5, the information gain measure was used when determining what textual features of the predefined sentiment words and phrases are relevant to the target class we would like to predict.



IG(ti ) = H (C ) − H (Cti ) =

 −

P (ti ) −

P C j log2 P C j







P C j ti log2 P C j ti

j=1



+ P ti

n 



n 





  

 

j=1



 



n 



P C j ti log2 P C j ti

 

 

(12)

j=1

For the Naïve Bayesian Classifier, we assumed that each sentiment word and phrase is independent of the others. We then train the dataset by determining the conditional probability of Ci , where Ci ∈ C = {C1 , C2 , · · · , Cr }, given the training data, d, based on the Bayes’ rule as follows:

P (d|C i )P (Ci ) P (Ci |d ) = = argmax{P (d|C i )P (Ci )}; i = 1, 2, 3, · · · r P (d ) (13) where d = {w1 , w2 , · · · , w j , . . . , wn } represents the words or phrases that are used to express sentiment. As the value of P (d ) is the same for all classes, the probability of Ci given d can be computed as:

P (d|C i ) = P (w1 , w2 , · · · , wn |Ci ) =

n 





P w j |C i = P (w1 |C i )

j=1

× P (w2 |C i ) × · · · × P (wn |C i )

(14)

87

Table 2 Comparisons with other algorithms. Category

SMeDA-SA

C4.5

Naïve Bayesian Classifier

Color Speed Battery Life Display Weight Camera Connector Price Headphone Siri Average

79% 84% 77% 82% 70% 65% 90% 82% 60% 57% 74.6%

70% 60% 65% 46% 44% 49% 52% 62% 45% 50% 54.3%

67% 63% 70% 53% 62% 45% 71% 73% 59% 58% 62.1%

Table 3 Continually trails experiment. Category

T1

T2

T3

T4

T5

T6

T7

T8

Color Speed Battery Display Weight Camera Connector Price Headphone Siri Average

28% 27% 25% 32% 20% 16% 30% 23% 19% 14% 23.4%

38% 40% 29% 39% 27% 23% 48% 35% 24% 27% 33%

47% 55% 44% 49% 33% 30% 53% 47% 27% 31% 41.6%

56% 67% 53% 56% 39% 33% 61% 53% 35% 36% 48.9%

62% 70% 58% 62% 48% 41% 69% 63% 42% 40% 55.5%

68% 74% 65% 69% 56% 47% 78% 65% 46% 42% 61%

73% 80% 71% 76% 62% 53% 83% 77% 52% 47% 67.4%

79% 84% 77% 82% 70% 65% 90% 82% 60% 57% 74.6%

Table 2 summarizes the results of the tests that we performed to evaluate the performance of SMeDA-SA. The accuracy of SMeDASA and the other two classification algorithms for each of those concept categories that we have the largest training tweets for is listed on the first column. The average accuracy for each of SMeDA-SA, C4.5 and the Naïve Bayesian Classifier is 74.6%, 54.3% and 62.1%, respectively. The relatively higher classification accuracy that SMeDA-SA is able to achieve is due likely to its ability to take into account the heterogeneity of the patterns hidden in each concept categories. One advantage of SMeDA-SA is that it can discover patterns in the training data in an incremental way without having to retrain itself using the old data set as additional new data are introduced. In order to find out how effective SMeDA-SA can become as more data are continually collected, we conducted additional experiments by first using 30% of the attributes of the original training data set to train SMeDA-SA. An additional 10% of the attributes were then added in the second experiment. Subsequent experiments were performed by adding 10% of attributes to the pervious experiment so that, in total, we have altogether by 8 tests with different percentage of attributes of 30% (T1), 40% (T2), 50% (T3), 60% (T4), 70% (T5), 80% (T6), 90% (T7), and 100% (T8), respectively, and the experimental results are summarized in Table 3. The results indicate that SMeDA-SA prediction accuracies can continually improve without retraining even though new attributes are added. 4. Twitter public sentiment analysis 4.1. Parallelled architecture In order to access big data from Twitter more effectively, SMeDA-SA makes use of MongoDB which is a “NoSQL” open-source database. In MongoDB, the data are stored in JSON-style that offers simplicity and power for data retrieval, as well as providing full support for attribute indexing. MongoDB has been shown to effectively support applications that require scalability and high computational efficiency. The Twitter data set used in our experiment contains more than 200 million tweets which were collected from October 2011 to March 2012.

88

B. Li et al. / Information Systems 69C (2017) 81–92

Fig. 3. General architecture of the system and components.

SMeDA-SA makes use of MongoDB in such a way that query and storage efficiency be ensured while it performs its tasks. For the experiments that we carried out, we assigned 15 CPUs to each slave node to create separate processes for the classification tasks that SMeDA performs. Based on the attribute indices in MongoDB, the publication date of the tweets were recorded and 15 parallel processes were established to query the database according to different time slots. In order to better manage system resources, SMeDA-SA sets a threshold to balance system load so that once the threshold is exceeded, the module “LoadBalanceManagement” can be triggered to assign tasks to other free slave nodes and to update all the task lists. In a typical computing environment, slave nodes are quite often isolated from the monitoring of the master node and this is caused by a number of different factors. For SMeDA-SA to perform

its tasks, it resides in an environment defined with an architecture designed specifically for it so that system operations can be increased on one hand and the chance of process deadlock occurring can be minimized. The architecture is therefore a parallel architecture and is designed to build more data channels between the task lists and the system resource lists by the process “Task Management”. Fig. 3 shows this architecture of the proposed paralleled system and the details of each component. A main class diagram of this proposed architecture is given in Fig. 4. The architecture is implemented for the experiments reported in the above section as follows. The implementation includes the use of Ubuntu12.04 ×86_64 GNU/Linux with CPU of Intel i7 3720QM and 16GB of RAM; SUN Java jdk 1.7.0_25; and a Hadoop 1.1.2 cluster that consisted of four nodes distributed through WAN for each cluster in order to conduct 16-core Hadoop cluster with

B. Li et al. / Information Systems 69C (2017) 81–92

89

Fig. 4. Class diagram of the proposed paralleled architecture.

a measured bandwidth for end-to-end TCP sockets of 100 MB/s. Given the Twetter dataset and SMeDA-SA and the parallel computing environment that it performs in, the sentiment word list that consists of 803 Positive+ words, 3572 Positive words, 2152 Neutral words, 1476 Negative words and 979 Negative− words, was thus created.

4.2. Predicting stock price movements One of our research problems is to find out whether or not the stock price of a particular listed company can be predicted by the sentiments expressed by the public. Based on the guidelines of the SIC (Standard Industrial Classification) [25], we selected 30 highly popular listed companies from the NASDAQ and the NYSE (New York Stock Exchange) and classified them into different industries based on the Standard Industrial Classification and actual classification from Yahoo Finance. Appendix A shows the 30 companies and their corresponding industries. These 30 companies are used in our research and the companies and their products and services can be found mentioned from time to time in the almost 200 million records of tweets that we collected. Based on the value of the attribute “Geo” in the Twitter data set, the location from which a tweet is posted can be made known. Based on the attribute values, we only retain those tweets that were posted in the U.S. were selected for experimentation as the 30 selected companies are all listed in NASDAQ or NYSE. In addition, we also removed all the non-English tweets from the dataset. In order to better analyze tweet sentiment, we used the Porter Stemming Algorithm to stem the words from collected tweets. As casual and informal language is typically used in tweets and considering that it conveniently provides all functions needed to remove common morphological and inflexional endings from words in English, the Porter stemmer is an appropriate choice for our last step in our preprocessing phase.

Table 4 Categories of sentment. Positive+

Positive

Neutral

Negative

Negative-

D(T)=2

D(T)=1

D(T)=0

D(T)=−1

D(T)=−2

According to the five predefined categories of sentiments, viz., Positive+ , Positive, Neutral, Negative and Negative− , we set different numeric degrees D(T) for these five categories of sentiments, as shown in Table 4, to represent the degree of each tweet’s sentiment. Typically, within a certain period of time, more than one tweet (=n) that mentioned the keyword can be found. If we set the time period to be “one day” and then compute the daily average sentiment degree of a selected company so as to determine public opinions about this company, some meaningful results can be obtained. The method to compute the daily average sentiment degree is shown as follows.

D (T D ) =

D (T 1 ) + D (T 2 ) + . . . D ( T n ) n

(15)

Once the daily average sentiment degree is computed from each selected company, we can investigate into if there is any association between such measure and the movement of the stock price. By plotting them on the same chart as below with one axis representing the numerical daily average sentiment degree, which ranges from −2 to 2 and the other the time period, which ranges from 1 to n, it appears that there is indeed an association relationship between them. In Fig. 5, we show the movement of the daily sentiment degrees as well as the real stock price trends of two listed companies: Amazon and Colgate-Palmolive, from 2011.10.17 to 2011.11.04, respectively, where the stock prices are daily closing values obtained from Yahoo! Finance. In order to determine if there indeed exists an association relationship between the 30 selected companies’ sentiment trends and

90

B. Li et al. / Information Systems 69C (2017) 81–92

Fig. 5. Daily trends showing association between sentiment and stock price movement.

their stock price movement, we constructed an attribute table for each selected company as Table 5 below. Using SMeDA-SA with the attribute tables as shown in Table 5, we attempted to predict the movement of stock price in terms of the following five conditions: up+, up, flat, down and down-. As we do not expect positive or negative sentiment towards a company and the products and services it provide can be reflected immediately on their corresponding stock price, we try to determine any delay that such opinion may have if there is any association that reflect how much association there is between sentiment and stock price movement. For this reason, as shown in Table 6, the T+X in the first row means that we would like to determine if the current sentiment expressed in tweets on day T has any association with the rise and fall of the stock price X days later. Our experimental results show that, in general, for the 30 selected companies, the association between sentiment and the stock price seems to be strongest as the average prediction accuracy of stock price movement is the highest when the time lag is set to T+3. Other than the above observation, we also noted that, among all the different industries, if we are to set the time-lag to T+3, the movement of the stock prices of the selected companies from the IT industry give the highest prediction accuracy of 76.12% when compared with the others. In contrast, predictions for manufacturing companies are the least accurate with a small 52.94% on average. Given that there seems to be a delay, or time-lag, of several days when the stock market respond to public sentiment, one may be interested in how long this time-lag is. This delay may well be a result of people taking actions to either buy or sell a particular stock. It may also be due to delay in the placing of a trade. In a stock market, the trading date refers to the day that the customer’s trade is executed. Once a trade is placed, the customer’s order will go to the stock exchange and a trade confirmation will be posted to the customer’s account immediately. Under the regulations of the American stock market, the settlement date is 3 days after the trading date. This may well also be consistent with the identification of the time-lag of T+3 being more accurate than the other possible time-lags.

Table 6 Results on prediction accuracy. T+0 57.42%

T+1 59.38%

T+2 61.29%

T+3 66.48%

T+4 61.35%

IT (T+3) Finance (T+3) 76.12% 70.75%

Manufactory (T+3) 52.94%

Medicine (T+3) 61.89%

Media (T+3) 73.78%

Energy (T+3) 63.37%

Fig. 6. Actual stock price vs. predicted stock price for Amazon at T+3.

Fig. 6 shows a graph of the normalized stock price movement as predicted by SMeDA-SA versus the actual stock price trend when trained on T+3 using Amazon as example. It should be noted that the association between public sentiment and the stock price movement of Amazon at T+3 appears to be rather strong. Using the same attribute tables similar to the one shown in Table 5, we also compared the prediction accuracy of the price movement as obtained using SMeDA-SA with those by of three other popular classical data mining algorithms, including the Naïve Bayesian classifier [26], the Support Vector Machine (SVM) [27] and C4.5 [28]. For the Naïve Bayesian classifier, let each tweet be represented as x = {x1 , · · · , x j , · · · , xn } where xj represents each attribute value of a tweet. Each probability is then calculated as P (c1 x ), · · · , P (c j x ), · · · P (cr |x ) . If P (ck x ) = max{P (c1 x ), · · · , P (c j x ), · · · P (cr |x )}thenx ∈ ck . For SVM, we use the

Table 5 Attribute table. LD (Apple.Inc)

S1 S2 … Sm

Day1 Day2 … Daym

E1

E2



En

Ec

iPhone_color

iPad_weight



iTunes_service

Stock price categories

Sentiment possibilities for LD1,1 Sentiment possibilities for LD2,1 … Sentiment possibilities for LDm,1

Sentiment possibilities for LD1,2 Sentiment possibilities for LD2,2 … Sentiment possibilities for LDm,2

… … … …

Sentiment possibilities for LD1,n Sentiment possibilities for LD2,n

V(Ec1 ) V(Ec2 )

Sentiment possibilities for LDm,n

V(Ec3 )

B. Li et al. / Information Systems 69C (2017) 81–92

91

Fig. 7. Evaluation of predictive accuracy of different algorithms.

Table 7 Comparisons with other algorithms. Prediction accuracy/industry

SMeDA-SA

SVM

C4.5

Naïve Bayesian

IT Manufactory Media Average

76.12% 52.94% 73.78% 66.48%

70.75% 55.32% 71.81% 63.34%

67.63% 51.48% 63.84% 60.15%

65.02% 53.13% 62.22% 59.79%

popular linear kernel. In order to handle classification involving multiple classes, SVM requires a classifier for each functional class. Here, we use the one-versus-the-rest method to decompose the problem into several binary classification problems before we train the SVM. For experimentation purposes, we selected several conditions from our dataset as the training data and then sum up the degrees of sentiment of the tweets as the set E, we also define the set classes C = {c1 , · · · c j , · · · cr }, where ci ∈ {Positive+ , Positive, Neutral , Negative, Negative− } were used for Naïve Bayesian, SVM and C4.5. Table 7 shows that SMeDA-SA achieves the best prediction results on average. Only for the manufacturing industry that SVM shows a better prediction accuracy (Fig. 7). The nature of the experimental data strongly influences the performance of an algorithm. For instance, the lowest prediction accuracy is with the Naïve Bayesian, attributed to the large amount of noisy data and also ambiguous data about this algorithm. Lacking the capacity to handle multi-layered attributes, C4.5 performs badly because there is a considerable proportion of missing values for some attributes due to the inherent nature of social media data. This may affect the prediction accuracy by mistakenly ignoring some information on the bottom layers even if the higher layer of attributes is empty. 5. Conclusion and future works Based on the review of recent works on analyzing social media data, especially on mining the public opinions, this study has identified and handled two important research problems with satisfactory experimental results via the proposed methodology SMeDASA. More specifically, the SMeDA-SA has defined the attribute relationships embedded in social media as a graph of concept map with several layers, in which the top layered attributes and intermediate layered attributes have direct relations (belonging), and the bottom layered attributes and intermediate layered attributes have indirect relations (describing). We combine both direct and indirect division operations accompany with chi-square test results and adjusted residual. As a result, the proposed process to build the concept map has increased the dimensionality of attributes

that could significantly measure the missing and embedded information from the social media data. Our experimental results demonstrate that the SMeDA-SA can achieve a much better prediction accuracy when compared with other existing data mining methods. This research has two important practical implications. On one hand, the SMeDA-SA have a better prediction performance in some certain industries such as IT and media. This knowledge should help these companies to effectively manage or promote products and brands via sentiment management in social media. On the other hand, our study indicates the SMeDA-SA have a better performance in using current tweets’ sentiment to predict the stock price of three days later. This suggests that a 3-day interval is the best period to evaluate the efficacy of event management in social media. In addition, this study also opened up several opportunities for future research. Firstly, our research only considers the daily or weekly closing values of the stock price. For the purpose of prediction and obtaining actual investment incomes, the opening, highest and lowest values of a time series have equal importance as the closing values. We suggest that future research should include these values into the analyses. Secondly, according to our experimental findings, using current public sentiment to predict real stock market price with a period of 3 days (i.e., T+3 days) has the best prediction accuracy. We only selected the trading days for our analysis but ignored non-trading days like holidays and weekends. We believe that on non-trading days, the public sentiments keep accumulating and therefore can cause a timeline gap in our SMeDA-SA. These problems may be potentially addressed by increasing the complexity of stock market data as the prediction target, adding more attributes in the analyses and enriching our SMeDA-SA. Moreover, the dataset in our study is the Twitter data. Most tweet messages are very short and some of them are actually irrelevant to our research context. As we calculated daily sentiment degree of selected listed companies based on each record of tweets, the very short and meaningless messages have reduced the accuracy of the sentiment classification algorithm and the accuracy of stock market price prediction. To address this problem, we suggest adding other social media data, like those from Facebook, which may include long textual data in order to enhance the prediction accuracy.

Appendix A See Appendix Table A1

92

B. Li et al. / Information Systems 69C (2017) 81–92 Table A1 30 Listed companies analyzed in this study. No.

Industry

Stock Code

No.

Industry

Stock Code

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Medicine Medicine Medicine Medicine Medicine Manufactory Manufactory Manufactory Manufactory Manufactory Media Media Media IT IT

NASDAQ:MYL NASDAQ:PRGO NYSE:MCK NYSE:PFE NYSE:JNJ NYSE:GE NYSE:HON NYSE:BA NASDAQ:GLNG NASDAQ:FELE NASDAQ:LBTYA NYSE:NLSN NYSE:NYT NYSE:IBM NASDAQ:AMZN

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

IT IT Energy Energy Energy Energy Energy Media Finance Finance Finance Finance Finance Media IT

NASDAQ:QCOM NASDAQ:MSFT NYSE:SNP NYSE:PTR NYSE:TOT NYSE:XOM NYSE:BP NASDAQ:NWSA NYSE:BRK.A NASDAQ:SIVB NASDAQ:TROW NYSE:LFC NYSE:WFC NASDAQ:CMCSA NASDAQ:AAPL

Note: We selected 30 highly popular listed companies from the NASDAQ and the NYSE (New York Stock Exchange) and classified them into different industries based on SIC (Standard Industrial Classification) and the actual classification from Yahoo Finance.

References [1] https://en.wikipedia.org/wiki/Twitter . [2] L. Cheng, C.W. Raymond, Viral marketing for dedicated customers? Inf. Syst. (2014) 1–23. [3] K. Younghoon, S. Kyuseok, TWILITE: a recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation, Inf. Syst. (2014) 59–77. [4] S. Milgram, The small world problem, Psychol. Today 1 (1967) 61–67. [5] M. Joshi, D. Das, K. Gimpel, N.A. Smith, Movie reviews and revenues: an experiment in text regression, in: Proceedings of NAACL-HLT, 2010. [6] S. Asur, A. Huberman, Predicting the future with social media, CoRR 5699 (2010) (abs/1003). [7] J. Bollen, H. Mao, X. Zeng, Twitter mood predicts the stock market, J. Comput. Sci. 2 (2010) 1–8. [8] A. Mittal, A. Goel, Stock Prediction Using Twitter Sentiment Analysis, Project Report, Standford available at https://pdfs.semanticscholar.org/4ecc/ 55e1c3ff1cee41f21e5b0a3b22c58d04c9d6.pdf , 2011. [9] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kauffmann, San Fransisco, CA, 2001. [10] L. Motiwalla, M. Wahab, Predictable variation and profitable trading of US equities: a trading simulation using neural networks, Comput. Oper. Res. 27 (20 0 0) 1111–1129. [11] E. Fama, The behavior of stock-market prices, J. Bus. 38 (1965) 34–105. [12] E. Fama, Efficient capital markets: II, J. Financ. 46 (1991) 1575–1617. [13] T. Leung, H. Daouk, A. Chen, Forecasting stock indices: a comparison of classification and level estimation models, Int. J. Forecast. 16 (20 0 0) 173–190. [14] B. Jansen, M. Zhang, K. Sobel, A. Chowdury, Twitter power: tweets as electronic word of mouth, J. Am. Soc. Inf. Sci. Technol. 6 (11) (2009) 2169–2188. [15] D. Gruhl, R. Guha, R. Kumar, J. Novak, A. Tomkins, The predictive power of online chatter, in: SIGKDD Conference on Knowledge Discovery and Data Mining, 2005. [16] R. Sharda, D. Delen, Predicting box-office success of motion pictures with neural networks, Expert Syst. Appl. 30 (2006) 243–254. [17] D.M. Pennock, S. Lawrence, C.L. Giles, F.A. Nielsen, The real power of artificial markets, Science 291 (2001) 987–988. [18] K. Chen, L.R. Fine, B.A. Huberman, Predicting the future, Inf. Syst. Front. 5 (2003) 47–61.

[19] W. Zhang, S. Skiena, Improving movie gross prediction through news analysis, In: Proceedings of the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, In Web Intelligence, vol. 1, 2009, pp. 301–304. [20] R.P. Schumaker, H. Chen, Textual analysis of stock market prediction using breaking financial news: the AZFinText System, ACM Trans. Inf. Syst. 27 (2009) 1–19. [21] G. Salton, A. Wong, C. Yang, A vector space model for automatic indexing, Commun. ACM 18 (1975) 613–620. [22] A. Esuli, F. Sebastiani, SENTIWORDNET: A high-coverage lexical resource for opinion mining, Technical Report 2007-TR-02, 2007. [23] G. Salton, H. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc, New York, NY, USA, 1986. [24] C.C. Chan, K.C. Wong, K.Y. Chiu, Learning sequential patterns for probabilistic inductive prediction, IEEE Trans. Syst. Man Cybern. 24 (10) (1994) 1532–1547. [25] B. Guibert, J. Laganier, M. Volle, An Essay on Industrial Classifications, Économie et Statistique, 1971. [26] P. Domingos, M. Pazzani, On the optimality of the simple Bayesian classifier under zero-one loss, Mach. Learn. 29 (1997) 103–130. [27] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (2002) 389–422. [28] R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann Ser. Mach. Learn. (1992). [29] C. Oh, O. Sheng, Investigating predictive power of stock micro blog sentiment in forecasting future stock price directional movement, in: Proceedings of the 32nd International Conference on Information Systems, 2011. [30] N. Oliveira, P. Cortez, N. Areal, On the predictability of stock market behavior using stocktwits sentiment and posting volume, Progress. Artif. Intell. (2013) 355–365. [31] H.K. Sul, A.R. Dennis, L. Yuan, Trading on twitter: The financial information content of emotion in social media, in: Proceedings of the 47th Annual Hawaii International Conference on System Sciences, 2014, pp. 806–815. [32] L. Liu, J. Wu, P. Li, Q. Li, A social-media-based approach to predicting stock comovement, Expert Syst. Appl. 42 (8) (2015) 3893–3901. [33] T. Nguyen, K. Shirai, Topic modeling based sentiment analysis on social media for stock market prediction, in: Proceedings of the 53rd Annural Meeting of the Association for Computational Linguistics, 2015. [34] E. Bartov, L. Faurel, P. Mohanram, Can Twitter help predict firm-level earnings and stock returns?, Rotman School of Management Working Paper, Available at SSRN 2631421, 2015.