Associative topic models with numerical time series

Associative topic models with numerical time series

Information Processing and Management 51 (2015) 737–755 Contents lists available at ScienceDirect Information Processing and Management journal home...

4MB Sizes 6 Downloads 222 Views

Information Processing and Management 51 (2015) 737–755

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Associative topic models with numerical time series Sungrae Park, Wonsung Lee, Il-Chul Moon ⇑ Department of Industrial and Systems Engineering, KAIST, Yusung-gu, Daejon 305-701, Republic of Korea

a r t i c l e

i n f o

Article history: Received 11 December 2014 Received in revised form 22 May 2015 Accepted 15 June 2015 Available online 5 July 2015 Keywords: Time series analysis Topic models Text mining

a b s t r a c t A series of events generates multiple types of time series data, such as numeric and text data over time, and the variations of the data types capture the events from different angles. This paper aims to integrate the analyses on such numerical and text time-series data influenced by common events with a single model to better understand the events. Specifically, we present a topic model, called an associative topic model (ATM), which finds the soft cluster of time-series text data guided by time-series numerical value. The identified clusters are represented as word distributions per clusters, and these word distributions indicate what the corresponding events were. We applied ATM to financial indexes and president approval rates. First, ATM identifies topics associated with the characteristics of time-series data from the multiple types of data. Second, ATM predicts numerical time-series data with a higher level of accuracy than does the iterative model, which is supported by lower mean squared errors. Ó 2015 Elsevier Ltd. All rights reserved.

1. Introduction Probabilistic topic models are probabilistic graphical models used to cluster a corpus with semantics (Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2004; McCallum, Corrada-Emmanuel, & Wang, 2005), and they have been successful in analyzing diverse sources of text as well as image data (Gupta & Manning, 2011; Hörster, Lienhart, & Slaney, 2007; Khurdiya, Dey, Raj, & Haque, 2011; Ramage, Dumais, & Liebling, 2010). The essence of the topic models is the probabilistic modeling of how to generate the words in documents, given the prior knowledge of word distributions per key ideas, called topics, in the corpus. Specifically, a topic is a probability distribution over a vocabulary, and the topic models assume that a document is a mixture of multiple topics. These ideas are implicit because they cannot be directly observed, so some describe these topics as latent topics (Blei, 2012). The latent Dirichlet allocation (LDA) (Blei et al., 2003), one of the popular topic models, is the Bayesian approach to modeling such generation processes. Given the success of LDA, many expansions of LDA are introduced, and they strengthen the probabilistic process of generating topics and words with additional information. One type of such additional information is the meta-data of documents. For instance, the author-topic model (Rosen-Zvi, Griffiths, Steyvers, & Smyth, 2004) includes the authorship model to produce additional estimations on the authorship as well as better topic modeling results. Another type of information is the prior information on the corpus characteristics. For instance, the aspect-sentiment unification model (Jo & Oh, 2011) utilizes the sentiment information of words as priors to estimate the sentiment of topics. While the expansion is motivated by the additional information, such expansion is often realized by either adding additional variables in the graphical model or calibrating prior settings in the inference process. ⇑ Corresponding author. Tel.: +82 42 350 3158. E-mail addresses: [email protected] (S. Park), [email protected] (W. Lee), [email protected] (I.-C. Moon). http://dx.doi.org/10.1016/j.ipm.2015.06.007 0306-4573/Ó 2015 Elsevier Ltd. All rights reserved.

738

S. Park et al. / Information Processing and Management 51 (2015) 737–755

While most variations of LDA augment the data, such as sentiment lexicons and authorships, from either text data or the meta-data of the corpus, several models relate text data to other types of data, such as geospatial data (Sizov, 2012) and document-level labels (Blei & McAuliffe, 2007; Chen, Zhu, Kifer, & Lee, 2010; Ramage, Hall, Nallapati, & Manning, 2009). This paper considers integrating texts and numerical data over time. We assume that there are events that become the common cause of dynamics in the text and the numeric data. Then, we are interested in identifying the representations of such events which are not directly observable from the corpus and the numbers. For instance, there are latent events, such as interest rate changes or policy changes, in the stock markets that cause the generation of news articles and numerical data. Our goal is to represent such latent events with word distributions per events. The key to identifying such events is the correlation between texts and numeric variables associated with the events. Such correlations between numerical and textual data are common in product sales (Liu, Huang, An, & Yu, 2007), candidate approval (Livne, Simmons, Adar, & Adamic, 2011), and box-office revenues (Asur & Huberman, 2010). In these domains, understanding why we see such numerical fluctuation with the textual context can provide a key insight, i.e., finding a topic that resulted in the huge drop of a stock index. This paper introduces a new topic model called an associative topic model (ATM), which correlates the time-series data of text and numerical values. We define an associative topic as a soft cluster of unique words whose likelihood of cluster association is influenced by the appearances on documents as well as the fluctuation of numeric values over-time. The association likelihood of words to clusters, or associative topics, is inferred to maximize the joint probability including factors of text generation and numeric value generation by the proportion of topics over-time. The topic proportion is the likelihood of the cluster, or topic, appearance on a certain time, which means that our data consists of timed batch datasets of texts and numbers over the period. In other words, The model assumes that the text and the numerical value at the same timed-batch are generated from a common latent random variable of topic proportions, which indicate the relative ratio of topic appearances at the time (Blei & Lafferty, 2006; Putthividhy, Attias, & Nagarajan, 2010). For instance, a topic on tax could be strongly related to and influencing to the generation of economic articles as well stock prices, for example. Then, we interpret that increasing the proportion of tax topic is the event influencing the numerical-data and text-data generation. This assumption enables the numerical value to adjust the topic extraction from texts through its movements. Fig. 1 describes an example of an associative topic. The associative topic provides summarized information about text-data, and its appearance over time is highly related to the numerical time-series data. Such associative topics are useful in interpretations and predictions. For the interpretation, the associative topics hold more information about the numerical context than do the topics from only texts. Also, the associative topics enable the prediction of the numerical value of the next time step with high accuracy. Fig. 2 explains the inputs and the outputs of ATM. A time-series numerical datum, y1:T , and time-series text datum, D1:T are input data for ATM. ATM provides associated topics with correlation indexes and topic proportions over time indicating the dynamics of topic appearances in text data. Additionally, ATM can predict the next time numerical value, yTþ1 , with the next time text data, DTþ1 . Inherently, the proposed model infers associative topics that are adjusted to better explain the associated numerical values. When this paper applies the model to the three datasets, the economic news corpus and the stock return from the Dow Jones Industrial Average (DJIA), the same news corpus and the stock volatilities from the DJIA, and the news corpus related to the president and president approval index, we observe that the model provides adjusted topics that explain each numerical value movement. Associative topics extracted from the proposed model have a higher correlation with time-series data when there is a relationship between the text and the numerical data. Also, the model becomes a forecasting model that predicts the next time variable by considering the past text and numerical data. We experimented with the model to predict the numerical time-series variable, and we found that the proposed model provides topics better adapted to the time-series values and better predicts the next time numerical value than do the previous approaches. The contributions of the paper are summarized as follows: – We introduced a model to analyze a bi-modal dataset including text and numerical time-series data, which have different characteristics. This is the first unified topic model for the numerical and text bimodal dataset without batch-processing two different models for each data type.

Fig. 1. An example of an associative topic.

S. Park et al. / Information Processing and Management 51 (2015) 737–755

739

Fig. 2. Inputs and outputs of associative topic model (ATM).

– In the parameter inference of the presented model, we introduced a method for estimating the expectation of the softmax function with Gaussian inputs, which was inevitable in the process of relating the numerical and the text data. This parameter inference technique is applicable to diverse cases of joining two datasets originating from the continuous and discrete domains. 2. Previous work There are several attempts to integrate side information with topic models. Basically, three kinds of side information have been considered: document-level features (i.e., categories, sentiment label, hits, etc.), word-level features (i.e., term volume), and corpus-level features (i.e. product sales, stock indexes). For document-level features, several topic models (Blei & McAuliffe, 2007; Ramage et al., 2009; Zhu, Ahmed, & Xing, 2012) are suggested to incorporate features of a document with the topic information of the document. In order to incorporate them, either the conditional latent topic assignment or the hyper-parameters of conditional distribution is used. For word-level features, a fully generative model (Hong, Yin, Guo, & Davison, 2011) was developed to incorporate features of a word with the significance of the word in topics. More specially, the model integrates a temporal topic model and temporal term-volumes, which is a sort of time series. This work is similar to our work in modeling time-series variables. However, they focused on the prediction of word volumes in text, while we considered time-series numerical variables from a different source not directly related to text data. Also, they assumed topic representation would be helpful for predicting word volumes, while we assumed that topic proportions in a period of time are related to time-series numerical values of the different source. For corpus-level features, which are our viewpoints of external time series, there were several attempts (Bejan & Harabagiu, 2008; Chen, Li, Xu, Yang, & Kitsuregawa, 2012; Kim et al., 2013) to analyze relationships between features in a corpus at time t and topic proportions of the corpus. Unlike the works for document-level and word-level features, they proposed no fully generative models, and therefore, these feature values cannot be easily inferred. While there is no unified model of corpus-level numerical values and texts over time, there have been efforts to estimate topics related to time series variables by iteratively utilizing multiple statistical models. Iterative Topic Modeling Framework with Time Series Feedback (ITMTF) (Kim et al., 2013) creates a feedback loop between a topic model and a correlation model by utilizing the prior distributions of the topic model as bridging variables. However, the batch style of applying multiple models may have two shortcomings. First, the topic model and the correlation model have two different joint probabilities that are not shared in their inference processes, so the inference to increase one joint probability might not be the right direction to increase the other joint probability. Second, the underlying assumptions might conflict. Each statistical model has underlying assumptions of probability distributions, and the statistical models set up the inference process based upon the assumptions. When chaining these models, the two assumptions regarding the shared variable can be different. Because of these potential shortcomings, we created a unified model of numerical values and texts as a single probabilistic graphical model instead of chaining two different statistical models. Some previous studies illustrate a prediction of a numerical time-series data with text topics. For example, some researches (Chen et al., 2012; Mahajan, Dey, & Haque,

740

S. Park et al. / Information Processing and Management 51 (2015) 737–755

2008; Si et al., 2013) used a topic model to analyze themes in corpus, and then they inferred a prediction model, i.e., a linear regression, by applying the identified topics as observed variables. These methods held similar shortcomings to those we described above. 3. Associative topic models with numerical time series This section provides the description of our proposed model, ATM, used to infer the associative topics jointly influenced by the texts and the numeric values. Fundamentally, ATM is a model of Bayesian network, so the first subsection provides its probabilistic graphical model and its independence/causality structure between modeled variables. The first subsection compares and contrasts ATM to its predecessor model, DTM; and the subsection provides a description on the generative process of texts and numeric values from the topic proportion, which is a standard model description in the generative modeling community. The second subsection explains the inference of modeled latent variables with the observed data by following the structure of ATM. We used the variational inference to optimize the parameters of the latent random variables. The third subsection shows the formula to calculate the parameters which were not inferred in the variational inference process. The fourth subsection shows how to turn ATM, which is intrinsically a soft cluster model, into a model predicting the numerical value of the next time. 3.1. Model description Here, we describe an ATM that learns associative topics from time-series text data regarding time-series numerical values. Our model captures the trajectory of topic proportions over time, and topics in the trajectory are correlated with a numerical variable over time. The model can be considered a combination of a Kalman filter and LDA. A similar model, which motivated ATM, is the dynamic topic model (DTM) (Blei & Lafferty, 2006), which explores the dynamics of topics, b1:T , and topic appearances over time, a1:T . Fig. 3 presents DTM and ATM in graphical representations. The difference between DTM and ATM lies in: (1) adding the numerical time-series variable, y1:T , and (2) simplifying the word distribution by topics, b. The first difference is the existence of numerical time-series values influenced by the latent variables of the corpus. The objective of the proposed model requires the inclusion of the numerical time-series variables, so we made a causal link from the topic proportions, a1:T , to the numerical variables, y1:T , by assuming that the latent topic from the corpus would generate the numerical values as well as the words in the corpus. The second difference is the simplified word distributions of topics, b, compared to DTM because we wanted to see a single set of topic words over the period for a simplified insight. If we utilize ATM in the practical field, i.e., stock analyses, the presentation of the dynamic changes in the topic words might not be intuitive to non-experts. If we support the dynamic changes of topic words, it might increase the accuracy on the prediction result of the numerical values, but this would

Fig. 3. (left) Dynamic topic model and (right) associative topic model.

741

S. Park et al. / Information Processing and Management 51 (2015) 737–755

decrease the clarity of the topic interpretation representing the overall period. Additionally, there is a potential risk of overfitting topics to the numeric values. From now, we describe the details of the assumptions of ATM in Fig. 3. ATM has additional y variables from DTM. In DTM and ATM, a is modeled by the Gaussian distribution to express the dynamics of topic appearances over time. Then, a is linked to the topic proportions of documents, h, that have the same distribution type:

at jat1  N ðat1 ; r2 IK Þ

ð1Þ

ht;d jat  N ðat ; a2 IK Þ

ð2Þ

where IK is K-dimensional identity matrix. In the above, each co-variance matrix is modeled as a scalar matrix to reduce the computational cost. A topic of each word, z, in a document is assigned in accordance with the topic proportion of the document:

zt;d;n jht;d  Multiðpðht;d ÞÞ

ð3Þ

In Formula (3), n is the number of words in the d document at time t. To turn the Gaussian distribution into a prior of multinomial distribution, we used a softmax function, p(h), as defined below.

expðhk Þ

pk ðhÞ ¼ PK

i¼1

ð4Þ

expðhi Þ

Another interpretation of using the softmax function is that the scale of h is ignored and relative differences within h are used to decide a topic for a word. Finally, each word is generated from topic distributions, b1:K , over words from a vocabulary of size V. Up to this point, the modeling approach of ATM is very similar to that of DTM. The main idea of ATM is on generating the numerical time-series variables, y, from the prior information of a. In order to use the same information of the topic assignments on words for predicting y, the numerical time-series variables are generated from a through the same softmax function in Formula (4). The equation below specifies the modeled causality between y and a. T

yt jat  N ðb

pðat Þ; d2 Þ

ð5Þ

where b is a vector of K-dimensional linear coefficient parameters. Formula (5) is the key mechanism that propagates the influence from the topic proportion, at , to the numeric values, yt . We model the relation between the topic proportion and the numeric values as a linear combination, fundamentally. T

However, our model is different from a typical Gaussian distribution with conditional information, i.e. N ðb at ; d2 Þ, because we apply a softmax function, pðat Þ, to the random variable of the topic proportion, at . Therefore, this paragraph clarifies the application of the softmax function in the generation process of numeric values. The topic proportion, at , itself is modeled as a multivariate Gaussian distribution as Formula (1). This suggests that the value has an infinitely long tail in a direction of the modeled dimension. Hence, there is an overfitting risk to infer a value, or an element of at , from such long tail while learning the parameters from the text side. Then, this overfitted part of, at , will distort the inference of b, because both will be combined as a linearly weighted sum if there is no other precaution made. To limit such extreme influence from the text to the numeric values, we use a soft-maxed topic proportion, which is pðat Þ, instead of the simple linear combination. The soft-max function has a finite range with continuous activation functionality. Thus, the topic proportion will activate the influence to the numeric value, but the influence will be limited, or squashed in a casual statistics term, to the level of the activation. Additionally, Formula (5) has an interesting variable to control, d2 , in its variance term. The variance of d2 affects the equation as a strength factor for fitting the numerical values to the topic proportions, at . Learning d2 means that ATM will appreciate and admit such amount of variance in yt from the expectation of yt determined by the linear combination of b and at . If we limit the variance level, d2 to the small amount, the inferred coefficients corresponding to each associative topic will be forced to over-fit the numeric value movement, and this will eventually provide too strong bias in extracting the parameters in the topic extraction of the text modeling part. On the other hand, the high variance of d2 will provide no guidance to the extracted topics, so this will turn the resulted topics to the ordinary topics without the guidance from the numeric values. Therefore, we experiment two versions of ATM. The first version is fixed-ATM which fixes the variance d2 as a very small value to maximize the fitness to the numeric values by accepting the possibility of over-fitted associative topics. The second version is an ordinary ATM which infers the variance d2 without any fixed point, so the extracted topics could be more meaningful by avoiding the over-fitness to the numeric values (see Formula (23)). Such adjustment of the influence strength from the topic proportion to the numeric value is experimented by comparing the result of fixed-ATM and ATM in the result section. Fig. 4 is the generative process summarizing the assumptions of ATM. As we mentioned, the differences from DTM are (1) the generative process on numerical time-series value, yt , and (2) the unified topics, b ignoring dynamic changes by time.

742

S. Park et al. / Information Processing and Management 51 (2015) 737–755

Fig. 4. Generative process of the associative topic model (ATM).

3.2. Variational inference Due to the non-conjugacy between distributions in ATM, the posterior inference becomes intractable. We employed the variational method (Jaakkola, 2001; Jordan, Ghahramani, Jaakkola, & Saul, 1998) to approximate the posterior in the ATM. The idea behind the variational method is to optimize the free parameters of a distribution over the latent variables by minimizing the Kullback–Leibler (KL) divergence. In ATM, the latent variables are: (1) corpus-level latent states of topic proportions, at , (2) document-level latent states of topic proportions ht;d , and (3) topic indicators zt;d;n . Below are our factorized assumptions of the variational posterior, qða; h; zÞ.

^ 1:T Þ qða; h; zÞ ¼ qða1:T ja

Dt T Y Y t¼1

Ntd Y qðhtd jctd ; Ktd Þ qðztdn j/tdn Þ

! ð6Þ

n¼1

d¼1

^ 1:T ; ctd ; Ktd and /tdn are variational parameters of the variational posterior over latent variables. Factorized variational where a distributions of htd and ztdn are simply modeled, htd jctd ; Ktd  N ðhtd ; ctd ; Ktd Þ and ztdn j/tdn  Multið/tdn Þ. However, in the variational distribution of a1:T , we retained the sequential structure of the corpus topic appearance by positing a dynamic model ^ 1:T . In DTM, the variational Kalman filter model (Blei & Lafferty, 2006) was used to with Gaussian variational observations, a express the topic dynamics, and we adopted the model for estimating the dynamics of the topic proportions in the corpus over time. The key idea of the variational Kalman filter is that observations in the standard Kalman filter model are regarded ^ and the posterior distribution of the latent state in the standard Kalman filter as the variaas the variational parameters a ^ Þ. Our variational Kalman filter is as follows: tional distribution qðaja

a^ t jat  N ðat ; v^ Ik Þ

ð7Þ

at jat1  N ðat1 ; r2 Ik Þ

ð8Þ

Using the standard Kalman filter calculation, the forward mean and variance are given by

at ja^ 1:t  N ðat ; lt ; V t Þ

lt ¼ Vt ¼



v^ t



v^ t

ð9Þ





V t1 þ r2 þ v^ t

^

 ðV t1 þ r2 Þ

with an initial condition of fixed

l0 and fixed V 0 . The backward mean and variance are given by

at ja^ 1:T  N ðat ; l~ t ; V~ t Þ

l~ t ¼



vt l þ 1 a^ t V t1 þ r2 þ v^ t t1 V t1 þ r2 þ v^ t



r2



V t þ r2

 ~t ¼ Vt þ 1  V

ð10Þ



lt þ 1  r2

Vt þ r

r2 V t þ r2



l~ tþ1

2   ~ tþ1  ðV t þ r2 Þ V 2

~ t are drawn from variational Kalman filter calculations. We approximate the posterior ~ t and V In Formula (10), l ^ 1:T Þ. pða1:T jw1:T ; y1:T Þ using the state space posterior qða1:T ja Using these variational posteriors and Jensen’s inequality, we can find the lower bound of the log likelihood as follows:

S. Park et al. / Information Processing and Management 51 (2015) 737–755

743

ZZ X pðw; y; a; h; zÞ qða; h; zÞ log dhda qða; h; zÞ z

log pðw1:T ; y1:T Þ P

¼ Eq ½log pða1:T Þ þ

Ntd Dt Dt X T X T X X X Eq ½log pðhtd jat Þ þ Eq ½log pðztdn jhtd Þpðwtdn jztdn Þ t¼1 d¼1 n¼1

t¼1 d¼1 T X Eq ½log pðyt jat Þ þ Hðqða; h; zÞÞ þ

ð11Þ

t¼1

This bound contains four expectation terms related to the in-hand data. The first term is related to the latent states from both sources of data. The second and third expectation terms are related to the text data and easily found by the same expectation terms in DTM. The fourth term, Eq ½log pðyt jat Þ, is related to the continuous time-series data that we introduce. The first term on the right-hand side of Formula (11) is:

Eq ½log pða1:T Þ ¼

X

Eq ½log pðat jat1 Þ / 

t

TK 1 X 1 X ~ 1 ~ 0 Þ  TrðV~ T ÞÞ ~t  l ~ t1 k  log r2  2 kl TrðV t Þ  2 ðTrðV 2 2r t 2r r2 t

ð12Þ

using the Gaussian quadratic form identity

Em;V ½ðx  lÞT R1 ðx  lÞ ¼ ðm  lÞT R1 ðm  lÞ þ TrðR1 VÞ

ð13Þ

The second term of the right-hand side of Formula (11) is:

Eq ½log pðhtd jat Þ / 

K 1 1 ~ t k  2 ðTrðV~ t Þ þ TrðKtd ÞÞ log a2  2 kctd  l 2 2a 2a

ð14Þ

The third term is:

Eq ½log pðztdn jhtd Þpðwtdn jztdn Þ P

X X 1 X ctdk þKtdk =2 /tdnk ctdk  /tdnk log ^ftd þ e ^ftd i k k

! þ

X /tdnk log bk;wtdn

ð15Þ

k

introducing additional variational parameters f1:T;1:D due to the softmax function whose input variables are drawn from the Gaussian distribution. The closed form of expectation cannot be calculated, but the lower bound of the expectation can be found. This treatment of the lower bound of the expectation keeps the lower bound of log likelihood. The fourth term is: T X X 1X X 1 X Eq ½log pðyt jat Þ / 2 bk yt Eq ½pk ðat Þ  2 bi bj Eq ½pi ðat Þpj ðat Þ d 2d t t t¼1 i;j k

ð16Þ

It should be noted that the we applied the softmax function p to a because we modeled the regression from the discretized topic proportions to the numerical variables. The rationale behind the discretization is that what we see as topics from text should influence the regression, so we applied the same softmax treatment to h and a. While such discretization for topic extraction occurred in DTM to join the multinomial and the Gaussian distributions, this treatment is joining two Gaussian distributions, which is a difference between ATM and DTM. While this difference might seem trivial, this linkage of two Gaussian distributions with the softmax function invokes a non-trivial calculation of the expectation. In Formula (16), finding the lower bound with the variational parameters of the expectation terms is intractable due to non-concavity caused by opposite signs of the expectations (Bouchard, 2007). Additionally, the closed form of the expectation terms cannot be calculated exactly because of the difficulties in log-normality and the rate of two random variables. Hence, we used an approximation approach to calculate the local expectations using the Taylor expansions for rate expectation (Kendall, Stuart, & Ord, 2010; Rice & Papadopoulos, 2009). To the best of our knowledge, this is the first case of inferencing a probabilistic graphical model with approximate expectations of the softmax function with a Gaussian prior. The expectation of the simple softmax function is shown below.

!  X expðati Þ V~ t 2  mti þ mti ðe  1Þ Eq ½pi ðat Þ ¼ Eq P mtk  mti k expðatk Þ k 

ð17Þ

~ t Þ, is introduced. The joint expectations of the two softmax functions are finally In Formula (17), a new notation, mti ¼ pi ðl approximated as follows:

" ! 2 # X expðati Þ ~t ~t 2 2 V 2 V P  e mti þ mti ðe  1Þ 3 mtk  4mti Eq ½pi ðat Þ  ¼ Eq k expðatk Þ k 2

# ! X expðati þ atj Þ ~t 2 V  mti mtj þ mti mtj ðe  1Þ 3 mtk  2mti  2mtj Eq ½pi ðat Þpj ðat Þ ¼ Eq P ð k expðatk ÞÞ2 k

ð18Þ

"

ð19Þ

744

S. Park et al. / Information Processing and Management 51 (2015) 737–755

In Appendix A, we described the approximate method for finding the expectations of softmax functions. The last term of Formula (11) is the entropy term:

X 1X 1X log jV~ t j þ log jKtd j  /tdnk log /tdnk 2 t 2 t;d t;d;n;k

HðqÞ ¼

ð20Þ

With the above expectations terms, we can find an approximated lower bound of the log likelihood. The learning equations for variational parameters are described in Appendix B. 3.3. Learning model parameters Using variational distributions that are approximate posterior distributions over latent variables, we can find update equations for model parameters. When it comes to topic representations, b can be updated as follows:

bkv /

X

/tdnk Iðwtdn ¼ v Þ

ð21Þ

t;d;n

where Iðwtdn ¼ v Þ is a binary function that is equal to 1 when wtdn ¼ v or 0 otherwise. Variances of Gaussian noises to text data (a2 ) and numerical time-series data (d2 ) can be updated as follows:

1 a ¼ DK 2

( ) X 2 ~ ~ ðkctd  lt k þ TrðKtd Þ þ TrðV t ÞÞ

ð22Þ

t;d

( ) X T 1 X T ~ ~ t bÞ þ d ¼ ðyt  l b Vtb T t t 2

ð23Þ

We assumed that each document-level latent state of topic proportion h are drawn from the time-series latent state of topic proportions, a, with the scalar variance, a2 , while the numerical time-series variables are drawn from a linear combination of a with a variance, d2 . By learning these variations, we can find the degree of correlation between text data and numerical time-series data. A low value of d2 means two sources of data are strongly correlated and text data can be helpful to predict time-series variables. Conversely, a high value of d2 means two sources of data are not closely related to each other. If we set fixed variances at a low value, ATM tends to learn topics that have a high explanatory power about the trajectory of numerical time series in the training set. As noted before, the linear combination of rescaled a and b becomes the estimation of y. To infer b, we maximize the lower bound of its log-likelihood. The function below is the softmax function to infer the coefficient vector.



X Eq ½pðat ÞT pðat Þ

!1 Eq ½pðaÞT Y

ð24Þ

t

In Formula (24), Eq ½pðat ÞT pðat Þ is a K  K matrix whose elements are Eq ½pi ðat ÞT pj ðat Þ 8i; j; Eq ½pðaÞ is a T  K matrix whose elements are Eq ½pk ðat Þ, and Y is a numerical value vector over the period T. This update equation is similar to the Gaussian response of Supervised LDA (Blei & McAuliffe, 2007). 3.4. Prediction After all of the parameters are learned, ATM can be used as a prediction model of a future time-series variable (yTþ1 ) by observing new text data (wTþ1 ). The expectation of yTþ1 is formalized in the function below.

Eq ½yTþ1 jy1:T ; w1:Tþ1  ¼

X bk Eq ½pk ðaTþ1 Þjy1:T ; w1:Tþ1 

ð25Þ

k

In order to calculate the expectations of the softmax function for the new time-step, we infer the posterior distribution of

aTþ1 utilizing the posterior distribution of aT , documents Dtþ1 , and learned model parameters. This particular inference is based upon Formula (11) except for the fourth expectation terms. After enough iterations of variational inferences, the posterior distribution of (aTþ1 ) can be used to predict the numerical value of the next time step. 4. Empirical experiment This section demonstrates the utility of the ATM in both explaining and predicting the time-series values. We applied ATM to a financial news corpus and stock indexes as well as to a news corpus related with the president and president approval index. Section 4.1 shows the detailed description of the datasets. Sections 4.2 and 4.3 describe the overview of

S. Park et al. / Information Processing and Management 51 (2015) 737–755

745

our experiments and the baseline models, such as autoregressive models (AR), LDA, DTM, ITMTF. Through Sections 4.3, 4.4, 4.5 and 4.6, we discuss the results of the experiments. 4.1. Dataset We applied ATM in two domains, finance and presidential popularity. For the financial datasets, we used financial news articles from the Bloomberg and the stock indexes of the DJIA from January 2011 to April 2013 (121 weeks). Due to the different roles of the indexes, we chose the return, which indicates a profit on an investment, and the volatility, which is a measure for variation of price of a financial instrument. The 60,500 news articles were selected randomly; 500 articles were selected for each week. That is, we used two coupled datasets in the finance domain; one includes Bloomberg articles and the DJIA stock return, and the other includes the same Bloomberg articles and the DJIA stock volatility. For the presidential datasets, we used news articles searched with the name of the president, ‘Obama’ from The Washington Post and the weekly president approval index from January 2009 to June 2014 (284 weeks). The 27,582 news articles were selected randomly; fewer than 100 articles were selected for each week. We pruned vocabularies by removing stop-words and terms, which occurred in less than 2% of the documents. We set a time step in the experiment as a week. Table 1 describes the summarized information of the datasets. 4.2. Overview of experimental design Our experiments are categorized into qualitative and quantitative analysis from a large perspective, see Fig. 5. The qualitative analysis examines the outputs of ATM and fixed-ATM from three datasets with texts and numeric values over-time. Section 4.4 enumerates the associative topics extracted by ATM for a qualitative analysis, and the section contrasts the associative topics to the topics without any such guidance, i.e. the topics from LDA and DTM. Section 4.5 shows the results of the fixed-ATM, and the section shows the associative topics which are biased to overfit the numeric values. Since the extracted topics of fixed-ATM are meant to be only dependent variables for the numeric value prediction, we omit the further analysis on the quality of the fixed-ATM topics. If users values the interpretation of extracted associative topics guided by the numeric values, they are recommended to use ATM instead of fixed-ATM. Section 4.6 provides the quantitative evaluations on the numeric value prediction by measuring the mean squared error and the mean absolute error, and the section includes both models for texts and numeric values as well as models for numeric values only, i.e. AR. Section 4.7 investigated the predictive log-likelihoods that is a quantitative measure in explaining unseen data with a model and a historic dataset. 4.3. Baseline To compare performances of the proposed models, we applied four baseline models. The autoregressive model (AR) is only for analyzing numerical time series, so we compared its numerical value prediction performance with ours. Since other topic models are only for analyzing text data, we applied linear regression models to joint numerical time-series values. The short descriptions of baselines are below. 4.3.1. Autoregressive models We could analyze continuous time-series values with the AR, which is a traditional time-series analysis tool. In a univariate autoregressive model AR(p), the variable yt depends on its previous values from yt1 to ytp . In this model, selecting lag values, p, is important regarding causalities with past information, and we optimized p empirically. 4.3.2. Latent Dirichlet allocation The basic topic model, LDA, was applied. We firstly analyzed topics in the total corpus regardless of time steps, then ðdÞ

extracted a topic in each document, hk . We could compute time-level topic proportions, tctk , from them as normalizing docP ðdÞ 1 ument topic proportions over time: tctk ¼ NðD d2Dt hk , where NðDt Þ is the number of documents at time t. Then, a simple tÞ linear regression was applied to evaluate topic association with numerical time-series values.

Table 1 Dataset descriptions. –

Bloomberg

Washington P.

Total docs. # of unique words Period Total time batch Numerical time series

60,500 1447 2011.1.1–2013.4.26 121 weeks Stock returns Stock volatilities

27,482 3610 2009.1.1–2014.6.30 284 weeks Obama approval indexes

746

S. Park et al. / Information Processing and Management 51 (2015) 737–755

Fig. 5. Overview of experiments: qualitative analysis of ATM and quantitative analysis of baseline models and ATM.

4.3.3. Dynamic topic models The time-series topic model, DTM, catches the dynamics of the latent state of topic proportions at . By utilizing DTM, we ~ t . Considering the softmax treatment for topic assignments, we extracted could extract the posterior mean of at dynamics, l ~ t . After defining topic proportions, we utilized a simtopic proportions over time by simply applied the softmax function to l ple linear regression. 4.3.4. Iterative topic modeling with time series feedback Iterative Topic Modeling with Time Series Feedback (ITMTF) is a general text mining framework that combines probabilistic topic modeling with time-series causal analysis to uncover topics that are both coherent semantically and correlated with time-series data. ITMTF consists of the following iterative steps: (1) Find topics with topic modeling techniques, (2) identify causal topics using correlation tests, and (3) make prior topic distribution using topic-level and word-level causality analysis. In (2) and (3) processes, there should be a threshold of confidence. In this paper, we applied LDA(ITMTF-LDA) and DTM(ITMTF-DTM) as topic models and a linear regression as a causal analysis tool with a 70% confidence threshold. 4.4. Sample results of ATM ATM has two variations by the setting of d. The first variation is learning d just like the other parameters from the data. This is an ordinary version of ATM, and this model limits the fitting to the past numeric data by allowing the deviation modeled by the variance of d. This model balances the influence of the numeric and the text data in the parameter inference process. We utilize this ordinary version of ATM in (1) Section 4.4 for qualitative topic evaluation, (2) Section 4.6 for numeric value prediction evaluation, and (3) Section 4.7 for quantitative topic evaluation. The second variation, or fixed-ATM, is fixing d as the small value, so the numeric fitting can be maximized, or overfitted from a certain perspective. This would not be a better way to predict the numeric time-series in the future, but this does maximize the influence from the numeric data to the text data in the topic extraction process. Therefore, the model would result in topics that are the best in describing the past numeric value movements, and Section 4.5 shows such retrospective analysis of texts influenced by the numeric values. To investigate the influence of time-series for identifying associative topics, we applied ATM to the same Bloomberg articles over 121 weeks as the text input with different time-series, the stock returns and the volatilities of DJIA. We initialized b1:K to randomly perturbed topics from uniform topics, d2 to the sample variance of the time-series data, r2 to 0.1, a2 to 0.1, and b to zero vector over the number of topics 10. The top and middle of Fig. 6 show the learning results with stock returns and volatilities. The results show that some associative topics from different time series are similar topics, e.g., European crisis, tax cuts, and economic reports, because of the property of the financial domain. However, some associative topics are different: Topics about energy and the economy of Asia are revealed from ATM with stock returns, and the topics about federal lows are revealed from ATM with stock volatilities. When we show two numerical time-series cases, the volatility is better predicted than the return is. This result is expected because of the efficient market hypothesis, which states the difficulty in predicting stock prices and returns (Poon & Granger, 2005). In the financial domain, the volatility often becomes the target of prediction with a general sentiment. The numeric values over-time drive ATM to extract different topics per the types of the numeric values, i.e. return and volatility. Subsequently, the different topics induced the different dynamics of topic proportions though two dynamics were captured from the same text data. This result qualitatively demonstrates that ATM identifies different associative topics from different time-series values, although the text data and the initial settings are the same. To confirm this influence from the numeric values in extracting topics, we calculated the similarities between (1) topics without numeric information and

S. Park et al. / Information Processing and Management 51 (2015) 737–755

747

Fig. 6. The training results of ATM with three datasets, Bloomberg news articles & the stock returns (top) and Bloomberg news articles & the stock volatilities (middle), and The Washington post news articles & the president approval rates (bottom). In each dataset, the left side shows the associative topics, which are represented with top eight words. The upper right side indicates the dynamics of topic proportions; the order is the same as the order of the left topic representation. The colors of the topic proportions indicate the effect of topics; for example, blue is positive and red is negative, and the strength of the color means the degree of the effect. The lower right side shows the numerical time-series values and the estimation of them in the training periods. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

temporal information (LDA), (2) topics without numeric information (DTM), (3) topics with over-time numeric values, volatilities (ATM(Volatilities)), and (4) topics with over-time numeric values, returns (ATM(Returns)). Specifically, we investigated the cosine similarities of the topics from different topic extraction models utilizing different information. We examined ten topics from LDA and DTM with the same period text data to compare two sets of topics from ATM with two different numeric values. Fig. 7 shows the block diagrams of the results. The color strength of each block indicates the degree of the topic similarities (a black cell means that the two topics are same, and a white cell means that two are quite different). In the case of LDA and DTM, each topic from LDA has a strongly similar topic in DTM. This indicates that DTM extracts similar topics from LDA although DTM takes the time information of text data into account. On the contrary, in the cases of DTM and ATM with returns as well as DTM and ATM with volatilities, ATM was able to extract different topics by considering numerical time-series values, i.e. the topic 4, 7, and 10 from ATM(Return) are matched to none of the topics from DTM. Such difference were observed between the pair of DTM and ATM(Return); DTM and ATM(Volatilities); and ATM(Return) and ATM(Volatilities). This numeric value guided topic extraction is a major objective in developing ATM, and this correlation analysis demonstrates such capability of ATM.

748

S. Park et al. / Information Processing and Management 51 (2015) 737–755

Fig. 7. Cosine similarities between the topics from different models, such as LDA, DTM, ATM(Returns), and ATM(Volatilities): each row and column means the topic indexes from the models (first four) and the strength of color indicates the similarity (left).

In order to explore The Washington Post corpus and its themes associated with president approval indexes, we inferred the parameters of ATM with 284 weeks of data. For learning ATM, we initialized parameters in the same way of the stock index case. ATM identifies two sets of associative topics corresponding to the president approval indexes. The bottom of Fig. 6 shows the results from ATM. The left side lists the estimated words with the top eight probabilities of ten topics and the associative factors that are standardized coefficients b. The results show that an associative topic about family life is positively related with the approval indexes. However, some associative topics are negatively related, e.g. Romney party, war, and policy. In addition, some topics are not highly related, e.g. education, tax, and agency. These results qualitatively demonstrate that ATM identifies plausible associative topics. Our proposed topic model, ATM, shows how text data and numerical time-series data are related, and ATM identifies associative topics that have explanatory power regarding the relation between two different types of data from two different sources. 4.5. Retrospective analysis with fixed-ATM ATM can be used for retrospective analysis, not for prediction. In the modeling process of ATM, we integrated numerical time-series data and dynamic topic models. At modeling time-series, we assumed numerical hidden states influence generating time-series variables with Gaussian errors (d2 ). If we set the error (d2 ) as the small fixed number, such as 1010 , then the ATM catches associative topics only explaining the exact trajectories of time-series in the training period. As we mentioned, we label ATM with this treatment as fixed-ATM. To explore the results of fixed-ATM, we initialized b1:K to randomly perturbed topics from uniform topics, r2 to 0.1, a2 to 0.1, and b to zero vector over the number of topics. We set the number of the topics as 10 for the return and approval index cases and 5 for the volatility case. The different number of topic setting is decided by empirical experiment; the proportions of some topics among 10 with fixed-ATM become close to 0. Fig. 8 shows total results from fixed-ATM. When investigating dynamics of topic proportions, all dynamics are dramatically changed to contain errors of time-series. Especially, a few topics are highly correlated in the volatility case. This means most of text data are not effectively related with volatilities by ignoring the error of time-series. In figures of estimations, as we said, the fixed low value of d2 made the estimation exact in the training period. This treatment is very similar with ITMTF (Kim et al., 2013) because the trajectories of time-series influence directly topics ignoring errors in time-series. In order to analyze highly associative topics over the training periods, fixed-ATM can be useful. As can be seen, there are some different topic effects from ATM results. For example, in the return case, the last topic appears to have the most negative effects from a fixed-ATM, while it was affected in a small way by ATM. This is caused by including the errors of time-series themselves into topic proportions. These results may be used for a retrospective understanding. 4.6. Evaluation of numerical value prediction performance To quantitatively evaluate the prediction performance of ATM, we considered the task of predicting the next time value by assuming the model was trained by prior data. We repeated this prediction from the 101th to 121th week (21 different period) for the stock indexes cases and from the 270th to 284th week (25 different period) for the approval index case, using the trained model with historically cumulated data. For instance, to predict y value at 270th week, the model was trained with 1–269th weeks data. The baseline models and ATM infer five topics, and the models had the same initial parameters and the same criteria to stop the approximations. For AR(p), we select p of the best performers on the test data. For ITMTF models, we iterated three times of prior feedback loops with a 70% confidence threshold. After the estimation, we P ^m Þ2 , and mean absomeasured the model performance of experimental models with mean squared errors (MSE), M1 m ðym  y P ^m j where M is the number of test points; ym is the true value of the prediction; and y ^m is the lute errors (MAE), M1 m jym  y

749

S. Park et al. / Information Processing and Management 51 (2015) 737–755

Fig. 8. The training results of fixed-ATM with three datasets, Bloomberg news articles & the stock returns (top) and Bloomberg news articles & the stock volatilities (middle), and The Washington post news articles & the president approval rates (bottom). In each dataset, the left side shows the associative topics, which are represented with top eight words. The upper right side indicates the dynamics of topic proportions; the order is the same as the order of the left topic representation. The colors of the topic proportions indicate the effect of topics; for example, blue is positive and red is negative, and the strength of the color means the degree of the effect. The lower right side shows the numerical time-series values and the estimation of them in the training periods. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 2 Prediction performance comparison on stock returns, stock volatilities, and president approval indexes; underline means the best performer on the prediction, and bold means the best performer among models with text data. –

Mean squared errors

Models

Returns

Volatil.

App.Idx

Returns

Volatil.

App.Idx

AR

1.892

15.00

1.039

LDA-LR IT-LDA DTM-LR IT-DTM ATM

2.405 1.805 2.252 1.897

0.871E5 1.166E5 1.052E5 4.916E5 18.257E5 1.051E5

21.22 20.35 30.39 23.62

1.118 0.947 1.063 1.058

14.47

0.904

2.436E3 2.826E3 2.651E3 5.604E3 8.875E3 2.652E3

3.041 3.690 3.979 4.745 4.096 3.207

1.761

Mean absolute errors

750

S. Park et al. / Information Processing and Management 51 (2015) 737–755

predicted value. Although the same training data are not used for prediction, MSE and MAE can be used to evaluate the model quality. Table 2 displays MSE and MAE from the baseline and the proposed models. The bold font indicates the best model generating both interpretation and prediction in other words, analyzing text and metrics simultaneously while the underlined font indicates the best performer including only numerical model, such as AR. When we focus only on the numerical value prediction, ATM is the best performance model in general. Exceptions are the MSE and the MAE of AR in the volatility prediction; and the MAE of AR in predicting the approval rate. This shows that simple regression model might beat sophisticated metric-text models, but given the result, ATM is the best solution that provides both prediction and interpretation. Additionally, some cases, i.e., predicting stock price, suggests that ATM would be better for the numerical value prediction. Other sophisticated methods, i.e., the generalized linear model, that can replace AR exist, but these models do not provide the interpretation of the context, either. 4.7. Predictive likelihood analysis After measuring the accuracy of the numeric value predictions, we compare the quality of the extracted topics from the texts with the predictive log likelihoods (Blei & Lafferty, 2006; Chang, Gerrish, Wang, Boyd-graber, & Blei, 2009; Wallach, Murray, Salakhutdinov, & Mimno, 2009; Wang, Blei, & Heckerman, 2008). The predictive log-likelihood is frequently utilized and widely accepted in the topic modeling community, and this metric represents how clearly the extracted topics are able to explain the future document distribution from the topic perspective. Since this evaluation only handles the topic modeling, we excluded the evaluation on the numeric only model, i.e. AR. The predictive log likelihoods in the topic models indicate the probabilities of generating unseen text data with past learned parameters. In the proposed models, the predictive likelihood for next-time text data, wTþ1 , is specified in the follow:

pðwTþ1 jw1:T ; y1:T Þ ¼

ZZ

pðwTþ1 jaTþ1 ÞpðaTþ1 jaT ÞpðaT jw1:T ; y1:T ÞdaTþ1 daT

ð26Þ

Due to the same reason of the intractability of the inference, we calculated the lower bound of the likelihood in the same way described in Formula (11). Because the baseline topic models, which are LDA, DTM, ITMTF-LDA, and ITMTF-DTM, are not unified probabilistic models containing numerical variable y1:T , their predictive log-likelihoods do not contain the historical condition of numerical time series. The formula for the baseline topic models is described in the below:

Fig. 9. Log likelihood on the test corpus from LDA, ITMTF-LDA, DTM, ITMTF-DTM, and ATM (right-stock returns, middle-stock volatilities, right-president approval rates): We learned all models with time-series dataset before each test period. For example, the results in the period 110 are calculated using the learned model from the 1 to 109 period dataset. The upper graphs show the results of all test period, and the lower graphs show the results of the last period.

S. Park et al. / Information Processing and Management 51 (2015) 737–755

pðwTþ1 jw1:T Þ ¼

Z

pðwTþ1 jH1:T ÞpðH1:T jw1:T ÞdH1:T

751

ð27Þ

where H1:T indicates latent variables, which are included in the baseline topic models. Due to the intractabilities, the lower bounds of pðwTþ1 jw1:T Þ are used. These lower bounds are not exactly comparable values because each probability is derived from different hyper-parameters as well as different condition, such as the existence of y1:T . However, these comparisons are still be worth to measure model performances (Asuncion, Welling, Smyth, & Teh, 2009) due to the difficulty on calculating the exact likelihood. The upper graphs in Fig. 9 display the log likelihoods on test datasets over time from the ATM and the baseline models. The dramatic differences over times are caused by the difference of the period. All models show similar performance from the predictive log likelihood aspect. This is encouraging for ATM because ATM has the similar likelihood performances in spite of the strong assumption that two different sources are generated from the same hidden states. In order to show details, the lower graphs in Fig. 9 show the log likelihood of the last period on the upper graphs in Fig. 9. 5. Conclusion This paper proposes associative topic model, or ATM, to better capture the relationship between numerical data and text data over time. We tested the proposed model with financial indexes and presidential approval index. The pair of numeric and text data in the financial domain consists of the economic news articles; and the stock return as well as volatility. The pair in the politics domain consists of news articles on the president and the president approval index. Our experiments show meaningful performance gain in predicting the numeric data by utilizing the text data. Additionally, our topic finding capability, i.e. measured by the predictive likelihood, was minimally affected in spite of adding numerical modeling to the text model, which is adding bias that was not there before. The generality of the model makes it useful in broad application domains that have fluctuating numerical data and text data. For instance, this model is highly applicable to product reviews and sales records over time, as well as political SNS messages, approval ratings, etc. To support the understanding of the model, we presented a variational inference procedure of the model to approximate the posterior. Acknowledgements This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012R1A1A1044575). Appendix A. Approximate expectation of softmax functions In the paper, we introduced the expectation of a softmax function whose input variables are drawn from Gaussian distribution. Since the covariance matrices of Gaussian distribution were modeled as diagonal matrices, the expectation became the proportion of log-normal distributions whose variables are independent. To find an approximate expectation of softmax function, we adapted the second-order Taylor extension. A.1. Approximate expectation of a single softmax function We describe a theorem about log-normal distribution and Taylor expansion to find an approximation for the expectation ~ k Þ 8k. We regard X k ¼ expðak Þ as a log-normal ~ k; V of softmax function whose inputs are independently drawn from ak  N ðl P distribution and Y ¼ k X k as a random variable where Y has support ð0; 1Þ. Then, the expectation of the softmax function becomes the expectation of the ratio between two correlated random variables:

    expðai Þ Xi E½pi ðaÞ ¼ E P ¼E Y k expðak Þ

ðA:1Þ

In order to calculate the mean of the ratio, we adapt the second-order Taylor expansion approximation (Kendall et al., 2010; Rice & Papadopoulos, 2009):

1 00 0 0 00 00 f ðX; YÞ  f ðhÞ þ f X ðhÞðX  hX Þ þ f Y ðhÞðY  hY Þ þ ff XX ðhÞðX  hX Þ2 þ f YY ðhÞðY  hY Þ2 þ 2f XY ðhÞðX  hX ÞðY  hY Þg 2

ðA:2Þ

where h ¼ ðhX ; hY Þ. If h ¼ ðE½X; E½YÞ, then a better approximation of expectation is:

1 00 00 00 E½f ðX; YÞ  f ðhÞ þ ff XX ðhÞVðXÞ þ f YY ðhÞVðYÞ þ 2f XY ðhÞCov ðX; YÞg 2 00

00

00

For f ðX i ; YÞ ¼ X i =Y; f XX ¼ 0; f XY ¼ Y 2 and f YY ¼ 2X i Y 3 , the approximate expectation of the ratio is as follows:

ðA:3Þ

752

S. Park et al. / Information Processing and Management 51 (2015) 737–755

  Xi E½X i  Cov ðX i ; YÞ E½X i VðYÞ  E þ  E½Y Y E½Y2 E½Y3

ðA:4Þ

In the above approximate expectation of the ratio, the mean and variance of the sum of log-normal distribution, E½Y and VðYÞ, can be calculated (Dufresne, 2008; Gao, Xu, & Ye, 2009):

E½Y ¼

X ~ ~ elk þV k =2

ðA:5Þ

k

VðYÞ ¼

 X ~ ~  ~ e2lk þV k eV k  1

ðA:6Þ

k

The covariance terms, Cov ðX i ; YÞ, can be calculated easily as follows:

Cov ðX i ; YÞ ¼ Cov X i ;

X Xi

!

  ~ ¼ VðX i Þ ¼ E½X i 2 eV k  1

ðA:7Þ

k

Finally, we can get the expectation of a softmax function whose inputs are drawn from Gaussian distributions:

!   X expðati Þ ~  mti þ mti ðeV t  1Þ Eq ½pi ðat Þ ¼ Eq P m2tk  mti k expðatk Þ k

ðA:8Þ

~ t Þ, which we introduced. when mti ¼ pi ðl A.2. Approximate expectation of jointed softmax functions Jointed variables with two softmax functions are classified into two classes: squared softmax functions,

"

# " # expð2ai Þ X 2i E½pi ðaÞ  ¼ E P 2 ¼ E 2 Y k expðak Þ 2

ðA:9Þ

and jointed variables with two softmax functions that have different indexes,

"

#   expðai þ aj Þ XiXj E½pi ðaÞpj ðaÞ ¼ E P 2 ¼ E 2 Y k expðak Þ

ðA:10Þ

where i – j. The expectation of squared softmax functions can be calculated by substituting f ðX i ; YÞ ¼ X 2i =Y 2 in Formula (A.2). 00

00

00

For f XX ¼ 2Y 2 ; f XY ¼ 4X i Y 3 and f YY ¼ 6X 2i Y 4 , the approximate expectation of the squared ratio is as follows:

" # X 2i

E

Y

2

¼

E½X i 2 E½Y

2

þ

VðX i Þ E½Y

2

þ3

E½X i 2 VðYÞ E½Y4

4

E½X i Cov ðX i ; YÞ

ðA:11Þ

E½Y3

Using this equation, we can calculate the approximate expectation of squared softmax functions

Eq ½pi ðat Þ2  ¼ Eq

"

expðati Þ P k expðatk Þ

2 #

~

~

 eV t m2ti þ m2ti ðeV t  1Þ 3

X m2tk  4mti

! ðA:12Þ

k

Two jointed softmax functions are constructed with the total summation and two of log-normal distributions, X i and X j . This requires a three-dimensional Taylor extension similar to Formula (A.2):

1 00 0 0 0 00 f ðX i ; X j ; YÞ  f ðhÞ þ f X i ðhÞðX i  hX i Þ þ f X j ðhÞðX j  hXj Þ þ f Y ðhÞðY  hY Þ þ ff X i X i ðhÞðX i  hX i Þ2 þ f X j X j ðhÞðX j  hXj Þ2 2 00 00 00 þ f YY ðhÞðY  hY Þ2 þ 2f X i X j ðhÞðX i  hXi ÞðX j  hX j Þ þ 2f Xi Y ðhÞðX i  hXi ÞðY  hY Þ 00

þ 2f X j Y ðhÞðX j  hX j ÞðY  hY Þg 00

ðA:13Þ 00

00

00

00

00

When we set f ðX i ; X j ; YÞ ¼ X i X j =Y 2 ; f X i X i ¼ 0; f X j X j ¼ 0; f X i X j ¼ Y 2 ; f X i Y ¼ 2X j Y 3 ; f X j Y ¼ 2X i Y 3 and f YY ¼ 6X i X j Y 4 , the approximate expectation of the two softmax functions is as follows:

  XiXj E½X i E½X j  EðX i ÞCov ðX j ; YÞ EðX j ÞCov ðX i ; YÞ E½X i E½X j VðYÞ ¼ E 2 2 þ3 Y2 E½Y2 E½Y3 E½Y3 E½Y4

ðA:14Þ

Finally, the expectation of the two jointed softmax functions that have different indexes is as follows:

"

# ! X expðati þ atj Þ 2 V~ t Eq ½pi ðat Þpj ðat Þ ¼ Eq P 2  mti mtj þ mti mtj ðe  1Þ 3 mtk  2mti  2mtj k k expðatk Þ

ðA:15Þ

753

S. Park et al. / Information Processing and Management 51 (2015) 737–755

Appendix B. Learning algorithm In this appendix, we give some details about the learning algorithm, including variational inference and model learning methods for optimizing the lower bound of the log likelihood. Due to the non-conjugacy of the model parameters, we utilized expectation–maximization algorithm to learn several parameters in the proposed model. In E-step, we have to infer the posterior distributions by using variational inference method which convert a problem of finding posterior distribution into that of finding variational parameters maximizing the lower bound of log likelihood. In M-step, we find model parameters also maximizing the lower bound of log likelihood. By iterating E and M steps, we can find approximated model parameters and posterior distributions. B.1. Learning variational parameters ^ ; c; K; f, and /, that can be used for finding approximate posterior distributions There are several variational parameters, a over latent variables. ^ 1:T , we use a conjugate gradient When maximizing the lower bound as a function of the variational parameters, a algorithm because the closed-form equation cannot be found.

    K X dEq ½pi ðat Þ ~ t;k dl ~ t1;k ~ dl 1 X dl 1X dl 1X ~ t;k  l ~ t1;k Þ ~ tk Þ  tk þ 2  2 ¼ 2 ðl  ðctdk  l bi yt ^ ^ ^ ^ ~ tk a dl r dask dask dask dask d i t t t;d 

~ tk dl ^ dask

K X dEq ½pi ðat Þpj ðat Þ dl ~ tk 1 X bi bj 2 ~ ^ d l d a 2d i;j tk sk t

ðB:1Þ

Derivatives of mean from variational Kalman filter are calculated by forward–backward equations. The forward recurrence is

    ^ tk dlt1;k dltk v^ t v^ t da ¼ þ 1  ^ sk ^ sk ^ sk V t1 þ r2 þ v^ t V t1 þ r2 þ v^ t da da da

ðB:2Þ

^ sk ¼ 0. The backward recurrence is then with the initial condition dl0;k =da

    ~ t1;k dlt1;k ~ tk dl r2 r2 dl ¼ þ 1  ^ sk ^ sk ^ sk V t1 þ r2 V t1 þ r2 da da da

ðB:3Þ

^ sk ¼ dlT;k =da ^ sk . Derivatives of expectations of single softmax functions in Formula (B.1) are ~ T;k =da with the initial condition dl separated into two cases: One is the case in which the softmax function and the variable for the derivative have the same topic index,

! K X dEq ½pk ðat Þ ~t ~t ~ 2 3 2 V V ¼ mtk þ mtk ð1  2e Þ þ 4mtk ðe  1Þ þ mti ðeV t  1Þðmtk  3m3tk Þ ~ tk dl i

ðB:4Þ

~ t Þ is introduced. The other is the case of different index i – k, where mti ¼ pi ðl K X dEq ½pi ðat Þ ~ ¼ mti mtk þ mti mtk ðeV t  1Þ 2mti þ 2mtk  3 m2tj ~ tk dl

! ðB:5Þ

j

Derivatives of expectation of two jointed softmax functions in Formula (B.2) are separated into 4 cases. The first case is the same topic index between the squared softmax and the variable for the derivative

! K X dEq ½pk ðat Þ2  ~t ~t ~t ~ 2 V 3 4 2 V V ¼ 2mtk e þ mtk ð12  e Þ þ 18mtk ðe  1Þ þ mtj ðeV t  1Þð6m2tk  12m3tk Þ ~ tk dl

ðB:6Þ

j

while the second is the different topic index between them: K X dEq ½pi ðat Þ2  ~ ~ ¼ 2m2ti mtk eV t þ m2ti mtk ðeV t  1Þ 12mti þ 6mtk  12 m2tj ~ tk dl

! ðB:7Þ

j

The third is that one of the jointed softmax functions has the same topic index with the derivative variable

dEq ½pk ðat Þpi ðat Þ ~ ~ ¼ mtk mti þ m2tk mti ð2  4eV t Þ þ mtk mti ð1  eV t Þð12m2tk  2mti þ 6mtk mti Þ ~ tk dl ! K X ~ 2 mtj ð1  eV t Þð3mtk mti  12m2tk mti Þ þ j

ðB:8Þ

754

S. Park et al. / Information Processing and Management 51 (2015) 737–755

and fourth is that none of them has the same topic index K X dEq ½pi ðat Þpj ðat Þ ~ ¼ 2mtk mti mtj þ mtk mti mtj  ðeV t  1Þ 6mtk þ 6mti þ 6mtj  12 m2tu ~ tk dl u

! ðB:9Þ

^ 1:T . Using these sub-derivative equations, we can find local optimal values of a The maximization rules of other variational parameters are similar to the standard topic model, DTM. The additional variational parameter, ftd , can be calculated with a closed form.

ftd ¼

X ectdk þKtdk =2

ðB:10Þ

k

For the variational parameter, /, can be calculated as follows:

/tdnk / bk;wtdn ectdk

ðB:11Þ

However, the normality of the prior distribution does not make some update-equations a closed-form. We adapted a conjugate gradient method for ctdk .

dl dctdk

¼

Nt d X 1 ctdk þKtdk =2 ~ tk Þ þ ðctdk  l /tdnk  f1 td N td e 2 a n

ðB:12Þ

Also, we adapted the Newton method for Ktdk to add the constraint of non-negativity.

dl 1 Ntd ctdk þKtdk =2 1 ¼ 2 e þ dKtdk 2a 2Ktdk 2ftd

ðB:13Þ

By iterating the total optimizations for variational parameters, we can find optimized variational parameters to explain posterior distributions. Algorithm 1. Learning algorithm for ATM

B.2. Learning algorithm In Appendix B.1 and Section 3.2, we described learning methods for variational and model parameters. Using these methods, we can build a total learning algorithm for ATM. This algorithm includes sub optimization algorithms, which are conjugate gradient method (CGM) and Newton method (NM). The summary is described in Algorithm 1. References Asuncion, A., Welling, M., Smyth, P., & Teh, Y. W. (2009). On smoothing and inference for topic models. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (pp. 27–34). AUAI Press. Asur, S., & Huberman, B. A. (2010). Predicting the future with social media. 2010 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT) (Vol. 1, pp. 492–499). IEEE.

S. Park et al. / Information Processing and Management 51 (2015) 737–755

755

Bejan, C. A., & Harabagiu, S. M. (2008). Using clustering methods for discovering event structures. In Proceedings of the association for the advancement of artificial intelligence (AAAI’08) (pp. 1776–1777). AAAI Press. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on machine learning (pp. 113–120). ACM. Blei, D. M., & McAuliffe, J. D. (2007). Supervised topic models. Proceeding of NIPS’07, 21, 121–128. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022. Bouchard, G. (2007). Efficient bounds for the softmax function and applications to approximate inference in hybrid models. In NIPS 2007 workshop for approximate Bayesian inference in continuous/hybrid systems. Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288–296). Chen, X., Li, L., Xu, G., Yang, Z., & Kitsuregawa, M. (2012). Recommending related microblogs: A comparison between topic and WordNet based approaches. In AAAI. Chen, B., Zhu, L., Kifer, D., & Lee, D. (2010). What is an opinion about? Exploring political standpoints using opinion scoring model. In Proceedings of the association for the advancement of artificial intelligence (AAAI’10). AAAI Press. Dufresne, D. (2008). Sums of lognormals. In Proceedings of the 43rd actuarial research conference. Gao, X., Xu, H., & Ye, D. (2009). Asymptotic behavior of tail density for sum of correlated lognormal variables. International Journal of Mathematics and Mathematical Sciences. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235. Gupta, S., & Manning, C. (2011). Analyzing the dynamics of research by extracting key aspects of scientific papers. In Preceedings of IJCNLP’11 (pp. 1–9). Hong, L., Yin, D., Guo, J., & Davison, B. D. (2011). Tracking trends: Incorporating term volume into temporal topic models. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 484–492). ACM. Hörster, E., Lienhart, R., & Slaney, M. (2007). Image retrieval on large-scale image databases. In Proceedings of the 6th ACM international conference on image and video retrieval (pp. 17–24). ACM. Jaakkola, T.S. (2001). Tutorial on variational approximation methods. In Advanced mean field methods: Theory and practice (p. 129). Jo, Y., & Oh, A. (2011). Aspect and sentiment unification model for online review analysis. In Proceedings of the 4th ACM international conference on web search and data mining, WSDM ’11 (pp. 815–824). ACM Press. http://dx.doi.org/10.1145/1935826.1935932. . Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational methods for graphical models. Springer. Kendall, M., Stuart, A., & Ord, J. K. (2010). Kendall’s advanced theory of statistics, distribution theory (Volume 1) (6th ed.). Arnold. Khurdiya, A., Dey, L., Raj, N., & Haque, S. M. (2011). Multi-perspective linking of news articles within a repository. In Proceedings of the twenty-second international joint conference on artificial intelligence-volume volume three (pp. 2281–2286). AAAI Press. Kim, H. D., Castellanos, M., Hsu, M., Zhai, C., Rietz, T., & Diermeier, D. (2013). Mining causal topics in text data: Iterative topic modeling with time series feedback. In Proceedings of the 22nd ACM international conference on information & knowledge management (pp. 885–890). ACM. Liu, Y., Huang, X., An, A., & Yu, X. (2007). ARSA: A sentiment-aware model for predicting sales performance using blogs. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 607–614). ACM. Livne, A., Simmons, M. P., Adar, E., & Adamic, L. A. (2011). The party is over here: Structure and content in the 2010 election. In ICWSM. Mahajan, A., Dey, L., & Haque, S. M. (2008). Mining financial news for major events and their impacts on the market. IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, 2008. WI-IAT’08 (Vol. 1, pp. 423–426). IEEE. McCallum, A., Corrada-Emmanuel, A., & Wang, X. (2005). The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. http://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1024&context=cs_faculty_pubs. Poon, S.-H., & Granger, C. (2005). Practical issues in forecasting volatility. Financial Analysts Journal, 45–56. Putthividhy, D., Attias, H. T., & Nagarajan, S. S. (2010). Topic regression multi-modal latent Dirichlet allocation for image annotation. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3408–3415). IEEE. Ramage, D., Dumais, S. T., & Liebling, D. J. (2010). Characterizing microblogs with topic models. In ICWSM. Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. Proceedings of the 2009 conference on empirical methods in natural language processing (Vol. 1, pp. 248–256). Association for Computational Linguistics. Rice, S. H., & Papadopoulos, A. (2009). Evolution with stochastic fitness and stochastic migration. PloS One, 4(10), e7130. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 487–494). AUAI Press. Si, J., Mukherjee, A., Liu, B., Li, Q., Li, H., & Deng, X. (2013). Exploiting topic based twitter sentiment for stock prediction. In Proceedings of the 51th annual meeting of the association for computational linguistics (pp. 24–29). Sizov, S. (2012). Latent geospatial semantics of social media. ACM Transactions on Intelligent Systems and Technology (TIST), 3(4), 64. Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In Proceedings of the 26th annual international conference on machine learning (pp. 1105–1112). ACM. Wang, C., Blei, D., & Heckerman, D. (2008). Continuous time dynamic topic models. In Uncertainty in artificial intelligence [UAI]. Zhu, J., Ahmed, A., & Xing, E. P. (2012). Medlda: Maximum margin supervised topic models. The Journal of Machine Learning Research, 13(1), 2237–2278.