Story co-segmentation of Chinese broadcast news using weakly-supervised semantic similarity

Neurocomputing 355 (2019) 121–133 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Story c...

Download PDF

1MB Sizes 0 Downloads 34 Views

Report

PDF Reader
Full Text

Neurocomputing 355 (2019) 121–133

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Story co-segmentation of Chinese broadcast news using weakly-supervised semantic similarity Wei Feng a,b,∗, Xuecheng Nie c, Yujun Zhang a,b, Zhi-Qiang Liu d, Jianwu Dang a,e a

School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China Key Research Center for Surface Monitoring and Analysis of Cultural Relics, State Administration of Cultural Heritage, China c Department of Electrical and Computer Engineering, National University of Singapore, Singapore d School of Creative Media, City University of Hong Kong, Hong Kong, China e School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), Japan b

a r t i c l e

i n f o

Article history: Received 6 March 2018 Revised 21 January 2019 Accepted 6 May 2019 Available online 10 May 2019 Communicated by Dr. Yu Jiang Keywords: Story co-segmentation Weakly-supervised correlated aﬃnity graph (WSCAG) Parallel aﬃnity propagation Generalized cosine similarity Chinese broadcast news MRF

a b s t r a c t This paper presents lexical story co-segmentation, a new approach to automatically extracting stories on the same topic from multiple Chinese broadcast news documents. Unlike topic tracking and detection, our approach needs not the guidance of well-trained topic models and can consistently segment the common stories from input documents. Following the MRF scheme, we construct a Gibbs energy function that feasibly balances the intra-doc and inter-doc lexical semantic dependencies and solve story co-segmentation as a binary labeling problem at sentence level. Due to the signiﬁcance of measuring lexical semantic similarity in story co-segmentation, we propose a weakly-supervised correlated aﬃnity graph (WSCAG) model to effectively derive the latent semantic similarities between Chinese words from the target corpus. Based on this, we are able to extend the classical cosine similarity by mapping the observed words distribution into the latent semantic space, which leads to a generalized lexical cosine similarity measurement. Extensive experiments on benchmark dataset validate the effectiveness of our story co-segmentation approach. Besides, we speciﬁcally demonstrate the superior performance of the proposed WSCAG semantic similarity measure over other state-of-the-art semantic measures in story co-segmentation. © 2019 Published by Elsevier B.V.

1. Introduction As the rapid growth of multimedia content on the Internet, there is an urgent need for reliable techniques to eﬃciently and effectively organize and understand such massive amount of data. To serve this purpose, jointly extracting common stories on the same topic from multiple documents becomes a vital preprocessing for many real-world semantic content based applications, such as topic retrieval and analysis [1], semantic summarization [2], and user behaviour analysis [3]. Existing Topic Detection and Tracking (TDT) [4] strategy can be exploited to tackle this problem in a two-stage manner: (1) segmenting documents into sequences of stories with automatic story segmentation techniques [5,6] (2) modeling the semantic representations of stories with topic models [7–9] and extracting the pair of stories with maximum semantic correlations. However, TDT

∗ Corresponding author at: School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China. E-mail addresses: [email protected], [email protected] (W. Feng), [email protected] (X. Nie), [email protected] (Y. Zhang), [email protected] (J. Dang).

https://doi.org/10.1016/j.neucom.2019.05.016 0925-2312/© 2019 Published by Elsevier B.V.

based approaches suffer two major limitations. Firstly, their performance is critically limited by the quality of story segments. If the automatic story segmentation fails to segment coherent stories, the introduced errors cannot be remedied and would severely harm the performance of the following topic modeling; Secondly, they heavily relied on well-trained topic models that are expensive to achieve from the exponentially increased amount of broadcast news transcripts and are diﬃcult to properly adapt the daily changes news focuses. Therefore, it is much more desirable to ﬁnd a less restrictive way to jointly segment stories on common topics from multiple broadcast news documents, without the prerequisite of story segmentation and the guidance of topic models. In this paper, we propose to design a general approach to extract topically coherent stories for unsegmented Chinese broadcast news via lexical cues only. Target at overcoming the limitations of TDT based approaches, we make the following observations. First, stories of interest should be directly located in multiple documents in one-shot, relying on the predeﬁned constraints and lexical semantic similarities; Second, semantics of stories should be explicitly represented with consecutive words in a corpus-driven manner, avoiding to expensively learn speciﬁc topic models; Third, story boundaries and topic labels from existing

122

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

datasets provide valuable cues for measuring meaningful wordlevel semantic similarities. However, existing approaches cannot eﬃciently and effectively leverage these information. Motivated by these, we propose a novel story co-segmentation model using weakly-supervised semantic similarity to extract topically coherent stories from multiple documents in a single-stage manner. The proposed story co-segmentation model alleviates the requirement for story segmentation techniques and determines correlated stories from the composed lexical words instead of topic models, making it robust to errors from early commitment and ﬂexible to changes of daily news. Besides, our weakly-supervised learning scheme suﬃciently exploits the story boundary and topic label information, deriving meaningful and corpus-driven semantic correlations in word-level. In particular, we formulate the proposed story co-segmentation model with the Markov Random Field (MRF) scheme, optimizing it as a binary labeling problem at sentence level. Given the speciﬁc Chinese broadcast news corpus, we ﬁrst construct a sentence level graph for the documents to encode both the interdoc and intra-doc semantic dependencies. Then, we initialize the foreground and background story labeling by lexical clustering and common-cluster selecting and the current labeling will be used to update the foreground and background models. Next, we formalize the story co-segmentation problem as a Gibbs energy minimization problem which takes both the intra-doc coherence and interdoc similarity into consideration. Finally, we reﬁne the foreground and background labeling by hybrid optimization. By this way, our story co-segmentation model locates stories on the same topic of multiple documents in a single-stage with only lexical information, conquering limitations of the TDT based approaches with requirements for completely segmenting lexical documents and explicitly modeling semantic representations. In our story co-segmentation model, we measure the correlations among sentences based on their composed lexical words. Therefore, word-level semantic similarity play a key and atomic role to achieve accurate results. To generate more meaningful semantics of words, we propose to utilize story boundary and topic label as weak supervisions, which are often ignored by prior works. Speciﬁcally, we construct intra-story correlations and inter-story correlations to reveal the contextual and topic relationships between words. In addition, based on the composition feature of Chinese words, we ﬁnd that words with common characters are closely correlated and we encode this relationship between Chinese words as common character correlation in our approach. We embed all these correlations into a correlated aﬃnity graph, which can be used to measure the semantic correlated aﬃnity between words. Moreover, we conduct the parallel aﬃnity propagation to derive more reliable semantic aﬃnity measurement. Based on this, we extent the classical cosine similarity through mapping the observed distribution into the latent semantic space, which leads to a generalized lexical cosine similarity measurement. The proposed weakly-supervised scheme provides an effective way to leverage story boundary and topic label information as guidance to produce meaningful word-level correlations in a corpus-driven manner. In addition, it enables us to consider semantic correlations among different words, overcoming drawbacks of conventional “bag-ofwords” assumptions which only counts for the same words and leading to more accurate sentence-level semantic relationship. Experiments on benchmark dataset TDT2 [10] demonstrate the superior performance of the proposed story co-segmentation model over TDT based approaches. In addition, ablation studies also validate the effectiveness of our weakly-supervised scheme to more generate meaningful semantic similarities, comparing with existing word embedding methods, e.g., LSA [11], pLSA [12], word2vec [13], etc. In the following, we will ﬁrst brieﬂy review the works closely related to ours in Section 2. Then, we illustrate our

energy minimization based approach via MRF scheme for story cosegmentation in Section 3. The weakly-supervised correlated aﬃnity model for semantic similarity measurement will be presented in Section 4. We show the experiments in Section 5. At last, we conclude this paper in Section 6. 2. Related work 2.1. Image co-segmentation Since the speciﬁc algorithm and implementation of image cosegmentation are not our focuses in this paper, we just review the general framework for image co-segmentation and illustrate the commonness between image co-segmentation and story cosegmentation in this section. Image co-segmentation has been well studied in recent years. It was ﬁrst introduced by Rother et al. in [14] to extract the same or similar objects in multiple different images. This problem was motivated by the need for computing the semantic similarity between images with the same subject and different backdrops. Generally, image co-segmentation is formulated as a binary labeling problem of MRF on the graphs corresponding to the input images. A Gibbs energy function is constructed to encode both the inter- and intradependencies [15]. Then, optimization [16–19] is performed on the Gibbs energy function to generate rational co-segmentation results by labeling all the pixels of input images. The idea of story co-segmentation stems from image cosegmentation, we extend the concept of co-segmentation from image foreground and background labeling to the lexical common stories extraction for the document set of Chinese broadcast news. Good story co-segmentation results should satisfy two conditions: (1) foreground stories should be meaningful in each document; (2) foreground stories should be semantically similar enough. We also formulate a Gibbs energy function to encode these two aspects. We use a hybrid optimization approach to minimize the energy to label all the sentences of the input images. We will illustrate our energy minimization based approach via Markov Random Field (MRF) scheme for story co-segmentation in detail in Section 3. Our preliminary work of story co-segmentation using hard similarity on synthetic dataset has been published on [20]. 2.2. Word-level semantic similarity measurement Word-level semantic similarity measurement is an essential challenge in Natural Language Processing (NLP). It has been focused by many linguistic experts in last decades. The existing approaches can be mainly classiﬁed into two categories: general knowledge based approach and corpus speciﬁc based approach. In the following, we will review some of representative works from these two categories, respectively. General knowledge based approaches. Semantic similarity measurements using general knowledge based approaches are generated from the denotation of linguistic experts according to their general knowledge and subjective consciousness about the semantic relationships between words and these semantic relationships between words are usually encoded in a well-constructed taxonomy according to their semantic senses. WordNet [21] is a well known semantic taxonomy to organize lexicons into hierarchies of is-a relations (synonym, hypernym, and hyponym, etc). Various WordNet based semantic similarity measurements [22–25] have been proposed. These approaches derive semantic similarity measurements from the analysis of hierarchical structure (path, length, depth, type of links, etc) and information content. Experimental results have shown that general knowledge based approaches can derive meaningfully semantic similarities.

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

Recently, a Chinese version of lexical taxonomy, known as HowNet, has been proposed by Dong et al. in [26]. HowNet is an on-line common-sense knowledge database describing the interconceptual relations and inter-attribute relations of concepts as connoting in lexicons of Chinese. It is devoted to demonstrate the general and speciﬁc properties of concepts and explicates various relations. One important application of HowNet is to compute the word similarity for Chinese [27], relied on the similarity of inclusive sememes. Despite the success of semantic similarity measurement derived from general knowledge based approaches, it is not the best choice for story co-segmentation. Semantic similarity measurement derived from general knowledge based approaches which are dependent on the general knowledge and subjective consciousness of linguistic experts is less preferred than corpus speciﬁc based ones. Corpus speciﬁc based approaches. To achieve corpus driven wordlevel semantic similarity, a lot of corpus speciﬁc based approaches have been proposed, such as Pointwise Mutual Information (PMI) [28], Latent Semantic Analysis (LSA) [29], Probabilistic Latent Semantic Analysis (pLSA) [30], etc. These corpus speciﬁc based approaches are usually based on the statistics and mathematical computation to generate the semantic similarity. With the assumption that words with similar contexts are semantically correlated, [31] and [32] show the effectiveness of contextual correlation for semantic similarity measurement. Recently, Google released their open source program word2vec [33], which can be used to compute the vector representation of words. Given a text corpus as input, word2vec ﬁrst constructs a vocabulary from the training data, then learns vector representations for each word in the vocabulary using continuous bag-of-words or skip-gram neural network architectures. The resulting word vector can be utilized to compute similarities of words in the vocabulary via the cosine similarity of two representation vectors. As the development of graph theories, graph models have drawn a lot of attention of linguists to measure semantic similarity [34,35]. These approaches encode the content information and contextual correlations into weighted graphs to reveal the semantic relations among words in a given corpus.The major drawback of these existing corpus speciﬁc based approaches is that they can not take full advantage of story boundaries or topic labels provided by the given corpus. 3. Lexical story co-segmentation In this section, we will introduce the proposed story cosegmentation model for extracting topically coherent stories from multiple Chinese broadcast news. Given Chinese broadcast news corpus C = {Di }D , our goal is to i=1 extract topically coherent stories {Ri }D from C, where story Ri i=1 is from document Di . The proposed story co-segmentation model solves this problem with an graph optimization framework based on Markov Random Field (MRF) and targets at achieving optimal solutions of binaryly labeling sentences to indicate they belong to stories of interest or not. Speciﬁcally, we ﬁrst split each document Di into a sequence j Si j of sentences, denoted as Di = {si } j=1 , where si represents the jth sentence with ﬁxed length L in the ith document and Si is the number of sentences of Di . Then, we construct a sentence-level graph, denoted as Gseg , to encode both inter-doc and intra-doc dependencies as shown in Fig. 1. In Gseg , the vertex set is represented as all the sentences in C, denoted as Vseg . The edge set, denoted as Eseg , is composed of two kinds of relations: intra-doc edges and inter-doc edges. The intra-doc edges (blue lines in each document in Fig. 1) encourage the adjacent sentences in each document belong to the same topic. To control the complexity, two sentences in a document are linked as intra-doc edge iff their distance is less than

123

or equal to a cutoff threshold τ seg . The inter-doc edges (green lines between any two documents in C in Fig. 1) ensure sentences in different documents with similar semantic meaning to be labeled as the same story. Obviously, Gseg is an instance of MRF model [36]. With the construction of Gseg = Vseg , Eseg , we formulate the proposed story co-segmentation model as a binary labeling problem for all sentences in C. In particular, sentences from topically coherent stories are labeled as foreground 1, while others as background 0. We use F = {s|label(s ) = 1 ∧ s ∈ Di ∧ i ∈ {1, . . . , D}} to represent the set of foreground sentences for all documents in C, specially, F = {R1 ∪ R2 ∪ · · · ∪ RD } and Bi = {s|label(s ) = 0 ∧ s ∈ Di } the set of background sentences in each document. In the following, we will illustrate optimization process for the proposed story cosegmentation model to automatically lexical story co-segmentation in detail. 3.1. Initialization The ﬁrst step is to obtain a rational initialization of labels for the foreground F, and background Bi (i ∈ {1, . . . , D} ) of each document. To achieve this goal, we agglomerate all the sentences in C into K clusters, then, we select sentences in the most common cluster as the initialization of foreground, and others as the background. The most common cluster for the initialization of foreground stories should satisfy two assumptions: (1) the number of sentences in each document in this cluster is approximately equal; (2) the word distribution in each document in this cluster is similar. Hence, we formulate these two assumptions as the following discrepancy score to select the most common cluster:

Ds(Ck ) = (1 − γ )

D D

H(Cki ) − H(Ckj ) 2

i=1 j=i+1

+

γ

D

D

i=1 j=i+1

max(Cki , Ckj ) min(Cki , Ckj )

−1 ,

(1)

where Ck refers to the kth nonempty cluster, Cki represents the subset of sentences in Ck belonging to document Di , min(·) represents the minima number of sentences, max(·) represents the maxima number of sentences, H(·) represents the word-over-vocabulary distribution for a set of words, ·2 represents the Euclidean distance and γ is a linear modulation parameter controlling the relative importance of size and semantic discrepancies. Accordingly, we can select the most common cluster Cˆ by minimizing the discrepancy score deﬁned above, that is Cˆ = argmink {Ds(Ck )}. In our implementation, we optimize Ds(Ck ) by enumerating. 3.2. Foreground/background story modeling From the “bag-of-words” assumption, we can represent semantics of a word stream by its word frequency distribution over a common vocabulary. Therefore, we use word frequency distribution of F and Bi over vocabulary Vseg to model foreground and background, respectively. In this step, we update foreground and background models as H(F (t ) ) and H(Bi(t ) ), according to current labels for F (t ) and Bi(t ) , where H(·) denotes the un-normalized histogram of word frequency distribution over Vseg and t is the current number of iterations. 3.3. Gibbs energy for story co-segmentation According to the two conditions for good story co-segmentation results mentioned above in Section 2.1, we construct the following Gibbs energy function to take both the inter-doc and intra-doc dependencies into account:

124

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

Fig. 1. An example of sentence-level graph formulation Gseg of document set of Chinese broadcast news C = {D1 , D2 , D3 } with cutoff value τseg = 1.

E (X ) =

D

Eintra (Xi ) + β

i=1

D D

Einter (Xi , X j ),

(2)

i=1 j=i+1

where X = {xn }N is the set of label variables of all sentences in n=1 C and N = D S is the total number of sentences, xn ∈ {0, 1} is m m=1 the label of nth sentence, Xi (with Si variables) is the labeling of sentences for document Di , coeﬃcient β is used to balance the role of inter-doc and intra-doc energies. Intra-doc energy Eintra (·) measures the goodness of foreground and background labeling in same document:

Eintra (Xi ) =

|Xi | j=1

+α

xij Sim(F (t ) , sij ) + (1 − xij )Sim(Bi(t ) , sij )

|xni − xm i |,

(3)

n∼m

where Xi is the labeling of the ith document ( j ∈ {1, . . . , D}), xi is the labeling variable for jth sentence in document i, n ∼ m indicates that sentences n and m are adjacent, that is, sentences n and m are linked in Gseg , and Sim(∗, ∗) denotes the semantic similarity of word frequency distributions for two word streams over same vocabulary. The concrete deﬁnition about Sim(∗, ∗) will be speciﬁed in detail in Section 4. Intra-doc energy Eintra is composed of two parts: the ﬁrst part is the foreground and background labeling cost, while the second part is the adjacent coherence prior, with α as the parameter modulating their relative inﬂuences. Inter-doc energy Einter (·) manages the semantic similarity between the foreground stories of any two documents in C. Generally, stories belonging to the same topic should have similar lexical distributions. Fig. 2 (a) and (b) give an example of this assumption. For two randomly selected documents, we cluster all of their sentences, then we count the number of sentences in each bin for the two whole documents and their foreground stories, respectively. From the statistics results shown in Fig. 2 (a) and (b), we can clearly see that although the distance between the two whole documents is large, the distance between the two foreground stories is very small. Hence, we can formulate Einter (·) as j

⎛

Einter (Xi , X j ) =

K k=1

⎝

⎞2

p∈Cki

xp −

xq ⎠ .

(4)

q∈Ckj

Note that the inter-doc energy Einter (·) penalizes the difference between the un-normalized histograms of the potential foreground stories in any two documents in C. 3.4. Optimization and model reﬁnement Since Eq. (4) contains lots of submodular and supermodular items at the same time, it is generally NP-hard [16,36] to

solve the Gibbs energy function deﬁned in Eq. (2). For the purpose of both accuracy and eﬃciency, in this paper, we use a hybrid minimization approach to solving Eq. (2). We ﬁrst minimize the energy function in Eq. (2) by Quadratic Pseudo-Boolean Optimization (QPBO) [16], which is a classical algorithm for optimizing functions with both submodular and non-submodular terms. Due to the persistency and partial optimality properties of QPBO, we reserve all 0−1 labels produced by QPBO, then, use Belief Propagation (BP) [17,19] to approximately optimize the simpliﬁed energy function of Eq. (2). At last, we use Extended SubmodularSupermodular Procedure (ESSP) [37] to further improve the labeling results generated by QPBO and BP, thus obtaining a suboptimal full labeling to all variables in X. Based on the current X, we update the foreground and background labeling F (t ) , Bi(t ) (i ∈ {1, . . . , D} ), respectively. Our energy minimization based approach via Markov Random Field (MRF) scheme for story co-segmentation is summarized in Algorithm 1. Algorithm 1 MRF based story co-segmentation Input: document set C = {D}D , iteration number T , sentence i=1 length L, cutoff τseg , cluster number K, parameters α , β and γ . Output: 0-1 labeling results X = {Xi }D ; i=1 Initialization: initialize X using Eq. (1); for t = 1 to T do F (t ) ← H(F (t ) ); for i = 1 to D do Bi(t ) ← H(Bi(t ) ); end for Construct Gibbs energy function for story co-segmentation by Eqs. (2)–(4); Get X (t ) by QPBO, BP and ESSP; end for X ← X (T ) ;

4. Weakly-supervised measure of chinese semantic similarity 4.1. Correlated aﬃnity graph For a given corpus C = {Di }D composed of D documents, each i=1 document D = {S j } includes several stories and each story S j belongs to a speciﬁc topic T p . Topic label for each story and story boundaries for each document are previously known. To generate the semantic similarity measurement between Chinese words, we ﬁrst construct a Weakly-supervised correlated aﬃnity graph, denoted as WSCAG, to encode the intra-story correlation and interstory correlation, which takes both the content and structure information into account. Moreover, we also add common character

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

a

b

c

d

125

Fig. 2. (a) shows the lexical distributions of two randomly selected docs, and the distance between two whole docs is 33.12; (b) shows the lexical distributions of foreground stories in the two selected docs, and the distance between the foreground stories is only 10.24; (c) shows the lexical distributions of the two selected docs after the reﬁnement of aﬃnity matrix, and the distance is 33.06; (d) shows the lexical distributions of foreground stories after the reﬁnement of aﬃnity matrix, and the distance is decreased to 4.79.

correlation to reveal the composition feature of Chinese words. In the following, we will illustrate detailed process of construction of WSCAG.

4.1.1. Intra-story correlation Words from the same story in a document are always closely correlated with their semantics. Modeling the intra-story correlations provides useful cues to extract topically coherent contexts. In this paper, given the word set Vw = {wi }Vi=1 of C, we deﬁne the intra-story correlation as following: Definition 1 (Intra-story correlation). We say that two different words wa and wb are intra-story correlated iff they simultaneously appear in the same story and their distance (the number of words between wa and wb in the same context) is less than or equal to a threshold τ sim . Herein, τ sim is an integer and set as 5 in our experiments. Fig. 3 (a) shows an example of the intra-story correlation for given corpus C, where τsim = 1. Noted that, two different words wa and wb may satisfy the intra-story correlation deﬁnition multiple times, and we use FIntraS to denote the intra-story correlated frequency. ab Then, the intra-story correlated aﬃnity between two words wa and

wb can be measured by

Aintra (wa , wb ) =

fa fb FIntraS ab

maxintra +

,

(5)

where fa and fb represent the word frequencies of wa and wb , respectively, maxintra = max(wi ,w j ∈Vw ) {fi f j FIntraS } is the normalizaij

tion factor and is a control factor to ensure 0 ≤ Aintra (wa , wb ) < 1 if (a = b), usually set as = 10−6 . We deﬁne Aintra (wa , wa ) = 1. We use Aintra = {Aintra (wa , wb )}wa ,wb ∈Vw to represent the intra-story correlation matrix for all words in Vw . As the deﬁnition of intrastory correlation, we can ﬁnd that more times two words wa and wb are intra-story correlated, higher the Aintra (wa , wb ) is, which is satisﬁed with the assumption of “bag-of-words” that the topically coherent words more likely appear in the same context. 4.1.2. Inter-story correlation Considering words across different stories is also valuable to analyze the semantic correlations. Words from the same topic have strong correlations to enhance the semantic similarities, even though they come from different stories. While words from different topics can provide useful information to distinguish uncorrelated contexts of multiple documents. Hence, given the topic label for each story, we consider the inter-story correla-

126

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

a

b

c

d

Fig. 3. Given a corpus C: (a) shows an example of the intra-story correlation with the cutoff value τsim = 1; (b) shows the construction of inter-story correlation, the ﬁrst row and second row show examples of the intra-topic correlation and inter-topic correlation, respectively; (c) shows some examples of the common character correlations. All the words are extracted from TDT2-ref dataset; (d) shows the construction of the Weakly-supervised correlated aﬃnity graph, which combine the intra-story correlation, inter-story correlation and common character correlation, together.

tion from two perspectives: intra-topic correlation and inter-topic correlation, deﬁned as following, respectively. Definition 2 (Intra-topic correlation). We say that two words are intra-topic correlated iff they simultaneously appear in different stories that belong to the same topic. Then, we deﬁne the intra-topic correlation aﬃnity between two words wa and wb as

Aintra inter (wa , wb ) =

fa fb FIntraT ab

maxintra + inter

,

(6)

where FIntraT represents the intra-topic correlated frequency of wa ab and wb , and maxinterintra = max(wi ,w j ∈Vw ) {fi f j FIntraT }. ij Definition 3 (Inter-topic correlation). We say that two words are inter-topic correlated iff they simultaneously appear in different stories that belong to different topics. Similar to Aintra (wa , wb ), inter-topic correlation aﬃnity is: inter

Ainter inter (wa , wb ) =

fa fb FInterT ab maxinter + inter

,

(7)

where FInterT is the inter-topic correlated frequency of ab wa and wb , and maxinter = max(wi ,w j ∈Vw ) {fi f j FInterT }. Acij inter cordingly,

we

use

Aintra = {Ainterintra (wa , wb )}wa ,wb ∈Vw inter

and

= {Ainterinter (wa , wb )}wa ,wb ∈Vw to represent the intra-topic correlation aﬃnity matrix and inter-topic correlation aﬃnity matrix for any two words in C, respectively. Then, we deﬁne the inter-story correlation aﬃnity as the difference of intra-topic correlation aﬃnity and inter-topic correlation aﬃnity: Ainter inter

inter Ainter = Aintra inter − Ainter .

(8)

By Eq. (8), inter-story correlation aﬃnity between two words wa and wb may be negative, to avoid this situation, we set negative elements in Ainter to 0. For simplicity, we still use Ainter to represent inter-story correlation aﬃnity matrix after reﬁnement. By the definition of inter-story correlation, we will increase correlated aﬃnity between two words belonging to the same topic and decrease correlated aﬃnity between two words belonging to different topics, which can enhance the discrimination between words. Fig. 3(b) shows the construction of inter-story correlation. 4.1.3. Common character correlation Different from alphabetical languages, e.g., English, French, character is the atomic morpheme of Chinese. Generally, each word of Chinese consists of several characters which are the combination

of syllable and sense. Words with the same component characters usually convey similar meaning, which can provide powerful clues to measure semantic similarities in terms of word composition. Therefore, we consider this kind of relationships to further strengthen the semantic similarity measurement. We encode this relationship into common character correlation. Before giving the deﬁnition of common character correlation, we should ﬁrst illustrate the correlation between words and their component characters. We use Vc to denote the component character set for word set Vw . The correlated aﬃnity between any character ca ∈ Vc and fc any word wb ∈ Vw is given by AP (ca , wb ) = la , where fca represents b

appearance frequency of character ca in word wb and lb represents the number of characters in wb . Similarly, we use AP to denote the correlation matrix between words and their component character, i.e., AP = {AP (ca , wb )}ca ∈Vc ,wb ∈Vw . Based on the correlations deﬁned between words and characters, the deﬁnition of common character correlation is given by: Definition 4 (Common character correlation). Two words are common character correlated iff they are correlated to the same character. Fig. 3(c) shows some examples of common character correlations. “ (troop)”, “ (military)”, and “ (military region)” are all correlated with the common word “ (army)”. “ (university)”, “ (institute)”, “ (academic)”, and “ (teaching)” are all correlated with the common word “ (study)”. Different from the inter-story and intra-story correlations, common character correlation reveals the semantic relationship between words in terms of component feature. Actually, common character correlation conﬁrms to the partial matching feature of subwords [38], which is robust to automatic speech recognition errors in story segmentation. 4.1.4. Graph formulation In order to combine the intra-story correlation, inter-story correlation and common character correlation together, we build a weakly-supervised correlated aﬃnity graph to encode these three kinds of correlations, as shown in Fig. 3(d). Words in Vw and characters in Vc are represented as nodes in WSCAG, and we use VWSCAG to denote the node set of WSCAG, i.e., VWSCAG = Vw ∪ Vc . Edges in WSCAG can be classiﬁed into three categories to intra , inter-story represent intra-story correlation denoted as EWSCAG inter correlation denoted as EWSCAG , and common character correlap tion denoted as EWSCAG , respectively. We use EWSCAG to denote the combination of these three kinds of correlations, i.e., EWSCAG = p intra inter EWSCAG ∪ EWSCAG ∪ EWSCAG . Edges in EWSCAG are weighted by Aintra ,

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

Ainter , and AP , respectively. Then, we can denote the WSCAG as G = EWSCAG , VWSCAG , WWSCAG , which can be represented by the adjacent matrix AG , given by

AG =

Aintra + Ainter AP

A P , 0

(9)

where A P represents the transposition of AP , and 0 represents the zero matrix. From Eq. (9), we can ﬁnd that the correlated aﬃnity AG (wa , wb ) = Aintra (wa , wb ) + Ainter (wa , wb ) between two words wa and wb may exceed 1. To ensure 0 ≤ AG (wa , wb ) ≤ 1, we set the elements that exceed 1 to 1. We still use AG to represent the correlated aﬃnity matrix after reﬁnement. From the above deﬁnition, we can see AG encodes intra-story correlation, inter-story correlation and common character correlation, together. Generally, topically coherent words are more likely correlated in G, and vice versa. Hence, the semantic similarity measurement is transferred to measure the correlated aﬃnity between two words in G, which will be illustrated in detail in next section.

127

parallel all-pair SimRank implementation is about 103 times faster than the CPU implementation, and averagely 100 times faster than calling |V |(|V | − 1 ) times of single-pair parallel SimRank. Relied on the iterative aﬃnity propagation, we can generate the correlated aﬃnity between any two words in Vw . For a pair of words wa and wb , we measure their semantic similarity by the closeness of their correlation contexts, that is, the closeness of subgraphs centered at wa and wb in WSCAG. Hence, we deﬁne the semantic similarity between words wa and wb as:

SS ( wa , wb ) =

A S ( w a , w c )A S ( w b , w c ) .

The correlation context is composed of correlations of the given word and its neighbors. To measure the semantic similarity for a pair of words with their correlation contexts, we not only consider their direct correlation aﬃnity generated from aﬃnity propagation, but also the latent correlation aﬃnities transmitted by their neighbors. We use SS to denote the semantic similarity matrix for any pair of words in Vw by:

SS = A S AS .

4.2. Word-level semantic similarity acquisition

(13)

wc ∈Vw

(14)

4.3. Semantic similarity measurement for word streams With the graph formulation of WSCAG, we measure the semantic correlated aﬃnity AS (va , vb ) between two nodes va and vb by three principles: (1) the semantic correlated aﬃnity of va to itself is 1; (2) AS (va , vb ) is positively correlated with AG (va , vb ); (3) AS (va , vb ) is positively related with the semantic correlated aﬃnity of its neighbors. Accordingly, we formulate AS (va , vb ) as following iterative propagation form:

AS ( va , vb ) ( 0 ) = AG ( va , vb ), AS(t ) (va , vb ) =

c Z

vi ∈N (va ) v j ∈N (vb )

(10) AS(t−1) (vi , v j ),

Sim(si , s j ) = (11)

where AS(t ) (va , vb ) represents the semantic correlated aﬃnity between nodes va and vb at tth iteration, c is a decay factor, usually set as 0.5, N (va ) and N (vb ) represent the neighbors of va and vb in WSCAG, respectively, and Z = |N (va )||N (vb )| is the normalization factor. Based on the iterative propagation form deﬁned above, we measure the semantic correlated aﬃnity for nodes va and vb as following:

AS (wa , wb ) = lim AS(t ) (wa , wb ). t→∞

As mentioned in Section 3.3, the semantic similarity measurement for two word streams is very important to the MRF based story co-segmentation. To make the semantic similarity measurement for two word streams more meaningful, we take the latent semantics between different words into consideration. Given word-level semantic similarity matrix S, we deﬁne the semantic similarity for word streams si and sj as

(12)

We use AS = {AS (wa , wb )}wa ,wb ∈Vw to denote the semantic correlated aﬃnity matrix for any two words in Vw . Actually, the iterative propagation process deﬁned by Eqs. (10)– (12) conﬁrms to the SimRank [39] procedure. SimRank was ﬁrst introduced by Jeh and Widom in [39] to measure the structural context similarity in any domain with object-to-object relationships. Effectively, SimRank is based on the assumption that: “two objects are similar if they are related to the similar objects”, which is equivalent to the third principle deﬁned in our iterative propagation for semantic correlated aﬃnity measurement. Therefore, we use the SimRank algorithm to solve the iterative propagation, which takes AG as input and AS as output. As the statement in [39], the complexity of SimRank algorithm is O(d¯|VWSCAG |2 ), where d¯ and |VWSCAG | represent the average degree and the number of nodes in G, respectively. Usually, the size of VWSCAG is very large, and the traditional implementation of SimRank algorithm for computing the semantic similarity for all words is time consuming. Recently, He et al. [40] proposed a parallel single-pair SimRank, which has been proven to be very eﬃcient for large graph. Inspired by that, we implement a parallel all-pair SimRank algorithm on GPU, based on the independence of updating different node pairs in the iterative propagation process. Experiments show that our

f Sf j i f Sfi i

f j Sf j

,

(15)

where fi is the word frequency vector for word stream si , and f is the transposition of fi . Based on Eq. (15), the more simii lar two word streams si and sj are, the higher Sim(si , sj ) is. In Eq. (15), S can be any rational semantic similarity measurement. Speciﬁcally, when S is the identity matrix, that is, we treat the similarity between the same words to be 1, otherwise 0 (We call this kind of semantic similarity measurement as hard similarity in this paper, and the corresponding similarity matrix is denoted as SH ), Eq. (15) reduces to the classical cosine similarity measurement. In experiments, we set S = SS to conduct lexical story co-segmentation using the weakly-supervised semantic similarity generated in Section 4.2. According to Eq. (14), we can rewrite Eq. (15) to

Sim(si , s j ) =

r r i j r r i i

r j r j

,

(16)

where ri = AS fi is the representative vector of word stream si . Clearly, we can observe 0 ≤ Sim(si , sj ) ≤ 1. In addition, we can treat ri as the reﬁned vector of fi by mapping the observed word distribution into the latent semantic space. To illustrate the effectiveness of ri to measure semantic similarity for word streams in story co-segmentation, we use the correlated aﬃnity matrix AS to reﬁne the frequency vectors of all sentences in the same documents mentioned in Section 3.3, then cluster them to observe the lexical distribution between two whole documents and their foreground stories. Results are shown in Fig. 2 (c) and (d). From Fig. 2 (c) and (d), we can ﬁnd although the lexical distribution between two whole documents has no obvious change, the distance between two foreground stories can decrease to be 4.79, much lower than using the original frequency vector. This can specify the frequency vector after reﬁnement of correlated aﬃnity matrix can represent more meaningful semantics of word streams.

128

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

5. Experimental results 5.1. Corpus and experiments setup Since there is no available benchmark for story co-segmentation tasks, we carry out our experiments on a synthesis dataset generated from the Mandarin Chinese broadcast news corpus TDT2 dataset. The TDT2 dataset contains 177 audio recordings with user-annotated story boundaries of VOA Chinese broadcast news. All the recordings are accompanied with an ASR error free transcripts set, denoted as TDT2-ref and a LVCSR transcripts set, denoted as TDT2-rcg. For TDT2-ref and TDT2-rcg, respectively, there are 497 stories supplied with topic indexes which can be divided into 17 categories, and we use these topic labeled stories to synthesis our benchmark story co-segmentation dataset. To construct the synthetical documents in our dataset, we randomly select stories from the topic labeled stories and concatenate them to compose a document. In this paper, we focus on the evaluation of story co-segmentation between two documents, hence, we totally generate 100 pairs of randomly synthetical documents to make up the benchmark dataset. Moreover, for a pair of documents, we ensure that there is only one story in each document belonging to the same topic. For simplicity, we still use TDT2-ref and TDT2-rcg to denote our synthesis datasets generated from the original TDT2 dataset. As discussed in the next, we compare the proposed approach to a number of competing methods on story co-segmentation. All evaluated methods were equally tuned by nested differential evolution (NestDE) [41] to achieve their respective best performance. As a convention of Chinese story segmentation on TDT2 dataset, a detected boundary is considered correct if it lies within a rational tolerant window (usually set as 15 s) on each side of the ground-truth boundary. Accordingly, we apply the same idea in story co-segmentation task, that is, a detected foreground sentence is considered correct if it falls in the tolerant window on each side of the ground-truth co-story boundary. In all our experiments, we evaluate the story co-segmentation accuracy by F1-measure score, ·Precision·Recall i.e., 2Precision+Recall , which has been utilized in previous studies for the similar task. We will release our code and synthesis dataset for reproducing results reported in this paper. 5.2. Evaluation on lexical story co-segmentation To our best knowledge, there is no existing approach that can directly extract topic coherent stories from multiple documents in a single-stage for comparing with the proposed story cosegmentation model. To comprehensively evaluate the eﬃcacy of our approach, we design the baseline relied on the two-stage Topic Detection and Tracking (TDT) [4] framework, which sequentially performs story segmentation and topic modeling. Here, we mainly focus on comparing the proposed approach with topic modeling based ones, therefore, we exploit a ﬁxed story segmentation model for all the baselines. Speciﬁcally, we utilize the state-of-the-art story segmentation model proposed in [42] to generate story segments for each document. We design baselines with different topic models, including Latent Dirichlet Allocation (LDA) [7], Probabilistic Latent Semantic Analysis (pLSA) [8], and lda2vec [9]. Note, lad2vec is the state-of-the-art deep-learning based approach for modeling topics, incorporating the word2vec and LDA to extract document representations via simultaneously considering the word-context correlations and inter-word relationships. In the implementations of baselines, we ﬁrst perform story segmentation via [42] to generate story segments for each document. Then, we apply topic models for each story segment to produce its representation vector. We measure the correlations of story segments via the Cosine similarities of their representation vec-

Table 1 Performance evaluation for the lexical story co-segmentation task solved by our proposed single-stage approach and baselines relied on two-stage TDT framework with different topic models on TDT2 benchmark. Method

Precision

Recall

F1-measure

TDT2-rcg Baseline(LDA) Baseline(pLSA) Baseline(Ida2vec) Ours

0.4921 0.5041 0.5703 0.5623

0.6577 0.6638 0.7129 0.7062

0.5430 0.5520 0.6135 0.6049

TDT2-ref Baseline(LDA) Baseline(pLSA) Baseline(Ida2vec) Ours

0.5013 0.5099 0.5772 0.5669

0.6702 0.6782 0.7222 0.7139

0.5536 0.5620 0.6214 0.6134

tors. Finally, we generate topically coherent stories via conducting a greedy search to ﬁnd the set of story segments with maximum similarities between two or multiple documents. We conduct experiments on TDT2-rcg and TDT2-ref datasets. We evaluate the performance with F1-measure as the metric. We also provide the corresponding Precision and Recall to further comprehensively compare the performance of our approach with baselines. Results are shown in Table 1. Baseline(x) denotes the baseline following the TDT framework with x as the topic model, where x ∈ {LDA, pLSA, lda2vec}. From Table 1, we can see that our approach achieves 0.6049 and 0.6134 F1-measure on TDT2-rcg and -ref datasets, respectively, signiﬁcantly outperforming the widely used topic models LDA (0.5430 and 0.5536 F1-measure) and pLSA (0.5520 and 0.5620 F1-measure). We can also ﬁnd that our approach achieves comparable performance with state-of-the-art deep-learning based topic model lda2vec (0.6135 and 0.6214 F1-measure). Similar observations can be made when using Precision and Recall as the metrics. These results demonstrate the eﬃcacy of our proposed singlestage approach for story co-segmentation over the two-stage TDT framework with the guidance of different topic models. Our approach is an energy minimization based framework to solve the story co-segmentation problem, the iteration number can signiﬁcantly affect the eﬃciency and accuracy. We have conducted the experiments on TDT2 dataset about the iteration number and the results are shown in Fig. 4. From Fig. 4, we can observe that our approach can achieve a good solution after 4 or 5 iterations, and the more times of iteration will not have signiﬁcant improvement on the accuracy. Take the eﬃciency into consideration, in our experiments, we set the iteration number as 5. Since there are no accurate sentence boundaries in the documents of TDT2 dataset, we use the sequential word streams with ﬁxed length to deﬁne the sentences in story co-segmentation, known as “Sudo-Sentence”, and the similar idea has been widely applied in the task of story segmentation. Sudo-Sentences are the atomic unit to convey the semantic meanings in MRF based story co-segmentation, so the sentence length has important effects on the accuracy. Fig. 5 shows the experimental results on TDT2 dataset. From Fig. 5, we can ﬁnd that as the sentence length increasing, the co-segmentation accuracy will be improved, this is because long sentence can encode more information and this can also demonstrate the effectiveness of taking the latent correlations between different words into consideration. From Fig. 5, we also can observe that when the sentence length exceeds a certain value, about 30–35, the co-segmentation accuracy will decrease, we think that if the length of sudo-sentence is too long, this will result the sudo-sentences cover too many words, which has negative effects on semantic similarity measurement. So sentence length in our approach is set as 30.

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

129

Fig. 4. Experiments about iteration number. The left ﬁgure and right ﬁgure shows the results on TDT2-ref and TDT2-rcg, respectively.

Fig. 5. Experiments about sentence length. The left ﬁgure and right ﬁgure shows the results on TDT2-ref and TDT2-rcg, respectively.

Fig. 6 shows the comparison results between the proposed semantic similarity measurement of word streams and the classical cosine similarity of word streams. From Fig. 6, we can ﬁnd that the proposed semantic similarity measurement of word streams obviously outperforms cosine similarity measurement on both the ASR error free transcripts set TDT2-ref and LVCSR transcripts set TDT2-rcg, and achieves 17.8% and 17.1% F1-measure improvement, respectively. Through the experimental results, we can see that taking the latent correlations between different words into consideration can help to achieve more meaningful semantic similarity measurement.

5.3. Comparison to other semantic similarity measures To illustrate the effectiveness of our weakly-supervised semantic similarity measurement, we now focus on the comparison of it with ﬁve state-of-the-art semantic similarity measurements: (1) Pointwise Mutual Information (PMI); (2) Latent Semantic Analysis (LSA); (3) Probabilistic Latent Semantic Analysis (pLSA); (4) word2vec (W2C); (5) HowNet. Speciﬁcally, we denote the proposed weakly-supervised semantic similarity measurement as WSCAG. Empirically, the goodness of semantic similarity measurement can be evaluated by the average of sentence-level intra- and

inter-topic ratio which is deﬁned as following:

R (C ) =

exp(Avglab(si )=lab(s j ) Sim(si , s j )) exp(Avglab(si ) =lab(s j ) Sim(si , s j ))

,

(17)

where lab(si ) represents the topic label of sentence si . Obviously, higher R ratio corresponds better discriminative ability in story co-segmentation. Fig. 7 shows the intra- and inter-topic ratio computed using different semantic similarity measurements on TDT2 dataset. From Fig. 7, we can see that R ratio calculated by our weakly-supervised semantic similarity measurement WSCAG is about two times of other similarity measurements on both TDT2-ref and TDT2-rcg datasets, which can demonstrate the discriminative ability of WSCAG in story co-segmentation. The word2vec measurement only ranks second place in the experiments, and we think this is because the corpus size is not large enough for word2vec to learn meaningful semantic similarity. HowNet measurement achieves the lowest R ratio, approximate 1, in our experiments, and it fails to distinct the semantics of different topics, which can partially illustrate that semantic similarity measurement based general knowledge is not suitable for story co-segmentation. Fig. 8 shows the story co-segmentation results using different semantic similarity measurements on our synthetical dataset. From Fig. 8, we can ﬁnd that WSCAG achieves the best F1-measure

130

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

Fig. 6. Story co-segmentation results using the proposed semantic similarity measurement of word streams and classical cosine similarity measurement.

Fig. 7. The average intra- and inter-topic similarity ratios calculated by WSCAG, word2vec, PMI, L SA, pL SA, and HowNet on TDT2-ref and TDT2-rcg datasets. Best viewed in 2x zoom in.

Fig. 8. Best story co-segmentation results on TDT2 using WSCAG, PMI, L SA, pL SA, word2vec and HowNet to measure the word-level semantic similarity.

score on both TDT2-ref and TDT2-rcg and outperforms the existing corpus speciﬁed semantic similarity measurement, we think this is because the existing corpus speciﬁc semantic similarity measurements are based on the statistics and mathematical computation and can not take full advantage of the existing information to generate more reliable semantic similarity measurement. Besides, we can see that HowNet achieves the lowest F1-measure score on both datasets, and we think the general knowledge based semantic similarity measurement, which is based on the subjective consciousness, can not provide reliable discriminative ability to

distinct the meaning of each topic. The above results conﬁrm the rank of R ratios in Fig. 7 and illustrate the effectiveness of the weakly-supervised semantic similarity measurement for lexical story co-segmentation. Moreover, we can ﬁnd that story co-segmentation results presented in Fig. 8 are better than the results generated using classical cosine similarity for semantic similarity measure of word streams in Fig. 6, which can further demonstrate the effectiveness to consider the latent correlations between different words for generating more reliable semantic similarity of word streams.

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

131

Fig. 9. Story co-segmentation results using WSCAG, WSCAG-Aﬃnity, WSCAG-IntraStory, and WSCAG-InterStory. The left ﬁgure and right ﬁgure shows the results on TDT2-ref and TDT2-rcg, respectively.

Fig. 10. Robustness evaluation for semantic similarity measurements. The left ﬁgure and right ﬁgure show the story co-segmentation results by different semantic similarity measurements on TDT2-ref and TDT2-rcg datasets using 100 groups of randomly generated parameters, respectively. Best viewed in color.

To demonstrate the robustness of our semantic similarity measurement in story co-segmentation, we conduct a more strict experiment by comparing different semantic similarities in story cosegmentation using 100 groups of randomly generated parameters. The results are shown in Fig. 10. From Fig. 10, we can see that WSCAG can always achieve higher co-segmentation accuracy on both TDT2-ref and TDT2-rcg. To illustrate the effectiveness of aﬃnity propagation, we compare WSCAG with some simpliﬁed versions of our weaklysupervised semantic similarity measurement: (1) the correlated aﬃnity generated from the intra- and inter story correlations, denoted as WSCAG-Aﬃnity; (2) the intra-story correlations reﬁned by the aﬃnity propagation, denoted as WSCAG-IntraStory; (2) the inter-story correlations reﬁned by the aﬃnity propagation, denoted as WSCAG-InterStory. The experiments on story co-segmentation are shown in Fig. 9. From Fig. 9, we can ﬁnd that WSCAG can achieve the best F1-measure score. Besides, both WSCAG-IntraStory and WSCAG-InterStory outperform WSCAG-Aﬃnity, this result can illustrate the aﬃnity propagation can help to generate more meaningful semantic similarity measurement. We can also observe that WSCAG-IntraStory outperforms WSCAG-InterStory, which can demonstrate that the intra-story correlation plays a more important role than inter-story correlation in the semantic similarity measurement. Moreover, combining the results in Figs. 8 and 9, we can ﬁnd WSCAG-Aﬃnity can achieve better story cosegmentation results than the existing corpus speciﬁc based and general knowledge based semantic similarity measurements, which

can further illustrate the effectiveness of our weakly-supervised approach to measuring semantic similarity by taking the intra- and inter-story correlation into account.

6. Conclusion This paper extends the concept of co-segmentation to extract the topically coherent stories from multiple documents. There are three main contributions of this work: First, we propose an energy minimization based approach via Markov Random Field (MRF) scheme for automatic lexical story co-segmentation. This approach is purely data-driven. Based on the MRF model, it transfers the lexical story co-segmentation into a binary labeling problem. To achieve rational results, we encode both intraand inter-dependencies in to a Gibbs energy function, which is minimized by hybrid optimization; Second, we construct a weakly-supervised correlated aﬃnity graph to encode intra-story correlation, inter-story correlation and common character correlation together to reveal the semantic relationships. Intra-story correlation acts as the contextual correlation ﬁlter to retain words with closest senses. Inter-story correlation enhances the discriminative ability of words for different topics. Common character correlation conﬁrms to the partial matching feature of “Subword”, which is more robust to the ASR errors; Third, we present an iterative aﬃnity propagation process to generate word-level semantic similarity for a given correlation graph.

132

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133

Based on the assumption that similar words are always correlated with similar neighbours in the correlated graph, we tackle the iterative aﬃnity propagation by the parallel SimRank algorithm on GPU. Moreover, we extent the classical cosine similarity measurement for word streams by mapping the observed words distribution into the latent semantic space. We conduct experiments on the our synthetical dataset generated from benchmark TDT2 dataset and can ﬁnd that Chinese broadcast news story co-segmentation using the proposed energy minimization based framework (via MRF model achieves impressive results. We also compare our weakly-supervised semantic similarity measurement with corpus speciﬁc based measurements: PMI, LSA, pLSA, word2vec and general knowledge based measurement HowNet. Experiments demonstrate the superiority of our approach over state-of-the-art corpus speciﬁc based and general knowledge based measurements to tackle story co-segmentation problem. In addition, we can see that taking latent semantic relationships between different words into consideration can generate more reliable semantic similarities for word streams. Although the proposed approach is mainly applied to process Chinese documents and words in this paper, it can be regarded as an general framework and extended to other languages for the story co-segmentation task, which will be explored in our future work. Conflict of interest None. Acknowledgment We thank all reviewers and the associate editor for their valuable comments. This work was supported by the National Natural Science Foundation of China (NSFC Grant No: 61671325, 61572354). References [1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang, Topic detection and tracking pilot study: ﬁnal report, in: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998. [2] L.-S. Lee, B. Chen, Spoken document understanding and organization, Signal Process. Mag. 22 (5) (2005) 42–60. [3] T. Iwata, S. Watanabe, T. Yamada, N. Ueda, Topic tracking model for analyzing consumer purchase behavior, in: Proceedings of the IJCAI, 2009. [4] Y. Suzuki, F. Fukumoto, Y. Sekiguchi, Topic tracking using subject templates and clustering positive training instances, in: Proceedings of the COLING, 2002. [5] I. Malioutov, R. Barzilay, Minimum cut model for spoken lecture segmentation, in: Proceedings of the ACL, 2006. [6] X. Nie, W. Feng, L. Wan, L. Xie, Measuring similarity by contextual word connections in Chinese news story segmentation, in: Proceedings o fthe ICASSP, 2013. [7] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, J. mach. Learn. Res. 3 (Jan) (2003) 993–1022. [8] L. Hennig, Topic-based multi-document summarization with probabilistic latent semantic analysis, in: Proceedings of the RANLP, 2009, pp. 144–149. [9] C.E. Moody, Mixing dirichlet topic models and word embeddings to make lda2vec, arXiv:1605.02019 (2016). [10] Topic detection and tracking phase 2 (July 20 0 0) http://projects.ldc.upenn.edu/ TDT2/. [11] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (6) (1990) 391–407. [12] T. Hofmann, Probabilistic latent semantic analysis, in: Proceedings of the UAI, 1999. [13] Y. Goldberg, O. Levy, word2vec explained: deriving Mikolov et al.’s negativesampling word-embedding method, arXiv (2014). [14] C. Rother, T. Minka, A. Blake, V. Kolmogorov, Cosegmentation of image pairs by histogram matching - incorporating a global constraint into MRFs, in: Proceedings of the CVPR, 2006. [15] D.S. Hochbaum, V. Singh, An eﬃcient algorithm for co-segmentation, in: Proceedings of the ICCV, 2009. [16] V. Kolmogorov, C. Rother, Minimizing nonsubmodular functions with graph cuts - a review, IEEE Trans. Pattern Anal. Mach. Intell. 29 (7) (2007) 1274–1279. [17] J.-S. Yedidia, W.-T. Freeman, Y. Weiss, Constructing free energy approximations and generalized belief propagation algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 51 (7) (2005) 2282–2312.

[18] V. Kolmogorov, R. Zabih, What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26 (2) (2004) 147–159. [19] C. Rother, P. Kohli, W. Feng, J. Jia, Minimizing sparse higher order energy functions of discrete variables, in: Proceedings of the CVPR, 2009. [20] W. Feng, X. Nie, L. Wan, L. Xie, J. Jiang, Lexical story co-segmentation of Chinese broadcast news, in: Proceedings of the INTERSPEECH, 2012. [21] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, MIT Press, 1998. [22] T. Pedersen, S. Patwardhan, J. Michelizzi, WordNet::Similarity-measuring the relatedness of concepts, in: Proceedings of the AAAI, 2004. [23] Y. Li, Z. Bandar, D. McLean, Measuring semantic similarity between words using lexical knowledge and neural networks, in: Proceedings of the IDEAL, 2002. [24] J.J. Jiang, D.W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: Proceedings of the ROCLING X, 1997. [25] Y. Li, Z. Bandar, D. McLean, An approach for measuring semantic similarity between words using multiple information sources, Knowl. Data Eng. 15 (4) (2003) 871–882. [26] Z. Dong, Q. Dong, C. Hao, HowNet and its computation of meaning, in: Proceedings of the COLING, 2010. [27] Q. Liu, S. Li, Word similarity computing based on HowNet, Comput. Lingu. Chinese Lang. Process. 7 (2) (2002) 59–76. [28] P.D. Turney, Mining the web for synonyms: PMI-IR versus LSA on TOEFL, in: Proceedings of the ECML, 2001. [29] T.K. Landauer, S.T. Dumais, A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychol. Rev. 104 (2) (1997) 211–240. [30] T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Mach. Learn. 42 (1-2) (2001) 177–196. [31] G.A. Miller, W.G. Charles, Contextual correlates of semantic similarity, Lang. Cogn. Process. 6 (1) (1991) 1–28. [32] D.L.T. Rohde, L.M. Gonnerman, D.C. Plaut, An improved model of semantic similarity based on lexical co-occurrence, Commun. ACM 8 (2006) 627–633. [33] Google, Word2vec project, 2013. https://code.google.com/p/word2vec/. [34] D. Widdows, B. Dorow, A graph model for unsupervised lexical acquisition, in: Proceedings of the COLING, 2002. [35] G. Ambwani, A.R. Davis, Contextually-mediated semantic similarity graphs for topic segmentation, in: Proceedings of the ACL, 2010. [36] W. Feng, J. Jia, Z.-Q. Liu, Self-validated labeling of Markov random ﬁelds for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 32 (10) (2010) 1871–1887. [37] W. Feng, J. Jia, Z.-Q. Liu, ESSP: An eﬃcient approach to minimizing dense and nonsubmodular energy functions, arXiv:1405.4583v1 (2014). [38] J. Zhang, L. Xie, W. Feng, Y. Zhang, A subword normalized cut approach to automatic story segmentation of Chinese broadcast news, in: Proceedings of the AIRS, 2009. [39] G. Jeh, J. Widom, SimRank: A measure of structural-context similarity, in: Proceedings of the SIGKDD, 2002. [40] G. He, H. Feng, C. Li, H. Chen, Parallel SimRank computation on large graphs with iterative aggregation, in: Proceedings of the SIGKDD, 2010. [41] W. Feng, X. Yin, Y. Zhang, L. Xie, NestDE: Generic parameters tuning for automatic story segmentation, Soft Comput. (2014). [42] W. Feng, X. Nie, Y. Zhang, L. Xie, J. Dang, Unsupervised measure of chinese lexical semantic similarity using correlated graph model for news story segmentation, Neurocomputing (2018). Wei Feng received the B.S. and M.Phil. degrees in computer science from Northwestern Polytechnical University, China, in 20 0 0 and 2003, respectively, and the Ph.D. degree in computer science from City University of Hong Kong in 2008. From 2008 to 2010, he was a research fellow at the Chinese University of Hong Kong and City University of Hong Kong, respectively. He is currently a full professor in the School of Computer Science and Technology, Tianjin University. His major research interest is active robotic vision and visual intelligence, speciﬁcally including active camera relocalization and lighting recurrence, general Markov Random Fields modeling, discrete/continuous energy minimization, image segmentation, active 3D scene perception, SLAM, and generic pattern recognition. Recently, he also focuses on solving preventive conservation problems of cultural heritages via computer vision and machine learning. He got the support of the Program for New Century Excellent Talents in University, China, in 2011. He is an associate editor of Journal of Ambient Intelligence and Humanized Computing and a member of the IEEE. Xuecheng Nie received the B.S. and M.Eng. degrees in School of Computer Software from Tianjin University, Tianjin, China, in 2012 and 2015, respectively. He is currently a Ph.D. candidate in Department of Electrical and Computer Engineering in National University of Singapore, Singapore. His research interests focus on Computer Vision, Deep Learning, Human Pose Estimation, Human Parsing, and Face Detection.

W. Feng, X. Nie and Y. Zhang et al. / Neurocomputing 355 (2019) 121–133 Yujun Zhang received the Bachelor Degree in Computer Science and Technology from Shanxi University. She is a junior postgraduate student now in Tianjin University. Her research interests include active vision, lighting recurrence, 3D scene perception and reconstruction.

Zhi-Qiang Liu received the M.A.Sc. degree in Aerospace Engineering from the Institute for Aerospace Studies, The University of Toronto, and the Ph.D. degree in Electrical Engineering from The University of Alberta, Canada. He was a Chair Professor with School of Creative Media, City University of Hong Kong and is currently with the Department of Management and Innovation Systems, University of Salerno, Italy. He has taught computer architecture, computer networks, photography, artiﬁcial intelligence, art & technology, programming languages, fashion, machine learning, pattern recognition, media systems, computer graphics, and animation. His interests are machine learning, mountain/beach trekking, humanmedia systems, gardening, computer vision, photography, mobile computing, and computer network. In 2012, he received the most prestigious Teaching Award from University Grants Committee (UGC) in Hong Kong, and Teaching Excellence Award from City University of Hong Kong.

133

Jianwu Dang graduated from Tsinghua Univ., China, in 1982, and got his M.S. degree at the same university in 1984. He worked for Tianjin Univ. as a lecture from 1984 to 1988. He was awarded the PhD. degree from Shizuoka Univ., Japan in 1992. He worked for ATR Human Information Processing Labs., Japan, as a senior researcher from 1992 to 2001. He joined the University of Waterloo, Canada, as a visiting scholar for one year from 1998. Since 2001, he has worked for Japan Advanced Institute of Science and Technology (JAIST) as a professor. He joined the Institute of Communication Parlee (ICP), Center of National Research Scientiﬁc, France, as a research scientist the ﬁrst class from 2002 to 20 03. Since 20 09, he has joined Tianjin University, Tianjin, China. His research interests are in all the ﬁelds of speech production, speech synthesis, speech recognition, and spoken language understanding. He constructed MRI-based physiological models for speech and swallowing, and endeavors to apply these models on clinics. He published more than 60 journal papers and more than 300 conference papers.

Story co-segmentation of Chinese broadcast news using weakly-supervised semantic similarity

Story co-segmentation of Chinese broadcast news using weakly-supervised semantic similarity

Recommend Documents