A Topic Drift Model for authorship attribution

A Topic Drift Model for authorship attribution

Communicated by Prof. H. Zhang Accepted Manuscript A Topic Drift Model for Authorship Attribution Min Yang, Xiaojun Chen, Wenting Tu, Ziyu Lu, Jia Z...

769KB Sizes 1 Downloads 57 Views

Communicated by Prof. H. Zhang

Accepted Manuscript

A Topic Drift Model for Authorship Attribution Min Yang, Xiaojun Chen, Wenting Tu, Ziyu Lu, Jia Zhu, Qiang Qu PII: DOI: Reference:

S0925-2312(17)31375-9 10.1016/j.neucom.2017.08.022 NEUCOM 18765

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

26 December 2016 25 July 2017 3 August 2017

Please cite this article as: Min Yang, Xiaojun Chen, Wenting Tu, Ziyu Lu, Jia Zhu, Qiang Qu, A Topic Drift Model for Authorship Attribution, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.08.022

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A Topic Drift Model for Authorship Attribution Min Yanga,d , Xiaojun Chena,∗, Wenting Tub , Ziyu Lub , Jia Zhuc , Qiang Qud a

CR IP T

College of Computer Science and Software, Shenzhen University, Shenzhen, P.R. China Department of computer science, The University of Hong Kong, Pokfulam, Hong Kong c School of Computer Science, South China Normal University, Guangzhou, P.R. China d Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, P.R. China

b

AN US

Abstract

Authorship attribution is an active research direction due to its legal and financial importance. Its goal is to identify the authorship from the anonymous texts. In this paper, we propose a Topic Drift Model (TDM), which

M

can monitor the dynamicity of authors’ writing styles and learn authors’ interests simultaneously. Unlike previous authorship attribution approaches,

ED

our model is sensitive to the temporal information and the ordering of words. Thus it can extract more information from texts. The experimental results show that our model achieves better results than other models in terms of

PT

accuracy. We also demonstrate the potential of our model to address the authorship verification problem.

CE

Keywords:

AC

Authorship attribution, Topic model, Topic Drift Model



Corresponding author. Email addresses: [email protected] (Min Yang), [email protected] (Xiaojun Chen), [email protected] (Wenting Tu), [email protected] (Ziyu Lu), [email protected] (Jia Zhu), [email protected] (Qiang Qu)

Preprint submitted to Neurocomputing

August 17, 2017

ACCEPTED MANUSCRIPT

1. Introduction With the rapid development of Internet technologies, many new ideas and

CR IP T

information can be easily spread via online services, such as forums, emails,

and micro-blogs. However, since people do no need to provide their real identities on the Internet, it is increasingly essential that we have automatic techniques to accurately figure out the authorship of anonymous texts in var-

ious cybercrime cases. Applications include online fraud detection, terrorist

AN US

message origination, article counterfeit and plagiarism detection.

Authorship attribution is the problem of recognizing the authorship of anonymous texts, where the suspect candidates and their writing samples are provided. It can assist law enforcement officers in discovering criminals who supply false or inaccurate information using their virtual identities. Most

M

previous studies of authorship attribution concentrated on formal texts with a small number of candidate authors. Unfortunately, this assumption is not

ED

consistent with the reality. Recently, some researchers have paid attention to informal texts (e.g., emails and social blogs) with tens to thousands of

PT

authors (see Section 2). Despite the effectiveness of these approaches, we argue that the authorship attribution of informal texts with thousands of

CE

authors still remain challenging. The principle behind authorship attribution is that everyone has a unique

AC

writing style, thus everyone written texts are different. For example, some people use “delicious” to praise food while others may use “good” to capture authors’ writing style is crucial for authorship attribution. However, most existing authorship attribution approaches neglect the temporal changes in authors’ writing style, which may lead to poor performance when the authors 2

ACCEPTED MANUSCRIPT

change their writing styles constantly. There are strong evidences showing that when a person matures or when a significant event (e.g., changing job)

CR IP T

happens, his/her writing style may dramatically change [1, 2, 3]. For example, [3] has shown the temporal changes of vocabulary usages of 133 Twitter users over a period of 40 months. From the result, it might be concluded

that the content should be more similar to the contents that were written

in the similar time range, rather than the contents that were written much

AN US

earlier. Also, old interests might slowly fade out of authors’ preferences while

new interests might arise. When authors write texts on different topics, specific words must be used. For example, the text about “neural network” is likely to contain the words “deep” in recent years. The changes of authors’

authorship attribution task.

M

interests should be taken into account to further improve the performance of

Moreover, previous authorship attribution approaches are often based on

ED

the bag-of-word assumption, which states that the meaning of the document is fully characterized by the unigram statistics of words. The bag-of-word as-

PT

sumption neglects the ordering of words and semantics of the context though these factors are important in identifying author’s writing. For instance, the

CE

phrases “To be, or not to be: that is the question”, “The question is whether to be or not to be” and “The question is whether to be or not” have similar unigram statistics but different meanings. Therefore, the approaches that

AC

are built on the bag-of-word assumption inevitably lose the distinguishable information. To address the above challenges, we propose a novel Topic Drift Model

(TDM) to model the dynamic evolution of individual author’s writing styles

3

ACCEPTED MANUSCRIPT

and interests. Our model is a topic model inspired by the Gaussian Mixture Neural Topic Model (GMNTM) in [4], The latent topics in the model

CR IP T

do not have to stand for actual, human-interpretable topics. Indeed, in our experiments, we note that some latent topics with words such as “speech” and “recognition” are considered to be an actual topic. However, other la-

tent topics seem to correspond to authorship style as reflected by authors’ vocabulary with net-speak words “haha”, “sooo”. What’s more, we treat

AN US

stopwords as a particular latent topic that can be seen as a stylistic marker, even though stopwords have no interpretability as a topic.

TDM learns the vector representations of words and authors’ interests. The similarity between these vectors are represented either by their inner product or by the Euclidean distance. Since the authors’ interests may

M

change over time, the interests of each author are represented by a sequence of vectors, where the speed of the topic drifting is controlled by the simi-

ED

larity of consecutive vectors. The TDM model also captures the fact that co-authors usually share common topics, by making their interest vectors

PT

positively correlated. Finally, we assign the anonymous texts to the author whose interest vectors have the highest similarity with that of the anonymous

CE

texts.

In summary, the main contributions of our paper are the following:

AC

1. Our model captures the fact that the authors’ interests and writing styles may change over time. The drift of authors’ interests and writing styles can be learned automatically from the raw data.

2. We use an interest vector vec(a,d) to represent the interests of author a when he/she wrote document d, which helps control the speed of inter4

ACCEPTED MANUSCRIPT

est drift. Our model provides a natural way to measure the similarity of author interests as well as the similarity between the authors and

CR IP T

the anonymous documents. 3. We provide a method for learning high-quality distributed vector rep-

resentations of words and take the ordering of words into account so that our model captures the semantics of the context and the syntactic regularities in language.

AN US

4. Our model not only works on authorship attribution (identifying the authorship of the anonymous texts from a set of candidates), but also performs well on authorship verification [5] (determining if a specific author did or did not write a text). Instead of assigning the given text to a particular author whose interest vector has the highest similarity

M

with that of the given text (“new” author), we assign the given text to the author if the similarity between the given text and the found

ED

author is more than a predefined threshold ρ (ρ ∈ [0, 1]), otherwise no authorship is determined for the text (the output is Don’t Know ).

PT

By changing the value of the threshold, we can control the values of precision and recall.

CE

5. We conduct extensive experiments on three widely used publicly available datasets. The experiment results show that our model can achieve

AC

better performance than other approaches. We also demonstrate the potential of our model to address the authorship verification problem.

The rest of the paper is organized as follows. Section 2 reviews the related

work. Section 3 presents the TDM in details. In Section 4, we describe the experiments. Section 5 concludes this paper and indicates the future work. 5

ACCEPTED MANUSCRIPT

2. Related work In recent years, there are many statistics and machine learning techniques

CR IP T

proposed for authorship attribution. Stamatatos [6] gave a comprehensive survey of the existing techniques for automatic authorship attribution.

Most previous studies considered a simpler version of the authorship at-

tribution problems, dealing with formal texts with only a small set of candidate authors [7, 8, 9]. Mosteller et al. [8] studied the authorship of “The

AN US

Federalist Papers”, which was one of the most important work of author-

ship attribution. However, there were only two candidate authors Hamilton and Madison in [8], which was unrealistic in real life applications. In order to apply authorship attribution on real life data, some large candidate sets with informal texts have been taken into consideration recently [10, 11]. For

M

example, Koppel et al. [12] viewed authorship attribution as found in the wild where the set of known candidates was very large (i.e., 10,000 candidate

ED

authors). They used similarity-based methods along with multiple randomly generated feature sets. Zhang et al. [13] represented the rich features of

PT

an author. These features included the biography regarding the authors’ background information and book contents in terms of the authors’ previous

CE

works.

In parallel, topic models [14, 15, 16, 17] have gained the popularity as a

AC

means of analyzing large text corpora. Latent Dirichlet Allocation (LDA) [14] is one of the most popular topic modeling approaches. In LDA, each document is viewed as a mixture of latent topics, where each topic is characterized by a distribution over words. Several LDA-style models have been applied on authorship attribution. Song et al. [18] applied author-topic model [15] 6

ACCEPTED MANUSCRIPT

to perform authorship prediction. Seroussi et al. [19, 20] employed LDA in authorship attribution and achieved state-of-the-art performance for both a

CR IP T

few and many candidate authors. Van et al. [2] investigated the influence of topic and the influence of time on author verification accuracy. Azarbonyad et al. [3] analyzed the temporal changes of word usage by authors and proposed an approach to estimate the dynamicity of authors’ word usage.

Most of the aforementioned approaches employ the bag-of-words assump-

AN US

tion, which is rarely true in practice. The bag-of-word assumption loses the ordering of the words and ignores the semantics and syntax of the context.

There are several previous literatures taking the order of words into account. For example, Yang et al. [4] proposed a generative topic model that represented each topic as a cluster of multi-dimensional vectors and embedded the

M

corpus into a collection of vectors generated by the Gaussian mixture model. Nevertheless, none of these models uses information in the timestamp or

ED

captures the authors’ writing styles and interests. Our model is inspired by the work of [4] but the focus of our TDM model

PT

is different from [4]. Our model explicitly models the evolution of author interest, instead of adding authorship as a simple feature. It is the first

CE

model that utilizes the interaction between the authorship information and the time-stamp information. Simultaneously, the order and semantics of

AC

words are taken into consideration based on the context. 3. Model In this section, we describe the TDM model as a generative model. Then

we illustrate the inference algorithm for estimating the model parameters. 7

ACCEPTED MANUSCRIPT

3.1. Generative model We assume that there are W different words in the vocabulary and there

CR IP T

are D documents in the corpus. In addition, these documents belong to T topics, where T is a hyper-parameter specified by the user. In this paper, hyper-parameter is a kind of parameters that cannot be directly learned from

the regular model training process. Generally, cross-validation or validation dataset is used to choose hyper-parameters. For a specific document d, sup-

AN US

pose that it has m authors a1 ,..., am . We use an interest vector vec(ai , d) ∈ Rp

to represent the interests of author ai when he/she writes document d. Two authors have similar interests if the distance between their interest vectors is small. Our goal is to learn these interest vectors from the content of the corpus. It is clear that the interests of co-authors should be correlated. Also,

M

the interests of the same author in writing different documents should be correlated and consistent. Thus, we define a generative model to characterize

ED

the correlation between these dependent interests. 3.1.1. Author interest representation

PT

Interest vectors can be generated sequentially from the earliest document to the latest document. Let t be the time that document d was written.

CE

Let d0i be the last document that author ai has written before she writes the current document, and let the timestamps for d0i be t0i . We have t0i ≤ t for all

AC

i = 1, ...., m. We define a joint distribution of the interest vectors vec(a, d) :=

(vec(a1 , d), ..., vec(am , d)). It is a multivariate normal distribution on Rmp taking the form N (µd , Σd ). Firstly, we specify the mean µd . Let µd := (vec(a1 ,d01 ), ...., vec(am ,d0m )) 8

ACCEPTED MANUSCRIPT

where vec(ai ; d0i ) is the interest vector for author ai when she wrote document d0i . This definition implies that the new interests of authors have connections

CR IP T

to the history. Second, we define the covariance matrix Σd . Note that Σd is an mp ×

mp matrix, so that it can be partitioned into m2 sub-matrices, each with dimension p × p. Let Σij ∈ Rp×p be the sub-matrix of Σd on the i-th row

and j-th column. If i = j, then the sub-matrix characterize the covariance

AN US

of the author i’s interest. Let 2 Σii := (σ(t − t0i )) I

where σ(x) is an increasing function of x. It means that as time passing, the covariance matrix entries get bigger. More concretely, we adapt a linear

M

function σ(x) = α + βx with hyper-parameters α > 0 and β > 0. If i 6= j, then the sub-matrix Σij characterizes the correlation between

PT

ED

the interest of the author i and the author j. Let Σij := ρσ(t − t0i )σ(t − t0j )I

CE

where ρ ∈ [0, 1] is another hyper-parameter measuring the correlation of the interest of co-authors. if ρ = 0, then vec(a1 , d),...,vec(am , d) are mutually independent, meaning that there is no correlation between co-authors. If

AC

ρ = 1, then the drift vectors vec(ai , d) − vec(ai , d0i ) are in the same direction

for all co-authors, meaning that the interest drift is perfectly correlated. In summary, the density function of vector vec(a, d) is defined by

9

ACCEPTED MANUSCRIPT

(1)

CR IP T

 (v − µ )T Σ −1 (v − µ )  d d d p(vec(a, d) = v) ∝ exp − 2 The vector µd and the matrix Σd follow the definition above. 3.1.2. Word representation

The words are also represented by vectors so that their semantic rep-

AN US

resentations are naturally connected with that of the author interests. For

each word w ∈ {1, . . . , W }, there is a p-dimensional vector representation vec(w) ∈ Rp .

We assume that all these vectors are generated from a multi-variate normal distribution conditioning on the topic. Given a topic k, the con-

M

ditional distribution of the word vector for word w is defined by vec(w; k) ∼ w N (µw k , Σk ), where µk and Σk represent the mean and the covariance ma-

ED

trix. We assume that the topic k is chosen with probability πk such that PT k=1 πk = 1.

PT

The parameters of the Gaussian mixture model are collectively repre-

CE

sented by

Ψ = {πk , µk , Σk } k = 1, . . . , T

AC

Given the collection of parameters, we use

p(x|Ψ) =

T X i=1

πi N (x|µi , Σi )

(2)

to represent the probability distribution for sampling a vector x from the Gaussian mixture model. 10

ACCEPTED MANUSCRIPT

3.1.3. Word prediction



z(a, d)

CR IP T

m

U



z(wi t )

µd

µ wi

AN US

vec(wi t )



vec(a, d)

⌃d

M

Figure 1: Graphical representation of TDM with corresponding latent variables. Here, t = 1, . . . , m.

The graphical representation of the generative process of TDM is shown

ED

by Figure 1. Since our model is a generative model, we now describe the procedure that the corpus is generated. Given the Gaussian mixture model

PT

Ψ, the generative process is described as follow: for each word w in the vocabulary, we sample its topic z(w) from the multinomial distribution π :=

CE

(π1 , π2 , . . . , πT ) and sample its vector representation vec(w) from distribution N (µz(w) , Σz(w) ). similarly, for each author a and its document d, we sample its

AC

topic z(a, d) from distribution π and sample its vector representation, namely vec(a, d), also from the Gaussian mixture model. Let V be the collection of all latent vectors for words and author interests:

V := {vec(w)} ∪ {vec(a, d)} 11

ACCEPTED MANUSCRIPT

For the i-th word slot in the d document written by the author a, its word realization is generated according to the author interest vector vec(a, d) and

the probability distribution of wi is defined by:

CR IP T

the word vectors of the m previous words of wi in the document. Formally,

(3)

AN US

p (wi = w|vec(a, d), vec(wi−n ), . . . , vec(wi−1 )) P exp(δaw + nt=1 γ t−1 δtw ) = PW Pn t−1 δ w0 ) w0 t t=1 γ w0 =1 exp(δa +

where δaw and δtw are linear functions of the vectors of the authors’ interests and the previous n words. γ is the discount factor that determines the

M

importance of the context words. Here, γ is a real-valued number between 0 P w0 and 1. For γ = 1, the context words are treated indifferently. W w0 =1 exp(δa + Pn t−1 w0 δt ) is a normalization term over all the possible words w0 in the t=1 γ

CE

PT

ED

vocabulary 1, . . . , W . The linear functions δaw and δtw are defined by

δaw = huw a , vec(a, d)i

(4)

δtw = huw t , vec(wi−t )i

(5)

w p where uw a , ut ∈ R are transformation coefficient. These coefficients are

AC

unknown parameters of the model that need to be learned during training procedure. They are shared across all word slots in the corpus. We use U to represent this collection of these coefficients. We summarize the set of unknown parameters to be learned: the set V

12

ACCEPTED MANUSCRIPT

containing the vectors of author interests and words, the set U containing the unknown linear transformation coefficients, and the set Ψ containing

CR IP T

parameters that defines the Gaussian mixture model. We will see that given arbitrary two of these sets, the remaining set of parameters can be solved by efficient algorithms. 3.2. Topic Inference

Given the model parameters U , Ψ and the vectors V , we can infer the

AN US

posterior probability distribution of topics. In particular, for a word w with

vector representation vec(w), the posterior distribution of its topic, namely q(z(w)), is easy to derive given the generative model. For any topic z ∈ 1, 2, . . . , T , we have

M

πz N (vec(w)|µz , Σz ) q(z(w) = z) = PT . π N (vec(w)|µ , Σ ) k k k k=1

(6)

ED

Similarly, we can use the same formula for inferring the author interest vectors’ associated topics, i.e.

CE

PT

πz N (vec(a, d)|µz , Σz ) q(z(a, d) = z) = PT . k=1 πk N (vec(a, d)|µk , Σk )

(7)

The term q(z(a, d) = z) reflects the ratio of the author a’s interests in

AC

topic z when she writes document d. 3.3. Estimating model parameters In this paper, we need to estimate the model parameters Ψ, U and latent

vectors V . The parameter estimation consists of three stages. In Stage I, we maximize the likelihood with respect to Ψ. It can be implemented by 13

ACCEPTED MANUSCRIPT

Expectation Maximization (EM). In Stage II, we maximize the likelihood with respect to U . This is a standard logistic regression problem solvable

CR IP T

by efficient optimization algorithms. In Stage III, we maximize a posteriori [21] with respect to V . We demonstrate that it is also a convex optimization problem that can be efficiently solved. Stage I, II and III are alternatively executed until the parameters converge.

3.3.1. Stage I: estimating Gaussian mixture components Ψ

AN US

In this stage, we estimate the parameters of the Gaussian mixture model Ψ = {πk , µk , Σk } given the latent vectors of words. This is a classical non-

convex estimation problem. The EM algorithm converges to a locally optimal solution, which is often good enough in practice. See the book [22] for the

M

implementation of the EM algorithm.

3.3.2. Stage II: estimating coefficients U

ED

In this stage, the collection of latent vectors V is given. We estimate the linear transformation coefficients U such that the likelihood of the prediction

PT

(3) is maximized. This is a classical multi-class logistic regression problem that can be solved by stochastic gradient descent, similar to the approach in

CE

[23].

3.3.3. Stage III: estimating latent vectors V

AC

In this stage, the Gaussian mixture components Ψ and the linear trans-

formation coefficients U are given. We estimate latent vectors V by maxi-

mizing its posterior probability. Recall that the posteriori is proportional to the product of the prior and the likelihood. Thus, we want to minimize the sum of the negative log-prior and the negative log-likelihood. Log-likelihood 14

ACCEPTED MANUSCRIPT

is the natural logarithm of the likelihood function, which is more convenient to work with. The negative log-likelihood is determined by (3) and takes the

`(V ) = −

n X X δaw + γ t−1 δtw - log w

t=1

W X

w0 =1

CR IP T

form

n  X 0 0  exp(δaw + γ t−1 δtw )

(8)

t=1

where w enumerates over each word in the corpus, and w0 enumerates over

AN US

each word in the vocabulary. The function `(V ) is convex with respect to V .

It remains to look at the negative log-prior of vectors V . The definition (1) yields that the negative log-prior of author interests is equal to

(9)

M

1 − log(p(vec(a, d))) = (vec(a, d) − µd )T Σd−1 (vec(a, d) − µd ) + C 2

ED

where C is a constant independent of vec(a, d). Recall that µd = vec(a,d0 ), which is the interest vector of authors when they wrote the previous article. Thus, the term inside the sum is a quadratic function of vec(a, d) and

convex.

PT

vec(a, d0 ). Since Σ−1 d is positive semi-definite, this log-prior function is also

CE

For words, their negative log-priors are defined by the Gaussian mixture

model. Taking word w as an example, the negative log-prior of vec(w) is

AC

equal to

15

ACCEPTED MANUSCRIPT

T X

CR IP T

1 T πk exp − (vec(w) − µw k) 2 k=1  w −1 × (Σk ) (vec(w) − µw ) +C k

− log(p(vec(w))) = − log

(10)

Unfortunately, equation (10) is not a convex function of vec(w). To approximate it by a convex function, we randomly sample the topic of word w

AN US

from the multinomial distribution (π1 , . . . , πT ). We denote z as the randomly sampled topic. Conditional on the topic z, the negative log-prior is equal to

1 T w −1 − log(p(vec(w)|z)) = (vec(w) − µw z ) (Σz ) 2

M

·(vec(w) − µw z)+C

(11)

ED

which is convex with respect to vec(w). Since the topic is randomly sampled,

PT

we take expectation with respect to z, then use

T X πk k=1

2

T (vec(w) − µw k)

−1 ·(Σw k ) (vec(w)

(12)

− µw k)+C

AC

CE

E[− log(p(vec(w)|z))] =

as the log-prior of for the word. We adopt the same approximation for author’s interest vectors, so that all log-priors are convex. Thus, the objective function of Stage III is convex. We employ Limited-memory BroydenFletcher-Goldfarb-Shanno (L-BFGS) [24] to solve V . L-BFGS is an efficient 16

ACCEPTED MANUSCRIPT

optimization algorithm in the family of quasi-Newton methods, which approximates the Broyden-Fletcher-Goldfarb-Shanno algorithm using a limited

CR IP T

amount of computer memory. L-BFGS algorithm has been widely used in large scale optimization problems. 3.4. Authorship Prediction

Given the proposed model, we assume that a new document dnew was

written by an author anew at time tnew . Since the author is unknown, we treat

AN US

her/him as a ”new author”. Then we use the model to infer the interest vector of the new document (vec(anew , dnew )). We finally calculate the similarity between each candidate author’s interest vector with the interest vector of “new” document. The most similar candidate author is returned as the writer

M

of the document.

Since each author’s interest are represented by a sequence of vectors

ED

{vec(a, di )|i = 1, ...D}, we use Nadaraya-Watson kernel regression algorithm [25] to estimate each author’s interest vector at tnew as a locally weighted

CE

PT

average:

vec(a, ˜ dnew ) =

PDa

i=1

K(tnew , ti ) ∗ vec(a, di ) PT i=1 K(tnew , ti )

(13)

AC

where Da is the total number of documents written by author a, ti is the time when a wrote document di , and K(·, ·) can be the RBF kernel:

K(tnew , ti ) =exp(− 17

||tnew − ti || ) 2σ 2

(14)

ACCEPTED MANUSCRIPT

Finally, we calculate the similarity between vec(anew , dnew ) and vec(a, ˜ dnew ) by the Euclidean distance based similarity. The Euclidean distance between

CR IP T

vectors vec(anew , dnew ) and vec(a, ˜ dnew ) is defined as

DIS(vec(anew , dnew ), vec(a, ˜ dnew )) = ||vec(anew , dnew ) − vec(a, ˜ dnew )|| (15) p = (vec(anew , dnew ) − vec(a, ˜ dnew ))T (vec(anew , dnew ) − vec(a, ˜ dnew ))

AN US

We convert then the Euclidean distance to a similarity measure:

SIM (vec(anew , dnew ), vec(a, ˜ dnew ))

1 1 + DIS(vec(anew , dnew ), vec(a, ˜ dnew ))

M

=

(16)

ED

We assign the given text to the author whose interest vector has highest similarity with the “new” author of the given text.

PT

4. Experiments

CE

In this section, we test the proposed model on three publicly available datasets and compare our model with the state-of-the-art approaches.

AC

4.1. Datasets

We use the PAN’11 emails data, IMDb62 movie reviews data and blog

data in our experiments. Data preprocessing is performed before training the models. We first divide the text to sentences according to the delimiters as in [26]. To remove sensitivity to capitalization, all text is downcased except 18

ACCEPTED MANUSCRIPT

the words containing two or more capital letters. The detailed properties of the datasets are described as follow.

CR IP T

PAN’11 emails (PAN’11): This corpus was developed based on the Enron emails corpus1 , to account for several different common authorship attribution and verification scenarios [27]. There are two training sets provided for

authorship attribution, a “Large” set containing 9337 documents by 72 different authors and a “Small” set containing 3001 documents by 26 different

AN US

authors (the author sets are disjoint). Two test sets are provided for each of the “Large” and “Small” training datasets. One test dataset contains texts

only written by the authors in the training set, and the other test dataset also contains texts written by around 20 authors who are out of the training data. In this paper, for our authorship attribution task, we train all the

M

methods on the “Large” training set, and run the tuned methods on the corresponding testing set that only contains authors in the training set.

ED

IMDb62 movie reviews (IMDb62): This corpus was first introduced in [28], containing 62,000 movie reviews by 62 prolific users of the Internet

PT

Movie database (IMDb)2 . Each reviewer writes 1000 reviews. This dataset allows us to test our model in a setting where all the texts have similar

CE

themes.

Blog: This corpus is the largest dataset that is widely used for authorship attribution, containing 678,161 blog posts by 19,320 authors from blogger.com

AC

in August 2004 [29]. In contrast to movie reviews, blog posts can be about any topic. 1 2

http://www.cs.cmu.edu/˜enron/ www.imdb.com

19

ACCEPTED MANUSCRIPT

For PAN’11 dataset, we use 90% of the training data for training and 10% of that for validation. For IMDb62 and Blog datasets, we use 80% documents

CR IP T

of each author as the training data, 10% documents of each author as the validation data, and the remaining are used for testing. 4.2. Baseline Methods

In this paper, we evaluate and compare our approach with several baseline methods, which we describe below:

AN US

SVM: Support Vector Machines (SVM) is a widely used baseline method

for authorship attribution. We implement SVM with LibLinear3 and use the same experiment settings as in [30].

LDA + Hellinger (LDA-H): This model is proposed by [19], which em-

M

ploys LDA in authorship attribution and yields state-of-the-art performance for both a few and many candidate authors.

ED

DADT: This model combines LDA and author-topic model by representing authors and documents over two disjoint topic sets [20]. DADT shows a better result than other models.

PT

Time-aware Feature Sampling (TFS): This approach is inspired by timeaware language model, which takes the dynamicity of authors’ word usage

CE

into account [3].

NNLM: [31] investigates the performance of a feed-forward NNLM on an

AC

authorship attribution problem, with moderate author set size and relatively limited data. 3

http://www.csie.ntu.edu.tw/˜cjlin/liblinear/

20

ACCEPTED MANUSCRIPT

4.3. Implementation details In training, the learning rate α is set to 0.05 and progressively reduced

CR IP T

to 0.0001. Specifically, in each epoch, we update the learning rate by the equation: LearningRate := LearningRate / (1 + decay). We initialize the LearningRate value as 0.05, and set the decay to 0.1. For each word, the

previous six words in the same sentence are included in the context. For easy comparison with other models, the vector size p is set to 100. Increasing the

AN US

word vector size may further improve the quality of the topics.

Documents are split into sentences and words using the NLTK toolkit4 . The Gaussian mixture model is learned using the variational inference algorithm in scikit-learn toolkit 5 . To perform comparable experiments with restricted vocabulary, words outside of the vocabulary is replaced by a special

ED

4.4. Experiment results

M

token and doesn’t count into the word perplexity calculation.

4.4.1. Authorship attribution results

PT

We first vary the number of latent topics of the models to see how the performance changes. We present results with 5, 10, 20, 40, 60, 80, 100,

CE

150, 200 topics. Figure 2 shows the accuracy as a function of the number of latent topics. The best accuracy of our model is consistently and clearly better than that of SVM, LDA-H, DADT, TFS and NNLM on the three data

AC

sets. More specifically, on the PAN’11 data set, the best accuracy of DADT and TFS are 51.6% and 51.2% respectively, which are slightly higher than 4 5

http://www.nltk.org/ http://scikit-learn.org/

21

ACCEPTED MANUSCRIPT

IMDb62 Dataset

PAN'11 Dataset

95

55

90

50

Accuracy (%)

35

SVM LDA-H DADT TFS NNLM TDM

30 25 10

20

70 65 60 55

40 60 80 100 150 200 Number of Topics

(a) Results on the PAN11 data set.

75

AN US

Accuracy (%)

80

40

20 5

CR IP T

85

45

50 5

10

20

SVM LDA-H DADT TFS NNLM TDM

40 60 80 100 150 200 Number of Topics

(b) Results on the IMDb62 data set.

Blog Dataset

35

M

30

20

ED

Accuracy (%)

25

15

5

0 5

10

20

40 60 80 Number of Topics

100

150

(c) Results on the blog data set.

AC

CE

PT

10

Figure 2: Accuracy as a function of the number of topics on the PAN’11, IMDb62 and Blog datasets.

those of SVM, LDA-H and NNLM. Our model further improves the accuracy

22

ACCEPTED MANUSCRIPT

to 53.9%. Similarly, for IMDb62 dataset, the best accuracy increases from 92.1% (NNLM) to 93.8% (our model) on IMDb62 dataset. However, for Blog

CR IP T

dataset, the LDA-H model is clearly worse than the other models by a large margin. This may be because that the LDA model neglects of the authorship information and the timestamps information thus it cannot capture the dif-

ferences between authors when considering thousands of candidates authors. The reason why our model defeats other models is because that it can take

AN US

both the dynamicity of authors and the ordering of words into account. We

also observe that all algorithms produced higher accuracies on the PAN’11 and IMDb62 datasets than those on the blog dataset since the Blog dataset consists of more number of authors and is harder to model. When the number of topics is small, i.e., K ≤ 40 for the PAN’11 dataset,

M

K ≤ 60 for the IMDb62 dataset and K ≤ 40 for the Blog dataset, we can see that TDM produced very low accuracies, indicating that it does not fit

ED

the unseen training data well when the number of topics is small. With the increase of the number of topics, the performance of TDM increases very

PT

fast. However, after crossing some boundary, e.g. K = 150, increasing the number of topics no longer significantly improves the result of our model. In

CE

the following experiments, we choose K = 150 for all the three datasets since 150 topics are more than enough to achieve the best result of the model.

AC

4.4.2. Time complexity analysis We analyze the time complexity of our model step by step. In stage 1

(refers to Section 3.3.1), we use EM algorithm to estimate the parameters of GMM. It requires O(D ∗ K 2 ), where D is the number of documents in the corpus and K is the number of topics. Stage II and Stage III (refer to 23

ACCEPTED MANUSCRIPT

SVM 1.03 2.66 9.22

LDA-H 2.79 7.33 28.48

DADT 4.07 19.72 44.33

TFS 2.29 11.45 27.22

NNLM 2.45 10.42 30.13

TDM 3.12 12.38 35.26

CR IP T

Data PAN’11 IMDb62 Blog

Table 1: Comparison of average training time per epoch (minute).

Section 3.3.2 and 3.3.3) estimate the word and author interest vectors and the

corresponding transformation coefficients, and require O(D ∗ (n ∗ p + p ∗ W )),

AN US

where n is number of context words we used, p is the dimension of the word and author interest vector, W is size of the vocabulary.

We also evaluate the efficiency of our model by calculating the average running time for each epoch during the training process (using one CPU only). The results are shown in Table 1. For PAN’11 dataset, the average

M

running time of one epoch of our model is 2.45 minutes, which is competitive with or better than LDA-H, DADT, TFS and NNLM. SVM is faster than our

ED

model. However, the performances of SVM is vastly inferior compared to our model. Similar results can be observed on the IMDb62 and Blog datasets.

PT

4.4.3. Authorship verification results Our model not only works on authorship attribution, but also performs

CE

well on authorship verification, determining whether a specific author write an anonymous text or not. Instead of assigning the anonymous text to the

AC

author whose interest vector has the highest similarity with that of the anonymous text, we assign the anonymous text to the author if the similarity between the anonymous text and the predicted author is higher than a predefined threshold ρ (ρ ∈ [0, 1]), otherwise no authorship is made for the text

(the output is Don’t Know ). By changing the value of the threshold, we 24

ACCEPTED MANUSCRIPT

1.0 0.923 0.116 0.206

0.9 0.825 0.254 0.388

0.8 0.772 0.421 0.545

0.7 0.632 0.480 0.546

0.6 0.416 0.566 0.480

0.5 0.242 0.763 0.367

CR IP T

ρ precision Recall F1

Table 2: Precision, Recall and F1-score with different ρ

achieved different values of precision and recall.

To test the authorship verification method, we selected the ”Large” dataset of PAN’11 as benchmark data set. The test set contains the emails written by

AN US

20 authors who are out-of-training authors. The results in terms of precision,

recall and F1-score are shown in Table 2. The best F-score (i.e., F1=0.546, precision = 0.632, recall = 0.480) is achieved with ρ = 0.7.

M

5. Conclusion and Future Work

In this paper, we have proposed a Topic Drift Model (TDM) for author-

ED

ship attribution. In contrast to existing work, TDM was the first model that explicitly characterized the drift of interests and writing styles for individ-

PT

ual authors. In addition, TDM took the ordering of words into account. We conducted extensive experiments on three widely used publicly available

CE

datasets. The experiment results showed that our model could achieve better performance than other baseline methods. We also demonstrated the potential of our model to address the authorship verification problem.

AC

In the further, we plan to incorporate deep learning algorithm to further

capture authors’ writing styles. For example, we may use a convolutional neural network to learn N-grams features of texts.

25

ACCEPTED MANUSCRIPT

Acknowledgement The work is supported by Science and Technology Planning Key Project

CR IP T

of Guangdong Province, China (2015B010109003, 2016A030303055), and NSFC under Grant no.61305059. References

[1] I. Lancashire, G. Hirst, Vocabulary changes in agatha christies mysteries

AN US

as an indication of dementia: a case study, in: 19th Annual Rotman

Research Institute Conference, Cognitive Aging: Research and Practice, 2009, pp. 8–10.

[2] M. van Dam, C. Hauff, Large-scale author verification: Temporal and

M

topical influences, in: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, ACM,

ED

2014, pp. 1039–1042.

[3] H. Azarbonyad, M. Dehghani, M. Marx, J. Kamps, Time-aware au-

PT

thorship attribution for short text streams, in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in

CE

Information Retrieval, ACM, 2015, pp. 727–730.

[4] M. Yang, T. Cui, W. Tu, Ordering-sensitive and semantic-aware topic

AC

modeling, in: Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[5] E. Stamatatos, N. Fakotakis, G. Kokkinakis, Automatic text categoriza-

26

ACCEPTED MANUSCRIPT

tion in terms of genre and author, Computational linguistics 26 (2000) 471–495.

CR IP T

[6] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Society for information Science and Technology 60 (2009) 538–556.

[7] J. F. Burrows, An ocean where each kind: Statistical analysis and some

AN US

major determinants of literary style, in: Computers and the Humanities, volume 23, Springer, 1989, pp. 309–321.

[8] F. Mosteller, D. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley, 1964.

M

[9] F. Peng, D. Schuurmans, S. Wang, Augmenting naive Bayes classifiers with statistical language models, in: Information Retrieval, volume 7,

ED

Springer, 2004, pp. 317–345.

[10] K. Luyckx, W. Daelemans, Authorship attribution and verification with

PT

many authors and limited data, in: Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, Association

CE

for Computational Linguistics, 2008, pp. 513–520. [11] D. Madigan, A. Genkin, D. D. Lewis, S. Argamon, D. Fradkin, L. Ye,

AC

Author identification on the large scale, in: Proceedings of the Meeting of the Classification Society of North America, 2005, pp. 1–20.

[12] M. Koppel, J. Schler, S. Argamon, Authorship attribution in the wild, Language Resources and Evaluation 45 (2011) 83–94. 27

ACCEPTED MANUSCRIPT

[13] H. Zhang, T. W. Chow, Q. J. Wu, Organizing books and authors by multilayer som, IEEE transactions on neural networks and learning

CR IP T

systems 27 (2016) 2537–2550. [14] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, the Journal of machine Learning research 3 (2003) 993–1022.

[15] M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth, The author-topic

model for authors and documents, in: Proceedings of the 20th conference

AN US

on Uncertainty in artificial intelligence, AUAI Press, 2004, pp. 487–494.

[16] H. Zhang, J. K. Ho, Q. J. Wu, Y. Ye, Multidimensional latent semantic analysis using term spatial information, IEEE transactions on cybernetics 43 (2013) 1625–1640.

M

[17] H. Zhang, Y. Ji, J. Li, Y. Ye, A triple wing harmonium model for movie recommendation, IEEE Transactions on Industrial Informatics

ED

12 (2016) 231–239.

PT

[18] J. Song, The author-topic model and the author prediction (2009). [19] Y. Seroussi, I. Zukerman, F. Bohnert, Authorship attribution with la-

CE

tent dirichlet allocation, in: Proceedings of the fifteenth conference on computational natural language learning, Association for Computa-

AC

tional Linguistics, 2011, pp. 181–189.

[20] Y. Seroussi, F. Bohnert, I. Zukerman, Authorship attribution with author-aware topic models, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Association for Computational Linguistics, 2012, pp. 264–269. 28

ACCEPTED MANUSCRIPT

[21] J.-L. Gauvain, C.-H. Lee, Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains, IEEE transac-

CR IP T

tions on speech and audio processing 2 (1994) 291–298. [22] C. M. Bishop, Pattern recognition and machine learning, volume 1, springer New York, 2006.

[23] Q. V. Le, T. Mikolov, Distributed representations of sentences and

AN US

documents, arXiv preprint arXiv:1405.4053 (2014).

[24] D. C. Liu, J. Nocedal, On the limited memory bfgs method for large scale optimization, Mathematical programming 45 (1989) 503–528. [25] E. A. Nadaraya, On estimating regression, Theory of Probability & Its

M

Applications 9 (1964) 141–142.

[26] A. Gruber, Y. Weiss, M. Rosen-Zvi, Hidden topic markov models, in:

163–170.

ED

International conference on artificial intelligence and statistics, 2007, pp.

PT

[27] S. Argamon, P. Juola,

Overview of the international authorship

identification competition at pan-2011.,

in: CLEF (Notebook Pa-

CE

pers/Labs/Workshop), 2011.

AC

[28] Y. Seroussi, I. Zukerman, F. Bohnert, Collaborative inference of sentiments from texts, in: User Modeling, Adaptation, and Personalization, Springer, 2010, pp. 195–206.

[29] J. Schler, M. Koppel, S. Argamon, J. W. Pennebaker, Effects of age

29

ACCEPTED MANUSCRIPT

and gender on blogging., in: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, volume 6, 2006, pp. 199–205.

CR IP T

[30] M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution, Journal of the American Society for information Science and Technology 60 (2009) 9–26.

[31] Z. Ge, Y. Sun, M. J. Smith, Authorship attribution using a neural

AC

CE

PT

ED

M

AN US

network language model, arXiv preprint arXiv:1602.05292 (2016).

30

ACCEPTED MANUSCRIPT

Min Yang is currently a PhD student in the Department of Computer Science, The University of Hong Kong, Hong Kong. She received the B.S.

CR IP T

degree in software engineering from the Sichuan University, China, in 2012. Her current research interests include machine learning, data mining and natural language processing.

Xiaojun Chen received his Ph.D. degree from Harbin Institute of Tech-

nology in 2011. He is now an assistant professor at College of Computer

AN US

Science and Software, Shenzhen University. His research interests include subspace clustering, topic model, massive data mining.

Wenting Tu is currently a PhD candidate at the Department of Computer Science, University of Hong Kong. Her research interests include data mining, text mining and machine learning.

M

Ziyu Lu received the PhD. degree in the Department of Computer Science, The University of Hong Kong. Her research interests include recom-

ED

mendation systems, natural language processing and machine learning. Jia Zhu is currently an associate professor in the School of Computer

PT

Science at South China Normal University after finished his Postdoctoral fellow at United Nations University. Prior to that, he received his Ph.D.

CE

degree from the University of Queensland in 2013, and his B.S. and M.S. degree from Bond University, Australia in 2004 and 2006 respectively. His research interests are Machine Learning and Information Retrieval. He has

AC

published several papers on top conferences and journals, such as Information Sciences and WWW. Qiang Qu received the MSc degree in computer science from Peking

University and the Ph.D. degree from Aarhus University, supported by the

31

ACCEPTED MANUSCRIPT

GEOCrowd project under Marie Skodowska-Curie Actions. He is now an associate professor with the Shenzhen Institutes of Advanced Technology,

CR IP T

Chinese Academy of Sciences. His current research interests include large-

AC

CE

PT

ED

M

AN US

scale data management and mining.

32

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Photo of the author(s) lick here to download Photo of the author(s): Zhiyu Lu.eps

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Photo of the author(s) lick here to download Photo of the author(s): Yangmin.eps

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Photo of the author(s) lick here to download Photo of the author(s): xiaojun.eps

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Photo of the author(s) lick here to download Photo of the author(s): wttu.eps

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Photo of the author(s) lick here to download Photo of the author(s): Jia Zhu.eps

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Photo of the author(s) lick here to download Photo of the author(s): quqiang.eps