Authorship verification of e-mail and tweet messages applied for continuous authentication

JID:YJCSS AID:2863 /FLA [m3G; v1.144; Prn:8/01/2015; 15:20] P.1 (1-12) Journal of Computer and System Sciences ••• (••••) •••–••• Contents lists av...

Download PDF

953KB Sizes 0 Downloads 17 Views

Report

PDF Reader
Full Text

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.1 (1-12)

Journal of Computer and System Sciences ••• (••••) •••–•••

Contents lists available at ScienceDirect

Journal of Computer and System Sciences www.elsevier.com/locate/jcss

Authorship veriﬁcation of e-mail and tweet messages applied for continuous authentication ✩ Marcelo Luiz Brocardo a,∗ , Issa Traore a,∗∗ , Isaac Woungang b a b

Department of Electrical and Computer Engineering, University of Victoria, Victoria, British Columbia, V8W 3P6, Canada Department of Computer Science, Ryerson University, Toronto, Ontario, M5B 2K3, Canada

a r t i c l e

i n f o

Article history: Received 1 July 2014 Received in revised form 14 December 2014 Accepted 14 December 2014 Available online xxxx Keywords: Continuous authentication Stylometry Short message veriﬁcation n-Gram features Unbalanced dataset SVM classiﬁer

a b s t r a c t Authorship veriﬁcation using stylometry consists of identifying a user based on his writing style. In this paper, authorship veriﬁcation is applied for continuous authentication using unstructured online text-based entry. An online document is decomposed into consecutive blocks of short texts over which (continuous) authentication decisions happen, discriminating between legitimate and impostor behaviors. We investigate blocks of texts with 140, 280 and 500 characters. The feature set includes traditional features such as lexical, syntactic, application speciﬁc features, and new features extracted from n-gram analysis. Furthermore, the proposed approach includes a strategy to circumvent issues related to unbalanced dataset, and uses Information Gain and Mutual Information as a feature selection strategy and Support Vector Machine (SVM) for classiﬁcation. Experimental evaluation of the proposed approach based on the Enron email and Twitter corpuses yields very promising results consisting of an Equal Error Rate (EER) varying from 9.98% to 21.45%, for different block sizes. © 2014 Elsevier Inc. All rights reserved.

1. Introduction Continuous authentication (CA) consists of re-authenticating a user repeatedly and unobtrusively throughout an authenticated session as data becomes available. CA is considered as a remedy against session hijacking, where an intruder seizes control of a legitimate user session [1]. Ideally, CA should be conducted unobtrusively, by enabling transparent user identity data collection and veriﬁcation. Stylometric analysis, which consists of the identiﬁcation of a user based on his writing style [2,3], could potentially be used for CA. Stylometry has so far been studied primarily for the purpose of forensic authorship analysis, generating a signiﬁcant amount of interest over the years and led to a rich body of research literature [4–7]. Three different kinds of authorship analysis using stylometry have been studied in the literature, including authorship attribution, authorship characterization, and authorship veriﬁcation. The focus of this paper is on authorship veriﬁcation, which is the most relevant to continuous authentication. Authorship veriﬁcation consists of checking whether a target document was written or not by a speciﬁc (i.e. known) individual (with

✩

This paper is the journal version of the conference paper (Marcelo Luiz Brocardo, Issa Traore, Isaac Woungang [23]).

* Corresponding author. ** Principal corresponding author.

E-mail addresses: [email protected] (M.L. Brocardo), [email protected] (I. Traore), [email protected] (I. Woungang).

http://dx.doi.org/10.1016/j.jcss.2014.12.019 0022-0000/© 2014 Elsevier Inc. All rights reserved.

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.2 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

2

some claimed identity). In this setting CA is performed by comparing sample writing of an individual against the model or proﬁle associated with the identity claimed by that individual at login time (i.e. 1-to-1 identity matching). CA involves several challenges including the need for low authentication delay, high accuracy, and the ability to withstand forgery. This paper focuses on the two ﬁrst challenges. Low authentication delay is simulated by analyzing short texts. Attempting to reduce at the same time the text size and the veriﬁcation error rates is a diﬃcult task in the sense that these attributes are loosely related to each other. A smaller veriﬁcation block may lead to increased veriﬁcation error rates and vice-versa. In this work, a set of new features are derived from n-gram analysis, and two classiﬁcation schemes are studied: a Support Vector Machine (SVM) classiﬁer and a hybrid SVM with Logistic Regression (LR) classiﬁer. Authorship veriﬁcation is investigated as a two-class problem. Furthermore a weighting strategy for unbalanced dataset, different SVM kernels, and different features selection strategy are tested. The proposed approach is evaluated experimentally by computing the following performance metrics:

• False Acceptance Rate (FAR): measures the likelihood that the system will fail to recognize a genuine person; • False Rejection Rate (FRR): measures the likelihood that the system may falsely recognize someone as a genuine person; • Equal Error Rate (EER): corresponds to the operating point where FAR and FRR have the same value. Different block sizes of characters (140, 280 and 500) are tested on Enron and Twitter datasets, yielding EER ranging from 9.98% to 21.45% for different block sizes. The results are very encouraging considering the existing works on authorship veriﬁcation using stylometry. The remainder of the paper is organized as follows. Section 2 summarizes and discusses related works. Section 3 presents an outline of the proposed approach. Section 4 presents the experimental evaluation of the proposed approach and the obtained results. Section 5 discusses the strengths and shortcomings of the proposed approach. Section 6 makes some concluding remarks and discusses future work. 2. Related work As mentioned above, research on authorship analysis covers three different areas: authorship identiﬁcation, authorship characterization, and authorship veriﬁcation. A considerable number of studies have been conducted on authorship identiﬁcation and characterization. For instance, previous studies on authorship identiﬁcation investigated ways to identify patterns of terrorist communications [8], the author of a particular e-mail for computer forensic purposes [9–11], as well as how to collect digital evidence for investigations [12] or solve a disputed literary, historical [13], or musical authorship [14–16]. Work on authorship characterization has targeted primarily gender attribution [17–19] and the classiﬁcation of the author education level [20]. However, there are few papers on authorship veriﬁcation outside the framework of plagiarism detection [6], and most of them focus on general text documents. In addition, the performance of authorship veriﬁcation for online documents is affected by the text size, the number of candidate authors, the size of the training set, and the fact that these documents are in general quite poorly structured or written (as opposed to literary works). Among the few studies available on authorship veriﬁcation are works by Koppel et al. [6], Iqbal et al. [10], Canales et al. [5], and Chen and Hao [21]. Koppel et al. used SVM with linear kernel and addressed the authorship veriﬁcation as a one-class classiﬁcation problem, ignoring negative samples. The corpus used in their study was composed by 21 English books written by 10 different authors. They divided the text into approximately equal sections of 500 words, preserving the paragraphs. The feature set was composed by the 250 most frequent words. They introduced a technique named “unmasking” where they quantify the dissimilarity between the sample document produced by the suspect and that of other users (i.e. imposters). Although the overall accuracy was 95.7%, they concluded that the use of negative examples could improve the results. Iqbal et al. experimented with two different approaches [10]. The ﬁrst approach conducts veriﬁcation using classiﬁcation; three different classiﬁers are investigated, namely, Adaboost.M1, Bayesian Network, and Discriminative Multinomial Naive Bayes (DMNB). The second approach conducts veriﬁcation by regression; three different classiﬁers were studied including linear regression, SVM with Sequential Minimum Optimization (SMO), and SVM with RBF kernel. The feature set was composed of 292 attributes, which included lexical (collected either in terms of characters or words), syntactic (punctuation and function words), idiosyncratic (spelling and grammatical mistakes) and content-speciﬁc (keywords commonly found in a speciﬁc domain). Experimental evaluation of the proposed approach, using the Enron e-mail corpus and by analyzing 200 e-mails per author, yielded EER ranging from 17.1% to 22.4%. Canales et al. trained a K-Nearest Neighbor (KNN) classiﬁer with 82 stylistic features including 49 character-based, 13 word-based, and 20 syntactic features [5]. In addition, they combined stylometry and keystroke dynamics analysis for the purpose of authenticating online test takers. They experimented with 40 students with sample document size ranging between 1710 and 70 300 characters, and obtained as performances (FRR = 20.25%, FAR = 4.18%) and (FRR = 93.46%, FRR = 4.84%) when using separately keystroke and stylometry, respectively. The combination of both types of features yielded EER = 30%. They concluded that the feature set must be extended and certain type of punctuations may not necessarily represent the style of students when taking online exams.

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.3 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

3

Chen and Hao investigated the authorship similarity from e-mails [21]. The proposed feature set included 40 lexical, 76 syntactic, 25 content speciﬁc, and 9 structural features. Experimental evaluation involving 40 authors from the Enron dataset, yielded 84% and 89% classiﬁcation accuracy rates for 10 and 15 short e-mails, respectively. 3. Proposed approach This section summarizes previous work and presents the proposed approach by discussing feature selection and describing the classiﬁcation model. 3.1. Previous work The authors investigated in previous work the possibility of using stylometry for authorship veriﬁcation for short online messages [22]. The technique was based on a combination of supervised learning and n-gram analysis. The evaluation used real-life dataset from Enron, where the e-mails were combined to produce a single long message per individual, and then divided into smaller blocks used for authorship veriﬁcation. The experimental evaluation yielded an EER of 14.35% for 87 users for message blocks of 500 characters. In an earlier version of the current paper, presented at the 28th IEEE International Conference on Advanced Information Networking and Applications (AINA-2014), the feature set was expanded and Information Gain (IG) metric was used to rank the best features. In addition, SVM was used for classiﬁcation [23] and a dataset based on the Enron e-mail corpus was used for experimental evaluation. In the current paper, an extra ﬁlter named Mutual Information (MI) is added in the feature selection process in order to discard highly correlated features. In addition, the authors investigate as classiﬁcation technique not only SVM, but also a hybrid method combining SVM with LR (SVM-LR). Finally, the proposed approach is evaluated using shorter messages (i.e. micro messages), consisting of blocks of texts of 140, 280 and 500 characters from two different datasets, one based on twitter feeds and another based on Enron e-mails. 3.2. Approach overview In a general overview of the proposed approach, an online document is decomposed into consecutive blocks of short texts over which (continuous) authentication decisions happen. For each block of text, a feature vector is extracted based on all features. In this study an initial set of basic features is selected by combining lexical characters, lexical words, syntactic and application speciﬁc characteristics and, as advanced features, a set of new features extracted through n-gram analysis. The feature values are normalized to range between 0 and 1 using maximum normalization scheme, in which case a given feature value will be replaced by its ratio over the maximum value for the same feature over the training set. In order to reduce the large feature space, information gain and mutual information techniques are combined for feature selection. The classiﬁcation model consists of a collection of proﬁles generated separately for individual users. The proposed system operates in two modes: enrollment and veriﬁcation. Based on sample training data, the enrollment process computes the behavioral proﬁle of the user using machine learning classiﬁcation. In addition, the proposed system addresses the authorship veriﬁcation as a two-class classiﬁcation problem. The ﬁrst class is composed by (positive) samples from the author, whereas the second class (negative) is composed by samples from other authors. Thereby, the negative class has more samples than the positive class, generating imbalance class distribution. The approach to deal with this situation is to assign a weight P to the negative class corresponding to the ratio between the total number of positive samples and the total number of negative samples. 3.3. Initial features Many linguistic features have been suggested for authorship veriﬁcation, for instance, the choice of particular words and syntactic structures [24]. Differently from topic-based text categorization whose central point is a “bag of content words”, the set of features was expanded by combining lexical, syntactic and application speciﬁc features. Such combination better expresses the author’s style. The list of all the features used in this work is shown in Table 4 in Appendix A. The features are brieﬂy discussed in the following. Lexical features consist of a set of lexical items (words or characters) extracted from a text [25]. In terms of characters, the set of features includes the frequency of characters including upper case, lower case, vowels, white space, digits, and special characters [26]. New features corresponding to 5-grams and 6-grams were derived. N-gram is a token formed by a contiguous sequence of characters or words, which has been shown to be tolerant to typos [27,28,6,29–31]. The approach to calculate these new features is discussed in Subsection 3.4. Another stylistic marker is the writer’s mood expressed in form of icons and symbols [32]. The icons are divided in three groups, and their averages are calculated. The ﬁrst group contains 126 text-based icons (e.g. “:-)” , “:o)”) subdivided in 38 different categories (e.g. smiley, laughing, very happy, frown, angry, crying, etc.) The second group contains 80 emoticons based on unicode characters (e.g. ,–/). In this case, the range of emoticons vary from code 1F600 to 1F64F. The last group has 256 miscellaneous symbols (e.g. _ or ) with unicode characters ranging from 2600 to 26FF.

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.4 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

4

In terms of lexical words, the following features are extracted for each sample: the number of words, the average sentence length in terms of words, the frequency of short words (1–3 characters) and long words (more than 6 characters), the average word length, the average syllable per word, the ratio of characters in words and the replaced words [29,5]. Although earlier studies used between 100 and 1000 frequent words to determine the author of a document [33–35], in this work only the ﬁfty most frequently used words per author are selected, since the sample text has few words. The vocabulary richness is measured by quantifying the number of hapax legomena and dis legomena, referring to a word which occurs only once or twice in a text, respectively [36,37,26]. Syntactic features extract the structure of a sentence in terms of punctuation and part of speech (POS) [38–40]. Punctuation is an important rule to deﬁne boundaries and identify meaning (quotation, exclamation, etc.) by splitting a paragraph into sentences and each sentence into various tokens [10]. The set of basic punctuation marks includes single quotes, commas, periods, colons, semi-colons, question marks, and exclamation marks. In addition, a set of uncommon punctuations . (e.g. †, .., . . . ) was created with 112 different symbols, based on the unicode format with code ranging from 2000 to 206F. Function words are topic-independent and capture the author’s style across different subjects. The POS tagging consists of categorizing a word according to its function in the context and can be classiﬁed as verbs, nouns, pronouns, adjectives, adverbs, prepositions, conjunctions, and interjections [20,16,41,19]. Application speciﬁc features include characteristics related to the organization and format of a text [37,26,4,41,21]. Since the proposed approach involves analyzing short messages (from e-mails and twitter posts), only features related to the paragraph structure are extracted. These features include the number of sentences per block of text, the average number of characters, words, and sentences in a block of text, and the average number of sentences beginning with upper and lower case. 3.4. N-gram model In the proposed approach, feature extraction is performed in two steps. During the ﬁrst step, the frequency and average of lexical, syntactic and application speciﬁc features are computed. In the second step, the character n-grams are calculated. The n-gram model includes not only all unique n-grams, but also all n-grams with frequency equal or higher than some number f . Likewise two different modes of calculation for n-grams are considered. Let m denote a binary variable representing the mode of calculation of the n-grams, m = 0 if calculation is based only on unique n-grams, and m = 1, otherwise. The training data of a given user U is divided into two subsets, denoted by T ( f )1U and T 2U . Let r U (b) denote the similarity between a sample data block b and the proﬁle of user U . The similarity r U (b) is deﬁned as the percentage of unique n-grams shared by block b and training set T ( f )1U , calculated as follows1 :

r U (b) =

| Nm (b) ∩ N ( T ( f )1U )| | Nm (b)|

(1)

Where N ( T ( f )1U ) denote the set of all unique n-grams occurring in T ( f )1U with frequency f , and N m (b) denote the set of all unique n-grams occurring in b (m = 0) or the set of all n-grams occurring in b (m = 1). Let d U (b) denote a binary similarity metric, referred to as decision, which captures the closeness of a block b to the proﬁle of user U , and is deﬁned as follows:

d U (b) = 1, d U (b) = 0,

if |r U (b)| ≥ U otherwise

(2)

Where U is a user-speciﬁc threshold derived from the training data. The value of U for user U is derived using a supervised learning technique outlined by Algorithm 1. Given a user U , the training subset T 2U is divided into p blocks of characters of equal size: b(m)1U , . . . , b(m)Up . The proposed model approximates the actual (but unknown) distribution of the ratios (r U (b1U ), . . . , r U (b Up )) (extracted from T 2U ) by computing the sample mean denoted by μU and the sample variance σU2 during the training. In the algorithm, the threshold is initialized (i.e. U = μU − (σU /2)), and then varied incrementally by minimizing the difference between FRR and FAR values for the user, the goal being to obtain an operating point that is as close as possible to the EER. 3.5. Feature selection Feature selection for a CA system consists of identifying and keeping only the most discriminating features for each individual user. This allows reducing the data size by removing irrelevant attributes and improves the processing time for training and classiﬁcation. The worth of an attribute is evaluated by applying a ranking strategy based on the Information

1

| X | denote the cardinality of set X .

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.5 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

/* U a user for whom the threshold is being calculated /* I 1 , . . . , I m : a set of other users ( I k = U ) /* U : threshold computed for user U Input: Training data for U , I 1 , . . . , I m Output:

5

*/ */ */

U

1 begin 2 up ← false; 3 down ← false; 4 δ ← 1; 5 U ← μU − (σU /2); 6 while δ > 0.0001 do

/* Calculating FAR and FRR for user U

7 FRRU , FARU = calculate(U , I 1 , . . . , I m , U , γ ); /* Minimizing the difference between FAR and FRR 8 if (FRRU − FARU ) > 0 then 9 down ← true; 10 U ← U − δ ; 11 else if (FRRU − FARU ) < 0 then 12 up ← true; 13 U ← U + δ ; 14 else 15 return U ; 16 end 17 if (up & down) then 18 up ← false; 19 down ← false; 20 δ ← δ/10; 21 end 22 end 23 return U ; 24 end

*/ */

Algorithm 1: Threshold calculation for a given user.

Gain (IG) and discarding highly correlated features based on Mutual Information (MI) strategy. Features with very little or no predictive information and high correlation are identiﬁed and removed. Let X and Y denote two random variables. The information entropy of X denoted by H ( X ) is deﬁned by:

H(X) = −

n

p (xi ) log2 p (xi )

(3)

i =1

Where p (xi ) denotes the probability mass function of xi . Let H ( X |Y ) denote the conditional entropy given a variable X after the observation of the variable Y . H ( X |Y ) is deﬁned as:

H ( X |Y ) = −

M N

p (xi , y j ) log2 p (xi | y j )

(4)

i =1 j =1

Suppose that the dataset is composed by two classes (positive and negative). The IG for a feature Attr with respect to a class is computed as follows:

IG(Class, Attr) = H (Class) − H (Class|Attr)

(5)

Given two features Attra and Attrb , the MI is calculated as follows:

MI(Attra , Attrb ) = H (Attra ) − H (Attra |Attrb )

(6)

Prior to computing IG and MI, it is necessary to discretize the numeric feature values into binary values (0 and 1). The discretization process consists of ﬁnding a cut-point or split-point that divides the range into two intervals, one interval being less or equal than the cut-point while the other is greater [42]. The entropy-based discretization method proposed by Fayyad and Irani [43] is used in this work. This is a supervised discretization method, which has been known to achieve some of the best performances in the literature. For the purpose of feature selection, only features with non-zero information gain are retained. 3.6. Classiﬁcation method Two different classiﬁers are investigated in this work: SVM and a hybrid of SVM and LR. These classiﬁers are brieﬂy described in the following.

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.6 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

6

Fig. 1. Decision boundary separating two classes; samples on the margin are called the support vectors. Table 1 Kernel functions. Kernel type

Inner product kernel

Linear Polynomial

K (x, y ) = (x × y ) K (x, y ) = (x × y + 1) p

Gaussian

2 2 K (x, y ) = e −x− y /2γ

3.6.1. Support vector machine SVM is a binary classiﬁer originally proposed by Vapnik [44]. SVM is based on the idea of mapping the original ﬁnite dimensional space X into a much higher dimensional space F and building a hyperplane separating points of the two classes (positive and negative). The straight line that divides the two classes are called the optimal hyperplane and the decision boundary is the maximum-margin hyperplane or the largest minimum distance to the training examples, as illustrated in Fig. 1. The instances that lie closest to the hyperplane are called the support vectors, and training an SVM consists of identifying the support vectors within the training samples. Assume that si ∈ X are the support vectors, each associated with a class label y j ∈ {+1, −1} (for positive and negative examples, respectively). Given an unlabeled sample x ∈ X , classiﬁcation consists of predicting the corresponding label y j . This is performed using a decision function as follows.

f (x) =

y i αi K (x, si ) + b

(7)

i

where αi are Lagrangian multipliers, and K is a kernel function that measures the similarity or distance between the unlabeled sample x and the support vector si . The kernel function K (x, si ) maps the sample space X into a high-dimensional feature space F . Examples of kernel functions include linear, polynomial, and Gaussian [44]. Table 1 shows some examples of kernel functions. 3.6.2. Hybrid SVM-LR classiﬁer Although SVM is a non-probabilistic classiﬁer, probability estimates can be obtained by integrating SVM with logistic regression into a more robust classiﬁer [45,46]. The output of the SVM ( f (x)) is submitted to a logistic function (LR), deﬁned as:

P (x) =

1 1 + e − f (x)

(8)

Where the output of the function P (x) is always between zero and one (0 ≤ P (x) ≤ 1). 4. Experimental evaluation 4.1. Dataset Two different data sets are used in the experiments, one corpus with e-mails and another with micro messages. The ﬁrst corpus is the Enron e-mail dataset.2 Enron was an energy company (located in Houston, Texas) that was bankrupt in 2001

2

Available at http://www.cs.cmu.edu/~enron/.

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.7 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

7

due to white collar fraud. The e-mails of Enron’s employees were made public by the Federal Energy Regulatory Commission during the fraud investigation. The e-mail dataset contains more than 200 thousands messages from about 150 users. The average number of words per e-mail is 200. The e-mails are plain texts and cover various topics ranging from business communications to technical reports and personal chats. The second corpus consists of micro messages obtained from Twitter.3 Twitter is a microblogging service that allows authors to post up to 140 characters. Registered users can read and post tweets, reply to a tweet, send private messages and re-tweet a message, while unregistered users can only read them. Also, a registered user can follow and be followed by other users. A novel Twitter dataset was developed as part of this research, based on a list of the UK’s most inﬂuential tweeters written by Ian Burrell from The Independent newspaper.4 His methodology to choose the people included help from the social media monitoring group, PeerIndex, with additional input from a panel of experts. The Twitter accounts of 100 authors randomly selected from the 20115 and 20126 lists were crawled. The crawled messages are composed by all tweets posted before November 6th, 2013 (inclusive). The dataset contains on average 3194 twitter messages with 301 100 characters per author. 4.2. Data preprocessing In the e-mail corpus, e-mails from the folders “sent” and “sent items” within each user’s folder were selected; all duplicate e-mails were removed. JavaMail API was used to parse each e-mail to extract the body of the message and remove reply texts when present. All e-mails that contain tables with numbers when the average number of digits per total number of characters was higher than 25% were removed. E-mail and web addresses were replaced by meta tags “e-mail” and “http”, respectively. In the micro messages corpus, only the content from the “text” ﬁeld of a JSON structure 7 was used in the experiments; such content characterizes the authorship of a message. Additional preprocessing task consisted of removing all non-English messages, all Re-Tweet (RT) posts, all duplicated tweets, and all messages that contain one or more of the following unicode blocks: Arabic, Cyrillic, Devanagari, Hangul-syllables, Bengali, Hebrew, Malaya-lam, Greek, Hiragana, Cherokee, CJK-uniﬁedideographs. Pound signs or hashtag symbols such as “#word” and the following word were replaced by a meta tag “#hash”. Also, @ user reference was replaced by a meta tag “@cite”. A canonicalization ﬁlter was applied in order to represent the concept rather than the actual value [47,48]. As a result, in both datasets the following replacements were performed: the phone number by a meta tag “phone”, currency by a meta tag “$XX”, percentage by a meta tag “XX%”, date by a meta tag “date”, hours by a meta tag “time”, numbers by a meta tag “numb”, and information between tags (“ information”) by a meta tag “TAG”. The next preprocessing steps included normalizing the document to printable ASCII, converting all characters to lower case, and normalizing white space. Finally, all messages per author were grouped, creating a long text or stream of characters that was divided into blocks. After the preprocessing phase, the Enron dataset was reduced from 150 authors to 76 authors to ensure that only users with 50 instances and 500 characters per instance were involved in the analysis. The number of users in the Twitter dataset remained 100. 4.3. Evaluation method The proposed approach was implemented in Java and used a popular machine learning toolkit named WEKA (Waikato Environment for Knowledge Analysis)8 for data analysis [49]. A SVM learner called Sequential Minimal Optimization (SMO) was used in WEKA [50,51]. For each user U , a reference proﬁle was generated based on a training set consisting of samples from the user (i.e. positive samples) and samples from other users (i.e. negative samples) considered as impostors. Experimental evaluation was conducted using 10-fold cross-validation methodology. The dataset was randomly sorted and allocated in each (validation) round 90% of the dataset for training and the remaining 10% for testing. To reduce variability, ten rounds of cross-validation were performed using different partition of the dataset, and the validation results were averaged over the rounds. False rejection (FR) was tested by comparing the test samples of each user U against his own proﬁle. The FRR was obtained as the ratio between the number of false rejections and the total number of trials. False acceptance (FA) was tested by comparing for each user U all the negative test samples against his proﬁle. The FAR was obtained as the ratio between the number of false acceptances and the total number of trials. The overall FRR and FAR were obtained by averaging the individual measures over the entire user population. Finally, the EER was determined by identifying the operating point where FRR and FAR have the same value.

3 4 5 6 7 8

Available at http://www.uvic.ca/engineering/ece/isot/datasets/. http://www.independent.co.uk. Available at http://www.independent.co.uk/news/people/news/the-full-list-the-twitter-100-2215529.html. Available at http://www.independent.co.uk/news/people/news/the-twitter-100-the-full-ataglance-list-7467920.html. JSON (JavaScript Object Notation) is a data-interchange language. Available at http://weka.wikispaces.com.

JID:YJCSS AID:2863 /FLA

8

[m3G; v1.144; Prn:8/01/2015; 15:20] P.8 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

Fig. 2. Receiver operating characteristic curve for the experiment and sample performance values for different weights.

Table 2 EER obtained by varying the type of SVM kernels. SVM Kernel

EER %

Linear (SVM) Linear (SVM-LR) Polynomial 3 (SVM-LR) Polynomial 5 (SVM-LR) Gaussian (SVM-LR)

18.86 15.34 19.20 28.47 46.01

Comparison among different SVM kernels based on the Twitter dataset involving 100 authors with block size of 280 characters and 100 blocks per user. In these experiments only the Information Gain was used as a feature selection metric.

4.4. Results Three different experiments were performed by varying the weight assigned to the negative samples (denoted by weight( P )), the SVM kernel, and the feature selection technique. The obtained results are presented as follows. 4.4.1. Varying weight In order to test the effect of the weight (weight( P )), a set of experiments were conducted using the Enron dataset involving 76 authors, a block size of 500 characters and 50 blocks per user. In these experiments, the SVM linear kernel was used and feature selection was performed using information gain. Fig. 2 shows the receiver operating characteristic (ROC) curve for the experiment. The curve shows the relation between the FAR and FRR when varying weight( P ) from 0 to 100. The optimal performance achieved by the system was obtained when setting the weight ( P ) limit to 10, with a FAR of 12.49% and a FRR of 12.34%. The EER was calculated as 12.42%. Subsequent experiments used weight ( P ) set to 10. 4.4.2. Varying kernel In order to test the effect of the SVM kernel, a set of experiments were conducted using the Twitter dataset involving 100 authors, a block size of 280 characters and 100 blocks per user. In these experiments, only the information gain metric was used for feature selection. The ﬁrst experiment used a linear kernel, yielding EER of 18.86% and 15.34% for pure SVM and hybrid SVM-LR, respectively. Subsequent experiments focused only on the hybrid SVM-LR classiﬁer, since it outperforms pure SVM. Experiments using polynomial kernel degree 3 and degree 5, and also Gaussian kernel yield (for the hybrid classiﬁer) EER varying from 19.20% to 46.01%, as shown in Table 2. From the above results, it can be concluded that the hyperplane separating positive from negative data is linear. Therefore, the next experiments were run with linear kernel only.

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.9 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

9

Table 3 Authorship veriﬁcation using Twitter dataset. Block Size

Blocks per user

SVM-LR %

140

100 200

21.45 18.37

280

50 100

17.83 13.27

EER for hybrid SVM-LR using the Twitter dataset involving 100 authors and varying the size of the block and the number of blocks per author. Feature selection was performed using the Information Gain and Mutual Information metrics.

4.4.3. Varying feature selection This set of experiments extend the feature selection by adding the mutual information selection approach and varying the block size and the number of blocks per user. Table 3 shows the results based on the Twitter dataset involving 100 authors with block size of 140 characters and 100 blocks per user. An EER of 21.45% was obtained when using hybrid SVM-LR classiﬁers. Increasing the training set and block size affect the accuracy. For instance, when increasing the number of blocks per user to 200, an EER of 18.37% was obtained. Also, using a block size of 280 characters and 50 blocks per user, yields an EER of 17.83%. Using 100 blocks per user, yields an EER of 13.27%. The last experiment was based on the Enron dataset and involved 76 authors, block size of 500 characters, and 50 blocks per author. Feature selection was performed using the Information Gain and Mutual Information metrics. In this case an EER of 9.98% was obtained when using the hybrid SVM-LR classiﬁer. 5. Discussions The results presented above are very encouraging and better than those obtained so far by similar work in the literature. Table 5, in Appendix A, summarizes the performances, block sizes, and population sizes of previous studies on authorship veriﬁcation using stylometry. An imbalanced dataset is when a negative class has more samples than a positive class. Balanced classiﬁcation can be achieved by changing the class distribution through under-sampling the majority class or over-sampling the minority class. The approach proposed of weighting negative class corresponding to the ratio between the total number of positive samples and the total number of negative samples is an alternative to imbalanced dataset. The EER was calculated as 12.42% when the weight( P ) was set to 10. We investigated also the impact of different SVM kernels on accuracy. The outcome of such study is that SVM with linear kernel achieves better EER performances than polynomial or Gaussian. The obtained results further validate the fact that the prediction accuracy of the SVM classiﬁer can be improved by extending it with LR in a hybrid classiﬁer. Furthermore, the performance results demonstrate that using in combination the Information Gain and Mutual Information metrics for feature selection achieves signiﬁcant improvement over the baseline system. 6. Conclusion In this paper, some important steps are taken toward developing a robust framework for continuous authentication using authorship veriﬁcation. A critical component of CA is user identity veriﬁcation. While many studies have employed stylometric techniques for authorship attribution and characterization, a fewer number of studies have focused on veriﬁcation. The proposed authorship veriﬁcation approach introduces new features obtained through n-gram analysis. CA was simulated by decomposing an online text into blocks of short text. Block sizes of 500, 280, and 140 characters were investigated. Comprehensive experiments based on 2 different datasets involving 76 and 100 different authors demonstrate that the proposed approach achieves promising results when compared to existing work in the literature. Future work will investigate another important aspect of CA based on stylometry, which consists of resilience to forgeries [52]. The goal will be to create a forgery dataset, and work on improvements by investigating new machine learning techniques that will not only strengthen the performance of the veriﬁcation process, but will address issues related to forgery attacks. Furthermore, with the increasing popularity of messenger services such as WhatsApp, Facebook messenger as well as Twitter, the threat of spooﬁng has become a source of increasing concerns for users. Future work will investigate how to extend and apply the proposed model as a spooﬁng counter measure. Acknowledgments This research has been enabled by the use of computing resources provided by WestGrid and Compute/Calcul Canada. The research is funded by a Vanier scholarship from the Natural Sciences and Engineering Research Council of Canada (NSERC) (No. VCGS3 - 424654 - 2012) and CNPq (Brazil) scholarship (No. 237628/2012-0).

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.10 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

10

Appendix A

Table 4 List of stylometric features used in this work. Feature

Characteristics

Lexical (Character) F1 F2 F3 F4 F5 F 6 . . . F 10 F 11 . . . F 36 F 37 F 38 . . . F 50 F 51 . . . F 66 F 67 . . . F 192 F 193 . . . F 272 F 273 . . . F 528

Number of characters (C) Number of lower character/C Number of upper characters/C Number of white-space characters/C Number of vowels (V)/C Vowels (a, e, i, o, u)/V Alphabets (A–Z)/C Number of special characters (S)/C Special Characters (e.g. ‘@’, ‘#’, ‘$’, ‘%’, ‘(’, ‘)’, ‘{’, ‘}’, etc.)/S Character 5 and 6-grams (r U (b) and d U (b)) with two different values for the frequency f (i.e. f = 1 and f = 2) and for the mode of calculation of the n-grams (i.e. m = 0 and m = 1) Text based icon (8 groups) Unicode – emoticons Unicode – miscellaneous symbols

Lexical (Word) F 529 F 530 . . . F 539 F 540 F 541 F 542 F 543 F 544 F 545 . . . F 550 F 551 . . . F 600 F 601 . . . F 602 F 603

Number of words (N) Average sentence length in terms of words/N Number of long words (more than 6 characters)/N Number of short words (1–3 characters)/N Average word length Average number of syllables per word Ratio of characters in words to N Replaced words/N The 50 most frequent words per author Hapax legomena and dis legomena Vocabulary richness (number of unique words/N)

Syntactic F 604 F 605 . . . F 612 F 613 . . . F 724 F 725 . . . F 729 F 730 . . . F 965

Number of punctuation (P) Number of punctuation (single quotes, commas, periods, colons, semi-colons, question marks, exclamation marks)/P Unicode – General punctuation Number of function words (conjunction, determiners, preposition, interjection, and pronouns)/N Relative frequency of function word

Application-speciﬁc F 966 F 967 F 968 . . . F 970 F 971 . . . F 972

Number of sentences Number of paragraphs Average number of characters, words and sentences in a block of text Average number of sentences beginning with upper case and lower case

Table 5 Comparative performances, block sizes and, population sizes for stylometry studies. Category

Ref.

Sample size

Block size

Number of features

Technique

Accuracy* (%)

EER (%)

Veriﬁcation

[5] [21] [31]

40 25–40** 8

1710–70 300 ch 30–50 w 628–1342 w

L(62), Sy(20) L(40), Sy(76), Se(25), A(9) L(100K), Sy(900K)

– 83.90–88.31 –

30

[10] [6] [53]

8 10 29

628–1342 w 500 w 2400 w

L(292) L(250) L(40)

[22] Current Current

87** 76** 100***

250-500 ch 500 ch 140–280 ch

L(n-gram) L(91), Sy(251), A(7) L(537+), Sy(362+), A(7)

k-NN SVM Weighted Probability Distribution Voting (WPDV) – SVM Linear Discriminant Analysis (LDA) Supervised SVM and SVM-LR SVM-LR

(L) = Lexical, (Sy) = Syntactic, (Se) = Semantic, (A) = Application, (w) = Word, (ch) = Character. * The accuracy is measured by the percentage of correctly matched authors in the testing set. ** Used Enron dataset for testing. *** Used Twitter dataset for testing.

3

17.1–22.4 95.70 –

– 22

– – –

18.90–14.35 12.42–9.98 21.45–13.27

JID:YJCSS AID:2863 /FLA

[m3G; v1.144; Prn:8/01/2015; 15:20] P.11 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

11

References [1] I. Traore, A.A.E. Ahmed (Eds.), Continuous Authentication Using Biometrics: Continuous Authentication Using Biometrics: Data, Models, and Metrics, IGI Global, 2012. [2] J. Li, R. Zheng, H. Chen, From ﬁngerprint to writeprint, Commun. ACM 49 (2006) 76–82. [3] J.L. Hilton, On Verifying Wordprint Studies: Book of Mormon Authorship, Foundation for Ancient Research and Mormon Studies, 1991, Reprint. [4] A. Abbasi, H. Chen, Writeprints: a stylometric approach to identity-level identiﬁcation and similarity detection in cyberspace, ACM Trans. Inf. Syst. 26 (2008) 1–29. [5] O. Canales, V. Monaco, T. Murphy, E. Zych, J. Stewart, C.T.A. Castro, O. Sotoye, L. Torres, G. Truley, A Stylometry System for Authenticating Students Taking Online Tests, CSIS, Pace University, 2011. [6] M. Koppel, J. Schler, Authorship veriﬁcation as a one-class classiﬁcation problem, in: Proceedings of the 21st International Conference on Machine Learning, ICML’04, Banff, Alberta, Canada, ACM, 2004, pp. 62–69. [7] C. Sanderson, S. Guenter, Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation, in: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP’06, Stroudsburg, PA, USA, Association for Computational Linguistics, 2006, pp. 482–491. [8] A. Abbasi, H. Chen, Applying authorship analysis to extremist-group web forum messages, IEEE Intell. Syst. 20 (2005) 67–75. [9] F. Iqbal, R. Hadjidj, B.C. Fung, M. Debbabi, A novel approach of mining write-prints for authorship attribution in e-mail forensics, Digit. Investig. 5 (2008) S42–S51. [10] F. Iqbal, L.A. Khan, B.C.M. Fung, M. Debbabi, E-mail authorship veriﬁcation for forensic investigation, in: Proceedings of the 2010 ACM Symposium on Applied Computing, SAC’10, New York, NY, USA, ACM, 2010, pp. 1591–1598. [11] F. Iqbal, H. Binsalleeh, B.C. Fung, M. Debbabi, A uniﬁed data mining solution for authorship analysis in anonymous textual communications, Inf. Sci. 231 (2013) 98–112. [12] C.E. Chaski, Who’s at the keyboard: authorship attribution in digital evidence investigations, Int. J. Digit. Evid. 4 (1) (2005) 1–13. [13] F. Mosteller, D.L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley, 1964. [14] J. Burrows, Delta: a measure of stylistic difference and a guide to likely authorship, Lit. Linguist. Comput. 17 (3) (2002) 267–287. [15] E. Backer, P. van Kranenburg, On musical stylometry pattern recognition approach, Pattern Recognit. Lett. 26 (3) (2005) 299–309. [16] Y. Zhao, J. Zobel, Searching with style: authorship attribution in classic literature, in: Proceedings of the Thirtieth Australasian Conference on Computer Science, Vol. 62, ACSC’07, Darlinghurst, Australia, Australian Computer Society, Inc., 2007, pp. 59–68. [17] K.G. Ruchita Sarawgi, Y. Choi, Gender attribution: tracing stylometric evidence beyond topic and genre, in: Proceedings of the 15th Conference on Computational Natural Language Learning, CoNLL’11, Stroudsburg, PA, USA, Association for Computational Linguistics, 2011, pp. 78–86. [18] N. Cheng, X. Chen, R. Chandramouli, K. Subbalakshmi, Gender identiﬁcation from e-mails, in: IEEE Symposium on Computational Intelligence and Data Mining, CIDM’09, 2009, pp. 154–158. [19] N. Cheng, R. Chandramouli, K. Subbalakshmi, Author gender identiﬁcation from text, Digit. Investig. 8 (1) (2011) 78–88. [20] P. Juola, R.H. Baayen, A controlled-corpus experiment in authorship identiﬁcation by cross-entropy, Lit. Linguist. Comput. 20 (Suppl) (2005) 59–67. [21] X. Chen, P. Hao, R. Chandramouli, K.P. Subbalakshmi, Authorship similarity detection from email messages, in: Proceedings of the 7th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 375–386. [22] M.L. Brocardo, I. Traore, S. Saad, I. Woungang, Authorship veriﬁcation for short messages using stylometry, in: Proceedings of the International Conference on Computer, Information and Telecommunication Systems, CITS, Piraeus-Athens, Greece, 2013, pp. 1–6. [23] M.L. Brocardo, I. Traore, I. Woungang, Toward a framework for continuous authentication using stylometry, in: Proceedings of the 28th IEEE International Conference on Advanced Information Networking and Applications, AINA-2014, Victoria, Canada, IEEE, 2014, pp. 106–115. [24] S. Argamon, C. Whitelaw, P. Chase, S.R. Hota, N. Garg, S. Levitan, Stylistic text classiﬁcation using functional lexical features: Research articles, J. Am. Soc. Inf. Sci. Technol. 58 (2007) 802–822. [25] S.M. Alzahrani, N. Salim, A. Abraham, Understanding plagiarism linguistic patterns, textual features, and detection methods, IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 42 (2) (2012) 133–149. [26] R. Zheng, J. Li, H. Chen, Z. Huang, A framework for authorship identiﬁcation of online messages: writing-style features and classiﬁcation techniques, J. Am. Soc. Inf. Sci. Technol. 57 (2006) 378–393. [27] B. Kjell, W. Woods, O. Frieder, Discrimination of authorship using visualization, Inf. Process. Manag. 30 (1) (1994) 141–150. [28] F. Peng, D. Schuurmans, S. Wang, V. Keselj, Language independent authorship attribution using character level language models, in: Proceedings of the 10th Conference on European Association for Computational Linguistics, Chapter of the Association for Computational Linguistics, Vol. 1, EACL’03, Stroudsburg, PA, USA, 2003, pp. 267–274. [29] E. Stamatatos, A survey of modern authorship attribution methods, J. Am. Soc. Inf. Sci. Technol. 60 (2009) 538–556. [30] P. Juola, Authorship attribution for electronic documents, in: Advances in Digital Forensics II, in: IFIP Advances in Information and Communication, vol. 222, Springer, New York, 2006, pp. 119–130. [31] H.V. Halteren, Author veriﬁcation by linguistic proﬁling: an exploration of the parameter space, ACM Trans. Speech Lang. Process. 4 (2007) 1–17. [32] A. Orebaugh, J. Allnutt, Classiﬁcation of instant messaging communications for forensics analysis, Int. J. Forensic Comput. Sci. 1 (2009) 22–28. [33] J.F. Burrows, Word patterns and story shapes: the statistical analysis of narrative style, Lit. Linguist. Comput. 2 (1) (1987) 61–70. [34] D.I. Holmes, The evolution of stylometry in humanities scholarship, Lit. Linguist. Comput. 13 (3) (1998) 111–117. [35] N. Homem, J. Carvalho, Authorship identiﬁcation and author fuzzy ﬁngerprints, in: Annual Meeting of the North American Fuzzy Information Processing Society, NAFIPS, 2011, pp. 1–6. [36] F.J. Tweedie, R.H. Baayen, How variable may a constant be? measures of lexical richness in perspective, Comput. Humanit. 32 (5) (1998) 323–352. [37] O. de Vel, A. Anderson, M. Corney, G. Mohay, Mining e-mail content for author identiﬁcation forensics, SIGMOD Rec. 30 (4) (2001) 55–64. [38] H. Baayen, H. van Halteren, F. Tweedie, Outside the cave of shadows: using syntactic annotation to enhance authorship attribution, Lit. Linguist. Comput. 11 (3) (1996) 121–132. [39] M. Koppel, J. Schler, Exploiting stylistic idiosyncrasies for authorship attribution, in: Workshop on Computational Approaches to Style Analysis and Synthesis, IJCAI’03, Acapulco, Mexico, 2003, pp. 69–72. [40] S. Argamon, S. Marin, S.S. Stein, Style mining of electronic messages for multiple authorship discrimination: ﬁrst results, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’03, New York, NY, USA, ACM, 2003, pp. 475–480. [41] R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, D. Benredjem, Towards an integrated e-mail forensic analysis framework, Digit. Investig. 5 (3–4) (2009) 124–137. [42] S. Kotsiantis, D. Kanellopoulos, Discretization techniques: a recent survey, GESTS Int. Trans. Comput. Sci. Eng. 32 (1) (2006) 47–58. [43] U.M. Fayyad, K.B. Irani, Multi-interval discretization of continuous-valued attributes for classiﬁcation learning, in: Thirteenth International Joint Conference on Artiﬁcial Intelligence, Vol. 2, Morgan Kaufmann Publishers, 1993, pp. 1022–1027. [44] V. Vapnik, Estimation of Dependences Based on Empirical Data, Springer Series in Statistics, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1982. [45] G. Wahba, G. Wahba, X. Lin, X. Lin, F. Gao, F. Gao, D. Xiang, D. Xiang, R. Klein, B. Klein, The bias-variance tradeoff and the randomized GACV, in: Advances in Neural Information Processing Systems, MIT Press, 1999, pp. 620–626.

JID:YJCSS AID:2863 /FLA

12

[m3G; v1.144; Prn:8/01/2015; 15:20] P.12 (1-12)

M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••

[46] Y.-chin Ivan Chang, Boosting SVM classiﬁers with logistic regression, Tech. rep. Institute of Statistical Science – Academia Sinica, Taipei, Taiwan, 2003. [47] W.-W. Deng, H. Peng, Research on a Naïve Bayesian based short message ﬁltering system, in: International Conference on Machine Learning and Cybernetics, 2006, pp. 1233–1237. [48] J. Cai, Y. Tang, R. Hu, Spam ﬁlter for short messages using winnow, in: International Conference on Advanced Language Processing and Web Information Technology, ALPIT’08, IEEE, 2008, pp. 454–459. [49] I.H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, S.J. Cunningham, Weka: practical machine learning tools and techniques with Java implementations, in: International Workshop on Emerging Engineering and Connectionnist-Based Information Systems, ANNES’99, 1999, pp. 192–196. [50] N. Landwehr, M. Hall, E. Frank, Logistic model trees, Mach. Learn. 59 (1–2) (2005) 161–205. [51] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schoelkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods – Support Vector Learning, MIT Press, 1998. [52] L. Ballard, Biometric authentication revisited: understanding the impact of wolves in sheep’s clothing, in: Proceedings of the 15th Annual Usenix Security Symposium, 2006, pp. 29–41. [53] I. Krsul, E.H. Spafford, Authorship analysis: identifying the author of a program, Comput. Secur. 16 (3) (1997) 233–257.

Authorship verification of e-mail and tweet messages applied for continuous authentication

Authorship verification of e-mail and tweet messages applied for continuous authentication

Recommend Documents