JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.1 (1-12)
Journal of Computer and System Sciences ••• (••••) •••–•••
Contents lists available at ScienceDirect
Journal of Computer and System Sciences www.elsevier.com/locate/jcss
Authorship verification of e-mail and tweet messages applied for continuous authentication ✩ Marcelo Luiz Brocardo a,∗ , Issa Traore a,∗∗ , Isaac Woungang b a b
Department of Electrical and Computer Engineering, University of Victoria, Victoria, British Columbia, V8W 3P6, Canada Department of Computer Science, Ryerson University, Toronto, Ontario, M5B 2K3, Canada
a r t i c l e
i n f o
Article history: Received 1 July 2014 Received in revised form 14 December 2014 Accepted 14 December 2014 Available online xxxx Keywords: Continuous authentication Stylometry Short message verification n-Gram features Unbalanced dataset SVM classifier
a b s t r a c t Authorship verification using stylometry consists of identifying a user based on his writing style. In this paper, authorship verification is applied for continuous authentication using unstructured online text-based entry. An online document is decomposed into consecutive blocks of short texts over which (continuous) authentication decisions happen, discriminating between legitimate and impostor behaviors. We investigate blocks of texts with 140, 280 and 500 characters. The feature set includes traditional features such as lexical, syntactic, application specific features, and new features extracted from n-gram analysis. Furthermore, the proposed approach includes a strategy to circumvent issues related to unbalanced dataset, and uses Information Gain and Mutual Information as a feature selection strategy and Support Vector Machine (SVM) for classification. Experimental evaluation of the proposed approach based on the Enron email and Twitter corpuses yields very promising results consisting of an Equal Error Rate (EER) varying from 9.98% to 21.45%, for different block sizes. © 2014 Elsevier Inc. All rights reserved.
1. Introduction Continuous authentication (CA) consists of re-authenticating a user repeatedly and unobtrusively throughout an authenticated session as data becomes available. CA is considered as a remedy against session hijacking, where an intruder seizes control of a legitimate user session [1]. Ideally, CA should be conducted unobtrusively, by enabling transparent user identity data collection and verification. Stylometric analysis, which consists of the identification of a user based on his writing style [2,3], could potentially be used for CA. Stylometry has so far been studied primarily for the purpose of forensic authorship analysis, generating a significant amount of interest over the years and led to a rich body of research literature [4–7]. Three different kinds of authorship analysis using stylometry have been studied in the literature, including authorship attribution, authorship characterization, and authorship verification. The focus of this paper is on authorship verification, which is the most relevant to continuous authentication. Authorship verification consists of checking whether a target document was written or not by a specific (i.e. known) individual (with
✩
This paper is the journal version of the conference paper (Marcelo Luiz Brocardo, Issa Traore, Isaac Woungang [23]).
* Corresponding author. ** Principal corresponding author.
E-mail addresses:
[email protected] (M.L. Brocardo),
[email protected] (I. Traore),
[email protected] (I. Woungang).
http://dx.doi.org/10.1016/j.jcss.2014.12.019 0022-0000/© 2014 Elsevier Inc. All rights reserved.
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.2 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
2
some claimed identity). In this setting CA is performed by comparing sample writing of an individual against the model or profile associated with the identity claimed by that individual at login time (i.e. 1-to-1 identity matching). CA involves several challenges including the need for low authentication delay, high accuracy, and the ability to withstand forgery. This paper focuses on the two first challenges. Low authentication delay is simulated by analyzing short texts. Attempting to reduce at the same time the text size and the verification error rates is a difficult task in the sense that these attributes are loosely related to each other. A smaller verification block may lead to increased verification error rates and vice-versa. In this work, a set of new features are derived from n-gram analysis, and two classification schemes are studied: a Support Vector Machine (SVM) classifier and a hybrid SVM with Logistic Regression (LR) classifier. Authorship verification is investigated as a two-class problem. Furthermore a weighting strategy for unbalanced dataset, different SVM kernels, and different features selection strategy are tested. The proposed approach is evaluated experimentally by computing the following performance metrics:
• False Acceptance Rate (FAR): measures the likelihood that the system will fail to recognize a genuine person; • False Rejection Rate (FRR): measures the likelihood that the system may falsely recognize someone as a genuine person; • Equal Error Rate (EER): corresponds to the operating point where FAR and FRR have the same value. Different block sizes of characters (140, 280 and 500) are tested on Enron and Twitter datasets, yielding EER ranging from 9.98% to 21.45% for different block sizes. The results are very encouraging considering the existing works on authorship verification using stylometry. The remainder of the paper is organized as follows. Section 2 summarizes and discusses related works. Section 3 presents an outline of the proposed approach. Section 4 presents the experimental evaluation of the proposed approach and the obtained results. Section 5 discusses the strengths and shortcomings of the proposed approach. Section 6 makes some concluding remarks and discusses future work. 2. Related work As mentioned above, research on authorship analysis covers three different areas: authorship identification, authorship characterization, and authorship verification. A considerable number of studies have been conducted on authorship identification and characterization. For instance, previous studies on authorship identification investigated ways to identify patterns of terrorist communications [8], the author of a particular e-mail for computer forensic purposes [9–11], as well as how to collect digital evidence for investigations [12] or solve a disputed literary, historical [13], or musical authorship [14–16]. Work on authorship characterization has targeted primarily gender attribution [17–19] and the classification of the author education level [20]. However, there are few papers on authorship verification outside the framework of plagiarism detection [6], and most of them focus on general text documents. In addition, the performance of authorship verification for online documents is affected by the text size, the number of candidate authors, the size of the training set, and the fact that these documents are in general quite poorly structured or written (as opposed to literary works). Among the few studies available on authorship verification are works by Koppel et al. [6], Iqbal et al. [10], Canales et al. [5], and Chen and Hao [21]. Koppel et al. used SVM with linear kernel and addressed the authorship verification as a one-class classification problem, ignoring negative samples. The corpus used in their study was composed by 21 English books written by 10 different authors. They divided the text into approximately equal sections of 500 words, preserving the paragraphs. The feature set was composed by the 250 most frequent words. They introduced a technique named “unmasking” where they quantify the dissimilarity between the sample document produced by the suspect and that of other users (i.e. imposters). Although the overall accuracy was 95.7%, they concluded that the use of negative examples could improve the results. Iqbal et al. experimented with two different approaches [10]. The first approach conducts verification using classification; three different classifiers are investigated, namely, Adaboost.M1, Bayesian Network, and Discriminative Multinomial Naive Bayes (DMNB). The second approach conducts verification by regression; three different classifiers were studied including linear regression, SVM with Sequential Minimum Optimization (SMO), and SVM with RBF kernel. The feature set was composed of 292 attributes, which included lexical (collected either in terms of characters or words), syntactic (punctuation and function words), idiosyncratic (spelling and grammatical mistakes) and content-specific (keywords commonly found in a specific domain). Experimental evaluation of the proposed approach, using the Enron e-mail corpus and by analyzing 200 e-mails per author, yielded EER ranging from 17.1% to 22.4%. Canales et al. trained a K-Nearest Neighbor (KNN) classifier with 82 stylistic features including 49 character-based, 13 word-based, and 20 syntactic features [5]. In addition, they combined stylometry and keystroke dynamics analysis for the purpose of authenticating online test takers. They experimented with 40 students with sample document size ranging between 1710 and 70 300 characters, and obtained as performances (FRR = 20.25%, FAR = 4.18%) and (FRR = 93.46%, FRR = 4.84%) when using separately keystroke and stylometry, respectively. The combination of both types of features yielded EER = 30%. They concluded that the feature set must be extended and certain type of punctuations may not necessarily represent the style of students when taking online exams.
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.3 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
3
Chen and Hao investigated the authorship similarity from e-mails [21]. The proposed feature set included 40 lexical, 76 syntactic, 25 content specific, and 9 structural features. Experimental evaluation involving 40 authors from the Enron dataset, yielded 84% and 89% classification accuracy rates for 10 and 15 short e-mails, respectively. 3. Proposed approach This section summarizes previous work and presents the proposed approach by discussing feature selection and describing the classification model. 3.1. Previous work The authors investigated in previous work the possibility of using stylometry for authorship verification for short online messages [22]. The technique was based on a combination of supervised learning and n-gram analysis. The evaluation used real-life dataset from Enron, where the e-mails were combined to produce a single long message per individual, and then divided into smaller blocks used for authorship verification. The experimental evaluation yielded an EER of 14.35% for 87 users for message blocks of 500 characters. In an earlier version of the current paper, presented at the 28th IEEE International Conference on Advanced Information Networking and Applications (AINA-2014), the feature set was expanded and Information Gain (IG) metric was used to rank the best features. In addition, SVM was used for classification [23] and a dataset based on the Enron e-mail corpus was used for experimental evaluation. In the current paper, an extra filter named Mutual Information (MI) is added in the feature selection process in order to discard highly correlated features. In addition, the authors investigate as classification technique not only SVM, but also a hybrid method combining SVM with LR (SVM-LR). Finally, the proposed approach is evaluated using shorter messages (i.e. micro messages), consisting of blocks of texts of 140, 280 and 500 characters from two different datasets, one based on twitter feeds and another based on Enron e-mails. 3.2. Approach overview In a general overview of the proposed approach, an online document is decomposed into consecutive blocks of short texts over which (continuous) authentication decisions happen. For each block of text, a feature vector is extracted based on all features. In this study an initial set of basic features is selected by combining lexical characters, lexical words, syntactic and application specific characteristics and, as advanced features, a set of new features extracted through n-gram analysis. The feature values are normalized to range between 0 and 1 using maximum normalization scheme, in which case a given feature value will be replaced by its ratio over the maximum value for the same feature over the training set. In order to reduce the large feature space, information gain and mutual information techniques are combined for feature selection. The classification model consists of a collection of profiles generated separately for individual users. The proposed system operates in two modes: enrollment and verification. Based on sample training data, the enrollment process computes the behavioral profile of the user using machine learning classification. In addition, the proposed system addresses the authorship verification as a two-class classification problem. The first class is composed by (positive) samples from the author, whereas the second class (negative) is composed by samples from other authors. Thereby, the negative class has more samples than the positive class, generating imbalance class distribution. The approach to deal with this situation is to assign a weight P to the negative class corresponding to the ratio between the total number of positive samples and the total number of negative samples. 3.3. Initial features Many linguistic features have been suggested for authorship verification, for instance, the choice of particular words and syntactic structures [24]. Differently from topic-based text categorization whose central point is a “bag of content words”, the set of features was expanded by combining lexical, syntactic and application specific features. Such combination better expresses the author’s style. The list of all the features used in this work is shown in Table 4 in Appendix A. The features are briefly discussed in the following. Lexical features consist of a set of lexical items (words or characters) extracted from a text [25]. In terms of characters, the set of features includes the frequency of characters including upper case, lower case, vowels, white space, digits, and special characters [26]. New features corresponding to 5-grams and 6-grams were derived. N-gram is a token formed by a contiguous sequence of characters or words, which has been shown to be tolerant to typos [27,28,6,29–31]. The approach to calculate these new features is discussed in Subsection 3.4. Another stylistic marker is the writer’s mood expressed in form of icons and symbols [32]. The icons are divided in three groups, and their averages are calculated. The first group contains 126 text-based icons (e.g. “:-)” , “:o)”) subdivided in 38 different categories (e.g. smiley, laughing, very happy, frown, angry, crying, etc.) The second group contains 80 emoticons based on unicode characters (e.g. ,–/). In this case, the range of emoticons vary from code 1F600 to 1F64F. The last group has 256 miscellaneous symbols (e.g. _ or ) with unicode characters ranging from 2600 to 26FF.
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.4 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
4
In terms of lexical words, the following features are extracted for each sample: the number of words, the average sentence length in terms of words, the frequency of short words (1–3 characters) and long words (more than 6 characters), the average word length, the average syllable per word, the ratio of characters in words and the replaced words [29,5]. Although earlier studies used between 100 and 1000 frequent words to determine the author of a document [33–35], in this work only the fifty most frequently used words per author are selected, since the sample text has few words. The vocabulary richness is measured by quantifying the number of hapax legomena and dis legomena, referring to a word which occurs only once or twice in a text, respectively [36,37,26]. Syntactic features extract the structure of a sentence in terms of punctuation and part of speech (POS) [38–40]. Punctuation is an important rule to define boundaries and identify meaning (quotation, exclamation, etc.) by splitting a paragraph into sentences and each sentence into various tokens [10]. The set of basic punctuation marks includes single quotes, commas, periods, colons, semi-colons, question marks, and exclamation marks. In addition, a set of uncommon punctuations . (e.g. †, .., . . . ) was created with 112 different symbols, based on the unicode format with code ranging from 2000 to 206F. Function words are topic-independent and capture the author’s style across different subjects. The POS tagging consists of categorizing a word according to its function in the context and can be classified as verbs, nouns, pronouns, adjectives, adverbs, prepositions, conjunctions, and interjections [20,16,41,19]. Application specific features include characteristics related to the organization and format of a text [37,26,4,41,21]. Since the proposed approach involves analyzing short messages (from e-mails and twitter posts), only features related to the paragraph structure are extracted. These features include the number of sentences per block of text, the average number of characters, words, and sentences in a block of text, and the average number of sentences beginning with upper and lower case. 3.4. N-gram model In the proposed approach, feature extraction is performed in two steps. During the first step, the frequency and average of lexical, syntactic and application specific features are computed. In the second step, the character n-grams are calculated. The n-gram model includes not only all unique n-grams, but also all n-grams with frequency equal or higher than some number f . Likewise two different modes of calculation for n-grams are considered. Let m denote a binary variable representing the mode of calculation of the n-grams, m = 0 if calculation is based only on unique n-grams, and m = 1, otherwise. The training data of a given user U is divided into two subsets, denoted by T ( f )1U and T 2U . Let r U (b) denote the similarity between a sample data block b and the profile of user U . The similarity r U (b) is defined as the percentage of unique n-grams shared by block b and training set T ( f )1U , calculated as follows1 :
r U (b) =
| Nm (b) ∩ N ( T ( f )1U )| | Nm (b)|
(1)
Where N ( T ( f )1U ) denote the set of all unique n-grams occurring in T ( f )1U with frequency f , and N m (b) denote the set of all unique n-grams occurring in b (m = 0) or the set of all n-grams occurring in b (m = 1). Let d U (b) denote a binary similarity metric, referred to as decision, which captures the closeness of a block b to the profile of user U , and is defined as follows:
d U (b) = 1, d U (b) = 0,
if |r U (b)| ≥ U otherwise
(2)
Where U is a user-specific threshold derived from the training data. The value of U for user U is derived using a supervised learning technique outlined by Algorithm 1. Given a user U , the training subset T 2U is divided into p blocks of characters of equal size: b(m)1U , . . . , b(m)Up . The proposed model approximates the actual (but unknown) distribution of the ratios (r U (b1U ), . . . , r U (b Up )) (extracted from T 2U ) by computing the sample mean denoted by μU and the sample variance σU2 during the training. In the algorithm, the threshold is initialized (i.e. U = μU − (σU /2)), and then varied incrementally by minimizing the difference between FRR and FAR values for the user, the goal being to obtain an operating point that is as close as possible to the EER. 3.5. Feature selection Feature selection for a CA system consists of identifying and keeping only the most discriminating features for each individual user. This allows reducing the data size by removing irrelevant attributes and improves the processing time for training and classification. The worth of an attribute is evaluated by applying a ranking strategy based on the Information
1
| X | denote the cardinality of set X .
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.5 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
/* U a user for whom the threshold is being calculated /* I 1 , . . . , I m : a set of other users ( I k = U ) /* U : threshold computed for user U Input: Training data for U , I 1 , . . . , I m Output:
5
*/ */ */
U
1 begin 2 up ← false; 3 down ← false; 4 δ ← 1; 5 U ← μU − (σU /2); 6 while δ > 0.0001 do
/* Calculating FAR and FRR for user U
7 FRRU , FARU = calculate(U , I 1 , . . . , I m , U , γ ); /* Minimizing the difference between FAR and FRR 8 if (FRRU − FARU ) > 0 then 9 down ← true; 10 U ← U − δ ; 11 else if (FRRU − FARU ) < 0 then 12 up ← true; 13 U ← U + δ ; 14 else 15 return U ; 16 end 17 if (up & down) then 18 up ← false; 19 down ← false; 20 δ ← δ/10; 21 end 22 end 23 return U ; 24 end
*/ */
Algorithm 1: Threshold calculation for a given user.
Gain (IG) and discarding highly correlated features based on Mutual Information (MI) strategy. Features with very little or no predictive information and high correlation are identified and removed. Let X and Y denote two random variables. The information entropy of X denoted by H ( X ) is defined by:
H(X) = −
n
p (xi ) log2 p (xi )
(3)
i =1
Where p (xi ) denotes the probability mass function of xi . Let H ( X |Y ) denote the conditional entropy given a variable X after the observation of the variable Y . H ( X |Y ) is defined as:
H ( X |Y ) = −
M N
p (xi , y j ) log2 p (xi | y j )
(4)
i =1 j =1
Suppose that the dataset is composed by two classes (positive and negative). The IG for a feature Attr with respect to a class is computed as follows:
IG(Class, Attr) = H (Class) − H (Class|Attr)
(5)
Given two features Attra and Attrb , the MI is calculated as follows:
MI(Attra , Attrb ) = H (Attra ) − H (Attra |Attrb )
(6)
Prior to computing IG and MI, it is necessary to discretize the numeric feature values into binary values (0 and 1). The discretization process consists of finding a cut-point or split-point that divides the range into two intervals, one interval being less or equal than the cut-point while the other is greater [42]. The entropy-based discretization method proposed by Fayyad and Irani [43] is used in this work. This is a supervised discretization method, which has been known to achieve some of the best performances in the literature. For the purpose of feature selection, only features with non-zero information gain are retained. 3.6. Classification method Two different classifiers are investigated in this work: SVM and a hybrid of SVM and LR. These classifiers are briefly described in the following.
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.6 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
6
Fig. 1. Decision boundary separating two classes; samples on the margin are called the support vectors. Table 1 Kernel functions. Kernel type
Inner product kernel
Linear Polynomial
K (x, y ) = (x × y ) K (x, y ) = (x × y + 1) p
Gaussian
2 2 K (x, y ) = e −x− y /2γ
3.6.1. Support vector machine SVM is a binary classifier originally proposed by Vapnik [44]. SVM is based on the idea of mapping the original finite dimensional space X into a much higher dimensional space F and building a hyperplane separating points of the two classes (positive and negative). The straight line that divides the two classes are called the optimal hyperplane and the decision boundary is the maximum-margin hyperplane or the largest minimum distance to the training examples, as illustrated in Fig. 1. The instances that lie closest to the hyperplane are called the support vectors, and training an SVM consists of identifying the support vectors within the training samples. Assume that si ∈ X are the support vectors, each associated with a class label y j ∈ {+1, −1} (for positive and negative examples, respectively). Given an unlabeled sample x ∈ X , classification consists of predicting the corresponding label y j . This is performed using a decision function as follows.
f (x) =
y i αi K (x, si ) + b
(7)
i
where αi are Lagrangian multipliers, and K is a kernel function that measures the similarity or distance between the unlabeled sample x and the support vector si . The kernel function K (x, si ) maps the sample space X into a high-dimensional feature space F . Examples of kernel functions include linear, polynomial, and Gaussian [44]. Table 1 shows some examples of kernel functions. 3.6.2. Hybrid SVM-LR classifier Although SVM is a non-probabilistic classifier, probability estimates can be obtained by integrating SVM with logistic regression into a more robust classifier [45,46]. The output of the SVM ( f (x)) is submitted to a logistic function (LR), defined as:
P (x) =
1 1 + e − f (x)
(8)
Where the output of the function P (x) is always between zero and one (0 ≤ P (x) ≤ 1). 4. Experimental evaluation 4.1. Dataset Two different data sets are used in the experiments, one corpus with e-mails and another with micro messages. The first corpus is the Enron e-mail dataset.2 Enron was an energy company (located in Houston, Texas) that was bankrupt in 2001
2
Available at http://www.cs.cmu.edu/~enron/.
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.7 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
7
due to white collar fraud. The e-mails of Enron’s employees were made public by the Federal Energy Regulatory Commission during the fraud investigation. The e-mail dataset contains more than 200 thousands messages from about 150 users. The average number of words per e-mail is 200. The e-mails are plain texts and cover various topics ranging from business communications to technical reports and personal chats. The second corpus consists of micro messages obtained from Twitter.3 Twitter is a microblogging service that allows authors to post up to 140 characters. Registered users can read and post tweets, reply to a tweet, send private messages and re-tweet a message, while unregistered users can only read them. Also, a registered user can follow and be followed by other users. A novel Twitter dataset was developed as part of this research, based on a list of the UK’s most influential tweeters written by Ian Burrell from The Independent newspaper.4 His methodology to choose the people included help from the social media monitoring group, PeerIndex, with additional input from a panel of experts. The Twitter accounts of 100 authors randomly selected from the 20115 and 20126 lists were crawled. The crawled messages are composed by all tweets posted before November 6th, 2013 (inclusive). The dataset contains on average 3194 twitter messages with 301 100 characters per author. 4.2. Data preprocessing In the e-mail corpus, e-mails from the folders “sent” and “sent items” within each user’s folder were selected; all duplicate e-mails were removed. JavaMail API was used to parse each e-mail to extract the body of the message and remove reply texts when present. All e-mails that contain tables with numbers when the average number of digits per total number of characters was higher than 25% were removed. E-mail and web addresses were replaced by meta tags “e-mail” and “http”, respectively. In the micro messages corpus, only the content from the “text” field of a JSON structure 7 was used in the experiments; such content characterizes the authorship of a message. Additional preprocessing task consisted of removing all non-English messages, all Re-Tweet (RT) posts, all duplicated tweets, and all messages that contain one or more of the following unicode blocks: Arabic, Cyrillic, Devanagari, Hangul-syllables, Bengali, Hebrew, Malaya-lam, Greek, Hiragana, Cherokee, CJK-unifiedideographs. Pound signs or hashtag symbols such as “#word” and the following word were replaced by a meta tag “#hash”. Also, @ user reference was replaced by a meta tag “@cite”. A canonicalization filter was applied in order to represent the concept rather than the actual value [47,48]. As a result, in both datasets the following replacements were performed: the phone number by a meta tag “phone”, currency by a meta tag “$XX”, percentage by a meta tag “XX%”, date by a meta tag “date”, hours by a meta tag “time”, numbers by a meta tag “numb”, and information between tags (“ information”) by a meta tag “TAG”. The next preprocessing steps included normalizing the document to printable ASCII, converting all characters to lower case, and normalizing white space. Finally, all messages per author were grouped, creating a long text or stream of characters that was divided into blocks. After the preprocessing phase, the Enron dataset was reduced from 150 authors to 76 authors to ensure that only users with 50 instances and 500 characters per instance were involved in the analysis. The number of users in the Twitter dataset remained 100. 4.3. Evaluation method The proposed approach was implemented in Java and used a popular machine learning toolkit named WEKA (Waikato Environment for Knowledge Analysis)8 for data analysis [49]. A SVM learner called Sequential Minimal Optimization (SMO) was used in WEKA [50,51]. For each user U , a reference profile was generated based on a training set consisting of samples from the user (i.e. positive samples) and samples from other users (i.e. negative samples) considered as impostors. Experimental evaluation was conducted using 10-fold cross-validation methodology. The dataset was randomly sorted and allocated in each (validation) round 90% of the dataset for training and the remaining 10% for testing. To reduce variability, ten rounds of cross-validation were performed using different partition of the dataset, and the validation results were averaged over the rounds. False rejection (FR) was tested by comparing the test samples of each user U against his own profile. The FRR was obtained as the ratio between the number of false rejections and the total number of trials. False acceptance (FA) was tested by comparing for each user U all the negative test samples against his profile. The FAR was obtained as the ratio between the number of false acceptances and the total number of trials. The overall FRR and FAR were obtained by averaging the individual measures over the entire user population. Finally, the EER was determined by identifying the operating point where FRR and FAR have the same value.
3 4 5 6 7 8
Available at http://www.uvic.ca/engineering/ece/isot/datasets/. http://www.independent.co.uk. Available at http://www.independent.co.uk/news/people/news/the-full-list-the-twitter-100-2215529.html. Available at http://www.independent.co.uk/news/people/news/the-twitter-100-the-full-ataglance-list-7467920.html. JSON (JavaScript Object Notation) is a data-interchange language. Available at http://weka.wikispaces.com.
JID:YJCSS AID:2863 /FLA
8
[m3G; v1.144; Prn:8/01/2015; 15:20] P.8 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
Fig. 2. Receiver operating characteristic curve for the experiment and sample performance values for different weights.
Table 2 EER obtained by varying the type of SVM kernels. SVM Kernel
EER %
Linear (SVM) Linear (SVM-LR) Polynomial 3 (SVM-LR) Polynomial 5 (SVM-LR) Gaussian (SVM-LR)
18.86 15.34 19.20 28.47 46.01
Comparison among different SVM kernels based on the Twitter dataset involving 100 authors with block size of 280 characters and 100 blocks per user. In these experiments only the Information Gain was used as a feature selection metric.
4.4. Results Three different experiments were performed by varying the weight assigned to the negative samples (denoted by weight( P )), the SVM kernel, and the feature selection technique. The obtained results are presented as follows. 4.4.1. Varying weight In order to test the effect of the weight (weight( P )), a set of experiments were conducted using the Enron dataset involving 76 authors, a block size of 500 characters and 50 blocks per user. In these experiments, the SVM linear kernel was used and feature selection was performed using information gain. Fig. 2 shows the receiver operating characteristic (ROC) curve for the experiment. The curve shows the relation between the FAR and FRR when varying weight( P ) from 0 to 100. The optimal performance achieved by the system was obtained when setting the weight ( P ) limit to 10, with a FAR of 12.49% and a FRR of 12.34%. The EER was calculated as 12.42%. Subsequent experiments used weight ( P ) set to 10. 4.4.2. Varying kernel In order to test the effect of the SVM kernel, a set of experiments were conducted using the Twitter dataset involving 100 authors, a block size of 280 characters and 100 blocks per user. In these experiments, only the information gain metric was used for feature selection. The first experiment used a linear kernel, yielding EER of 18.86% and 15.34% for pure SVM and hybrid SVM-LR, respectively. Subsequent experiments focused only on the hybrid SVM-LR classifier, since it outperforms pure SVM. Experiments using polynomial kernel degree 3 and degree 5, and also Gaussian kernel yield (for the hybrid classifier) EER varying from 19.20% to 46.01%, as shown in Table 2. From the above results, it can be concluded that the hyperplane separating positive from negative data is linear. Therefore, the next experiments were run with linear kernel only.
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.9 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
9
Table 3 Authorship verification using Twitter dataset. Block Size
Blocks per user
SVM-LR %
140
100 200
21.45 18.37
280
50 100
17.83 13.27
EER for hybrid SVM-LR using the Twitter dataset involving 100 authors and varying the size of the block and the number of blocks per author. Feature selection was performed using the Information Gain and Mutual Information metrics.
4.4.3. Varying feature selection This set of experiments extend the feature selection by adding the mutual information selection approach and varying the block size and the number of blocks per user. Table 3 shows the results based on the Twitter dataset involving 100 authors with block size of 140 characters and 100 blocks per user. An EER of 21.45% was obtained when using hybrid SVM-LR classifiers. Increasing the training set and block size affect the accuracy. For instance, when increasing the number of blocks per user to 200, an EER of 18.37% was obtained. Also, using a block size of 280 characters and 50 blocks per user, yields an EER of 17.83%. Using 100 blocks per user, yields an EER of 13.27%. The last experiment was based on the Enron dataset and involved 76 authors, block size of 500 characters, and 50 blocks per author. Feature selection was performed using the Information Gain and Mutual Information metrics. In this case an EER of 9.98% was obtained when using the hybrid SVM-LR classifier. 5. Discussions The results presented above are very encouraging and better than those obtained so far by similar work in the literature. Table 5, in Appendix A, summarizes the performances, block sizes, and population sizes of previous studies on authorship verification using stylometry. An imbalanced dataset is when a negative class has more samples than a positive class. Balanced classification can be achieved by changing the class distribution through under-sampling the majority class or over-sampling the minority class. The approach proposed of weighting negative class corresponding to the ratio between the total number of positive samples and the total number of negative samples is an alternative to imbalanced dataset. The EER was calculated as 12.42% when the weight( P ) was set to 10. We investigated also the impact of different SVM kernels on accuracy. The outcome of such study is that SVM with linear kernel achieves better EER performances than polynomial or Gaussian. The obtained results further validate the fact that the prediction accuracy of the SVM classifier can be improved by extending it with LR in a hybrid classifier. Furthermore, the performance results demonstrate that using in combination the Information Gain and Mutual Information metrics for feature selection achieves significant improvement over the baseline system. 6. Conclusion In this paper, some important steps are taken toward developing a robust framework for continuous authentication using authorship verification. A critical component of CA is user identity verification. While many studies have employed stylometric techniques for authorship attribution and characterization, a fewer number of studies have focused on verification. The proposed authorship verification approach introduces new features obtained through n-gram analysis. CA was simulated by decomposing an online text into blocks of short text. Block sizes of 500, 280, and 140 characters were investigated. Comprehensive experiments based on 2 different datasets involving 76 and 100 different authors demonstrate that the proposed approach achieves promising results when compared to existing work in the literature. Future work will investigate another important aspect of CA based on stylometry, which consists of resilience to forgeries [52]. The goal will be to create a forgery dataset, and work on improvements by investigating new machine learning techniques that will not only strengthen the performance of the verification process, but will address issues related to forgery attacks. Furthermore, with the increasing popularity of messenger services such as WhatsApp, Facebook messenger as well as Twitter, the threat of spoofing has become a source of increasing concerns for users. Future work will investigate how to extend and apply the proposed model as a spoofing counter measure. Acknowledgments This research has been enabled by the use of computing resources provided by WestGrid and Compute/Calcul Canada. The research is funded by a Vanier scholarship from the Natural Sciences and Engineering Research Council of Canada (NSERC) (No. VCGS3 - 424654 - 2012) and CNPq (Brazil) scholarship (No. 237628/2012-0).
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.10 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
10
Appendix A
Table 4 List of stylometric features used in this work. Feature
Characteristics
Lexical (Character) F1 F2 F3 F4 F5 F 6 . . . F 10 F 11 . . . F 36 F 37 F 38 . . . F 50 F 51 . . . F 66 F 67 . . . F 192 F 193 . . . F 272 F 273 . . . F 528
Number of characters (C) Number of lower character/C Number of upper characters/C Number of white-space characters/C Number of vowels (V)/C Vowels (a, e, i, o, u)/V Alphabets (A–Z)/C Number of special characters (S)/C Special Characters (e.g. ‘@’, ‘#’, ‘$’, ‘%’, ‘(’, ‘)’, ‘{’, ‘}’, etc.)/S Character 5 and 6-grams (r U (b) and d U (b)) with two different values for the frequency f (i.e. f = 1 and f = 2) and for the mode of calculation of the n-grams (i.e. m = 0 and m = 1) Text based icon (8 groups) Unicode – emoticons Unicode – miscellaneous symbols
Lexical (Word) F 529 F 530 . . . F 539 F 540 F 541 F 542 F 543 F 544 F 545 . . . F 550 F 551 . . . F 600 F 601 . . . F 602 F 603
Number of words (N) Average sentence length in terms of words/N Number of long words (more than 6 characters)/N Number of short words (1–3 characters)/N Average word length Average number of syllables per word Ratio of characters in words to N Replaced words/N The 50 most frequent words per author Hapax legomena and dis legomena Vocabulary richness (number of unique words/N)
Syntactic F 604 F 605 . . . F 612 F 613 . . . F 724 F 725 . . . F 729 F 730 . . . F 965
Number of punctuation (P) Number of punctuation (single quotes, commas, periods, colons, semi-colons, question marks, exclamation marks)/P Unicode – General punctuation Number of function words (conjunction, determiners, preposition, interjection, and pronouns)/N Relative frequency of function word
Application-specific F 966 F 967 F 968 . . . F 970 F 971 . . . F 972
Number of sentences Number of paragraphs Average number of characters, words and sentences in a block of text Average number of sentences beginning with upper case and lower case
Table 5 Comparative performances, block sizes and, population sizes for stylometry studies. Category
Ref.
Sample size
Block size
Number of features
Technique
Accuracy* (%)
EER (%)
Verification
[5] [21] [31]
40 25–40** 8
1710–70 300 ch 30–50 w 628–1342 w
L(62), Sy(20) L(40), Sy(76), Se(25), A(9) L(100K), Sy(900K)
– 83.90–88.31 –
30
[10] [6] [53]
8 10 29
628–1342 w 500 w 2400 w
L(292) L(250) L(40)
[22] Current Current
87** 76** 100***
250-500 ch 500 ch 140–280 ch
L(n-gram) L(91), Sy(251), A(7) L(537+), Sy(362+), A(7)
k-NN SVM Weighted Probability Distribution Voting (WPDV) – SVM Linear Discriminant Analysis (LDA) Supervised SVM and SVM-LR SVM-LR
(L) = Lexical, (Sy) = Syntactic, (Se) = Semantic, (A) = Application, (w) = Word, (ch) = Character. * The accuracy is measured by the percentage of correctly matched authors in the testing set. ** Used Enron dataset for testing. *** Used Twitter dataset for testing.
3
17.1–22.4 95.70 –
– 22
– – –
18.90–14.35 12.42–9.98 21.45–13.27
JID:YJCSS AID:2863 /FLA
[m3G; v1.144; Prn:8/01/2015; 15:20] P.11 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
11
References [1] I. Traore, A.A.E. Ahmed (Eds.), Continuous Authentication Using Biometrics: Continuous Authentication Using Biometrics: Data, Models, and Metrics, IGI Global, 2012. [2] J. Li, R. Zheng, H. Chen, From fingerprint to writeprint, Commun. ACM 49 (2006) 76–82. [3] J.L. Hilton, On Verifying Wordprint Studies: Book of Mormon Authorship, Foundation for Ancient Research and Mormon Studies, 1991, Reprint. [4] A. Abbasi, H. Chen, Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace, ACM Trans. Inf. Syst. 26 (2008) 1–29. [5] O. Canales, V. Monaco, T. Murphy, E. Zych, J. Stewart, C.T.A. Castro, O. Sotoye, L. Torres, G. Truley, A Stylometry System for Authenticating Students Taking Online Tests, CSIS, Pace University, 2011. [6] M. Koppel, J. Schler, Authorship verification as a one-class classification problem, in: Proceedings of the 21st International Conference on Machine Learning, ICML’04, Banff, Alberta, Canada, ACM, 2004, pp. 62–69. [7] C. Sanderson, S. Guenter, Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation, in: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP’06, Stroudsburg, PA, USA, Association for Computational Linguistics, 2006, pp. 482–491. [8] A. Abbasi, H. Chen, Applying authorship analysis to extremist-group web forum messages, IEEE Intell. Syst. 20 (2005) 67–75. [9] F. Iqbal, R. Hadjidj, B.C. Fung, M. Debbabi, A novel approach of mining write-prints for authorship attribution in e-mail forensics, Digit. Investig. 5 (2008) S42–S51. [10] F. Iqbal, L.A. Khan, B.C.M. Fung, M. Debbabi, E-mail authorship verification for forensic investigation, in: Proceedings of the 2010 ACM Symposium on Applied Computing, SAC’10, New York, NY, USA, ACM, 2010, pp. 1591–1598. [11] F. Iqbal, H. Binsalleeh, B.C. Fung, M. Debbabi, A unified data mining solution for authorship analysis in anonymous textual communications, Inf. Sci. 231 (2013) 98–112. [12] C.E. Chaski, Who’s at the keyboard: authorship attribution in digital evidence investigations, Int. J. Digit. Evid. 4 (1) (2005) 1–13. [13] F. Mosteller, D.L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley, 1964. [14] J. Burrows, Delta: a measure of stylistic difference and a guide to likely authorship, Lit. Linguist. Comput. 17 (3) (2002) 267–287. [15] E. Backer, P. van Kranenburg, On musical stylometry pattern recognition approach, Pattern Recognit. Lett. 26 (3) (2005) 299–309. [16] Y. Zhao, J. Zobel, Searching with style: authorship attribution in classic literature, in: Proceedings of the Thirtieth Australasian Conference on Computer Science, Vol. 62, ACSC’07, Darlinghurst, Australia, Australian Computer Society, Inc., 2007, pp. 59–68. [17] K.G. Ruchita Sarawgi, Y. Choi, Gender attribution: tracing stylometric evidence beyond topic and genre, in: Proceedings of the 15th Conference on Computational Natural Language Learning, CoNLL’11, Stroudsburg, PA, USA, Association for Computational Linguistics, 2011, pp. 78–86. [18] N. Cheng, X. Chen, R. Chandramouli, K. Subbalakshmi, Gender identification from e-mails, in: IEEE Symposium on Computational Intelligence and Data Mining, CIDM’09, 2009, pp. 154–158. [19] N. Cheng, R. Chandramouli, K. Subbalakshmi, Author gender identification from text, Digit. Investig. 8 (1) (2011) 78–88. [20] P. Juola, R.H. Baayen, A controlled-corpus experiment in authorship identification by cross-entropy, Lit. Linguist. Comput. 20 (Suppl) (2005) 59–67. [21] X. Chen, P. Hao, R. Chandramouli, K.P. Subbalakshmi, Authorship similarity detection from email messages, in: Proceedings of the 7th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 375–386. [22] M.L. Brocardo, I. Traore, S. Saad, I. Woungang, Authorship verification for short messages using stylometry, in: Proceedings of the International Conference on Computer, Information and Telecommunication Systems, CITS, Piraeus-Athens, Greece, 2013, pp. 1–6. [23] M.L. Brocardo, I. Traore, I. Woungang, Toward a framework for continuous authentication using stylometry, in: Proceedings of the 28th IEEE International Conference on Advanced Information Networking and Applications, AINA-2014, Victoria, Canada, IEEE, 2014, pp. 106–115. [24] S. Argamon, C. Whitelaw, P. Chase, S.R. Hota, N. Garg, S. Levitan, Stylistic text classification using functional lexical features: Research articles, J. Am. Soc. Inf. Sci. Technol. 58 (2007) 802–822. [25] S.M. Alzahrani, N. Salim, A. Abraham, Understanding plagiarism linguistic patterns, textual features, and detection methods, IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 42 (2) (2012) 133–149. [26] R. Zheng, J. Li, H. Chen, Z. Huang, A framework for authorship identification of online messages: writing-style features and classification techniques, J. Am. Soc. Inf. Sci. Technol. 57 (2006) 378–393. [27] B. Kjell, W. Woods, O. Frieder, Discrimination of authorship using visualization, Inf. Process. Manag. 30 (1) (1994) 141–150. [28] F. Peng, D. Schuurmans, S. Wang, V. Keselj, Language independent authorship attribution using character level language models, in: Proceedings of the 10th Conference on European Association for Computational Linguistics, Chapter of the Association for Computational Linguistics, Vol. 1, EACL’03, Stroudsburg, PA, USA, 2003, pp. 267–274. [29] E. Stamatatos, A survey of modern authorship attribution methods, J. Am. Soc. Inf. Sci. Technol. 60 (2009) 538–556. [30] P. Juola, Authorship attribution for electronic documents, in: Advances in Digital Forensics II, in: IFIP Advances in Information and Communication, vol. 222, Springer, New York, 2006, pp. 119–130. [31] H.V. Halteren, Author verification by linguistic profiling: an exploration of the parameter space, ACM Trans. Speech Lang. Process. 4 (2007) 1–17. [32] A. Orebaugh, J. Allnutt, Classification of instant messaging communications for forensics analysis, Int. J. Forensic Comput. Sci. 1 (2009) 22–28. [33] J.F. Burrows, Word patterns and story shapes: the statistical analysis of narrative style, Lit. Linguist. Comput. 2 (1) (1987) 61–70. [34] D.I. Holmes, The evolution of stylometry in humanities scholarship, Lit. Linguist. Comput. 13 (3) (1998) 111–117. [35] N. Homem, J. Carvalho, Authorship identification and author fuzzy fingerprints, in: Annual Meeting of the North American Fuzzy Information Processing Society, NAFIPS, 2011, pp. 1–6. [36] F.J. Tweedie, R.H. Baayen, How variable may a constant be? measures of lexical richness in perspective, Comput. Humanit. 32 (5) (1998) 323–352. [37] O. de Vel, A. Anderson, M. Corney, G. Mohay, Mining e-mail content for author identification forensics, SIGMOD Rec. 30 (4) (2001) 55–64. [38] H. Baayen, H. van Halteren, F. Tweedie, Outside the cave of shadows: using syntactic annotation to enhance authorship attribution, Lit. Linguist. Comput. 11 (3) (1996) 121–132. [39] M. Koppel, J. Schler, Exploiting stylistic idiosyncrasies for authorship attribution, in: Workshop on Computational Approaches to Style Analysis and Synthesis, IJCAI’03, Acapulco, Mexico, 2003, pp. 69–72. [40] S. Argamon, S. Marin, S.S. Stein, Style mining of electronic messages for multiple authorship discrimination: first results, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’03, New York, NY, USA, ACM, 2003, pp. 475–480. [41] R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, D. Benredjem, Towards an integrated e-mail forensic analysis framework, Digit. Investig. 5 (3–4) (2009) 124–137. [42] S. Kotsiantis, D. Kanellopoulos, Discretization techniques: a recent survey, GESTS Int. Trans. Comput. Sci. Eng. 32 (1) (2006) 47–58. [43] U.M. Fayyad, K.B. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, in: Thirteenth International Joint Conference on Artificial Intelligence, Vol. 2, Morgan Kaufmann Publishers, 1993, pp. 1022–1027. [44] V. Vapnik, Estimation of Dependences Based on Empirical Data, Springer Series in Statistics, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1982. [45] G. Wahba, G. Wahba, X. Lin, X. Lin, F. Gao, F. Gao, D. Xiang, D. Xiang, R. Klein, B. Klein, The bias-variance tradeoff and the randomized GACV, in: Advances in Neural Information Processing Systems, MIT Press, 1999, pp. 620–626.
JID:YJCSS AID:2863 /FLA
12
[m3G; v1.144; Prn:8/01/2015; 15:20] P.12 (1-12)
M.L. Brocardo et al. / Journal of Computer and System Sciences ••• (••••) •••–•••
[46] Y.-chin Ivan Chang, Boosting SVM classifiers with logistic regression, Tech. rep. Institute of Statistical Science – Academia Sinica, Taipei, Taiwan, 2003. [47] W.-W. Deng, H. Peng, Research on a Naïve Bayesian based short message filtering system, in: International Conference on Machine Learning and Cybernetics, 2006, pp. 1233–1237. [48] J. Cai, Y. Tang, R. Hu, Spam filter for short messages using winnow, in: International Conference on Advanced Language Processing and Web Information Technology, ALPIT’08, IEEE, 2008, pp. 454–459. [49] I.H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, S.J. Cunningham, Weka: practical machine learning tools and techniques with Java implementations, in: International Workshop on Emerging Engineering and Connectionnist-Based Information Systems, ANNES’99, 1999, pp. 192–196. [50] N. Landwehr, M. Hall, E. Frank, Logistic model trees, Mach. Learn. 59 (1–2) (2005) 161–205. [51] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schoelkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods – Support Vector Learning, MIT Press, 1998. [52] L. Ballard, Biometric authentication revisited: understanding the impact of wolves in sheep’s clothing, in: Proceedings of the 15th Annual Usenix Security Symposium, 2006, pp. 29–41. [53] I. Krsul, E.H. Spafford, Authorship analysis: identifying the author of a program, Comput. Secur. 16 (3) (1997) 233–257.