Alecsa : Attentive Learning for Email Categorization using Structural Aspects

ARTICLE IN PRESS JID: KNOSYS [m5G;February 8, 2016;10:58] Knowledge-Based Systems 000 (2016) 1–11 Contents lists available at ScienceDirect Knowl...

Download PDF

964KB Sizes 0 Downloads 40 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

JID: KNOSYS

[m5G;February 8, 2016;10:58]

Knowledge-Based Systems 000 (2016) 1–11

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Alecsa: Attentive Learning for Email Categorization using Structural Aspects Mostafa Dehghani a,b, Azadeh Shakery a,c,∗, Maryam S. Mirian d a

Intelligent Information Systems Lab, School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Iran Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, The Netherlands c School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Iran d Robotics and Machine Intelligence Group, School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Iran b

a r t i c l e

i n f o

Article history: Received 28 February 2015 Revised 8 December 2015 Accepted 23 December 2015 Available online xxx Keywords: Email categorization Email structural aspects Active decision fusion learning Reinforcement learning

a b s t r a c t Due to the enormous volume of email data generated each day, email management has become a vital area of research. Among the email management tasks, automatic email categorization is one of the most interesting problems. However, the dynamic nature of email data makes the email categorization problem diﬃcult to address for traditional machine learning approaches. In this paper, we propose Alecsa as an attentive learning approach for automatic email categorization. Alecsa aims to simulate the dynamic behavior of users while they attempt to categorize a new email. For this purpose, email categorization problem in Alecsa is cast to a decision-making problem, and an attention control framework is employed to dynamically choose a sequence of structural aspects of the email as the distinguishing factors for categorization. We have analytically evaluated the proposed approach on the Enron–Bekkerman datasets. The evaluation results indicate the unprecedented power of Alecsa toward modeling the dynamic essence of the email categorization problem in terms of effectiveness as well as eﬃciency. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Email is pervasive across all aspects of people’s lives as a strong communication tool. Furthermore, the vast growth of its usage makes email data management an unavoidable task for all the users of this service. Several researches have been emerged with the goal of automating email management. Among the automatic email management problems, email categorization – which is also referred as email ﬁling and email classiﬁcation – is one of the most popular research topics. Neustaedter [1] deﬁned email categorization as the process of investigating the unhandled emails and deciding how to manage them. Although many research studies have been conducted on the email categorization problem, there are still challenges that have not been resolved. Among these challenges, the most problematic ones arise due to the fact that email system is naturally a dynamic environment [2]. Observing the real behaviors of users while categorizing their emails can reveal these challenges. One of the subtle points, when

∗ Corresponding author at: Intelligent Information Systems Lab, School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Iran. Tel.: 98 2182089722. E-mail addresses: [email protected] (M. Dehghani), [email protected] (A. Shakery), [email protected] (M.S. Mirian).

a user attempts to categorize an incoming email, is that he will not necessarily categorize an email based on its topic. In other words, what makes the distinction between email folders of a user is not necessarily the topic of the textual content of the emails. For example, a folder might be created since its belonging emails are about the same topic, while there might be another folder containing emails that are received from a particular person, regardless of the content of the emails. To get a better intuition, consider a user that categorizes his/her emails in the following 10 folders: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Sport Entertainment Call For Papers (Conferences/Journals) PhD Council SIGIR Mailinglist My Supervisor 2012 _ Summer Machine Learning Course _ Spring 2013 Machine Learning Course _ Spring 2014 Samira’s email _ 2011–2013

As can be observed from the name of the folders, the user trying to categorize emails in the above folders does not make the decision based on the content only. An initial guess would be that

http://dx.doi.org/10.1016/j.knosys.2015.12.013 0950-7051/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

JID: KNOSYS 2

ARTICLE IN PRESS

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

the emails in folders 1, 2, and 3 are categorized based on the topics of their content, while emails that are associated with labels of folders 4, 5, and 6, are grouped based on their senders, which is either a speciﬁc person or a particular group of people. Emails in folder 7 are categorized regarding the time they are received. Emails in folders 8 and 9 are also archived based on their topics as well as the time they are received. Folder 10 contains emails that are grouped together regarding their sender who is a speciﬁc person and also the time interval that they are received. This example reveals the type of dynamism existing in the decision-making process while users try to categorize incoming emails. This dynamism can be linked to the fact that each email has a structured content, and thereby can be viewed from different perspectives and aspects. In email categorization, users decide to categorize an email considering one or more of its aspects. In other words, there is a type of heterogeneity in categorization criteria [3,4]. Thus, in order for an automatic email categorization method to be successful, it should be able to make a dynamic decision about which aspect of the incoming email should be investigated for its categorization. There are also other types of dynamism in the email environment. For example, since email categorization is a user-oriented task, and users’ preferences change over time, automatic email categorization methods should possess an adaptive behavior [2,4,5]. Another instance of dynamism in email categorization is the varying number of classes over time in this task. Users sometimes create a new folder for a new incoming email instead of categorizing it into the existing folders. Determining the necessity of creating a new folder for an incoming email is a diﬃcult decision to make for automatic email classiﬁcation methods since the systems have no history for the new folder [2]. While a large body of works have been done on email categorization problem, none of them delve into dynamically considering the proper structural aspects of emails for email categorization. The existing methods usually consider automatic email categorization as a classiﬁcation problem [5–10]. They ﬁrst extract features from emails and then making use of the traditional classiﬁcation methods, they classify emails into different folders. In this paper, we deﬁne and study the type of dynamism existing in the problem of email categorization the users’ behaviors while categorizing emails. This problem is rooted in the fact that each email has a structured content and thereby can be viewed from different perspectives and aspects. Then we propose a new approach, Alecsa, which aims to improve email categorization taking the proper structural aspects of emails into account. Generally speaking, our proposed method, Alecsa, acts as a meta-classiﬁer in which in the ﬁrst step, the features that should be considered for classiﬁcation are determined and then, classiﬁcation is done based on the selected features. To do this, for an incoming email, each of the different structural aspects of the email is represented as a feature set which is a subset of the whole feature space. For example, regarding the receipt time of the email or the body text of the email, a time-aware feature set and a content-aware feature set are deﬁned respectively. Afterwards, a sequence of feature sets are selected for investigation. In other words, a sequence of classiﬁers/functions on that feature sets are selected for consultation. On the basis of this selection, a decision about categorizing the email will be made. This decision making procedure is similar to the behavior of users while they want to categorize an email. In the aforesaid example, consider an email belonging to folder 9 is received. The user probably takes the content (content-aware feature set) of the email into account – up to this point, folders 1, 2, 3, 8, and 9 can potentially be the label of the incoming email – . As the result of considering the content of the email, user removes folders 1, 2, and 3 from his/her decision space regarding their irrelevance to the

topic of content of the incoming email. Afterwards, the user decides to categorize the email in folder 9 based on the time he/she has received the email, i.e. regarding the time-aware feature set. The fact that the user decides to choose a particular sequence of feature sets at the moment of receiving all feature sets as the input data, and to actively fuse them for making the decision, evokes the Attention Control problem, which is one of the most attractive problems among cognitive science researchers [11]. In Alecsa, the email categorization problem is represented as an attentive decision-making problem in which Active Decision Fusion Learning (ADFL) is used to model the problem. The modeled problem is then solved using Reinforcement Learning. Alecsa not only is a powerful approach for imitating the dynamics of user behavior in email categorization effectively, but also tries to improve the eﬃciency of automatic email categorization. This is due to the fact that along with achieving a high accuracy, minimizing the cost is a key concern in the ADFL framework, which is employed as the attentive decision-making framework in Alecsa. Compared to the existing categorization methods, Alecsa can be regarded as a meta-classiﬁer, which provides a policy for dynamic feature selection for each new instance. The selected features are then exploited for classifying the email. In fact, ADFL, which is the core of Alecsa, is a pre-processing layer on top of the traditional classiﬁers [12]. In order to study and investigate the dynamism related to the structural aspects of emails in real data, and also to evaluate the performance of Alecsa, we have annotated the Enron–Bekkerman email datasets [9] with the structural aspects taken into account for categorizing emails in each folder. We have demonstrated that the type of discussed dynamism exists in real data. Moreover, evaluation results along with detailed analysis indicate the power of Alecsa in simulating the dynamic behavior of users in email categorization. The results show that Alecsa signiﬁcantly outperforms baselines in the datasets in which the users categorize their emails with a dynamic behavior. Also, it is demonstrated that beside its effectiveness in terms of accuracy, Alecsa performs very eﬃcient in terms of time complexity. In the following sections, related works are reviewed in Section 2. In Section 3.2 ADFL framework is brieﬂy introduced. Afterward, modeling Alecsa exploiting ADFL framework is explained in detail. Then, in Section 4, experimental results are presented, supported with several comprehensive discussions. Finally, in Section 5, the paper is concluded and some ideas for the future works are discussed. 2. Related work Several research studies have been conducted on the email categorization problem. A great deal of these research works focus on the problem of categorizing emails into spam and non-spam emails, which is called spam detection [13,14]. Some of these research works also tried to group emails regarding their topic of discussion into email conversation threads [15–17]. However, in this section we focus on reviewing the previous studies addressing the general email categorization problem into folders. Email categorization can be done taking various information from an email into account, including text content of the email (i.e body text and subject), social networks of participants of the email, receipt time of the email, and information from the email attachments [18,19]. We can simply categorize previous methods into two groups: the methods in which only the text content of emails is considered and the methods that exploit other information as well. The methods of the ﬁrst group assume that emails are categorized based on their topics. Thus, since the text content of the emails are the best representative of emails’ topics, it is

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

JID: KNOSYS

ARTICLE IN PRESS

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

considered as the most informative feature of email data for categorization. In one of the ﬁrst research studies on this problem, Cohen [6] presented RIPPER as a rule-based method which uses spotted keywords in email text and studied its performance compared to the traditional IR learning methods. There are also several methods based on the traditional classiﬁcation approaches. For instance, Rennie [7] with emphasis on the importance of the eﬃciency in the context of email categorization, presented IFILE which uses Naïve Bayes to do the classiﬁcation. IFILE is capable of updating the trained model in the presence of a new categorized email. Clark et al. [8] proposed a neural network based system for automated email categorization and investigated the impact of different feature selection weighting and normalization methods on the performance of their proposed approach. Koprinska et al. [20] have also demonstrated that random forest classiﬁcation method achieves promising results in the task of email categorization and also spam ﬁltering. Crawford and Koprinska [21] showed that ensemble learning can improve the performance of email categorization. In one of the most important research works on email categorization, Bekkerman et al. [9] introduced a clean subset of Enron dataset and tried to scrutinize the performance of different text classiﬁcation methods on the email categorization problem. Taghva et al. [22] used the extracted concepts from emails as part of the feature vectors for an underlying Bayesian classiﬁer. To this end, they tried to construct an ontology that applies rules for identiﬁcation of features and concepts to be used for email categorization. Balakumar and Vaidehi [23] also attempted to categorize emails using an ontology to understand the content of the emails. They utilized the concepts and relations in the ontology created during the training phase to this end. Chang and Poon [10] reported their experience on classifying emails using phrases as basic features and making some analysis on comparing the learning curves for email classiﬁcation on public and private email collections. Yu and Zhu [24] using a modiﬁed version of latent semantic indexing, converted the original sparse and noisy feature space to a semantically richer feature space and exploited neural networks to classify emails. In [25], Bermejo et al., aiming to solve the problem of imbalances of folders’ size in email categorization, improve the Naïve Bayes multinomial text classiﬁcation problem for email data. In a recent work, Carmona-Cejudo et al. [5] conducted a comprehensive study on feature selection in the task of email categorization. Furthermore, considering the concept drift in email data, they proposed a dynamic approach so as to be adaptive over time. Although the approaches in this group do not consider non-textual information of email data explicitly, some of them use this information implicitly, e.g. in [5], the problem of concept drift is addressed considering time property of emails. In the second group, the approaches go beyond the text content of emails and try to automate email categorization task making use of other information of emails as well. Diao et al. [26] dealt with both the text content of emails and the structured information in email data like participants information, date information, and attachment information by modeling emails as semi-structured documents. Using this information, they examined the performance of the decision tree based and the Bayesian classiﬁer. Kiritchenko et al. [27], using the date information of emails, try to discover temporal relations between emails and embed this information on the content based learning methods. Chakravarthy et al. [3] tried to represent an email as a graph in which besides text content of emails, information about the sender, recipients, and attachments is presented. Then, using graph mining approaches, they classify emails into folders. Armentano and Amandi [28] proposed an interactive system that not only considers date, recipients, attachments information along with the content of the email, but also

regarded user proﬁle information and feedback from his/her action. In [29], Grbovic et al., use content features, email address features, behavioral features (e.g. reply and forward information) and temporal features with the goal of classifying email into the most handful categories. Alecsa is an approach which belongs to the second group since it regards all email aspects including text content of emails, participants, and time properties. However, the unprecedented power of Alecsa is due to its dynamic way of fusing all the information.

3. ADFL framework for attentive email categorization In this section, we ﬁrst brieﬂy describe the ADFL framework and then explain how we have employed the ADFL framework in Alecsa to formulate the email categorization problem.

3.1. Active Decision Fusion Learning (ADFL) Decision fusion is to combine the responses of a selected set of experts toward a given query and to make a ﬁnal decision based on this combination [12]. Decision fusion is applicable in many real world problems including the multiple classiﬁer systems, and the mixture of experts. In general, there are many reasons for combining multiple sources of information for decision-making such as improving effectiveness and accuracy, handling a high volume of data using divide and conquer methods, data fusion approaches, and ensemble learning. Decision fusion is crucial since each expert’s knowledge, which is considered as a separate source of information, is imperfect and nonuniform over the problem domain. So, the suggestion of experts individually, not only might be incomplete but also can be misleading due to their lack of expertise in some cases [12]. One of the most popular and interesting applications of decision fusion is when there are multiple experts, where each one makes a local decision based on a local area of expertise in the form of a probability distribution over all possible choices. In this case, the goal is to learn a proper sequence of several consultations with local decision experts to improve the quality of ﬁnal decision in terms of accuracy and cost. Active Decision Fusion Learning (ADFL) is presented to address this problem. In ADFL, the problem is modeled as Markov Decision Process (MDP) in which the states determine decisions of the local decision experts and the actions are either referring to a new local decision expert for consultation or determining the ﬁnal decision. Furthermore, the costs of consultation and decision-making are modeled as a reward function. Then, to solve the MDP and to ﬁnd an optimal sequence of consultations, a continuous reinforcement learning method is used. In the employed rewarding function, a big punishment is given in a case of making a wrong ﬁnal decision and a big reward for a correct one. Moreover, in order to control the cost of consultations, a small punishment is considered when it is decided to consult with an extra expert. Finally, using the learned policy through the reinforcement learning, at each state, either the next local decision expert is selected for consultation or a ﬁnal decision is made based on the consultations already made. In ADFL, it is assumed that ei (i = 1 to l ) is a local decision expert who has the ability to access a subset of the feature space, fi . The subsets can overlap and F = ∪li=1 fi . It is also assumed that the decision space is a discrete space, D = {d1 , d2 , . . . , dc }, where c is equal to the number of possible decisions. The output of each ei is a probability distribution on the decision space that is indicated by ei (fi ). ADFL, which is actually a combiner for decisions of ei s, tries to learn a policy based on reinforcement learning approach

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

3

ARTICLE IN PRESS

JID: KNOSYS 4

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

in which state si , action a, and decision policy π are deﬁned as follows:

⎧ ⎪ ⎪ ⎪ ⎨s = [s1 , s2 , . . . , sl ],

⎪ a ∈ A, ⎪ ⎪ ⎩ π = {Pr (a|si )}a∈A,si ∈s

si =

ei ( f i )

i ∈ selected expert

01×c

otherwise

A=T ∪D (1)

where in the above equation, T = {t1 , t2 , . . . , tl } in which ti means to consult with ei . It is noteworthy that π in general is a mapping from perceived states of the environment to actions to be taken when the agent enters those states [30]. After each action, ADFL attempts to update its state until the action is making ﬁnal decision. This way, the purpose of ADFL is to ﬁnd an optimal policy such that:

π ∗ = arg max Q π (s, a ) ∀s, a.

(2)

π

where Q is the utility function that makes a compromise between the accuracy of the decision and the cost of consultations. The ADFL agent learns to reach a reasonable trade-off between the quality of its ﬁnal decision and the cost of consulting with local decision experts by maximizing its expected reward. To do so, it considers predeﬁned costs for consulting each local decision expert as well as the beneﬁts of making a reasonable ﬁnal decision. More precisely, Qπ (s, a) is the cumulative reward that the ADFL agent achieves based on the predeﬁned value as the reward of action a in state s as well as its already gained rewards according to the policy π . This predeﬁned value, r, is deﬁned as a conditional function based on the given action a as follows:

r=

[m5G;February 8, 2016;10:58]

⎧ High Positive, ⎪ ⎪ ⎪ ⎨

High Negative,

⎪ ⎪ ⎪ ⎩Small Negative

×number of already consulted experts,

i f a = Correct Decision ∈ D i f a = Wrong Decision ∈ D if a ∈ T

(3) So, in each state s, if the action is a ﬁnal decision, the value of Q(s, a) will be increased by the reward (r = High Positive) or decreased according to the punishment (r = High Negative) for correct or wrong ﬁnal decisions. On the other hand, if the action is a consultation, the value of Q(s, a) is decreased by the consultation cost. This cost is linearly increased with the number of already consulted local decision experts (r = Small Negative × number of already consulted experts). According to Eq. 1, actions are discrete but states are continuous (a sequence of distributions on the decision space). Thus, we should use a continuous reinforcement learning, such as the method proposed in [31]. It is noteworthy that, the initial state is considered Null, and this means that in the beginning, there has been no consultation. So, to start, there are several possibilities as the ﬁrst action [11]. For example, the ﬁrst action can be consulting with the most knowledgeable local decision expert regarding his access to a vast portion of the feature space or access to an informative part of the feature space. 3.2. Alecsa: Modeling email categorization employing ADFL framework In fact, the task of email categorization is a matter of decisionmaking problem in which the decision of “In which folder should we categorize a given email?” is taken. A straightforward solution for this decision-making problem is to use classiﬁcation

methods. However, since in the traditional classiﬁcation methods, decisions are completely made based on a static trained model rather than a dynamic policy, these methods are not suﬃciently dynamic to choose a proper sequence of feature sets for each new instance. On the other hand, one of the applications of ADFL is in the classiﬁcation task, when there are multiple experts and each one has access to a speciﬁc portion of the feature space. So, based on their knowledge domain, they declare their decision to categorize a given instance. In this case, ADFL tries to learn a proper policy to choose a sequence of experts for consultation. Thus, it can ensure effectiveness, i.e., the highest accuracy in the result of classiﬁcation, and eﬃciency, i.e., least cost in terms of the number of consultations. The aforementioned points stand to reason that ADFL is a strong framework to model the different type of dynamics in user behaviors in the task of email categorization. In Alecsa, we use ADFL to model the automatic email categorization problem. To do so, each component of the email categorization problem is clearly mapped to a component of ADFL framework. As mentioned before, to simulate the user behavior, which is investigating a given email regarding its different structural aspects in the classiﬁcation problem, features can be grouped into several sets, so that each set of features represents a speciﬁc structural aspect of the email. For example, one set of features may reﬂect the textual content of the email and another set of features may show the characteristics of the sender and the recipients of the email. Thus, from the viewpoint of each structural aspect of email, only a proportion of the feature space is visible. We assume that for every structural aspect, one or more functions are deﬁned. The deﬁned functions can only access the feature space that is visible to their corresponding structural aspect as the input data. Each of these functions provides a probability distribution over the email folders, which indicates the likelihood of belonging a given email to each of the existing email folders considering the information in the visible portion of feature space for the function. Essentially, these functions are equivalent to the local decision experts in ADFL framework. They present their output decisions in the form of probability distributions on the decision space. The decision space here is equivalent to all possible choices from existing folders in which the given email can be categorized. Based on this mapping, an email is assigned to a folder when choosing that folder as the ﬁnal decision has the maximum probability of all possible actions. This includes any possible ﬁnal decision and consultation with more experts. Fig. 1 presents the pseudo-code of a learning cycle of Alecsa. Then, the goal is to learn a policy of selecting the appropriate sequence of consultations with the local decision experts. This problem is casted to an episodic Markov decision process that is deﬁned by using states and actions of ADFL, and transition and reward functions. Afterward, reinforcement learning is exploited to solve the corresponding MDP problem and learn the optimal policy. It is noteworthy that the test phase is the same as the training phase except that no knowledge updating is involved.

3.3. Deﬁning local decision experts using features of structural aspects Local decision experts are equal to the functions that are deﬁned on a speciﬁc portion of feature space. Hence, in order to deﬁne these functions, we should ﬁrst determine the structural aspects of the email data. Each function can be deﬁned on the visible area of features considering the structural aspect that the function

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

JID: KNOSYS

ARTICLE IN PRESS

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

5

Fig. 1. Pseudo-code of a learning cycle of Alecsa.

belongs to. The three main structural aspects in email data are as follows: • Content-aware aspect • Time-aware aspect • Participants-aware aspect (senders and recipients of emails) For each of these aspects, a set of functions is deﬁned, where the functions indicate the likelihoods of belonging an email to the existing folders from the viewpoint of the structural aspects. It is noteworthy that each of these functions calculates the similarity score between an email and a folder using emails in that folder. Then, the scores calculated by these functions for a particular email are normalized over all the folders so that the ﬁnal scores form a probability distribution. In the following, we describe the functions in detail. 3.3.1. Content-aware aspect If a user wants to categorize an email into a number of folders regarding the content of an email, he/she will consider the subject of the email as well as its body text. The following functions are deﬁned from the perspective of the content-aware aspect of emails: Body-text Similarity. When an email belongs to a folder in terms of similarity of its content topic to the content topic assigned to the folder, the body text of the email, as its main content, should be similar to the body texts of existing emails in that folder. In Alecsa, to quantify this similarity, two different information retrieval based methods are used. These methods are from the best performing methods for calculating text similarity. We exploit Okapi BM25 [32] and language model based similarity using KL-Divergence [33] with Dirichlet Prior smoothing. We consider the body text of the incoming email as the query and the concatenations of the body text of all emails within a folder as the documents. Named Entities Similarity. Named entities are one of the main parts of the texts that are strongly capable of representing the topic of the text content. Usually, two topically related emails have several common named entities in their text. In order to determine the similarity between an email and a folder considering their named entities as representatives of their topics, the standard Dice

Similarity is adopted as follows:

SimNamedEntity (e, F ) =

2|NEe ∩ (

|NEe | + |(

e ∈F

i

ei ∈F

NEei )|

NEei )|

(4)

where NEe is the set of named entities extracted from the body text of email e. To extract named entities from body text of emails, we have used Stanford Named Entity Recognizer [34]. Email Subject Similarity. Another proposed function is the subject similarity. It determines the similarity of a given email to the existing folders regarding the subject line of the emails. To measure this similarity, at the ﬁrst step all subject lines of all emails have been standardized so the parts which are not generated by human, such as “fw:” or “re:”, are removed. Then, normalized overlapping is used to determine the similarity of the standardized subject line of the incoming email with the subject lines of all emails in a folder as follows:

SimSub jectLine (e, F ) =

|Se ∩ ( ei ∈F Sei )| |Se ∪ ( ei ∈F Sei )|

(5)

where Se is the set of terms in the subject of email e after standardization and |Se | determines the total number of terms in this set. 3.3.2. Time-aware aspect As stated in the problem deﬁnition, for a folder in which emails are archived due to their time similarity, e.g. being received in a speciﬁc period of time, the user categorizes a new incoming email into this folder paying attention to its time-related characteristics. We have deﬁned the following functions regarding this point of view. Time Closeness. If an email is related to a particular event which happens in a speciﬁc time interval, considering its time similarity with the emails in the existing folders partly indicates in which folders the email can be categorized. The similarity of an email to a folder is deﬁned as follows:

SimT ime (e, F ) =

1+

i : ei ∈D

|F | T imeDi f f (e, ei )

(6)

where |F| is the number of emails in the folder F, and TimeDiff is a function that calculates the difference between receipt times of two emails in terms of “day”. This function also implicitly models tendency of users to categorize newly received emails to newly

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

ARTICLE IN PRESS

JID: KNOSYS 6

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

Table 1 Statistics on Enron–Bekkerman email datasets User

Number of folders

Number of emails

Number of emails in smallest folder

Number of emails in largest folder

Average number of emails in each folder

beck-s farmer-d kaminski-v kitchen-l lokay-m sanders-r williams-w3

101 25 41 47 11 30 18

1971 3672 4477 4015 2489 1188 2769

3 5 3 5 6 4 3

166 1192 547 715 1159 420 1398

19.6 146.9 109.2 85.4 226.3 39.7 153.9

created folders. In other words, according to the scores generated by this function, newly created folders are more likely to be considered as the target of an incoming email. Periodic Time Similarity. Sometimes categorizing an incoming email into a folder considering its receipt time is not because that the email is received in a speciﬁc time interval, but is due to the fact that the email is received in a speciﬁc period of time. For example, emails which are received in fall semesters, or emails received in winter vacations, or emails received in weekends. In order to capture this kind of time-aware similarity between an email and a folder, we introduce four different functions, each of them considering a determined time period: • Number of emails in the folder which are received in a same day of week with the incoming email, normalized by number of all emails belongs to the folder. • Number of emails in the folder which are received in a same week of month with the incoming email, normalized by number of all emails belongs to the folder. • Number of emails in the folder which are received in a same month of year with the incoming email, normalized by number of all emails belongs to the folder. • Number of emails in the folder which are received in a same season with the incoming email, normalized by number of all emails belongs to the folder. 3.3.3. Participants-aware aspect To calculate similarity of an incoming email to a folder considering the participants of the email, the below functions are deﬁned: Common Participants. Participants of an email refers to the set of people mentioned in the “From”, “To”, and “CC” part of the email, i.e., the sender of email and its recipients. The similarity of an email to a folder regarding the participants of the emails is calculated as follows:

SimCommonPaticipants (e, F ) =

i : ei ∈D

SimP (e, ei )

|F |

(7)

where |F| is the number of emails in the folder F and SimP is a function which calculates the similarity of participants of two given emails. We exploit the similarity function presented by Erera and Carmel [35], which is a variant of Dice Similarity for this purpose. This function takes the activity role of participants into account based on their appearance in “From” and “To” (active participants), or “CC” (passive participants) ﬁelds. Common Participants Groups. It is possible that an email is categorized in a folder since it is sent by a particular “group” of people, rather than a particular person, whose sent emails exist on that folder. Hence, the previous function is not able to show the participants-aware similarity of an email and a folder in this situation, especially when the incoming email is from a new person

who did not participate in any previous emails, but he/she is a member of a particular group of existing participants. To overcome this drawback, in the previous function, instead of exact matching of participants using their complete email address, we consider two participants as the same person (in the same group) if the domains of their email addresses, i.e. the part after @ in the email address, are the same. This relaxation helps us to calculate the participants-aware similarity of and email and a folder taking their group membership into consideration. Common Signature Words. Another deﬁned function which is exploited in Alecsa is calculating the participants-aware similarity of an email and a folder using the information of emails’ signature. To do so, the signatures of all emails are extracted using a tool named Jangada [36]. To avoid “Null” in the case of not having signature, the email address of the sender is added to the signature. Afterwards, using a tokenizer, the set of words of each signatures is extracted. Having the set of words for all emails, we use the deﬁned “subject similarity” function, where the words in the signatures are used instead of the set of words that belong to the subject lines. 4. Experiments In this section, the results of the experiments conducted to evaluate the quality of Alecsa are provided. Firstly, a brief description of the dataset as well as the experimental setup is presented. Then, the results of applying Alecsa in the task of automatic email categorization are presented. Furthermore, several discussions regarding the sensitivity of Alecsa to the amount of training data, its quality compared to the previous methods, time eﬃciency, and the quality of the local decision experts are presented. 4.1. Datasets Due to privacy issues, a few number of real email datasets is publicly available. Among them Enron corpus [37] is the most famous one for the task of email categorization. The corpus includes 517,431 emails that belong to 150 users, who were mostly senior managers of the Enron corporation. While different versions of this corpus have been published, the complete and most famous version is provided by William Cohen 1 . Using this complete version, Bekkerman 2 generated a standard subcorpus that consists of seven datasets, each one being the emails of an employee. Some postprocessing has been done on the datasets including: removing small folders, removing automatically-generated folders, and ﬂattening folders hierarchies [9]. Table 1 indicates the statistics of the Enron–Bekkerman datasets.

1 2

https://www.cs.cmu.edu/∼./enron/ http://www.cs.umass.edu/∼ronb/enron_dataset.html

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

JID: KNOSYS

ARTICLE IN PRESS

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

7

Fig. 2. Distribution of folders counts over different structural aspects in Enron–Bekkerman datasets.

4.2. Experimental setup To assess the quality of the results of Alecsa, ﬁrstly, each dataset is divided into two parts: training and test data. Using the model described in Section 3.2 on the training data, for each dataset an optimal policy of categorizing emails into folders is trained. For this purpose, the label information is used to update the reward function. Finally, to evaluate the extracted policy, the test data is exploited. According to the fact that the email data are time dependent, using random split may cause the dependency of the earlier emails on the later emails, which does not happen in reality. So, we have used a time-based data splitting approach in which the training data includes earlier emails compared to the emails in test data [37]. To do so, all emails have been sorted chronologically. Then, the ﬁrst k% of the data has been considered as the training data and the next 10% has been used as the test data for the evaluation. In all of our experiments, we have splitted the data using three different values for k: k = 30, k = 60, and k = 90 for each dataset. Each reported accuracy is the average of accuracies in these three situations. For the parameters of ADFL, the cost of each consultation has been set to −1, the reward of each correct decision making has been set to +10, and the considered punishment of each wrong decision making has been set to −10. The evaluation metric we have used to assess the accuracy of Alecsa is Correct Classiﬁcation Rate (CCR), which is the most common metric to measure the quality of classiﬁcation task [38]. In fact, CCR determines the percentage of correct decisions. 4.3. Discussion on the importance of considering structural aspects in automatic email categorization In order to emphasize the importance of considering the structural aspects of the email data in automatic email categorization, we have studied the Enron–Bekkerman datasets from this perspective. In this section, we provide some information to indicate that the idea of considering structural aspects for email categorization makes sense. Furthermore, we have monitored Alecsa’s behavior toward this issue. Our analysis demonstrates that the way Alecsa corporate the structural aspects for email categorization is in accordance with reality. 4.3.1. Annotating Enron–Bekkerman datasets with structural aspects To explore the role of the structural aspects as the distinguishing characteristics of different folders, each of the datasets

in the Enron–Bekkerman corpus has been manually annotated, so that the labels on each folder indicate the structural aspects of emails considered by the user while categorizing emails in the folder. In Fig. 2, for each dataset a Venn diagram is presented. In these diagrams, each folder is considered as an object and structural aspects, i.e. content-aware aspect, time-aware aspect, and participants-aware aspect, are considered as labels. Each folder is labeled with one or more structural aspects based on the fact that emails in that folder are categorized according to which structural aspects. For example, in the kaminski-v dataset, among the 41 folders, 31 folders are labeled with only the content-aware aspect, 5 folders are labeled with just the participants - aware aspect, 2 folders are labeled with the content-aware and time aware aspects, and 3 folders are labeled with the content-aware and participants aware aspects. We will refer to Fig. 2 in the following analytical sections. 4.3.2. Annotation process In order to annotate the data, each folder is presented to three different individuals. Annotators, considering the name of the folder as well as taking a glance at the emails in the given folder (content, participants and receipt time of emails), have determined which structural aspect/aspects of the emails are considered as the distinguishing characteristic of the given folder. In order to make the annotation process clearer, a set of labels have been selected, and each one has been mapped to one or more of the deﬁned three structural aspects. The set of labels and their mapping to the structural aspects are presented in Table 2. Annotators are allowed to choose a set of labels for each folder. Afterward, their annotations have been mapped to the structural aspects labels. In order to assess the quality of annotations on the ﬁnal structural aspects labels, we have used an agreement measure presented by Bhowmick et al. [39] which is an extension of the Kappa coeﬃcient for computing inter-annotator reliability for a multi-label annotation process. The obtained agreement value is κ = 0.873 that is considered as an almost perfect agreement in an annotation. 4.3.3. Contribution of structural aspects in the procedure of Alecsa for email categorization As denoted in Section 3.2, in the presence of each incoming email, a sequence of local decision experts is selected according to the learned policy. In each sequence, considering to the type of local decision experts within the sequence, the percentage of the

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

ARTICLE IN PRESS

JID: KNOSYS 8

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11 Table 2 Annotation labels and their mapping to the structural aspects. Annotation labels

Speciﬁc Speciﬁc Speciﬁc Speciﬁc Speciﬁc Speciﬁc Other

Structural aspects

topic sender/group of senders group of recipients topics from one or more mailing lists time slot/period event

Content-aware

Time-aware

Participants-aware

– – –

– – – –

– – –

Fig. 3. The contribution of each structural aspects in different datasets in Alecsa procedure of categorizing emails. Table 3 Evaluation results of different methods on Enron–Bekkerman datasets (One-tailed t-test is conducted to assess the signiﬁcance of improvement of the best result over the second best result in each dataset. (p-value < 0.005) and (p-value < 0.05) indicate signiﬁcance levels). Method

beck-s

farmer-d

kaminski-v

kitchen-l

lokay-m

sanders-r

williams-w3

Maximum Entropy [40] Naïve Bayes [41] SVM [42] Wide-margin Winnow [9] ABC-DynF [5] Alecsa

55.8 32.0 56.4 49.9 49.5 63.8

76.6 64.8 77.5 74.6 61.1 70.8

55.7 46.1 57.4 51.6 65.8 63.1

58.4 35.6 59.1 54.6 59.9 58.3

83.6 75.0 82.7 81.8 87.5 91.6

71.6 56.8 73.0 72.1 76.3 74.3

94.4 92.2 94.6 94.5 88.3 96.8

contribution of each structural aspect in the decision made for categorizing the email can be calculated. To do so, the contribution of a structural aspect SA is computed as the number of local decision experts in the selected sequence that have access to the area of feature space related to SA divided by the number of all local decision experts in the sequence. For each dataset in the Enron–Bekkerman datasets, the contribution of the structural aspects is computed as the average over all emails belonging to the dataset. Fig. 3 illustrates the percentage of the contribution of each structural aspect in different datasets in the Alecsa procedure for categorizing emails. Fig. 3 conﬁrms the ability of Alecsa to categorize emails dynamically based on the different structural aspects. As can be seen, the most useful aspect as the distinguishing characteristic is the content-aware aspect. Comparing Figs. 2 and3, it can be observed that Alecsa is very good at modeling the importance of each structural aspect in the different datasets. For example, in the beck-s dataset, the time-aware, and the participants-aware aspects have fair contributions, and this is almost consistent with the annotations in Fig. 2a. In farmer-d dataset, the content-aware aspect has the dominant contribution, which is in accordance to Fig. 2b. It should be noted that the reported value for each dataset

is the micro-average of the contributions of different aspects, so they cannot show the overlapping contributions of different aspects. 4.4. Results of Alecsa compared to the other methods To compare the quality of Alecsa with other methods, here, we present the evaluation results of several previous methods. Some of these methods are purely text-based classiﬁcation methods used for automatic email categorization [9]. Furthermore, the evaluation results of the ABC-DynF method [5] is presented as one of the best performing methods, which tries to handle the dynamic nature of email data, on Enron–Bekkerman datasets. Table 3 shows the evaluation results of Alecsa over the baseline methods supplemented by the statistical signiﬁcance tests. In the compared approaches as the baseline methods, Maximum Entropy [40] aims to model the distribution of folders with a uniform distribution and in the modeling, it tries to maximize the constraints satisfaction given by the training data. This method originally suffers from overﬁtting in the lack of a large amount of training data (e.g. when a new folder is created). Naïve Bayes [41] is one of the primary text classiﬁcation methods. Applying Bayes

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

JID: KNOSYS

ARTICLE IN PRESS

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

9

Fig. 4. Results of Alecsa on different Enron–Bekkerman datasets with different amount of training data.

rule, Naïve Bayes tries to compute the probability of assigning an email to a folder. Support Vector Machine (SVM) [42] is one of the strongest methods for the text classiﬁcation problem [43]. The goal of SVM in a binary classiﬁcation problem, is to induce a maximum margin hyperplane that separates two classes in the training data. For classifying a multi-class data, the problem is divided into several binary classiﬁcation problems. Wide-margin Winnow [9] is an extension of Winnow [44] algorithm. Winnow is an online binary classiﬁcation method that attempts to learn a class separator minimizing the number of incorrect guesses while instances of trained data are presented to it. This method performs adequately in the case of sparse data and high dimensionality. Wide-margin Winnow is a multi-class version of Winnow, which is presented in [9]. ABC-DynF [5] is one of the most recently proposed methods for automatic email categorization. ABC-DynF considers the dynamic nature of email data as well as the high dimensionality in email text categorization. Carmona-Cejudo et al. by presenting ABN-DynF tried to address the concept drift problem by proposing an adaptive approach. Their proposed method is an iterative implementation of Naïve Bayes, which possesses dynamic feature selection. This dynamism helps the method to be effective by adapting to the changes over time, and eﬃcient by reducing the number of features. They examined various conﬁgurations of their method. Their best results in each dataset are reported in Table 3. As shown in Table 3, Alecsa outperforms all other methods in the three datasets: beck-s, lokay-m, and williams-w3. In the becks dataset, according to the distribution of folders counts over different structural aspects (Fig. 2a), there is considerable number of folders that are distinguished from other folders based on the participants of their included emails, and not their content. This fact could be found out in datasets lokay-m (Fig. 2e) and williamsw3 (Fig. 2g. The achieved results show Alecsa is able to discern the proper perspective for investigating the incoming emails and this highlights the power of Alecsa to simulate the behavior of these users during email categorization. In the kaminski-v, kitchen-l, and sandeers-r datasets, according to the study by Carmona-Cejudo et al. [5], creation of new folders and generally concept drift is observed. Thus, due to that fact that ABC-DynF method’s main concern is addressing this problem, it can achieve slightly better results in these datasets in terms of CCR. In the farmer-d dataset, folders are almost topically distinguishable (Fig. 2b). This stands to reason that the textual features perform very well and in this case, SVM achieves a favorable result. We performed a signiﬁcance test to check whether the best result in each dataset is signiﬁcantly better than the others. We used one-tailed t-test between the best result and the second best in general in each dataset. In all cases which Alecsa is the best method, the achieved superiority is statistically signiﬁcant. In farmer-d, SVM signiﬁcantly outperforms other methods. However, the improvements of ABC-DynF over Alecsa are not statistically signiﬁcant in any cases.

4.5. Quality of Alecsa using different amounts of training data To be able to evaluate the effectiveness of Alecsa using different amounts of training data, we evaluated the method trained on three different values of k: k = 30, k = 60, and k = 90 for each dataset. The Evaluation results of Alecsa on the Enron–Bekkerman datasets with different amounts of training data are presented in Fig. 4. It is commonly observed that the more the amount of training data is, the better the results of Alecsa would be. In certain cases, lokay-m, kaminski-v, and sanders-r datasets, there is a diminution in the CCR while the amount of training data increases. Investigating the data, we found that in lokay-m dataset, this is due to the fact that, in the 10% test data after the point in which the training and test data are separated (k = 60%), the user created new folders. Hence, there is no instance for the new created folders in the training data at all; consequently, there is no chance for any email in the test data to be categorized in those folders. However, as the amount of training data increases (i.e. k = 90%), it covers the lost information about the newly created folders and the trained policy allows emails in the test data belonging to the new folders, to be correctly categorized. In kaminski-v and sanders-r cases, CCR decreases when the training data includes 90% of all the data. As also discussed in [5], this is due to a phenomenon called concept drift. Concept drift refers to changes in the data which can lead to changes in the target of learning over time [45]. In email categorization, the distinguishing characteristics associated with a folder can shift over time. Appearance of new folders in email data can also be considered as a special case of concept drift. Carmona-Cejudo et al. in [5] focused on this challenge and analyzed these datasets from the concept drift point of view. However, addressing this challenge is not the target of Alecsa, so it does not perform very well toward this issue. The difference between the values of CCR for various datasets is due to the nature of the data in these datasets. For example, in williams-w3, as can be seen in Table 1, about half of the emails are categorized in a single folder. This fact makes learning a proper policy for categorizing emails to be an easy task for Alecsa. However, in some classiﬁcation methods that are sensitive to the imbalanced amount of training data in different categories, this might cause the trained model to be biased. 4.6. A brief discussion on the time eﬃciency In this section, we discuss the time eﬃciency of our proposed method, Alecsa. To do so, ﬁrst we simply compare the time consumption of Alecsa with baseline systems. Moreover, we provide an argument about the procedure of Alecsa that reveals its parsimonious nature. As a naive evaluation approach for comparing the eﬃciency of Alecsa with other methods in terms of time consumption, we have

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

ARTICLE IN PRESS

JID: KNOSYS 10

[m5G;February 8, 2016;10:58]

M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11 Table 4 Performance of Alecsa consulting with different local decision experts as the initial action on Enron–Bekkerman datasets. Initial local decision expert

beck-s

farmer-d

kaminski-v

kitchen-l

lokay-m

sanders-r

williams-w3

Body-text Similarity Named Entities Similarity Email Subject Similarity Time Closeness Periodic Time Similarity Common Participants Common Participants Groups Common Signature Words

63.8 60.1 58.7 55.1 53.8 65.9 67.3 59.0

70.8 67.4 67.9 49.9 47.8 47.8 45.8 40.6

63.1 58.9 60.1 46.0 47.1 54.4 49.6 46.0

58.3 52.9 58.1 49.7 45.3 50.3 46.1 44.4

91.6 90.1 88.6 64.3 67.6 87.7 82.2 78.5

74.3 73.0 69.3 67.0 70.3 71.1 72.0 66.9

96.8 90.6 87.4 87.7 80.8 97.0 95.4 90.8

Fig. 5. Consultation ratio of Alecsa on Enron–Bekkerman datasets.

Another interesting fact, which we can implicitly conclude, is that in dynamic environments, Alecsa not only tries to choose the best structural aspects as the points of view, but also chooses the smallest set of the most knowledgeable local decision experts from each structural aspect. As an example, in the beck-s dataset in which the user dynamically selects the distinguishing characteristics of emails for categorization (Fig. 2a), Alecsa has a great improvement over the baselines (Table 3). It can be seen that in this dataset, on average, about 4 local decision experts are consulted (Fig. 5). On the other hand, the contribution of the local decision experts in categorizing emails of this dataset is fairly uniform (Fig. 3). This shows that in this dataset, Alecsa is able to choose the best aspects, in each case (diversity of aspects contribution) and the best local decision experts are consulted in the selected aspect (small number of local decision experts). 4.7. Selecting ﬁrst local decision expert: initial state in ADFL

done all experiments on the same machine with 4GB of memory and a 2 GHz, dual-core and the Ubuntu 12.4 as the operating system. Then we have compared the time taken for running of each method in the train and test phases. Although this approach is not a careful approach for comparing the complexity of different methods, it might be suﬃcient to show simply that Alecsa performs relatively eﬃcient. For the training phase, among all the baseline methods, Naïve Bayes is the fastest one (on average, about 3 min for each dataset) and the worst is MaxEnt (on average, about 45 min for each dataset). For Alecsa, this time is about 1 h for each dataset on average. This is due to the essential of reinforcement learning algorithms that have an iterative manner to ﬁnd a proper policy. However, in the testing phase, Alecsa performs as the best method. It takes 17 s for each dataset on average while, for the fastest one in the test phase, ABC-DynF takes 56 s. Alecsa possesses a high speed in the test time since, during the time that the policy is trained, the punishment of consulting with an additional expert in reward function controls the number of consultations. This ensures that in the trained policy, for a new email, only the required functions are computed as the selected local decision experts and unhelpful local decision experts are left unattended. Furthermore, looking on the procedure of Alecsa in detail, we have discovered that in the studied datasets, Alecsa has a very parsimonious behavior. We have monitored the path of the trained policies that are traversed to categorize each email in the test data. For each email, we calculate the Consultation ratio as the number of consulted local decision experts divided by the total number of all local decision experts [12]. For each dataset, the consultation ratio is calculated as the average over all the emails in the dataset. Fig. 5 shows the consultation ratio of different datasets. Averages of the consultation ratios over all datasets is 0.46, which means that among all of the deﬁned local decision experts, whose number is 8, only about half of them, i.e., 4 experts, are chosen for consultation on average. Thus, this parsimonious behavior of Alecsa leads taking less time in the test phase.

As noted in Section 3.1, the initial state is considered as Null, so, there is no clue in the beginning on how to choose the initial action in the constructed MDP. This action should be a consultation with a local decision expert. To discuss the sensitivity of Alecsa to the chosen initial action, the results of consulting with different local decision experts as the initial actions are evaluated in this experiment. Table 4 presents the results of Alecsa starting by consultation with different local decision experts in all Enron– Bekkerman datasets. The obtained results indicate that the performance of Alecsa is sensitive to the starting point. According to the results given in Table 4, in most cases, consulting with the local decision expert that decides regarding body-text similarity as the starting point leads to better results. This local decision expert is a representative of the content-aware aspect of emails. Deciding to categorize an email based on its content is the most common way among different aspects of email data. Hence, as one of the suggested strategies for selecting initial action [11], we assume this local decision expert is the most knowledgeable local decision expert in terms of access to an informative part of the feature space and employ consultation with this expert as the initial action. It is noteworthy that the results provided in the given tables on Fig. 4 and Table 3 are based on this assumption. 5. Conclusion and future work In this research, we presented Alecsa as a dynamic system for the task of email categorization. Alecsa uses an attentive learning approach to learn which structural aspects of an incoming email should be taken into account for categorizing the email. To do so, the problem of email classiﬁcation in Alecsa is mapped to a decision-making problem in which for an email, it is decided which folder is most likely to be considered as the email category. In this decision-making problem, several local decision experts exist to be consulted with, and each of them provides a suggestion in the form of a probability distribution over the folders.

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

JID: KNOSYS

ARTICLE IN PRESS M. Dehghani et al. / Knowledge-Based Systems 000 (2016) 1–11

The provided suggestion of each expert is based on the expert knowledge domain. Different knowledge domains refer to different perspectives of investigating an email regarding different structural aspects. Exploiting ADFL framework, Alecsa dynamically learns a policy to select a sequence of local decision experts for consultation. Experimental results along with analysis indicate that Alecsa performs very well in terms of accuracy, and its parsimonious behavior leads to its eﬃciency in terms of time complexity. As the future work, three different ideas can be considered. The ﬁrst and foremost is to make Alecsa time adaptive. In other words, instead of considering the email data as a bunch of data, they can be viewed as a stream. This way, some problems like “handling concept drift” and “incremental learning ability” can be addressed. According to the fact that reinforcement learning, which is the core of Alecsa, has an online essential, modifying Alecsa with the aim of making it time adaptive seems to be feasible. The second suggestion is to improve features deﬁned regarding each of the structural aspects deﬁned in Section 3.3. Since our primary concern in this paper is to propose a complete framework as the solution for the problem of dealing with the dynamism in choosing the proper aspect, we have utilized simple but eﬃcient as well as effective features regarding each aspect. However, each of these features can be improved independently. On the other hand, other features, like ontology-based similarity features, can be easily added to the system as extra experts. The third suggestion as the future work would be to learn how to determine the initial consultant as the primary action in the ADFL framework. In this fashion, the initial action, which Alecsa is sensitive to, will be chosen automatically. References [1] C. Neustaedter, Beyond from and received: Exploring the dynamics of email triage, in: Proceedings of ACM CHI05 Extended Abstracts on Human Factors in Computing Systems, CHI EA05, 2005, pp. 1977–1980. [2] J.D. Brutlag, C. Meek, Challenges of the email domain for text classiﬁcation, in: Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 103–110. [3] S. Chakravarthy, A. Venkatachalam, A. Telang, A graph-based approach for multi-folder email classiﬁcation, in: Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM ’10), IEEE Computer Society, Washington, DC, USA, 2010, pp. 78–87. [4] T. Tam, Learning Techniques for Automatic Email Message Tagging, (Master’s thesis), Instituto Superior de Engenharia de Lisboa, 2011. [5] J.M. Carmona-Cejudo, G. Castillo, M. Baena-Garca, R. Morales-Bueno, A comparative study on feature selection and adaptive strategies for email foldering using the ABC-DynF framework, Knowl. Based Syst. 46 (0) (2013) 81–94. [6] W. Cohen, Learning rules that classify e-mail, in: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, AAAI Press, 1996, pp. 18–25. [7] J.D.M. Rennie, iﬁle: An application of machine learning to e-mail ﬁltering, in: Proceedings of KDD Workshop on Text Mining, 2000. [8] J. Clark, I. Koprinska, J. Poon, A neural network based approach to automated e-mail classiﬁcation, in: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence (WI ’03), IEEE Computer Society, Washington, DC, USA, 2003, pp. 702–706. [9] R. Bekkerman, A. Mccallum, G. Huang, Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora Technical Report, Center for Intelligent Information Retrieval, Technical Report IR-418, 2004. [10] M. Chang, C.K. Poon, Using phrases as features in email classiﬁcation, J. Syst. Softw. 82 (6) (2009) 1036–1045. [11] M.S. Mirian, A Framework for Learning Attention Control in Tasks with Multidimensional Perceptual Space, (Ph.D. thesis), University of Tehran, 2010. [12] M.S. Mirian, M.N. Ahmadabadi, B.N. Araabi, R.R. Siegwart, Learning active fusion of multiple experts’ decisions: An attention-based approach, Neural Comput. 23 (2) (2011) 558–591. [13] G. Tang, J. Pei, W.-S. Luk, Email mining: Tasks, common techniques, and tools, Knowl. Inf. Syst. 41 (1) (2013) 1–31. [14] J. Gomez, E. Boiy, M.-F. Moens, Highly discriminative statistical features for email classiﬁcation, Knowl. Inf. Syst. 31 (1) (2012) 23–53. [15] D.D. Lewis, K.A. Knowles, Threading electronic mail: A preliminary study, Inf. Process. Manage. 33 (2) (1997) 209–217. [16] M. Dehghani, A. Shakery, M. Asadpour, A. Koushkestani, A learning approach for email conversation thread reconstruction, J. Inf. Sci. 39 (6) (2013) 846–863. [17] M. Dehghani, M. Asadpour, A. Shakery, An evolutionary-based method for reconstructing conversation threads in email corpora, in: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), 2012, pp. 1132–1137.

[m5G;February 8, 2016;10:58] 11

[18] A. Padhye, Comparing Supervised and Unsupervised Classiﬁcation of Messages in the Enron Email Corpus, (Master’s thesis), University of Minnesota, 2006. [19] U. Pandey, S. Chakraverty, A review of text classiﬁcation approaches for e-mail management, Int. J. Eng. Technol. (IACSIT) 3 (2) (2011) 137–144. [20] I. Koprinska, J. Poon, J. Clark, J. Chan, Learning to classify e-mail, J. Inf. Sci. 177 (10) (2007) 2167–2187. [21] E. Crawford, I. Koprinska, A multi-learner approach to e-mail classiﬁcation, in: Proceedings of the Seventh Australasian Document Computing Symposium, 2002. [22] K. Taghva, J. Borsack, J. Coombs, A. Condit, S. Lumos, T. Nartker, Ontology-based classiﬁcation of email, in: Proceedings of the International Conference on Information Technology: Computers and Communications (ITCC ’03), IEEE, 2003, pp. 194–199. [23] M. Balakumar, V. Vaidehi, Ontology based classiﬁcation and categorization of email, in: Proceedings of the International Conference on Signal Processing, Communications and Networking (ICSCN ’08), 2008, pp. 199–202. [24] B. Yu, D.-h. Zhu, Combining neural networks and semantic feature space for email classiﬁcation, Know. Based Syst. 22 (5) (2009) 376–381. [25] P. Bermejo, J.A. Gmez, J.M. Puerta, Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets, Expert Syst. Appl. 38 (3) (2011) 2072–2080. [26] Y. Diao, H. Lu, D. Wu, A comparative study of classiﬁcation based personal email ﬁltering, in: Proceedings of the Fourth Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Application (PADKK ’00), Springer-Verlag, 2000. [27] S. Kiritchenko, S. Matwin, S. Abu-Hakima, Email classiﬁcation with tempo˚ ˚ K. Trojanowski (Eds.), Intelligent ral features, in: M. KAopotek, S. WierzchoA, Information Processing and Web Mining, Advances in Soft Computing, 25, Springer Berlin Heidelberg, 2004, pp. 523–533. [28] M.G. Armentano, A.A. Amandi, Enhancing the experience of users regarding the email classiﬁcation task using labels, Knowl. Based Syst. 71 (0) (2014) 227– 237. [29] M. Grbovic, G. Halawi, Z. Karnin, Y. Maarek, How many folders do you really need?: Classifying email into a handful of categories, in: Proceedings of the Twenty-third ACM International Conference on Conference on Information and Knowledge Management (CIKM ’14), ACM, New York, NY, USA, 2014, pp. 869– 878. [30] R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction, 1, MIT Press Cambridge, 1998. [31] H. Firouzi, M.N. Ahmadabadi, B.N. Araabi, S. Amizadeh, M.S. Mirian, R.R. Siegwart, Interactive learning in continuous multimodal space: A Bayesian approach to action-based soft partitioning and learning, IEEE Trans. Auton. Ment. Dev. 4 (2) (2012) 124–138. [32] S.E. Robertson, S. Walker, M. Beaulieu, Experimentation as a way of life: Okapi at TREC, Inf. Process. Manag. 36 (1) (2000) 95–108. [33] C. Zhai, J. Lafferty, A study of smoothing methods for language models applied to information retrieval, ACM Trans. Inf. Syst. (TOIS) 22 (2) (2004) 179–214. [34] J.R. Finkel, T. Grenager, C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, in: Proceedings of the Forty-third Annual Meeting on Association for Computational Linguistics (ACL ’05), Association for Computational Linguistics, 2005, pp. 363–370. [35] S. Erera, D. Carmel, Conversation detection in email systems, in: Proceedings of the IR Research, Thirty European Conference on Advances in Information Retrieval (ECIR’08), Springer-Verlag, 2008, pp. 498–505. [36] V.R. Carvalho, W.W. Cohen, Learning to extract signature and reply lines from email, in: Proceedings of the Conference on Email and Anti-Spam, 2004. [37] B. Klimt, Y. Yang, The Enron corpus: A new dataset for email classiﬁcation research, in: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Machine Learning: ECML 2004, Lecture Notes in Computer Science, 3201, 2004, pp. 217– 226. [38] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classiﬁcation tasks, Inf. Process. Manag. 45 (4) (2009) 427–437. [39] P.K. Bhowmick, P. Mitra, A. Basu, An agreement measure for determining interannotator reliability of human judgements on affective text, in: Proceedings of the Workshop on Human Judgements in Computational Linguistics (HumanJudge ’08), Association for Computational Linguistics, Stroudsburg, PA, USA, 2008, pp. 58–65. [40] S. Della Pietra, V. Della Pietra, J. Lafferty, Inducing features of random ﬁelds, IEEE Trans. Pattern Anal. Mach. Intell. 19 (4) (1997) 380–393. [41] A. McCallum, K. Nigam, A comparison of event models for Naive Bayes text classiﬁcation, in: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, AAAI Press, 1998, pp. 41–48. [42] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classiﬁers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, ACM, 1992, pp. 144–152. [43] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. 34 (1) (2002) 1–47. [44] N. Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm, Mach. Learn. 2 (4) (1988) 285–318. [45] G. Widmer, M. Kubat, Learning in the presence of concept drift and hidden contexts, Mach. Learn. 23 (1) (1996) 69–101.

Please cite this article as: M. Dehghani et al., Alecsa: Attentive Learning for Email Categorization using Structural Aspects, KnowledgeBased Systems (2016), http://dx.doi.org/10.1016/j.knosys.2015.12.013

Alecsa : Attentive Learning for Email Categorization using Structural Aspects

Alecsa : Attentive Learning for Email Categorization using Structural Aspects

Recommend Documents