Three-layer concept drifting detection in text data streams

Three-layer concept drifting detection in text data streams

Communicated By Dr. Haiqin Yang Accepted Manuscript Three-layer Concept Drifting Detection in Text Data Streams Yuhong Zhang, Guang Chu, Peipei Li, ...

1MB Sizes 0 Downloads 40 Views

Communicated By Dr. Haiqin Yang

Accepted Manuscript

Three-layer Concept Drifting Detection in Text Data Streams Yuhong Zhang, Guang Chu, Peipei Li, Xuegang Hu, Xindong Wu PII: DOI: Reference:

S0925-2312(17)30798-1 10.1016/j.neucom.2017.04.047 NEUCOM 18396

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

1 August 2016 23 March 2017 25 April 2017

Please cite this article as: Yuhong Zhang, Guang Chu, Peipei Li, Xuegang Hu, Xindong Wu, Three-layer Concept Drifting Detection in Text Data Streams, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.04.047

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Three-layer Concept Drifting Detection in Text Data Streams Yuhong Zhanga,, Guang Chua , Peipei Lia , Xuegang Hua , Xindong Wub a School

CR IP T

of computer science and information technology, Hefei University of Technology, Hefei City, Anhui Province, China, 230009 b School of Computing and Informatics,University of Louisiana at Lafayette, Lafayette, Louisiana, USA.

Abstract

M

AN US

Text data streams have widely appeared in real-world applications, in which, concept drifts owe a significant challenge for classification. Compared with relational data streams, concept drifts hidden in text streams usually reflect in the relationship between the feature vector and the instance labels. Meanwhile, existing concept drifting detection methods are mainly based on error rates of classification. When applying these methods in text streams, they perform poorly in the evaluations of false alarms and missing detections, etc. Motivated by this, we firstly give a systematic analysis of the concept drifts in text data streams. Then, we propose a three-layer concept drifting detection approach, where the three layers indicate the layer of label space, the layer of feature space and the layer of the mapping relationships between labels and features respectively. In this approach, the latter two layers are based on the values of WoE (Weight of Evidence) and the IV (Information Value) index. Experimental results show that our approach can improve the performance of concept drifting detection and the accuracy of classification, especially when concept drifts in text data streams are frequent.

PT

1. Introduction

ED

Keywords: concept drift; classification; text data streams; IV value.

AC

CE

Text data streams have widely appeared in real-world applications, such as social networks[1], online reviews[2] and online documents collection [3]. Among these streams, concept drifts, which result from the time-changing of the data distribution [4, 5], are common and frequent. For example, the topics people concerned are changing over time in Weibo [6]. The phenomenon also occurs in shopping reviews, online news and so on. The state-of-the-art works on classification of data streams can be divided into two categories: updating gradually the classifier using the new data chunk without concept drifting detection [7], and detecting concept drifts by integrating with re-training the classifier using the new data chunk [8, 9, 10, 11]. Hence, how to detect concept drifts [12] is a critical problem for the classification of data streams. Most methods for concept drifting detection are based on the error rates of classification [4, 8, 11, 13, 14]. When the concept drifts occur, the classifier trained from the previous data Email addresses: [email protected] (Yuhong Zhang ), [email protected] (Guang Chu), [email protected] (Peipei Li), [email protected] (Xuegang Hu), [email protected] (Xindong Wu) Preprint submitted to Neurocomputing

May 10, 2017

ACCEPTED MANUSCRIPT

AN US

CR IP T

chunks is unsuitable for the new incoming chunks, thus the error rate of the classification on the new chunk will increase. Therefore, concept drifts can be distinguished by detecting the change of error rates. Some statistical tests such as Hoeffeding bound [15, 16] are used to measure the change. Contrary to other data streams, a text data stream is characterized with the following aspects: 1) Text is sparse and high-dimensional [17], which makes the training of an accurate and stable classifier more difficult. 2) Concept drifts in text streams occur frequently, such as in the Weibo streams[6]. Therefore, the concept drifting detection in text streams is challenging for the following reasons. Firstly, whether a concept drift is detected depends on the performance of classifiers. When the text streams are sparse and high-dimensional, the deviation of classification error rates tends to be high, which makes the detection based on the error rate be inaccurate. Then, the causes of concept drifts may be manifold [5], and not all changes can be reflected in the error rate timely and directly. Lastly, since the change of the error rate is a gradual process, concept drifts can only be detected after enough false predicted instances are accumulated. When the concept drifts occur frequently, the misclassified instances are not enough to detect concept drifts, thus, they will be missed. To overcome these problems, in this paper we firstly analyze the categories of concept drifts in view of their causes. Then we propose a novel concept drifting detection approach for text data streams, namely a three-layer model based on feature selection. This approach can detect hidden concept drifts by measuring changes in each layer. The contributions of this paper are as follows.

M

• We systematically categorize concept drifts in text data streams according to the triggers of concept drifts instead of their velocity.

ED

• We propose a novel three-layer model for concept drifting detection. This model can not only detect concept drifts layer by layer, from simple to complex, but also be independent of the classifier.

PT

• While existing detection methods predict labels to get the error rate of classification before detecting concept drifts, our method first detects concept drifts and then classifies. Hence, our method can be adapted to the new data chunk quickly and has less missing detection. Therefore, it works better in data streams where concept drifts occur frequently.

CE

The remainder of this paper is organized as follows. Section 2 briefly reviews related work. Section 3 introduces the IV index, which is the basic theory used in our detection model. Section 4 gives the details of our three-layer model. Section 5 shows the effectiveness of our approach with experimental results. Section 6 summarizes the paper.

AC

2. Related Work

The state-of-the-art works falls into two categories: supervised approach based on labeled data streams[10, 11, 18, 19], semi-supervised approach based on a little labeled instances and amount of unlabeled instances[20]. This paper belongs to the former. In this section, we first review the definition of concept drift, and then summarize some classical detection methods for concept drifts.

2

ACCEPTED MANUSCRIPT

2.1. The definition of concept drift

2.2. Concept drift detection methods

AN US

CR IP T

Concept drift is a common phenomenon in data streams that indicates the knowledge implied in the data changes over time. Some works have given definitions of concept drifts. For example, Zhang et al. [12] gave a formal definition of concept drift as follows. Concept drift refers to the changes in the data distribution, including a change of features in P(f ) (only unconditional distribution function changes), a change of condition in P(c | f) (only the posterior probability changes), and bi-directional changes in P(f ) and P(c | f). In addition, Gama et al. [21] summarized that a concept drift primarily refers to an online supervised learning scenario, when the relation between the input data and the target variable changes over time. The manifestations of concept drifts may be various, and they can be divided into different categories by different standards. Concept drifts fall into two categories: abrupt drift and gradual drift according to the difference degree between the old and the new concepts [5]. They can also be divided into moderate fast drift and very slow drift according to the changing rate [4]. In addition, concept drifts can be divided into relaxation drift and strict drift according to the transformations happening in the data sample distribution and conditional distribution. The former refers to the data sample distribution changes while the conditional distribution remains stable, and the later indicates both are random changes continuously [12]. Yang et al. [22] summarized these opinions, and pointed out that changes of data distributions caused by the variety of implicit context can be classified as gradual drift, abrupt drift and sampling variation. The aforementioned works classify concept drifts in views of diversity, rate and so on, but all of them lack systematic analysis on the incentives of concept drifts.

AC

CE

PT

ED

M

The detection of concept drifts is essential for the classification of data streams, and there are many existing approaches [8, 9, 10, 11, 13, 18, 23, 24, 14, 15, 25]. These concept drifting detection approaches can also be divided into two sub-categories: measuring the deviation between two concept clusters based on their radius and distance, which is mainly used in unlabeled streams [13, 24]; and measuring the change of the error rate, which is adopted in labeled streams. In this paper, we focus on the latter. Wang et al. [9] detected concept drifts based on the error variance of concept similarity between the classifier and the data set. Gama et al. [10, 11] proposed the DDM (Drift Detection Method) concept drift detection method based on the Bernoulli data distribution, and used a threshold to determine whether the concept drift occurs or not. Baena-Garcia et al. [18] put forward EDDM (Drift Detection Method Early) developed from DDM. It aggregates the ability to detect the gradual drift by ensuring the efficiency to detect the abrupt drift. This method can be used as a wrapper of a batch learning algorithm or be implemented inside an incremental and online algorithm with any learning algorithm. Severo and Gama [23] proposed a detection system for regression problems, which is composed of three components: a regression predictor, a Kalman filter and a Cumulative Sum of Recursive Residual (CUSUM) change detector. The system monitors the error of the regression model using the regression predictor, then interprets a significant increase as a change in the distribution, and lastly rebuilds the regression model if a change is detected. The system is robust to false alarms and can be applied in handling problems where the information is available over time. Moreover, other detection approaches based on classification error used other measure tools, such as Hoeffding Bounds [15, 16, 26], µ-test [25] and so on. These works aim to distinguish concept drifts from noise, and they are based on the error rates of classification in essence. 3

ACCEPTED MANUSCRIPT

3. Basic Theory

CR IP T

In this section, we introduce the theory of Weight of Evidence and Information Value in brief. WoE (Weight of Evidence) [27] is a quantitative analysis method combining instances based on specific labels, and the WoE value is called ”contribution weight”. Given a feature F= { f1 } and the label Y= {0, 1}, WoE can be considered as the contribution of vi j , a value of fi , to Y, as shown in Eq. (1). g( fi = vi j |Y = 0) (1) WoE( fi = vi j , Y) = ln g( fi = vi j |Y = 1)

AN US

Here g( fi =vi j | Y=0) refers to the ratio of samples in the cases of fi =vi j and Y=0 to the samples in the cases of Y=0. It can be seen that the larger the WoE value, the greater contribution of the value vi j of feature fi to Y, and vice versa. WoE can only measure the contribution to the label for a feature value. Hence the IV index is proposed to process a feature. The IV value can be calculated by (2). Z IV( fi ) = [g( fi = vi j |Y = 0) − g( fi = vi j |Y = 1)] ∗ WoE( fi = vi j , Y) d f (2)

ED

M

The IV index is an effective index for feature selection. Firstly, the computation mainly consists of counting that costs less time. Secondly, it contains the information of different granularity, such as IV matrix, IV vector and WoE data cube. Thus the drill-down and roll-up operations can be used to express the importance of feature or feature values in different granularity. In [27], the authors obtain a conclusion from the experiments: when the IV value of a feature falls in [0, 0.1), the feature has low contribution to the classification; when it falls in [0.1, 0.3), the feature’s contribution is medium; when falls in [0.3,), the contribution is high. In this paper, we use the IV index to detect concept drifts. 4. Our Three-Layer Concept Drifting Detection Approach

AC

CE

PT

In this section, we firstly divide the concept drifts in a text data stream into three categories, and then propose a three-layer concept drifting detection approach based on the three categories. We will detect concept drifts according to the three-layer model. Firstly, the definition of the problem is given and some notations are shown in Table 1. Given a text stream S = {S 1 , ..., S m }, where S i refers to the i-th data chunk in the text stream. In addition, we note the feature space of S i as Fi = { f j } and the label space as Yi = {yk }. As described in Section 3, we can obtain the value of WoE( fi , valuei j , yk ) for each feature value and each label, and all values of WoE( fi , valuei j , yk ) form the data cube V. Especially, we select the common couple of label and feature in V for two adjacent data chunks, and obtain their WoE −−−−→ vector, named WoE. Then we can roll-up the data cube in the dimensionality of the feature value to obtain the IV matrix R. The elements of R represent the IV of all the features for different labels, donated as iv( fi , y j ). We also can roll-up the IV matrix R in the dimensionality of label − → values to obtain the value of IV, which refers to the IV value vector of all features. We use IV( fi ) − → to represent the element of IV,which refers to the IV value of feature fi . In addition, t represents the threshold. 4

ACCEPTED MANUSCRIPT

Table 1: Variable Descriptions Variable

the IV of all the features for a label y j

CR IP T

Description A text stream The i − th data chunk in text streams The number of samples contained in one chunk feature space of S i label space of S i the IV value of feature fi the IV value of all the features for different labels

feature vector contain features selected by IV value

IV values vector of all features IV matrix consisting of IV of all the features for different labels WoE data cube consisting of WoE of all the features and different values for different labels vector of WoE values with a label and feature couple the threshold

M

AN US

S Si n Fi = { f j } Yi = {yk } IV( fi ) iv( fi , y j ) −−−−→ WoE | y j → − F − → IV R V −−−−→ WoE t

Figure 1: An illustration to show three drifting cases

ED

4.1. Categories of concept drifts in text streams

AC

CE

PT

In this subsection, we category the concept drifts by the causes. A data stream consists of a feature space and a label space. According to the definition of concept drift, the causes may include the change of label distribution, the change of feature distribution and the change of mapping relationship between labels and features, as shown in Fig. 1. Firstly, we consider concept drift detection from the label space. If Yi , the label space of S i is different from Yi+1 , we consider that there is a concept drift occurring, denoted as A-drift. An A-drift can be detected by a simple comparison of the label spaces of two data chunks. Moreover, we do not need to consider changes in the feature space and others for A-drift. Secondly, we analyze concept drifts caused by the change of the feature space denoted as B-drift, which means that Fi is different from Fi+1 . To measure the change of feature values, we need to select the informative features in each data chunk, and then compare the difference or similarity between Fi and Fi+1 . Lastly, we concern concept drifts caused by the change of mapping relationship of labels and features, namely C-drift. In this case, the change of mapping relationship includes two aspects below. One is that feature fi refers to y0 in S i , while fi ∪ f j refers to y0 in S i+1 . The other is that feature fi refers to y0 in S i but it refers to y1 in S i+1 . To detect this drift, it needs to measure the change in mapping relation between labels and features in two data chunks. Considering the former aspect, we can detect this drift through computing the contribution of an feature to a label and comparing the similarity the two informative feature sets. Considering the latter aspect, we 5

ACCEPTED MANUSCRIPT

AN US

CR IP T

need to further compute the mapping relationship between labels and features in different data chunks. It should be noted that the latter drift does exist but it is not frequent generally. Take the online shopping reviews as an example. The product consumers might be concerned from “ books” to “electronics” over time, and this concept drift is denoted as A-drift due to the change of labels. If the focus of users on computers changes from “size” to “performance”, it can be considered as the change of feature distribution, denoted as B-drift. “small size” of phones usually mapped to advantage for users in previous 10 years, but now it may mean disadvantage, it can be categorized as the change of mapping relationship between labels and features, denoted as C-drift. We will give further analysis on the causes of concept drifts. Generally, the C-drift happens less than A- and B-drifts in real world applications. In summary, we divide concept drifts into A-, B- and C-drifts based on their incentives. Adrift is the simplest and C-drift is the most complex. Fig. 1 illustrates these three types of A-, B- and C-drifts. In a real-world application, a concept drift may be triggered by more than one type. That is, it may be triggered by the change of label space (A-drift) as well as the change of feature space (B-drift). In this paper, when we detect a A-drift, we do not care whether there is B-drift or not. 4.2. Three-layer concept drifting detection model

M

Based on the above analysis, concept drifts are triggered by 3 factors: the change of the label space (namely A-drift), the change of feature space (namely B-drift) and the change of the mapping relationship between features and labels (namely C-drift). Correspondingly, we define a three-layer concept drifting detection model to detect the various types of concept drifts, in which, some indexes such as the Jaccard coefficient, Euclidean distance and the cosine similarity [28] can be used to compute the similarity of two data chunks. Fig. 2 gives the framework of our three-layer model.

CE

PT

ED

4.2.1. The first layer In this layer, we aim to detect concept drifts caused by the change of the label space. Comparing two data chunks, if there are only a few instances with the same labels, we think that A-drift happens, otherwise there are no drifts. More specifically, for each label yk ∈(Yi ∩ Yi+1 ), we count the numbers of instances with yk in both chunks and sum up the smaller numbers of each yk . Then the ratio of the sum to the size of each chunk is denoted as the similarity degree for two label spaces. The process is shown in Eq. (3). Φ1 (Yi , Yi+1 ) =

Pl

k=1

min(ni (yk ), ni+1 (yk )) , yk ∈ (Yi ∩ Yi+1 ) n

(3)

AC

where ni (yk ) indicates the number of instances with label yk in S i , l indicates the size of Yi ∩ Yi+1 and n is the number of instances in S i . According to Eq. (3), if the value of Φ1 is less than the threshold t, we can determine that a concept drift occurs. All concept drifts caused by the changes of the label space will be found in this layer. Therefore we only need to consider the rest of data with the same labels in the next layer for detecting B-drift and C-drift.

6

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 2: A three-layer model for concept drift detection

ED

M

4.2.2. The second layer In this layer, we aim to detect concept drifts caused by the change of the feature space, namely B-drift. A concept drift will be detected by comparing the feature sets based on feature selection with IV index. The informative features in two chunks are firstly selected, and then the similarity is calculated between the two feature sets. Especially, the IV values of all features are computed and then we exclude those features whose IV values are less than 0.3 [27]. Then the similarity between the two feature sets is calculated according to Eq. (4). → − → − F i ∩ F i+1 − → − → → − → − Φ2 (IV i , IV i+1 ) = Jac( F i , F i+1 ) = → − → − F i ∪ F i+1

(4)

CE

PT

→ − In Eq. (4), F i = { f j kIV( f j ) > 0.3} indicates the set selected from a feature set by the IV value. When the value of Φ2 is less than the threshold t, it can be considered that a concept drift is occurring, because features relevant to the classification in the two data chunks are unsimilar.

AC

4.2.3. The third layer Only when data chunks share a similar space of labels and features, it is needed to detect the difference of the mapping relationship between labels and features, namely C-drift in the third layer. As mentioned in Section 4.1, there are two sub-categories of concept drifts, and the detecting step is based on the values of Φ3 and Φ4 respectively. The computation method of Φ3 is given below. In order to detect a change of mapping − → relationship, we need to drill-down the IV vector (IV) to IV matrix R (as shown in Eq. (5)), and the elements in R represent the contribution of each feature to each label.

7

ACCEPTED MANUSCRIPT

 − → IV | y1   − →  IV | y 2   R =   ...   − → IV | yn

(5)

Φ3 (Ri , Ri+1 ) =

Pl

k=1

CR IP T

We first select two feature sets by the IV matrix which have the same label, and then compute the similarity of the two sets. Lastly, the average of each label can be used to detect drifts, and the definition is provided in Eq.(6), − → − → Φ2 (IV i | yk , IV i+1 | yk ) , yk ∈ (Yi ∩ Yi+1 ) l

(6)

AN US

where l indicates the size of Yi ∩ Yi+1 . When the value of Φ3 is less than the threshold t, it can be considered a C-drift occurs, otherwise we need to detect the special case below, namely the computation method of Φ4 . In order to detect this special concept drift, we need to drill-down the IV matrix R in the dimensionality of feature values to obtain the data cube V. Elements of V represent the WoE values for different labels, as shown in Eq. (7), V = WoEi, j,k ( fi , valueik , y j ), y j ∈ (Yi ∩ Yi+1 )

(7)

ED

M

where i and j represent the feature and label dimensions respectively, and valueik refers to the k-th value of features fi . Due to the analysis in the upper layers, we just need to consider instances with the same labels and features between two data chunks. For each pair of common label and feature, we select the common values of the feature and obtain their WoE values to form two value sequences, denoted −−−−→ −−−−→ −−−−→ −−−−→ as WoE i and WoE i+1 respectively. Then we calculate the similarity of WoE i and WoE i+1 , and obtain the average value, as shown in Eqs. (8) and (9). (8)

−−−−→ −−−−→ WoE i ∩ WoE i+1 −−−−→ −−−−→ −−−−→ −−−−→ γ(WoE i , WoE i+1 ) = Jac(WoE i , WoE i+1 ) = −−−−→ −−−−→ WoE i ∪ WoE i+1

(9)

PT

−−−−→ −−−−→ Φ4 (Vi , Vi+1 ) = avg(γ(WoE i , WoE i+1 ))

CE

After finishing the concept drifting detection in the above three layers, we can determine the status of concept drifts, such as A-drift, B-drift, C-drift or no drift. 4.3. Algorithm and Analysis

AC

In this subsection, we will analyse the time complexity of our approach. In the classification of a text stream, our approach and other competitive ones mainly consist of two main steps: concept detection and training of a classifier. More specially, our approach detects concept drifts first, then trains a classifier and gets the error rate. While others train classifier, get the error rate, and then detect drifts. The classifier training in all approaches are equivalent, and the detection of drifts make them different in the time complexity. Now the complexity of our 3-layer model is analyzed below. In the first layer, the main cost is the counting of the label space, and the cost is O(|S i |+|S i+1 |). The computation on R is involved in 8

ACCEPTED MANUSCRIPT

PT

ED

M

AN US

CR IP T

Algorithm 1: IV-Jac: Three-Layer Concept Drift Detection Based On IV and Jaccard Input: S : streaming data; t: the threshold; c: the size of ensemble classifier Output: drift: the number of concept drifts; errRate: the error rate of classification 1 Read a chunk S 1 from stream S and train classifier C i with S 1 ; 2 while S is not at the end do 3 read next chunk to S i ; 4 train the base classifier Ci from S i ; 5 Calculate R of S i−1 and S i ; 6 if Φ1 (Yi−1 , Yi ) < t then 7 get the errRate in S i using Classifier Ci−1 ; 8 drift++; C = Ci ; continue; 9 end − → − → 10 else if Φ2 (IV i−1 , IV i ) < t then 11 get the errRate in S i using Classifier Ci−1 ; 12 drift++; C = Ci ; continue; 13 end 14 else if Φ3 (Ri−1 , Ri ) < t then 15 get the errRate in S i using Classifier Ci−1 ; 16 C = Ci ;drift++; continue; 17 end 18 else if Φ4 (Vi−1 , Vi ) < t then 19 get the errRate in S i using Classifier Ci−1 ; 20 drift++; C = Ci ; continue; 21 end 22 get the erro rate S i using Classifier C;//no concept drifts; 23 update C with Ci ; 24 end 25 return drift and errRate;

AC

CE

the second and third layers, and its cost is O(|F|∗|v|∗|Y|), where |v| refers to the average number of values in a feature. In the second layer, the comparison of two chunks is based on the IV vector, and the cost is O(|F| ∗ |v| ∗ |Y|) + O(|F|). In the third layer, the comparison of two chunks is based on R, and the cost is O(|F| ∗ |v| ∗ |Y|) + O(|F| ∗ |v| ∗ |Y|). As the data stream arrives, it will be firstly detected in the first layer. If an A-drift is detected in the first layer, our model will not detect other concept drifts, and continuously handle the next data chunk. If there is no drift detected in the first layer, our model will adopt the second layer. In a similar way, the most complicated C-drift will be detected in the third layer only if there are no B-drift in the second layer. From the above analysis, our model can detect concept drifts from simple to complex categories. Generally, the time complexity of our detection is less than O(|F|∗|v|∗|Y|)+O(|F|) as C-drift happens rarely. On the other hand, the concept drift detection in other competitive algorithms observes the changes of classification error rates, and the time cost is regarded as O(|S i | + |S i+1 |) [5]. Therefore, the time cost of our approach is relatively low in streams consisting of frequent drifts, especially A-drifts and B-drifts. 9

ACCEPTED MANUSCRIPT

CR IP T

As for the accuracy of classification, our approach detects concept drifts based on the label and feature spaces while the competitive baselines are based on the classification error rates. The baselines can detect concept drifts when enough mis-classified instances are accumulated. If concept drifts happen frequently, some drifts will be missed, which will lead to the loss of accuracy. Therefore, it can be concluded that our method performs better in streams with frequent drifts. 5. The Experiments

In this section, we study the effectiveness of our approach from three perspectives including parameter analysis on the threshold t, experimental comparisons on concept drifting detection results and classification performance. 5.1. Data Sets and Baselines

AC

CE

PT

ED

M

AN US

In our experiments, 5 benchmark data sets are used including Kddcup, Amazon shopping, 20-Newsgroups, Reuters-21578 and Twitter. Details of the five data sets are summarized in Table 2. In each dataset, we select a-window-size of instances from a sub-category as a data chunk, and two adjacent chunks from different categories are regarded as one concept drift. In order to investigate the performance of the concept drifting detection of A-,B- and C-drifts, these datasets contain these categories of drifts. As for high dimensions of text streams, we filter out the features whose frequency is less than 5 in a pre-process step. Kddcup1 is a data set of intrusion detection. It contains 490000 instances with 40-dimensions, and 40 labels. The data after pre-processing consists 2450 chunks and 75 concept drifts, and the dimension of features is 40. Amazon shopping data2 is obtained by crawling web pages from the Internet. It contains 6400 instances with four types of products: books, electronics, DVD and kitchen. Each product contains 1600 instances. The dimension of this data set is 401971 and we filter words with lowfrequency to form a 1951-dimension feature space. We select randomly a-window-size of 200 instances from a sub-category (such as books) to form a data chunk, and read many times to form a stream. In the stream, the adjacent chunks from different sub-categories are considered as a concept drift and chunks from the same sub-categories as no concept drift. In Amazon, We select data from different categories randomly, and get a text stream with 3 concept drifts. 20-Newsgroups3 is a data set of news. It contains 10000 instances with 16 sub-categories and 67933-dimension features. In a similar way, we filter features with low frequency, get a 4989-dimension features. In addition, we select 200 instances from one category as a chunk and randomly select data from different sub-categories to form concept drifts. This text stream consist 45 chunks and 33 drifts. Reuters-215784 is a data set of news commentary in Reuters. It contains 9000 samples and the dimension of features is 19710, with 5 labels. We pre-process this data set and form the concept drifts in the same way as the above. We get a text stream consisting 45 chunks with 1378-dimension features and 17 concept drifts. 1 Http://kdd.ics.uci.edu/databases/kddcup99 2 Http://www.cs.jhu.edu/

mdredze/datasets/sentiment

3 Http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html

4 Http://www.cse.ust.hk/TL/index.html

10

ACCEPTED MANUSCRIPT

Table 2: Details of Data Sets Category Name

Size

Category Name

Size

TotalSize /dimension

Chunks /dimension

Kddcup

intrusion detection books dvd comp.graphics comp.ibm.hardware comp.os.ms-windows comp.sys.mac.hardware comp.graphics sci.crypt sci.electronics sci.med exchanges people topics arsenal chelsea smart phone

490000 1600 1600 580 587 591 578 580 595 591 594 428 818 9296 515017 647879 401361

/ electronics kitchen rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey talk.religion.misc talk.politics.guns talk.politics.mideast talk.politics.misc Orgs places / blackfriday obama /

/ 1600 1600 594 594 597 600 377 546 564 465 594 16568 / 137767 112155 /

490000/40

2470/40

Times of Concept Drifts 75

6400/401971

32/1951

3

Amazon

20-Newsgroups

Reuters

10000/67933

45/4989

33

9000/19710

45/1378

17

45/88571

4

1814179/ 212915

AN US

Twitter

CR IP T

Data Set

AC

CE

PT

ED

M

Twitter is a data set crawled from Twitter with some keywords in [29]. The sub-categories are divided by keywords include “arsenal”, “blackfriday”, “chelsea”, “obama” and “smart phone”. It contains 1814179 instances and 212915-dimension features in total. We pre-process the data set and get a text stream with 25 chunks, 88571-dimension features and 4 drifts. Among these data sets, Kddcup is a classical data set for data streaming, and the others are text streams containing A-,B- and C-drifts. In addition, the concept drifts in 20-Newsgroups are the most frequent while there are few concept drifts in the Amazon data set. We verify the effectiveness of our method in different types of text streams. We adopt three measures including false alarm, missing and delay [27] to evaluate the concept drifting detection. False alarm indicates that there is no concept drift occurring between two chunks while a concept drift is detected in the algorithm. Missing indicates that there is a concept drift occurring between two chunks while it is not distinguished in the algorithm. Delay refers to the distance between the positions where the concept drift detected and occurred, and we usually use the average distance in each drift chunk to evaluate the delay. In our experiments, we adopt an ensemble classifier as well as the works in [6, 26, 19, 30]. Moreover, Decision Tree (DT), Naive Bayes (NB) and LibSVM (SVM) are selected as the base classifiers. We select six state-of-the-art concept drifting detectors in data streams as the baselines. The algorithms of DDM [10, 11] and EDDM [18] are widely used in concept drifting detection, and most articles use them as baseline algorithms. PageHinkleyDM [19] detects concept drifts based on the Page Hinkley Test. HDDM-W-Test and HDDM-A-Test [16] perform online drift detection based on Hoeffding bounds using the weighted average bound and the average bound. All competing drifting detectors are from the open source experimental platform of MOA [31]. All of the parameters are set to default values according to the original works. Our approach uses IV index to detect concept drifts, in which, Jaccard is used to measure the similarity, so we donate our algorithm as IV-Jac. Regarding parameter settings, The size of a data chunk is set to 200, namely n=200. the size of the ensemble classifier in our IV-Jac is set to 4. We read and process each data chunk once. Specially, we treat the previous four data chunks as training data and the current data chunk as the test data, and then we can get an ensemble classifier, which includes four base classifier, to predict the current data chunk. And then we will get an error rate for each data chunk, except for the first data chunk in data stream. In our 11

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 3: The missing and false detection with the varying of threshold t

ED

5.2. Parameter setting

M

approach, each chunk is read and used to train a base-classifier, and several base-classifiers form the ensemble classifier. When there is concept drift, we will clear the ensemble classifier and put this new base-classifier into the ensemble classifier. Otherwise, the base-classifier is added into the ensemble classifier. In our experiments, we will the average error rate of all data chunks to illustrate the effectiveness of algorithms. All experiments are conducted on a P4, 2.5GHz PC with 8G main memory, running Windows XP Professional with the program platform of Eclipse Jdk1.7 in Windows 7. Weka3.6 is imported in our experiment setup in order to use the base classifiers DT, NB and SVM.

AC

CE

PT

The threshold of concept drift t is critical for our approach, so we conduct experiments to get the optimal value of t. In this paper, t is a threshold relevant to the similarity in the concept drifting detection, and its interval falls in [0, 1]. If t is set to 1, the similarity between any two chunks cannot be greater than the value of t, in which case, it is considered no concept drift occurring, namely missing a concept drift. Otherwise, if t is set to 0, there are definitely concept drifts in any two chunks, namely a false alarm. We hope to get a lower false alarm and reduce missing cases, hence need to balance the trade-off between the missing point and the number of false alarms. Fig. 3 shows the numbers of false alarms and missing pont in the concept drifting detection by varying the value of t. It can be seen from Fig. 3 that the number of false alarms is increasing obviously with the values of t growing, and the number of missing drafts is increasing with the value of t decreasing. We select 0.5 as an optimal value of t. 5.3. The detection of concept drifts Table 3 summarizes the results of concept drift detection of 6 approaches. It can be seen from Table 3 that our IV-Jac model is superior to the others in general. The superiority is more obvious in 20-newsgroup (concept drifts occurring frequently) than in Amazon (concept drifts occurring rarely). The missing number in IV-Jac is less than the others, whose missing number is 0 and the number of false alarms is 1 on 20-Newsgroups data. On one hand, this is because the baselines 12

ACCEPTED MANUSCRIPT

Table 3: ACCURACY ANALYSIS OF CONCEPT DRIFT DETECTION EDDM NB SVM

DT

4 10 0

20 45 100

21 47 103

20 46 97

24 45 109

20 46 102

0 0 0

0 0 30

0 0 40

0 0 30

0 0 47

0 0 68

1 0 0

0 13 127

0 15 142

0 14 87

0 14 132

0 14 116

3 7 0

1 10 120

0 9 115

1 9 120

1 10 85

1 10 97

8 0 0

0 1 37

0 1 37

0 1 37

0 1 47

0 1 47

DT

PageHinkleyDM DT NB SVM Kddcup 18 5 7 5 46 71 71 70 98 93 103 115 Amazon 0 0 0 0 0 3 3 3 47 0 0 0 20-Newsgroups 0 0 1 1 13 27 28 28 81 74 71 184 Reuters 1 0 0 0 10 17 17 17 100 0 0 0 Twitter 0 1 1 0 1 3 3 1 47 145 145 145

HDDM-A-Test DT NB SVM

HDDM-W-Test DT NB SVM

43 45 94

41 45 96

33 43 99

32 46 101

31 47 101

0 0 3

0 0 0 0 5.33 3

0 0 10

0 0 10

0 0 10

0 13 108

1 14 114

1 14 126

1 14 121

1 14 125

2 10 63

2 10 64

0 8 68

2 10 66

2 10 69

0 1 3

0 1 3

0 1 14

0 1 14

1 3 14

1 14 115 0 8 63 0 1 3

42 48 95

CR IP T

DDM NB SVM

IVJac

AN US

Index DataSet False Miss Delay DataSet False Miss Delay DataSet False Miss Delay DataSet False Miss Delay DataSet False Miss Delay

CE

PT

ED

M

are based on error rates of classification, which result in concept drifts, especially the C- drifts (triggered by the mapping relationship between a feature and a label) cannot responsed to the error rate timely and accurately. On the other hand, only when the misclassified instances are accumulated enough, a drift can be detected. But if the concept drifts occur frequently, the number of misclassified instances will not be enough, which will lead to the increasing of missing drafts. When concept drifts occur infrequently, such as in Amazon data, IV-Jac performs similarly to baselines. The delay of IV-Jac in our approach is better than that in baselines. This is because as the new data chunk comes, our approach detects concept drifts firstly and then classifies instances, while the baselines first get the error ratio of classification and then detect the concept drifts. Therefore, the delay of baselines is larger than IV-Jac. Finally, it should be mentioned that the false alarms of IV-Jac are a little higher than the baselines, especially on the Twitter data. In Twitter, the false alarms occur in the sub-categories “Arsenal” and “Chelsea”. In the “Arsenal” sub-category, the instances focus on many topics including the super-star, the management of team and so on. The keyword “Chelsea” is ambiguous, as it can mean a football team as well as a place. Unclear topics in the data set may be the main season of the high false alarms of IV-Jac. 5.4. The results of classification

AC

Table 4 gives the classification performance of six algorithms based on different detection mechanisms. In Table 4, the average error rate means the average of all data chunks, not of the the results of multiple running. And we also give the standard deviation of these error rates. It can be seen that our IV-Jac approach is better than others on average, especially in 20-newsgroups and Twitter. In 20-Newsgroups, our IV-Jac approach reduces the error rate to 26.76%, and this result is far lower than the others. The reason is that IV-Jac has less missing drifts and less delay, which makes the classification error decrease. More specifically, when IV-Jac detects a concept drift, it will train the classifier and predict based on the incoming data instead of the last chunk, so that the classification error is low. The results in Twitter prove IV-Jac has a good performance 13

ACCEPTED MANUSCRIPT

Table 4: CLASSIFICATION PERFORMANCE ANALYSIS (%) Classifier

DDM 2.37±11.21 2.21±10.86 2.49±12.23 9.68±30.05 15.74±26.55 9.68±30.05 33.42±27.06 43.20±21.75 46.00±39.96 36.13±37.85 36.94±37.64 36.41±38.79 22.57±39.2 22.61±39.2 22.57±39.2

Average error rates of classification EDDM PageHinkleyDM HDDM-A-Test 2.10±10.65 3.53±14.07 2.00±9.87 2.05±10.47 3.40±13.69 1.83±9.56 2.48±12.31 3.61±14.75 2.50±11.98 9.68±30.05 24.19±35.64 9.68±30.05 15.74±26.55 28.41±31.89 15.74±26.55 9.68±30.05 24.19±35.64 9.68±30.05 33.42±27.06 67.95±18.96 35.88±23.51 44.67±24.05 69.77±13.99 40.91±18.69 46.00±39.96 75.13±22.01 57.23±33.15 37.35±38.48 52.40±35.13 35.83±39/03 38.61±37.66 53.27±34.14 38.36±37.76 38.12±38.8 52.86±35.3 36.91±38.25 22.57±39.2 28.13±41.91 22.57±39.2 22.61±39.2 28.16±41.87 22.61±39.2 22.57±39.2 28.13±41.91 22.57±39.2

HDDM-W-Test 2.07±10.20 1.99±10.02 2.44±11.85 9.68±30.05 15.74±26.55 9.68±30.05 35.88±23.51 41.34±18.12 57.23±33.15 35.83±39.03 38.36±37.76 36.91±38.25 22.57±39.2 22.61±39.2 22.57±39.2

M

AN US

DT Kddcup NB SVM DT Amazon NB SVM DT 20Newsgroups NB SVM DT Reuters NB SVM DT Twitter NB SVM

IV-Jac 3.50±13.96 3.37±13.65 3.62±14.81 9.68±30.05 15.74±26.55 9.68±30.05 27.76±27.12 36.67±20.06 39.66±41.53 34.69±40.29 35.77±39.48 35.17±40.66 16.67±38.07 16.72±38.05 16.67±38.07

CR IP T

Data Set

ED

Figure 4: The details of error rate in NB on part of 20-newsgroups data

AC

CE

PT

in high dimensional data. As for Amazon (in which, concept drifts are scare), IV-Jac has the same performance as baselines. However, the result of IV-Jac in Kddcup is inferior to baselines because this data set is not a text stream, and the data distribution is not consistent with the category of drifts and the 3-layer model. In conclusion, our IV-Jac approach is more effective than the competing algorithms. In addition, the performance of PageHinkleyDM is much worse in 20-Newsgroups (75.13%) and Reuters(53.27%), and we analyze the reasons below. PageHinkleyDM predicts incoming data chunks using the last classifier. Moreover, frequent concept drifts result in few training instances to update the classifier. Both lead to a higher error rate. In Table 4, we also give the standard deviation of these error rates. It can be seen that the deviation of our algorithm and baselines all seem to be large. In offline dataset, the deviation is got by several running results for the same dataset. While in our data streams, the deviation is got by the many different data chunks. When the concept drifts occur, the error rate will be very large, that make the deviation of all algorithms, including our algorithm and baselines, are large. In addition, we give the Fig. 4 to illustrate show this process more clearly. Detailed concept drifting detection curves on 20-newsgroups as an example are given in Fig. 4 to verify the effectiveness of our IV-Jac approach. In this figure, we only give parts of 14

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: The error rate based on NB with the increasing of drifts on Amazon

CE

PT

ED

M

stream data and some representative drift detecting points ( A, B, C, D and E). In addition, the experimental results of DDM and EDDM are similar, and we only show EDDM in Fig. 4 for a clear investigation. About HDDM-A-Test and HDDM-W-Test, we only show HDDM-A-Test. As shown in Fig.4, at point A (the 6th chunk), IV-Jac and HDDM-A-Test can detect the concept drift correctly, but EDDM and PageHinkleyDM miss this drift. At point B (the 10th chunk), IV-Jac and EDDM can detect the concept drift correctly, while PageHinkleyDM and HDDM-ATest cannot. Further, at points C and E, only IV-Jac can detect the concept drift correctly. At point D, all algorithms except PageHinkleyDM can detect the concept drift. Meanwhile, at all points, IV-Jac has the lowest error rate. Therefore, we can conclude that our IV-Jac approach can detect concept drifts correctly, with the lowest missing. Meanwhile our IV-Jac approach can adapt to the new data chunk faster with a lower error rate. Moreover, in order to scale the performance with IV-Jac in frequent-drifts-steams, we observe the classification results with the increasing of concept drifts in Fig. 5 and Fig. 6. We read the data chunks with different mechanisms and get the data streams containing different concept drifts. It can be seen from Fig. 5 and Fig. 6 that IV-Jac has the best performance when the times of concept drifts fall in the moderate level, such as [7,23] in Fig. 5 and [4,19] in Fig. 6. In our real world applications, both cases of no concept drift occurring and each chunk being a concept drift are rare. Therefore, we can conclude that IV-Jac performs better in most cases. 5.5. The time cost

AC

In this subsection, we compare the time cost of all algorithms as shown in Table 5. Based on the analysis in Section 4.3, the time consists of detecting and classifying. In Table 5, we show the average time (detecting and classifying) and deviation of each data chunk in data stream. In order to get a comfortable size table, the total time of baselines (DDM, PageDM and HDDAT) are not list. It is easy to know the total time of baselines are equal to the classify time, with the detect time being 0. It can be seen from Table 5 that the detect time of IV-Jac is more than that of baselines (which is almost 0), which is consistent with the analysis in Section 4.3. As for the classify time, IV-Jac 15

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 6: The error rate based on NB with the increasing of drifts on Twitter Table 5: Comparison of Time Cost (∗103 ms) IV

Detect DDM PageDM HDDAT IV

HDDAT

Total IV

Amazon 20newsgroup Reuters Twitter

0.23±0.04 0.33±0.06 0.14±0.03 16.54±3.85

0 0 0 0

0 0 0 0

0 0 0 0

0.25±0.08 0.40±0.17 0.2±0.09 8.75±3.44

0.38±0.05 0.52±0.11 0.26±0.07 20.99±6.13

Amazon 20newsgroup Reuters Twitter

0.23±0.04 0.33±0.06 0.14±0.03 16.54±3.85

0 0 0 0

0 0 0 0

0 0 0 0

2.27±0.78 12.52±5.25 1.92±0.86 155.75±61.23

2.57±0.79 9.83±3.42 1.93±0.98 134.87±76

Amazon 20newsgroup Reuters Twitter

0.23±0.04 0.33±0.06 0.14±0.03 16.54±3.85

0.11±0.04 0.52±0.22 0.17±0.07 9.12±3.58

0.26±0.09 0.43±0.08 0.33±0.1 20.31±5.75

ED 0 0 0 0

0 0 0 0

PT

0 0 0 0

Classify DDM PageDM SVM 0.15±0.05 0.26±0.09 0.27±0.04 0.19±0.06 0.40±0.18 0.53±0.07 0.12±0.06 0.21±0.09 0.26±0.03 4.45±2.75 9.24±3.63 10.53±3.61 NB 2.34±0.81 2.34±0.81 2.83±0.48 9.49±3.39 13.65±6.25 27.15±3.89 1.79±0.98 2.06±0.88 2.88±0.43 118.33±73.21 163.2±64.17 183.58±62.87 DT 0.04±0.01 0.11±0.04 0.11±0.02 0.09±0.03 0.52±0.21 0.51±0.07 0.18±0.1 0.19±0.08 0.17±0.02 3.77±2.33 8.49±3.34 9.58±3.28

M

Data Set

AC

CE

is superior to baselines. The main reason is that IV-Jac can detect most of concept drifts while the baselines may miss some drifts. Given a concept drift in a stream, IV-Jac detects it while the baselines might miss it. In these cases, IV-Jac only uses the newest base classifier to get the error rate while the baselines use the ensemble classifier (including several base classifiers), which consumes more time. In sum, IV-Jac costs more time than baselines. As for the gap between our method and baselines in time cost, ignoring the C-drifts, which occurring rarely relatively, may be helpful to improve the efficiency while maintaining the performance of classification, especially in data stream drifts occurring frequently. In addition, the deviation for detect time of our IV-Jac is larger than others. It is because the time cost for A-, B- and C-drift is different. And the deviation for classify time of all algorithms are large as the deviation are got from the different data chucks.

16

ACCEPTED MANUSCRIPT

6. CONCLUSIONS

CR IP T

It is a challenge to detect concept drifts in text streams because of its characteristics of sparseness and high-dimension. Most of concept drifting detection methods are based on the error rates of classification, and then the performance closely depends on the classifiers. In this paper, we proposed a three-layer concept drifting detection method which is independent of classifiers. The three-layer model is built to detect different types of concept drifts layer by layer, while it can further improve the efficiency. Experimental results demonstrated that effectiveness and efficiency of our approach in concept drifting detection. Meanwhile, in our approach, the threshold value is directly related to the sensitivity of the results. In our future work, we hope to design a self-adaptive parameter with history data chunks. In addition, we will expand our three-layer model to unlabeled text streams. 7. ACKNOWLEDGEMENTS

AN US

This work is supported in part by the Research and Development Program of China under grant 2016YFB1000901, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China, under grant IRT13059, the Natural Science Foundation of China under grants (61503112, 61673152). References

AC

CE

PT

ED

M

[1] J. Scott, Social network analysis, Sage, 2012. [2] Y. Wang, S. C. F. Chan, H. V. Leong, G. Ngai, N. Au, Multi-dimension reviewer credibility quantification across diverse travel communities, Knowledge and Information Systems 49 (3) (2016) 1071–1096. [3] G. Drzadzewski, F. W. Tompa, Partial materialization for online analytical processing over multi-tagged document collections, Knowledge and Information Systems 47 (3) (2016) 697–732. [4] G. Widmer, M. Kubat, Learning in the presence of concept drift and hidden contexts, Machine learning 23 (1) (1996) 69–101. [5] H.-L. Nguyen, Y.-K. Woon, W.-K. Ng, A survey on data stream clustering and classification, Knowledge and information systems 45 (3) (2015) 535–569. [6] A. Bifet, E. Frank, Sentiment knowledge discovery in twitter streaming data, in: International Conference on Discovery Science, Springer, 2010, pp. 1–15. [7] H. R. Loo, M. N. Marsono, Online data stream classification with incremental semi-supervised learning, in: Proceedings of the Second ACM IKDD Conference on Data Sciences, ACM, 2015, pp. 132–133. [8] A. Tsymbal, The problem of concept drift: definitions and related work, Computer Science Department, Trinity College Dublin 106 (2). [9] H. Wang, W. Fan, P. S. Yu, J. Han, Mining concept-drifting data streams using ensemble classifiers, in: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2003, pp. 226–235. [10] J. Gama, P. Medas, G. Castillo, P. Rodrigues, Learning with drift detection, in: Brazilian Symposium on Artificial Intelligence, Springer, 2004, pp. 286–295. [11] J. Gama, G. Castillo, Learning with local drift detection, in: International Conference on Advanced Data Mining and Applications, Springer, 2006, pp. 42–55. [12] P. Zhang, X. Zhu, Y. Shi, Categorizing and mining concept drifting data streams, in: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2008, pp. 812–820. [13] C. C. Aggarwal, On change diagnosis in evolving data streams, IEEE Transactions on Knowledge and Data Engineering 17 (5) (2005) 587–600. [14] L. Rutkowski, M. Jaworski, L. Pietruczuk, P. Duda, A new method for data stream mining based on the misclassification error, IEEE transactions on neural networks and learning systems 26 (5) (2015) 1048–1059. [15] P. Li, X. Wu, X. Hu, H. Wang, Learning concept-drifting data streams with random ensemble decision trees, Neurocomputing 166 (2015) 68–83.

17

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

´ [16] I. Fr´ıas-Blanco, J. del Campo-Avila, G. Ramos-Jim´enez, R. Morales-Bueno, A. Ortiz-D´ıaz, Y. Caballero-Mota, Online and non-parametric drift detection methods based on hoeffdings bounds, IEEE Transactions on Knowledge and Data Engineering 27 (3) (2015) 810–823. [17] T. Joachims, Text categorization with support vector machines: Learning with many relevant features, in: European conference on machine learning, Springer, 1998, pp. 137–142. ´ [18] M. Baena-Garcıa, J. del Campo-Avila, R. Fidalgo, A. Bifet, R. Gavalda, R. Morales-Bueno, Early drift detection method, in: Fourth international workshop on knowledge discovery from data streams, Vol. 6, 2006, pp. 77–86. [19] J. Gama, R. Sebasti˜ao, P. P. Rodrigues, On evaluating stream learning algorithms, Machine learning 90 (3) (2013) 317–346. [20] M. J. Hosseini, A. Gholipour, H. Beigy, An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams, Knowledge and Information Systems 46 (3) (2016) 567–597. ˇ [21] J. Gama, I. Zliobait˙ e, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept drift adaptation, ACM Computing Surveys (CSUR) 46 (4) (2014) 44. [22] Y. Yang, X. Wu, X. Zhu, Combining proactive and reactive predictions for data streams, in: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM, 2005, pp. 710– 715. [23] M. Severo, J. Gama, Change detection with kalman filter and cusum, in: International Conference on Discovery Science, Springer, 2006, pp. 243–254. [24] A. Bouchachia, C. Vanaret, Gt2fc: An online growing interval type-2 self-learning fuzzy classifier, IEEE Transactions on Fuzzy Systems 22 (4) (2014) 999–1018. [25] Q. Zhu, X. Hu, Y. Zhang, P. Li, X. Wu, A double-window-based classification algorithm for concept drifting data streams, in: Granular Computing (GrC), 2010 IEEE International Conference on, IEEE, 2010, pp. 639–644. [26] P.-P. Li, X. Wu, X. Hu, Mining recurring concept drifts with limited labeled streaming data., in: ACML, 2010, pp. 241–252. [27] M. Hababou, A. Cheng, R. Falk, Variable selection in the credit card industry, NESUG Proceedings: Statistics and Pharmacokinetics. [28] J. Gama, R. Sebasti˜ao, P. P. Rodrigues, Issues in evaluation of stream learning algorithms, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2009, pp. 329–338. [29] Z. Wang, L. Shou, K. Chen, G. Chen, S. Mehrotra, On summarization and timeline generation for evolutionary tweet streams, IEEE Transactions on Knowledge and Data Engineering 27 (5) (2015) 1301–1315. [30] E. Ikonomovska, J. Gama, S. Dˇzeroski, Online tree-based ensembles and option trees for regression on evolving data streams, Neurocomputing 150 (2015) 458–470. [31] A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, Moa: Massive online analysis, Journal of Machine Learning Research 11 (May) (2010) 1601–1604.

AC

CE

PT

ED

Yuhong Zhang is an associate professor at Hefei University of Technology, China. She received her B.S., M.S. and Ph.D. degrees from Hefei University of Technology in 2001, 2004, 2011 respectively. Her research interests are in transfer learning, data stream classification and data mining. Guang Chu is a postgraduate student at Hefei University of Technology, China. He received his B.S. from Anhui University in 2013. His research interests are text classification and data streams classification. Peipei Li is currently a post-doctoral fellow at Hefei University of Technology, China. She received her B.S., M.S. and Ph.D. degrees from Hefei University of Technology in 2005, 2008, 2013 respectively. She was a research fellow at Singapore Management University from 2008 to 2009. She was a student intern at Microsoft Research Asia between Aug. 2011 and Dec. 2012. Her research interests are in data mining and knowledge engineering. Xuegang Hu is a professor at the School of Computer Science and Information Engineering, Hefei University of Technology, China. He received his B.S. degree from the Department of Mathematics at Shandong University, China, and his M.S. and Ph.D. degrees in Computer Science from the Hefei University of Technology, China. He is engaged in research in data mining and knowledge engineering. Xindong Wu is a Professor of Computer Science at the University of Louisiana at Lafayette (USA), and a Fellow of the IEEE and the AAAS. He holds a Ph.D. in Artificial Intelligence 18

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

from the University of Edinburgh, Britain. He is the founder and current Steering Committee Chair of the IEEE International Conference on Data Mining and the founder and current Editorin-Chief of Knowledge and Information Systems, and an Editor-in-Chief of the Springer Book Series on Advanced Information and Knowledge Processing (AI&KP). He was the Editor-inChief of the IEEE Transactions on Knowledge and Data Engineering from 2005 to 2008. His research interests include data mining, Big Data analytics, knowledge-based systems, and Web information exploration. He has published over 300 refereed papers in these areas in various journals and conferences, including IEEE TPAMI, TKDE, ACM TOIS, DMKD, KAIS, IJCAI, AAAI, ICML, KDD, ICDM, and WWW, as well as 36 books and conference proceedings.

19

ACCEPTED MANUSCRIPT Yuhong Zhang

AN US

CR IP T

Guang Chu

ED

M

Peipei Li

AC

CE

PT

Xuegang Hu

Xindong Wu