Handling imbalanced data with concept drift by applying dynamic sampling and ensemble classification model

Journal Pre-proof Handling wireless sensor network by applying dynamic sampling in surveillance system S. Ancy, D. Paulraj PII: DOI: Reference: S014...

Download PDF

922KB Sizes 0 Downloads 51 Views

Report

PDF Reader
Full Text

Journal Pre-proof Handling wireless sensor network by applying dynamic sampling in surveillance system S. Ancy, D. Paulraj

PII: DOI: Reference:

S0140-3664(19)30676-0 https://doi.org/10.1016/j.comcom.2020.01.061 COMCOM 6182

To appear in:

Computer Communications

Received date : 24 June 2019 Revised date : 22 January 2020 Accepted date : 27 January 2020 Please cite this article as: S. Ancy and D. Paulraj, Handling wireless sensor network by applying dynamic sampling in surveillance system, Computer Communications (2020), doi: https://doi.org/10.1016/j.comcom.2020.01.061. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Journal Pre-proof Handling Wireless Sensor Network by applying Dynamic Sampling in Surveillance System S.Ancy1 , D.Paulraj2 Assistant Professor1, Professor2 Information Technology1, Computer Science and Engineering2 Jeppiaar Institute of Technology1, R M K College of Engineering and Technology2 [email protected], [email protected]

urn al P

re-

pro

of

Abstract Wireless Sensor Networks (WSN) are widely employed in numerous real time applications in almost all the domains such as remote monitoring, healthcare,mobject tracking, security and surveillance based applications etc. With the availability of a broad range of applications for Big Data streaming, both the class imbalance and concept drift have become crucial learning issues. The concept of drift handling solutions is sensitive to class imbalance. The sampling techniques are widely applied to process the continuously arriving data streams with a sufficient number of instances. The selected instances have to build a statistical inference to support imbalanced class distribution. The stream data classification model without concept drift adaptation is not preferable to the imbalanced class distribution. To solve the issues, this article presents the dynamic sampling and an ensemble classification technique, named as Handling Imbalanced Data with Concept Drift (HIDC). To provide high statistical precision over imbalanced class distribution with concept drift, the HIDC decides an optimal reservoir size using the metrics regarding statistical properties of stream data and control parameter. The former refers to the inequality level in the values of instances arrived from a source, and the latter one controls over the selection of instances from multiple sources. The HIDC estimates the optimal reservoir size using such statistical and control parameters. To select the appropriate instances with an allocated optimal reservoir size, the HIDC applies random sampling over imbalanced classes and chooses a set of instances from multiple sources. The random sampling cannot solve the issues of imbalanced class distribution among the existing classes. To address such problems, the HIDC applies resampling techniques with respect to the imbalance factor. To identify and address the new concepts, the proposed HIDC sampling model trains the candidate classifier and replaces the worst ensemble member with the candidate classifier. Finally, the experimental results show that the HIDC performs better sampling and mining over imbalanced class distribution with concept drifts. Keywords: Imbalanced Class Distribution, Concept Drifts, Sampling, Resampling, Optimal Reservoir Size, and Ensemble Classification.

Jo

1. Introduction The speed of information technology growth proliferates the generation of vast volumes of high-velocity data. The high velocity of Big data streaming surpasses the processing capacity of traditional systems. Due to limited memory space, the preprocessing task is essential to scan the instances in only one pass and select a small set of samples from stream data. The main aim of the sampling process is to choose a portion of the data stream that behaves like the whole [1][2]. In-stream data processing, the imbalanced data phenomenon becomes vital, since it appears in various domains, such as weather data prediction, anomaly detection, social media mining, and so on. Class imbalance is possible when the number of instances that represent one class is much higher than the others [3]. The classes with most of the data instances are called as majority classes, whereas other classes are called minority classes. In-stream data classification, the majority class overwhelms the instances and ignores the minority classes. An appropriate solution to balance the class distribution is to apply a useful preprocessing step. Improper allocation of reservoir size for stream data from various sources makes the imbalance problem even worse. The data sampling over imbalanced streaming tends the learning model to bias towards the majority classes. Resampling is widely employed in balancing the sample set through the elimination of instances of the major classes or addition of past instances in the minority class, which are known as undersampling and

Journal Pre-proof

urn al P

re-

pro

of

oversampling respectively. However, the sensitivity of learning accuracy in the class imbalance mainly depends on the distribution of minority classes over data streaming and the degree of overlap between the classes. Class imbalance is not the only problem in learning the whole data stream. However, most of the data stream classification models do not focus on the concept drift issue in imbalanced data streams, whereas drift detection methods are designed based on the assumption that the data streams are balanced. Concept drift represents the change in the distributed instances over time that causes a significant issue in the stream data analysis [4][5]. To solve these issues simultaneously, the imbalance factor is estimated to add a bias based on the density of instances among minority and majority classes. They need to decide whether an explicit concept drift has occurred during the data stream classification. Thus, the concept drift detector needs to determine when to change the classification member to respond to the sudden or gradually appeared concept drifts. Moreover, there is a gap in the field of class imbalance learning with concept drift. Thus, the proposed work presents a hybrid approach called as Handling Imbalanced Data with Concept Drift using an ensemble classifier model (HIDC), which integrates optimal reservoir allocation, random sampling, and resampling techniques without affecting the concept drift handling solutions. The contributions of the proposed work are as follows.  The proposed HIDC collects the samples from big data streaming and represents the characteristics of the entire data even when the classes are imbalanced with the concept drift.  The statistical parameters of the data stream, such as the mean and standard deviation and power of allocation in optimal reservoir allocation provide control over the addition of instances more from multiple sources and avoid the enlargement of imbalance class distribution after sampling.  By estimating the imbalance factor in class distribution, the HIDC makes a dynamic decision on the oversampling and undersampling process, and effectively handles the problem of imbalanced class distribution.  By modeling and evaluating the efficiency of candidate classifier separately, the HIDC replaces the worst member of ensemble classifier during either the sudden or the gradual concept drift.  The performance evaluation of the proposed HIDC confirms its efficiency of handling imbalanced class distribution with the concept drift. 1.1 Paper Organization The organization of the rest of the paper is as follows: Section 2 includes the previous works related to the classification of data stream. Section 3 presents the proposed methodology. Section 4 demonstrates the experimental results of the proposed system, and Section 5 concludes the paper.

Jo

2. Related Works Most of the big data streaming applications face issues related to the class imbalance and concept drift [6]. The conventional sampling methods exploit two kinds of approaches to deal with the imbalance and concept drift issues, such as resampling and similarity approaches [7][8]. Resampling is a useful approach in data level. The resampling techniques balance the data distribution using either random or deterministic approaches [9]. The standard works select the samples from the continuously arriving data stream using the methods of sampling with replacement and without replacement [10]. The sampling with replacement is utilized when the fixed size of samples is required, whereas sampling without replacement is employed for the applications, where the instances are not required to replace for representing the whole data stream. The former method does not guarantee the sample sufficiency without repetition, and the latter approach is not preferred for the sub-streams that correspond to various patterns. In [11], Dynamic Feature Group Weighting framework with Importance Sampling (DFGW-IS) attempts to solve the issue, when the concept drift occurs together with class imbalance. A weighted ensemble trained on randomly generated feature group addresses the problems in imbalance class distribution. It assumes that the minority classes do not change over time, but the minority classes over the past window may not the same as the current classes. Moreover, the solution of imbalanced class distribution using previous instances is not adapted to the concept drift effectively. A sampling technique

Journal Pre-proof

Jo

urn al P

re-

pro

of

in [12] applies the recursive binary partition over input instances and selects the samples which represent the whole stream. The greedy optimality and explicit error bounds handle the issues, related to the concept drift over time. Sampling over heterogeneous sources is another problem in sampling. An adaptive sampling scheme [13] on imbalanced data streams using more repeatable and reliable predictive models. The predictive model is built, when the data is less imbalanced. When the data is severely imbalanced, it enables a data scan thoroughly and collects sufficient minority instances. However, deciding on an appropriate reservoir sample size is difficult. The large size reservoir wastes the resources, whereas small size reservoir lacks in drawing meaningful conclusions. However, the main limitations of these algorithms are that they run with the fixed-size reservoir and do not consider the worst-case optimality in the aspects of space, query time, and update time together. To solve these issues, the stream sampling in [14] and continuous random sampling exploits the concept of overlap independence. By combining the density and distance measures, the DENDIS represents the matrix in [15] to maintain the semantic coherence. However, it fails to exploit the semantic center-of-gravity weight in the short text analysis and to differentiate the unfair matches caused by the differences in information source length. The G-means Update Ensemble (GUE) in [16] attempts to solve both the concept drift and imbalanced class distribution. To handle the imbalance class distribution, it exploits the oversampling process, and it utilizes the weighting scheme for handling concept drift. However, it lacks in tuning the imbalance factor according to the context of streaming data. The static threshold value is not suitable for solving the imbalanced class distribution always. The Gradual Resampling Ensemble (GRE) technique is designed in [17] for handling both the concept drifts and imbalanced class distribution. It applies the resampling technique only for a part of previously arrived minority classes in order to amplify the current minority classes. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is applied to identify the disjuncts as well as avoid the influence of those disjuncts on the similarity evaluation. This assists the GRE to adapt to the new instances. Likewise, effective learning for the nonstationary imbalanced data stream is proposed in [18]. It attempts to reduce the misclassified instances with the focus on two-class problems. It creates multiple bags of chunk and each chunk training, and testing is applied using classification algorithm. However, they face the issues during multi-class classification. In [19], the well known spiking neural networks are developed for online learning in data streams. The main aim of the work is to control the size of the neuron repository and to utilize the advantages of data reduction techniques as well as compressed neuron learning ability. The knowledge-Maximized Ensemble (KME) in [20] combines the online and chunk based ensemble classifiers to deal with various concept drift issues over the continuously arrived data stream. The use of unsupervised learning models and stored recurrent concepts can maximize the knowledge used in stream data mining. It improves the accuracy of data classification. However, there is a need to develop the diversity among ensemble members for further improving the classification accuracy during sudden and gradual concept drifts. 3. Proposed methodology 3.1 Problem Formulation Considering an input data stream arrives from n sources (Soi) and the sources are represented as So1, So2, So3,............Son. Each source i generates k number of streams (Soik), i.e. Soi1, Soi2, ........., Soik. The instances from all the sources create entire streaming data USoi = So. The main aim of data preprocessing technique is to allocate the memory of reservoir SR for stream data ‘So’ from n sources. Two factors are essential in the identification of statistically appropriate reservoir size for an entire stream data. These factors are the popularity and the degree of disparity in the stream data. The degree of disparity in the stream data represents the variation in the number of instances distributed among various sources. High degree of disparity tends to less confidence interval, which is the probability of the correct value estimation for a stream data [21]. |𝐒𝐑 |

𝐍

(1) 𝟏 𝐍𝐞𝟐 Where in equation (1), |SR| represents the total sample size and N represents the total population. Moreover, e represents the confidence interval. In the case of a low confidence interval or high disparity, the data sampling techniques should select a high number of samples. Otherwise, only a few samples are required to represent the entire stream data. After applying a sampling process for an optimized size of the reservoir,

Journal Pre-proof

of

there are two problems remaining in the stream data classification, such as imbalance class distribution and concept drift. Consider an online ensemble classifier Ɵ that receives a new instance xt at t time, and the predicted class label is y’t. After making the prediction, the classifier receives the expected label of xt as yt. Both the predicted and expected label set belongs to {1,-1}. The classification results of ensemble classifier Ɵ are classified into four types, 1. True positive if yt = y’t = 1 2. True negative if yt = y’t = -1 3. False positive if yt =-1; y’t = 1 4. False negative if yt = 1; y’t = -1

pro

According to these metrics, the ensemble classification accuracy is estimated for both minority and majority classes. Uneven distribution of instances among majority and minority classes and changes appeared over concepts increase the imbalance factor and reduce the classification accuracy. The imbalance factor is measured using the occurrence probability of the minority classes. Due to the imbalance among the distribution of instance among majority and minority classes, the classification accuracy is affected.

urn al P

re-

3.2 Overview of the Proposed Methodology The proposed work aims at handling data imbalance and concept drift issues over continuously arriving data streams. There are mainly two challenges in the design of the classification model on the imbalanced data stream with concept drift. The first challenge is to adapt the classification model to new concepts immediately when the concept drift occurs. The second challenge is to identify the imbalanced class distribution and to prevent the situation of ignoring the instances of the minority class from the data stream.

Dynamic Sampling

Deciding Optimal Size of Reservoir using Standard Deviation and Mean

Dynamic Control Parameter Estimation for Power of allocation

Random Sampling with Dynamic Reservoir Size

Disparity Based Ensemble Learning

Jo

Imbalance Factor Estimation

Handling Concept Drift

Oversampling Training Process for Handling Imbalance Distribution

Building Candidate Model for Ensemble Classification

Ensemble Member Weight estimation using

Disparity Measurement

Figure 1: Block Diagram for the Proposed Methodology The proposed scheme focuses on both the data preprocessing and stream data classification processes. The dynamic sampling method determines the optimal size of the reservoir for each stratum and generates a sample

Journal Pre-proof

of

set using a random sampling technique. The proposed sampling model tightly incorporates the dynamic control parameter to provide a way for allocating the sample to different strata optimally. When most of the instances from a source are spread out over a broader range of values, i.e., high standard deviation, the conventional optimal strata allocation converges to the instances from a single source. To overcome this limit, the proposed methodology decides a dynamic control parameter based on the ratio of coefficient of variation, which represents the ratio of the standard deviation to the mean. The Disparity based ensemble learning applies the imbalance factor measurement. The imbalance factor estimation strictly integrates the overall accuracy of ensemble classifier, and the ratio of instances converged under majority classes to the overall instances. The disparity refers to the ratio of the difference in the accuracy between majority and minority classes. The comparison between the disparity level of both the minority and majority classes and imbalance factor confirms the presence of imbalance class distribution and concept drift. The dynamic weight based ensemble member replacement solve such issues in data stream mining.

urn al P

re-

pro

3.3 Dynamic Sampling by using Optimal Reservoir Sample Size The sampling is a process of selecting the instances from the continuously arriving data streams and deriving estimations to represent the entire stream data. Most of the conventional techniques select the sample instances in a fixed size. However, they are inappropriate across the imbalanced class distribution. The proposed work plans to divide the incoming stream data from various sources into multiple strata. The streaming nature of having multiple classes introduces additional challenges. Initially, the data stream size is unknown, and so it is difficult to predetermine the sample size a priori. Second, the streaming data cannot be stored, and therefore, it is essential to process the streaming data sequentially in a single pass. A technique commonly used to overcome these challenges is reservoir sampling. However, it is not adequate for all cases. The instances from all the sources create entire streaming data USoi = So. The total available reservoir size for sampling is denoted as |SR| instances, and the main aim of sampling is to allocate |SR| instances optimally among the n sources and such optimal allocation subjects to the data changes among the sources. The reservoir size is allocated using equation (2). | 𝐒 | ∑𝐤 𝐨𝟏 𝐤 𝐢 𝐘𝐤

𝐑𝐞𝐬𝐞𝐫𝐯𝐨𝐢𝐫 𝐒𝐢𝐳𝐞 |𝐒𝐑 |

𝐢

𝐪

∗

∝𝐤 ∗| 𝐒𝐨 𝐤 |𝐢

| 𝐒𝐨 𝐤 | ∑𝐧 𝐘𝐢𝐤 𝐢 𝟏∝𝐢 ∗ ∑𝐤 𝟏

𝐪

| 𝐒 | ∑𝐤 𝐨𝟏 𝐤 𝐢 𝐘𝐤

| 𝐒 | ∑𝐤 𝐨𝟏 𝐤 𝐘𝐢𝐤

| 𝐒𝐨 𝐢 |

(2)

Jo

Where, Yik denotes the sampling attribute value of the ith source in So, αk denotes the standard deviation of the sampling attribute values in (So)i, and q denotes the power of allocation. A randomly selected sample for an estimated dynamic size provides high statistical precision than previous sampling techniques, due to the utilization of statistical properties such as mean and standard deviation. In equation (2) the measurement of differentiation in the instances of the source (So)i over all the sources is estimated using the statistical properties. Instead of equal preference to all sources, the proposed methodology assigns the reservoir size for each source individually according to the changes in the statistical properties of a source of the stream. The conventional sampling techniques allocate the reservoir size among various sources using the equation (2) by assigning the q value as 1. However, it leads to unnecessary convergence to a single source, especially when the standard deviation between the instances in a single source is high. When a data stream faces multiple sudden concept drifts, the ratio of standard deviation and the mean value of a source is high, which is denoted in the numerator of equation (2). In such cases, the drifts are noticed when taking a huge number of samples from a source. Even though the equation (2) returns too high reservoir size to such source, since the standard deviation is closer to the mean value, it tends to overlook on the instances from other sources, even if there is a concept drift. Thus, the factor of q, which is named as the control parameter should be allocated dynamically for the sources. For the aim of analyzing the instances from the separate sources, the optimal reservoir size allocation is not adequate, when assigning the value of q as one. Thus, the proposed methodology assigns the q value

Journal Pre-proof as per statistical properties using equation (3). 𝐪 𝐬𝐢𝐳𝐞

𝟎. 𝟓

𝐢𝐟 𝐦𝐚𝐱

𝛂𝐤 | 𝐒 | ∑𝐤 𝐨𝟏 𝐤 𝐢 𝐘𝐢𝐤

∈ 𝐧 𝐬𝐨𝐮𝐫𝐜𝐞𝐬

𝟎. 𝟓

(3) 𝟎 𝐎𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞 The proposed scheme decides the size of reservoir |SR| optimally using dynamic q value. It assists the proposed HIDC to handle the concept drift issue in stream data without converging the sampling to a single source. To fill the reservoir size, the instances are taken from the sources using random sampling with replacement. | 𝐒𝐨 𝐢 |

re-

pro

of

3.4 Random Sampling with Optimal Reservoir Size The HIDC extends the reservoir sampling with the dynamically allocated reservoir size. Instead of taking the samples randomly with the fixed size, the HIDC selects samples from an input stream data randomly in an optimal size. Considering that |SR|i denotes a reservoir of the size of a source i. Initially, the HIDC reservoir sampling algorithm places all instances in the |SR|i size reservoir from the current instances. The process of random sampling is explained in figure 2. The conventional reservoir sampling techniques choose a uniform random sample of fixed size without replacement from the continuously arriving streaming data. Initially, the algorithm takes into account all the instances in a reservoir (SR)i until the reservoir of the size of |SR|i instances becomes full. After reaching the size of reservoir |SR|i, each kth instance is accepted for inclusion in the reservoir with the random probability and allowed instances replace an instance, which is randomly selected in the reservoir.

Jo

urn al P

Requirements: |SR|i is the size of a reservoir for source S(o)i allocated for current instances, i is the source, and k is the number of instances. 1: |SR|i =0; 2: for each instance k from the input stream do 3: k = k +1; 4: if k≤|SR|i then 5:|SR|i[k-1] <=instance k; //Add instance in the reservoir of SRi 6: else 7: for each instance k > |SR|i do 8: Randomn 1=rand(0,1); // A random number generation between 0 and 1 9: if Randomn 1 < |SR|i / k then 10: Randomn 2=rand(0,|SR|i -1);// A random number generation between 0 and |SR|i -1 11: |SR|i [Randomn 2] = k; // A selected instance replacement randomly in reservoir with the accepted instance 12: end if 13: end for 14: end if 15: end for Figure 2: Random Sampling in HIDC

3.5 Disparity Based Ensemble Learning The proposed methodology, HIDC implements the ensemble classification method on the imbalanced data stream with concept drift. There are two significant challenges in the data stream analysis. Those are concept drift and imbalanced class distribution. The stream data classification model has to adjust itself to adapt to the new concepts on the data stream and prevent the situation of ignoring the minority class. To solve these issues, the proposed methodology exploits a weighting mechanism for ensemble classification to react to various types of concept drifts quickly and achieve good classification accuracy when facing both majority and minority

Journal Pre-proof classes over the continuously arrived data stream. The symbols used in the proposed ensemble classification algorithm are summarized in Table 1.

Description

m ε

Maximum number of classifiers in the bl Ensemble

C

Candidate model

Cb

The base model in the ensemble

Wb

The weight of the Base model in the ensemble

Wc Amaj

The weight of the Candidate model in the bl The accuracy of majority classes

Amin

The accuracy of minority classes

pro

of

Symbols

Dmaj

The disparity of majority classes

Dmin MI MJ

Imbalance Factor

re-

ɵ

The disparity of minority classes Set of minority classes Set of majority classes

urn al P

Table 1: List of Symbols

The ensemble learning model consists of m number of base classifiers Cb. In order to improve the classification accuracy of the ensemble, the proposed methodology plans to train the candidate learning model C, outside of the ensemble classifiers. It replaces the ensemble classifier with minimum weight, when the weight of the candidate classifier is high, compared to the base classifier, i.e., Wb < Wc [16]. The data streaming arrives continuously, and so it is impossible to obtain all of the data from the data streams directly. Therefore, the proposed methodology takes samples from the data stream with the optimal size. It assists the proposed methodology to identify the imbalance detection using imbalance factor ɵ before training the instances into a candidate model and decide whether the oversampling process is required. All the base classifiers in the ensemble predict the class using a concept of majority vote.

Jo

3.5.1 Imbalance Factor Estimation and Over Sampling The ratio between the total instances under minority class to that of majority class defines the imbalance degree. Most of the conventional classification algorithms from the decision tree to neural networks assume the even distribution of instances among various classes. However, in real-time applications, the ratio between the minority and majority classes is low. The samples are separated into chunks with a fixed size, and they are tested first then trained alternately. The results return from the testing are the number of instances in each class, the ratio of minority and majority instances, and the accuracy of the majority and minority classes.

The best ratio represents a balanced data stream. By using such information, the proposed methodology estimates the imbalance factor, ɵ.

Journal Pre-proof

𝛉

𝟏

𝐀𝐦𝐚𝐣

𝐀𝐦𝐢𝐧

𝟐 ∗

∑ 𝐦𝐢𝐧𝐨𝐫𝐢𝐭𝐲 𝐢𝐧𝐬𝐭𝐚𝐧𝐜𝐞𝐬

∑ 𝐦𝐚𝐣𝐨𝐫𝐢𝐭𝐲 𝐢𝐧𝐬𝐭𝐚𝐧𝐜𝐞𝐬

(4)

of

The equation (4) returns the imbalance factor of a data stream, by utilizing the Amaj, Amin, and ratio of minority and majority instances. The imbalance factor is small when both the accuracy and ratio of the majority and minority instances are small. It represents a scenario of ignoring the minority class and poor classification accuracy. Notably, the imbalance factor is not the only reason behind the worst classification accuracy, and other factor such as training data size also affects the classification accuracy. The proposed methodology attempts to solve this problem using optimal reservoir size and control parameter. Even though, the instances selected from the minority classes still makes a negative impact on the classification accuracy. Thus, the proposed methodology tests and trains the remaining chunks using ensemble members, after applying the oversampling and undersampling techniques.

𝐃𝐦𝐢𝐧

𝐀𝐦𝐚𝐣

𝐀 𝐦𝐢𝐧

pro

3.5.2 Oversampling and Undersampling for Solving Imbalance Class Distribution The HIDC exploits the disparity factor to estimate the difference between the classification accuracy of the majority and minority classes. High disparity factor in the minority and majority class emphasizes the necessity of oversampling and undersampling process respectively. 𝐀𝐦𝐚𝐣 𝐀 𝐦𝐢𝐧 (5) 𝐃𝐦𝐚𝐣 𝐀 𝐦𝐚𝐣

urn al P

re-

(6) 𝐀𝐦𝐢𝐧 The equations (5) and (6) estimate the disparity of majority and minority classes, Dmaj, and Dmin respectively. In the case of low disparity only in minority class, compared to the imbalance factor (Dmin< Ɵ), the oversampling process is sufficient for improving the classification accuracy. In the case of low disparity only in both the Dmin and Dmax, it is essential to perform both the oversampling and undersampling processes. The HIDC performs the stream data oversampling from the minority classes and undersampling the majority classes. To oversample the minority class and to reduce the classification inaccuracy, the HIDC moves the instances from the majority classes and oversampling the minority classes. For every minority class, the HIDC selects the nearest majority classes as neighbors and also selects some instances from them according to the distance value. In contrast to the conventional over-sampling methods, the proposed work only oversample or strengthen the borderline minority classes. Firstly, the HIDC sampling model identifies the minority classes and differentiates the borderline minority classes from them [22]. The closest instances are selected from the neighboring classes and added to the subset of the minority class. Considering that the sampled data stream is T, the minority class is MI, and the majority class is MJ.

Jo

mjmjnum} MI = {mi1, mi2, ....., miminum}, MJ = {mj1, mj2, ....., Where minum and mjnum are the numbers of minority and majority instances respectively, the procedure for the hybrid resampling technique is as follows. Step 1: For every minority class ∈ MI, h nearest neighbors are detected from the sampled data stream T. The number of majority classes h’ among the h nearest neighbors is denoted by h' (0 ≤ h'≤ h). Step 2: If h=h’, that means all the h nearest neighbors of MI are majority classes, it is not considered for further processes, since there are no minority classes. When the condition of h / 2 ≤ h'< h is satisfied, nearly h/2 minority classes are in danger since the number of majority nearest neighbors is larger than the number of its minority ones. Moreover, if 0 ≤ h'< h / 2, the minority classes need not participate in further steps. Step 3: The instances in danger are considered as the borderline data of the minority class MI. Moreover, a set of instances in danger is a set or a subset of MI. A set of minority instances which are in danger is denoted as Dn. {mi' 1 , mi' 2 ,..., mi' dnum}, 0 ≤ dnum ≤ minum

Journal Pre-proof For each instance in danger, the h nearest neighbors are estimated from MJ. Step 4: This step takes dnum positive instances from MJ. For each mi’i, it randomly selects and moves instances from its h nearest neighbors in MJ. By repeating the above procedure for each mi’i in Dn, the HIDC oversampling the minority classes as well as undersampling the majority classes. The HIDC repeats the procedure until the sum square error for the samples in each class is significantly reduced. Thus, the proposed approach effectively handles the imbalanced class distribution and the concept of drift using the HIDC. Moreover, a hybrid of oversampling and undersampling scheme on the borderline minority classes strengthen the HIDC efficiency.

𝐖𝐜

𝐀𝐦𝐚𝐣 𝐜

𝐀𝐦𝐢𝐧

pro

of

3.6 Ensemble Member Updation using Weighting Mechanism for Handling Concept Drift Initially, the ensemble classifier ε consists of m number of members. The ensemble classifier exploits the incremental classifiers as weighted components for handling the concept drifts. In contrast to the single classification method, the ensemble classification model can solve many types of concept drifts, due to the combined prediction of classes in the ensemble model. When the stream data arrives into the candidate model, the HIDC trains a candidate model C as well as m number of ensemble members Cb. The HIDC assigns a weight Wc for ith chunk as shown in equation (7). 𝟎.𝟓

𝐜

(7)

𝐢

𝐖𝐛

𝐀𝐦𝐚𝐣

𝐛

urn al P

re-

The weight of the candidate model is estimated from the Amaj and Amin by analyzing the streaming data. The Amaj and Amin denote the accuracy of the majority class and that of the minority class respectively. In the testing process, the HIDC estimates the accuracy of classification in the candidate model with the support of the ensemble model. The higher weight represents that classifier C can attain better accuracy for both the majority and the minority classes. If the classifier, C performs poorly for either the majority or the minority class, it obtains a lower weight. Even though the overall accuracy of the ensemble classifier is high in most of the existing works, the accuracy of the minority class is inferior. However, the proposed methodology HIDC plans to consider the accuracy of the minority class in order to improve the overall performance of ensemble classification. Like the Wc measurement, the Wb is estimated for every ensemble member using the following equation. 𝐀 𝐦𝐢𝐧

𝟎.𝟓

𝐛

𝐢 𝟏

𝐀𝐦𝐚𝐣 𝐛

𝐀𝐦𝐢𝐧

𝐛

𝟎.𝟓 𝐢

𝐀𝐦𝐚𝐣 𝐛

𝐀𝐦𝐢𝐧

𝟎.𝟓 𝐛

𝐢 𝟏

𝐢

𝐟

𝟏 (8)

Jo

The equation (8) consists of two parts. The primary purpose of adding the first part is to evaluate the performance of the Cb corresponding to the i-1th chunk. This part assists the HIDC to preserve the classifiers with better performance at the recent chunk, and the proposed ensemble model can adapt to the sudden concept drift as much as possible. The second part assists the HIDC to estimate the performance of the classifier over the period of time. In equation (8), f represents the time step that the classifier is built. It facilitates the HIDC to handle the gradually appeared concept drift over the data stream. The weight Wb is measured for every base classifier after the ensemble classifier classifies the current chunk. The HIDC substitutes the candidate classifier with the classifier that has poorest Wb compared to Wc when the ensemble is full. Otherwise, the HIDC directly appends the candidate classifier as a member of the ensemble. Thus, the proposed methodology can append more instances of the minority classes in an imbalanced environment and react to various concept drifts using the weight mechanism. 4. Experimental Evaluation The experimental evaluation is carried out on the City pulse weather dataset. The dataset is collected from weather observations from the city of Aarhus during August 2014 - September 2014. The raw weather dataset with 1.7 MB is downloaded in the format of JSON. The weather data includes the features of Timestamp, Location, Temperature (°F), Wind speed, Pressure, Dew point, and Humidity. The feature of Humidity is used as ground truth. According to the HIDC, the sampling technique collects the instances for each class with

Journal Pre-proof

of

optimally allocated reservoir size and the bolt in Storm executes the weighted ensemble classification model. The total ensemble size is fixed at four. Intuitively, three factors affect the performances of the HIDC algorithm over a data stream from Number of Minority Classes, Concept Drift over time, and Number of Instances. To measure the impact of the proposed scheme, the HIDC is compared with the DFGW [11] in terms of precision, recall, G-Mean, and delay. Precision: The ratio of the number of correct results and the total number of returned results. Recall: The ratio of the number of correct results and the total number of results that must have been returned. G-Mean: It is the harmonic mean of the precision and recall of the test. F-Measure: It is the weighted harmonic mean of the precision and recall of the test.

re-

pro

4.1 Experimental Results Figure 3 illustrates the recall of HIDC by varying the minority classes from 10% to 30%.

Figure 4: Concept Drift over Time Vs. G-Mean

urn al P

Figure 3: Minority Classes Vs. Recall

Jo

It is observed that the HIDC shows improved performance, compared to the DFGW-IS. When the percentage of minority classes is low, the recall of HIDC is equal to 85%, whereas the importance of sampling in GFGW attains nearly 70% recall. The observed impact of the minority classes on the recall of the HIDC is reasonable compared to the DFGW-IS. Since the HIDC allocates the size of the reservoir to various sources according to the metrics of statistical properties and control parameter, it shows significant improvement in the recall, notably, when the minority classes increase. The exception occurs when the minority classes exceed 20%. It means that the recall of HIDC degrades from 91.3% to 89% when the minority classes vary from 25% to 30%. When the minority neighbors increase more than that of majority ones, the proposed HIDC excludes the instances from the resampling techniques. It is the main reason behind the marginal decrement of recall after 20% of minority classes. Moreover, the difference between the HIDC and DFGW-IS algorithms increases with the minority classes. The main limitations of DFGW-IS are the collection of raw data from the past instances for minority classes, which assumes that the current data can best represent current and future concepts. In addition, it assumes that the minority classes are not changed. It is imperfect in the case of real-time environments. The performance of HIDC improves the recall by 29.15% more than that of DFGW-IS with high minority classes. Figure 4 shows the influence of the concept drift on the G-Mean of both HIDC and DFGW-IS. The concept drift over time represents the number of patterns changed over a total number of patterns. The minimum level of concept drift over time tends to small standard deviation and enables the HIDC to take a few samples from the past data, which is closer to the performance of DFGW-IS. That is, the statistical change over streams is small, and it does not cause too much difference in the sampling process. For example, the HIDC attains 93.3% of G-Mean when 20% change in patterns. Increasing the concept drift over time drops the G-Mean value of both the HIDC and DFGW-IS. Even though the G-Mean of HIDC degrades, it shows better performance, compared to the DFGW-IS. The ensemble learning member updation using weighting scheme leads the HIDC system to adapt various concept drift issues. High concept drift over time decreases the

Journal Pre-proof

Figure 5: Minority Classes Vs Precision

pro

of

G-Mean of HIDC from 93.3% to 84.26%. However, the DFGW-IS experiences 84% to 39.68% G-Mean degradation, while increasing the concept drift over time. The DFGW-IS tackles the concept of drift using the metrics of discriminative power and stable level. However, it extracts each component classifier from a specific feature set. The G-Mean of DFGW-IS has degraded by 43.58% with the scenario of new concepts or 100% concept drift, compared to the HIDC.

Figure 6: Number of Instances per Class Vs. F-Measure

Jo

urn al P

re-

Figure 5 holds the experiment results on the comparative performance of HIDC and DFGW-IS for precision while varying the minority classes from 10% to 30%. The HIDC illustrates better precision value since the proper measurement of statistical properties and control parameter efficiently allocates the reservoir size for various sources. The DFGW-IS assumes that the minority classes are not changed over time, but the minority classes in the past window are not the same as the current minority classes. In addition, the problem becomes worse, when concept drift occurs together with class imbalance. For instance, the HIDC attains 85% precision when the minority classes are 10%, whereas the DFGW-IS attains only 72% precision in the same environment. By considering the standard deviation among the sampling instances, the HIDC sampling technique allocates the reservoir size properly. It results in an optimal number of instances and avoids variation in the HIDC performance. Even though the HIDC improves the performance more than that of DFGW-IS, the HIDC experience degradation in precision. Even though the HIDC allocates an optimal size for a reservoir over the past window, the random sampling of past instances for minority classes is not adequate in all the cases. Figure 6 shows the F-Measure of HIDC and DFGW-IS against the number of instances per hour with 20% concept drift. Notably, the increasing number of instances per hour also increases the performance of both the works. The number of instances has a significant influence on the F-measure of HIDC and DFGW-IS. From low to a medium number of instances (i.e., from 120 to 1000), it is observed that both the HIDC and DFGW-IS improve the F-Measure linearly. For instance, HIDS improves F-Measure by 15%. For a higher number of instances (e.g., 2000), there are no noticeable changes in the F-Measure of both the works since more than 1000 instances per hour with 20% minority classes are excessive for analyzing such weather data. Even though both the works behave in the same way when increasing the instances, the performance of HIDC is always better than DFGW-IS. The reason for this is that DFGW-IS does not apply the restriction on past instances and measure discrepancy level of classes over time, whereas the proposed HIDC decides the optimal reservoir size and appropriate ensemble member using weighting concept. 5. Conclusion This paper presented the handling imbalanced data with concept drift using dynamic sampling and ensemble classification model, HIDC over continuously arriving imbalance data with concept drifts. The proposed dynamic sampling model allocates the reservoir size depending on the results of statistical and control parameter in streaming data from various sources. The standard deviation and mean value based reservoir allocation influences the adaptation of the reservoir to various sources, and assists the HIDC to handle the imbalanced data distribution successfully. Moreover, the proposed HIDC attains better recall using the classifier weighting method when the streaming data has new concepts. By measuring the imbalance factor

Journal Pre-proof from the results of ensemble members, the HIDC performs oversampling and undersampling for handling the imbalanced class distribution. The weighting scheme is applied on both the candidate and ensemble member for comparing and replacing the worst ensemble classifier. It improves the strength of HIDC by solving both the imbalanced class distribution and concept drift, without increasing the cost. The experimental evaluation integrates the DFGW-IS and HIDC for stream data analysis. From the evaluation, the HIDC improves the F-Measure by 43% when varying the number of instances compared to the DFGW-IS.

Jo

urn al P

re-

pro

of

References [1] Haas, Peter J. "Data-stream sampling: basic techniques and results" In Data Stream Management, Springer Berlin Heidelberg, pp. 13-44, 2016. [2] Hu, W., & Zhang, B., “Study of sampling techniques and algorithms in data stream environments”, 9th International IEEE Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1028-1034, 2012 [3] B.krawczyk, Learning from imbalanced data: open challenges and future directions, Progress in Artificial Intelligence, Vol.5, No.4, pp.221– 232, 2016 [4] SGarcía, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: methods and prospects. Big Data Analytics, Vol.1, No.1, 2016. [5] I. Zliobaite, B. Gabrys, Adaptive preprocessing for streaming data, IEEE Trans. Knowl. Data Eng., Vol.26, No.2, pp.309–321, 2014 [6] Khamassi, Imen, Moamar Sayed-Mouchaweh, Moez Hammami, and Khaled Ghédira. "Discussion and review on evolving data streams and concept drift adapting." Evolving systems, Vol.9, No. 1, pp. 1-23, 2018 [7] Boicea, Alexandru, et al. "Sampling strategies for extracting information from large data sets." Data & Knowledge Engineering, 2018 [8] Ramírez-Gallego, Sergio, et al. "A survey on data preprocessing for data stream mining: current status and future directions." Neurocomputing, Vol.239, pp. 39-57, 2017 [9] Wu, Kesheng, et al. "Statistical data reduction for streaming data." IEEE Scientific Data Summit (NYSDS), 2017. [10] Kancharala, Abhilash, et al. "Big Streaming Data Sampling and Optimization." IT Convergence and Security 2017. Springer, Singapore, pp.221-228, 2018 [11] Wu, Ke, et al. "Classifying imbalanced data streams via dynamic feature group weighting with importance sampling."Proceedings of the SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2014. [12] Cervellera, Cristiano, and Danilo Macciò. "Distribution-Preserving Stratified Sampling for Learning Problems." IEEE transactions on neural networks and learning systems, 2017. [13] Zhang, Wei, et al. "Adaptive Sampling Scheme for Learning in Severely Imbalanced Large Scale Data." Asian Conference on Machine Learning. 2017. [14] Tao, Yufei, Xiaocheng Hu, and Miao Qiao. "Stream sampling over windows with worst-case optimality and ℓℓ-overlap independence." The VLDB Journal, Vol. 26, No.4, pp. 493-510, 2017. [15] Ros, Frédéric, and Serge Guillaume. "DENDIS: A new density-based sampling for clustering algorithm." Expert Systems with Applications, Vol.56, pp.349-359, 2016 [16] Wang, Sin-Kai, and Bi-Ru Dai. "A G-Means Update Ensemble Learning Approach for the Imbalanced Data Stream with Concept Drifts." In International Conference on Big Data Analytics and Knowledge Discovery, pp. 255-266. Springer, Cham, 2016. [17] Ren, Siqi, Bo Liao, Wen Zhu, Zeng Li, Wei Liu, and Keqin Li. "The Gradual Resampling Ensemble for mining imbalanced data streams with concept drift." Neurocomputing, Vol.286, pp.150-166, 2018. [18] Thalor, Meenakshi A., and S. T. Patil. "Propagation Of Misclassified Instances To Handle Nonstationary Imbalanced Data Stream." Journal of Engineering Science and Technology, Vol.13, No. 4, pp.1134-1142, 2018. [19] Iwashita, Adriana Sayuri, Victor Hugo C. Albuquerque, and João Paulo Papa. "Learning concept drift with ensembles of optimum-path forest-based classifiers." Future Generation Computer Systems, 2019.

Journal Pre-proof

Jo

urn al P

re-

pro

of

[20] Ren, Siqi, Bo Liao, Wen Zhu, and Keqin Li. "Knowledge-maximized ensemble algorithm for different types of concept drift." Information Sciences, Vol. 430, pp. 261-281, 2018. [21] Al-Kateb, Mohammed, Byung Suk Lee, and X. Sean Wang. "Adaptive-size reservoir sampling over data streams." IEEE In null, pp. 22, 2007. [22] Napierala, Krystyna, and Jerzy Stefanowski. "Types of minority class examples and their influence on learning classifiers from imbalanced data." Journal of Intelligent Information Systems, Vol. 46, No. 3, pp. 563-597, 2016.

Journal Pre-proof AUTHORSHIP STATEMENT

Manuscript title: Handling Wireless Sensor Network by applying Dynamic Sampling in Surveillance System

pro

of

All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript. Furthermore, each author certifies that this material or similar material has not been and will not be submitted to or published in any other publication before its appearance in the Computer Communications.

Category 1 Conception and design of study: S.Ancy, D.Paulraj

acquisition of data: S.Ancy, D.Paulraj

,

urn al P

,

re-

Authorship contributions Please indicate the specific contributions made by each author. The name of each author must appear at least once in each of the three categories below.

,

;

analysis and/or interpretation of data: S.Ancy, D.Paulraj

,

,

,

.

Category 2 Drafting the manuscript: S.Ancy, D.Paulraj

Jo

revising the manuscript critically for important intellectual content: S.Ancy, D.Paulraj

Category 3 Approval of the version of the manuscript to be published (the names of all authors must be listed): S.Ancy, D.Paulraj

Journal Pre-proof Conflict of Interest This paper has not communicated anywhere till this moment, now only it is communicated to your esteemed journal for the publication with the knowledge of all co-authors. Ethical approval

Jo

urn al P

re-

pro

of

This article does not contain any studies with human participants or animals performed by any of the authors.

Handling imbalanced data with concept drift by applying dynamic sampling and ensemble classification model

Handling imbalanced data with concept drift by applying dynamic sampling and ensemble classification model

Recommend Documents