Outlier detection approaches for wireless sensor networks: A survey

Computer Networks 129 (2017) 319–333 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet O...

Download PDF

2MB Sizes 298 Downloads 425 Views

Report

PDF Reader
Full Text

Computer Networks 129 (2017) 319–333

Contents lists available at ScienceDirect

Computer Networks journal homepage: www.elsevier.com/locate/comnet

Outlier detection approaches for wireless sensor networks: A survey Aya Ayadi a,b, Oussama Ghorbel a,c,∗, Abdulfattah M. Obeid a,d, Mohamed Abid a,c a

CES Research Lab, Sfax University, Tunisia National Engineering school of Gabes, University of Gabes, Tunisia c National Engineers School of Sfax, Sfax University, Tunisia Digital Research Center (CRNS), Technopark Sfax, Tunisia d King Abdulaziz City for Science and Technology, Saudi Arabia b

a r t i c l e

i n f o

Article history: Received 28 November 2016 Revised 31 August 2017 Accepted 24 October 2017

Keywords: Classiﬁcation Outlier detection technique Statistics approaches Clustering approaches Wireless sensor networks

a b s t r a c t Over the past few years, wireless sensor networks have gained signiﬁcant attention. They have been distributed in the real world in order to collect valuable raw sensed data. Due to high density, WSNs (Wireless Sensors Networks) are exposed to faults and nasty attacks. Likewise, the sensors readings are inaccurate and unreliable, which make Wireless Sensor Networks vulnerable to outliers. Abnormal data, outliers or anomalies are usually considered to be those sensor readings that have deviated signiﬁcantly from normal behavior. However, the challenge is to ensure data quality, secure monitoring and reliable detection of interesting and critical events. In this survey, we describe a comprehensive overview of existing outlier detection techniques speciﬁcally used for the wireless sensor networks. Moreover, we present a comparative table used as a guideline to select which technique is adequate for the application in terms of characteristics such as detection mode, architectural structure and correlation extraction. © 2017 Published by Elsevier B.V.

1. Introduction Wireless Sensor Network has been a growing interest in the scientiﬁc and industrial communities thanks to innovations that have occurred during the last decades in the domains of microelectronics, MEMS (Micro-Electro-Mechanical System) design, energy harvesting, and wireless technologies. These Wirelessly connected sensor networks consist of a large number of sensing nodes densely deployed in the wanted region .These sensing embedded elements are connected to each other through wireless links and work together to collect large amounts of high-ﬁdelity information about different locations, processing them, and transmitting data to gateway nodes also known as sink points. Recent deployments have demonstrated their utility in various domains. WSNs are usually used in military operations [1]. Recently, a new set of possible applications has been an active subject of research, such as structural health monitoring [2–4], environmental monitoring [5,6], agriculture [7] and industrial applications [8–10]. Recent experimentations are currently exploding in terms of usage and performances to improve the way of working in many contexts like automation smart cars to reduce the number of crashes and home automation. The quality of data collected by WSNs has been often unreliable and inaccurate because of the WSN’s imperfect ∗ Corresponding author at: National Engineers School of Sfax, Sfax University, Tunisia Digital Research Center (CRNS), Technopark Sfax, Tunisia. E-mail address: [email protected] (O. Ghorbel).

https://doi.org/10.1016/j.comnet.2017.10.007 1389-1286/© 2017 Published by Elsevier B.V.

nature. Nonetheless, sensor nodes have stringent resource constraints such as memory capacity, computational complexity, and communication bandwidth and energy consumption. These limited resources make the data generated by sensor contaminated by noise, obvious error, missing data, duplicated values and conﬂicting information. Furthermore, WSNs frequently utilize a large number of sensor nodes in harsh and hostile environments where sensor nodes are vulnerable to malicious attacks, hence, data generated and processed will be controlled by enemies. However, in the context of WSNs, outliers which are also known as anomaly or deviation, are one of the sources that greatly inﬂuence the quality of data collected. Lately, the identiﬁcation of outliers has provided data reliability, performance and secure functioning of the network. Most sources of outliers which existed in WSNs are due to mechanical mistakes, fraudulent behavior or arise through natural deviation in populations. These resources include events, malicious attacks, noise and errors. Sometimes, outliers are often more interesting than the normal ones, that is why outliers need to be identiﬁed as they may contain important information. Nowadays, outliers’ identiﬁcation is of great interest to the researchers. Accordingly, outlier detection in WSNs has attracted much attention and was widely used for several real-life applications such as fraud detection, intrusion detection, environmental monitoring, health & medical monitoring, localization and target tracking. In this survey, we presented an extensive overview of the research done in the ﬁeld of existing

320

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

Fig. 1. Architecture of Wireless Sensor Networks.

Fig. 2. A simple example of anomalies in a two-dimensional data set.

outlier detection techniques speciﬁcally developed for WSNs. The contributions of this survey can be presented as: • Introducing the fundamentals of outlier in Wireless Sensors network and deﬁnes keywords that will be referred to in later sections, (Section 2), • Identifying important keys of outlier detection techniques in WSNs (Section 3), • Addressing a brief description of current outlier detection techniques (Section 4), • Introducing a comparative table to select an adequate technique (Section 5). 2. Classiﬁcation criterion of outliers Wireless Sensor Networks (WSNs) have become an interesting research topic in recent years. Their capabilities for monitoring wide areas, accessing remote and hostile places, real-time reacting, and relative ease of use has brought scientists a whole new horizon of possibilities. So far, WSNs have been employed in military activities, such as recognition, surveillance, target acquisition, environmental activities and in civil engineering such as structural health measurement. A sensor network is typically composed of hundreds, or even thousands, of small, low-cost nodes distributed over a wide area that shown in Fig. 1. Furthermore, WSNs combine the ability to sense, compute, and coordinate their activities with the ability to communicate results to the end users. They have revolutionized data collection in all kinds of domains and the deployment and design of these networks creates unique research and engineering challenges. Several limitations must be taken into account when developing WSN software such as their intended large area size, obstacles to their communication, their often random and hazardous deployment, their high failure rate and their limited power. Ensuring sensor data quality is crucial for better decision making. The use of key management techniques and cryptographic are insuﬃcient to ensure the integrity and reliability of the data as they cannot protect sensor nodes from insider attacks [11]. Therefore, outlier detection techniques are designed to detect any abnormal behavior in sensor data streams [12]. Various factors make wireless sensor networks (WSNs) especially prone to outliers. Firstly, they collect their data from the real world using imperfect sensing devices. Secondly, they are battery powered, and thus, their performance tends to deteriorate as power dwindles. Thirdly, since these networks may include a huge number of sensors, the chance of error accumulates. Finally, in the security and military applications, sensors are especially prone to manipulation by enemies [1]. However, it is clear that outlier detection should be an inseparable part of any data processing routine in WSNs [13]. The following subsections describe the basic concepts, Sources and requirements of outlier detection in WSNs.

Fig. 3. Different types of outlier source in Wireless Sensors Networks.

2.1. Deﬁnition of outliers In the context of WSNs, outlier also known as an anomaly or deviation is considered for identifying unusual behavior when compared to the majority of sensor readings as shown in Fig. 2. In the literature, there is no universally acknowledged deﬁnition for outliers. We presented in Table 1, a set of prevalent deﬁnitions of outliers which have been differently proposed by numerous authors. 2.2. Type of outliers Outliers can be classiﬁed as global or local with respect to the data set. Global outliers take account of outliers relating to all the available data points. However, local outliers are data points which consist of outliers in connection with its nearest local neighbors. LOF (Local Outlier Factor) [14] and its variants LOCI (Local Outlier Correlation Integral) [15], and LOOP (Local Outlier Probabilities) [16] are some local outlier detection techniques in data streams. 2.3. Dimension of detected outliers A data point can be identiﬁed as an outlier when its attributes have anomalous values. An outlier univariate is a data point that has a single attribute and it can be detected if its attribute is anomalous with respect to other data. In contrast, an outlier multivariate is a data point that has multiple attributes and it can be identiﬁed as an outlier if some of its attributes together have anomalous values with respect to other data. 2.4. Identity of outliers One of the most important challenges for anomaly detection in WSN is to provide reliability and quality and to protect sensor data to be corrupted and damaged. In many applications, outliers are often more interesting than the normal ones. Thus outliers need to be identiﬁed as they may contain important information that is of great interest to the researchers. There are various sources of outliers , as shown in Fig. 3, due to environmental changes or error coming from a faulty sensor in WSNs, which can be deﬁned as:

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

321

Table 1 Prevalent deﬁnitions of outliers. Date

References

Deﬁnitions

1960

[17]

1969 1980

[18] [19]

1994 20 0 0 2001 2002 2003

[20] [14] [21] [22] [23]

2004

[24]

2005

[25]

2006

[26]

2011

[27]

‘An outlier is an observation which is suspected of being partially or wholly irrelevant because it is not generated by the stochastic model’ ‘An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.’ ‘An outlier is an observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.’ ‘An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.’ ‘Outliers are points that lie in the lower local density with respect to the density of its local neighborhood.’ ‘Outliers are points that do not belong to clusters of a data set or as clusters that are signiﬁcantly smaller than other clusters.’ ‘Points that are not reproduced well at the output layer with high reconstruction error are considered as outliers.’ ‘A point can be considered as an outlier if its own density is relatively lower than its nearby high density pattern cluster, or its own density is relatively higher than its nearby low density pattern regularity.’ ‘If the removal of a point from the time sequence results in a sequence that can be represented more brieﬂy than the original one, then the point is an outlier.’ ‘A point is considered to be an outlier if in some lower-dimensional projection it is present in a local region of abnormal low density.’ ‘A spatial-temporal point whose non-spatial attribute values are signiﬁcantly different from those of other spatially and temporally referenced points in its spatial or/and temporal neighborhoods is considered as a spatial-temporal outlier.’ ‘An outlier is a data point which is signiﬁcantly different from other data points, or does not conform to the expected normal behavior, or conforms well to a deﬁned abnormal behavior.’

start launching attacks. Malicious attacks can decrease the limited resources of the network or inject false and corrupted data. Moreover, these outliers can be classiﬁed into two major categories, known as passive and active attacks. For the ﬁrst category, the passive attack obtains data interchanged in the network without interrupting the communication. For the second category, the attack aims to disrupt the normal functionality of the network.

Fig. 4. Various architectural structure of outliers detection.

The most of the outlier and event detection techniques presented in the literature focus on differentiating between outliers and events [34] but the characterization of outliers as sensor faults, misbehavior or noise has not been dealt. However, due to the fact that there are spatial- temporal similarities between neighboring nodes, the measurements enable to classify outlier as either an event or error. Event observations seem to be spatially correlated as in [35] while, error data observations seem to be unrelated. Recently, some research has been dedicated to detecting sensor errors [36]. 3. Fundamentals of outlier detection techniques

• Noise or error: that refers to a noise-related measurement or data coming from various sources, such as a sensor misbehavior or sensor fault [28]. Outliers caused by errors may occur frequently. Erroneous data are normally represented as a change in the dataset which is extremely different from the rest of the data. Noise or error is results for many variations that are related to the environment, including the harshness and the diﬁculties of the deployment areas [29]. Noisy data, as well as erroneous data, should be eliminated or corrected if possible. • Events: is deﬁned as a sudden change in the real world state [30]. Typical examples of events may include ﬁre detection [31], food, earthquake, volcanic eruption, weather changes, rainfall, chemical spill, air pollution, etc. In contrast with erroneous data, outliers caused by events tend to have an extremely smaller probability of occurrence. This kind of outlier normally lasts for a relatively long period of time and changes the historical pattern of sensor data. Removing the event outlier from data set will lead to a loss of important hidden information of the data about events as in [32]. Outliers that are very close to random errors in terms of size can only be determined through the application of outlier tests. • Malicious attacks: are related to the network security and some works such as Baig [33] has tackled this problem. This sort of outliers can access and control some nodes and then

3.1. Architectural structure In centralized detection approach [37], the collected data is sent to the base station to be processed and analyzed. However, the anomaly detection is performed to the sink which has more resource available and a large storage. A clustering technique uses centralized detection to improve the accuracy in detecting anomalies .Thus, in distributed approach [38], the detection agent is achieved in every node, which allows real-time anomaly detection through listening indiscriminately to the behavior of its neighboring nodes and sending an alarm to alert the base station or neighboring nodes if an anomaly has been detected as shown in Fig. 4. 3.2. Correlation of exploitation data Temporal correlation models assume that there is a relationship between the readings observed at timestep t and its history node readings at the previous time instants such as in [39]. In contrast, the existence of spatial correlation implies that there is a geographical relationship between the readings of sensor nodes within a certain physical spatial range such as neighborhood [40,41] and cluster [42] . Spatial correlation can also be deﬁned as a logical spatial range like a group of trusted sensors [43] where

322

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

the correlated sensor readings are assumed with similar values. However, Zhang and co-workers [38,44–46] aim to capture the spatio-temporal correlations that help to predict the trend of sensor readings and also to distinguish between errors and events.

•

3.3. Use of pre-labelled data Supervised learning approaches need to learn normality and an abnormality models by utilizing pre-labelled data. Thus, identifying a new data point as normal or outlier depends on which model the data point ﬁts. In addition, these methodologies are applied in various real-life applications such as fraud detection and intrusion detection. In contrast, the unsupervised learning approach is more general that it can identify outliers without requiring pre-labelled data. Distributed-based methods use unsupervised learning approach that can identify outliers based on a standard statistical distribution model or on the measure of full distance between a point and its neighborhoods. On the other hand, semi-supervised learning approach makes use of training on pre-labelled normal data to identify a boundary of normality. Then, it identiﬁes a new data point as normal or outliers depending on how well the data point ﬁts into the normality model.

•

3.4. Evaluation methods The adequacy of outlier detection techniques can widely be assessed via cross validation which is utilized to estimate the prediction of errors [47] using its two principal metrics: Root Mean Square Error (RMSE) and Mean Prediction Error (MPE). In the ideal case, the prediction is unbiased and accurate if it has a MPE = 0 and the lowest RMSE. This method is the simplest and more popular than the other methods such as bootstrap [48]. A characterizing property of bootstrap method is that when there is an outlier in a data set, it is restricted in only a subset of bootstrap resample. The outlier causes a considerable raise in the sample mean of the bootstrap resample, which makes the bootstrap histogram of the sample mean a mixture distribution with more than one mode. In this context, the concept of an outlier depends not only on a points exact value but on patterns in the data. In many real-world applications, it is simply not possible to force data into a form where the three classical assumptions of linearity, homoscedasticity and Normal errors are concurrently satisﬁed. In addition, this method can be computationally expensive for large datasets. Moreover, the detection rate (DR) that means how many outliers are correctly identiﬁed, and false positive rate (FPR) that indicates how many normal data are incorrectly indicated as outliers, are employed as metrics of accuracy detection. The effectiveness outlier detection approach should reach a low FPR and high DR [49]. The trade-off between these metrics is usually represented using the receiver operating characteristic ROC [50] in a 2-D graph . There are various other techniques to evaluate the eﬃciency of outlier detection approaches such as the computational cost and memory occupation which are necessary to execute outlier detection techniques or to use deﬁned parameter. 3.5. Issues of outlier detection in WSNs Many outlier detection solutions in WSN have been proposed in the literature. The main challenge of these techniques is to achieve high detection effectiveness with minimum energy cost. However, to design suitable solutions some constraints need to be respected that are summarized by the following: • Resource constraints. The process of outlier detection in WSN requires the use of the computational capacity, energy and communication bandwidth and storage resources for processing data in real time. However, WSNs are made up of low

•

•

•

•

quality and cheap sensors which are very severe constraints in resources in terms of memory, and processing. High communication cost. Various communication constraints exist such as extreme path loss, signal absorption, spreading, rapidly changing times-varying channels, large propagation delay, noise and fading characteristics. In WSNs, radio communication consumes the utmost portion of energy [51]. Some traditional outlier detection solutions are built based on the centralized approach in which the data are collected from sensors and sent entirely to be processed by the cluster-head or the base station. However, the cost of data transmission is higher than the cost of data processing. In addition, the design of distributed online outlier detection models also requires the communication between sensor nodes. As a result, the energy consumption is affected by the amount of communication overhead incurred by the distribution process. Distributed streaming data. The dynamic streaming nature of sensing data is another challenge. Generally, there is no prior knowledge available to build the normal reference model distribution of sensing data. Even if this limitation is available in a speciﬁc point of time, it is inappropriate for the future that the dynamic streaming may change the nature of distribution over the time. In addition, direct computation of probabilities is diﬃcult [52]. Dynamic network topology. A sensor network deployed in harsh environments over an extended period of time is susceptible to dynamic network topology and frequent communication failures. Some WSNs applications expand over the time such that some new nodes may be added or deleted to the network. As a result, the old normal reference model which was built for the network needs to be updated. In addition, the mobility of nodes in some applications and the communication failures increase the network topology change. Network heterogeneity. An aspect of heterogeneity appears when the sensor nodes are equipped with many sensors for measuring different environmental phenomenon at the same time. As well, some application of WSN used different types of nodes and assigned various jobs to different nodes. Additionally, the data collected by sensors obey to different data distributions which make the anomaly detection model incapable of coping with such kind of heterogeneity. Identifying outlier sources. Wireless sensor network is expected to collect the raw data sensed from the physical world and detect also events observed in the network. Moreover, it is diﬃcult to identify the cause of an outlier in sensor data due to dynamic nature of WSNs and the resource constraints . Thus, the main question is how to process as much data as possible in a decentralized and online way while keeping the memory, communication overhead, and computational cost low. High Dimension Data. The increase of network size may also cause an increase in the dimensionality of the collected data. As results, this incrementing can incur a higher computational cost that drains the memory and energy of sensors. Besides the increase of data dimension becomes a problem for the eﬃciency aspect of anomaly detection.

3.6. Use of outlier detection in WSNs Outlier detection, also called anomaly detection is one of the fundamental tasks in data mining application as shown in Fig. 5. It identiﬁes defected sensors nodes that always generate outlier values, ensures the security of the network and detects potential network attacks. The value added by outlier detection appears in many real-life applications [53]:

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

323

Fig. 5. Applications of outlier detection in WSNs.

• Habitat monitoring, in which endangered species can be equipped with small nonintrusive sensors to monitor their behavior. • Environmental monitoring, in which sensors humidity, as well as temperature, are deployed in harsh and hostile regions to monitor the natural environment. • Healthcare and medical monitoring, in which small sensors are implemented on multiple different positions of the patient body to monitor vital stats and patient metrics such as Heart Rate, Blood Pressure, etc. • Target tracking, in which sensors are embedded in moving targets to track them at a particular time. • Industrial monitoring, in which devices are equipped with pressure, temperature or vibration sensors to monitor their states. 4. Outlier detection techniques for WSNs Outliers detection techniques aim to clear and ameliorate the collected data and provide the best information to end users. These approaches optimize the quality of sensor measurements while maintaining low energy consumption and high computation. In this section, we classify outlier detection techniques designed for WSNs based on the disciplines like statistical based approaches, nearest neighbor-based approaches, artiﬁcial Intelligence based approaches and classiﬁcation-based approaches as shown in Fig. 6. 4.1. Statistical-based approaches Statistical-based approaches were the ﬁrst algorithms used for the problem of outlier detection that was experienced by statisticians in the early-19th century [54]. In fact, Grubbs [18],

Barnett and Lewis [20] and Rousseeuw and Leroy [55] describe and analyze a broad range of statistical outlier techniques. In these techniques, the observations are modeled by utilizing a stochastic distribution. Therefore, points can be identiﬁed as outliers if the probability of the data instance to be generated by this probability distribution model is very low. An important amount of research has treated statistics-based outlier detection such as Hawkins [19] and Beckman and Cook [56]. Statistical models are generally appropriate to quantitative real-valued data sets or at the very least quantitative ordinal data distributions that must be transformed to suitable numerical values for numerical treatment. This constraint limits their applicability and raises the processing time if complex data transformations are needed before processing. Various methods are selected to reduce the treatment time [57] to overcome the problem of increasing dimensionality that optimizes the processing time and deforms the data distribution . Statistical based approaches techniques can further be deﬁned as parametric and noparametric based techniques as follow: • Parametric : Can effectively identify outliers if a correct probability distribution model is acquired, but it is not available in many real-life scenarios. Similarly, it is diﬃcult to choose the appropriate threshold for evaluation. Thus, the use of a ﬁxed threshold is not appropriate for dynamic change of WSNs characteristics. In addition, these models are categorised into a Gaussian-based scheme and non-Gaussian-based scheme. • Non-parametric : Unlike parametric approaches , these techniques do not rely on the assumptions made regarding the availability of data distribution. Many techniques are used such as histogram and Kernel functions. However, the ﬁrst technique is not suitable for real-time applications as it is not able to capture the interactions between different attributes of multivari-

324

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

Fig. 6. Overview of outlier detection techniques for WSNs.

ate data. However, Kernel functions can solve this problem as it scales well in multivariate data with cheaper computationally. In 20 0 0, Laurikkala et al. [58] described one of the simplest statistical outlier detection techniques. It is equivalent to a visual inspection where informal box plots are used to pinpoint outliers in univariate data sets as well as multivariate. This creates a graphical representation and allows a human auditor to visually pinpoint the outlying points. In this approach, box plots make no constraints about the data distribution model, but they are dependent on a human to notice the extreme points plotted on the box plots. In 2003, Hida et al. [59] proposed a technique which is based on the spatio-temporal correlations of sensor data and uses two statistical tests to locally detect outliers to make simple aggregation operations more reliable. In this process, each sensor reading is compared against the current value and the previous values of all its neighboring nodes. However, this technique takes into consideration only one-dimensional outlier data and uses a lot of memory to store historical values of all nodes in the neighborhood. Moreover, Palpanas et al. [60] described a kernel based technique for the problem of distributed online deviation detection in realtime streaming sensor data in wireless sensor networks. This non-Parametric-based approach has no prior knowledge of data distribution, but instead, it builds a model of the most recent values in a sliding window using kernel density estimator. Thus, each node can locally identify outliers if the values deviate signiﬁcantly from the model of approximated data distribution. The drawbacks of this technique are its high dependency on the deﬁned threshold and insuﬃciency of a single threshold for multidimensional data and maintaining the data model built. In 2006, Subramaniam et al. [52] extended the work of Palpanas [60] using two global outlier detection techniques for complex applications. The ﬁrst used the same technique as Palpanas and then transmitted the outliers to its corresponding parent to be checked until the sink eventually determines all global outliers. As for the second technique, each node employs LOCI [15] to locally detect global outliers by having a copy of the global estimator model obtained from the sink. These approaches achieve high accuracy of data distribution estimation and signiﬁcant outlier detection rate with low memory occupation and data transmission. In fact, the main problem for these techniques is its ineﬃcacy to detect spatial outliers due to the fact that it does not consider the spatial correlations among neighboring sensor reading. In [44], Jun et al. designed a non Gaussian based approach, which utilizes a symmetric alpha stable (SS) distribution to model outliers being in form of impulsive noise. This approach uses the spatial temporal correlations of sensor data to locally detect outliers. Each node in a cluster compares the predicted data and the sensor reading to detect and correct temporal outliers. Thus, the cluster head collects the rectiﬁed data from

all other nodes in the cluster and detects spatial outliers that deviate from normal behavior. Even though this technique reduces the cost of communication and computation, the distribution used may not be compatible for real sensor data and the cluster-based topology. In 2007, Wu et al. [61] proposed and analyzed two novel techniques for identiﬁcation of outlying sensors and the detection of the reach of events in sensor networks. These algorithms used the spatial correlation of the readings existing among neighboring sensor nodes. This spatial correlation aims at differentiating between outlying sensors as well as the event boundary with a high accuracy and a low false alarm rate when as much as 20% sensors report outlying readings .This approach has a low computational overhead and low accuracy of outlier detection techniques where the temporal correlation of sensor readings is ignored. In [29], Bettencourt et al. selected a local outlier detection technique to identify errors and detect events in ecological applications of WSNs. This method used the spatio-temporal correlations of sensor data to differentiate between erroneous measurements and events. However, each node archived the statistical distribution of difference between its own measurements and each of its neighboring nodes, also between its current and previous measurements. If its measurement is less than a user-speciﬁed threshold, the observation is identiﬁed as outlier. The main key of this approach is the adequate choice of threshold value. A histogram-based method to identify global outliers in data collection applications of sensor networks with reduced communication cost is presented in [62] where each sensor maintains a histogram type summary of pertinent data over a sliding window of its data points. The sink uses these summaries to extract data distribution from the full set of accumulated data and ﬁlters out of the normal data. The identiﬁcation of outliers is estimated by a ﬁxed threshold value or the rank among all outliers. Unfortunately this technique entails the fact that only detected outliers over one-dimensional data are affected, also, it is only applied in settings where spatial proximity is important. The approach proposed in 2008 by Shuai et al. [63] exploits outliers within spatial and temporal correlations of sensor data through a prediction model of data using kalman ﬁlter .This technique only uses the estimation of the system’s state from previous and current measurements of sensor data and it requires low computation and small storage. However, authors deﬁne two levels for evaluation. The transition phase involves predicting the upcoming state based on current one. Then a measurement module is utilized for estimating neighboring sensor readings as data produced by a virtual sensor device. In 2010, another statistical modeling technique is described in [64]. It is based on the approximation of the sensor data distribution using kernel density function employed to transform the raw sensor readings into meaningful information. However, this technique treats only onedimensional data. Also Zhang et al. [38] describes a method based

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

on time-series analysis and geostatistics in 2012. It aims at identifying outliers and differentiating between errors and events in a distributed and online manner with a low false positive rates and high accuracy, taking advantage of their spatial and temporal correlations existing in WSN data to deﬁne the normal behavior. In addition, in 2015, Kannan et al. [65] described the eﬃcacy of statistical based techniques and nearest neighbor based approaches for identifying outlier in data mining. This comparison denotes that the statistical approach of Histogram Based Outlier Score (HBOS) has indicated maximum points of outliers while the nearest neighbor based techniques which identiﬁed only the most deviated extreme points as outliers. 4.2. Nearest neighbor-based approaches Nearest neighbor-based approaches are the most widely used concepts to analyze a data instance with respect to its nearest neighbors in machine learning and data mining. This approach has been used for various purposes such as classiﬁcation, clustering and outlier detection where a data instance located far to their own nearest neighbor is declared as an outlier [66,67]. Although these approaches do not make any assumption about data distribution but it can generalize many notions from statistical-based approaches. The nearest neighbor based outlier detection techniques have an explicit notion of proximity. They described several well-deﬁned distance notions as any two data instances, or a set of instances or a sequence of instances. The most popular choice for univariate and multivariate continuous attributes is Euclidean distance. In 2006, Branch et al. [68] proposed a technique based on distance similarity to solve the problem of unsupervised global outlier detection in wireless sensor networks. Each node uses distance similarity to identify local outliers and then broadcasts the outliers to its neighboring for veriﬁcation. This procedure is repeated until all of the sensor nodes in the network eventually concur on the global outliers. However, every node uses broadcast to communicate with all other nodes in the network, which will increase communication overhead. Consequently, this algorithm is well compatible for applications in which the conﬁdence of an outlier rating may be calculated by an adjustment of sliding window where the algorithm is accurate and imposes a reasonable communication load and level of power consumption. In addition, an in-network outlier cleaning approach for sensor network data collection applications [45] is represented. It makes use of wavelet based outlier correction and neighboring DTW (Dynamic Time Warping) distance based on outlier removal with regards to a spatial-temporal correlated environmental data. This technique reduces outlier traﬃc that can effectively clean the sensing data, hence, most of the outliers can be either corrected, or removed from additional transmission within 2 hops. However, this method depends on a suitable threshold that is not evident to deﬁne. Furthermore, Zhang et al. [69] describe a new unsupervised distance-based approach to detect global outliers in snapshot and continuous queries processing applications of sensor networks in 2007. It adopts the structure of aggregation tree where each node in the tree sends some useful data to its parent after collecting all the data sent from its children. Then the sink gets out top global outliers and broadcast these outliers to all nodes in the network for veriﬁcation. This process is repeated when a node disagrees with the global results calculated by the sink. This technique reduces communication overhead but it assumes only onedimension. Lijun et al. [70] present in 2010 a novel data stream outlier detection algorithm SODRNN based on reverse nearest neighbors. In order to detect anomalies in the current window, the outlier queries are performed with the sliding window model. This algorithm reduces the number of scans to only one for the update of insertion or deletion of the current window which improves

325

eﬃciency and effectiveness. The next year, Kontaki et al. [71] extended the works of Angiulli and co-workers [72,73] which used sliding windows to identify global distance-based outliers in data streams, and improved its complexity and memory consumption. In 2013, authors proposed a new technique for parameter-free outlier detection algorithm [74] to compute Ordered Distance Difference Outlier Factor. In this approach, a new outlier score is calculated for each data point by considering the difference of ordered distances which is used to compute an outlier score. Since the LOF technique’s achievement of high detection performance in non-homogeneous densities without any assumption in regards to the underlying distribution of the data set, it has turned into a prominent technique and numerous variants of this methodology have been proposed. However, some strategies improve the detection accuracy of LOF by changing the way k-NNs [75] or using the approximation [15] to improve its time complexity. Last year, Kannan et al. [65] described the eﬃcacy of statistical based techniques and nearest neighbor based approaches for identifying outlier in data mining. This comparison denotes that the statistical approach of Histogram Based Outlier Score (HBOS) has indicated maximum points of outliers while the nearest neighbor based techniques such as LOF, COF(Class Outlier Factors), LOOP and INFLO (Improving Inﬂuenced Outlierness) identiﬁed only the most deviated extreme points as outliers data. In 2016, Abid et al. [76] discuss unsupervised outlier detector based on Data Nearest for Outlier Detection (DNOD). This approach aims to analyze a learning data gathered by sensors in order to localize outlier measurements.

4.3. Artiﬁcial intelligence based approaches Recently, many approaches of outlier detection in WSN have been tested on decision making theories. Some of the Artiﬁcial Intelligence techniques that are used in outlier analysis are Neural Networks and Fuzzy Logic. The latter technique is a logical model which provides a general idea about the decision process in the analysis of the data set. The fuzzy logic suggested by Sisman et al. [77] is essentially an approach that allows transition values to make a deﬁnition between the conventional values such as right/wrong, yes/no, high/low. The main purpose of the method is to bring a certainty to assigning a membership degree to the concepts which are hard to express or have diﬃcult meaning. A fuzzy logic system consists of three main phases. Firstly, fuzziﬁcation that can be deﬁned as a transfer between a deﬁnite system and a fuzzy system and it describes a property of an object in a certain fuzzy set, rule base and defuzziﬁcation. Secondly, the rule base combines the membership functions from the fuzziﬁcator with the rule handling data which is based on the database and stored there. Thirdly, in the defuzziﬁcation unit, the rule results that are obtained from the rule handling unit are evaluated in the fuzziﬁcator and turned into deﬁnite results as in [77]. Outlier is declared according to the rule results. In WSNs, fuzzy logic has been used to improve decision-making, cluster-head election, security, data aggregation, routing, MAC protocols, and QoS and recently outlier detection. Liang and Wang [78] proposed the use of fuzzy logic in combination with Double Sliding Window Detection, in order to improve the accuracy of event detection. However, they do not study the effect of fuzzy logic alone or the inﬂuence of spatial or temporal properties of the data on the classiﬁcation accuracy. In [79] fuzzy logic is used to combine personal and neighbors’ observations and determine if an event has occurred. Their results show that fuzzy logic improves the precision of event detection. This approach used that named D-FLER (A Distributed Fuzzy Logic Engine for Rule) [79] does not incorporate any temporal semantics. In the context of outlier detection, Kamal et al. [80] extend the work of Zhang et al. [38] and use spatial-temporal similarity combined

326

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

Fig. 8. Multi-class classiﬁcation based outlier detection.

Fig. 7. An example of clusters of points.

with fuzzy logic. This approach declared outliers with high detection rate while decreasing the resource consumption of network. 4.4. Clustering-based approaches Cluster analysis [81] is a popular within the data mining community to group similar data instances into clusters with similar behavior. Clustering is the partition of data into clusters of similar objects in which each group, or cluster, is formed of objects that are similar to one another and different to objects in other groups [82]. In the literature , various data-mining algorithms ﬁnd outliers as a results of clustering algorithms like [83,84] as well as others clustering-based methods proposed such as [85–87]. This technique uses Euclidean distance as the similarity measure between two data instances, but its computing in multivariate data is computationally expensive. The clustering based techniques involve a clustering step where data instances can be called as outliers if they do not belong to clusters or if they are small in size compared to other clusters [21,88,89] as shown in Fig. 7. Yet, clustering-based approaches do not have a prior knowledge of data distribution and they are apt of being used in an incremental model. But these approaches suffer from the choice of an appropriate parameter of cluster width. Various advantages of Clustering-Based Approaches are mentioned in [90–92]. These semi-supervised techniques are adequate for novelty detection [93] which utilizes normal data to produce clusters representing the normal modes of behavior of data [94,95]. An instance is identiﬁed as outlier if it is not assigned to any of the formed clusters. In addition, Smith et al. [96] investigated K-means Clustering, Self-Organizing Maps (SOM), and Expectation Maximization (EM) to recognize attacks during the normal activities in a system and then use the clusters to classify test data. In the same context, Vinueza and Grudic [97] present a novel technique that identiﬁes global as well as local outliers with respect to the cluster. In this algorithm, a point data can belong to one of the several class labels by learning the parameters for the clusters representing each class. A test point is identiﬁed as an outlier if it is far from all clusters or if it belongs to a class that is far from all other points. Respectively, clustering-based approaches can be classiﬁed as an unsupervised technique by using some known clustering algorithm and then analyzing each data instance with respect to the clusters. Mahoney et al. [98] use CLAD (Clustering LearningAnomaly Detectors)algorithm that takes a random sample and calculate the average distance between the closest points to obtain the width from data. A cluster is declared as a local outlier if its density is lower than a threshold and it is

identiﬁed as a global outlier if it is far away from other clusters. Barbara et al. [99] proposed a boot-strapping technique that uses frequent itemset mining to distinguish normal data from outliers and COOLCAT (the name comes from the fact that it reduce the entropy of the clusters, thereby ‘cooling’ them) clustering technique [100] to obtain clusters. Moreover, Rajasegarar et al. [30] described a global approach that is based on distributed non-parametric to identify outliers’ oﬄine measurements in sensor nodes. Each sensor cluster measurement uses a ﬁxed-width clustering algorithm and sends cluster summaries to its parent node. The head cluster then merges its children’s cluster statistics collected before reporting them to the sink which will identify all outliers. If the cluster’s average inter-cluster distance is upper than one threshold value of the set of inter cluster distances, then an anomalous cluster can be ﬁxed. Nevertheless, this technique reduces the communication overhead and supports energy-eﬃciency so that anomaly detection is only effectuated at the base station, it is unsuitable for local and real-time decision-making. In addition, Birant and Kut [101] present a spatio-temporal outlier detection technique using a clustering concepts called ST-DBSCAN (Spatio-Temporal Density-Based Clustering in Spatial Databases) which is an extended version of the clustering technique DBSCAN (Density-Based Clustering in Spatial Databases) [102] that also utilizes temporal aspects. 4.5. Classiﬁcation-based approaches Classiﬁcation approaches [103] are important systematic approaches in the data mining and machine learning concept. The aim of classiﬁcation is to identify a classiﬁcation model (classiﬁer) using the set of labeled data instances (training) and then classify an unseen instance into one of the learned (normal/outlier) class (testing). The classiﬁer may require updating itself to accommodate the new data that belongs to the normal class. In machine learning, classiﬁcation based outlier detection techniques [104] operate under the general assumption that a classiﬁer can be learned from a given space feature that can identify normal and outlier classes. This procedure can be devised into two phases which are training and testing. The training phase aims at learning a classiﬁer using the available labeled training data followed by a testing phase that classiﬁes a test instance as normal or an outlier using the classiﬁer. The ﬁrst category of classiﬁcation is the Multi-Class such as Bayesian network and neural network. These techniques assume that the training data contains labeled instances which belong to multiple normal classes [105,106]. One has to learn a classiﬁer to distinguish between each normal class against the rest of the classes. If a test instance is not classiﬁed as normal by any of the classiﬁers, then it is identiﬁed as an outlier as shown in Fig. 8. Some multi-category methods give the classiﬁer a conﬁdence

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

score. If neither of classiﬁers is conﬁdent in classifying the test data as normal or in other words do not score well, the instance is identiﬁed as an outlier as shown in Fig. 6. Bayesian network-based approaches use a probabilistic graphical model to modulate a set of variables and its probabilistic independencies. They aggregate data from different instances and provide an estimate of the expectancy of an event to belong to the learned class. In 2004, Elnahrawy and Nath [107] described a method for modeling and learning statistical contextual information in WSNs. It can also be applied for outlier identiﬁcation by a Nave Bayesian model-based approach to discover local outliers and detect faulty sensors. This technique solves the problem of learning spatial and temporal correlations to the issue of learning the parameters of the Bayesian classiﬁer and then utilizes the classiﬁer for probabilistic inference. In the employed model, the current reading of each sensor is only inﬂuenced by the preceding reading of the same sensor, and its incoming readings being in all subintervals (classes) divided by the whole values interval. Then, this method is used to predict the highest probability class of the subsequent reading where a reading is deemed as an outlier if its probability in its class is smaller than that of being in other classes. The technique doesn’t require a speciﬁed threshold to indicate outliers and can also be used to approximate the missing readings occurred in the network but it does not deal with multidimensional data. By using naive Bayesian networks for classiﬁcation, we can estimate if an observation belongs to a class or not, but this approach does not consider the conditional dependencies between the observations of sensor attributes. Accordingly, an outlier detection based on Bayesian Belief Networks [108] was proposed in 2006. Janakiram et al.present an online and distributed technique based on Bayesian belief network (BBN) to identify local outliers in streaming sensor data .This method uses BBN to capture the spatio-temporal correlations among the attributes and to estimate the missing values from the stream data emanated from the sensors .The use of naive Bayesian networks for classiﬁcation increases accuracy which considers conditional dependencies among the attributes, but it may not be adequate for the dynamic change of network topology. After one year, [109] presented an approach that uses Dynamic Bayesian networks (DBNs) with a network topology that evolves over time to identify local outliers in environmental sensor data streams. They presented two techniques to identify anomalous data: Bayesian credible interval and maximum a posteriori measurement status (MAP-ms) and they operate on several data streams at once. The ﬁrst method , the Kalman ﬁltering is used to sequentially reduce hidden distributions which are used to construct a Bayesian credible interval for the most recent set of measurements and observed states as new measurements become available from the sensors. The data measurements that are above the expected interval’s value are considered as outliers. The other method utilizes a more complex DBN including two measured state variables for outlier detection. Furthermore, authors developed a new approach of hierarchical Bayesian space-time (HBST) [43,46] where spatial and temporal correlations are not calculated but only assumed. This technique uses a tagging system to mark data that do not ﬁt within given models. Despite the fact that HBST is complex, it is accurate that it has a low false detection rate and it is more adequate to model mismatches and unmodeled dynamics than to linear autoregression modeling. Last year, Paola et al. [110] proposed an adaptive distributed Bayesian approach for identifying outliers in collected data through a wireless sensor network. This algorithm improved the accuracy of classiﬁcation, time complexity and communication complexity, and optimized the metrics for latency and energy consumption if compared to non-adaptive approaches. Also, Neural Networks [111] are widely used for building classiﬁers by learning different weights associated with the network. It is a set of interconnected nodes designed to imitate the func-

327

Fig. 9. One-classclassiﬁcation based outlier detection.

tioning of the human brain. Each node has a weighted connection to several other nodes in neighboring layers. In [22], Harkins et al. use a Replicator Neural Network (RNNs) which is a multi-layer perceptron neural network with three hidden layers, and the same number of output neurons and input neurons, in order to model data. In this model, the input variables are also the output variables so that the RNN forms an implicit, compressed model of data during training. This research’s objective is to provide a measure of outlying data records such as the reconstruction of error of individual data points. The performance of the RNNs is assessed by using a ranked score measure. The effectiveness of the RNNs for outlier detection is demonstrated on two publicly available databases. This compares to SmartSifter [112] which similarly builds models to identify outliers but scores the individuals depending on the degree to which they perturb the model. Another previous neural network method approach to detecting outliers is presented by Sykacek [113] where Sykacek’s neural network approach is to use a multi-layer perceptron (MLP) as a regression model. Then it treats outliers as data with residuals outside the error bars. To detect outliers, another dynamic model of WSNs based on recurrent neural networks (RNNs) was presented in [114], whereas a mechanism commonly used for outliers detection was described by Siripanadorn et al. [115]. In this paper, authors aim to propose an anomaly detection algorithm which determines the wavelet transform, and detects the abnormality of the sensor readings by training the SOM using the wavelet coeﬃcients. This can be achieved by the Discrete Wavelet Transform (DWT) that has been extensively employed for anomaly and fault detection in many applications [116]. DWT has also been integrated within SOM to detect faults [117]. In particular, feature vectors of the faults have been constructed using DWT, sliding window and a statistical analysis. The classiﬁcation of the feature vectors was obtained by using SOM. In the second category, One-Class classiﬁcation-based approaches are presented. One-class classiﬁcation based outlier detection methods still assume that all training instances have a unique class label as shown in Fig. 9. If any test instance does not belong to the learnt boundary after learning a discriminative limit around the normal instances using adequate one-class classiﬁcation algorithm, it is identiﬁed as an outlier such as support vector machines (SVM) [118], Kernel Fisher Discriminants [119] and Kernel Principle Component Analysis (KPCA). SVM is introduced by Vapnik [120,121] which is a popular classiﬁcation-based approach in the data mining and machine learning communities. It is used to detect outliers without requiring an explicit statistical model. It also optimizes the solution for classiﬁcation by maximizing the margin of the decision boundary and avoiding the dimensionality problem. SVM divides the data belonging to different classes by ﬁtting a hyper plane deﬁned by a number of support vectors between them which maximizes the separation. Each instance in

328

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

the training set contains one ’target value’. The instance is mapped into a higher dimensional characteristic space where it can be easily separated by a hyper plane. Furthermore, a kernel function is used to estimate the dot products among the mapped vectors in the feature space to ﬁnd the hyper plane. In this context, Rajasegarar et al. [39] used a distributed anomaly detection approach with energy that is eﬃcient in terms of communication overhead. This approach is based on a one-class quarter sphere SVM, for wireless sensor networks in a multi-level hierarchical topology, using real data gathered, in 2007. This technique identiﬁes local outliers at each node where the sensor data lying outside the quarter sphere is identiﬁed as an outlier. Then, each node communicates its summary data collected of the radius of sphere with its parent for global outlier classiﬁcation. However, this approach is not performed in real-time and has a low accuracy that it ignores spatial correlation of neighboring nodes. In addition, Zhang et al. [123] developed three one-class support vector machine-based outlier detection techniques so that the model representing normal behavior of the sensed data is sequentially updated .This technique use spatial and temporal correlations that exist between sensor data to identify well outliers. Although this online outlier detection technique proves a high detection accuracy and low false alarm rate, it has a complicated computation and low memory costs. In 2010, Xiao et al. [124] presented an outlier detection method based on support vector machine to detect the high dimensional nonlinear outlier sample data which is built by the clean sample set. Without outlier, it is used to predict the samples. In this technique, if the error between the prediction-value and actual value exceeds the threshold, the sample is identiﬁed as an outlier. It is applied to analyze the practical copper-matte converting production data. This method is eﬃcient to detect the high dimensional nonlinear outlier sample and has considerable practical value. Some authors such as Mohamed and Kavitha [125] described the real time network outlier detection method in the wireless sensor networks. This technique identiﬁes the sensor node data as local outlier or cluster outlier or network outlier using SVM classiﬁcation method. This method is eﬃcient in detecting the outliers in real time with the aim of ﬁnding the outlier with high accuracy and low false alarm rate as well as keeping some computational complexity due to updating the training set in real time. In 2013, Smart et al. [126] compared One-class SVM and Two-class SVM for detecting multiple faults in induction motor and proved one class SVM better than two-class SVM for fault detection. The main thing they have highlighted is classiﬁcation through two-class SVM performance that can suffer when one of the classes is under sampled. Classiﬁcation through one-class SVM performance can suffer if accurate class labels for one class are not there. Principal component analysis (PCA) is a linear statistical technique, ﬁrst introduced by Hotelling [127]. This method is a well dimension-reduction method and also a way of identifying inherent patterns, relations, regularities, or structure in data. However, these models Gaussian mixture and auto-associative neural networks, are the solutions of many real-world problems that are nonlinear. But these techniques need to solve a nonlinear optimization problem, hence, to be proved to local minima and sensitive to the initialization. Kernel principal component analysis (KPCA) has been proposed as a nonlinear extension of PCA [128] in the high-dimensional feature space through the nonlinearity of the kernel. This permits a reﬁnement in the description of the patterns of interest. The KPCA method performs more than linear PC analysis method at processing nonlinear systems [39,129].Over the last few years, Kernel Principal Component Analysis (KPCA) has found several applications like face detection [130], image segmentation [131], feature extraction [132], data de-noising [133], computer science [134] and voice recognition [135], etc. Recently, KPCA has also found an application in novelty detection [136] and

a new ﬁeld of outlier detection in wireless sensor networks. Some authors such as Ghorbel et al. [91] presented an improved KPCA for outlier detection based on Mahalanobis kernel in Wireless Sensor Networks to extract relevant characteristic for classiﬁcation and to estimate from the abnormal events. The distribution of training data is described through a principal subspace in an inﬁnite-dimensional feature space. However, the reconstruction error of a new data point was utilized as a measure to decide if this new point is considered as a normal point or outliers. Compared to a standard KPCA specially designed to be used with wireless sensor networks (WSNs), MKPCA (KPCA based Mahalanobis Kernel) is more robust. Moreover, Ghorbel et al. [137] proposed a new outlier detection method based on Kernel Principal Component Analysis (KPCA) using Mahalanobis distance in wireless sensor networks. This technique calculates implicitly the mapping of the data points in the feature space to segregate outlier points from normal pattern of data distribution where a new data point is considered as an outlier, if its distance is above a preﬁxed threshold. Compared to KPCA-RE (KPCA using Reconstruction Error), KPCAMD (Mahalanobis Disance) has a higher classiﬁcation performance on a synthetic and real database that is why it performs better in ﬁnding outliers in wireless sensor networks. 5. Comparaison of outlier detection techniques for WSNs Though all above-mentioned technique based outliers detection has tried to achieve high accuracy, each one has their own pros and cons as follow: • Statistical based approaches – Pros ∗ Can effectively identify outliers if a correct probability distribution model is acquired. ∗ Use temporal correlations to determine the presence of an outlier. A sudden change in the data distribution reduces the temporal correlations and this helps in outlier detection in streaming data. – Cons ∗ Parametric techniques-based are not useful because in most WSN real life applications, there is no prior data distribution knowledge. ∗ Non-parametric statistical models are not that suitable for real time applications. ∗ Histograms do not rely on the underlying data distribution but they are only eﬃcient for univariate data and cannot ﬁnd the interactions between attributes in multivariate data. ∗ Computational cost of handling multivariate data is more. • Nearest neighbor-based approaches – Pros ∗ It is unattended in nature and does not make any presumptions in connection with the underlying distribution of the data. ∗ Applying nearest neighbor-based techniques to different data type is simple, and primarily requires deﬁning an appropriate distance measure for the given data. – Cons ∗ The computation of the distance between data patterns in multivariate datasets is very expensive, ∗ The scalability of these models is a major concern, ∗ Threshold value is used to differentiate outliers from normal object and lower outlierness threshold value will result in high false negative rate for outlier detection, ∗ Problem arises when data instance is located between two clusters, the inter-distance between the object of k

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

329

Table 2 Classiﬁcation and comparison of existing outliers detection techniques for WSNs. Date

[2008] [2010] [2012] [2010] [2015] [2016] [2006] [2009] [2011] [2010] [2011] [2014] [2015] [2012] [2013] [2013] [2015]

References

[63] [64] [38] [70] [65] [76] [101] [122] [34] [124] [125] [91] [110] [138] [37] [139] [140]

Techniques

Statistical Statistical Statistical Nearest neighbor Nearest neighbor Nearest neighbor clustering Clustering Clustering Classiﬁcation Classiﬁcation Classiﬁcation Classiﬁcation Hybrid Hybrid Hybrid Hybrid

Dimension outlier

Detection mode

Univariate

Multivariate

Online

Oﬄine

– – – – – – x x x x – – x

– x – – – – –

x x x x x x x x – – x –

x – x x x – x –

– – x x x

Architectural structure

Correlation Spatial

Temporal

Distributed Distributed Distributed Distributed Centralized Distributed Distributed Distributed Distributed Distributed Distributed Distributed Distributed Centralized Centralized Distributed Distributed

– – – – – – – x

– x x – – – x x – – –

Note: () Function Supported, (x) Function Not Supported, (–) Not explicitly deﬁned.

nearest neighborhood increases when the denominator value increases leads to high false positive rate, ∗ Needs to improve the eﬃciency of density based outlier detection. • Artiﬁcial intelligence based approaches – Pros ∗ Ability to generalize from limited, noisy and incomplete data. ∗ When new data or rules are added to the system, there is no need to re-train the system, mainly just adding new rules – Cons ∗ Hard to develop a model from a fuzzy system that require more ﬁne tuning and simulation before operational. ∗ Storing the rule-base might require a signiﬁcant amount of memory. The number of rules grows exponentially to the number of variables. ∗ Additing spatial and temporal semantics to the decision making process further increases the number of rules. ∗ Storing a full rule-base on every node might not be reasonable that sensor nodes have limited memory. In addition, constantly traversing a large rule-base might considerably slow down the detection process. • Clustering based approaches – Pros ∗ It is easily adaptable to incremental mode (i.e. after learning the clusters, new points can be inserted into the system and tested for outliers). ∗ It does not have to be supervised. ∗ It is suitable for anomaly detection from temporal data. ∗ The testing phase for clustering based techniques is fast since the number of clusters against which every test instance needs to be compared is a small constant. – Cons ∗ Dependency on the choice of cluster width in some clustering techniques makes them not suitable for WSN applications. ∗ Updating reference model involves lot of communication overhead and is also computationally expensive. ∗ Clustering is very computationally expensive with multivariate data because the calculation of the distance measures among all data patterns has high computational cost that make them unsuitable for limited resource devices such as sensors.

∗ Clustering techniques cannot cope with continuous changes of data streams over time so the normal reference model will be out of date by the time they are used. Although, some recent clustering-based models have tackled this issue via incremental learning methods, the computational cost for such methods is too high to be affordable by constrained resource devices. • Classiﬁcation based approaches – Pros ∗ Does not require Neither a statistical model nor estimated parameters. ∗ Allows optimal and maximum identiﬁcation of outliers. ∗ Solves the problem of multidimensional data. – Cons ∗ Pose a computational complexity that is greater than that of clustering and statistical based techniques. ∗ Need to train itself according to the newarriving normal data sets.

In addition, we have investigated a general characteristics of existing outlier detection techniques designed especially for WSNs are shown in Table 2. In This table, we provided a comparison of different techniques in terms of dimension outlier (univariate and multivariate), detection mode (online and oﬄine), architecture structure and correlation (spatial and temporal). Moreover, from this table, we can summarise that the majority of existing work can be categorized into three classes. Firstly, many techniques use spatial correlation among sensor data of neighboring nodes but they suffer from the selection of appropriate neighborhood ranges. Secondly, some techniques use temporal correlation among sensor data, but the problem is the choice of the size of the sliding window. Thirdly, other approaches only consider the spatial and temporal correlations among sensor data and abandon the dependencies between the attributes of the sensor node, hence having a low accuracy of outlier detection and high computational complexity. The key challenge is to develop an outlier detection technique suitable for various domains depending on several important aspects. Some of these aspects are streaming and multivariate data, the dependencies of sensor node among their reliable neighborhood and its attributes, the selection of adequate and ﬂexible decision threshold and ﬁnally, the dynamics of updates sensor data as well as network topology. Taking into account these criteria, outlier detection techniques can be able to work with high dimensional data, online manner over multivariate stream-

330

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

ing data with low communication overhead and computation complexity. 6. Conclusion Outlier detection is an enormously important problem with large application in a wide variety of domains. In this paper, the main attention is given to various outlier detection techniques suitable for wireless sensor networks to maintain a high-quality and to have control over data analysis. Wireless sensor networks are different from a traditional network in various aspects, thereby necessitating protocols and tools that address unique challenges and limitations. Consequently, WSNs require innovative solutions for energy aware, security, localisation, target, and outlier detection. We have investigated the different ways in which the problem has been formulated in the existing literature. We have provided an easier and succinct understanding of these techniques and we present a comparative analysis that has been carried out in terms of characteristics like the dimension of outlier, detection mode, architectural structure and correlation extraction. These factors help to identify the improvement and feasibility of various techniques presented. In addition, this survey presents both pros and cons for ﬁve categories. As a future work, we aim to exploit a speciﬁcs application domain for detection outlier based on WSNs for monitoring water pipeline. However, we aim to elaborate a comparative study to choose the suitable technique that provides the highest accuracy in terms of communication as well as computational and memory complexity to identify the improvement and feasibility of various techniques discussed. Acknowledgments This work was supported by King Abdulaziz City for Science and Technology (KACST) and the Digital Research Center of Sfax (CRNS) under a research grant (project no. 35/1012). References [1] M.P. Durisic, Z. Tafa, G. Dimic, V. Milutinovic, A survey of military applications of wireless sensor networks, in: 2012 Mediterranean Conference on Embedded Computing (MECO), 2012, pp. 196–199. [2] Y.Q. .Ni, Y. .Xia, W.Y. .Liao, J.M. .Ko, Technology innovation in developing the structural health monitoring system for guangzhou new TV tower, Stuct. Control Health Monitor. (2009). [3] H. Baldus, K. Klabunde, G. Msch, Reliable set-up of medical body-sensor networks, in: H. Karl, A. Wolisz, A. Willig (Eds.), Wireless Sensor Networks, d. Springer Berlin Heidelberg, 2004, pp. 353–363. [4] M. Yusro, K.M. Hou, E. Pissaloux, H.L. Shi, K. Ramli, D. Sudiana, SEES: concept and design of a smart environment explorer stick, in: 2013 The 6th International Conference on Human System Interaction (HSI), 2013, pp. 70–77. [5] P. Kułakowski, E. Calle, J.L. Marzo, Performance study of wireless sensor and actuator networks in forest ﬁre scenarios, Int. J. Commun. Syst. 26 (4) (2013) 515–529. Avr. [6] A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler, Wireless sensor networks for habitat monitoring, WSNA, 2002. [7] T. Brooke, J. Burrell, From ethnography to design in a vineyard, in: Proceeedings of the DUX Conference, 2003. Case Study. [8] R. Adler, P. Buonadonna, J. Chhabra, M. Flanigan, L. Krishnamurthy, N. Kushalnagar, M. Yarvis, Design and deployment of industrial sensor networks: experiences from the north sea and a semiconductor plant, in: ACM SenSys, 2005, p. 17. [9] Y. Guo, F. Kong, D. Zhu, A. Tosun, Q. Deng, Sensor placement for life time maximization in monitoring oil pipelines, in: ICCPS ’10 : Proceedings of the 1st ACM/IEEE International Conference on Cyber-Physical Systems, 2010. [10] I. Stolanov, L. Nachman, S. Madden, T. Tokmouline, Pipeneta wireless sensor network for pipeline monitoring, in: IPSN ’07 : Proceedings of the 6th international conference on Information processing in sensor networks, 2007. [11] Y.A. Bangash, Y.E. Al-Salhi, Security issues and challenges in wireless sensor networks: a survey, IAENG Int. J. Comput. Sci. 44 (2) (2017). [12] N. Shahid, I.H. Naqvi, S.B. Qaisar, Characteristics and classiﬁcation of outlier detection techniques for wireless sensor networks in harsh environments: a survey, Artif. Intell. Rev. 43 (2) (2015) 193–228. [13] A. Saini, K.K. Sharma, S. Dalal, A survey on outlier detection in wsn, Int. J. Res. Aspects Eng. Manage 1 (2) (2014) 69–72.

[14] M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, LOF: identifying density-based local outliers, in: Proceedings of ACM SIGMOD, 20 0 0, pp. 93–104. [15] S. Papadimitriou, H. Kitagawa, P.B. Gibbons, C. Faloutsos, Loci: Fast outlier detection using the local correlation integral, in: ICDE, 2003, pp. 315–326. [16] H.P. Kriegel, P. Kroger, E. Schubert, A. Zimek, Loop: local outlier probabilities, in: ACM Conference on Information and Knowledge Management, 2009, pp. 1649–1652. [17] F.J. Anscombe, I. Guttman, Rejection of outliers, Technometrics 2 (2) (1960) 123–147. [18] F.E. Grubbs, Procedures for detecting outlying observations in samples, Technometrics 11 (1969) 1–21. [19] D.M. Hawkins, Identiﬁcation of Outliers, Chapman and Hall, Reading, London, 1980. [20] V. Barnett, T. Lewis, Outliers in Statistical Data, John Wiley Sons, Reading, New York, 1994. [21] M.F. Jiang, S.S. Tseng, C.M. Su, Tw-phase clustering process for outliers detection, Pattern Recognit. Lett. 22 (2001) 691–700. [22] S. Harkins, H. He, G.J. Willams, R.A. Baster, Outlier detection using replicator neural networks, in: Proceedings of DaWaK, 2002, pp. 170–180. [23] T. Hu, S.Y. Sung, Detecting pattern-based outliers, Pattern Recognit. Lett. 24 (16) (2003) 3059–3068. [24] S. Muthukrishnan, R. Shah, J.S. Vitter, Mining deviants in time series data streams, in: Proceedings of SSDBM, 2004. [25] C.C. Aggarwal, S.P. Yu, An effective and eﬃcient algorithm for high dimensional outlier detection, VLDB J. 14 (2005) 211–221. [26] T. Cheng, Z. Li, A multiscale approach for spatio-temporal outlier detection, Transactions in GIS 10 (2) (2006) 253–263. [27] S. Sadik, L. Gruenwald, Online outlier detection for data streams, in: Proceedings of the 15th Symposium on International Database Engineering and Applications, ACM, 2011, pp. 88–96. [28] V. Chandola, A. Banerjee, V. Kumar, Outlier Detection: A Survey, University of Minnesota, 2007 Technical report. [29] L.A. Bettencourt, A. Hagberg, L. Larkey, Separating the wheat from the chaff: practical anomaly detection schemes in ecological applications of distributed sensor networks, in: Proc. IEEE International Conference on Distributed Computing in Sensor Systems, 2007. [30] S. Rajasegarar, C. Leckie, M. Palaniswami, J.C. Bezdek, Distributed anomaly detection in wireless sensor networks, in: Proceedings of IEEE International Conference on Communications, vol. 30, IEEE Computer Society, Singapore, 2006, pp. 1–5. Oct-1 Nov 2006. [31] K.X. Thuc, K. Insoo, A collaborative event detection scheme using fuzzy logic in clustered wireless sensor networks, AEU-Int. J. Electron. Commun. 65 (2011) 485–488. [32] D. Mingtao, T. Zheng, X. Haixia, Adaptive kernel principal component analysis, in: Signal Process, 2010, pp. 1542–1553. [33] Z.A. Baig, Pattern recognition for detecting distributed node exhaustion attacks in wireless sensor networks, Comput. Commun. 34 (2011) 468–484. [34] J.C. Bezdek, S. Rajasegarar, M. Moshtaghi, C. Leckie, M. Palaniswami, T.C. Havens, Anomaly detection in environmental monitoring networks, IEEE Comput. Intell. Mag. (6) (2011) 52–58. [35] X. Luo, M. Dong, Y. Huang, On distributed fault-tolerant detection in wireless sensor networks, IEEE Trans. Comput. 55 (1) (2016) 58–70. [36] A. Sharma, L. Golubchik, R. Govindan, Sensor faults: detection methods and prevalence in real-world datasets, ACM Trans. Sens. Netw. 6 (3) (2010) 23. [37] P.R. Chandore, D.P.N. Chatur, Hybrid approach for outlier detection over wireless sensor network real time data, Int. J. Comput. Sci. Appl. 6 (2) (2013) 76–81. [38] Y. Zhang, N.A.S. Hamm, N. Meratnia, A. Stein, M. van de Voort, P.J.M. Havinga, Statistics-based outlier detection for wireless sensor networks, Int. J. Geogr. Inf. Sci. (2012) 1373–1392. [39] S. Rajasegarar, C. Leckie, M. Palaniswami, J.C. Bezdek, Quarter sphere based distributed anomaly detection in wireless sensor networks, in: Networks, Proc. IEEE International Conference on Communications, 2007, pp. 3864–3869. [40] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, in: Proceedings of theWorkshop on Dependability Issues in Wireless Ad Hoc Networks and Sensor Networks (DIWANS ’06), ACM, Los Angeles, Calif, USA, 2006, p. 6572. [41] P. Jiang, A new method for node fault detection in wireless sensor networks, Sensors 9 (2) (2009) 1282–1294. [42] E. Khazaei, A. Barati, A. Movaghar, Improvement of fault detection in wireless sensor networks, in: Proceedings of the ISECS International Colloquium on Computing, Communication, Control, and Management (CCCM ’09), vol. 4, Sanya, China, 2009, pp. 644–646. [43] K. Ni, G. Pottie, Sensor network data fault detection with maximum a posteriori selection and bayesian modeling, ACM Trans. Sensor Netw. 8 (3) (2012). 23.1–23.21. [44] M.C. Jun, H. Jeong, C.C.J. Kuo, Distributed spatio-temporal outlier detection in sensor networks, in: Proc. SPIE, 2006. [45] Y. Zhuang, L. Chen, In-network outlier cleaning for data collection in sensor networks, in: Proc. VLDB, 2006. [46] K. Ni, G. Pottie, Sensor Network Data Fault Detection Using Hierarchical Bayesian Space-Time Modeling, University of California, 2009 Technical report, tr-69. [47] R. Webster, M.A. Oliver, Geostatistics for Environmental Scientists, Chichester, Springer, 2007.

A. Ayadi et al. / Computer Networks 129 (2017) 319–333 [48] B. Efron, Bootstrap methods: another look at the jackknife, Ann. Stat. 7 (1) (1979) 1–26. [49] A.R. Ganguly, J. Gama, O.A. Omitaomu, M. Gaber, R.R. Vatsavai (Eds.), Knowledge Discovery from Sensor Data, CRC Press, 2008. [50] A. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, V. Kumar, A comparative study of anomaly detection schemes in network intrusion detection, in: Proceedings of SIAM, 2003. [51] S. Rajasegarar, J.C. Bezdek, C. Leckie, M. Palaniswami, Elliptical anomalies in wireless sensor networks, ACM Trans. Sen. Netw. 6 (2010) 7:1–7:28. [52] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogerakiand, D. Gunopulos, Online outlier detection in sensor data using nonparametric models, J. Very Large Data Bases VLDB (2006). [53] K. Kapitanova, S.H. Son, K.D. Kang, Event detection in wireless sensor networks, in: Second International Conference, ADHOCNETS 2010, Victoria, BC, Canada, 2010, pp. 18–20. [54] F.Y. Edgeworth, On discordant observations, Philos. Mag. 23 (5) (1887) 364–375. [55] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley and Sons, 1996. [56] R.J. Beckman, R.D. Cook, Outliers, Technometrics 25 (2) (1983) 119–149. [57] P. Datta, D. Kibler, Learning prototypical concept descriptions, in: Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann, 1995, pp. 158–166. [58] J. Laurikkala, M. Juhola, E. Kentala, Informal identiﬁcation of outliers in medical data, in: Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology IDAMAP-20 0 0 Berlin, 22 August. Organized as a workshop of the 14th European Conference on Artiﬁcial Intelligence ECAI-20 0 0, 20 0 0. [59] Y. Hida, P. Huang, R. Nishtala, Aggregation Query Under Uncertainty in Sensor Networks, Tech. Rep, Department of Electrical Engineering and Computer Science, University of California, Berkeley, 2004. [60] T. Palpanas, D. Papadopoulos, V. Kalogeraki, D. Gunopulos, Distributed deviation detection in sensor networks, in: ACM Special Interest Group on Management of Data, 2003, pp. 77–82. [61] W. Wu, X. Cheng, M. Ding, K. Xing, F. Liu, P. Deng, Localized outlying and boundary data detection in sensor networks, IEEE Trans. Knowl. Data Eng. 19 (8) (2007) 1145–1157. [62] B. Sheng, Q. Li, W. Mao, W. Jin, Outlier detection in sensor networks, in: Proc. MobiHoc, 2007. [63] M. Shuai, K. Xie, G. Chen, X. Ma, G. Song, A Kalman ﬁlter based approach for outlier detection in sensor networks, in: Proceedings of international conference on computer science and software engineering, IEEE Computer Society, 12–14, Wuhan, China, 2008, pp. 154–157. December 2008. [64] V. Verma, S. Kumar, K. Harsh, Outlier detection of data in wireless sensor networks using kernel density estimation, Int. J. Comput. Appl. 5 (7) (2010) 28–32. [65] K.S. Kannan, K. Manoj, E. Sakthivel, A comparative study on nearest-neighbor based outlier detection in data mining, in: A Journal of Management NISMA Noorul Islam Strategic Management Ambience01/2015, vol. 1, 2015, pp. 203–204. [66] E. Knorr, R. Ng, Algorithms for mining distance-based outliers in large data sets, in: VLDB Conference Proceedings, 1998. [67] S. Ramaswamy, R. Rastogi, K. Shim, Eﬃcient algorithms for mining outliers from large data sets, in: Proceedings of ACM SIGMOD, 20 0 0, pp. 427–438. [68] J. Branch, B. Szymanski, C. Giannella, R. Wolff, In-network outlier detection in wireless sensor networks, in: Proc. IEEE ICDCS, 2006. [69] K. Zhang, S. Shi, H. Gao, J. Li, Unsupervised outlier detection in sensor networks using aggregation tree, in: Proc. ADMA, 2007. [70] C. Lijun, L. Xiyin, Z. Tiejun, Z. Zhongping, L. Aiyong, A data stream outlier delection algorithm based on reverse k nearest neighbors, in: Proc 3rd Int Symposium on Computational Intelligence and Design (ISCID), 2010, pp. 236–239. [71] M. Kontaki, A. Gounaris, A.N. Papadopoulos, K. Tsichlas, Y. Manolopoulos, Continuous monitoring of distance-based outliers over data streams, in: ICDE, 2011, pp. 135–146. [72] F. Angiulli, F. Fassetti, Detecting distance-based outliers in streams of data, in: CIKM, 2007, pp. 811–820. [73] D. Yang, E.A. Rundensteiner, M.O. Ward, Neighbor-based pattern detection for windows over streaming data, in: Advances in Database Technology, 2009, pp. 529–540. [74] N. Buthong, A. Luangsodsai, K. Sinapiromsaran, Outlier detection score based on ordered distance difference, in: International Computer Science and Engineering Conference (ICSEC), 2013, pp. 157–162. [75] J. Tang, Z. Chen, A.W.C. Fu, D.W. Cheung, Enhancing effectiveness of outlier detections for low density patterns, in: Advances in Knowledge Discovery and Data Mining, 2002, pp. 535–548. [76] A. Abid, A. Kachouri, A. Mahfoudhi, Anomaly detection through outlier and neighborhood data in wireless sensor networks, in: Advanced Technologies for Signal and Image Processing (ATSIP), 2016 2nd International Conference on, 2016, pp. 26–30. [77] Y. Sisman, A. Dilaver, S. Bektas, Outlier detection in 3d coordinate transformation with fuzzy logic, Acta. Montanistica Slovaca Ronk. 17 (1) (2012) 1–8. [78] Q. Liang, L. Wang, Event detection in wireless sensor networks using fuzzy logic system, CIHSPS, 2005. [79] M. Marin-Perianu, P. Havinga, D-FLER: a distributed fuzzy logic engine for rule-based wireless sensor networks, in: UCS, 2007, pp. 86–101.

331

[80] S. Kamal, R.A. Ramadan, E.R. Fawzy, Smart outlier detection of wireless sensor network, Facta Univ. Series 29 (3) (2015) 383–393. [81] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc, 1988. [82] P. Berkhin, A Survey of Clustering Data Mining Techniques: Grouping Multidimensional Data, Springer Berlin Heidelberg, 2006, pp. 25–71. [83] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in: ACM SIGMOD Conference Proceedings, 1998. [84] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, J.S. Park, Fast algorithms for projected clustering, in: ACM SIGMoD Record, volume 28, ACM, 1999, pp. 61–72. [85] C.C. Aggarwal, P. Yu, Finding generalized projected clusters in high dimensional spaces, in: Proc. ACM SIGMOD Int’l Conf. Management Data, 20 0 0, pp. 70–81. [86] S. Guha, R. Rastogi, K. Shim, CURE: an eﬃcient clustering algorithm for large databases, in: Proc. ACM SIGMOD Int’l Conf. Management Data, vol. 27, 1998, pp. 73–84. [87] R.T. Ng, J. Han, CLARANS: a method for clustering objects for spatial data mining, IEEE Trans. Knowl. Data Eng. 14 (5) (2002) 1003–1016. [88] Y. Zhang, N. Meratnia, P. Havinga, Outlier detection techniques for wireless sensor networks: a survey, IEEE Communications Survey and Tutorials, vol. 12, 2010. Second Quarter. [89] D. Yu, G. Sheikholeslami, A. Zhang, Findout: ﬁnding outliers in very large datasets, J. Knowl. Inf. Syst. 4 (3) (2002) 387–412. [90] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection - a survey, ACM Comput. Surv. 4 (3) (2009) 1–58. [91] O. Ghorbel, M. Abid, H. Snoussi, Improved KPCA for outlier detection in wireless sensor networks, in: 1st International Conference on Advanced Technologies for Signal and Image Processing - ATSIP’, 2014, pp. 507–511. [92] M. Al-Zoubi, A. Al-Dahoud, A.A. Yahya, New outlier detection method based on fuzzy clustering, WSEAS Trans. Inf. Sci. Appl. (2010) 681–690. [93] J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang, Topic detection and tracking pilot study, in: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp. 194–218. [94] D. Marchette, A statistical method for proﬁling network traﬃc, in: Proceedings of 1st USENIX Workshop on Intrusion Detection and Network Monitoring, Santa Clara, CA, 1999, pp. 119–128. [95] N. Wu, J. Zhang, Factor analysis based anomaly detection, in: Proceedings of IEEE Workshop on Information Assurance, United States Military Academy, West Point, NY, USA, 2003. [96] R. Smith, A. Bivens, M. Embrechts, C. Palagiri, B. Szymanski, Clustering approaches for anomaly based intrusion detection, in: Proceedings of Intelligent Engineering Systems through Artiﬁcial Neural Networks, ASME Press, 2002, pp. 579–584. [97] A. Vinueza, G. Grudic, Unsupervised Outlier Detection and Semi-Supervised Learning, Univ. of Colorado at Boulder. May, 2004 Tech. rep. cu-cs-976–04. [98] M.V. Mahoney, P.K. Chan, M.H. Arshad, A Machine Learning Approach to Anomaly Detection, Department of Computer Science, Florida Institute of Technology Melbourne FL 32901. march, 2003 Tech. rep. cs-2003–06. [99] D. Barbara, Y. Li, J. Couto, J.-L. Lin, S. Jajodia, Bootstrapping a data mining intrusion detection system, in: Proceedings of the 2003 ACM symposium on Applied computing, ACM Pre’s, 2003, pp. 421–425. [100] D. Barbara, Y. Li, J. Couto, Coolcat: an entropy-based algorithm for categorical clustering, in: Proceedings of the eleventh international conference on Information and knowledge management, ACM Press, 2002, pp. 582–589. [101] D. Birant, A. Kut, Spatio-temporal outlier detection in large databases, J. Comput. Inf. Technol. 14 (4) (2006) 291–297. [102] J. Sander, M. Ester, H.P. Kriegel, X. Xu, Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications, Data Min. Knowl. Discov. 2 (2) (1998) 169–194. [103] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison-Wesley, 2005. [104] D.S. Upadhyaya, K. Singh, 2.3 classiﬁcation based outlier detection techniques, Int. J. Comput. Trends Technol. 3 (2) (2012). [105] C. Stefano, C. Sansone, M. Vento, To reject or not to reject: that is the question - an answer in case of neural classiﬁers, IEEE Trans. Syst. Manag. Cybern. 30 (20 0 0) 84–94. [106] D. Barbara, J. Couto, S. Jajodia, N. Wu, Detecting novel network intrusions using bayes estimators, in: Proceedings of the First SIAM International Conference on Data Mining, 2001. [107] E. Elnahrawy, B. Nath, Context-aware sensors, in: Proc. EWSN, 2004. [108] D. Janakiram, A. Mallikarjuna, V. Reddy, P. Kumar, Outlier detection in wireless sensor networks using Bayesian belief networks, in: Proc. IEEE Comsware, 2006. [109] D.J. Hill, B.S. Minsker, E. Amir, Real-time Bayesian anomaly detection for environmental sensor data, in: Proc. 32nd Congress of the International Association of Hydraulic Engineering and Research, 2007. [110] A. De Paola, S. Gaglio, G. Re, F. Milazzo, M. Ortolani, Adaptive distributed outlier detection for WSNs, IEEE Trans. Cybern. 45 (5) (2015) 888–899. [111] R. Beale, T. Jackson, Neural Computing: An Introduction, IOP Publishing Ltd., Bristol, UK, 1990. [112] K. Yamanishi, J. Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection using ﬁnite mixtures with discounting learning algorithm, in: Proceedings of KDD20 0 0, 20 0 0, pp. 320–324. [113] P. Sykacek, Equivalent error bars for neural network classiﬁers trained by Bayesian inference, in: Proc. ESANN, 1997. [114] A.I. Moustapha, R.R. Selmic, Wireless sensor network modeling using modi-

332

[115]

[116]

[117] [118]

[119] [120] [121] [122]

[123]

[124] [125]

[126]

[127]

A. Ayadi et al. / Computer Networks 129 (2017) 319–333 ﬁed recurrent neural networks: application to fault detection, IEEE Trans. Instrum. Measur. 57 (5) (2008) 981–988. S. Siripanadorn, W. Hattagam, N. Teaumroong, Anomaly detection in wireless sensor networks using self-organizing map and wavelets, Int. J. Commun. 4 (3) (2010) 74–83. N. Yadaiah, N. Ravi, Fault detection techniques for power transformers, in: Industrial & Commercial Power Systems Technical Conference, 20 07. ICPS 20 07. IEEE/IAS, IEEE, 2007, pp. 1–9. Z. Xu, Q. Zhao, A novel approach to fault detection and isolation based on wavelet analysis and neural network, Electr. Comput. Eng. 1 (2002) 572–577. B. Scholkopf, R.C. Williamson, A.J. Smola, J. Shawe-Taylor, J.C. Platt, Support vector method for novelty detection, in: T.K. Leen, K.-R. Muller (Eds.), Advances in Neural Information Processing Systems 12,S.A. Solla, Cambridge, MA: MIT Press, 20 0 0, pp. 582–588. F. Dufrenois, A one class kernel ﬁsher criterion for outlier detection, IEEE Trans. Neural Netw. Learn. Syst. 26 (2015) 982–994. V.N. Vapnik, Overview of statistical learning theory, IEEE Trans. Neural Netw. 10 (1999) 988–999. V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. M. Moshtaghi, S. Rajasegarar, C. Leckie, S. Karunasekera, Anomaly detection by clustering ellipsoids in wireless sensor networks, in: Proceedings of the 5th International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), Melbourne, Australia, 7–10, 2009, pp. 331–336. Y. Zhang, N. Meratnia, P.J.M. Havinga, Adaptive and online one-class support vector machine-based outlier detection techniques for wireless sensor networks, in: Proc. Int. Conf. Adv. Inform. Netw. Appl. Workshops, 2009, pp. 990–995. X. Peng, J. Chen, H. Shen, Outlier Detection Method Based on SVM and Its Application in Copper-matte Converting, IEEE, 2010. M.S. Mohamed, T. Kavitha, Outlier detection using support vector machine in wireless sensor network real time data, Int. J. Soft Comput. Eng. 1 (2) (2011) 2231–2307. E. Smart, D. Brown, L. Axel-Berg, Comparing one and two class classiﬁcation methods for multiple fault detection on an induction motor, in: IEEE Symposium on Industrial Electronics & Applications (ISIEA2013), 2013, pp. 22–25. H. Hotelling, Analysis of a complex of statistical variables with principal components, J. Educ. Psychol. 24 (1933) 498–520.

[128] B. Scholkopf, A.J. Smola, K.R. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (1998) 1299–1319. [129] Y. Zhang, N. Meratnia, P.J.M. Havinga, Distributed online outlier detection in wireless sensor networks using ellipsoidal support vector machine, Ad Hoc Netw. (2012). [130] K.I. Kim, K. Jung, H.J. Kim, Face recognition using kernel principal component analysis, IEEE Signal Process. Lett. volume 9 (2002) 40–42. [131] C. Alzate, J. Suykens, Image segmentation using a weighted kernel PCA approach to spectral clustering, in: Computational Intelligence in Image and Signal Processing, 2007. CIISP 2007. IEEE Symposium, 2007, pp. 208–213. [132] R. Rosipal, M. Girolami, L.J. Trejo, A. Cichocki, Kernel PCA for feature extraction and de-noising in nonlinear regression, Neural Comput. Appl. 10 (3) (2001) 231–243. [133] S. Mika, B. Scholkopf, A. Smola, K.R. Muller, M. Scholz, G. Ratsch, Kernel PCA and de-noising in feature spaces, in: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, 1999, pp. 536–542. [134] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, U.K., 2004. [135] M.S. Kim, I.H. Yang, H.J. Yu, Robust speaker identiﬁcation using greedy kernel PCA, in: Tools with Artiﬁcial Intelligence, 2008. ICTAI ’08. 20thIEEE International Conference, 2008, pp. 143–146. [136] Y. Shen, A.J. Izenman, Outlier detection using the smallest kernel principal components with radial basis function kernel, Joint Statistical Meetings, 2008. [137] O. Ghorbel, M. Abid, H. Snoussi, Kernel principal subspace based outlier detection method in wireless sensor networks, in: Advanced Information Networking and Applications Workshops (WAINA), 2014, pp. 737–742. 28th International Conference on Advanced Information Networking and Applications Workshops. [138] A. Marco, K. Tei, T.A. Nguyen, A hybrid fault detection approach for context-aware wireless sensor networks, in: IEEE 9th International Conference on Mobile Ad-Hoc and Sensor Systems, 2012, pp. 281–289. [139] A. Fawzy, M.O. Hoda, O. Hegazy, Outliers detection and classiﬁcation in wireless sensor networks, Egyptian Inform. J. 14 (2) (2013) 157–164. [140] O. Ghorbel, M. Abid, H. Snoussi, A novel journal, outlier detection model based on one class principal component classiﬁer in wireless sensor networks, in: IEEE 29th International Conference on Advanced Information Networking and Applications (AINA), 2015, pp. 24–27.

A. Ayadi et al. / Computer Networks 129 (2017) 319–333

333

Aya Ayadi is a Ph.D. student at the National Engineering School of Gabes since January 2016. Her research activity is conducted within CES research unit . She has received the engineer’s degree from the Higher School of Applied Sciences and Technology of Gabes in 2012. Her current research interests are in the ﬁeld of Wireless Sensor Networks (WSN). His current research interests are in the ﬁeld of monitoring water pipeline pipeline based WSNs and internet of things. She has several publications in many international conferences : ATSIP, IWCMC.

Oussama Ghorbel is an assistant professor in the High Institute of Management of Gabes (ISGG) and a Ph.D. student at the National Engineering School of Sfax since January 2011. Her research activity is conducted within CES research unit. He has received the diploma degree in computer science, from the Faculty of Sciences of Sfax, Tunisia in 2007, the Engineering degree from the National Engineering School of Sfax, in 2009 and the Master degree in New Technologies of Dedicated Computer Science Systems, from the National Engineering School of Sfax, in 2010. He is actually an invited Ph.D.student in University of Technology of Troyes, France, at the LM2S Laboratory. Her current research interests are in the ﬁeld of Wireless Sensor Networks (WSN) and Image Compression. He served in National and International Conference Organization: ICM, TWESD, SensorNets-09, RoboSense 2012, WES 2015 and IDT 2016.

Dr. Abdulfattah Mohammad Obeid received the Ph.D. degree in 2006 in electrical and information engineering from Darmstadt University of Technology, Germany. He is currently an associate professor and the Deputy Director for Scientiﬁc Affairs of the Communications and Information Technologies Research Institute (CITRI) at King Abdulaziz City for Science and Technology (KACST), the primary national research agency in Kingdom of Saudi Arabia. His research interests are in the areas of computer architectures, reconﬁgurable computing, VLSI, Wireless sensor Networks and IoT. Dr. Obeid was a key member of several planning committees for science, technology and innovation is Saudi Arabia. Dr. Obeid has led a number of technology commercialization studies and some has resulted in commercial investments. Dr. Obeid is also the founder and managing director of WaferCatalyst, a low-cost Multi-Project Wafer (MPW) service provider and is the ﬁrst of its kind in the region. This program provides essential supporting and stimulating services the Microsystems R&D ecosystem in KSA and regionally. These services include IC fabrication, training and support. WaferCatalyst serves also as a platform for collaboration between different partners thus enabling an electronics design industry in the region. Dr. Obeid has served on the technical committee of several IEEE Conferences and Workshops. He has authored and co-authored several papers in refereed international conferences and journals. Dr. Abdulfattah M. Obeid is a senior member of IEEE.

Mohamed ABID, Head of “computer Embedded System” laboratory CES-ENIS, Tunisia. Mohamed ABID is working now as a professor at the Engineering National School of Sfax (ENIS), University of Sfax, Tunisia (http://www.ceslab.org/eng/perso.php?id=27). He received the Ph.D. degree from the National Institute of Applied Sciences, Toulouse (France) in 1989 and the “thèse d’état” degree from the National School of Engineering of Tunis (Tunisia) in 20 0 0 in the area of Computer Engineering & Microelectronics. His current research interests include: hardware-software co-design, System on Chip, Reconﬁgurable System, and Embedded System, etc. He has also been investigating the design and implementation issues of FPGA embedded systems. He was founding member and responsible of doctoral degree computer system engineering at ENIS, 2003–2010. Dr. Abid served in national and international conference organization and program committees at different organizational levels. He was Founding Member of several international conferences. He was also Joint Editor of Speciﬁc Issues in two International Journals. Dr. Abid is joint coordinator or an active member of several International Research and Innovation projects: STIC/INRIA project since 2009, CMCU project and since 2009 and Head of Federator Research Project since 2009. Dr. Abid was Supervisor or Co-supervisor of more than 20 Ph.D.doctors, most of them were in joint guardianship and Supervisor or Co-supervisor of more than 50 master students. He is Author or co-author of more than 30 publications in Journals and author or coauthor of more than 180 papers in international conferences. He is also author or co- author of many guest’s papers, Joint author of many book’s chapters. Dr. Abid has served also as Guest professor at several international universities and as a consultant to research & development in Telnet Incorporation.

Outlier detection approaches for wireless sensor networks: A survey

Outlier detection approaches for wireless sensor networks: A survey

Recommend Documents