Information Sciences 297 (2015) 1–20
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Aggregate query processing in the presence of duplicates in wireless sensor networks Jun-Ki Min a, Raymond T. Ng b, Kyuseok Shim c,⇑ a
School of Computer Science and Engineering, Korea University of Technology and Education, Byeongcheon-myeon, Cheonan, Chungnam, Republic of Korea Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada c School of Electrical and Computer Engineering, Seoul National University, Kwanak, P.O. Box 34, Seoul, Republic of Korea b
a r t i c l e
i n f o
Article history: Received 12 February 2014 Received in revised form 19 May 2014 Accepted 5 November 2014 Available online 15 November 2014 Keywords: Query processing Aggregation Sliding window Sensor network
a b s t r a c t Wireless sensor networks (WSNs) have received increasing attention in the past decade. In existing studies for query processing of WSNs, each sensor node measures environmental parameters such as temperature, light and humidity around its location. Instead, in our work, we consider a different type of sensors that detect objects in their sensing regions which may overlap with each other. In WSNs with such sensors, an object may be detected by several sensor nodes and processing of aggregate queries (such as COUNT, SUM and AVERAGE) becomes problematic since an identical object can be considered redundantly. In this paper, we propose efficient algorithms for processing aggregate queries as well as sliding window aggregate queries in the presence of multiply detected events. To perform de-duplication, our proposed algorithms identify potential duplicates among detected events by communicating with other nodes and perform aggregations as early as possible. In addition, we extend our algorithms for aggregate queries to support time-based sliding windows. By extensive performance study with diverse environments, we show that the energy consumptions of our proposed algorithms are much smaller than those of baseline algorithms. Ó 2014 Elsevier Inc. All rights reserved.
1. Introduction A wireless sensor network (WSN) consists of spatially distributed autonomous devices with various sensors, and a powered base station which serves as an access point for users to pose ad hoc queries. Wireless sensor networks have a broad range of applications, such as environment and habitat monitoring [30] that collects meteorological data (e.g., temperature, pressure, and humidity), and combat field surveillance [22] that tracks the movement of personnel or detects potentially hazardous chemicals. Since sensor nodes have limited battery power, for many applications, it is hard or costly to replace the batteries in the environments of its applications. Thus, minimizing energy consumption of sensor nodes to prolong the network lifetime has been one of the most important research issues [8,20,21]. Typically, sensing and communication dominate the power consumption by orders of magnitude compared to computation and accessing RAM [21]. In existing studies for query processing of WSNs, each sensor node measures environmental parameters such as temperature, light and humidity around its location. However, due to advancements of Micro Electro Mechanical Systems ⇑ Corresponding author. Tel.: +82 2 880 7269; fax: +82 2 871 5974. E-mail addresses:
[email protected] (J.-K. Min),
[email protected] (R.T. Ng),
[email protected] (K. Shim). http://dx.doi.org/10.1016/j.ins.2014.11.021 0020-0255/Ó 2014 Elsevier Inc. All rights reserved.
2
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
technology, the prices of many sensors such as motion detectors, infrared radiation detectors [30], ultra sonic sensors (sonar) and magnetometers [22] have significantly reduced, and their sensing ranges are expanded recently. Thus, for many applications, it becomes cost effective to use a larger number of sensors to increase the robustness of query processing in sensor networks. Therefore, in our work, we consider a type of sensors which detect objects in their sensing regions which may overlap each other. In WSNs with such sensors, an object may be detected by several sensor nodes and processing of aggregate queries (such as COUNT, SUM and AVERAGE) becomes problematic since an identical object can be considered redundantly. The assumption of disjoint sensing areas is no longer true and particularly problematic for aggregate queries, due to duplicate detections. Moreover, in WSNs, duplicates also arise in situations when the objects move in and out of the sensing regions of multiple sensors in a sufficiently long time. In this case, even if the sensors do not overlap in their sensing regions, de-duplication is required for exact aggregation. In this paper, we study processing of exact aggregate queries and sliding window aggregate queries in the presence of multiply detected events for WSNs. To the best of our knowledge, our work is the first study for exact in-network aggregation and sliding window aggregation with de-duplication of multiply detected events. Aggregate Queries: We assume that each object carries a unique identifier (e.g., an animal equipped with a ‘‘collar’’ [14]). Specifically, in this paper, we consider an aggregate query Q with the following SQL syntax: Q: SELECT AGG(ak ) FROM ( SELECT AGG1(S:attr1 ) AS a1 ; . . ., AGGm(S:attrm ) AS am FROM Sensor S GROUP BY S:ID ) WHERE p(a1 ; . . . ; am )
In the above query, the key idea is the introduction of the GROUP BY clause in the inner SQL statement. Using the unique object identifiers, de-duplication is achieved by the GROUP BY clause. However, there is no guarantee that all non-ID attributes are the same. The SELECT clause in the inner statement specifies how to obtain the ‘‘representative’’ attribute values for each distinct object.1 Specifically, we ask the query processor to represent the value of every attribute atti (with 1 6 i 6 m) for each distinct object by applying an aggregation function AGGi such as MIN, MAX and AVERAGE in the SELECT clause of the inner statement. To choose the groups generated by the inner statement, we use the WHERE clause in the outer statement. The selection predicate p in the WHERE clause is applied to each group using the representative attribute values of each group. Finally, the aggregation function AGG is applied to the values of the aggregation attribute ak to obtain the final aggregation result. The following is an example of our aggregate queries. Example 1. A zoologist is interested in the number of zebras suffering from a disease. If the symptom of the disease is high body temperature, this can be captured by the following query Q z . Q z : SELECT COUNT(ZAVG) FROM ( SELECT AVG(Z.body_temperator) AS ZAVG FROM Zebra Z GROUP BY Z.id ) WHERE ZAVG P 38
As a zebra may be detected with different body temperatures by different sensors, the inner statement of the query Q z chooses the average value as the representative body temperature of each zebra. The query Q z returns the number of zebras with (average) body temperatures above 38 °C. Towards the goal of minimizing power consumption, we first propose the LCA (Least Common Ancestor) algorithm for exact aggregation in the presence of duplicates. We exploit the hierarchical structure of routing trees where sensors are organized in groups with their coordinator nodes that perform de-duplications and partial aggregations. The partially aggregated results are then passed along the routing paths to the base station. To further reduce communication overhead, we extend LCA into the two-phase algorithm called LCA-EA (LCA with Eager Aggregation). In the first phase, each sensor node identifies potential duplicates by communicating with the other nodes using a variant of bloom filters for detected events. After potential duplicate events are identified, they are sent to coordinators and the second phase applies LCA for de-duplication. Meanwhile, for uniquely detected events, early aggregations 1
We refer to the tuple consisting of every attribute attri ’s (with 1 6 i 6 m) representative value of an object o as the representative of o.
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
3
are performed and the aggregated results are passed along the routing paths to the base station. Our performance study shows that there are situations where LCA-EA outperforms LCA, while there are other situations where LCA is the recommended method. Sliding Window Aggregate Queries: A sensor network can be considered as a source of stream data. Then, the stream data can be broken into possibly overlapping partitions by specifying a window and computation can be carried out in each partition. While in-network processing is generally used in the sensor network where each sensor computes a partial result distributively, existing studies on in-network processing methods do not generally pay attention to the sliding window queries. Furthermore, efficient processing techniques for window queries have been proposed in the area of data streams, but most of the previous work assumes that query processing is done by a central server. We extend the LCA scheme to consider time-based sliding window aggregate queries which return repeatedly summarized information of the mobile objects detected during a given time interval. To specify time-based sliding windows, we adapt the window query syntaxes [1,2,18] derived from SQL-99 and popularly used in the data stream area. The query syntax that we consider in the paper is as follows: Q w : SELECT AGG(ak ) FROM ( SELECT AGG1(S:attr1 ) AS a1 ; . . ., AGGm(S:attrm ) AS am FROM Sensor S RANGE w SLIDE ‘ FOR d GROUP BY S:ID ) WHERE p(a1 ; . . . ; am )
In the sliding window aggregate query Q w , the keyword RANGE specifies the length of the time window, SLIDE indicates the step by which the sliding window moves, and FOR presents the duration of the window query. The following is an example of a sliding window aggregate query Q zw which is an extension of the aggregate query Q z in Example 1. Example 2. A zoologist is interested in monitoring the number of zebras suffering from high body temperatures detected in the window of past 2 timestamps at every timestamp for the duration of 3 timestamps. Q zw : SELECT COUNT(ZAVG) FROM ( SELECT AVG(Z.body_temperator) AS ZAVG FROM Zebra Z RANGE 2 SLIDE 1 FOR 3 GROUP BY Z.id ) WHERE ZAVG P 38
Let us assume that Q zw is invoked at timestamp 1. Then, Q zw returns the number of zebras with high body temperatures on average for the time interval [1,2] at timestamp 2 and that for the time interval [2,3] at timestamp 3. To optimize processing of time-based sliding window aggregate queries, we develop the LCA-SW (LCA with Sliding Window) and LCA-SW H (enhanced LCA with Sliding Window) algorithms. In the LCA-SW algorithm, each coordinator computes the representatives of objects at each timestamp with de-duplication and keeps them for a time window. To generate an aggregation result for each window, every coordinator transmits the representatives for the window to the base station. Along the routing paths to the base station, when a node collects all representatives with the same id for a window, the node conducts partial aggregation. Since a representative has to be sent repeatedly to the base station within a time window, the enhanced LCA-SW H algorithm dictates that the least common ancestor of coordinators computes the representative of an object for a window and updates partial aggregations after it receives all the required representatives from the coordinators. Our comprehensive empirical evaluation shows that LCA-SW H delivers the best performance in all situations. 2. Preliminaries 2.1. Sensor networks We consider a sensor network consisting of n stationary sensor nodes fs1 ; s2 ; . . . ; sn g deployed in a field of interest and a powered base station serving as an access point for users to pose ad hoc queries. As a basic primitive to collect sensing data in WSNs, we use an ad hoc spanning tree, such as TinyDB [21], SNEE [10], and AnduIn [16] as the basic routing structure from each sensor node to the base station since trees are known as one of the most effective routing structures for aggregate query processing. Two nodes capable of direct bi-directional wireless communication are referred to as the neighbors for each other. Each node can broadcast a message to all of its neighbors (or from a parent to its child nodes) at once. We assume that every node
4
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
knows its location as well as the identifiers and locations of its neighbors. In addition, the base station knows the locations of sensor nodes and the tree routing hierarchy among sensors. A sensor reading consists of several attributes each of which is associated with a sensor device. A sensor node may be equipped with several sensor devices and thus we conceptually treat a wireless sensor network (WSN) as a single universal relational table which is the union of the sensor readings from all the sensor nodes. Sensor nodes generate its readings periodically. The sampling period is known as an epoch [21]. To agree on a global time base that allows sensor nodes to start and end each epoch simultaneously, every sensor node executes SMACS protocol [28] or a global time synchronization protocol [29]. Based on globally synchronized time, every node sleeps for a certain period of time in each epoch to minimize energy consumption and it wakes up to sample and receive results when its neighbors try to propagate a message. Generally, it is known that one bit of data transmission costs as much energy as that of executing 800–1000 CPU instructions [20,21]. Thus, similarly to the work of [5,20,24,35], to calculate the energy consumption, we consider transmission costs only. In addition, we use the free space channel model [13]. Under this model, to send an l-bit packet for distance c, each sensor node spends the amount of energy ET ðl; cÞ defined as:
ET ðl; cÞ ¼ l Eelec þ namp l c2 where Eelec denotes the energy consumed per bit by running both the transmitter and receiver, and namp denotes the energy consumed per bit by the amplifier. To receive this message, each sensor node consumes the amount of energy ER ðlÞ ¼ l Eelec . In our experimental study, we set 50 nJ/bit to be the electronic circuit constant (Eelec ) and 100 pJ/bit/m2 to be the amplifier constant (namp ) [13]. We do not consider the link failures and node failures here. Link failures can be easily solved using retransmission protocol or multi path routing [24]. For node failures, every alive node broadcasts a beacon signal periodically and each node can detect the failure of another node if it does not send a beacon signal for a long time.
2.2. Bloom filters Many in-network aggregation techniques utilize sketches to obtain approximate aggregation results [7,25]. However, such sketches are not applicable to our work since we need to compute exact values of aggregation functions. Instead, since we need to identify the duplicates, we use the spectral bloom filter [3,6]. A bloom filter [3] is a space-efficient randomized data structure for representing set data in order to support membership queries. A bloom filter for representing a set S ¼ fx1 ; x2 ; . . . ; xn g of n elements is described by an array of m bits which are initially set to 0. A bloom filter uses k independent hash functions h1 ; . . . ; hk which map keys into the range f1; . . . ; mg. For each element x 2 S, the bits hi ðxÞ in a bloom filter are set to 1 for 1 6 i 6 k. For an item y, we check its membership in S by examining the bits of the positions h1 ðyÞ; h2 ðyÞ; . . . ; hk ðyÞ in a bloom filter. If one of the bits is 0, then it is clear that y is not in S. If all hi ðyÞs are 1s, we simply assume that y is in S, even though it may turn out to be a false positive. The probability of a false positive can be calculated as follows. Since a bloom filter has m bits, the probability that a bit in a bloom filter is set to 1 by a hash function is 1=m. Thus, the probability p that one of the m bits is still 0, after inserting n elements to S is:
p ¼ ð1 1=mÞkn ekn=m The probability f of a false positive is then the same as the probability that all k bits that we test are 1, which becomes k
k
f ¼ ð1 pÞk ¼ ð1 ð1 1=mÞkn Þ ð1 ekn=m Þ : Generally, it is known that f is minimized when k ¼ mn ln 2, resulting in f ð1 1=2Þk ¼ ð0:6185Þðm=nÞ [4]. In this paper, we adapt spectral bloom filters (SBFs) [6]. A spectral bloom filter is a variant of traditional bloom filters and is designed to represent multisets by encoding the multiplicities of individual items. Unlike the traditional bloom filters, a spectral bloom filter consists of an array with m counters and independent k hash functions h1 ; h2 ; . . . ; hk . Initially, all counters in the spectral bloom filter are 0. For each element x in a multiset M, the k counters pointed by k hash functions are increased by 1. The estimator c^x of the occurrence count cx of an item x is obtained by mx ¼ minðSBF½h1 ðxÞ; SBF½h2 ðxÞ; . . . ; SBF½hk ðxÞÞÞ. In [6], it is shown that the condition of cx 6 c^x always holds and the probability of cx – c^x is the false positive rate of traditional bloom filters.
3. Aggregate query processing In this section, we consider aggregate queries with the SQL-syntax presented in Section 1.
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
5
3.1. The LCA algorithm: one-phase aggregation A naive brute-force (BF) algorithm can be designed as follows. All sensor nodes take their readings periodically and keep these readings into their local tables. Each sensor node transmits its measured data as well as those data received from its descendant nodes to its parent node. The sensor readings of all the sensor nodes are collected in the base station and the query is evaluated using de-duplication in the base station. Thus, BF is energy-inefficient since each sensor node sends all of its detected events blindly to the base station. To reduce the energy of sensor nodes, we develop the least common ancestor (LCA) algorithm. Definition 1. When the sensing area of a sensor node si overlaps with that of another sensor node sj , we say si overlaps with sj . We call sj an oneighbor of si (a short-hand of ’’overlapped-neighbor’’). The set of oneighbors of a node s and s itself is denoted by OSðsÞ. Definition 2. For a sensor node s, the least common ancestor in the routing paths of the nodes in OSðsÞ is called the coordinator of s and is denoted by coordinatorðOSðsÞÞ. That is, the coordinator of s is the node with the maximum hop distance from the base station among the common ancestors of OSðsÞ. The node s is called a coordinatee of coordinatorðOSðsÞÞ. Example 3. Consider a sensor network shown in Fig. 1. Since s3 overlaps with s2 and s4 , we have OSðs3 Þ ¼ fs2 ; s3 ; s4 g. The coordinator of OSðs3 Þ is s2 (i.e., coordinatorðOSðs3 ÞÞ ¼ s2 ) because the common prefix of the routing paths of the nodes in OSðs3 Þ is B-s1 -s2 and s2 has the maximum hop distance from the base station B. Thus, s3 is a coordinatee of s2 . For s4 , we have OSðs4 Þ ¼ fs1 ; s2 ; s3 ; s4 ; s5 g and the common prefix is B-s1 . Thus, we have coordinatorðOSðs4 ÞÞ ¼ s1 . For s1 ; s2 and s5 , we have coordinatorðOSðs1 ÞÞ ¼ coordinatorðOSðs2 ÞÞ ¼ coordinatorðOSðs5 ÞÞ ¼ s1 . In our study, we make no assumption on the sensing radius of each node relative to the communication distance. In other words, it is possible that a pair of nodes overlap in their sensing areas, and yet there is no direct communication between them. Similarly, we do not assume that the coordinator node of OSðsÞ is within direct communication distance from a sensor node s. In such scenarios, the routing protocols such as GPSR [15] can be used. In addition, when the sensing areas of sensor nodes are disjoint, the coordinator of each sensor becomes itself and we perform traditional aggregate query processing. Thus, the problem we are considering is a generalization of the traditional in-network aggregation problem. In LCA, all sensor nodes take sensor readings periodically and record these readings into their local tables. Each sensor node also transmits its data, which are received from its descendant nodes as well as its measured data, to its parent node. Since the events appearing in the overlapping area of a sensor node s0 with the other sensors are all collected at the coordinator s of s0 , LCA identifies the duplicates of the events of s0 in the coordinator s. The coordinator s collects all the sensor readings of its coordinatees and their oneighbors. Thus, in a coordinator s, before sending all of sensor readings from its descendant nodes and itself to its parent node, de-duplication of detected events is performed. When the duplicate events from the coordinatee nodes of s are identified, the representatives of the attribute values appearing in the SELECT clause of the inner statement in the query Q are computed by applying the aggregation functions. If the selection predicates are satisfied with the computed representatives, the representative of the aggregation attribute ak of the duplicates for each distinct event is calculated. Using the representative value of ak , the (partial) aggregation result is updated and sent to the base station. The next example illustrates the detailed steps of how LCA works.
Fig. 1. An example of a sensor network.
6
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
Example 4. Let us return to Example 1. In Fig. 1, we show the tuples consisting of zebras’ identifiers and their body temperatures measured at each node. Fig. 2 shows the detailed steps of how LCA works. According to LCA, each node sends its detected events to its coordinator. For instance, s3 sends its tuples to s2 and s4 sends its tuples to s1 via s2 . Thus, s2 can identify the duplicates among the detected events in its coordinatee node s3 . The node s2 identifies id2 and id3 as duplicates. For the duplicate tuples {(id2 , 37), (id2 , 38), (id2 , 37)}, the representative value 37.3, which is the average value of body temperature, does not satisfy the selection predicate (i.e., P38), and the duplicate tuples of id2 are thus discarded. Since the representative value of id3 ’s body temperature is 38, the partial aggregation result of COUNT at s2 becomes 1. Among the detected events at s3 (i.e., s2 ’s coordinatee node), id1 is a uniquely detected event and satisfies the selection predicate. Thus, the partial aggregate value of COUNT is increased by 1 which will be sent to the parent of s2 . The node s2 does not discard id4 coming from s4 since s2 is not the coordinator of s4 although the aggregation attribute value 37 does not satisfy the selection predicate. Thus, s2 sends {(id4 , 37), (id6 , 40)} and the partial aggregation result 2 to its coordinator s1 . In the node s1 which is the coordinator of s1 ; s2 , s4 , and s5 ; id4 and id5 are determined as duplicates. Since the representative value 38.3 of id4 satisfies the selection predicate, the partial aggregation value of COUNT becomes 3. However, since the representative value 37.5 of id5 does not satisfy the selection predicate, the duplicates of id5 are discarded. In addition, since id6 is a uniquely detected event at the node s2 and satisfies the selection predicate, the partial aggregation value of COUNT becomes 4. Finally, as the aggregation value of COUNT, s1 sends 4 to the base station. The pseudo code of LCA is presented in Fig. 3. In each node, we compute the collection of detected events DupS by merging the detected events in the node as well as those received from the child nodes. Similarly, we calculate the partial aggregation result ParAgg which are the union of those coming from every child node (lines 3–8). Then, we update the partial aggregation result ParAgg appropriately if multiple partial aggregations exist for a distinct object (line 9). We next examine the event set Do of each distinct object o in DupS (lines 11–18). If Do does not have any element from the current node’s coordinatees, Do should be sent to the parent node for later investigation and thus we add Do in DupN (line 17). Otherwise (i.e., there exists an event in Do from this node’s coordinatee), we compute the representative values of the attributes for the object by performing the de-duplication using the detected values in Do (lines 13–14) and check the selection predicate p in the query Q with those representative values (line 15). If the predicate is satisfied, we compute the partial aggregation values and update the partial aggregation result ParAgg using AGG presented in the query Q (lines 15–16). Regardless of the predicate being satisfied, since de-duplication is performed for the events in Do , we do not add them to DupN. Finally, the partial aggregation value ParAgg and DupN are sent to the parent (line 19). 3.2. The LCA-EA algorithm: two-phase aggregation To further reduce energy consumption of LCA, we extend LCA to the two-phase algorithm called LCA-EA which performs as much partial aggregations as possible in each node. In LCA-EA, (1) we subdivide a set of detected events into the unique event set and potential duplicate set, (2) send the potential duplicate set only to the coordinator of each node for further de-duplication, and (3) perform partial aggregations on the unique events, of which the result will be sent to its coordinator. In LCA-EA, since each sensor node transmits the partial aggregation result (i.e., single value) instead of uniquely detected events, the energy consumption can be reduced. The LCA-EA algorithm consists of two phases which are the duplicate identification and early aggregation phases. The first duplicate identification phase is used to identify duplicates in every sensor node by collaborating with its overlapped nodes. Each node sends a compact representation of detected events to the overlapped nodes. After potential duplicate events are
Fig. 2. Query processing of LCA.
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
7
Fig. 3. The LCA algorithm.
identified using compact representations, each node partitions the detected events into the potential duplicate and unique event sets. In the second early aggregation phase, potential duplicates are sent to coordinators so that further de-duplication can take place. Meanwhile, for unique events, early aggregation can be performed as long as the aggregation function is not holistic [25]. Holistic aggregation functions such as quantiles require that all representatives must be brought together to be aggregated. Thus, we assume the aggregate functions used in this paper are not holistic. When holistic aggregations are used, the LCA and brute-force algorithms can be utilized. 3.2.1. The duplicate identification phase To facilitate the identification of duplicate events, we propose a variant of the spectral bloom filter, which we call a modified spectral bloom filter, denoted as MSBF. It is an array of m number of 2-bit elements. Each element of the array has three possible states represented by ‘00’, ‘01’, and ‘10’ where ‘00’ represents that no item is hashed into the element, ‘01’ denotes that one item is hashed into the element, and ‘10’ means that more than one item are hashed into the element. In LCA-EA, each node collaborates with its oneighbor nodes. Specifically, every node broadcasts its bloom filter to all its oneighbors. After a node receives the bloom filters of its oneighbors, it consolidates its own bloom filter with the received filters, and identifies potential duplicates. To produce MSBF in each node s, we invoke the procedure Consolidate. The input of this procedure is the set of bloom filters received from all the nodes in OSðsÞ. Consolidate first initializes MSBF[i]s to zeros with 1 6 i 6 m. Then, for each bloom filter BF received from every node in OSðsÞ, whenever BF½i is one and MSBF[i] – ‘10’, we increase MSBF[i] by one. After Consolidate finishes, we check whether each detected event is a potential duplicate or not by examining the minimum value of its hashed elements in MSBF. Specifically, for the object identifier id of a detected event e, if min(MSBF[h1 (e:id)], . . ., MSBF [hk (e:id)]) is 0100; e is considered as a potential duplicate. Otherwise, e is not a duplicate. We call the procedure to perform such a task isPotentialDup. The following example illustrates the uses of MSBF. Example 5. Let us consider the query Q z again in Example 1 with the sensor network shown in Fig. 1. Assume that our bloom filter uses two independent hash functions h1 and h2 which map keys into the range {1; . . . ; 7}. The bloom filter representation for every zebra identifier is presented in Table 1. Since the sensor node s2 detects the zebras with id2 and id6 , the bloom filter in the node s2 becomes (0111100). Similarly, the bloom filters of s3 and s4 are (0111011) and (1101010) respectively, because s3 detects id1 , id2 as well as id3 and s4 detects id2 ; id3 as well as id4 . Since s2 and s4 are the oneighbor nodes of s3 ; s3 receives the bloom filters of s2 and s4 . Assuming that every bit of MSBF is zero, the MSBF in s3 obtained from the bloom filters of the nodes s2 ; s3 , and s4 by invoking the function Consolidate becomes (‘01’ ‘10’ ‘10’ ‘10’ ‘01’ ‘10’ ‘01’). For the identifier id1 , we have h1 ðid1 Þ ¼ 3 and h2 ðid1 Þ ¼ 7. Since the minimum of the third and seventh positions of MSBF is ‘01’, we conclude that the zebra with id1 is detected uniquely. However, id2 is assumed to be a potential duplicate since h1 ðid2 Þ ¼ 2, h2 ðid2 Þ ¼ 4 and minðMSBF½2; MSBF½4Þ ¼ minð‘10’; ‘10’Þ ¼ ‘10’. Similarly, we can also conclude that id3 is a potential duplicate.
8
J.-K. Min et al. / Information Sciences 297 (2015) 1–20 Table 1 The bloom filter representation of identifiers. xi
h1 ðxi Þ
h2 ðxi Þ
Representation
id1 id2 id3 id4 id5 id6
3 4 4 2 6 5
7 2 6 1 1 3
(0010001) (0101000) (0001010) (1100000) (1000010) (0010100)
The LCA-EA1 algorithm, which is the first phase of LCA-EA, is presented in Fig. 4. Each node generates a bloom filter Sig using the ids of its detected events. Then the bloom filter is sent to its oneighbors. After receiving the bloom filters of oneighbors, each node computes ConSig by consolidating its Sig with SigOi s (line 4). Then, we splits its detected events into the potential duplicate and unique event sets by invoking isPotentialDup (lines 5–9). 3.2.2. The early aggregation phase The second phase of LCA-EA is similar to LCA. In other words, in LCA, every detected event is considered as a potential duplicate event. On contrary, in LCA-EA, the detected events are partitioned into unique events and potential duplicate events at the first phase. In the second phase, each node computes a partial aggregation value for unique events satisfying the selection predicates of the query. Partial aggregation results are gradually merged along the routing paths to the base station. By Definition 2, the duplicates of an identical object appearing in OSðsÞ of a node s are collected at coordinatorðOSðsÞÞ. When the coordinator receives the potential duplicates from its coordinatee nodes, it computes the representative value of each aggregate attribute. If the selection predicate is satisfied based on the representative values, the aggregation attribute values are reflected into the partial aggregation result. The duplicates processed currently in the node are removed from the detected events to be transmitted to its parent node. Example 6. We present how the second phase of LCA-EA works in Fig. 5 by using the query Q z in Example 1. We assume that each node identifies unique events exactly among its detected events at the first phase. Since the unique event (i.e., id1 ) satisfies the selection predicate (i.e., P38), s3 sends 1 as a partial aggregation result. As the potential duplicates, s3 sends (id2 , 38) (id3 , 37) to s2 . Similarly, s4 sends 0 as a partial aggregation result and (id2 , 37), (id3 , 39) and (id4 , 37) as potential duplicates to s2 . Note that s2 has (id2 , 37) as a potential duplicate and (id6 , 40) as a unique event. Since id6 satisfies the selection predicate, the partial aggregation result becomes 2 (i.e., 1 for id6 and 1 coming from s3 ). Furthermore, s2 ð¼ coordinatorðOSðs3 ÞÞÞ knows that id2 and id3 sent from s3 are potential duplicates. Since the representative of body_temperature is computed as AVG(body_temperature) in Q z , the representative body temperature values of id2 and id3 are 37.3 and 38, respectively. Since the representative value of id3 satisfies the selection predicates, the partial aggregation result becomes 3. The potential duplicates for id2 and id3 are next removed. At the node s2 , although the aggregation attribute value of id4 coming from s4 is 37, id4 cannot be discarded since coordinatorðOSðs4 ÞÞ is s1 . Finally, the partial aggregation result 3 and (id4 , 37) are sent to s1 . Let us consider the node s1 and s5 next. The node s5 sends (id4 , 39) and (id5 , 37) as the potential duplicates to s1 . Moreover, s1 has (id4 , 39) and (id5 , 38) as potential duplicates. The representative body temperature values of id4 and id5 are 38.3 and 37.5, respectively. Then, id5 is discarded but id4 is reflected to the partial aggregation result since the representative value of id4 satisfies the selection predicate. Thus, s1 sends 4 as the final aggregation result to the base station.
Fig. 4. The LCA-EA1 algorithm.
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
9
Fig. 5. An example of how the LCA-EA2 algorithm works.
The LCA-EA2 algorithm performs the second phase of LCA-EA. The pseudo code of the LCA-EA2 algorithm is shown in Fig. 6. Note that the pseudo code of LCA-EA2 is quite similar to that of LCA provided in Fig. 3 since all detected events of a sensor node s are still sent to its coordinator coordinatorðOSðsÞÞ. In particular, in LCA, all detected events are considered as potential duplicates. In contrast, in LCA-EA2, each node separates the detected events into unique event set and potential duplicate set. Then, the partial aggregation results are updated by using the unique events that satisfy the selection predicate p in the query Q (lines 1–4). The potential duplicates are then added to DupS (line 5). The rest of the pseudo code of LCA-EA2 is identical to the lines 3–19 in the pseudo code of LCA shown in Fig. 3. 3.3. Comparison of energy consumptions Let EðlÞ be the energy consumption to transmit the l-bit sized data from a node to its neighbors. Then, EðlÞ consists of the sending cost ET ðlÞ and receiving cost ER ðlÞ. The following lemma shows comparison of the energy consumptions between the LCA and BF algorithms. Lemma 1. The average energy consumption of LCA is at most that of BF. Proof Sketch: Let e be the average number of the l-bit sized events detected by a sensor node, hB be the average hop distance from a sensor node to the base station and hc be the average hop distance from a sensor node to its coordinator. For the LCA algorithm, the detected events of a node are sent up to its coordinator and the r-bit sized partial aggregation result is sent from the coordinator to the base station. Note that r l generally holds. The total energy to transmit data from a node to the base station by LCA is, at most, hc Eðe lÞ þ ðhB hc Þ EðrÞ. Since we set the least common ancestor of OSðsÞ as the coordinator of s, we have hB hc P 0. Furthermore, BF requires the transmission cost of hB Eðe lÞ. Since EðrÞ 6 Eðe lÞ, we can establish that hc Eðe lÞ þ ðhB hc Þ EðrÞ (i.e., the total consumed energy by LCA) 6 hc Eðe lÞ þ ðhB hc Þ Eðe lÞ ¼ hB Eðe lÞ (i.e., the total consumed energy by BF). In other words, the average energy consumption of LCA is at most that of BF. h As presented earlier, the second phase of LCA-EA can be decomposed into two steps: (a) data transmission from a node to its coordinator, and (b) aggregation result transmission from the coordinator to the base station. For the latter step, the transmission cost (i.e., the term ðhB hc Þ EðrÞ) remains the same for both algorithms. However, the main difference occurs in step (a). Specifically, e number of l-bit sized events are transmitted from a node to its coordinator in LCA. However, in
Fig. 6. The LCA-EA2 algorithm.
10
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
LCA-EA, only the potential duplicates and partial aggregation result are transmitted. Since bloom filters can have false positives, some unique events can be identified as duplicates. Let u and d be the number of unique events and the number of duplicates among e events, respectively (i.e., e ¼ u þ d), as well as let f be the false positive rate of the bloom filters. Then, (d þ u f ) number of events with an r-bit sized partial aggregation result are transmitted from a node to its coordinator for the step (a) in LCA-EA. In LCA, (d þ u) events are sent from a node to its coordinator. Thus, the benefit of LCA-EA’s second phase is as follows:
hc Eððu u f Þ l rÞ ¼ hc Eðu ð1 f Þ l rÞ When the number of unique events u is large, the benefit of the second phase increases. In general, when the sensing region of a sensor node overlaps less with the sensing regions of the other sensor nodes, more unique events are generated. When u is a constant, clearly the benefit of LCA-EA’s second phase is affected by the false positive rate f. This brings us back to the first phase of LCA-EA which requires the additional cost compared to LCA. Let no be the average number of oneighbors for each node. With a little loss of generality, we assume all oneighbors are directly reachable by communication without additional routing. Since a node broadcasts its bloom filter to its oneighbors, the total energy consumption of the first phase is n ðET ðmÞ þ no ER ðmÞÞ where n is the number of sensor nodes and m is the size in bits of a bloom filter. As m increases, the extra cost of LCA-EA in the first phase becomes bigger. However, based on the formula in Section 2.2, we have the false positive rate f ð0:6185Þðm=ðno eÞÞ , when the number of hash functions k is nm ln 2. o e Since 0.6185 is less than 1, as m increases, the false positive rate f decreases, making the net savings in the second phase to be larger. It is hard to derive formally the point when the extra cost of the first phase is exactly offset by the savings in the second phase. Thus, we resort to empirical evaluation in the Section 5. 4. Extension to sliding window aggregation In this section, we expand the scope of our investigation to a time-based sliding window aggregate query which repeatedly applies an aggregation function over a sliding time window. Note that even when there is no overlap in the sensing regions of sensors, there is still the need for de-duplication for processing a sliding window aggregate query, because an object may be detected several times within the time interval. The problem is known as the distinct counting(or sum) problem [31] in the spatio-temporal database community. In [31], a variant of R-tree [12], called the aBR-tree, is utilized to identify the region specified in the query and FM sketch [9] is adopted to approximately estimate the number of distinct objects in the query region during a time interval. The problem we address in this section is more difficult in (a) considering a sliding time window, and (b) giving exact aggregation results. The SQL format we are considering is presented again below. According to the time-based sliding window scheme [11], a window W consists of tuples whose timestamp values are less than w apart. While the lower-case w denotes the width of how large a window is as well as when it should be closed, ‘ denotes when a new window is opened. In other words, a new window is opened every ‘ time units. Thus, Q w generates aggregate results at timestamps w, w þ ‘; w þ 2‘; . . . ; w þ bd=‘c ‘ where d is the duration of Q w . Q w : SELECT AGG(ak ) FROM ( SELECT AGG1(S:attr1 ) AS a1 ; . . . , AGGm(S:attrm ) AS am FROM Sensor S RANGE w SLIDE ‘ FOR d GROUP BY S:ID ) WHERE p(a1 ; . . . ; am ) Since an object may be detected by several sensor nodes either at the same timestamp or at different timestamps within a window W, de-duplication is required at each timestamp as well as during the interval to evaluate the selection predicate p in the query Q w . When there is a region (hole) in the sensing field which is not covered by any sensor node, some sensor reading of an object may not be produced. In that case, we assume that the representative is computed using detected values only and ignoring missing values. Furthermore,when ‘ P w, we say that we have a ‘‘tumbling window’’ which consists of non-overlapping consecutive windows. Since the processing of a tumbling window query is the same as multiple invocations of a single window aggregate query, we assume 0 < ‘ < w without loss of generality in the rest of paper. To make our presentation simpler, we also assume that Q w is invoked at timestamp 1 in this section. 4.1. The LCA-SW: an extended LCA algorithm for sliding window The brute-force (BF) algorithm presented in Section 3.1 can handle sliding window aggregate queries since all of the sensor readings from every sensor node are gathered in the base station. However, the presence of duplicates increases transmission cost significantly in BF. Thus, we extend LCA to support sliding window aggregate queries in this section. Each coordinator in LCA computes the representatives of the events detected at each timestamp by its coordinatees. To process sliding window aggregate queries, such computed representatives need to be aggregated for each window W since
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
11
the selection predicate p and final aggregation function AGG in the query Q w should be applied to the aggregated values for every object detected in W. We extend the notation of the representative of an object to support sliding windows. The representative value of an attribute attri of an object o at each timestamp is defined as the e-representative attribute value. The w-representative value of an attribute attri of an object o is the aggregation value AGGi of the e-representative values of the attribute attri generated in a window W.2 We next present the extended LCA algorithm called LCA-SW. In LCA-SW, each coordinator calculates the e-representatives of the objects detected by its coordinatees using the de-duplication technique used in LCA and keeps their e-representatives for a window W. Each coordinator transmits all e-representatives generated within the window W to the base station. Along the routing path from each coordinator to the base station, when a node n receives w number of the e-representatives for an identical object o, we compute the w-representative of o and applies the selection predicate p in the query Q w . If the w-representative satisfies p, the (partial) aggregation result is updated and sent to the base station. Since LCA-SW is a straightforward extension of LCA, we omit the pseudo code of the LCA-SW algorithm here. The LCA-SW algorithm is conceptually simple and more efficient than the BF algorithm since LCA-SW transmits the e-representatives only instead of all duplicately detected events. However, LCA-SW consumes a lot of energy for the following key reasons: For a window, each coordinator transmits every e-representative individually along the routing path to the base station in LCA-SW. Although w number of e-representatives of an identical object o are transformed to a single w-representative, this transmission of the e-representatives results in large consumption of energy. Another drawback of LCA-SW is that the e-representatives of an object o are blindly transmitted to the base station when o has been detected less than w times for a window. This happens when an object o moves into/out-of the sensing field or there is a region (hole) in the sensing field which is not covered by any sensor node. Given the weaknesses of the LCA-SW algorithm, a natural idea is to explore how to extend the LCA-EA algorithm presented earlier to handle sliding window aggregate queries. To do so, unfortunately, the senor node n detecting an object o should collaborate with a large number of sensor nodes of which some may detect o during the window W. Consequently, the cost of the first phase of duplicate identification would become very expensive due to the heavy overhead of communication. Thus, we do not explore to extend the LCA-EA algorithm in the paper. 4.2. The LCA SW H : an enhanced LCA-SW algorithm To overcome the drawbacks of LCA-SW, we devise an enhanced algorithm called LCA SW H . We first introduce the notion of super-coordinators by extending the concept of coordinators in Definition 2. Let dw be the maximum moving distance of every object for a window W of width w. We can draw a virtual circle centered at the location of a sensor s with the radius dw . Then, an object detected by a node s at timestamp t can be detected by another sensor node whose sensing region overlaps with this virtual circle. Based on the maximum moving distance dw , we define the super-coordinator as follows: Definition 3. Given a sensor node s and the maximum moving distance dw , let the maximum moving region MMRðs; dw Þ be a set of sensor nodes whose sensing regions overlap with the circle centered at the location of s with the radius dw . Then, the set of extended overlapped sensor nodes EOSðs; dw Þ is defined as [s0 2MMRðs;dw Þ OSðs0 Þ. By the above definition, a sensor node s also appears in EOSðs, dw Þ since the sensing areas of s overlaps with the circle centered at the location of s with radius dw (i.e., s 2 MMRðs; dw Þ). In addition, since EOSðs; dw Þ is the union of all oneighbors of every sensor node s0 in MMRðs; dw Þ, we have EOSðs; dw Þ OSðsÞ where OSðsÞ is the set of oneighbors of s as presented in Definition 1. Fig. 7 gives an example. For instance, MMRðs4 ; dw Þ is fs2 ; s3 ; s4 ; s6 g and EOSðs4 ; dw Þ is fs2 ; s3 ; s4 , s5 ; s6 g. Definition 4. Given a sensor node s and its corresponding EOSðs; dw Þ, the least common ancestor in the routing paths to the base station from the nodes in EOSðs; dw Þ is called the super-coordinator of s and denoted as sp coordðs; dw Þ. The node s is called the sub-coordinatee of sp coordðs; dw Þ. Let us continue with the example shown in Fig. 7. Given MMRðs4 ; dw Þ is fs2 ; s3 ; s4 ; s6 g and EOSðs4 ; dw Þ is fs2 ; s3 ; s4 ; s5 ; s6 g, the super-coordinator sp coordðs4 :dw Þ becomes s1 . In addition, the super-coordinator sp coordðs4 ; dw Þð¼ s1 Þ is the parent node of coordinator ðOSðs4 ÞÞ ð¼ s2 Þ because EOSðs4 ; dw Þ is a super set of OSðs4 Þð ¼ fs2 ; s3 ; s4 gÞ. Thus, each super-coordinator is eligible to construct the w-representatives and to apply the selection predicate p in the query Q w to select the w-representatives. Similar to LCA-SW, each coordinator keeps the e-representatives generated for a window W in LCA SW H . As discussed before, a drawback of LCA-SW is that each coordinator transmits every e-representative individually generated in W resulting in a lot of energy consumption. Instead, in LCA SW H , the intermediate representative of an object o is used. The intermediate representative of o is initially created by each coordinator having e-representatives of o. Then, along the routing path to 2 We simply refer to the tuple consisting of every attribute attri ’s (with 1 6 i 6 m) w-representative (and e-representative) value of an object o as w-representative (and e-representative) of o.
12
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
Fig. 7. An example of super-coordinators.
sp coordðs; dw Þ, the intermediate representative of o is updated with respect to the aggregation functions AGGi with every i such that 1 6 i 6 m in Q w whenever another intermediate representative of o shows up. Recall that, in LCA-SW, the node receiving w number of e-representatives of an object o calculates the w-representative of o to update the partial aggregation result. In LCA SW H , the super coordinator sp coordðs; dw Þ is computed based on the maximum moving distance dw of every object. But every object does not always move with the maximum speed. Thus, in LCA-SW, a node n in the routing path to the super-coordinator sp coordðs; dw Þ can collect w number of e-representatives of o and update the partial aggregation result earlier. In LCA SW H , to compute the w-representative of the object o in the node n using the intermediate representative, the intermediate representative has a counter c which indicates how many e-representatives are reflected into the intermediate representative. When an intermediate representative irep with the 0 counter c is updated with another intermediate representative irep with the counter c0 ; c is set to c þ c0 . Furthermore, whenever the counter c reaches w, we know that irep becomes the w-representative. Another drawback of LCA-SW is that the e-representatives of an object o are blindly transmitted to the base station when o has been detected less than w times for a window. Similarly, if the intermediate representatives whose counters are less than w are transmitted blindly to the base station in LCA SW H , it wastes unnecessary energy. To prevent the problem, LCA SW H relies on super-coordinators. When an object o is detected by a sensor node s and its corresponding intermediate representative irep of the object o arrives at the super-coordinator sp coordðs; dw Þ of s, the super-coordinator sets irep to the w-representative of o and updates the partial aggregation result by applying the selection predicate p even though the counter c of irep is less than w. This is correct since the super-coordinator sp coordðs; dw Þ is the least common ancestor of EOSðs; dw Þ. Whenever a super-coordinator sp coordðs; dw Þ receives the intermediate representative irep of an object o, the super-coordinator needs to check whether the object o is detected by its sub-coordinatees or not. To do so, if all the ids of the sensors detecting an identical object o in the window W are annotated with the intermediate representative of o, the size of the intermediate representative becomes large. Instead, each intermediate representative keeps the smallest sensor id. When the e-representative of o is generated using the events of o detected by several sensors, the smallest id of the sensors detecting o is kept in the e-representative. Similarly, when the intermediate representative of o is generated, the smallest sensor id is kept in the intermediate representative of o. The use of the smallest sensor id sid does not affect the correctness of sliding window aggregation processing because the super-coordinator of the sensor with sid only transforms the intermediate representative irep to the w-representative when irep arrives at the super-coordinator. The following example shows how LCA SW H works. Example 7. Consider the query Q zw in Example 2 and a sensor network in Fig. 7. To make this example simpler, let us assume that each zebra is detected by at most a single sensor at each timestamp t. The information of the detected zebras at each timestamp is given in Fig. 7 and the detailed steps of how LCA SW H works is shown in Fig. 8. Note that, we omit the smallest id of the sensors detecting the object in each e-representative and intermediate representative in Fig. 8 since each object is detected by a single sensor at each timestamp. At t ¼ 1, the sensor nodes s3 and s4 send their tuples to their coordinator s2 . Since the sensor node s2 is the coordinator of s3 and s4 ; s2 computes the e-representatives ðid2 ; 38Þ and ðid1 ; 40Þ as well as keeps them in its buffer. Since t is not ðw þ j ‘Þ for every j with 0 6 j 6 bd=‘c, the partial aggregation result is not generated.
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
13
Fig. 8. An example of a sensor network for sliding windows.
At t ¼ 2, the sensor nodes s4 and s6 send their tuples to their coordinators s2 and s5 , respectively. As shown in Fig. 8(b), the sensor nodes s2 and s5 keep the e-representatives ðid2 ; 36Þ and ðid1 ; 42Þ in their buffers, respectively. Since w ¼ 2, the partial aggregation result is generated. Each coordinator converts the e-representatives in its buffer to the intermediate representatives and transmits them along the routing path to the base station. For instance, in s2 , the e-representatives ðid2 ; 38Þ and ðid2 ; 36Þ are aggregated and becomes the intermediate representative ðid2 ; 37; 2Þ as well as the e-representative ðid1 ; 40Þ becomes the intermediate representative ðid1 ; 40; 1Þ where the last attribute in each intermediate representative denotes the counter c. Furthermore, since the counter of ðid2 ; 37; 2Þ is w (i.e., 2), ðid2 ; 37; 2Þ becomes the w-representative of id2 . Because the average temperature 37 does not satisfy the selection predicate (P 38) in the query Q zw , the partial aggregation result becomes 0. Thus, the sensor node s2 transmits the intermediate representative ðid1 ; 40; 1Þ with the partial aggregation result 0 to s1 . The sensor node s5 also transmits the intermediate representative ðid1 ; 42; 1Þ to s1 . The sensor node s1 aggregates ðid1 ; 40; 1Þ and ðid1 ; 42; 1Þ, and generates ðid1 ; 41; 2Þ. Since the counter value c of 2 is w and/or s1 is the super-coordinator of both s4 and s6 ; ðid1 ; 41; 2Þ becomes the w-representative of id1 . Because this w-representative satisfies the selection predicate, the partial aggregation result is updated into 1 (see Fig. 8(b)). At t ¼ 3, each super-coordinator removes the old e-representatives in its buffer. Then, the sensor s6 transmits a tuple ðid2 ; 44Þ to s5 and s5 keeps ðid2 ; 44Þ in its buffer. The object with the id of id1 is not detected by any sensor node. The sensor node s2 transmits the intermediate representative ðid2 ; 36; 1Þ to s1 , since s4 detected the object with the id id2 when t ¼ 2 and s4 is not a sub-coordinatee of s2 . Similarly, s5 transmits ðid1 ; 42; 1Þ and ðid2 ; 44; 1Þ to s1 . Then, in the node s1 ; ðid2 ; 40; 2Þ is computed from ðid2 ; 36; 1Þ and ðid2 ; 44; 1Þ. Since the counter value of 2 is w, ðid2 ; 40; 2Þ becomes the w-representative of id2 . ðid1 ; 42; 1Þ coming from s5 also becomes the w-representative of id1 although the counter value is less than w since s1 is the super-coordinator of s6 and ðid1 ; 42; 1Þ was originated from s6 . Since the two w-representatives satisfy the selection condition, the partial aggregation result becomes 2 (see Fig. 8(c)). The pseudo code of LCA SW H is presented in Fig. 9. We use DupS; IRepS and ParAgg to store the detected events, intermediate representatives and partial aggregation results, respectively (lines 3–4). We also use Buffer to maintain the e-representatives for the window of past w timestamps. We first compute DupS; IRepS and ParAgg from the set of detected events Dupi , the set of intermediate representatives IRepSi and the partial aggregation result PAgg i received from every i-th child node, respectively (lines 5–13). We next compute the e-representatives at the current timestamp. For the event set Do of each distinct object o in DupS (lines 16–22), if Do has any element from the coordinatee of this node, we compute the e-representative erep of Do and insert it into Buffer (lines 18–20). Furthermore, the smallest sensor id is annotated with erep (line 19). However, if Do does not have any element from the coordinatee of this node, Do is simply added to DupN only (line 21) without computing the e-representative. The reason is that this node is not the coordinator of any descendant sensor node detecting the object o and thus we should not compute the e-representative of o at this node. After every distinct object in DupS is examined, the e-representatives whose timestamps are (t-w) are deleted from Buffer (line 23). Finally, we compute the w-representatives for the window of past w timestamps if the current timestamp t is (w þ j ‘) for any j with 0 6 j 6 bd=‘c where d is the duration and ‘ is the slide step by which sliding window moves (lines 25–40). In this case, the intermediate representatives are first calculated with the e-representatives kept in Buffer. The w-representatives for the window of past w timestamps are next computed with the computed intermediate representatives. Then, the aggregation result is produced using the computed w-representatives. Note that, the w-representative of each object o can be computed only when the counter of o’s intermediate representative is w or the intermediate representative is originated from a sub-coordinatee of this node. To compute the w-representatives for the window of past w timestamps, we do as follows: Each e-representative in Buffer is transformed to an intermediate representative and inserted into IRepS (lines 27–30). Then, for each distinct object o in IRepS (lines 31–39), we compute the aggregated intermediate representative irep from o’s intermediate representatives in IRepS (lines 32–33). If the counter c of irep is w or irep originates from one of its sub-coordinatees (i.e., this node is the super-coordinator), the computed irep is actually the w-representative wrep of o and we thus set irep to wrep (lines 34–35). Furthermore, if wrep satisfies the selection predicate p of the query Q w ; ParAgg is updated by using wrep (lines
14
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
Fig. 9. The LCA SW H algorithm.
36–37). If the counter c of irep is not w and irep does not originate from its every sub-coordinate, we insert the intermediate representative irep to IRepN (line 38) which will be sent to the parent node later. Then, the partial aggregation result ParAgg, the set of DupN and the set of intermediate representatives IRepN are routed to the base station along its routing path (line 41).
5. Experiments 5.1. Experimental setup Data Set: We obtained a real-life data set containing the trajectories of Kruger Buffalos from Movebank [23]. Unfortunately, the data set only records 26,668 positions of 740 Kruger Buffalos. To show the scalability of our work, based on the real-life data set, we first built a probabilistic model of animal movements and next computed the average speed. We initially scattered animals randomly in the sensing field. According to the animal movement model, each animal moves in the sensing field as time passes. Since the real-life data set contains only x/y position attributes, we added the artificial attributes such as body temperature, weight, and height which were generated by following the normal distribution Nð36; 1Þ, uniform distribution with the
15
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
range of [800.0, 900.0], and normal distribution Nð200; 2Þ, respectively. We also introduced the measurement errors of sensors in the range of [1.0, 1.0] with the chance of 1%. The size of each attribute was set to 4 bytes. Query Set: Table 2 shows the queries that were used in the experimental results reported below. The query Q1 is a simple aggregate query which computes the number of animals having high body temperatures. Q2 is a query to obtain the maximum weight of animals which satisfy the selection condition (i.e., height 6 200). The query Q3 requests the average body temperature of small sized animals. The query Q4 is a more complex nested query to obtain the identifiers and body temperatures of animals with the highest body temperature. Network Configuration: We implemented all the proposed algorithms to evaluate their empirical behaviors. To conduct comprehensive empirical evaluation, we constructed diverse network environments, as summarized in Table 3. The communication distance c in a network is 100 m (meters). To place the sensor nodes in the sensing field, we divided the sensing field into equi-sized grids and each sensor node was placed in every corner of each grid. A sensor network came in three different sizes: a large network, a medium network, and a small network. In the large network, 1681 (= 412 ) sensor nodes were distributed in a sensing field sized 16,000,000 m2. In the medium network and small network, 441 (=212 ) and 121 (=112 ) sensor nodes were placed in the areas of 20002 m2 and 10002 m2, respectively. We varied the radius r of each sensing area which is proportional to the communication distance c and ranges from 45 m (= 0.45c) to 75 m (= 0.75c). As shown in Fig. 10, in a grid, when r < 50ð¼ 100=2Þ m, the sensing pffiffiffi area of every sensor node does not overlap with those of the other sensor nodes. In contrast, when r P 70:7106ð 100 2=2Þ m, there is no sub-region which is not covered by any sensor node in a grid. We set the default radius for the sensing areas to 65 m since some objects are duplicately detected due to overlapped sensing regions and there are sub-regions which are not covered by any sensor
Table 2 Query set. Name
Definition
Q1
SELECT COUNT(avg_t) FROM ( SELECT AVG(body_temperature) AS avg_t FROM ANIMAL GROUP BY ID) WHERE avg_t P 38
Q2
SELECT MAX(min_w) FROM ( SELECT MIN(weight) AS min_w, AVG(height) AS avg_h FROM ANIMAL GROUP BY ID) WHERE avg_h 6 200
Q3
SELECT AVG(body_temperature) FROM ( SELECT MIN(body_temperature) AS min_t, MAX(weight) AS max_w, MAX(height) AS max_h FROM ANIMAL GROUP BY ID) WHERE max_h 6 200 AND max_w 6 840
Q4
SELECT ID, MIN(body_temperature) FROM ANIMAL GROUP BY ID HAVING MIN(body_temperature) ¼ SELECT MAX(min_b) FROM (SELECT MIN(body_temperature) AS min_b FROM ANIMAL GROUP BY ID)
Table 3 Parameters. Parameter
Range
Default
Size of Sensing field
(1000 m)2, (2000 m)2, (4000 m)2 121, 441, 1681 100 m 45–75 m 1000–20,000 24–152 bytes
(4000 m)2
Number of sensor nodes (n) Communication distance (c) Radius of sensing area (r) Number of animals (a) Packet size (p)
1681 100 m 65 m 10,000 56 bytes
16
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
Fig. 10. Effect of r in a grid.
node. The base station is placed at the center of the sensing field of each different sized network. The routing tree is constructed using the FHF (First-Heard-Form) algorithm [21]. 5.2. Varying the sensing radius We varied the radius r of the sensing areas from 45 m to 75 m for the large network. Each sensor node detects the animals every minute and we ran the simulator for 10 min. Fig. 11 shows the total energy consumption of the proposed algorithms on all queries. Clearly, the brute-force algorithm is the worst, sometimes by an order of magnitude difference. When the radius of sensing areas is small (i.e., r ¼ 45 m), the sensing areas of sensors are disjoint and the number of each sensor node’s oneighbors becomes zero. Thus, the performances of LCA, and LCA-EA are identical. As the radius of sensing areas increases, the number of detected events increases, and energy consumption increases accordingly. When r is within 50 m and 70 m, LCA-EA shows the best performance since each sensor node identifies the unique events in the first phase and sends the partial aggregation result of the unique events as well as the potential duplicates to the base station in the second phase.
Fig. 11. Varying r.
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
17
When r becomes large, the performance gap between LCA-EA and LCA decreases because most of all events are duplicately detected. In our experiment, on average, 10% of events are uniquely detected when r ¼ 75 m. Thus, the gain of identifying unique events offsets the overhead by the first phase of LCA-EA. The tuple sizes of the query results also affect the performance of our proposed algorithms. In the query Q1, only two attributes (i.e., animal ID and body_temperature) are required. The queries Q2 and Q3 require three and four attributes, respectively. As the tuple size becomes larger, more energy is required to send the duplicates. Thus, as shown in Fig. 11(a), (b) and (c), the performance gap between LCA-EA and LCA increases with increasing tuple size. Since the query Q4 requires two attributes, the performances of Q4 are similar to those of Q1. 5.3. Varying the packet size In this experiment, we varied the packet size p from 24 bytes to 152 bytes. Fig. 12 shows the results of Q3 and Q4; we do not show the results of Q1 and Q2 because the patterns are very similar. As the size of a packet increases, the number of packets to be transmitted is reduced, but the energy to transmit a packet increases. Thus, as shown in Fig. 12, the performance of every technique is not much affected by the packet size. The most interesting point is that, when p is small (i.e., p = 24), the energy consumption of LCA-EA is higher than that of LCA-EA with p = 56 (see Fig. 12) as well as p = 56 and 88 (see Fig. 12(a)). As mentioned earlier, we set the size of a bloom filter as p 4 bytes. When the packet size is small (i.e., p = 24), more uniquely detected events tend to be identified as the potential duplicates since the false positive ratio of the bloom filter is higher. As p increases, the false positive rate decreases, and the percentage of detected unique events increases. Consequently, as p increases, energy consumption drops. However, when p becomes even higher (e.g., p ¼ 120), the communication overhead of sending bloom filters around in the first phase starts to take over and dominate the decrease in false positive rate. Eventually, energy consumption starts to rise again. 5.4. Varying the number of animals and sensors In this experiment, we varied the number of animals a in the sensing field from 1000 to 20,000 with the default values of the other parameters. We only show the experimental result of Q3 since the results of the other queries show similar patterns. In Fig. 13(a), we show the energy consumption with increasing the number of animals. We find that the energy
Fig. 12. Varying p.
Fig. 13. Varying a and n.
18
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
consumption grows with increasing the number of animals. Furthermore, BF is the worst performer. On average, the performance of LCA is 1.76 times better than that of BF. In addition, LCA-EA is 1.6 times better than LCA. Fig. 13(b) shows the energy consumption with varying the number of sensor nodes. When the number of sensor nodes increases, the energy consumption of BF dramatically increases due to growing length of the routing path from each leaf node to the base station. However, those of our proposed techniques gradually become large with increasing the number of sensors. It is because our proposed techniques remove the duplicates gradually along the routing paths to the base station. These results indicate that our proposed techniques are scalable in WSN environments. In sum, LCA-EA shows the best performance in most of all cases since it identifies potential duplicates efficiently at its first phase. However, when there is a high percentage of duplicate events, LCA shows better performance than LCA-EA. The reason is that, in LCA-EA, the overhead of the first phase dominates the gain generated by partial aggregation on unique events at the second phase. 5.5. Sliding window aggregation We next report the performance results of BF, LCA-SW, and LCA SW H for sliding window aggregate queries. We let all the animals move with respect to the probabilistic model obtained from real buffalo data set. The average and the maximum speeds of every object are 3.4 m/min and 86.97 m/min, respectively. For these experiments, we reused the queries in Table 2 by changing the window size w, slide step ‘ with the fixed duration d (= 100). The default values of the window size w and the slide step ‘ are 10 and 1, respectively. Varying w. We varied the window size w from 3 to 10 with the default value of the slide step ‘ (i.e., ‘ = 1) and plot the total energy consumption of all sensor node at each timestamp in Fig. 14. We only report the results of Q1 and Q3 in Fig. 14 since the performance patterns of Q2 and Q4 are similar. As shown in Fig. 14, the energy consumption of BF is not affected by the window size w since each sensor node transmits its readings to the base station at each timestamp. In contrast, the window size w affects the performance results of LCA-SW and LCA SW H . In particular, LCA-SW and LCA SW H show better performance than BF since both algorithms calculate the erepresentative with duplicate events of each identical object by performing de-duplication at each timestamp; and hence, the volume of data to be transmitted to the base station is reduced.
Fig. 14. Varying w when s = 1 and d = 100.
Fig. 15. Varying ‘ when w = 10 and d = 100.
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
19
Since every object moves around, in a window, the number of objects moving in the regions which are not covered by any sensor node increases as w increases. Thus, a large number of nodes in the routing path cannot receive w number of e-representatives. Consequently, as w grows, the energy consumption of LCA-SW dramatically increases. Meanwhile, although the performance of LCA SW H is slightly affected by the window size w, LCA SW H shows the best performance over all cases since LCA SW H utilizes not only the intermediate representatives to prevent surplus transmission of the e-representatives but also the super-coordinators to calculate the w-representatives eagerly. Varying ‘. We varied the slide step ‘ from 1 to 9 with the default window size w (i.e., w = 10) and report the results of Q1 and Q3 in Fig. 15. Only when t ¼ w þ j ‘ for every j such that 0 6 j 6 bd=‘c, each coordinator transmits the e-representatives generating during ½t w þ 1; t (and the intermediate representatives in LCA SW H ) along the routing path and the partial aggregation result for a window is generated. Thus, as ‘ increases, the energy consumptions of LCA-SW and LCA SW H decrease. LCA SW H shows the best performance over all cases since LCA SW H utilizes the intermediate representatives and the super-coordinators. However, as ‘ increases, the performance gap between LCA-SW and LCA SW H decreases since the aggregation result is rarely generated, and hence, the gains obtained utilizing the intermediate representatives and super-coordinators are small. 6. Related work The pioneering TAG work by Madden et al. in [20] studied in-network aggregation for reducing communication overhead using summary data (e.g., SUM) and/or exemplary data (e.g., MIN and MAX). In TAG, as climbing up a routing tree from leaf nodes to the base station, partial aggregation values are computed. In order to reduce the communication overhead, approximate in-network aggregation techniques have also been proposed [7,24–26]. These approximation techniques cannot be applied to exact aggregation since the approximated aggregation values are transmitted from leaf nodes to the base station. Considine et al. [7] presented a robust method using FM sketch and multi-path routing for processing approximate aggregate queries in the presence of failures. Shrivastava et al. [25] developed the q-digest structure to support approximate processing for quantile queries. Nath et al. [24] also introduced a general framework based on synopsis diffusion for various approximate aggregate queries. For some aggregations such as MIN and MAX, Silberstein et al. [26] developed an algorithm called HAT with the goal of minimizing communication cost in a sensor network while guaranteeing data accuracy. Recently, some aggregation techniques for a spatial region in sensor networks have been proposed [5,27,35]. In [27], Soheili et al. proposed a distributed spatial index, called SPIX, to process an aggregate query for a user defined spatial region. Zhuang and Chen [35] proposed a max regional aggregate query which is for finding a region with maximum aggregation value. In [35], Zhuang and Chen used sampling to reduce the communication overhead. Finally, in [5], Choi and Chung proposed to use the smallest enclosing circles to obtain aggregation values. All the studies mentioned so far do not allow the sensing regions to overlap and do not handle duplicates. To the best of our knowledge, the algorithms proposed here are the first exact methods for in-network aggregation in the presence of duplicates. Moreover, the algorithms are general enough to handle sliding window aggregate queries. In the spatio-temporal data management area, there are many studies for efficient query processing such as nearest neighbor searches for mobile objects [34]. Among the previous techniques for spatio-temporal data, some studies consider aggregate query processing [17,31]. However, most of the spatio-temporal aggregate query processing techniques utilize index structures such as MVSB-tree [32], MR-tree [33] and MRA-tree [17] to minimize CPU cost as well as disk I/O cost (See details in [19]). These studies assume that mobile objects are stored in a local disk and thus such techniques do not consider the efficiency of energy consumption for aggregate query processing. 7. Conclusions In this paper, we study collaborative in-network aggregation in the presence of duplicates. We consider a type of sensors that detect objects in their sensing regions which may overlap each other. Although many effective aggregation techniques have been proposed in WSN and spatio-temporal communities, the previous techniques cannot be applied to our work since an identical object can be detected by several sensor nodes. To solve this problem, we proposed the single-phase algorithm LCA which is later extended to the two-phase algorithm LCA-EA. The latter uses a variant of the spectral bloom filter to identify unique events which can then be aggregated early. Empirical results show that LCA-EA performs the best in some situations. However, if there are strong overlaps among the sensors with a high percentage of duplicates, the simpler LCA would be the recommended choice. We also extended LCA to the LCA-SW and LCA SW H algorithms for sliding window aggregate queries. The LCA-SW algorithm shows better performance than the naive BF algorithm due to the de-duplication at each timestamp. Performance is further enhanced by the optimized LCA SW H algorithm. Acknowledgment This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012R1A1B3003060). This work was also supported by
20
J.-K. Min et al. / Information Sciences 297 (2015) 1–20
Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2012M3C4A7033342). In addition, this research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (NIPA-2014-H0301-14-1022) supervised by the NIPA (National IT Industry Promotion Agency). References [1] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, J. Widom, STREAM: the stanford stream data manager, IEEE Data Eng. Bullet. 26 (1) (2003). [2] A. Arasu, S. Babu, J. Widom, The CQL continuous query language: semantic foundations and query execution, VLDB J. 15 (2) (2006) 121–142. [3] B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM 13 (7) (1970) 422–426. [4] A. Broder, M. Mitzenmacher, Network applications of bloom filters: a survey, Internet Math. 1 (4) (2004) 485–509. [5] D.-W. Choi, C.-W. Chung, Request: region-based query processing in sensor networks, in: Proceedings of DASFAA, 2011, pp. 266–279. [6] S. Cohen, Y. Matias, Spectral bloom filters, in: Proceedings of ACM SIGMOD, 2003, pp. 241–252. [7] J. Considine, F. Li, G. Kollios, J. Byers, Approximate aggregation techniques for sensor databases, in: Proceedings of ICDE, 2004, pp. 449–460. [8] A.J. Demers, J. Gehrke, R. Rajaraman, A. Trigoni, Y. Yao, The cougar project: a work-in-progress report, SIGMOD Rec. 32 (4) (2003) 53–59. [9] P. Flajolet, G.N. Martin, Probabilistic counting algorithms for data base applications, J. Comput. Syst. Sci. 31 (2) (1985) 182–209. [10] I. Galpin, C.Y. Brenninkmeijer, A.J. Gray, F. Jabeen, A.A. Fernandes, N.W. Paton, SNEE: a query processor for wireless sensor networks, Distrib. Parallel Datab. 29 (1–2) (2011) 31–85. [11] L. Golab, T. Özsu, Issues in data stream management, ACM SIGMOD Rec. 32 (2) (2003) 5–14. [12] A. Guttman, R-trees: a dynamic index structure for spatial searching, ACM SIGMOD Rec. 14 (2) (1984) 47–57. [13] W.B. Heinzelman, A.P. Ch, A.P. Chandrakasan, H. Balakrishnan, H. Balakrishnan, An application-specific protocol architecture for wireless microsensor networks, IEEE TWC 1 (4) (2002) 660–670. [14] P. Juang, H. Oki, Y. Wang, M. Martonosi, L. Shiuan Peh, D. Rubenstein, Energy-efficient computing for wildlife tracking: design tradeoffs and early experiences with ZEBRANET, in: Proceedings of ASPLOS, 2002, pp. 96–107. [15] B. Karp, H.T. Kung, GPSR: Greedy perimeter stateless routing for wireless networks, in: Proceedings of MobiCom, 2000, pp. 243–254. [16] D. Klan, M. Karnstedt, K. Hose, L. Ribe-Baumann, K. Sattler, Stream engines meet wireless sensor networks: Cost-based planning and processing of complex queries in AnduIN, distributed and parallel databases, Distrib. Paral. Datab. 29 (1) (2011) 151–183. [17] I. Lazaridis, S. Mehrotra, Progressive approximate aggregate queries with a multi-resolution tree structure, in: Proceedings of ACM SIGMOD, 2001, pp. 401–412. [18] J. Li, D. Maier, K. Tufte, V. Papadimos, P.A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, in: Proceedings of ACM SIGMOD, SIGMOD ’05, 2005, pp. 311–322. [19] I.F.V. Lopez, R.T. Snodgrass, B. Moon, Spatiotemporal aggregate computation: a survey, IEEE TKDE 17 (2) (2005) 271–286. [20] S. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, Tag: a tiny aggregation service for ad-hoc sensor networks, in: Proceedings of OSDI, 2002. [21] S.R. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, Tinydb: an acquisitional query processing system for sensor networks, ACM TODS 30 (1) (2005) 122–173. [22] W.M. Merrill, F. Newberg, K. Sohrabi, W. Kaiser, G. Pottie, Collaborative networking requirements for unattended ground sensor systems, in: Proceedings of IEEE Aerospace, 2003, pp. 2153–2165. [23] MoveBank
. [24] S. Nath, P.B. Gibbons, S. Seshan, Z.R. Anderson, Synopsis diffusion for robust aggregation in sensor networks, in: Proceedings of SenSys, 2004, pp. 250– 262. [25] N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: new aggregation techniques for sensor networks. In Proceedings of SenSys, pages 239–249, 2004. [26] A. Silberstein, K. Munagala, J. Yang, Energy-efficient monitoring of extreme values in sensor networks, in: Proceedings of ACM SIGMOD, 2006, pp. 169– 180. [27] A. Soheili, V. Kalogeraku, D. Gunopulos, Spatial queries in sensor networks, in: Proceedings of ACM GIS, 2005, pp. 61–70. [28] K. Sohrabi, J. Gao, V. Ailawadhi, G.J. Pottie, Protocols for self-organization of a wireless sensor network, Pers. Commun., IEEE 7 (5) (2000) 16–27. [29] B. Sundararaman, U. Buy, A.D. Kshemkalyani, Clock synchronization for wireless sensor networks: a survey, Ad Hoc Netw. 3 (2005) 281–323. [30] R. Szewczyk, E. Osterweil, J. Polastre, M. Hamilton, A.M. Mainwaring, D. Estrin, Habitat monitoring with sensor networks, Commun. ACM 47 (6) (2004) 34–40. [31] Y. Tao, G. Kollios, J. Considine, F. Li, D. Papadias, Spatio-temporal aggregation using sketches, in: Proceedings of IEEE ICDE, 2004, pp. 214–225. [32] D. Zhang, A. Markowetz, V. Tsotras, D. Gunopulos, B. Seeger, Efficient computation of temporal aggregates with range predicates, in: Proceedings of ACM PODS, 2001, pp. 237–245. [33] D. Zhang, V.J. Tsotras, Improving min/max aggregation over spatial objects, in: Proceedings of ACM GIS, 2001, pp. 88–93. [34] B. Zheng, D.L. Lee, Semantic caching in location-dependent query processing, in: Proceedings of SSTD, 2001, pp. 97–116. [35] Y. Zhuang, L. Chen, Max regional aggregate over sensor networks, in: Proceedings of IEEE ICDE, 2009, pp. 1295–1298.